Could a multilingual planet benefit from a digital twin?

“To imagine a language,” said Ludwig Wittgenstein, “is to imagine a form of life”.

We’ve entered the exponential age of constantly innovating, data-driven digital tech. And software is busy eating the word, not just the world. Vision, speech, and language research seem to be converging on a single processing model. Synthetic media are populating the web. Analogical reasoning will start taking over from the inductive approach. Quantum NLP is somewhere down the road, and many language tasks are now morphing into AI sub-disciplines.

It’s surely time to seize on these signs and invent a more digitally-intelligent environment for anyone to engage with the problems and opportunities of the planet’s inescapable condition of multilinguality. In a second post, I will suggest that we build a digital twin to help revivify ideas about the human language challenge.

But first, there is a dark paradox at the heart of Earth’s language faculty:

- On the one hand we celebrate “multi” as a natural benefit to our lives on the planet, an existential condition of linguistic difference and belonging that gives each community a unique cognitive resource and a special identity.

- At the same time, practical communication always experiences this same property as a high-cost problem — a language barrier. I have to spend extra effort to overcome this barrier to communicate with or gain knowledge from an alt-language individual, media or knowledge community.

- So language as a “form of life” gives me a strong identity but it carries a cost of communication complexity vis à vis other lives, knowledge sources, etc. How can we optimize this contradiction and render that crucial interpretive operation less burdensome?

We can certainly learn other languages or use technology to remove parts of the barrier. But language learning doesn’t scale: individual humans typically acquire between two and maybe ten other languages out of the world’s seven thousand tongues currently spoken. Massive multilinguality, a term Google has used to describe one of its digital language projects, is therefore an apt name for a global problem. But not yet for a tech solution.

We have obviously lived with and managed to overcome the “language barrier” aspect of multilinguality throughout human civilization, from wandering tribal encounters to massive concentrations of people in big cities. The current development of a technological civilization focused on digital, clean energy sourcing, big tourism, synthetic biology, and space exploration almost certainly interprets this same language barrier as a source of increasing frustration rather than a beacon of inclusiveness.

Language learning via embodiment takes considerable time for humans, but information now travels at the speed of sound. For safety reasons voyaging in space, for example, would be seen as a particularly dangerous, high-risk operation that needs instantaneous information sharing by all onboard. Multiple parallel languages would typically be seen as a threat to the kind of perfect onboard communication that we believe would be necessary, at least in the early stages of space travel. Luckily we already have an active — if still limited — translation tech community.

Possible but improbable solutions to the language barrier are obvious enough to most of us, typically built around a sort of global committee/UN approach to fair play using tech:

a) we maintain massive multilinguality but invent an optional brain-implant widget that automatically translates languages you can’t speak/understand into your tongue so that the world sounds unilingual to each individual (the old Babel fish trope). Still in the pipeline…

b) we invent a shared international auxiliary language (e.g. Esperanto or Interglossa ) to share basic information in one simplified tongue rather than fighting over all the existing ones. Popular between 1900 and 1945…

c) we teach a single “big language” on a global scale — e.g. Chinese or English — to ensure democratic access to knowledge for all, and back this up with two-way translation solutions for all the others. Unlikely…

d) we agree to all learn when young at least one of the ten languages with the biggest speaker populations (English, Hindi, Mandarin, Spanish, French, Arabic, Bengali, Russian, Portuguese, Indonesian) and simultaneously build a massive, high-quality translation network to handle spoken/text content across the other nine. But what if a new, faster-growing population entered the top ten? And do we want to insist on some populations having to learn a new language to join this club, while others are born into the tongue at no cost? No comment…

Our current efforts, mainly focused on extensive business, institutional and tourism translation automation, operate among a few hundred languages at the very most. And much of this tends to be one-way from a very few “big languages” (i.e. economically/numerically powerful) to the rest, and has only been in operation for fewer than 50 years. Apart from massive Biblical and Buddhist translations in earlier centuries, world knowledge has largely fed on translations from the Greco-Roman classical world into mainly European languages, and a wide scatter of global literary translations. Naturally enough, the vast bulk of global translation throughout history has in fact been channeled locally via human lips as evanescent spoken words.

Three current solutions

1. Big government programs: There is already at least one geographical bloc-size tech program addressing these issues. The European Union’s often-used “usine à gaz” approach consists of multiple interlinked research projects designed to boost the “deep tech” infrastructure to digitally equalize all of Europe’s over 80 national languages and minority languages in the next decade. Umberto Eco joked that Europe “spoke translation” but in reality it is largely various forms of English. English is the world’s most widely spoken language with approximately 1.5 billion co-speakers, though fewer than 400 million first-language speakers. Note that China has around 1.3 billion mostly native “Chinese” speakers.

India too is trying to address the digital needs of its 22 major national languages (covering 96 per cent of population) in a semi-systematic way (Hindi has over half a million speakers, the others fewer). The People’s Linguistic Survey of India 2010 identified around 780 living languages, but around 220 of them had died off in the preceding five decades. South Africa elected 11 languages as national vehicles back in 1994 and a decade later launched its human language technologies initiative to digitize these. These efforts, though, tend to be largely invisible to the world as a whole. Even big language programs don’t get big airtime for long.

2. Small programs: A scatter of smaller-footprint “national” languages are also attempting to digitize. Welsh in the UK is being supported by its local government to enable its 885,000 or so current speakers (from a population of over 3.2 million) to benefit from a digital upskill. Basque has undertaken a similar program of digitization for many years now. But this sort of long-term single-language community commitment is probably rare outside of Europe. Many multilingual countries also represent themselves as speaking one vehicular language for education, political and social cohesion — be it Swahili, English, French, Spanish, or Mandarin, etc. , especially in the global South, and therefore benefit automatically from existing digitization efforts targeting these few big languages, while down-playing their population-rich other languages.

Yet these same nations or blocs will inevitably need to address the needs of a rich multi-language sub-culture within and sometimes across their boundaries, as in many African and some South American countries. Already many speakers of these second-tier languages are able to communicate via mobile phones in their own communities, and thereby sustain and evolve their birth tongues to reflect their experiences in new ways. This in turn could produce volumes of potentially useful new spoken data for ulterior digitization efforts. Local projects to build apps with local alphabets and other resources are also emerging in the Global South. But official documents, public media, some educational content, and so on will continue to speak the single national/regional language of power.

Off-subject but equally fascinating is interspecies communication, which tries to extend the insights gained through language engineering to other species — mammals and some birds. This may become more insistent if we begin to consider person rights for non-humans. Current tech work is focused for example on sperm whales whose coda (acoustic outputs) are being collected and analyzed using machine learning to open up a line of general inquiry into animal well-being; and possibly an understanding of and communication with non-human species. The fascinating dimension of this is that “interspecies” presumably does not endorse the rather children’s storybook image of a salmon chatting with a sperm whale, but exclusively that of humans intercepting mammal and other communications. The circle of communication might widen as this is about species multilinguality of a very special kind. So far, the engineering tools of this trade are familiar — data collection and machine learning — but there is probably much room to open up new pathways.

3. Language maintenance: Figures vary but it has been suggested that anything up to 1,500 Earth languages could disappear by the end of the century. But it is also widely accepted that tech (networking, compute, and AI) will have some role to play in preserving, teaching/learning, and supporting renewed vitality for many of these languages.

To address this massive problem, UNESCO has just launched a ten-year program to extend efforts to safeguard the planet’s endangered language communities. And yes, there are strange parallels here with other forms of destruction and loss now facing the planet in the longer term: global warming, climate disorders, and rising sea-levels are all endangering livelihoods and forcing communities into radical rethinks about their sustainability. And the current impact of fatal pandemic/endemic diseases is a violent reminder of hard-to-control death and loss.

We need sustainable solutions to prolong the lives of these endangered languages, rather than constantly apologize for the inevitability of their demise. A language’s last speakers are among the most tragic figures in the drama of human history — a single speaker of a dying tongue, reduced to diminishing soliloquy. Yet we must remember that in the remote past there may have been far more distinct tongues spoken on Earth than today, so language loss is not solely due to modern forms of human violence against communities of speakers. The vast scatter of small roving populations for much of the first 100,000 years of human presence on the planet was radically reduced by the rise of the great empires some 5,000 years ago, when language imposition would have increased massively, causing widespread linguicide. Language loss is not solely due to modern imperialism or paranoid autocracies.

Yet, we will also have to accept that in the near future extant languages will become “post-human” in a real sense — they will never completely “die.” As we begin to build and embrace better virtual/software solutions for real-world problems, some of today’s spoken-only languages in distress will be (partially) recorded, datafied, and re-weaponized. Organizations like the Long Now are thinking this through. Future generations might in many cases be able to develop techniques to revitalize these languages in new ways.

Heritage languages can be revivified through their communities. Not, of course, as they once were spoken, or as their current elders might wish them to be, but as revived tongues (modern Hebrew is a standard example) that can generate new forms of attachment and celebration in a world that will already be deeply virtualized. Keeping languages alive is the preferred goal, but the key fact is that “being alive” will itself be transformed. The digital afterlife of humans who die tomorrow will take on a new legal, social, and technical status, as we transform our entire conscious record of personal signals and signatures into data for a new forever.

Hopefully, this whistle-stop tour of our familiar language galaxy sets the scene for thinking through a different type of solution — a digital twin — to the massive multilinguality dilemma. More in my next post. Meanwhile, comments welcome!

Language dreamer who has spent too long stuck in the past