Moonshots: Language Preservation in the Digital Long Term

Andrew Joscelyne
5 min readApr 28, 2019


My name is Ozmandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.

Immense disappointment when the Beresheet rocket launched by a private Israeli concern failed to make the last kilometer to the moon in early April. But luckily, there are good reasons for thinking that the Lunar Library (LL), the lander’s payload of Earthling culture (semi-inspired by Isaac Asimov’s Foundation novels) provided by the Arch Mission Foundation as a special nickel disc to last a million years, is not lost.

Beyond the immediate technical heroics, this whole enterprise of etching a physical précis of some 10,000 years of Terran culture raises important questions about the status and destiny of knowledge and especially language in a digital, quantum, planetary future.

First, the purpose of the LL, masterminded by Nova Spivak, was not to communicate to another intelligence out there in the cosmos. As, for example, was the 1972 Pioneer plaque designed by Carl Sagan showing the figures of a human male and female, plus symbols providing information about the spacecraft’s origin. Spivak’s project is to ensure that a backup of Earth 1.0 knowledge could easily be recovered from our nearest satellite in the aftermath of a global emergency on Terra within the next million years or so. Think extreme climate change, or too close an encounter with an asteroid of the worst kind.

The problem for such a backup, of course, is that although human knowledge is finite, we can never know when a given domain of knowledge has stopped evolving. Which is why one of the Arch Mission Foundation’s bolder ideas is to keep sending up further backups to LL as time goes on to augment the baseline basket carried by Beresheet. So it is sad that this first attempt to conceive of an exponential K-backup was reduced to cosmic dust. As if the fates may not want our civilization to reboot, post-catastrophe. Shades of poor Ozmandias in Shelley’s poem prefacing this piece.

Second, one of the crucial domains of “knowledge” contained in the LL is human language. This includes the text of the English version of Wikipedia (without images) but above all information about and content from 5,000 languages provided by PanLex, the Long Now global language and dictionary project collected in its Rosetta Language Archive. The Library also includes Project Gutenberg texts (probably mostly in English) as well as a secret Vault — whose content will be revealed over time. (Read the Mission’s white paper for a full account of this immense collection).

The timely question for me is this: how do you back up a language as knowledge for posterity? Knowledge can be either “know how” — practical skills such as shoemaking — or “know that” — propositional stuff, as in the summation of knowledge collected in a library. The word lists, grammar books or speech recordings sent off on LL are “know that” knowledge in the form of facts about language plus some linguistic performances. But can a human language reduce to a finite set of performances, in the same way as listable historical collections of stories, songs, paintings, films, and artefacts? Or is it “know how” — more like Chomsky’s infinity of sentences capable of being generated by a specific grammar embodied in human brains as a competence?

If the latter, then we could theoretically capture a representation of language competence in a digital media. Not by collecting existing texts, but by designing a grammar machine with the know-how to generate sentences in any language. Such a language generator could then tell the sentient beings of post-cataclysmic Earth 2.0 something more about their forebears’ linguistic powers than a mere set of facts and performances.

Third, can a language be automatically futureproofed? Yes in so far as its speakers can adapt and change with the environment. That is how we can trace today’s Silicon Valley English back to some Indo-European forebear. But languages also die out, alas, because their speakers either prefer (or are forced to speak) a new, unrelated tongue (due to conquest, slavery, etc.), or are annihilated.

There is presumably nothing intrinsic to one language’s structural features that condemn it to age and die more quickly than another. The languages represented in the LL will in nearly all cases have already evolved over thousands of years and are surely likely to disappear (die out or massively morph) on planet Earth in the next million years anticipated by the project. By then, most items in the Library dictionaries would have been retired or replaced by some new lexeme. So what would anyone do with such a hoard of words, apart from wonder at the past and study it as a semantic graveyard, as we do with Akkadian or Vandalic today? Is the destiny of our languages to become Ozymandian epitaphs?

As it happens, 2019 has been elected the Year of Indigenous Languages, highlighting both the raw truth about language loss and the positive efforts at revitalization, often via some form of an existing back-up located in a single human brain. There are hundreds of such projects underway.

Yet we can only truly revitalize a language by embedding it in the creative competences of human beings, not by reducing it to recordable, repeatable performances and backing them up on disc or packaging them into a learning app. Our capacity to re-energize a dying language has so far solely been predicated on our human linguistic competence.

Yet as we tiptoe into the AI age, we may well start, as suggested above, to investigate whether a combination of data and algorithms could help language revitalization by training existing data to predict new utterances in a “creative” way, and backup a language by evolving a physically-grounded robot (a synthetic cogitator-speaker-listener) with the appropriate competence to teach humans how to speak it, eons from now.

According to this narrative, a LL-type project could try to back up human languages as a series of synthetic tongues to resist language death, rather as we might try to edit the human genome in an attempt to improve our survival rate against viruses using synthetic biological techniques.

One very primitive forerunner of such an application is IBM’s Project Debater (PD), which coincidentally was also developed by Israeli researchers. PD can utter “new” sequences of language in argumentative speech acts that make sense to humans. This suggests it is endowed with some proto-synthetic (i.e. non-human) form of language competence. At the same time, you can back it up, whizz it around the solar system on a disc, and bring it back to tell the tale — for example, to another debater robot.

So perhaps we are getting a little closer to producing a new condition of language whereby a type of linguistic competence, not just parroting performance, can be engendered outside of human brains, and embodied digitally. In that case, we could revitalize dying languages by gradually inventing synthetic versions of them all, learnt not just by humans but also by what today we still call machines.

From this viewpoint, the ultimate back-up of human civilization would not simply be a hoard of knowledge about past human (linguistic) performances, but a set of instructions for dynamically producing an infinity of new machine performances (civilization as a generative algorithm!). Is that what has been secreted in the Lunar Library’s mysterious Vault?