Community vs. Language Autonomy: Understanding the Revitalization Challenge

At the heart of the upcoming UNESCO decade of endangered + revitalizing language projects, there’s one obvious yet curious fact: we can extract a “language” from its community setting of dialog and narrative flows, and manipulate it as an autonomous piece of knowledge. How does a finite code of signs and rules of syntax and pronunciation compete with — or complete — our everyday talk and texting?

A chunk of structured knowledge called “Quechua” for example, contains standardized information about a given language. Not how to speak or use it — that is a learnt ability. This intellectual construct is a really a way to “autonomize” a language by separating it from its natural embedding in speaking or understanding and treating it as an object.

Note that this object of knowledge is not the same as data, though the two are related. A knowledge graph of Finnish would tell you how all the language’s constituent parts are structured and related. But a vast data stash of recorded Finnish — conversations, narratives, official documents, novels and poems, speeches, blogs, etc (think of the billions of English words driving a software product such as GPT-3 ) — is not a “language” you can learn. It is the raw material that can help transform strings of text signs into content you can know.

Understanding language autonomy

So the key property of this autonomization process is a capacity to extract structures from the constant flow of spoken fragments, fashion them into an ensemble of categories— on paper in the past, now digitally — and then manipulate them independently of their human community of origin and purpose.

There are numerous language autonomization projects in the pipeline today focused on documenting endangered languages and transforming their findings into dictionaries and grammars by constructing knowledge from text, interviews, audio and video data recordings. They aim to ensure long-term preservation (e.g. the Long Now’s Rosetta Project) in the form of encyclopedic knowledge imprinted on durable media. Some of these languages are even destined for outer space in case there’s a catastrophic forgetting event on Earth.

So languages lead a double life: they are both the familiar physical and behavioral phenomena we produce and experience over time within our communities; but they are also those abstract chunks of (timeless?) knowledge with a data lifestyle of their own. How does this double act influence the way we can engage with the future UNESCO’s major language revitalization and speakers’ rights project?

Weaponizing communities

This decade of actions (2022 to 32) will be about empowering communities, not just their languages. It involves (re)creating the social, cultural, and economic conditions for communities to rightfully use all their languages as a rich natural resource for their life projects.

It also involves resuscitating a community’s endangered language — i.e. one which has very few if any child native speakers today. The figures vary considerably, but 46% of extant languages are considered likely to “disappear” in the next 80 years unless action is taken.

All human communities have members that typically speak different languages as well as one or more shared tongues. Just think of your own network of friends and colleagues — lots of languages used, some shared and others not. If you narrow the size of a community, you can almost certainly reduce the number of languages spoken down to those who all speak one shared (vehicular) language, plus bits and pieces of other, sometimes disappearing or even banned languages.

Economic development, education policies, and brutal turf wars tend to be the main determinants of how many different languages are spoken within a given community. Languages disappear, however, because their speakers are forced to change languages. This is due either to a political or natural catastrophe that eliminates a speaker population or more often prohibits a language from being used, or because one of their tongues becomes gradually replaced as a viable communication vehicle.

All kinds of multilingual configurations are therefore possible in communities. But in each case, we can theoretically document the different tongues used, record them on tape, digitize and hence “autonomize” them into a knowledge base, and analyze them objectively far from their natural home.

Speakers pass away, and when memories of a language grow dim, we feel the loss. When we benefit from active (possibly artificial) memories, a language can feel familiar, graspable, learnable again. Above all, languages as autonomous knowledge architectures will provide a replacement independently of the voices of their speaker community. Stored in another medium, possibly in another geography. As a result, any of these languages could theoretically be rewritten as a partial knowledge graph that in due time might drive a digital machine anywhere in the world.

By giving languages this curious autonomous identity, communities can however be weaponized to invent new holistic strategies, laws and guidelines to help them teach, expand, enrich, share, and narrate their places, stories and ambitions in their indigenous tongues. But that same language’s association with a sacred place or tribal story will inevitably be broken as it becomes part of humanity’s universal knowledge.

The fundamental enabler of revitalization will of course be a mix of institutional, social, and personal agendas, not just engineering with linguistic knowledge in our sense. Yet as endangered languages become partially autonomous through documentation and recording, they will gradually join other tongues in the planet’s growing collection, forming dynamic networks of analytic data that will inspire and improve further knowledge work.

In the saddest cases, languages will only continue to exist as a half-dead data structure on a network, extracted from their original culture, and transformed into knowledge objects. Saved from extinction…but lacking indigenous speakers. Like Latin spoken in today’s Roman curia

Benefits and dangers of autonomization

  1. Language as documents

Historically, a key intellectual operation first developed in the West has been “grammatization” (see Auroux ) or writing grammars of languages, something we now take for granted, but others don’t. In fact this “technology” was first developed in Europe back in the 15th and 16th centuries after the invention of moveable-type printing, when a few early thinkers began to rationalize the description of languages as a natural phenomenon, usually using ancient Greek categories first set down around 100BCE by Dionysius Thrax.

So scholars began to write “grammars” and dictionaries or word lists for ancient and “modern” languages, aware that people could learn languages as forms of knowledge. And notably for new unwritten languages encountered for example in South America by the Spanish church starting in the 16th century, in its efforts to localize Catholic teachings for the Amerindian populations.

One of the effects of grammatization as a function of a literate culture is that language content inevitably becomes a document of some sort. The Bible for example has been a defining written document — the book — in many colonial encounters. We have inherited document technology as the best way to embody practical knowledge and data about language, and in fact everything else. A “grammar” is a Bible for students learning a language.

An oral culture will naturally not dissect the languaged world into “documents”. Storytelling, praying, and holding court can be recorded by anthropologists as audio or video content, but they obviously cannot function in situ as objects that you can look up, edit, abbreviate, correct, collect, translate, challenge, enumerate, and so on.

In the future it will be possible to build digital knowledge assistants for endangered languages, which can help in the process of language learning for the outside world and possibly provide better revitalization support for the community. But in practice, speech — vocabulary, pronunciation, usage — varies widely among speakers, and changes constantly in small ways, so these changes would never be included in the documentary record of an autonomized language.

A robot interviewer could be activated at any time by a community language speaker who could tell it more about life and discourse within the community. The output would then feed into the “documentary” work of actually describing the language. I would expect some sort of automated active listening and questioning solution for building endangered language knowledge to be on the tool agenda for UNESCO’s decade of projects.

However, some communities would clearly reject this sort of intrusion. They would not want their language to be shared with outsiders — especially robots — as it is used to recount sacred stories and teachings that are for their own ears only. Others, however, might accept that knowledge of their language should become part of Earth’s patrimony, and could ultimately feed into new types of virtual document media that we can’t even guess at today, ensuring a new opportunity for survival and growth.

2) Translation builds communities of knowledge

One classic argument for revitlizing indigenous languages is that we should be able to record and share new forms of practical knowledge expressed by speakers — their ecological traditions, their complex relationships with the natural world, and their close observations and understanding of the dependencies between the environment and human life, especially during the current biodiversity crisis. If the language disappears, then detailed insights, observations, and references about nature, disease, and local experiences expressed in that language will evaporate too.

The most efficient way to preserve that knowledge in such a fragile context will be to translate it into more sustainable, widely-spoken languages that can reach larger populations via “documents”. Translation is a proven human strategy for preserving information across tongues and thereby broadening the range of shareable content for humanity.

One outstanding example from the history of religion: Over eight centuries starting in about 01CE, the Sanskrit or Prakrit versions of the vast Buddhist Tripitaka canon (70 times the size of the Bible) were translated into Chinese, then Japanese, Malayalam and Vietnamese. The Chinese Tripitaka was eventually printed in 972CE using wooden blocks, long before printing began in Europe. Think of the knowledge about languages, let alone about Buddhism, that was explicitly or implicitly learned during that long translation story.

In the context of today’s revitalization projects, the benefits of translation should work both ways — into and from. Knowledge useful to the endangered community from outside can be gradually translated into their endangered language, thus expanding its own range and introducing useful and no doubt challenging new content to the community.

This is particularly relevant if younger people are to be raised and educated in the language, rather than accepting the idea that their endangered community tongue can only be used for introspective ritual utterances, secret teachings, or insider story-telling.

And conversely, the rest of the world can learn more about that culture’s specificities by translating some of its knowledge and experience into wider-access languages as documents of record.

One last, but vital suggestion: the revitalization drivers must learn how to hold (online?) meetings between different communities in the field, aided by translation/interpretation. This would give scattered communities an unusual opportunity to share ideas, learn about each others’ plans and techniques and build solidarity, rather than interacting solely with outsider knowledge workers. This would also also test translation logistics and incite more inter-revitalization translation activities.

A viable “knowledge” translation project for all these communities should naturally go beyond simply localizing commercial-type content into endangered languages; it should pivot towards a universal vision of “knowledge exchange” in which relevant content is carefully adapted and delicately languaged for a community that wishes to interact with the world. This could open up an interesting opportunity for more widespread innovation in both translation practice and tooling.

3) What gets lost because of language autonomy?

Although language data can become generative — i.e. it can act as fuel for a human language learner or a machine to create new content in that language — only a subset of a community’s actual content will probably ever be recorded, datafied, and classified into documents.

In fact an oral community will only be able to access content it can remember, which means that there might be natural neural limits on the amount of historical content a given community could access. Even though new writing systems are gradually being introduced to previously oral-only communities, this limit on memory will remain.

So there will always be unnoticed, intimate aspects of language behavior that will never be documented — how children use language at different ages, how women and men, old and young, may use language differently to achieve different outcomes, and handle their language or modify it in special ways for specific end uses; how traditional stories about the body, sexuality, health, birth or death vary as they change hands from one generation to the next.

These features of any community’s practices may never be remembered in detail. Nor can we easily document the deep physical pathos, engagement, or joy of all those individual or collective language experiences. They remain only as memories of performances — brief film clips — not demonstrable knowledge.

In the same way, speakers will sometimes remember how their actual language has changed over their lifetime, even if there is no evidence apart from human memories. They might also be able to recall the coining of new words to identify new realties, or how transformative events in the community left their impact on the language.

Yet once again these events are part of the embedded, physical experience of the community, and therefore difficult to transform into external knowledge that can be usefully framed through language autonomization into documents.

So the intimate life of a language and a community lies inevitably beyond both the oversight of the law and the reach of structured description. This in turn makes you wonder how much of our own experience of language and through language goes beyond the kind of data and documents we are all busy collecting today in order to surveil sentiment in speaker/user populations, translate more content, and build more robust economies.

We can only hope that engagement with various technologies of the word — data, knowledge graphs, documents, and machines — will stimulate further reflection on the real value, dangers, and best practices of constantly transforming human talk into autonomous “language” to achieve a form of ground truth.