Language value vs. language data

Andrew Joscelyne
9 min readAug 8, 2023

As we constantly re-evaluate the way languages are used over time, especially on the internet in this generative-AI moment, we encounter two questions: firstly, what can we actually measure productively about a given language or linguistic practice, and secondly what should we be measuring better if we want to leverage the quality and value of languages as a positive dimension of our human communication ecosystem?

Here are nine thought flashes on data about language value (but not ”language data”!).

a) Facts about comparative language structure: This is basic linguistics information - a “grammar” as a description of linguistic forms that can then be compared to those for other languages within a family or between families, or over time.

How is this helpful to global understanding or communication, beyond providing basic descriptive information? We can relate structural complexity in language to such phenomena as speed of learning, utterance, and comprehension; vocabulary formation, different mastery per age group; speed of change in rules of grammar over time, etc. - all those parameters concerning languages that practitioners don’t typically think about. But attention to these could enhance some longer-term trends in human development.

Practical utility? Not sure: are some types of language structure more likely to evolve socially/population-wise in a significant way, or express specific subtleties of thought, as in the S. American language Aymara’s supposed three-value logic? Is there any structural reason why a given grammar/language is more widely spread, or more robust evolutionarily speaking, than another? We’ve always believed the answer is no — “all languages are equal”, is the watchword. But using much more of these data, we might start asking questions effectively about the value of different structures in practice.

b) Facts about comparative language media: Does a given language come with a writing system or not; how many different systems are deployed; do all languages (or is it cultures?) use signing, and will they all have an ISO language code for digital applications including emojis, ease of learning, likelihood of overcoming endangerment, etc.?

Answers to these questions could be listed in an atlas of comparative writing/media types, and the data could be linked to further explorations For example, is there any connection between writing systems and economic and educational development? How many language groups change their writing systems (think of Turkish in the 20th century, and former Soviet republics after the 1960s) and why? What is the impact of script options on psychology, or human skill sets, etc.? How about the comparative mastery of multiple writing systems in certain cultures, or innovation in producing language aids such as dictionaries, new emojis, etc.?

Media adaptation is a fast-moving target these digital days, as we plan to start providing some endangered languages with writing systems, and hopefully compute codes. How do you choose which system, and who chooses? Will they or should they always tend towards the alphabetic rather than the logographic end of the spectrum?

c) Comparative “size” of vocabulary: Does anyone analyze this sort of factoid? By the time you’ve collected any data in a living language, it will almost certainly be out of date! New words emerge all the time, others disappear from standard use, and dictionaries do not capture them all.

If one language has a huge vocab for certain medical domains, whereas another doesn’t, is this significant? What if one language borrows all its words for automobile repair, whereas another decides to invent its own? Can field data about these choices be used to prove - or improve – anything? We shall have to wait until the results of India’s effort to localize science and medicine teaching to its 22 official languages have come in.

You could, for example, try to map term borrowings and compare them against neologism creation. But do such data really matter, beyond the bare fact of making a choice? Human vocabulary creation is a generative phenomenon that never stops – just think of the neologisms that have spread across social media in the past two decades! Should governments ever have a say in neologism creation, or should it flow freely from local usage?

In the early Renaissance, copia (abundance of vocab items) was an issue, for example in England: did the English language have enough words (in comparison with Latin and Greek) to describe emerging reality? This probably emerged from observations about translating Latin into English as it became much more frequently used in local legal documents and knowledge communications. Interestingly, the term copia meant both “fullness” as in copiousness, and also copy – which was a full or complete transcription of another text, or the student’s copy-book.

d) Number of a given language’s speakers: Always used as a handy metric when planning for resource building, education programs, print runs for literature, writing software, awarding an international standard, translating marketing info, website localization, phone apps, etc. There are two different issues here:

1) how many speakers of X language/dialect are there in a given unit — school, city, province, or country population, or even plane passenger list or prison population? These data may be needed to develop content for language learning devices and courses, or for companies translating factual information, literature or films for local populations (is market size worth the cost?), or publishing emergency public health and safety messages;

2) how many speakers or users are there in toto worldwide for language/dialect X or Y? This would be extremely hard to measure – what is a speaker of language Z? Fluent, learner, bi-/multi-lingual, over the age of four, deaf/dumb/blind, living anywhere in the world, etc. Obviously, we try to use a rounded figure to help determine global investments in resources for that language (in education, translatability for commerce, media consumption, medical supplies, emergency measures, etc.) but global quantities tend to blind us to the infinite specifics that define quantitative language behavior (speech, understanding, learning materials, web access and presence, etc.) in local communities.

e) Languages per country: This is another old favorite, even though there is typically very little discussion of the prospective value of measuring such a phenomenon, or any idea of what to do about it if the inhabitants for example find their own radical multilinguality a burden rather than a benefit.

Papua New Guinea is often cited as the country currently containing the most languages spoken natively within its borders - 840 according to Ethnologue, for a population of under a million. But we instinctively want to know how they all manage the tasks we associate with language usage inside countries! Obviously, the number of languages spoken will change over longish periods of time as tongues disappear or new pidgins and creoles emerge. This is because some communities die off/are killed off, governments stipulate usage laws, and new migrants bring in new languages. And there are always bi- and trilingual speakers in most countries around the world, so matching languages to speaker communities/age-groups is never simple.

Yet there is huge value in making knowledge about these multi-language experiences available, in order to address or avoid new political, social and educational challenges as populations evolve in a more mobile, crisis-ridden world. Language multiplicity for example comes with considerable extra costs that need to be properly programmed.

In the UK, socio-political generation shifts are currently influencing the re-emergence of Scots, Welsh, and Cornish as languages of record, and digital technology will likely facilitate some sort of integration into major application domains (e.g. education) over time, creating a new state of multilingualism as a national value. But how can we usefully measure this phenomenon, and how long before we give it some kind of statistical importance? Is this a political, sociological, or purely “scientific” (i.e. linguistic) fact?

At the same time, what do we make of the city of Manchester’s current linguistic profile: the annual School Census has reported upwards of 150 different languages used as pupils’ ‘first languages’. And interpreter requests in the healthcare sector show regular demand for around 120 languages. Surely a signal for more comprehensive language “management” to come! And for a more pressing evaluation of alternatives, although there may not be much valid parallel data to provide examples of effective actions.

f) Number of languages worldwide: Everyone is familiar with this iconic data point: Planet Earth harbors about 7,000 languages, and 70% are due to disappear within 20 years. These widely published “scare” figures in fact vary considerably, depending on what we mean by “language” and numerous other factors. The value of this information is to alert us all to loss, endangerment, and mistreatment. But do these data actually make an impact?

The exact number of languages does not really matter for any specific task ahead. In fact, according to Glottolog, there are some 8,572 languages found on the planet, which means that on average, a given language would be spoken by a population of around 932,780 people! This simply reinforces the standard story — our major 300 or so languages with their multi-millions of speakers are extremely well populated today, while the end of the long tail is highly endangered.

However, big-number languages are distributed over nations with very varied histories, geographies, and economies, so it is hard to draw many useful explanatory conclusions from just the figures.

Nor have we been counting this data point for very long (since the mid-20th century?). Yet there is clearly a steady decline in “traditional” language variety, though a steady increase in speaker numbers and linguistic creativity for fewer and fewer languages, as the global population grows and interconnects more. Is there an endpoint when one (highly dialect-rich) language will survive everywhere? Who is doing the modeling for these scenarios?

g) Speed of language spread on social media: Knowing how fast and extensively new batches of languages come to be used on various social media or communication networks generally would be useful data for advertisers, writers, and translators, among others. But also, language-specific marketers, trainers, influencers, educators, and the rest will also need to know.

Big automated translation services such as Google, DeepL, and Baidu are focused on growing supply, new services, and advertising income. It would be interesting to learn how many services push message translation, rather than let the system react to reader pull. Digital language data (examples of actual usage) on the other hand are available for rapid analysis and sociolinguistic deciphering in increasingly vast quantities.

h) Mobile phone data: One of the best guides to quantitative influences on language futures is probably to track mobile phone features, volumes,aids) variety of content, and sales. Much of the emerging technology impacting general language issues (ee.g. generative tech) is being channeled into phones. Current figures suggest there are some 17B mobile devices, set to rise to 18.2B by 2025 for a world population of 7.9B people – an average of two phones each for everyone! But this also quickly shows the limits of statistics when thinking about language. Can all phones provide access to any writing system for input? Can they recognize all spoken input languages? Will they translate speech well enough? And therefore be usable outside a small circle of acquaintances? And will these phones be secretly listening to everything, everywhere, all at once, as we fear?

i) Using language usage data about/for international aid efforts: Here’s my promo: CLEAR Global is a worthy organization that emerged from TWB and is dedicated to providing services to all kinds of international players involved in language services beyond the commercial language industry. Their GLDR (global language data review) raises awareness of and helps overcome language barriers that prevent people who speak a marginalized language from accessing relevant programs and services. It provides planners with data on the languages spoken in the places they work so they can develop more inclusive, evidence-based programs. CLEAR Global recently found that only 24 out of 88 countries examined provide effective language coverage to make development efforts more effective. Having documentation in relevant languages is a necessary if not sufficient criterion for program effectiveness.

We are obviously plunging deeper into a data age, but lots of those data about language as praxis are still hard to find and compare. Or simply non-existent. The tech field is changing fast, last week’s market or widget figures have already been forgotten, and it is hard to gain visibility and clarity over the entire range of new “generative” issues that concern the language sciences, knowledge management, the decision politics that impact speech and text behavior, and verbal creativity as a whole. Let alone more historical questions about specific trends for the past ten, twenty, fifty years, etc.

In a networked age, there can obviously be no single up-to-date repository for critical data on language usage, global language product sales, official figures on the socio-economics of national and international decisions concerning language uses, behaviors, technologies, projects and failures. This stuff is scattered all over the planet in many languages. We desperately need a bot that collects, updates, and responds to queries about the available information in this domain for us all.

So it might be a good time to start modeling the kind of intelligent data repository we need for advanced knowledge about language values in a social-technical galaxy (as opposed to simply language as a spoken/written asset). This could be programmed to deliver regular bulletins on critical data issues concerning language facts and trends in the global economy, society, and technosphere. And then do the analytics to drive reactions, discussions, plans, and actions that target greater value all round.

And as a baseline, it should be able to express its results in… err… any language.

--

--