Something, somewhere, bit by bit: the language numbers game

Andrew Joscelyne
7 min readMar 21, 2023

The Oscar-winning “Everything, etc.” film title tries to cover the whole works. This little commentary aims lower. As we constantly map how languages are used over time, especially on the internet, we run up against two key questions: firstly what can we measure about a given language or quantitative practice, and secondly what should we be measuring better if we want to understand the quality/value of languages as evolving human phenomena in an exponential tech environment? Here are eight thought flashes on the value of data about language (not “language data”!).

a) Facts about comparative language structure: This is linguistics stuff. It refers basically to “grammars” as descriptions of forms that can then be compared to those for “similar” languages within a family or between families.

A typical measuring purpose would be to relate structural linguistic complexity to such phenomena as speed of learning, uttering, and comprehending; vocabulary formation; different degrees of mastery per age group; speed of change in grammar rules over time, etc. All sensitive issues.

Practical utility? Low: some types of language structure are probably more likely to evolve socially/demographically in a significant way, or evolve to express rare subtleties of thought, as in the S. American language Aymara’s supposed three-value logic. But is there any grammar structure reason why a given grammar type or language is more robust evolutionarily speaking, than another?

b) Facts about comparative language media: Languages may or may not come with a writing system, plus all the paraphernalia of grammatical descriptions. How many such systems are deployed? Do these languages (or is it cultures?) use signing as well, and are they all equipped with an ISO language code for digital applications including emojis to ease learning, or overcome endangerment?

Results can be listed in an atlas of comparative scripting types, and the data could spur further explorations For example, any connection between writing systems and economic and educational development? Why do some national languages change their writing systems — Turkish, some former Soviet states? What about the impact of writing policies on psychology, skill sets, and socio-economic development generally? The comparative mastery of multiple writing systems in certain cultures, and the production of language aids such as dictionaries and script enhancement are equally interesting.

“Writing” is a fast-moving field these digital days, as we plan to provide endangered, currently scriptless languages with writing codes, now that all language expression can be represented in a symbol system of 0s and 1s . But which values do you prioritize when choosing a useful visual system, and based on which evidence? Can you ask an AI to design a script?

c) Comparative “size” of vocabulary: Little is really gained by comparing this dimension of languages. By the time you’ve collected any sizable data, it would almost certainly be out of date! New words emerge all the time, others disappear (but how often and fast?) from standard use. We keep records of neologisms in dictionaries or on line, but they cannot be up to date. Vocab in any case is an easy topic for any speaker to play with.

You could try to map terminology borrowings and compare them against neologism creation or borrowings over time. But does this matter? Vocab creation seems to be the simplest thing in the world, and most families do it. But lexeme penetration across a language will likely divide into age and interest groups, as the uncertain destiny of emojis shows.

d) Number of speakers in a given language: Always used as a handy metric when planning for resource building, education, print runs, film dubbing, software localization, etc. There are two different issues here:

1) How many speakers of language/dialect X are there in a given school, city, province, national population, or even plane passenger list? How many of these are sensory-disabled? Or multilingual? What proportions of male vs. female speakers? These data may be needed to develop useful content for language learning content, for companies translating literature or films for local populations (is market size worth the cost?), or publishing legal and emergency messages, and so on.

2) How many speakers are there in toto for language/dialect X or Y around the world? This would be extremely hard to measure — what is a speaker of a given language and are they anchored in some sort of primary geographical space? They could be gendered, fluent or learners, bi-/multi-linguals, under/over the age of four, or deaf/dumb/blind, etc. Or tomorrow, they could simply be a large language model! Obviously we try to use rounded figures to help determine global investments in resources for each language (in education, translatability for commerce, etc.) but global numbers blind us to the infinite specifics that define quantitative language behavior (i.e. speech habits, learning materials, web presence, etc). Our data for all this could at least be gendered and age-bracketed to improve their utility. And why not have some universal language identification system on national ID cards to help authorities plan and communicate better in the case of accidents, frontier crises, migrations, etc?

One of the strangest developments we shall have to measure in future is the number/nature of AI devices that “speak” a language, using frozen language models that replace the constant variation of real human speech production with a semi-static repertoire of infinitely repetitive talk and text.

e) Languages per country: This is another old “numbers” favorite as it is easy to find several stable countries as a starting point. Even so, there is very little discussion of the value of measuring such a phenomenon or any idea of what to do about it if the inhabitants find their radical multilingualism (or its lack) a burden rather than a benefit.

Papua New Guinea is often cited as the country currently containing the most “languages” spoken — 840 according to Ethnologue, for a population of under a million. Obviously, the number of languages spoken in a country will change over time as tongues/speakers die or new migrants arrive and new pidgins and creoles emerge. And there are always bi- and trilingual speakers in most countries around the world, so matching languages to speaker communities/age groups is never simple.

Yet there is potential value in making knowledge about language behavioral data available to governments, to meet/avoid new social and educational challenges as migration increases, populations evolve, tech encourages free-range text production, and more individuals wield linked digital devices to communicate globally.

In the UK, for example, sociopolitical and generation shifts are currently influencing the re-emergence of Scots, Welsh, and Cornish as languages of record, and digital technology will likely facilitate their integration into some application domains over time, creating a sort of “bilingualism” as an emerging local feature. But from how long ago do we start measuring this phenomenon, and how long before we give it some kind of statistical importance? Is this a political, sociological, or purely “scientific” (i.e. linguistic) factoid?

As an example, what do we make of the UK city of Manchester’s current linguistic profile: the annual School Census has reported upwards of 150 different languages used as pupils’ ‘first languages’. And interpreter requests in the healthcare sector show regular demand for around 120 different languages. Is this a sign of more sophisticated language “choice management” to come (e.g. in secondary/tertiary language education)? Or a challenge to boost teaching English as the “preferred” tongue?

f) Number of languages worldwide: Everyone in the language business is now familiar with this iconic data point: Planet Earth resounds to about 7,000 spoken languages, of which 50 to 80% are due to disappear within 20 to 50 years. Such widely published “scare” figures in fact vary considerably, depending on what we mean by “language” and numerous other factors.

The exact number doesn’t matter for any specific task ahead. But according to Glottolog, there are some 8,572 languages spoken on the planet, which means that on average, a given language would be spoken by a population of around 932,780 people! This simply shows that our major 300 or so languages (Wikipedia’s market) are extremely well populated, while there are x4 more long-tail tongues in real danger. Meanwhile Google is targeting a speech model for a thousand tongues.

Big-number languages are distributed over nations with very varied histories, geographies and economies, so it is hard to draw new explanatory conclusions for language survival vs. threatened from simply reading these figures.

We have only been counting this data point since the early-20th century, yet there is a steady decline in language variety, and a steady increase in speaker numbers for fewer and fewer languages, as the global population expands. Will there be an endpoint when a single (extremely dialect-rich) language will survive almost everywhere? And then split again into local tongues to start a new cycle?

g) Speed of language spread on social media: Knowing how fast and extensively new batches of languages come to be used across various social media is helpful data for advertisers, writers, and automated translation services such as Google or Baidu, which target both a growth in supply and in advertising income. It would also be interesting to learn how many services push message translation, rather than let the system react to reader pull. Yet when big language models begin to drive artificial language production on a massive scale, individual users may want the program to use their own language style (!) when composing automatic text in response to a prompt, not just the average data mashup.

h) Numbers of mobile phones: The best guide to quantitative influences on language futures may be the figures for mobile phone features, ownership volume, and content. This is because much of the emerging technology now impacting language practices (using LLMs, etc.) will end up tooling phones in various ways. Currently, there are some 17B mobile devices, set to rise to 18.2B by 2025 for a world population of around 7.9B people — a global average of two phones each! But this quickly shows the limits of statistics when thinking about language. Will phones support all possible or “most” writing/speech systems for input? Will they provide good-enough translation? Or draw upon new eloquent chat and search apps? And therefore expand but also uniformize collective intelligence — or split into rivals and stoke new types of linguistic competition?

Finally, will these phones be secretly listening to everything, everywhere, all at once, as we all imagine? If so, what next?

--

--