The Challenge of Localizing Science-Writing

Andrew Joscelyne
9 min readSep 5, 2023

There is a constantly growing issue in the global science education and research community: the problem experienced by non-native speakers reading and/or writing science content in English. How can we better address this current brake on development?

English is currently the “preferred” language for international scientific research publications. However, as evidenced crucially in this article, there is widespread proof of operational difficulties — especially slower speeds of reading/understanding — for science content readers who have English as a second language (E2L).

This means that E2L usage in science can hinder rapid and effective knowledge access and production. How? It takes longer for an E2L user to digest and/or write a document properly. This cumulatively disadvantages non-native speakers/readers in various professional ways. Above all, there is a financial cost, as time spent wrestling with language understanding, manipulation, and production spells money.

Leveraging local science knowledge

A second concern is more general: today’s scientific discourse, findings, and hypotheses need to be made more easily available — and therefore understandable — to far more individuals around the world.

Why? We are facing a global climate crisis; a massive ecological challenge to our forestry, mining, agricultural, and food policies; significant medical and disease problems; and major rethinks of topics ranging from nuclear fission and green energy to AI and space travel. Plus the complex moral and social issues raised by certain technological advances.

We shall largely map the future of sustainable solutions or alternatives in all these domains by socially informed scientific projects. And these in turn will be represented by shared knowledge graphs and other AI technologies for mutual supporting innovation in anything from phenomics to physics, astronomy to zoology.

This means we desperately need the contributions of new generations of local scientists, doctors, and engineers, as well as a broader information ecosystem of planners, teachers, journalists, and influencers. These can stimulate appropriate data collection initiatives or share experiments associated with local biodiversity, mineral, medical, and social phenomena, and then contribute to collective local/global science-based programs to inspire solutions.

Large-scale data collection, management, and access are clearly vital to this project. But better knowledge-sharing will help target the real needs on the ground. ‘Better’ here means derisking understanding: making more content in all domains easily accessible for a broader spectrum of users. For example, making it more digitally available, and therefore far more easily teachable and reusable for new generations of schoolkids, students, and the interested public.

This ambition will almost certainly benefit from linguistic engineering to help create, edit, and spread information equitably, and ensure maximum readability/accessibility for everyone concerned. There will also be an obvious, ongoing role for more intensive, automated, and timely translation into local languages for certain types of content.

Boosting readability, derisking comprehension

As a first step, how about improving the current circulation of knowledge in English by ensuring broader and quicker readability?

We all experienced a major public effort at languaging knowledge for all about a killer disease during the global campaign to share information on COVID-19, just a couple of years ago. This often took the form of public messaging in clear language designed to rapidly inform all members of various communities.

It is now possible to build a variant of reliable and secure “plain-language” editors, which operate automatically on grammatical patterns and vocabulary to remove any ambiguity and produce more easily understandable versions of many English-language scientific papers and similar content.

In other words, adapt content to local reading habits and information requirements by means of a dynamic visual layout and appropriate linguistic tools and resources.

For example, sentence length is a classic variable in determining the ease/speed of content understanding. E2L readers probably find it fairly easy to memorize the vocabulary of the sciences/medicine etc. once they know what an item means and refers to. But more complex sentential or discourse syntax is much harder and can pose problems for rapid understanding and memorization. This linguistic barrier should be analyzed, derisked, and ideally provided with an automated solution to address any text.

Coming to terms...

The rate of innovation in word formation is clearly accelerating in the key languages of scientific endeavor. All kinds of new terms and phrases are being generated in research and also across the media to designate new concepts, processes, and objects in the worlds of science and technology.

We shall therefore have to decide how to handle such technical terms (and their more familiar social-media versions and variants) in local languages for many non-technical but interested readers in decision-making positions. The classic dilemma is this: Do we borrow a “new” term, or do we find or invent a translation equivalent?

If we borrow, we can also share common knowledge more easily with speakers of other languages. If we translate, who decides on the local term and how far/soon would it penetrate current discourse? And how long should communities take to decide on a translation and implement it within the digital context?

When scientists need to interface with their funding authorities, government science officers, and end-users, it is important that their message is maximally understandable and clear to non-specialists. So it may be necessary to opt for shared explanatory lexicons, provided they are digital, dynamic, and multi-localized— i.e. creating something like a science knowledge-base chatbot. This will enable the various actors in the drama of science and technology development to understand each other better and more quickly in an age of widespread lexical innovation.

That said, the circulation of new terminology in local languages will typically never be as rapid as users might like, due to complex bureaucratic processes in local lexical decision-making. So inventing a new semi-automated solution to term creation in shared or local languages would be a useful first step.

Vernacular science — the Indian model

What are national authorities doing about this problem of the role of language in representing science knowledge?

Contrary to what you might think, there is a gradual shift in ideas about the role of indigenous language science. Where English appears to command the heights, various communities are rethinking the longer-term role of local language in science education, from countries such as Nigeria, with its Igbo language for science initiative, to the Indian subcontinent.

India is (along with South Africa) dedicated to adapting English science vocab to a number of local languages so that science can be taught and en-cultured in the nations’ official languages.

In fact, this is not a recent concern: in the case of India, proposals about localizing science go back at least to Rajendra Lal Mitra’s Scheme for the Rendering of European Scientific Terms into the Vernaculars of India published in 1877!

Today, however, the All India Council of Technical Education (AICTE) is promoting a new policy of regional language science education for all. Hopefully, this move will enable individuals to leverage their existing languages more effectively in their lives and work, and not simply use regional language programs to drive a purely ideological precept.

Programs like these will inevitably cause enormous problems in logistics and equipment provision due to the 22 regional languages involved, which will also require some form of constant inter-communication. Many voices are naturally opposed to such a wholesale shift in the case of higher education courses, where there are weak vocab resources in local languages. The most likely scenario is that Hindi will attract most of the limelight in this momentous shift of skills. Other tongues might be left behind.

What’s more, if this policy is implemented, there will be a further barrier to knowledge down the road. If Indian science goes vernacular, a new need will emerge for translating or adapting Indian science advances into whatever other language is currently the de facto global tongue for sharing scientific knowledge — almost certainly still English!

That said, if the AICTE succeeds in this plan to localize and Indianize science, it will provide vital proof of feasibility that humans can a) translate science effectively, and b) localize it efficiently. All this on a planet where current signposts (e.g. the rapid spread of the web and global networking, massive knowledge sharing, constant inter-personal communications, etc.) point more to a gradual convergence of content solutions, against a background of more divergent politico-cultural language policies among nations!

The promise of generative technology

The development of more accessible and powerful science writing and reading aids, enabled by the new wave of advances in secure large language model (LLM) apps, could have a positive impact on the future of vernacular science.

We ought to be able (despite many technical, financial, and even moral constraints) to leverage the capacity of emerging technologies using a “co-pilot” approach to adapt — i.e. simplify, rationalize, enhance — received monolingual content to individual reading abilities. And then, where necessary, translate this “plainer” version into a local language for new cohorts of readers once the preparatory lexical work has been achieved.

One pathway towards this opportunity would be to use an automatic Q/A tool that asks typical questions (what, why, how, when, etc.) about any aspect of a piece of scientific textual content — i.e. a research paper — and helps the human reader disentangle useful high-level knowledge or crucial information in the text from the rhetorical complexity of the presentation (see this early example of such a reader’s assistant device).

A refined version of this assistant could then be integrated into a secure but widely-used generative text application designed to simplify the reading of written and visual content by using “plain language”. This would enable the content — text, diagrams, graphs, videos — to be more easily understandable for specialists and non-specialists alike, without diluting the meaning of the original.

The crucial advantages are:

  • the almost instantaneous conversion of a given text into a “plain language” or simplified version,
  • the resulting ease of translation of this plain language version into a local language for immediate wider sharing (if required), resulting in
  • a more reliable dialogic (i.e. question/answer and instantaneous comparison with similar text) understanding of the vital content for a broader range of readers. This should eventually stimulate higher rates of understanding and then innovation in various science-driven fields of the social economy.

It is by now easy enough to imagine the emergence of a personalized co-pilot that could anticipate linguistic comprehension problems, using “plain language” as a common currency. This is somewhat similar to current developments in the tech industry, whereby a service such as DataBricks helps non-specialists use “plain language” to design AI-type software functions into their workflows via a chatbot.

Co-pilot functionality dedicated to specific end-user needs will further tailor the understanding process to each individual by learning digitally how each of us reads, takes notes, connects ideas, etc. via the computer interface. No doubt experiments in this direction are already underway, given the broad range of generative AI app domains currently being addressed.

Obviously, today’s LLM-driven content tools cannot actually anticipate what individual humans need from a “plain” version, or make judgments on appropriate levels of “plain” language. But they will certainly be a useful first step in rewriting/explaining, provided we can prevent them from hallucinating.

Linguistic resources for science?

Translation is clearly not yet a universal option. Constructing a global network of effective science translation into and between all relevant major and local languages is still far too complex and costly.

For example, local languages need to do considerable data collection and terminology work to overcome lexical poverty in the global science domains (i.e., lack of corresponding terms) to ensure comprehensive translation.

On the other hand, specific local efforts in critical areas of knowledge could be highly effective in many cases (e.g. botany, agricultural science, medical specialties, ecological engineering, etc.). So an interesting initial possibility could be to adapt English-based plain language science discourse comprehension and production to local languages via translation.

Currently, the Science X Newsletter, for example, does a terrific job of transmitting the content of selected English science papers for easier consumption by non-specialists. But it obviously can only edit a minuscule amount of the total science news output every day, and currently only in English. Two others source of science-based information for non-specialists, but based on research papers, are Nature Briefing and The Conversation, which is now published in several languages.

This entire smarter access-to-science process should therefore be upgraded digitally to a global dimension via plain language aids and, where necessary, some form of automated translation of these plain versions alongside local science production. These look like the best options for ensuring broader access to critical content for future generations.

--

--