Twenty years ago, the French cognitive anthropologist Dan Sperber predicted in L’avenir de la lecture et de l’écriture” that we will eventually give up writing and start using speech to dictate to machines, though reading text will continue. “We have evolved a brain,” he claimed, “specifically prepared for speaking and listening and not reading and writing.” Learning to write is more “expensive” both cognitively and materially than learning to read. Does this prediction provide a useful guide to current developments?
We may well be on the cusp of a new epoch in language practice and awareness. Signifiers — the acoustic, physical, and dynamic aspects of human voiced language — are set to play as large a role in our communication economy as the purely informational values of written words and phrases that technology has favored so far.
Massive voice data availability will naturally amplify the dangers of intrusive surveillance of our vocal exchanges. But this pivot to voice will also intensify the focus on understanding or appreciating not just words spoken, but entire soundscapes, along with their various complex signaling properties.
Software solutions will be able to collect and finely map detailed patterns of voice behavior expressing everything from panic to joy, suspicion to relief, sickness to health, truth to perfidy across many speaker populations in relation to any number of specific stimuli. Our phones and similar devices will then begin to engage with the intelligence implicit in tomorrow’s soundscapes in the broadest possible way. Deducing your psychological state or health from an analysis of your speech will be only the first step of a journey into inner space.
So here is a possible scenario:
- speaking via media will gradually replace most writing
- listening/watching might even replace much reading
- machines will be able to design, program, read, and write much basic content tailored especially for us
- “literacy” will either change its meaning or disappear.
If this media shift in language behavior intensifies, then human-authored text may lose its primacy as the knowledge media measuring stick. Written content of different kinds will instead be generated in vast personalized quantities as machines learn from language data, making it possible to automate most existing forms of writing, blur the distinctions between human and artificial expression, and produce “practical’ content at scale across languages.
So the metaphoric trend that emerged in the wake of 1960s French theory of referring not just to books but ideas, movements, and history itself as “texts” to be “read”, might be drawing to a close as we shift towards Sperber’s post-writing world. Our new metaphors may be more in the vein of “listening to history,” “the talking’s on the wall (of sound)”, or “the tongue is mightier than the sword”. Let’s explore how we might be entering a literal post scriptum…
From text to speech
Text has so far been a more efficient knowledge-access medium than voice for at least four reasons: — it forms a durable object, it is easy to search visually, you can scan it quickly to identify relevance, and reading is typically faster than listening to audio versions (read aloud live or recorded). Text provides a multi-linear format for content consumption (you can skim and jump backwards and forwards), whereas video and audio are largely monodirectional.
However, we shall almost certainly be able to overcome the forced linearity of voice content, using voice summarizing, intonation-based insight spotting, and sound searches over speech and video. And we will all welcome an end to the laborious process of learning handwriting or typing and then using numerous gadgets, materials and tools simply to transform an idea in the mind into a visible language object.
Natural language input — i.e. talking to a machine — will however extend beyond simply asking questions or giving orders to robots. Voice will eventually become the medium for individuals to code software solutions to help machines learn. Spoken AI instructions could form input to machines to help build solutions to a wide range of specific intellectual, managerial, content, or design tasks where AI will be able to learn, organize and build virtual worlds and focused applications for people. This will repurpose text content as a data/knowledge resource, and highlight voice commands as prompts to program a machine to expand our sensible world, not just listen to its replies.
Meanwhile, we shall stop “looking things up” and shift to instantaneous search gratification using voice queries to a phone or other device. When in a public context, we shall be able to whisper commands to our devices, and eventually “think” them straight to the device. Speech will thus take a more central role in our conceptions of and relationship with ‘content’. Physical, signal-rich utterances will transmit the deeper values of community, identity, and inclusiveness, against a potentially divisive social background.
As we noted above, the great danger created by this kind of media shift is that all our speech, just like all our text in the last 30 years of online production, will become searchable intelligence for interested parties. By virtue of their huge volume, these data will only be manageable by powerful machine-learning applications. The signals embedded in the data (who you are, what you say, hesitate about, feel, suggest, reveal, emote, or stay silent about) will inevitably drive the machinery of commercial and political surveillance and evaluation. There will be a constant, difficult conversation between bad information and useful information, just as there has been throughout history.
Yet as more content becomes datafied in this way, we shall also benefit from richer medical, scientific and creative advances… whilst dealing with the inevitable commercial strangleholds and unethical forms of exploitation. Broader knowledge, greater fear.
What about the long march of literacy?
As voice speaks more insistently to our content needs, will we stop evolving towards an era in which human writing is the privileged baseline for most language actvities (in line with the UNESCO-inspired literacy project launched in the 1950s)? Or will we move beyond writing as the key to knowledge sharing, and simply teach reading where appropriate? This would depend on the existence of effective writing systems.
The exact number of languages that remain unwritten is hard to determine. Ethnologue (24th edition) lists a total of 7,139 living languages of which 4,065 have a writing system. Only some of these are widely used — literacy is not a given. This leaves over 3,000 languages, often with very small speaker bases, without some sort of writing system. Were languages ever born equal?
If we really want to give every language a writing system, one long-term possibility is that machine learning software could be used to (help) design new writing systems for as yet unwritten languages, perhaps drawing on a database of the six fundamental script types of alphabet, abjad, abugida, alphasyllabary, syllabary, and logographic. This would at least speed up the initial phases of deciding on the visual units to be chosen, given apparent word structure.
Inventing brand new formal types of writing could appear unnecessary, as a machine could eventually learn to output a “universally” readable version of any input speech at will. But small, typically endangered language communities would probably wish to create their own systems, in line with traditional rather than technological values. This will preserve communities of speech but exclude them for many kinds of digital exchanges for some time.
Visual text is typically impersonal, even though it can breed great familiarity. We have largely given up using highly personal handwriting as a major text medium in the digitized world; instead we choose from just a few publicly available fonts when sharing electronic messages or documents. In any case readers could easily choose their favorite font in which to read your messages, so writer choice may be minimally relevant.
Emojis are fun to use as space-saving sentiment suprasegmentals but are unlikely to extend the scope of writing and will probably remain a Web2 social-media phenomenon. We might however see some vocal or sound equivalents to emojis emerge in a more audio-based metaverse as people communicate in new ways combining visuals with acoustics.
Meanwhile, spoken X-casts — which enable people to listen to voiced discourse of various kinds while they do something else — will extend former radio or online experiences and morph into richer interactive sound performances of the drama of other people’s voices, rants, and conversations.
As noted, voices initially require linear attention: there is no “speed listening” (cf speed reading), no natural leaping ahead to a specific word or phrase. You can of course spool ahead to downstream points in any recording, but ideally, when you listen you have to catch the entire temporal flow to fully understand. You have to pay attention, constantly evaluate the personality of the speaker, feel empathy/antipathy towards their dialect or rhythm as well as their content. But new behaviors in both voice and listening will emerge to expand listener powers and boost speaker resonance, aided in turn by new linguistic or time-based search solutions and performance tricks.
Voice also, of course, immediately distinguishes individuals from each other in terms of gender, age and so on, whereas visual text usually has no embodied identity, apart from handwritten script. We all speak in a dialect of some kind, even when there is only one remaining speaker of a language — no individual can ever be the “voice” of a language — only a voice.
Speaking inevitably offers a broader range of signals to any inquiring intelligence. Think of high- and low-pitched voices, slow and fast and mixed enunciation, short and long breath groups, varying sound registers, lisps, sobs and laughs, cries and whispers.
This produces a whole pageant of auditory phenomena that we often cannot name but react to emotionally as a series of signifiers — signals pointing towards richer possible meanings about the speaker, the environment, and what is said. This rich interactive dimension is absent from the text domain. Audio cannot be paraphrased easily, described precisely, or translated item by item with any accuracy or completeness. It is these visceral aspects of language that we shall be recognizing, celebrating, and monitoring automatically in a post-writing economy.
As a simple illustration of the issues at play in the speech/text dialectic, take conference interpretation (real-time spoken translation). Here we use speech to translate another person’s spoken language, for example in court, a conference, or a diplomatic encounter. The voice timbre of the translator is rarely if ever similar to that of the original speaker — interpreters may be women who translate men’s speech and vice versa. There is no attempt to capture the entire vocal communication context of a conference or meeting: you don’t imitate or “translate” hesitations, tongue-trips, or bungled pronunciations, through which often powerful people in such meetings share their opinions, emotions, personality, etc. These “actor” features are completely erased from the transmission format applied by the interpreter, whose job is to reduce speech to an affect-less stream of clear conceptual meanings — all semantics and no “signifiers”.
Yet there are always situations in the doctor’s office, court room, police station, or international press conference when a speaker being interpreted breaks down, infusing the atmosphere and their language with raw emotion, as for example the witnesses testifying in the South African Truth and Reconciliation Commission in 1996 demonstrated as interpreters struggled to transmit their painful accounts.
On other occasions this may create a strange contrast between the suffering speaker and the composed translator: interpreting is indeed a bit like producing a sanitized, text translation of wild and whirling speech in real time.
In the future, a machine interpretation device might handle much human discourse under similar circumstances: but will it be programmed to simply identify a thread of conceptual language beneath the emotion of any specific situation? Or will we design systems to reconstruct the speaker’s full communicative voice characteristics (emotions, rhythms, silences, etc.) as found in high-quality video dubbing?
Critically, voice is part of the attraction of media such as films, games and TV series. There is growing competition in such geographies as SE Asia (especially India) to localize audiovisual content into new language communities so that they enjoy a richer acoustic performance. Film dubbing is now a growing industry (worth around $4 billion and growing at over 6% a year) in which technology is aiding this extension of content to new languages. It requires attention to the nuances of speech that text in subtitles obviously cannot provide for avid film fans.
There is also technology coming onstream to ensure that appropriate physical mouth movements of the actors are injected into dubbed versions of a film originally shot in another language, completing the physical illusion of the speaker’s identity. Recreating the original actors’ voice qualities in dubbed versions is already possible. Presumably efforts to localize visual talking media to specific communities in Asia, Africa, South America, the Middle East etc. will expand considerably, even though dubbing may not yet work its way down into many minority languages.
However, there can be cultural problems here: we don’t know how far the speech track of an ethnic local will be appreciated by a viewer on the other side of the world who hears it dubbed in the spectator’s language — colloquialisms and all. By “domesticating” all the language of a film to the spectator, something about the magic of its original vocal otherness may be lost. We would never (?) think of “translating” a painting into the favorite colors of the viewer, for example. Strangeness communicates in art.
You could argue that text in any language can (nearly) always be read aloud as speech. But in reality, speech is a singular performance which cannot avoid casting a specific person who speaks with a specific voice at a certain speed with a given accent or voice quality in a specific situation. There is no ur-speech.
Unlike text, speech is always situated, grounded, and sounded. We don’t know what Etruscan sounded like, for example, even though we might work out some of the structure of the language. The wealth of speech features cannot always be effectively transferred into writing without laborious and intrusive commentary. By wealth, I mean the range of audio parameters — pitch, rhythm, tone, and accent — due to gender, geography, physique, age, health, social class, and all the rest.
A voice automatically signals some idea of a speaker’s age, for example, but a written/printed/online text issued by a government office does not necessarily reflect any such feature. So there are clear limits to any natural porosity between text and voice. Text lines up naked words; voice clothes them in human color and music even before they’re understood semantically.
You talk like a book
Remember Ray Bradbury’s 1953 novel Fahrenheit 451 or the 1966 film version by François Truffaut? It imagines a time when books in the US are destroyed by burning (probably inspired by stories of Nazi book-burning yet written in the USA at a time when Senator McCarthy was active as an anti-communist scourge). To evoke Sperber’s premise we began with, this book-free situation meant that in fact we would not even be able to read even if we wanted to!
Bradbury posits an America in which renegade individuals overcome this handicap by memorizing the entire text of books (in fact novels) and then acting as their oral remembrancer. In the movie, we see them walking around endlessly trying to transfer book content to memory. This would keep creative literature alive once all the pages had been burnt to ashes. But implausibly the characters do not seem to venture into books about science or history as classic sources of knowledge, which erases some of the novel’s underlying frisson.
We can however run thought experiments about issues that are not faced in the novel. For example, would children’s, teenagers’, women’s, or men’s voices be chosen in any special order according to the content of the book to be memorized? Is War and Peace a “male” or “female” sounding book? And would some long books need to be stored as speech across numerous remembrancers?
How would linear memory be able to help a person find a specific passage or a particular piece of information? Who memorizes the dictionary so that unknown words can be checked by ignorant listeners? Would you memorize page numbers so that the concept of page lingers on in aural memory? Is any remembrancer able to run a random search over their aural database to find specific information someone needs? Only very rarely, surely. And above all, what happens when the remembrancer of a novel by Proust or Raymond Chandler dies in an accident? Surely you need at least doubles for every text to overcome these problems. Calquing an oral culture onto a post-literate civilization sounds like a hopeless fantasy.
This is of course the exact reverse of what happened in history: specialized bards in many communities in pre-literate times were able to learn, remember and enunciate long poetic sagas and similar works to memorize knowledge and ritual for the tribe, and pass their learning on, eventually to scribes. In some cases, these were similar to the ancient Greek epics or the Sanskrit poems such as the Rgveda and Mahabarata which are thought to be derived from a tradition of eloquence that orally performs foundational events in the histories of certain peoples before eventually being written down.
The mnemo-techniques of spoken repetitions, rhythms, rhymes, verbal formulae, and similar features are well-known, and can be found in most oral stages of cultures. Yet it is unlikely that a bard’s audience would ask specific questions about what color a given character was wearing or why they acted in a certain way. Remembrancers don’t have access to the specific knowledge base feeding the poem, only to the memorized words that anyone can in theory learn through practice. It is the physical sounds that inspire our imagination to create a world.
In fact, Bradbury’s story also echoes a very real situation in 20th century Russian history. In the Soviet Union from the 1930’s, when much literary publication was banned by law, the Russian author Anna Akhmatova would write a poem and her friend Lydia Chukovskaya would memorize it; the paper version would be destroyed. This made Chukovskaya a sort of virtual poetry memory bank. In the same way, Nadezhda Mandelstam memorized much of Osip Mandelstam’s verse, again to protect it from annihilation.
Tomorrow, text<->speech technology for many languages will become available to transform written content of all sorts into spoken performances almost at will, and conversely transform almost anything spoken into machine-readable text. How will this media shift impact our cultures and our lives? For one thing, all cognitive activities along the human spectrum — from research or storytelling to law-making or doing math — involve the creation, use, and stockpiling of linguistic data. To build systems that can design and build handy voice-driven solutions to automate most content options, we will need masses of data in all languages associated with all kinds of human discourse.
So far, we have seen the collection of many billion-word data sets for a few dozen languages used to help drive “transformer“ software that can generate, translate and speechify digital content. These are the very first steps in what will become a huge shift induced by the combined power of focused data and compute technologies. Today we are still thinking in terms of just 100 languages or so as a major coverage target. We will rapidly need to raise that to thousands, and then handle all the outcomes in terms of ethical manipulation and balanced distribution to ensure their beneficial utilization across all communities. A typical challenge, for example, could be the 700 languages and dialects spoken in Indonesia alone!
Speech in the metaverse (2.0)
One key future application where speech will evolve into a primary channel is communication in any future metaverse. Although both the general architecture and local details are still unknown, we can at least anticipate that in a visually and acoustically rich virtual space, we will need to communicate linguistically through speech and sound rather than simply by writing/reading blocks of text on screens.
We tend to characterize Web 1 and 2 as text universes — an electronic extension of the affordances that books, postal letters/messages, libraries and picture galleries for readers and viewers sitting at a desk have given us in the distant past.
Reading in the metaverse will obviously occur in various forms, but interactions and encounters with others in virtual visual-acoustic environments rather than as solitary clicking will favor a far broader range of spoken intercourse. By metaverse, I mean a still fictional Metaverse 2.0, as the first version of the concept is looking more like a commercial landgrab for virtual real estate backed by gaming and NFT markets. This is surely only a pale blueprint of the more interesting concept of a global virtual space for new kinds of creative encounters regulated by open-access principles.
If communities eventually manage to form, expand and mutate around Metaverse 2.0, language barriers will be just as flagrant as in the real world. So some form of instant speech translation for a broad range of languages will therefore become a baseline need. Inventing scaled solutions for such encounters will surely become one of the major challenges (in terms of compute power, data, social and cultural inclusiveness, etc.). Otherwise we will end up simply recreating a metacommunity of communities along existing real-world language divisions, and then try to provide the usual tech support for conversations among each of them.
Yet even a commercially-driven Metaverse 1.0 of separate brand or market-based communities will attract and mix together different language communities. So inclusive multilingual voice support will be needed from the get-go. You can also bet that oral language teaching/learning will be largely reinvented and extended via metaverse-type affordances. But remember that all types of language tool acceptance and usage are typically determined by the usual socio-economic barriers to entry.
What, though, would a metaverse mean to the linguistically disabled community? Any preference for speech communication will make life harder for many. We shall need to ensure that signers can reach the deaf, that mutes can interact with speakers through texting and some new form of dynamic imaging, and that the blind can benefit from automated spoken commentary and soundscaping about everything going on around them. We will also need to think about supporting such handicaps as reduced sensorial skills among aging populations as they vocalize and listen, using automatic lip-reading and similar techniques.
A different and more distressing form of disability is the fate of very small-population indigenous languages where speakers can no longer use their birth or first community language because it has been almost silenced by brutal regimes or historical wear and tear. A spoken metaverse environment could potentially offer more support for re-languaging by enabling wider speech options, and greater access to virtual support mechanisms rather than insisting on time-consuming descriptive linguistics and the laborious creation of a writing system.
Although there is currently a major UN focus on revitalizing indigenous language communities worldwide, there is a very broad variety of cases, and no specific reparatory action or reinvention program will fit all. Forcing small territory-focused communities onto the digital whirligig may not be the right way forward for them.
Dialog versus text
Yet if voice-first does materialize as the new communication benchmark, Sperber’s prophecy can claim to echo Socrates’ 2,400-year old critique of writing’s limits. Dramatized in Plato’s Phaedrus, he claims that because it lies fixed on the page and we cannot question it, written text avoids the dynamics of dialog, the mutual working out of ideas and their consequences through the cut and thrust of human conversation, argument or debate.
Writing simply reflects ideas, leaving them unchanged as objects of mere contemplation. Speech on the other hand vivifies human engagement, allows us to challenge what we learn, and strategizes our relationship with truth and silence. Speech is less about “speaking a monologue” than it is about engaging in verbal pugilism…
There might also be political fallout if this seismic shift to voice occurs. If everyone has a recognized, listed “voice” and each voice counts in some way, there will be a multitude of opinions that may never be distilled into specific parties or legible policies, whatever the system. The passion-riven social media landscape (zillions of brief, hard-hitting utterances on Twitter etc.) as a precursor to the post-writing age is an example.
Highly visible, well-structured bodies of opinion and discourse that used to operate as references in most democratic regimes will gradually crumble into an incessantly-changing digital platformization of vocal opinion. This will make it harder to crystallize a polyphony of voices into a set of manageable clear-cut messages, and instead will encourage constant micro-fragmentation across populations to drive a more diffuse and aggrieved engagement with political issues — more local, social, personal, spiky and fast-changing.
So the question is: will a massive pivot towards better-supported, richer spoken intercourse create a new hierarchy of communicational values after 500 years or more of the hard-fought battle for literacy as truth? Will humans engage more with each other via some form of virtual, oral agora, and access or experience imaginative content as a new combination of provisional, ever-changing visual-aural 3D stories, demonstrations, and virtual experiments via a ground truth of constant corrective dialog? And simply read knowledge as reams of content custom-generated by a smart if understanding-free machine!