Missing Scripts Project: “A fish would design an O differently”

Talked to researchers and designers saving endangered writing systems from extinction

October 31, 2024

Nearly half of the world’s writing systems remain unencoded, leaving them excluded from the digital world and at risk of extinction. To address this, the Atelier National de Recherche Typographique in Nancy (France), the University of Applied Sciences in Mainz (Germany), and the Script Encoding Initiative at the University of California at Berkeley (USA) have joined forces. Together, they are working to integrate these endangered and minority scripts into the Unicode standard and provide them with a typographic form. We’ve contacted designers and researchers from all three institutions to find out how the writing system preserving process looks.

How did the idea of drawing, encoding and researching the missing scripts come to you in the first place?

Deborah Anderson: In my case, the idea came to me when I discovered that I could not put letters used for a historic language of Italy onto a Web publication, because the letters were not in Unicode. I started the Script Encoding Initiative at UC Berkeley to provide a voice in Unicode meetings for those people working on historic and modern minority languages, whose script was not in Unicode.

Johannes Bergerhausen: Deborah started SEI in Berkeley in 2002. A year later, I founded the Decodeunicode project at Hochschule Mainz in Germany. We met at a Unicode conference in 2005 and have been working together ever since. Every year or two we did an update of our website. With each new Unicode version, new scripts were added. After a few updates, I asked myself the simple question: How many are still missing? When will we have all scripts encoded in Unicode?

In 2015, I proposed the «Missing Scripts Project» to Thomas Huot-Marchand, the director of ANRT, and Deborah. They were immediately on board. Since autumn 2016, one or two students have been working on a missing writing system at ANRT in Nancy each year.


Decodeunicode book, 2018


Morgane Pierson: In 2017, I joined the Atelier de Recherche Typographique, a post-graduate studies specialising in typographic research, based in Nancy. I was the second researcher-student to work on the Missing Scripts project, after Arthur Francietta. Together, we had the pleasure, among other things, of researching and designing the glyphs to represent the world’s writing systems. The wide variety of letterforms from all over the world provided an endless source of inspiration and highlighted the need to create more diversity in the world of type design. This is why I continued to specialise in designing fonts for ancient and/or disadvantaged writing systems, whether for communities of scholars such as archaeologists or linguists, or for populations that lack adequate digital typographic tools.

Anushah Hossain: I joined SEI as Debbie’s successor this year, after spending several years studying the histories of text standards and technologies, particularly in the South Asian context. I am fascinated by how different politics and forms of expertise must come into dialog when a script is up for encoding. I am excited to use my new position in SEI to continue aiding scripts along the Unicode pipeline, but also encourage more scholarship in the humanities about the complex social and political dynamics surrounding language technologies.

If a script used for the language is not encoded or not digitised, how do usually the native speakers handle the situation? What symbols do they use to write?

DA: Some users will hand-write the letters, and even take screenshots of the letters and share those. Alternatively, some users create their own non-standard fonts, and share those, though this doesn’t help when you are sending it to someone without the same font installed.

MP: Another solution is transliteration and romanisation, the use of the Latin alphabet in place of the original writing system. For example, it has been widely used to write Persian, which is written using the Arabic writing system. This way of writing Persian is called Fingilish or Pingilish. With the development of global technologies and the creation of higher-quality fonts, this way of writing Persian has decreased significantly. This proves the need to create specific high-quality typographic tools for users. In addition, users often develop their own software and digital keyboards to fill the typographic gap. The main problem with these local typographic tools is the lack of internationalisation and standardisation that Unicode is trying to address.

JB: There is the example of the script called Sunuwar Sunuwar was just added to Unicode 16.0 in September 2024, thanks to SEI in Berkeley, in which the community in the Indian city of Gangtok wrote an entire daily newspaper by hand and then duplicated it. Without proper digital typography, you can hardly be found on the net. That’s why it’s so important that all written cultures in the world are included in one encoding standard. Unicode is, so to speak, the typographic general assembly of the United Nations. Every culture should have a place there!


1 Handwritten newspaper in Sunuwar, 2005. Imamge: Atlas of Endangered Alphabets


How do people make OS providers (like apple or android) add digital keyboards? Do the companies check the Unicode updates or native speakers and initiatives have to fight for the new writing system to be added?

DA: Companies do pay attention to new scripts that are published in each version. I’m not certain how companies decide which keyboards to support, as the decision process may vary from one company to another. I understand that users may need to lobby mobile providers to support their script or keyboard on a device in a particular area. I would guess that having a supporter in a company (especially one high up in the chain of command) can help: Steve Jobs supported getting Cherokee on Apple devices, I believe.


3 Cherokee script. Image: Wikipedia


It also helps to have an active member from the community working with the company to get support — so the Barry brothers were able to work with companies to get ADLaM supported.

MP: I would suggest discussing this question with Marc Jamra and Neil Patel. They specialise in typographic and technical support for African languages.


ADLaM characters. Image: Andrew Footit


Do the ANRT students working on the typefaces have to learn a language the script is used for?

DA: Will let Morgane and Johannes answer this, but from my experience it is not necessary, at least for the script proposal authors. Often the proposal authors need to analyse the script, but should definitely work with users who use/understand the script and can answer questions.

MP: In my opinion, a language does not have to be written to be spoken, and a writing system does not have to be read to be written. When I’m working on a typeface, I’m often not able to read the manuscripts or inscriptions I’m studying. I leave this complex work to linguists and archaeologists. However, I can understand the construction of the letter, the ductus, the proportions, and the script grammar. To put it simply, I learn the script, but not how to read it. It is also essential to study the historical and cultural context of the writing system and to do as much research as possible to be as accurate and authentic as possible. The same is true when I design fonts for ancient writing systems such as Phoenician, Nabataean, Paleo-Hebrew or Lycian, and a font for native users such as Arabic, Greek, or Hebrew. Of course, it is essential to consult with various specialists and native readers to validate my choices. The multidisciplinary and collaborative part of this work is very valuable and enriching.



JB: Often there is no one-to-one relationship between language and script. Most writing systems allow you to write several languages. (Hundreds with Latin!) So there is often no one suitable language. In addition, scripts develop a life of their own and have features that are language-independent. Students at ANRT learn the respective writing system very thoroughly from the linguistic, typographic and orthographic structure.

How did you come to the conclusion that there are 292 scripts? How do you define a script?

DA: The number is based on feedback from various experts. Differentiating scripts, particularly as they change through time, can be difficult. One test is the «menu test» (at least for modern users!): if a user was given a menu in the two different scripts, can they both be read? If not (and if they show structural differences), then they may need to be separately encoded.

MP: Personally, I prefer to use the term «writing system» rather than «script». A writing system is a set of organised and standardised signs that allows the notation of human thought. This term gives a broader definition than a script, and this is why it is certain that there are more than 292 scripts in the world. It is important to note that the Missing Scripts Project is an ongoing project and new writing systems are added to the list every year.

JB: This number is a snapshot of the state of research at the time of the publication of our posters or the respective update of the website. Since 2018, we have printed four updated editions. The number changes slightly, about every two years. An example: for some time now, two experts have been discussing whether one should distinguish between North Palaeo-Hispanic and South Palaeo-Hispanic. In my humble opinion, we can summarise this very well as one historical writing system. In the end, it’s about two or three letters. The experts are not yet in agreement.

Another example: Chinese can be easily divided into four writing systems. But you could also argue that this has been exactly the same script for more than 3,300 years. On the other hand, you could even make it more granular historically and divide it into even more scripts … Classification! … an almost philosophical question. We often pragmatically adopt the subdivision (and designation) that has become established in the scientific community. But even this is not static, but develops further as knowledge becomes more precise.



Do you think more scripts are going to be discovered?

DA: Yes!

JB: Yes. From time to time historical scripts are added that were not previously known. Sometimes the question arises as to whether it is a variant of something familiar or an independent writing system.

Even today, new scripts are still being invented. The last one is the Toto writing system in South Asia in 2015. For the Unicode proposal, it is crucial that the applicants can prove that there is a real, large community that has been using the script for a longer period of time. Also, of the 292 scripts, 7 have not yet been deciphered.


14 Rising Sun song in Toto. Image: Unicode


How do you select the scripts to work on and what criteria do you use to prioritise them?

DA: Slight priority goes to modern scripts (since there are modern communities actively using them), but I personally feel that those working with historic scripts should not be overlooked. In the past, I have selected scripts where there are experts/users interested in getting the script into Unicode and who can answer questions and provide evidence, as needed.

JB: For the type design, we try to choose writing systems whose encoding is already being worked on, or whose publication in one of the next Unicode versions is imminent, so that we can make very up-to-date contributions.

MP: I am fascinated by the ancient writing systems of the Mediterranean and Southwest Asia. This vast region was the birthplace of many early writing systems: from the first pictographic systems to the development and birth of the alphabet. It is a mesmerising way to study the very ancient link between the writing systems we use today, and to try to understand how they may have influenced each other. When I study these writing systems, I’m not only studying the evolution of writing, but also the links between societies and human history. The most beautiful and fascinating thing is that writing can be both ephemeral, existing only for a certain period of time, or so durable that we can still contemplate and decipher it 5,000 years later.

AH: We are also trying to come back to old proposals that almost made it into Unicode, but just need a bit more tidying or discussion to get all of the way there. There are dozens of scripts sitting at this almost-there stage!


20 Apollo, seated before an altar with an inscription in Cyprio-syllabic script. Image: Metropolitan Museum of Art


Can you give examples of scripts that existed for a very short period of time?

MP: In this case I was talking about the support of the writing. For example, we can find very ancient inscriptions carved or scratched into rocks; others are made to be ephemeral, like writing in sand or on perishable material.

Some writing systems also lived for a very short time. I say very short compared to the history of writing. For example, as far as we know, the Elymaic writing system lasted for only 8 centuries. It was used to write the Aramaic and Elamite languages between the 3rd century BC and the 5th century AD. The Elymais kingdom itself had a rather short existence and was small compared to the Parthians or the Sassanids, which explains the end of the use of Elymaic. The Mixtec writing system of present-day Mexico also had a very short existence due to the invasion of the Europeans in 1520. At last, an example of a writing system that lasted only 50 years is the Osmanya, which is used to write the Somali language. It was invented by Osman Yusuf Kenadid, the son of Sultan Yusuf Ali Kenadid and brother of Sultan Ali Yusuf Kenadid of the Sultanate of Hobyo. This new writing system struggled to gain acceptance among the population due to competition from the long-established Arabic and the emerging Somali Latin alphabet. In 1972, President Mohamed Siad Barre decreed that Somali should be written in Latin rather than Arabic or Osmanya.


21 An example of the pictorial representations the Mixtecs used for non-verbal communication (reconstruction from 2001)


22 Kigelia typeface with Osmanya support


Do you have a favourite glyph? Which language does it come from?

MP: I love them all. I can not have any preferences.

JB: All colours are beautiful. But, among others, I’m fascinated by a character of the Afáka script. It is not yet encoded.

AH: My favourite is perhaps the character (/kʰɔɳɖɔ tɔ/ «piece of ta»). It is a little-used Bangla letter, but it shows up in essential (and somewhat startling) words like হঠাৎ («sudden») or চীৎকার («scream»). The letter caused a major uproar for years in Unicode forums because of its perceived-to-be faulty encoding. The debates drew in linguists, literature professors, type designers, national media, and everyday internet users because each seemed to feel something about their identity was threatened if was not encoded a certain way. I like the story for capturing the intense feelings and long histories that can be evoked over something others might consider mundane. And of course it’s a beautiful letter.



In your AtypI lecture you mentioned that A is the reference glyph for Latin. Can you please define the term reference glyph?

DA: The glyph in the code chart that is recognisable to most users.

JB: Perhaps in the future we should even better say “reference character”. (Unicode encodes characters — type designers draw glyphs.) We have a ranking for the selection:

First, let’s see if there is a well-known character that is internationally associated with the script. Example: The Omega Ω U+003A9 for Greek.

If this is not known, we ask the respective community if they use a typical character themselves. Example: in Han (Chinese), the character U+06C38 stands for “perpetual, eternity, permanent, forever, long”, which fits well with its millennia-old history.

If it is an alphabet, we choose the letter for the sound /A/. Example: 𐒀 U+10480 Osmanya Letter Alef.

When it comes to a syllabary, we use the character for the syllable /KA/, because this is common in many countries. Example: U+00915 Devanagari Letter Ka.

And finally, in picto-/ideographic scripts, we choose a character for “human”, the human body or “head”. Example: U+101D1 Phaistos Disc Sign Plumed Head.

The first two Missing Scripts students at ANRT, Arthur Francietta and Morgane Pierson, designed the reference glyphs in one typographic style for all 292 reference characters from 2016 to 2019.



What was the first script you have been working on?

DA: Old Italic.

JB: In 2014, we designed a cuneiform script in Mainz. With over a thousand glyphs, it was a pretty insane project for three people. The first Missing script at ANRT was Palaeo-Hispanic, designed by Arthur Francietta from 2016 onwards.

MP: When I joined the ANRT, I was also working on a research project on the Nsibidi, a pictographic and ideographic writing system from what is now southern Nigeria. More specifically, I studied Nsibidi with a focus on the pre-colonial period. But the first font I designed was for Elymaic, an ancient writing system from present-day Iran. It was also the first digital typographic representation of this writing system, which has been encoded in 2019 thanks to the work of linguist Anshuman Pandey. The font is now published by Google under the name Noto Sans Elymaic. It was important to me that the first font for this minority writing system should be easily accessible and free to use.


Digital cuneiform book by Johannes Bergerhausen, 2014


Noto Sans Elymaic


AH: This is my first year in the role and we are working with experts on about ten scripts right now (building on efforts that Debbie had already started)! The projects range from the historic, such as Maya and Egyptian hieroglyphs, to more modern inventions, such as Mwangwego and Masaba. In all cases, there is a great deal of international collaboration involved, which has been inspiring to witness.


Mwangwego script. Images: Tapiwanashe S. Garikayi


Could you walk me through the typical cycle of your project?

AH: On the SEI website, we maintain an ongoing list of unencoded scripts, with an open invitation for visitors to provide more information. We review this list periodically to determine which scripts are most viable to focus on for the next set of proposals and fonts, and communicate that to our Missing Scripts colleagues. Incoming ANRT students will typically pick scripts from this list to research. We also work with Johannes to incorporate any new information from our scripts list into the next edition of the World’s Writing Systems poster.

DA: A new script proposal goes to the Unicode Script Encoding Working Group for non-emoji and non-Chinese/Japanese/Korean ideographs for review. It often takes several meetings and proposal revisions for eligible proposals to be «mature.» The proposals also need to have their characters reviewed for the correct character properties (which define the behaviour of characters), before they are recommended to the Unicode Technical Committee (UTC). The UTC, which meets quarterly, then will often recommend provisional code points for the scripts/characters. At some later point, the UTC will identify which version the script will appear in. Unicode versions appear yearly in September. Once published in Unicode, then officially fonts and keyboards can be made implemented by companies.

MP: This stage, when a writing system hasn’t yet been integrated into the digital world or encoded, is a very critical moment. From the designer’s point of view, the responsibility is to give the first typographic form to a writing system that could take many different forms in the future, depending on the sources and the purpose of the font. It’s important to remember that the first design should not be considered a standard or a model. For this reason, a long period of research and multidisciplinary collaboration is required before and after the work of encoding.

JB: All proposals can be viewed online at unicode.org. Any person or institution can submit a request for new typographic characters, including new emoji. However, they must justify the application very well.

What happens after you create the font and send the application to Unicode?

DA: A proposal to Unicode should have a font available, at least by the time the proposal is recommended for approval by the Unicode Technical Committee. In my opinion, it would be very important to have a Unicode-based font ready to go once the script is published in Unicode, so the script can be used as soon as possible. (However, this typically won’t happen for fonts that are system fonts that come bundled with a computer. Computer companies typically wait for a script to be published in Unicode before they consider implementing it — and, sadly, historic and minority scripts aren’t a top priority.)

MP: After this process of encoding and type design, the question of usability is also an important part of the process: How will users have access to the typographic tools? Does the software properly support the script grammar? Will these tools be free? Some communities can not afford the price of fonts. So foundries and large industries need to take this responsibility into account. Of course, it depends on the writing system and the purpose of the fonts. But in conclusion, it is not enough to create more fonts for the world’s languages if they cannot be used.

JB: The annual release of the new Unicode version (unfortunately) happens independently of suitable fonts. Unicode does not create fonts. Usually the major operating systems and the type designers follow.

How long does it normally take for Unicode to accept a proposal? Have any of your proposals been rejected?

DA: In the past, I would let people know it takes at least two years from the first proposal until approval for a new script, but it can be quicker if it is just adding a few characters.

JB: Some proposals (from different people) hang in the air for years because not enough conclusive arguments and documents have been provided or further questions and problems have arisen. They often have to be revised and resubmitted. In 2019, the german artist Ilka Helmig and I made an exhibition about these proposals (which many different people had submitted over the last 20 years) at the Museum of Contemporary Art West in The Hague and printed out all 71 unfinished proposals for the show.

AH: Debbie’s right! The fastest Unicode turnaround is just under two years. This happens very rarely if a script is approved with only one proposal. But most commonly, we see a script take somewhere between 5–6 years and require 2-3 proposals to go from first contact with Unicode to being published in the Standard. In some exceptionally fraught cases, it has taken over ten years! And of course, these numbers leave out the cases Johannes mentions that never or have not yet made it to the final encoding stage. One of my students and I are working on figuring out what factors can explain how long the process takes for a given script.


38 Materails from the exhibition Missing Scripts — The Proposals


Back in 2018 you’ve said that there are no pan Unicode fonts. Are there now? Isn’t Noto one of them?

DA: Noto provides various fonts, rather than a single, pan-Unicode font. The font size is restricted, so no font can today cover all 154,998 characters The number is from the Unicode 16.0 release, which came out in September 2024 in Unicode today. I have recently learned that Noto has no funding for the scripts, added to Unicode this September. We are looking into options so fonts could be donated to Noto for 16.0 scripts, as long as they fulfil the Noto specifications. Of course, other good quality free fonts are also possible (thinking here of ANRT!)


Noto Sans Balinese by Aditya Bayu Perdana


JB: A font file is (so far) limited to 16 bits and 65,536 code points for technical reasons. But, even if you were to spread the 155K glyphs over three files — so far there is no typeface that represents all characters in one typographic style. An almost impossible project that also raises questions about the different cultures of writing. What «style» would be the basis for the design? There is no such thing as a neutral style. Our poster with a reference glyph per script is perhaps the closest and asks these questions as well.

When we published the book decodeunicode in 2011, our code charts were composed of more than 50 different fonts. For some scripts we were happy to find a font at all — there was no second one! Today we would certainly need fewer fonts to cover it all.

AH: It’s interesting to note that there were precursors to Noto, too. I’m thinking of the free font Code2000 by James Kass, which you used to see floating around the web in the 2000s. I actually just read that the project is restarting after a 15 year hiatus, taking on the same goal as Noto of achieving a pan-Unicode font.


42 Code2000. Image: Luc Devroye


While Google stopped funding Noto, are there other corporations that are funding such projects?

DA: At the moment, I am not aware of other corporations doing the same sort of thing, though one can hope there may be one or a group in the future.

Have you seen any notable impacts or feedback from communities whose scripts you have digitised? Has there been any negative feedback from them?

DA: It takes time for new scripts to be included in widely available fonts and supported in software, but it is (for me) very rewarding to see, for example, N’Ko being used in Facebook. My project helped N’Ko early on.

I haven’t personally received negative feedback, except frustration from groups who are anxious for their script to appear on phones.


Typeface with N’Ko support by Tapiwanashe S. Garikayi. Images: Nan.xyz


MP: I have seen a lot of excitement from the scientific communities, especially through the PIM project that I have been working on in partnership with the Bibliothèque Nationale de France and the ANRT. This project consists of developing typographic tools for the transcription of antique monetary inscriptions. Although the writing systems involved in this project are already encoded in Unicode, there is still a lack of accessible fonts. The Human Sciences, such as archaeology, epigraphy and linguistics, are quite disadvantaged in terms of the typographic tools available to support their studies.

DA: It is true that many of those in the academic world would benefit from understanding the process of encoding and fonts better, so they can get the support they need for their work.



JB: One fine day, Apple introduced a keyboard layout for Cherokee in iOS with a software update. This meant that the community of the Cherokee Nation — that’s 450,000 people according to the latest data (2023) — can now use their own writing system on smartphones and computers. Before, they could only use Latin.

AH: Because many language or script communities may share the same Unicode code points, issues sometimes arise even after encoding, over how the script or characters are named or which Unicode block the characters show up in. In one sense, Unicode is a low-level standard that regular users shouldn’t really have to interact with, meaning the names and blocks are not usually consequential to a user’s experience. But at the same time, the Standard itself holds a certain symbolism and is sometimes perceived as having a legitimising force. And so something having the «wrong» name can matter to a community. Unfortunately, because of Unicode’s absoluteness guarantee, such changes typically cannot be made, even if desired.

Were the native speakers ever asking you to design more typefaces supporting their language?

MP: I hear complaints, of course, but sometimes users are unaware of typographic possibilities. And the problem doesn’t come from the complexity of the writing system. It comes from political decisions and business plans. As we all know, money rules the world, and it dictates which languages have more typographic access and choice.

How often do type designers contact you because they’ve spotted some flaws in the typeface designed for a missing script?

MP: Not very often, considering the number of glyphs representing the world’s writing systems. It is a collaborative work in progress and we welcome feedback and advice. If we realise that there is a better representative character, we update it for the next version. It is also an opportunity for fascinating discussions, which adds to the richness of the project.

JB: For example, someone drew our attention to the fact that Ethiopic is not the best name for the writing system depicted. Ge’ez is the better umbrella term. Another example: After discussing with the community, we reduced the most important phases of Chinese writing, covering 3,300 years of history, from five to four.



How do they know that the outlines are flawed if the script is no more used in everyday life?

MP: This is a difficult question. We can detect errors by comparing the inscription or manuscript with other sources. If we can find other references to the same allograph Each of two or more letters or letter combinations representing a single phoneme in different words, this can provide evidence for the letter form under study. The quality of the drawing can also give us some clues. For example, if the inscription is carefully drawn on high-quality material, we can deduce that the graphie and ductus were made with purpose and skill.

JB: This leads to the almost philosophical question: Are there general, universal optical rules for the (human) design of typefaces? This is discussed from time to time among experts. I would say: Absolutely. Example: the gravity of a letter. In other words, the desire of the type designers to draw a glyph stably so that it stands or sits well on the (imaginary) baseline. So there are optical corrections so that it looks right to the human eye. Gravity is a daily physical experience for us humans, which we also inscribe in our letterforms. The letter O is therefore very rarely an exact geometric circle. A fish would design it differently.


50

Poster with all known writing systems, 2022. Image: ANRT


51 Poster with all known writing systems, 2022. Image: ANRT


What will the project do after all the 292 scripts get encoded?

DA: Continue working, since additional scripts are being discovered and/or new users are asking for their script to be in Unicode. In addition, characters often need to be added to existing scripts.

AH: There are also many stories and experiences that are worth documenting about how these scripts got encoded. That is a project we are starting already.

MP: As I mentioned earlier, the work of encoding is nothing without the creation of digital fonts. So in addition to the missing scripts, we need to work on the missing fonts. There is an abundance of Latin fonts, but for some less privileged languages, the choice is less. Sometimes certain diacritics or glyphs are missing to write languages that use the Latin alphabet, such as Turkish, Vietnamese, Polish or Pan-African alphabets, to name a few. The choice of fonts for other writing systems in the world is even more limited. I will take the example of the Arabic alphabet, which I think is very revealing, even though Arabic is one of the most widely used writing systems in the world. It allows not only the writing of the Arabic language, but also the writing of Persian, Kurdish, Kashmiri or Urdu, among others. There are many different styles, linked in particular to cultural, period, and geographical preferences. But this formal diversity has yet to be revealed in the world of digital fonts. To take an example of an ancient writing system, I noticed the same problem with the Phoenician alphabet, which evolved tremendously to slowly become our writing systems used today as Latin, Greek, Cyrillic, Arabic, Armenian, Hebrew, and so on. Although we can see progress and more interest from the different communities, there is still a long way to go to make up for the accumulated deficiencies.

JB: Unicode started in 1991 and took 27 years to encode 150 scripts. We have done the maths: at the same pace, all writing systems of humanity will be united in a universal code around the year 2044. It will have taken a little more than 50 years to encode 5,300 years of written history. Not so long in comparison! So, hopefully our mission will be completed by the year 2044. Unless by then we meet aliens.


Missing Scripts Project

worldswritingsystems.org