Statistical unicodification of African languages

TitleStatistical unicodification of African languages
Publication TypeJournal Article
Year of Publication2011
AuthorsScannell, Kevin P.
Journal TitleLanguage Resources and Evaluation
Journal Date09/2011

Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open-source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.