Statistical unicodification of African languages
Title | Statistical unicodification of African languages |
Publication Type | Journal Article |
Year of Publication | 2011 |
Authors | Scannell, Kevin P. |
Journal Title | Language Resources and Evaluation |
Journal Date | 09/2011 |
Volume | 45 |
Issue | 3 |
Pagination | 375-386 |
ISSN | 1574-020X |
Abstract | Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open-source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets. |
URL | https://www.springerlink.com/content/g53784640q43206m/ |
DOI | 10.1007/s10579-011-9150-3 |
- Login to post comments
- Google Scholar