Statistical unicodification of African languages

Submitted by Guy on Tue, 2011-09-20 10:36

Title	Statistical unicodification of African languages
Publication Type	Journal Article
Year of Publication	2011
Authors	Scannell, Kevin P.
Journal Title	Language Resources and Evaluation
Journal Date	09/2011
Volume	45
Issue	3
Pagination	375-386
ISSN	1574-020X
Abstract	Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open-source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.
URL	https://www.springerlink.com/content/g53784640q43206m/
DOI	10.1007/s10579-011-9150-3

»

Login to post comments
Google Scholar

Also...

User login

Also hosted on AfLaT.org

Register @ aflat.org

Registered members of AfLaT.org can upload publications, add links and information on their research projects. If you would like to become a member of AfLaT.org, please contact guy♻aflat.org.