Central Africa

Corpora for African languages - An Crúbadán


The Crúbadán Project is devoted to creating basic language technology for minority languages and under-resourced languages using web-crawling and statistical techniques. As of early 2008 we have collected text corpora for 419 languages, including more than 125 African languages, and have used these to create open source spell checkers for more than 20 languages. Please contact Kevin Scannell (http://borel.slu.edu/) if you are interested in developing open source resources for other African languages using these data.

Automatic Diacritic Restoration for African Languages

The orthography of many African languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration.

This is a demonstration system for a diacritic restoration method that is able to automatically restore diacritics on the basis of local graphemic context. It is based on the machine learning method of Memory-Based learning. We have applied the method to the African languages of Cilubà, Gĩkũyũ, Kĩkamba, Maa, Sesotho sa Leboa, Tshivenḓa and Yoruba.

You can find more information on this system in this paper

Select a language and enter the word or sentence you want to restore diacritics for.
Cilubà (e.g. mutekete)
Gĩkũyũ (e.g. nituronire)
Kĩkamba (e.g. ningulilikana)
Maasai (e.g. oltunani)
Sesotho sa Leboa (Northern Sotho) (e.g. swanetse)
Tshivenḓa (e.g. tshiswitulo)
Yoruba (e.g. isinku)

[Processing the text might take a while]

Guy De Pauw: CNTS - Language Technology Group, University of Antwerp, Antwerp, Belgium, guy [dot] depauw [at] ua [dot] ac [dot] be
Gilles-Maurice de Schryver: African Languages and Cultures, Ghent University, Ghent, Belgium, gillesmaurice [dot] deschryver [at] ugent [dot] be
Peter Waiganjo Wagacha: School of Computing and Informatics, University of Nairobi, Nairobi, Kenya, waiganjo [at] uonbi [dot] ac [dot] ke

African Languages in Danger of Disappearing -- Interactive Atlas

Click here to use the interactive atlas.

Open Question: What about focusing our efforts on these disappearing languages before it's too late? What can we, as computational linguists, do; should we do something?

Google Interface in African Languages


Google currently offers its interface in the following African languages:

Language Internet address
Afrikaans http://www.google.com/intl/af/
Amharic http://www.google.com/intl/am/
Lingála http://www.google.com/intl/ln/
Sesotho http://www.google.com/intl/st/
Shona http://www.google.com/intl/sn/
Somali http://www.google.com/intl/so/
Swahili http://www.google.com/intl/sw/
Tigrinya http://www.google.com/intl/ti/
Twi http://www.google.com/intl/tw/
Xhosa http://www.google.com/intl/xh/
Yoruba http://www.google.com/intl/yo/
Zulu http://www.google.com/intl/zu/

Online Publications Involving TshwaneDJe Members


A collection of articles and papers on corpus and dictionary topics for the African languages, as well as on language-independent lexicography and terminology software.

Fang Dictionary


An on-line Fang - English, French, Spanish, Portuguese translation dictionary.

Number to Ngangela Words Converter


This program converts a number (less than 10,000) to Ngangela / Nyemba words.

Cilubà - French Dictionary


An on-line Cilubà-French French-Cilubà translation dictionary.

