Southern Africa

NHN Day 2010 - Call for Abstracts


National Human Language Technology Network (NHN) Day 2010

The National Human Language Technology Network (NHN) is a collaborative effort which aims to strengthen synergies between HLT researchers and practitioners in South Africa. Currently, the members of the NHN are most of the major South African tertiary institutions and research councils who are actively involved in HLT R&D activities.

Studentships, internships and bursaries at Meraka (South Africa)

The Human Language Technology (HLT) Research Group of the Meraka Institute, CSIR is offering a number of studentships, internships and bursaries for 2010.

The relevant websites to consult are:

The Meraka website
The CSIR website
The CSIR intraweb (for CSIR employees).

Students MUST complete the application process ONLINE, and must be encouraged to visit the Meraka website for the correct procedure to follow.

List of Tshivenda words containing diacritics


List of Tshivenda words containing diacritics distributed under the Creative Commons Attribution 2.5 South Africa License (

Web Site Flore


This web site present the name of African trees in different languages.
Today there are 375 names, 17 languages.
The language of communication is French, but browing is very simple.



Unicode-Afrique est un forum sur Yahoogroupes. Il existe pour : donner publicité aux projets en Afrique utilisant l'Unicode; discuter des questions et problèmes pratiques avec Unicode et les jeux de caractères pour les langues africaines; et partager des expériences utiles sur le développement et utilisation des polices unicodes pour les langues africaines. Cet e-groupe fait partie d'une "famille" de forums de discussion sur la rencontre des langues africaines et NTIC (les autres forums sont accessibles à la page portail "A12n," dont le lien se trouve au fond de cette page).

Corpora for African languages - An Crúbadán


The Crúbadán Project is devoted to creating basic language technology for minority languages and under-resourced languages using web-crawling and statistical techniques. As of early 2008 we have collected text corpora for 419 languages, including more than 125 African languages, and have used these to create open source spell checkers for more than 20 languages. Please contact Kevin Scannell ( if you are interested in developing open source resources for other African languages using these data.

Computational Morphological Analysis (University of South Africa)


The Computational Morphological Analysis project is a unique multidisciplinary research project, requiring knowledge, expertise and skills associated with the disciplines of linguistics of the South African indigenous languages and of computing science.

This project is NRF (National Research Foundation) supported within the Focus Area of Information and Communication Technology (cf.

The significance of morphological analysis as a basic enabling application for further kinds of NLP is well known. The primary aim of the overall project therefore is the development of computational morphological analysers for Bantu languages, using the natural language independent Xerox Finite-State Tools ( This integrated set of tools is used to model and implement the complexities of word-formation rules as well as morphophonological alternations by means of finite-state networks, which in turn are combined together algorithmically into larger networks that perform morphological analysis. Lexical challenges are addressed by means of the development of machine-readable lexicons in XML format, containing knowledge about individual words in the languages.

Project Leader:
Sonja Bosch
boschse [at] unisa [dot] ac [dot] za

Automatic Diacritic Restoration for African Languages

The orthography of many African languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration.

This is a demonstration system for a diacritic restoration method that is able to automatically restore diacritics on the basis of local graphemic context. It is based on the machine learning method of Memory-Based learning. We have applied the method to the African languages of Cilubà, Gĩkũyũ, Kĩkamba, Maa, Sesotho sa Leboa, Tshivenḓa and Yoruba.

You can find more information on this system in this paper

Select a language and enter the word or sentence you want to restore diacritics for.
Cilubà (e.g. mutekete)
Gĩkũyũ (e.g. nituronire)
Kĩkamba (e.g. ningulilikana)
Maasai (e.g. oltunani)
Sesotho sa Leboa (Northern Sotho) (e.g. swanetse)
Tshivenḓa (e.g. tshiswitulo)
Yoruba (e.g. isinku)

[Processing the text might take a while]

Guy De Pauw: CNTS - Language Technology Group, University of Antwerp, Antwerp, Belgium, guy [dot] depauw [at] ua [dot] ac [dot] be
Gilles-Maurice de Schryver: African Languages and Cultures, Ghent University, Ghent, Belgium, gillesmaurice [dot] deschryver [at] ugent [dot] be
Peter Waiganjo Wagacha: School of Computing and Informatics, University of Nairobi, Nairobi, Kenya, waiganjo [at] uonbi [dot] ac [dot] ke

Northern Sotho Part-of-Speech Tagger (V2) - Demo

This demo showcases a part-of-speech tagger for Northern Sotho. It retrieves the morpho-syntactic categories for words in a sentence. It uses MBT, the memory-based tagger trained on a relatively small annotated corpus.

Version1: Ocotober 10 2007 (20k tokens training set)
Version2: December 8 2007 (35k tokens training set)

Type in the text you want to tag (2,500 character limit)
Example: Motho ge a sa tseba o swanetše go dumela seo gore bao ba tsebago ba mmotše.

[Tagging the text might take a while]


Verbal extension sequencing: An examination from a computational perspective

Verbal extension sequencing: An examination from a computational perspective, Anderson, Winston, and Kotzé Albert E. , 14th International Conference of the African Language Association of Southern Africa, Nelson Mandela Metropolitan University, Port Elizabeth, Eastern Cape Province, South Africa, (2007)
