Resource-Light Bantu Part-of-Speech Tagging

TitleResource-Light Bantu Part-of-Speech Tagging
Publication TypeProceedings Article
Year of Conference2012
AuthorsDe Pauw, Guy, de Schryver Gilles-Maurice, and van de Loo Janneke
Conference NameProceedings of the workshop on Language technology for normalisation of less-resourced languages (SALTMIL8/AfLaT2012)
PublisherEuropean Language Resources Association (ELRA)
Conference LocationIstanbul, Turkey
ISBN Number978-2-9517408-7-7

Recent scientific publications on data-driven part-of-speech tagging of Sub-Saharan African languages have reported encouraging accuracy scores, using off-the-shelf tools and often fairly limited amounts of training data. Unfortunately, no research efforts exist that explore which type of linguistic features contribute to accurate part-of-speech tagging for the languages under investigation. This paper describes feature selection experiments with a memory-based tagger, as well as a resource-light alternative approach. Experimental results show that contextual information is often not strictly necessary to achieve a good accuracy for tagging Bantu languages and that decent results can be achieved using a very straightforward unigram approach, based on orthographic features.

depauwetal.pdf222.42 KB