Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

Submitted by Guy on Tue, 2011-09-20 10:32

Title	Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili
Publication Type	Journal Article
Year of Publication	2011
Authors	De Pauw, Guy, Wagacha Peter W., and de Schryver Gilles-Maurice
Journal Title	Language Resources and Evaluation
Journal Date	09/2011
Volume	45
Issue	3
Pagination	331-344
ISSN	1574-020X
Abstract	Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.
URL	https://www.springerlink.com/content/6650157426v8318t/
DOI	10.1007/s10579-011-9159-7

»

Login to post comments
Google Scholar

Also...

User login

Also hosted on AfLaT.org

Register @ aflat.org

Registered members of AfLaT.org can upload publications, add links and information on their research projects. If you would like to become a member of AfLaT.org, please contact guy♻aflat.org.