Helsinki Corpus of Swahili


Helsinki Corpus of Swahili contains 12,5 million words of text from a number of current news sources as well as extracts from a large number of books. Typing errors of texts have been manually corrected. The corpus was tagged with SALAMA without human intervention. With a signed contract the corpus is available for scientific research without charge.
The corpus can be accessed through the web-based browser Lemmie 2.0. A direct access to the Linux server is also possible. Currently it is not possible to access the English glosses with Lemmie 2.0. So the users needing the English glosses might wish to use the Linux interface.
Currently HCS does not have syntactic tags. In future we wish to enrich the corpus with those tags, together with a number of new features, including a large number of idioms and multi-word expressions. Also new texts will be added.

