Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho
Title | Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho |
Publication Type | Proceedings Article |
Year of Conference | 2006 |
Authors | Anderson, Winston, and Kotzé Petronella M. |
Conference Name | Fifth International Conference on Language Resources and Evaluation |
Pagination | 1906-1911 |
Conference Start Date | 24/05/2006 |
Publisher | European Language Resources Association |
Conference Location | Genoa, Italy |
Keywords | Northern Sotho tokenisation tokenisation |
Abstract | Tokenisation is an important first pre-processing step required to adequately test finite-state morphological analysers. In agglutinative languages each morpheme is concatinatively added on to form a complete morphological structure. Disjunctive agglutinative languages like Northern Sotho write these morphemes, for certain morphological categories only, as separate words separated by spaces or line breaks. These breaks are, by their nature, different from breaks that separate ``words'' that are written conjunctively. A tokeniser is required to isolate categories, like a verb, from raw text before they can be correctly morphologically analysed. The authors have successfully produced a finite state tokeniser for Northern Sotho, where verb segments are written disjunctively but nominal segments conjunctively. The authors show that since reduplication in the Northern Sotho language does not affect the pre-processing tokeniser, the disjunctive standard verbal segment as a construct in Northern Sotho is deterministic, finite-state and a regular Type 0 language in the Chomsky hierarchy and that the copulative verbal segment, due to its semi-disjunctivism, is ambiguously non-deterministic. |
- Login to post comments
- Google Scholar