Bilingual Data Mining for the English-Amharic Statistical Machine Translation

TitleBilingual Data Mining for the English-Amharic Statistical Machine Translation
Publication TypeConference Paper
Year of Publication2011
AuthorsGebreegziabher, Mulu, Besacier Laurent, GirmaTaye, and DerejeTeferi
BooktitleAGIS11 - Action Week for Global Information Sharing (AfLaT2011 Breakout Session)
LocationAddis Ababa, Ethiopia
Abstract

Machine Translation (MT) is the application of computers to translate text from one natural language to another. Corpus based approaches for MT have been on the rise especially for languages such as Amharic which can be considered as a resource-scarce language. The statistical machine translation (SMT) approach heavily relies on bilingual parallel aligned corpora in source and target languages. Whereas the challenge to develop MT using rule-based approach that heavily employs integrated linguistic knowledge, rules and resources of both the source and target languages is too enormous for a language like Amharic; the SMT approach requires very limited computational linguistic resources.
Thus, the experiment described here aims at collecting a training corpus based on expressions that are found in comparable Amharic-English news. The first step involves collecting a raw English-Amharic news corpus from the Ethiopian News Agency. A total of 35,049 news corpora have been collected. The news is related to domestic, regional and international topics. The contents of each news corpus can be either in Amharic or in English. The English news corpus represents 32% (11,276 corpora) while Amharic is 68% (23,773 corpora). The English-Amharic news corpora coverage is from Aug 21, 2006 to January 06, 2008. The next process is to align the corpus automatically at document level by identifying the news documents that are translations of each other. The first experiment has been done on 1,036 manually aligned English-Amharic news pairs in order to measure the quality of the aligner. Using the developed automatic aligner, the recall is 1 that matched all the English news items while the precision is 0.93 that correctly matched 968 news items. The experiment has been extended to automatically align the whole English-Amharic ENA news corpora representing 11,276 documents. Further processes such as tokenization and sentence splitting have been done before starting aligning at sentence level. Tokenization tasks are performed on each corpus in order to convert them into a valid format suitable for the EASMT system. Trimming has been performed by removing the headers, footers, notes and other unnecessary data from each news corpora. After automatically trimming the news corpora, the process of splitting each paragraph into sentences using sentence endings is performed. The alignment at the sentence level has been done using a sentence aligner called Hunalign. Hunalign aligns bilingual text at sentence level using sentence-length information. A small English-Amharic bilingual dictionary of 8212 words has also been used. Finally, the parallel corpus obtained contains 155,200 bilingual English-Amharic sentences. Thus, the future work is to develop the EASMT and to improve the translation quality by using the English-Amharic news parallel corpora and by using linguistic knowledge.