Corpus

warning: Creating default object from empty value in /home/webserver/html/aflat.bak/modules/taxonomy/taxonomy.pages.inc on line 33.

Web as Corpus 2007

3rd Web as Corpus Workshop (WAC3) - Incorporating Cleaneval, an ACL-SIGWAC event
Sept. 15-16, 2007
University of Louvain, Louvain-la-Neuve, Belgium

African language resources at UCLA

Description: 

Documentation of Chadic languages of Nigeria. Information on Hausa. Complete Hausa language course.

Helsinki Corpus of Swahili

Description: 

Helsinki Corpus of Swahili contains 12,5 million words of text from a number of current news sources as well as extracts from a large number of books. Typing errors of texts have been manually corrected. The corpus was tagged with SALAMA without human intervention. With a signed contract the corpus is available for scientific research without charge.
The corpus can be accessed through the web-based browser Lemmie 2.0. A direct access to the Linux server is also possible. Currently it is not possible to access the English glosses with Lemmie 2.0. So the users needing the English glosses might wish to use the Linux interface.
Currently HCS does not have syntactic tags. In future we wish to enrich the corpus with those tags, together with a number of new features, including a large number of idioms and multi-word expressions. Also new texts will be added.

kasahorow Communication Group, Suuch Solutions

Description: 

We work on making generic tools for African languages. We are currently working on enabling web publishing for African languages.

  • African language content publishing platform (based on Drupal)
    • Virtual online keyboard for inputting (Unicode) character set range of African languages

Some of our tools can be seen in action at www.dictionary.kasahorow.com

Project Leads:
Chris Manu
Paa Kwesi Imbeah

AfLaT users: 

English - Luganda Parallel Corpus

A parallel corpus consists of the same text in two or more different languages. Word-alignment involves finding the links between the words in the two texts.  A large word-aligned corpus can be used as source material for statistical machine translation techniques and knowledge transfer techniques.

On this page, you can download a small word-aligned parallel corpus Luganda - English. It consists of 150 manually annotated sentences of the gospel of Luke (1:1 until 3:18). The English text is the King James Bible and the Luganda text was taken from the on-line Luganda bible.

Needless to say this is a very modest-size corpus and cannot be used as the only dataset to bootstrap MT research. Its purpose however it to provide a gold-standard test set to evaluate and tune automatic word-alignment techniques for larger parallel corpora English-Luganda.
The files were made using the UMIACS Word Alignment Interface. To visualize the parallel corpus, you will need to download this software. Further data-processing can be done immediately on the output files:

  • Luke.tok: English text
  • Lukka.tok: Luganda text
  • aligned.1 ... aligned.150: a description of the word-alignment for each of the 150 sentences.

The annotation work was done By Edina Nalukenge in the context of the OCAPI project (University of Antwerp).

CTexT (Centre for Text Technology)

Description: 

On 1 June 2004 CTexT (Centre for Text Technology) started functioning as a non-profit, self-supporting unit of the Research Focus Area: Languages and Literature in the South African Context in the Faculty of Arts at the North-West University (Potchefstroom Campus). The staff at CTexT occupy themselves with a variety of research and development activities. A close interaction between such research and development activities is aspired towards: research should, wherever possible, lead/contribute to product development, whereas all products are rooted in thorough research. The four main activities of the Centre for Text Technology involve the following:

  • Research (including basic research, strategic research, applied research, and market research)
  • Development (including development of sources and end-user applications & products)
  • Commercialisation of products and services
  • Maintenance of products and support to end users/clients
AfLaT users: 

SU-ClaST (Centre for Language and Speech Technology at the University of Stellenbosch)

Description: 

SU-CLaST is a new interdisciplinary research centre of the Faculties of Arts and Engineering at the University of Stellenbosch (US), South Africa. It is the result of active collaboration in the field of language and speech processing over a period of more than twenty years between the Department of Electrical and Electronic Engineering and the Department of African Languages at the University of Stellenbosch.

The eXe-Files Team

Description: 

The eXe-Files Team is based in the Xhosa Department of the University of the Western Cape (UWC), in Bellville (Cape Town), South Africa. The aim of this research group is to build the very first balanced electronic corpus for the Xhosa language, with which to support a plethora of linguistics and language studies and products.

AfLaT users: 

Online Publications Involving TshwaneDJe Members

Description: 

A collection of articles and papers on corpus and dictionary topics for the African languages, as well as on language-independent lexicography and terminology software.

Mauritian Creole Text Corpus

Description: 

Provides access to a text corpus for Mauritian Creole using a web interface. Provided by the ALLEX Project.

Syndicate content