OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

The EMILLE/CIIL Corpus
Title:
The EMILLE/CIIL Corpus
ID:
ELRA-W0037
Link to the object:
Online:
Yes
Archive:
Date:
2004-09-15
Publisher:
ELRA (European Language Resources Association)
Description:
Written Corpora
The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ?Developing Asian language corpora: standards and practice? in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya. This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Content language:
Urdu
Telugu
Tamil
Sinhala
Panjabi
Oriya (macrolanguage)
Marathi
Malayalam
Kashmiri
Kannada
Hindi
Gujarati
Bengali
Assamese
Linguistic type:
Primary text
DCMI type:
Text
Other language:
Urdu
Telugu
Tamil
Sinhalese
Panjabi, Punjabi
Oriya
Marathi
Malayalam
Kashmiri
Kannada
Hindi
Gujarati
Bengali
Assamese
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: ELRA Catalogue of Language Resources
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Assamese
Content language: Bengali
Content language: Gujarati
Content language: Hindi
Content language: Kannada
Date: 2000 - 2009
Date: 2000 and later
Publisher: ELRA (European Language Resources Association)
Title: The EMILLE/CIIL Corpus
Other language: Assamese
Other language: Bengali
Other language: Gujarati
Other language: Hindi
Other language: Kannada