OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

ECI Multilingual Text
Title:
ECI Multilingual Text
ID:
LDC94T5
https://catalog.ldc.upenn.edu/LDC94T5
ISBN: 1-58563-033-0
ISLRN: 511-168-567-582-5
Online:
Yes
Archive:
Date:
1994
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least. The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words. Language (Subcorpus #) Kwords Totals German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145 Czech (44) 4726 4726 Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 Chinese (78) 2895 2895 Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610 Norwegian (41) 2226 2226 Swedish (37) 1718 1718 Serb/Croat/Slov(24) 700 (56) 289 989 Tibetan (76) 834 834 Portuguese (60) 675 (47) 24 (71) 21 720 Malay (80) 563 563 Russian (73) 364 364 Japanese (57) 203 203 Turkish (20) 173 (20A) 110 283 Albanian (82) 205 205 Gaelic (55) 141 141 Estonian (39) 100 100 Usbek (81) 88 88 Latin (74) 75 75 Danish (47) 24 (71) 21 45 Lithuanian (89) 20 20 Bulgarian (84) 5 5 Total 91969
Content language:
Turkish
Swedish
Slovenian
Russian
Portuguese
Norwegian
Norwegian Bokmål
Norwegian Nynorsk
Lithuanian
Latin
Japanese
Scottish Gaelic
French
Estonian
English
Modern Greek (1453-)
German
Danish
Bulgarian
Tosk Albanian
Standard Malay
Spanish
Serbian
Northern Uzbek
Mandarin Chinese
Italian
Dutch
Czech
Croatian
Albanian
Linguistic type:
Primary text
DCMI type:
Text
Other format:
Distribution: Web Download
Other language:
Turkish
Swedish
Slovenian
Russian
Portuguese
Norwegian
Norwegian Bokmål
Norwegian Nynorsk
Lithuanian
Latin
Japanese
Scottish Gaelic
French
Estonian
English
Modern Greek (1453-)
German
Danish
Bulgarian
Tosk Albanian
Standard Malay
Spanish
Serbian
Northern Uzbek
Mandarin Chinese
Italian
Dutch
Czech
Croatian
Albanian
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
ECI/MCI Agreement: https://catalog.ldc.upenn.edu/license/eci-slash-mci-user-agreement.pdf
Le Monde Material User Agreement: https://catalog.ldc.upenn.edu/license/le-monde-material-user-agreement.pdf
Rights holder: Portions © 1994 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Albanian
Content language: Bulgarian
Content language: Croatian
Content language: Czech
Content language: Danish
Date: 1950 - 1999
Date: 1990 - 1999
Contributor: Linguistic Data Consortium
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: ECI Multilingual Text
Other format: Distribution: Web Download
Other language: Albanian
Other language: Bulgarian
Other language: Croatian
Other language: Czech
Other language: Danish
Other rights: ECI/MCI Agreement: https://catalog.ldc.upenn.edu/license/eci-slash-mci-user-agreement.pdf
Other rights: Le Monde Material User Agreement: https://catalog.ldc.upenn.edu/license/le-monde-material-user-agreement.pdf
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 1994 Trustees of the University of Pennsylvania