OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

MLCC Multilingual and Parallel Corpora
Title:
MLCC Multilingual and Parallel Corpora
ID:
ELRA-W0023
Link to the object:
Online:
Yes
Archive:
Date:
1996-09-01
Publisher:
ELRA (European Language Resources Association)
Description:
Written Corpora
The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies. The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora: Dutch - Het Financieele Dagblad - 1992-1993 (Samples) The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text. English - The Financial Times - 1993 (Samples) The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words. French - Le Monde - 1992-1993 (Samples) A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words. German - Handelsblatt - 1986-1988 (Samples) This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt. Italian - Il Sole 24 Ore - 1992-1993 (Samples) The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh. Spanish - Expansion - 1994 (Samples) This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words. The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities: Official Journal of the European Commission, C Series: Written Questions 1993 Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language). Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994 This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words). The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
Content language:
Dutch
English
German
French
Italian
Spanish
Linguistic type:
Primary text
DCMI type:
Text
Other coverage:
1986-1994
Other language:
Dutch, Flemish
English
German
French
Italian
Spanish, Castilian
Other rights:
Rights available for: Research Use
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: ELRA Catalogue of Language Resources
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Dutch
Content language: English
Content language: French
Content language: German
Content language: Italian
Date: 1950 - 1999
Date: 1990 - 1999
Date: 2000 - 2009
Date: 2000 and later
Date: 2010 - 2019
Publisher: ELRA (European Language Resources Association)
Title: MLCC Multilingual and Parallel Corpora
Other coverage: 1986-1994
Other language: Dutch, Flemish
Other language: English
Other language: French
Other language: German
Other language: Italian
Other rights: Rights available for: Research Use