OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

NIST 2009 Open Machine Translation (OpenMT) Evaluation
Title:
NIST 2009 Open Machine Translation (OpenMT) Evaluation
ID:
LDC2010T23
https://catalog.ldc.upenn.edu/LDC2010T23
ISBN: 1-58563-570-7
ISLRN: 264-294-098-796-0
Online:
Yes
Archive:
Date:
2010
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
*Introduction* NIST 2009 Open Machine Translation (OpenMT) Evaluation contains source data, reference translations and scoring software used in the NIST 2009 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was compiled and scoring software was developed by researchers at NIST, making use of broadcast, newswire and web data and reference translations collected and developed by the Linguistic Data Consortium (LDC). The objective of the NIST Open Machine Translation (OpenMT) evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. The 2009 task was to evaluate translation from Arabic to English and Urdu to English. Additional information about these evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation web site. *Scoring Tools* This evaluation kit includes a single Perl script (mteval-v11b.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. More information on the evaluation algorithm may be obtained from the paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine Translation (Papineni et al, 2002). The included scoring script is intended for use with SGML-formatted data files. An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support, additional options and bug fixes, documentation, and example translations, may be downloaded from the NIST Multimodal Information Group Tools website. *Data* This release contains 373 documents with corresponding sets of four separate human expert reference translations. The source data is comprised of Arabic and Urdu broadcast, newswire and weblog data collected by LDC in 2007 and 2009. The newswire and broadcast material are from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic), Al-Ahram (Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar (Arabic), Al-Quds Al-Arabi (Arabic), Xinhua News Agency (Arabic), British Broadcasting Corporation (Urdu), Deutsche Welle (Urdu), Mehr News Agency (Urdu) and Voice of America (Urdu). For each language, the test set consists of two files: a source and a reference file. Each file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is evalset), version of the data, and source vs. reference file (with the latter being indicated by -ref) are reflected in the file name. A reference file contains four independent reference translations unless noted otherwise in the accompanying README.txt. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. This files in this package are provided in both formats. *Samples* Please view this sample. *Updates* Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T23.
Content language:
Urdu
Arabic
Linguistic type:
Primary text
DCMI type:
Text
Other format:
Distribution: Web Download
Other language:
Urdu
Arabic
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Rights holder: Portions © 2007 Agence France Presse, © 2007 Al-Ahram, © 2007 Al Hayat, © 2007 Al Quds - Al Arabi, © 2007 An Nahar, © 2007 Asharq Al-Awsat, © 2007 Assabah, © 2009 BBC, © 2009 DW, © 2009 Mehr News Agency, © 2007 Xinhua News Agency, © 2010 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Arabic
Content language: Urdu
Date: 2000 and later
Date: 2010 - 2019
Contributor: NIST Multimodal Information Group
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: NIST 2009 Open Machine Translation (OpenMT) Evaluation
Other format: Distribution: Web Download
Other language: Arabic
Other language: Urdu
Other rights: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 2007 Agence France Presse, © 2007 Al-Ahram, © 2007 Al Hayat, © 2007 Al Quds - Al Arabi, © 2007 An Nahar, © 2007 Asharq Al-Awsat, © 2007 Assabah, © 2009 BBC, © 2009 DW, © 2009 Mehr News Agency, © 2007 Xinhua News Agency, © 2010 Tr
Other rights: ustees of the University of Pennsylvania