OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

2009 NIST Language Recognition Evaluation Test Set
Title:
2009 NIST Language Recognition Evaluation Test Set
ID:
LDC2014S06
https://catalog.ldc.upenn.edu/LDC2014S06
ISBN: 1-58563-682-7
ISLRN: 180-783-854-340-4
Online:
Yes
Archive:
Date:
2014
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
*Introduction* 2009 NIST Language Recognition Evaluation Test Set contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by the Linguistic Data Consortium (LDC) in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese. The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005 and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release. LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) Data The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007. All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. *Samples* Please listen to this audio sample. *Updates* None at this time.
Content language:
Amharic
Haitian
English
French
Hindi
Spanish
Urdu
Bosnian
Croatian
Georgian
Korean
Portuguese
Turkish
Vietnamese
Yue Chinese
Dari
Persian
Hausa
Mandarin Chinese
Russian
Ukrainian
Pushto
Linguistic type:
Primary text
DCMI type:
Sound
Other format:
Sampling Rate: 8000
Sampling Format: ulaw
Distribution: Web Download
Other language:
Amharic
Haitian
English
French
Hindi
Spanish
Urdu
Bosnian
Croatian
Georgian
Korean
Portuguese
Turkish
Vietnamese
Yue Chinese
Dari
Persian
Hausa
Mandarin Chinese
Russian
Ukrainian
Pushto
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Rights holder: Portions © 2000, 2001, 2005-2007, 2014 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
DCMI type: Sound
Content language: Amharic
Content language: Bosnian
Content language: Croatian
Content language: Dari
Content language: English
Date: 2000 and later
Date: 2010 - 2019
Contributor: Brandschain, Linda
Contributor: Graff, David
Contributor: Greenberg, Craig
Contributor: Martin, Alvin
Contributor: Walker, Kevin
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: 2009 NIST Language Recognition Evaluation Test Set
Other format: Distribution: Web Download
Other format: Sampling Format: ulaw
Other format: Sampling Rate: 8000
Other language: Amharic
Other language: Bosnian
Other language: Croatian
Other language: Dari
Other language: English
Other rights: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 2000, 2001, 2005-2007, 2014 Trustees of the University of Pennsylvania