OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

RATS Language Identification
Title:
RATS Language Identification
ID:
LDC2018S10
https://catalog.ldc.upenn.edu/LDC2018S10
ISBN: 1-58563-852-8
ISLRN: 190-505-311-077-0
Online:
Yes
Archive:
Date:
2018
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
*Introduction* RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers. *Data* The source audio consists of conversational telephone speech recordings from: (1) conversational telephone speech (CTS) recordings, taken either from previous LDC CTS corpora, or from CTS data collected specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers; and (2) portions of VOA broadcast news recordings, taken from data used in the 2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available from LDC as LDC2014S06. CTS recordings were audited by annotators who listened to short segments and determined whether the audio was in the target language. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, language ID and LID provenance. All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers. The data is divided for use as training, initial development set, and initial evaluation set. *Samples* Please view this audio sample. *Updates* None at this time. *Acknowledgment* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20016. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Content language:
South Levantine Arabic
North Levantine Arabic
Persian
Dari
Pushto
Urdu
Linguistic type:
Primary text
DCMI type:
Sound
Other format:
Sampling Rate: 16000
Sampling Format: pcm
Distribution: Hard Drive
Other language:
South Levantine Arabic
North Levantine Arabic
Persian
Dari
Pushto
Urdu
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Rights holder: Portions © 2000, 2001, 2004, 2005, 2007, 2014, 2018 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
DCMI type: Sound
Content language: Dari
Content language: North Levantine Arabic
Content language: Persian
Content language: Pushto
Content language: South Levantine Arabic
Date: 2000 and later
Date: 2010 - 2019
Contributor: Graff, David
Contributor: Jones, Karen
Contributor: Ma, Xiaoyi
Contributor: Strassel, Stephanie
Contributor: Walker, Kevin
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: RATS Language Identification
Other format: Distribution: Hard Drive
Other format: Sampling Format: pcm
Other format: Sampling Rate: 16000
Other language: Dari
Other language: North Levantine Arabic
Other language: Persian
Other language: Pushto
Other language: South Levantine Arabic
Other rights: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 2000, 2001, 2004, 2005, 2007, 2014, 2018 Trustees of the University of Pennsylvania