OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

Asian Spoken Language Sampler
Title:
Asian Spoken Language Sampler
ID:
LDC2010S07
https://catalog.ldc.upenn.edu/LDC2010S07
ISBN: 1-58563-559-6
ISLRN: 042-211-152-679-3
Online:
Yes
Archive:
Date:
2010
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
*Introduction * The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected, readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with data providers and maintaining relations with other like-minded groups around the world. Resources available from LDC (
include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at
*Data * The Asian Spoken Language Sampler provides a variety of speech and transcript samples from various corpora and is designed to illustrate the variety and breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below. The sample files provided in this release have been modified in various ways relative to the original data as published by LDC: * most excerpts are truncated to be much shorter than the original files, excerpt duration is typically one minute and thirty seconds * signal amplitude has been adjusted where necessary to normalize playback volume * some corpora are published in compressed form, but all samples here are uncompressed * LDC frequently uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. 2005 NIST Language Recognition Evaluation The goal of the NIST Language Recognition Evaluation is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. 2007 NIST Language Recognition Evaluation Test Set The most significant differences between previous NIST evaluations and the 2007 task were the increased number of languages and dialects, the greater emph asis on a basic detection task for evaluation and the variety of evaluation conditions. ARL Urdu Speech Database, Training Data The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLFRIEND Vietnamese A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Vietnamese. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Mandarin Chinese Speech The Callhome Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. JEIDA/JCSD-Channel 0 Mono Syllables This collection consists of high-fidelity recordings of 150 native speakers of Japanese each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones. Korean Telephone Conversations Speech and Transcripts This publication consists of 100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean, while the rest of 51 are previously unexposed calls. All 100 conversations have been transcribed. Mandarin Affective Speech Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli. Russian through Switched Telephone Network (RuSTeN) The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels. TDT4 Multilingual Broadcast News Speech Corpus This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations. West Point Korean Speech West Point Korean Speech is a database of digital recordings of spoken Korean. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific translation systems. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. *How to Obtain the Sampler * The Asian Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 28 mb
Content language:
Yue Chinese
Vietnamese
Urdu
Tamil
Russian
Korean
Japanese
Hindi
Persian
Mandarin Chinese
North Levantine Arabic
South Levantine Arabic
Gulf Arabic
Dari
Iranian Persian
Linguistic type:
Primary text
Other format:
Distribution: Web Download
Other language:
Yue Chinese
Vietnamese
Urdu
Tamil
Russian
Korean
Japanese
Hindi
Persian
Mandarin Chinese
North Levantine Arabic
South Levantine Arabic
Gulf Arabic
Dari
Iranian Persian
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Rights holder: Portions © 2010 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
Content language: Dari
Content language: Gulf Arabic
Content language: Hindi
Content language: Iranian Persian
Content language: Japanese
Date: 2000 and later
Date: 2010 - 2019
Contributor: Linguistic Data Consortium
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: Asian Spoken Language Sampler
Other format: Distribution: Web Download
Other language: Dari
Other language: Gulf Arabic
Other language: Hindi
Other language: Iranian Persian
Other language: Japanese
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 2010 Trustees of the University of Pennsylvania