OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

Web 1T 5-gram, 10 European Languages Version 1
Title:
Web 1T 5-gram, 10 European Languages Version 1
ID:
LDC2009T25
https://catalog.ldc.upenn.edu/LDC2009T25
ISBN: 1-58563-525-1
ISLRN: 930-499-840-946-0
Online:
Yes
Archive:
Date:
2009
Publisher:
Linguistic Data Consortium
https://www.ldc.upenn.edu
Description:
*Introduction* Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens. The n-grams were extracted from publicly-accessible web pages from October 2008 to December 2008. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect pages from the specific target languages only, it is likely that some text from other languages may be in the final data. This dataset will be useful for statistical language modeling, including machine translation, speech recognition and other uses. *Data* The input encoding of documents was automatically detected, and all text was converted to UTF8. The following table contains statistics for the entire release. File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text files Total number of tokens: 1,306,807,412,486 Total number of sentences: 150,727,365,731 Total number of unigrams: 95,998,281 Total number of bigrams: 646,439,858 Total number of trigrams: 1,312,972,925 Total number of fourgrams: 1,396,154,236 Total number of fivegrams: 1,149,361,413 Total number of n-grams: 4,600,926,713 *Samples* For an example of the data in this corpus please examine this sample file.
Content language:
Swedish
Spanish
Romanian
Portuguese
Polish
Dutch
Italian
French
German
Czech
Linguistic type:
Primary text
DCMI type:
Text
Other format:
Distribution: Web Download
Other language:
Swedish
Spanish
Romanian
Portuguese
Polish
Dutch
Italian
French
German
Czech
Other rights:
Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Web 1T 5-gram, 10 European Languages Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-10-european-languages-version-1.pdf
Rights holder: Portions © 2009 Google Inc., © 2009 Trustees of the University of Pennsylvania
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: The LDC Corpus Catalog
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Czech
Content language: Dutch
Content language: French
Content language: German
Content language: Italian
Date: 2000 - 2009
Date: 2000 and later
Contributor: Brants, Thorsten
Contributor: Franz, Alex
Publisher: Linguistic Data Consortium
Publisher: https://www.ldc.upenn.edu
Title: Web 1T 5-gram, 10 European Languages Version 1
Other format: Distribution: Web Download
Other language: Czech
Other language: Dutch
Other language: French
Other language: German
Other language: Italian
Other rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Other rights: Rights holder: Portions © 2009 Google Inc., © 2009 Trustees of the University of Pennsylvania
Other rights: Web 1T 5-gram, 10 European Languages Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-10-european-languages-version-1.pdf