OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

Plaintext Wikipedia dump 2018
Title:
Plaintext Wikipedia dump 2018
Link to the object:
Online:
Yes
Archive:
Contributor:
Rosa, Rudolf (author)
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Description:
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [
This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [
used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [
Content language:
Abkhazian
Achinese
Adyghe
Afrikaans
Akan
Tosk Albanian
Amharic
Old English (ca. 450-1100)
Arabic
Official Aramaic (700-300 BCE)
Aragonese
Egyptian Arabic
Assamese
Asturian
Atikamekw
Avaric
Aymara
South Azerbaijani
Azerbaijani
Bashkir
Bambara
Bavarian
Central Bikol
Belarusian
Bengali
Bislama
Banjar
Tibetan
Bosnian
Bishnupriya
Breton
Buginese
Bulgarian
Russia Buriat
Catalan
Min Dong Chinese
Cebuano
Czech
Chamorro
Chechen
Cherokee
Church Slavic
Chuvash
Cheyenne
Central Kurdish
Cornish
Corsican
Cree
Crimean Tatar
Kashubian
Welsh
Danish
German
Dinka
Dimli (individual language)
Dhivehi
Lower Sorbian
Dzongkha
Modern Greek (1453-)
English
Esperanto
Estonian
Basque
Ewe
Extremaduran
Faroese
Persian
Fijian
Finnish
French
Arpitan
Northern Frisian
Western Frisian
Fulah
Friulian
Gagauz
Gan Chinese
Scottish Gaelic
Irish
Galician
Gilaki
Manx
Goan Konkani
Gothic
Guarani
Gujarati
Hakka Chinese
Haitian
Hausa
Hawaiian
Serbo-Croatian
Hebrew
Herero
Fiji Hindi
Hindi
Hiri Motu
Croatian
Upper Sorbian
Hungarian
Armenian
Igbo
Ido
Inuktitut
Interlingue
Iloko
Interlingua (International Auxiliary Language Association)
Indonesian
Inupiaq
Icelandic
Italian
Jamaican Creole English
Javanese
Lojban
Japanese
Kara-Kalpak
Kabyle
Kalaallisut
Kannada
Kashmiri
Georgian
Kanuri
Kazakh
Kabardian
Kabiyè
Central Khmer
Kikuyu
Kinyarwanda
Kirghiz
Komi-Permyak
Komi
Kongo
Korean
Karachay-Balkar
Kölsch
Kurdish
Ladino
Lao
Latin
Latvian
Lak
Lezghian
Ligurian
Limburgan
Lingala
Lithuanian
Lombard
Northern Luri
Latgalian
Luxembourgish
Ganda
Literary Chinese
Marshallese
Maithili
Malayalam
Marathi
Moksha
Eastern Mari
Minangkabau
Macedonian
Malagasy
Maltese
Mongolian
Maori
Western Mari
Malay (macrolanguage)
Creek
Mirandese
Burmese
Erzya
Mazanderani
Min Nan Chinese
Neapolitan
Nauru
Navajo
Ndonga
Low German
Nepali (macrolanguage)
Newari
Dutch
Norwegian Nynorsk
Norwegian
Novial
Pedi
Nyanja
Occitan (post 1500)
Livvi
Oriya (macrolanguage)
Oromo
Ossetian
Pangasinan
Pampanga
Panjabi
Papiamento
Picard
Pennsylvania German
Pfaelzisch
Pitcairn-Norfolk
Pali
Piemontese
Western Panjabi
Pontic
Polish
Portuguese
Pushto
Quechua
Vlax Romani
Romansh
Romanian
Rusyn
Rundi
Macedo-Romanian
Russian
Sango
Yakut
Sanskrit
Sicilian
Scots
Samogitian
Sinhala
Slovak
Slovenian
Northern Sami
Samoan
Shona
Sindhi
Somali
Southern Sotho
Spanish
Albanian
Sardinian
Sranan Tongo
Serbian
Swati
Saterfriesisch
Sundanese
Swahili (macrolanguage)
Swedish
Silesian
Tahitian
Tamil
Tatar
Tulu
Telugu
Tama (Colombia)
Tetum
Tajik
Tagalog
Thai
Tigrinya
Tonga (Tonga Islands)
Tok Pisin
Tswana
Tsonga
Turkmen
Tumbuka
Turkish
Twi
Tuvinian
Udmurt
Uighur
Ukrainian
Urdu
Uzbek
Venetian
Venda
Veps
Vietnamese
Vlaams
Volapük
Võro
Waray (Philippines)
Walloon
Wolof
Wu Chinese
Kalmyk
Xhosa
Mingrelian
Yiddish
Yoruba
Yue Chinese
Zeeuws
Zhuang
Chinese
Zulu
Linguistic type:
Primary text
DCMI type:
Text
Other date:
2018-05-09T09:25:05Z
Other rights:
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
Other subject:
Wikipedia
text corpora
monolingual corpus
Other type:
corpus
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Abkhazian
Content language: Achinese
Content language: Adyghe
Content language: Afrikaans
Content language: Akan
Contributor: Rosa, Rudolf
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Title: Plaintext Wikipedia dump 2018
Other date: 2018-05-09T09:25:05Z
Other rights: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
Other rights: http://creativecommons.org/licenses/by-sa/3.0/
Other subject: Wikipedia
Other subject: monolingual corpus
Other subject: text corpora
Other type: corpus