OLAC Language Resource Catalog

Navigation Aids

OLAC Language Resource Catalog
Search for language resources
 

Main Content

Corpus of contemporary blogs
Title:
Corpus of contemporary blogs
Link to the object:
Online:
Yes
Archive:
Contributor:
Grác, Marek (author)
Publisher:
Masaryk University, NLP Centre
Description:
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Content language:
Czech
Linguistic type:
Primary text
DCMI type:
Text
Other date:
2013-02-26T13:40:06Z
Other rights:
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
Other subject:
corpus
blogs
annotation
annotators
sentences
machine learning
Other type:
corpus
Complete OLAC record:
Link for this page:

Find Related Information:

Archive: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Online: Yes
Linguistic type: Primary text
DCMI type: Text
Content language: Czech
Contributor: Grác, Marek
Publisher: Masaryk University, NLP Centre
Title: Corpus of contemporary blogs
Other date: 2013-02-26T13:40:06Z
Other rights: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Other rights: http://creativecommons.org/licenses/by-nc-nd/3.0/
Other subject: annotation
Other subject: annotators
Other subject: blogs
Other subject: corpus
Other subject: machine learning
Other type: corpus