Core Concepts
Creation of comparable web corpora for South Slavic languages enriched with linguistic and genre annotation.
Abstract
Introduces a collection of highly comparable web corpora for South Slavic languages.
Covers Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian.
Total of 13 billion tokens from 26 million documents.
Ensured comparability through crawling setup and technology usage.
Linguistically annotated using CLASSLA-Stanza pipeline.
Enriched with genre information via X-GENRE classifier.
Genre analysis shows consistent distribution across corpora.
News content varies based on economic strength of language community.
Stats
This paper presents a collection of highly comparable web corpora covering the whole spectrum of official languages in the South Slavic language space. The total corpus comprises 13 billion tokens from 26 million documents.
Quotes
"The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology."
"All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline."