核心概念
Quality of web-crawled corpora does not significantly impact language model training.
摘要
Large, curated, web-crawled corpora are crucial for training language models like GPT and XLM-RoBERTa. However, little attention has been given to the quality of these corpora. A study compared four major web-crawled corpora across eleven European languages. MaCoCu and OSCAR were found to have superior quality, while CC100 performed best in downstream tasks despite lower quality. The study revealed that data set quality did not play a significant role in training language models.
統計資料
CC100 corpus covers 100 languages with sizes ranging from 55.6 billion tokens in English to 10 million tokens in Sundanese.
mC4 corpus contains approximately 6.3 trillion tokens across 101 languages.
OSCAR corpus includes data from the last Common Crawl dumps and covers 152 languages.
MaCoCu corpus consists of about 17.3 billion tokens across 11 low-resourced European languages.
引述
"Quality of web-crawled corpora does not seem to play a significant role when training LMs."
"We conclude that data set quality (as judged by humans) of web-crawled corpora does not seem to play a significant role in training language models."