Large, curated, web-crawled corpora are crucial for training language models like GPT and XLM-RoBERTa. However, little attention has been given to the quality of these corpora. A study compared four major web-crawled corpora across eleven European languages. MaCoCu and OSCAR were found to have superior quality, while CC100 performed best in downstream tasks despite lower quality. The study revealed that data set quality did not play a significant role in training language models.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Rik ... lúc arxiv.org 03-14-2024
https://arxiv.org/pdf/2403.08693.pdfYêu cầu sâu hơn