Alrashed, S., Khizbullin, D., & Pugh, D. R. (2024). Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models. arXiv preprint arXiv:2411.06402.
This paper introduces a new large-scale machine-translated Arabic dataset, FineWeb-Edu-Ar, to address the scarcity of high-quality Arabic text data for training small language models (SLMs).
The authors machine-translated the English FineWeb-Edu dataset, a quality-focused corpus used to train the successful English SLM, SmolLM, into Arabic. They evaluated 12 different machine translation models, including encoder-decoder and decoder-only transformers, using an LLM-as-a-Judge approach with GPT-4o to assess translation quality based on accuracy, grammar, fluency, and style. The nllb-200-distilled-600M model was selected for its balance of translation quality and computational efficiency. The dataset was translated using a sliding window approach with no overlap to minimize padding tokens and optimize for flash_attention_2.
The authors created FineWeb-Edu-Ar, comprising 202 billion tokens in Arabic, making it the largest publicly available machine-translated Arabic dataset. Their analysis of various machine translation models highlights the trade-off between translation quality and computational cost, with nllb-200-distilled-600M emerging as a suitable choice for large-scale translation tasks.
FineWeb-Edu-Ar provides a valuable resource for researchers and developers working on Arabic SLMs. The dataset's size and quality are expected to contribute to the advancement of Arabic NLP, particularly in the context of resource-constrained environments.
This work addresses a critical gap in Arabic NLP by providing a substantial, high-quality dataset for training SLMs. This is particularly significant given the increasing demand for deploying language models on edge devices with limited computational resources.
While FineWeb-Edu-Ar offers a significant contribution, the authors acknowledge potential limitations regarding translation inaccuracies and the dataset's focus on knowledge domains relevant to English-speaking countries. Future research could explore evaluating the dataset's effectiveness in training Arabic SLMs and compare different approaches to mitigate potential biases stemming from the translation process.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Sultan Alras... at arxiv.org 11-12-2024
https://arxiv.org/pdf/2411.06402.pdfDeeper Inquiries