מושגי ליבה
Continual pre-training of the Llama 2 7B model on the MaLA corpus, a comprehensive multilingual dataset, results in the EMMA-500 model that demonstrates robust performance across a wide range of multilingual benchmarks.
תקציר
The authors introduce EMMA-500, a large-scale multilingual language model designed for enhanced multilingual performance, with a focus on improving language coverage for low-resource languages. They compile the MaLA corpus, a comprehensive multilingual dataset, and enrich it with curated datasets across diverse domains to facilitate continual pre-training.
The key highlights of the work are:
- The MaLA corpus contains 939 languages, 546 of which have more than 100k tokens and are used for training the EMMA-500 model. The corpus is further augmented with instruction data, code, and high-quality curated data to create a diverse data mix.
- The authors perform continual pre-training of the Llama 2 7B model on the MaLA corpus, resulting in the EMMA-500 model.
- EMMA-500 demonstrates robust performance across a wide collection of benchmarks, including multilingual tasks and PolyWrite, a novel open-ended generation benchmark developed as part of this work.
- The model outperforms Llama 2-based models and other multilingual baselines in tasks such as commonsense reasoning, machine translation, and open-ended generation.
- While math and machine reading comprehension tasks remain challenging, EMMA-500 significantly enhances the performance of the Llama 2 base model.
- The authors show that massively multilingual continued pre-training does not necessarily lead to regressions in other areas, such as code generation, if the data mix is carefully curated.
סטטיסטיקה
The MaLA corpus contains 939 languages, 546 of which have more than 100k tokens and are used for training the EMMA-500 model.
The final data mix for continual training has around 136B tokens.
ציטוטים
"We compile the MaLA corpus, a comprehensive multilingual dataset and enrich it with curated datasets across diverse domains."
"Our model remarkably improves the performance of commonsense reasoning, machine translation, and open-ended generation over Llama 2-based models and multilingual baselines, and outperforms the latest advanced models in many cases."
"We demonstrate that massively multilingual continued pre-training does not necessarily lead to regressions in other areas, such as code generation, if the data mix is carefully curated."