This work introduces MaLA-500, a novel large language model designed to cover an extensive range of 534 languages, achieved through vocabulary extension and continued pretraining on the LLaMA 2 model with the Glot500-c dataset.
This paper introduces four multilingual pre-trained language models (PLMs) tailored for five Angolan languages using a Multilingual Adaptive Fine-tuning (MAFT) approach. The authors demonstrate that employing informed embedding initialization through the OFA method and incorporating synthetic data significantly enhances the performance of the MAFT models on downstream tasks.
A 34 billion parameter multilingual language model, Poro 34B, trained on 1 trillion tokens of Finnish, English, and programming languages, substantially advances the state-of-the-art for Finnish while also performing competitively in English and code generation, and achieving strong translation capabilities.
AURORA-M is a 15B parameter multilingual open-source language model that addresses key challenges in existing models, including limited multilingual capabilities, catastrophic forgetting, and lack of compliance with AI safety and development laws. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions to align its development with the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.