Core Concepts
This paper presents a comprehensive methodology for adapting large language models to new languages, demonstrating state-of-the-art results across 9 diverse languages and 2 model scales.
Abstract
The paper presents a comprehensive study on adapting large language models (LLMs) to new languages. The key contributions include:
Best practices for continuous pretraining in 9 diverse languages, including vocabulary expansion, embedding initialization, and the impact of base model quality.
A recipe for human preference alignment in any language using minimal target language data, including the use of machine-translated data.
Open-sourcing code and checkpoints for state-of-the-art models in 9 languages and 2 parameter scales (7B and 70B).
The authors start with an existing base model (Llama 2) and adapt it to the target languages. They explore various design choices, such as vocabulary expansion, embedding initialization, and the quality of the base model. They also investigate the use of machine-translated data for human preference alignment, showing that it can perform as well as human-written data.
The authors evaluate their models on a wide range of benchmarks, including perplexity, translation, text classification, question answering, and natural language understanding tasks. They compare their models to existing open-source language experts and multilingual models, demonstrating state-of-the-art performance across the 9 target languages.
Stats
"The resulting models can outperform large multilingual models and even language specific models pre-trained from scratch."
"Our methodology can lead to better models than existing state of the art models in these languages."
"Our SambaLingo models consistently out-perform other models in the same language."
Quotes
"Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages."
"Adaptation requires various design choices around the tokenizer, data, alignment and evaluation strategies."
"We show that our methodology works by training models across 9 languages and 2 parameter scales (7B and 70B) and comparing them against publicly available models."