Initializing new vocabulary embeddings within the convex hull of existing embeddings is crucial for preserving the performance of pre-trained language models while expanding their vocabulary for multilingual tasks, and simpler initialization methods can be as effective as more complex ones after continual pre-training.
Continual pre-training of the Llama 2 7B model on the MaLA corpus, a comprehensive multilingual dataset, results in the EMMA-500 model that demonstrates robust performance across a wide range of multilingual benchmarks.
This paper presents a comprehensive methodology for adapting large language models to new languages, demonstrating state-of-the-art results across 9 diverse languages and 2 model scales.
Comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation.