Multilingual language model adaptation

insight - Multilingual language model adaptation

Vocabulary Expansion and Initialization for Multilingual Language Models: An Empirical Comparison

Initializing new vocabulary embeddings within the convex hull of existing embeddings is crucial for preserving the performance of pre-trained language models while expanding their vocabulary for multilingual tasks, and simpler initialization methods can be as effective as more complex ones after continual pre-training.

Enhancing Massively Multilingual Adaptation of Large Language Models through Continual Pre-training on a Diverse Corpus

Continual pre-training of the Llama 2 7B model on the MaLA corpus, a comprehensive multilingual dataset, results in the EMMA-500 model that demonstrates robust performance across a wide range of multilingual benchmarks.

Adapting Large Language Models to Diverse Languages: A Comprehensive Study

This paper presents a comprehensive methodology for adapting large language models to new languages, demonstrating state-of-the-art results across 9 diverse languages and 2 model scales.

Efficient Encoder Models for Closely-Related Languages via Additional Pretraining of Multilingual Language Models

Comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation.

About

Products

Resources