Core Concepts
This survey provides a comprehensive analysis of Multilingual Large Language Models (MLLMs), delving into critical issues surrounding their training corpora, multilingual representation alignment, and inherent biases.
Abstract
This survey presents an overview of the evolution of Multilingual Large Language Models (MLLMs), tracing their development from monolingual Large Language Models (LLMs). It highlights the key techniques that contribute to the success of MLLMs, including transformer architecture, pre-training techniques, and reinforcement learning with human feedback.
The survey then explores the multilingual capacities of MLLMs, discussing the challenges they face due to language imbalance in training corpora and the potential for cross-lingual transfer learning. It provides an in-depth analysis of the widely utilized multilingual corpora and datasets for training and fine-tuning MLLMs, offering insights into their language distribution, data sources, and language coverage.
Next, the survey delves into the topic of multilingual representation alignment, categorizing the existing approaches into static, contextual, and combined methods. It examines the factors that affect the performance of these alignment techniques, such as initial alignment solution, language typological distance, and pre-training data and settings of MLLMs.
Finally, the survey discusses the issue of bias in MLLMs, addressing questions about the types of bias present, the available debiasing techniques, and the impact of bias removal on model performance. It also summarizes the existing bias evaluation datasets for MLLMs.
Throughout the survey, the author aims to facilitate a deeper understanding of MLLMs and their potential in various domains, while also highlighting the critical challenges and promising research directions in this field.
Stats
MLLMs are trained on corpora where English is the predominant language, accounting for over 90% in some cases.
The Indo-European language family holds a prominent position in MLLM training corpora, both in terms of quantity and language diversity.
The Sino-Tibetan language family, mainly consisting of the Chinese language, also has a significant presence in the training corpora.
Other language families, such as Niger-Congo and Trans-New Guinea, are notably absent from the top 20 languages in the training data.
Quotes
"MLLMs broaden language coverage beyond LLMs, yet English remains dominant in their training corpora."
"Further work should consider a more comprehensive inclusion of language families and prioritize language diversity within the same language family when training MLLMs."
"The performance of MLLMs on multilingual alignment varies across languages, with better performance observed for English and its closely related languages."