This study investigates the factors that influence the performance of multilingual large language models (MLLMs) across a diverse set of 204 languages. The analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data).
For the ALL languages scenario, the decision tree analysis reveals that pretraining data size is the most influential factor, as it determines whether a language was part of the training set or not.
For SEEN languages, pretraining data size continues to be the most important factor, with the amount of language-specific data playing a crucial role in model performance. General resource availability also emerges as an important factor for specific models and settings.
In contrast, for UNSEEN languages, linguistic characteristics like script type and language family become the most influential factors, highlighting the importance of cross-lingual transfer learning when models encounter unfamiliar languages.
Interestingly, model size and architecture do not significantly alter the most important features identified, suggesting that the distribution of languages in the pretraining data and the linguistic properties of the target languages consistently shape MLLM performance.
The findings provide valuable insights into the strengths and limitations of current MLLMs and can guide the development of more effective and equitable multilingual NLP systems.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询