The paper proposes a method called the "Language Ranker" to quantitatively measure and compare the performance of large language models (LLMs) across different languages. The key findings are:
The authors use the OPUS-100 multilingual dataset to evaluate the performance of four popular open-source LLMs: LlaMa2, Qwen, Mistral-v0.1, and Gemma. They measure the cosine similarity between the representations of the target language and English to quantify the performance gap. The results show that high-resource languages like German, French, and Chinese have representations more similar to English, while low-resource languages like Igbo, Kazakh, and Oriya exhibit lower similarity.
The paper also explores the impact of model size on multilingual performance, finding a positive correlation between model size and performance on low-resource languages, but a negative correlation for high-resource languages. This suggests that as LLMs grow in size, they become better at understanding low-resource languages but may struggle with the increased complexity of high-resource language data.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zihao Li,Yuc... at arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.11553.pdfDeeper Inquiries