This thesis focuses on addressing the limitations of multilingual large language models (LLMs) in representing and understanding underrepresented languages and cultures. It begins with a comprehensive evaluation of multilingual LLMs on a diverse set of underrepresented languages, specifically Austronesian languages spoken in Indonesia. The evaluation covers both language understanding and generation tasks, as well as cultural understanding capabilities.
The results reveal significant disparities in the performance of multilingual LLMs across different languages, with underrepresented languages consistently lagging behind high-resource languages. This underscores the urgent need to develop methods for improving the inclusivity and diversity of multilingual LLMs.
To address this challenge, the thesis proposes two approaches:
Cross-lingual Continual Instruction-Tuning: This method employs data-efficient cross-lingual objectives to fine-tune multilingual LLMs, enabling them to acquire capabilities in underrepresented languages without catastrophic forgetting of high-resource language abilities.
Cross-lingual In-Context Learning: This training-free approach leverages retrieval-based techniques to adapt multilingual LLMs to underrepresented languages during inference, without modifying the model parameters.
Additionally, the thesis introduces a novel method for measuring the multicultural value alignment in multilingual LLMs. This approach uses value-eliciting question answering and multi-view embedding learning to capture the representation of diverse cultural values across different languages, allowing for a deeper understanding of the cultural inclusivity of multilingual LLMs.
The contributions of this thesis aim to advance the field of multilingual natural language processing towards greater equality and inclusiveness, by enhancing the performance and cultural sensitivity of large language models in underrepresented languages.
翻譯成其他語言
從原文內容
arxiv.org
深入探究