This thesis presents a comprehensive evaluation of multilingual large language models (LLMs) on underrepresented languages, revealing limitations in their multilingual and multicultural generalization. It proposes data-efficient methods to improve the inclusivity and diversity of multilingual LLMs, enabling better performance on underrepresented languages without sacrificing high-resource language capabilities.
Multilingual neural machine translation models can be efficiently fine-tuned by isolating intrinsic language-specific subspaces, leading to significant performance improvements with a much smaller number of trainable parameters.
We introduce MEDIT, a set of multilingual models capable of performing various text editing tasks like grammatical error correction, text simplification, and paraphrasing across multiple languages by fine-tuning large pre-trained language models via instruction tuning.
Meta4XNLI is a parallel corpus in Spanish and English that provides metaphor annotations for both detection at the token level and interpretation through Natural Language Inference.
Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, is crucial for zero-shot cross-lingual transfer. This survey provides a comprehensive overview of techniques to improve cross-lingual alignment, including objectives using parallel data, contrastive learning, modified pre-training schemes, adapter tuning, and data augmentation.
MLaKE is a novel benchmark for evaluating the multilingual knowledge editing capabilities of large language models, comprising 5,360 single-hop and 4,072 multi-hop questions across five languages (English, Chinese, Japanese, French, German).
This paper presents a comprehensive survey of the recent progress and emerging trends in multilingual large language models (MLLMs), offering a unified perspective through a novel taxonomy based on alignment strategies.
Bias exists in text data across multiple languages, including benchmark datasets on the English GLUE/SuperGLUE leaderboards, as well as datasets in Italian, Dutch, German, and Swedish.
The proposed In-Context Cross-Lingual Transfer (IC-XLT) approach effectively leverages target-language demonstrations during inference to improve cross-lingual text classification performance, especially in scenarios with limited source-language data.
Our system AAdaM achieves competitive results in the SemEval-2024 Task 1 on Semantic Textual Relatedness for African and Asian languages, by leveraging data augmentation, task-adaptive pre-training, and adapter-based tuning.