Comprehensive Multilingual Benchmarking of State-of-the-Art Large Language Models Across Languages, Modalities, and Tasks
This study performs a thorough evaluation of the non-English capabilities of state-of-the-art large language models (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets covering 83 languages, including low-resource African languages. It also includes multimodal datasets and compares the performance of LLaVA models, GPT-4-Vision, and Gemini-Pro-Vision. The experiments show that larger models like GPT-4, Gemini-Pro, and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. The study also finds that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of large language models.