The paper introduces NORMAD, a novel dataset designed to evaluate the cultural adaptability of large language models (LLMs). NORMAD contains 2.6k stories representing social and cultural norms from 75 countries, with varying degrees of cultural contextualization (rule-of-thumb, value, country).
The key findings are:
LLMs perform poorly in adapting to cultural norms, especially when contextualized with values and country information. Even the best-performing models, like GPT-3.5-turbo and Mistral-Instruct, achieve only 60% and 55% accuracy respectively in these settings, lagging behind human performance of 95.6%.
LLMs exhibit inherent agreement or sycophancy biases, performing significantly better on stories that adhere to cultural norms than those that violate or are irrelevant to them.
Increasing model size or adopting better preference alignment optimization methods (like KTO) can improve overall performance, but the improvements are skewed towards English-speaking and European cultures, rather than African-Islamic cultures.
LLMs particularly struggle with stories involving gift-giving across cultures, which involve complex social norms around presentation, number, and color of gifts.
The paper highlights the pressing need for LLMs to develop better cultural adaptability and reasoning capabilities to ensure their equitable and effective deployment across diverse global contexts.
To Another Language
from source content
arxiv.org
Дополнительные вопросы