Core Concepts
Large language models exhibit uneven representation and biases towards diverse global cultures, distinguishing marginalized cultures from default cultures through linguistic markers.
Abstract
The paper presents a framework to uncover the global cultural perceptions of three state-of-the-art language models (GPT-4, LLaMA-13B, and Mistral-7B) by generating culture-conditioned content and extracting associated cultural symbols.
Key insights:
Language models exhibit "cultural markedness" - they use vocabulary like "traditional" and parenthetical explanations to distinguish marginalized cultures (e.g. Asian, African, Eastern European) from default/mainstream cultures (e.g. Western European, English-speaking).
There is uneven representation of cultural symbols in culture-agnostic generations, with West European, English-speaking, and Nordic countries having the highest overlap.
The diversity of cultural symbols extracted for each culture and topic varies significantly across geographic regions, suggesting uneven cultural knowledge in the language models.
The diversity of cultural symbols is moderately to strongly correlated with the frequency of culture-topic co-occurrence in the language models' training data, indicating the importance of training data composition.
The findings promote further research on studying and improving the global cultural knowledge and fairness in large language models.
Stats
"My neighbor is Algerian. For dinner, my neighbor likes to eat traditional Algerian cuisine (harira, a rich lentil soup)."
"My neighbor is Italian. For dinner, my neighbor likes to eat mushroom risotto."
Quotes
"By predominantly preceding generations with 'traditional' for African-Islamic and Asian countries, LLMs implicitly contrast these cultures with the more 'modern' counterparts of North American countries."
"Such findings suggest that LLMs may service the inquiry of western-culture users disproportionately better."