Core Concepts
COPAL-ID is a novel, public Indonesian language common sense reasoning dataset that incorporates Indonesian local and cultural nuances, providing a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere.
Abstract
The authors present COPAL-ID, a novel Indonesian language common sense reasoning dataset that incorporates local and cultural nuances. Unlike previous Indonesian datasets like XCOPA-ID, COPAL-ID is professionally written by native Indonesians from scratch, making it more fluent and free from awkward phrases.
COPAL-ID is designed to capture three categories of locality: Culture, which reflects local customs or norms; Local Terminology, which includes terms commonly known by locals but not outsiders; and Language, which tests the nuance of the Indonesian language, including homonymy and non-compositionality.
The dataset is provided in both standard Indonesian and colloquial Jakartan Indonesian, a dialect commonly used in daily conversation. The authors find that COPAL-ID poses a greater challenge for existing open-sourced and closed state-of-the-art multilingual language models, achieving only 66.91% accuracy, compared to near-perfect human performance. This suggests that current language models still struggle to comprehend the local nuances of Indonesian.
Stats
The man updated his KK (a legal document that lists all the family members in a household).
My neighbor's house was just broken into by thieves.
That kid was accepted into UI (one of the top universities in Indonesia).
Quotes
"Nasi kuning is often served for celebrations."
"Gigit jari is a figure of speech to express helplessness."