Core Concepts
Data augmentation techniques can improve the performance of large language models on dialectal commonsense reasoning tasks.
Abstract
This report presents the GMUNLP team's approach to the DIALECT-COPA shared task, which aims to evaluate the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects.
The key highlights and insights are:
The authors explore the potential of data augmentation techniques to enhance the performance of language models on dialectal commonsense reasoning tasks. They utilize a diverse set of language models, including smaller models suitable for low-resource settings, mid-size models that balance task-specific performance and language understanding, and closed-source models that generate high-quality synthetic task data.
The authors achieve the highest scores across all three test datasets in the open-source model category. Their solution also performs on par with the GPT-4 zero-shot iterative prompting approach employed by one of the teams, demonstrating the competitiveness of the proposed approach against state-of-the-art closed-source models.
The authors observe that increasing the data quantity through various data augmentation techniques primarily improves performance for most languages and low-resource dialects. However, discarding instances written in the Cyrillic script can boost performance for certain languages and dialects, while hindering others.
The authors experiment with cross-lingual mix-and-match strategies, but do not find conclusive patterns indicating that this approach consistently makes the model more language-agnostic, as it helps in some cases while hindering performance in others.
The authors find that full fine-tuning of the comparatively smaller, non-instruction-tuned, but language-specific BERTić model cannot surpass the performance of the multilingual, instruction-tuned Aya-101 model. However, applying the same data combinations to perform instruction tuning on the Aya-101 model leads to an overall performance boost.
Stats
The authors report the following key metrics and figures:
"My body cast a shadow over the grass."
"The sun was rising."
"The grass was cut."
"cause", 0, 0
The DIALECT-COPA dataset consists of cause-effect examples across 8 languages and dialects, with 400 training instances and 100 validation instances per language, and 500 test instances for the three dialects (Cerkno, Chakavian, and Torlak).