แนวคิดหลัก
This paper presents a system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision. The authors focus on different source language selection strategies on two different pre-trained language models: XLM-R and FURINA.
บทคัดย่อ
The paper presents a system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision.
The authors explore the following approaches:
- Single-source transfer: Fine-tuning pre-trained language models on English data.
- K-nearest-neighbor languages: Augmenting the English training dataset with the datasets of k languages that are closest to the target language.
- Multi-source transfer: Fine-tuning a single model on the concatenation of all available source language datasets.
- Multi-source transfer on languages from the same family: Fine-tuning a single model on the concatenation of source language datasets from the same language family as the target language.
- Machine translation-based data augmentation: Translating selected languages into each other to balance the training dataset.
- Transliteration: Transliterating non-Latin script languages into Latin script to facilitate multilingual transfer learning.
The authors find that:
- Knowledge transfer from multiple source languages improves STR models compared to single-source transfer.
- Training on languages from the same family as the target language can outperform training on all available source languages, indicating the presence of language interference.
- Script differences cause high variance in transfer performance, and transliteration does not consistently improve cross-lingual transfer.
- Machine translation-based data augmentation can enhance transfer performance for some languages but can also lead to shifts in label semantics.
The authors' submitted system, which fine-tunes FURINA on English, Spanish, and Hausa, achieves the first place in the C8 (Kinyarwanda) test set.
สถิติ
15,123 training instances across 9 languages
2,588 development instances across 14 languages
7,667 test instances across 12 languages