toplogo
Sign In

SemEval-2024 Task 1: Cross-lingual Semantic Textual Relatedness without Direct Supervision


Core Concepts
This paper presents a system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision. The authors focus on different source language selection strategies on two different pre-trained language models: XLM-R and FURINA.
Abstract

The paper presents a system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision.

The authors explore the following approaches:

  1. Single-source transfer: Fine-tuning pre-trained language models on English data.
  2. K-nearest-neighbor languages: Augmenting the English training dataset with the datasets of k languages that are closest to the target language.
  3. Multi-source transfer: Fine-tuning a single model on the concatenation of all available source language datasets.
  4. Multi-source transfer on languages from the same family: Fine-tuning a single model on the concatenation of source language datasets from the same language family as the target language.
  5. Machine translation-based data augmentation: Translating selected languages into each other to balance the training dataset.
  6. Transliteration: Transliterating non-Latin script languages into Latin script to facilitate multilingual transfer learning.

The authors find that:

  • Knowledge transfer from multiple source languages improves STR models compared to single-source transfer.
  • Training on languages from the same family as the target language can outperform training on all available source languages, indicating the presence of language interference.
  • Script differences cause high variance in transfer performance, and transliteration does not consistently improve cross-lingual transfer.
  • Machine translation-based data augmentation can enhance transfer performance for some languages but can also lead to shifts in label semantics.

The authors' submitted system, which fine-tunes FURINA on English, Spanish, and Hausa, achieves the first place in the C8 (Kinyarwanda) test set.

edit_icon

Customize Summary

edit_icon

Rewrite with AI

edit_icon

Generate Citations

translate_icon

Translate Source

visual_icon

Generate MindMap

visit_icon

Visit Source

Stats
15,123 training instances across 9 languages 2,588 development instances across 14 languages 7,667 test instances across 12 languages
Quotes
None

Key Insights Distilled From

by Shijia Zhou,... at arxiv.org 04-04-2024

https://arxiv.org/pdf/2404.02570.pdf
MaiNLP at SemEval-2024 Task 1

Deeper Inquiries

어떻게 서로 다른 스크립트와 저자원 설정을 가진 언어에 대한 교차언어 전송 성능을 더 향상시킬 수 있을까요?

교차언어 전송 성능을 향상시키기 위해 다음과 같은 방법을 고려할 수 있습니다: 스크립트 표준화: 서로 다른 스크립트를 가진 언어들 간의 전송을 용이하게 하기 위해 스크립트를 표준화하는 방법을 고려할 수 있습니다. 이를 통해 모델이 다른 스크립트로 작성된 데이터를 더 잘 이해하고 처리할 수 있습니다. 다중 언어 모델 활용: 다양한 스크립트와 저자원 언어를 다루기 위해 다중 언어 모델을 활용할 수 있습니다. 이러한 모델은 다양한 언어 간의 상호작용을 고려하여 학습되어 교차언어 전송 성능을 향상시킬 수 있습니다. 스크립트 간 변환: 서로 다른 스크립트를 가진 언어들 간의 데이터를 변환하고 표준화하는 방법을 고려할 수 있습니다. 이를 통해 모델이 다양한 스크립트에서 동일한 의미를 더 잘 파악할 수 있습니다.
0
star