Core Concepts
Capturing deeper semantic connections between sentences beyond simple word overlap to enable robust multilingual communication and information retrieval across diverse languages.
Abstract
This paper presents a comprehensive analysis of systems for the Semantic Textual Relatedness (STR) task at SemEval-2024. The authors explore methods to capture semantic connections between texts in languages like English, Marathi, Hindi, and Spanish, addressing the critical gap in multilingual STR research.
The paper covers three tracks:
- Supervised Learning: The authors adapt sentence-transformer-based models like all-mpnet-base-v2 and marathi-sentence-bert-nli to compensate for the smaller size of the available corpora.
- Unsupervised Learning: The authors utilize BERT-based models, including Hindi-BERT-v2 and BERT-based-uncased, to learn semantic relationships without relying on labeled data.
- Cross-lingual Learning: The authors translate datasets from English to Hindi and Spanish to English, then train models like all-mpnet-base-v2 and hindi-sentence-bert-nli on the translated data.
The authors' submissions achieved promising scores on several tracks, demonstrating the effectiveness of their proposed methods. This work aims to inspire further exploration of multilingual STR, particularly for under-resourced languages, to unlock the true potential of language understanding and empower communication across diverse cultures.
Stats
The SemRel2024 dataset consists of sentence pairs with corresponding semantic similarity scores ranging from 0 to 1.
The dataset is divided into training, development, and test sets for English, Hindi, Marathi, and Spanish.
Quotes
"The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages."
"Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective."