The paper describes IITK's system for the SemEval-2024 Task 1: Semantic Textual Relatedness. The task involves automatically detecting the degree of relatedness between pairs of sentences in 14 different languages, including both high and low-resource Asian and African languages.
For the supervised track (Track A), the system uses a BERT-based contrastive learning approach and a custom similarity metric-based approach. The contrastive learning approach, SimCSE, creates positive and negative sentence samples using Natural Language Inference (NLI) and aligns related sentences in the embedding space. The custom similarity metric combines various lexical similarity measures, such as cosine similarity, Mahalanobis distance, Euclidean distance, and Jaccard and Dice coefficients, calculated on the sentence embeddings.
For the unsupervised track (Track B), the system explores the use of transformer autoencoders, specifically the Transformer Denoising through Auto Encoders (TSDAE) approach. TSDAE corrupts the input sentences and trains the model to reconstruct the original sentences, aiming to improve the quality of the sentence embeddings.
The paper also discusses the creation of a bigram relatedness corpus using a negative sampling strategy, which is intended to produce refined word embeddings for the unsupervised task.
The system's performance is evaluated on the provided datasets for the 14 languages. The results show that the contrastive learning approach did not perform well for some languages, potentially due to the complexity of the lexical structures and the inability of the negative samples to be distinguishable enough from the positive samples. The unsupervised approach, on the other hand, performed reasonably well for most languages, with the correlation scores being higher than the provided baselines.
The paper concludes by acknowledging the need to study the properties of each low-resource language in more depth to improve the overall efficiency and performance of the system.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Udvas Basak,... om arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04513.pdfDiepere vragen