The paper addresses the problem of cross-lingual sentence embedding, particularly for low-resource languages. It observes that current cross-lingual models trained solely with sentence-level alignment objectives exhibit under-alignment of word representations between high-resource and low-resource languages.
To address this, the paper proposes a novel framework called WACSE that incorporates three training objectives:
The experiments on the Tatoeba dataset demonstrate that the proposed word-aligned training objectives can substantially improve cross-lingual sentence embedding, especially for low-resource languages. The model also retains competitive performance on a broader range of tasks, including STS22, BUCC, and XNLI, where most languages are high-resource.
The analysis further reveals that incorporating language identification information can be beneficial for low-resource languages, while it may be detrimental for the overall 36-language setting. Additionally, the AWP and WTR objectives prove to be effective in enhancing cross-lingual sentence embeddings, with the combination of the three objectives (TR, AWP, and WTR) yielding the optimal results.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Zhongtao Mia... : arxiv.org 04-04-2024
https://arxiv.org/pdf/2404.02490.pdfDaha Derin Sorular