Kernkonzepte
Leveraging word alignment models to explicitly align semantically equivalent words between high-resource and low-resource languages can enhance cross-lingual sentence embeddings, particularly for low-resource languages.
Zusammenfassung
The paper addresses the problem of cross-lingual sentence embedding, particularly for low-resource languages. It observes that current cross-lingual models trained solely with sentence-level alignment objectives exhibit under-alignment of word representations between high-resource and low-resource languages.
To address this, the paper proposes a novel framework called WACSE that incorporates three training objectives:
- Translation Ranking (TR): Aligns the sentence-level semantics between parallel sentences.
- Aligned Word Prediction (AWP): Utilizes the contextual representations of masked words to predict their aligned counterparts in another language, aiming to align word-level semantics.
- Word Translation Ranking (WTR): Aligns word-level semantically equivalent units within parallel sentences.
The experiments on the Tatoeba dataset demonstrate that the proposed word-aligned training objectives can substantially improve cross-lingual sentence embedding, especially for low-resource languages. The model also retains competitive performance on a broader range of tasks, including STS22, BUCC, and XNLI, where most languages are high-resource.
The analysis further reveals that incorporating language identification information can be beneficial for low-resource languages, while it may be detrimental for the overall 36-language setting. Additionally, the AWP and WTR objectives prove to be effective in enhancing cross-lingual sentence embeddings, with the combination of the three objectives (TR, AWP, and WTR) yielding the optimal results.
Statistiken
The training dataset consists of 36 language pairs with a total of 36 million parallel sentences.
The low-resource languages considered in the experiments are: tl, jv, sw, ml, te, mr, kk, and ka.
The number of Wikipedia articles available per language is used as one of the criteria to identify low-resource languages.
Zitate
"The field of cross-lingual sentence embedding has recently seen great advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora."
"To address this under-alignment, we introduce a new framework featuring two word-level alignment objectives: aligned word prediction and word translation ranking."
"The experiment results demonstrate that the proposed word-aligned training objectives can enhance cross-lingual sentence embedding, particularly for low-resource languages, as evidenced on the Tatoeba dataset."