核心概念
Integrating token-level and sentence-level objectives in cross-lingual sentence encoders significantly improves the quality of sentence representations across various tasks.
要約
The paper introduces MEXMA, a novel approach for training cross-lingual sentence encoders that leverages both token-level and sentence-level objectives. The key insights are:
Current cross-lingual sentence encoders typically use only sentence-level objectives, which can lead to a loss of information, especially at the token level. This can degrade the quality of the sentence representations.
MEXMA combines sentence-level and token-level objectives, where the sentence representation in one language is used to predict masked tokens in another language. This allows the encoder to be updated directly by both the sentence representation and the individual token representations.
Experiments show that adding the token-level objectives greatly improves the sentence representation quality across several tasks, including bitext mining, classification, and pair classification. MEXMA outperforms current state-of-the-art cross-lingual sentence encoders like LaBSE and SONAR.
The paper also provides an extensive analysis of the model, examining the impact of the different components, the scalability to different model and data sizes, and the potential to improve other alignment approaches like contrastive learning.
The analysis of the token embeddings reveals that MEXMA effectively encodes semantic, lexical, and contextual information in the individual tokens, which contributes to the improved sentence representations.
Overall, the paper demonstrates the importance of integrating token-level objectives in cross-lingual sentence encoding and presents a novel approach that achieves state-of-the-art performance.
統計
The car is red.
El coche es rojo.
引用
"Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation."
"We show that adding token-level objectives greatly improves the sentence representation quality across several tasks."