The paper introduces MEXMA, a novel approach for training cross-lingual sentence encoders that leverages both token-level and sentence-level objectives. The key insights are:
Current cross-lingual sentence encoders typically use only sentence-level objectives, which can lead to a loss of information, especially at the token level. This can degrade the quality of the sentence representations.
MEXMA combines sentence-level and token-level objectives, where the sentence representation in one language is used to predict masked tokens in another language. This allows the encoder to be updated directly by both the sentence representation and the individual token representations.
Experiments show that adding the token-level objectives greatly improves the sentence representation quality across several tasks, including bitext mining, classification, and pair classification. MEXMA outperforms current state-of-the-art cross-lingual sentence encoders like LaBSE and SONAR.
The paper also provides an extensive analysis of the model, examining the impact of the different components, the scalability to different model and data sizes, and the potential to improve other alignment approaches like contrastive learning.
The analysis of the token embeddings reveals that MEXMA effectively encodes semantic, lexical, and contextual information in the individual tokens, which contributes to the improved sentence representations.
Overall, the paper demonstrates the importance of integrating token-level objectives in cross-lingual sentence encoding and presents a novel approach that achieves state-of-the-art performance.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Tiefere Fragen