Alapfogalmak
Unsupervised method for building bilingual lexicons for low-resource languages against high-resource languages.
Kivonat
The article introduces a novel method for unsupervised bilingual lexicon induction between related low-resource and high-resource languages. It addresses the limitations of existing approaches that rely on good quality embeddings for both languages. The method shows superior performance on low-resource languages from the Indic continuum, releasing resulting lexicons for five low-resource Indic languages. Limitations include applicability to related language pairs and dependency on orthographic distance for identifying cognate equivalents.
Introduction:
Bilingual lexicons are essential resources with various uses in NLP.
Interest in unsupervised BLI is growing, but existing methods have limitations.
Linguistic Setup in India:
India has numerous low-resourced dialects closely related to high-resource languages.
Related Work:
Recent approaches use contextual embeddings or BERT-based models for BLI.
Method:
A new unsupervised BLI method is introduced for related LRL and HRL pairs.
Experimental Settings:
Monolingual data sources used from shared tasks and corpora.
Results and Discussion:
Comparison with baselines VecMap+CSLS and CSCBLI shows superior performance of the proposed methods.
Details of released lexicons:
Bilingual lexicons released under CC BY-NC 4.0 license for five Indic languages.
Conclusion:
The new method addresses gaps in existing literature, showing better performance on low-resource languages.
Statisztikák
State-of-the-art BLI methods exhibit near-zero performance for severely data-imbalanced language pairs.
Idézetek
"Most existing approaches depend on aligning monolingual word embedding spaces."
"Our main contribution is a novel unsupervised BLI method."