Mundra, N., Kishore, A. N., Dabre, R., Puduppully, R., Kunchukuttan, A., & Khapra, M. M. (2024). An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models. arXiv preprint arXiv:2407.05841v2.
This research paper investigates the impact of different vocabulary expansion and embedding initialization methods on the performance of pre-trained language models (LMs) adapted for multilingual tasks. The authors aim to determine the most effective strategies for initializing new vocabulary embeddings when extending a pre-trained LM to support new languages.
The authors experiment with two pre-trained language models, RoBERTa (encoder-based) and LLaMA2 (decoder-based), and expand their vocabularies to support four target languages: Hindi, Tamil, Russian, and German. They compare six different embedding initialization methods: Constrained Word2Vec (CW2V, a novel approach proposed in this paper), OFA, Univariate, Multivariate, Mean, and Random. The performance of the expanded models is evaluated on five downstream tasks: XNLI, NER, QA, Machine Translation, and XLSUM. The impact of continual pre-training (CPT) on the effectiveness of different initialization methods is also analyzed.
The authors conclude that efficient large-scale multilingual adaptation of pre-trained language models can be achieved even with simpler embedding initialization methods, as long as they ensure that new embeddings lie within the convex hull of existing embeddings. Continual pre-training is essential for maximizing the performance of the expanded models.
This research provides valuable insights into the importance of embedding initialization for multilingual language model adaptation. The findings suggest that simpler and computationally less expensive methods can be effectively used for vocabulary expansion, potentially democratizing access to high-performing multilingual language models.
The study is limited to four target languages and a specific set of downstream tasks. Future research could explore the generalizability of these findings to other languages and tasks. Additionally, investigating the impact of different continual pre-training objectives and data augmentation techniques on the effectiveness of various initialization methods could be beneficial.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies