Khái niệm cốt lõi
Transliteration-based post-training alignment improves cross-lingual transfer in multilingual language models, especially between related languages with different scripts, by aligning representations in the original and Latin scripts.
Thống kê
The authors sampled 10% of the available data for each language or a minimum of 10k sentences, whichever was larger, from the Glot500-c training dataset.
The Mediterranean-Amharic-Farsi group consists of 10 languages, 5 scripts, and around 16M sentences.
The South+East Asian Languages group consists of 10 languages, 7 scripts, and around 4M sentences.
The aligned model achieved, on average, more than 20% higher accuracy than the baseline Glot500 model on the SR-B sentence retrieval task.