Core Concepts
The author proposes using romanization to bridge English and non-English languages efficiently, demonstrating improved performance and alignment with English representations.
Abstract
The study introduces RomanSetu, a method to extend Large Language Models (LLMs) to non-English languages using romanized text. By pretraining on romanized data and fine-tuning instructions, the approach shows reduced token fertility and improved performance across various NLP tasks. Romanized text aligns better with English, enabling more effective cross-lingual transfer.
Large language models excel in English but face challenges in non-English languages due to limited data representation. The study explores the efficiency of romanization as a bridge between English-heavy LLMs and other languages. Results show that romanized text reduces token fertility, improves alignment with English, and enhances performance in NLU, NLG, and MT tasks.
Efficiency gains are observed through reduced memory consumption, faster generation times, and increased sequence length limits when processing romanized text. The study highlights the promising direction of leveraging romanization for extending LLM capabilities to languages traditionally underrepresented in NLP.
Stats
Romanized text not only reduces token fertility by 2x-4x but also matches or outperforms native script representation.
The embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script.
Quotes
"RomanSetu presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP."
"Our approach involves continual pretraining on romanized text followed by instruction tuning, showing improved efficiency and cross-lingual transfer."