The study introduces RomanSetu, a method to extend Large Language Models (LLMs) to non-English languages using romanized text. By pretraining on romanized data and fine-tuning instructions, the approach shows reduced token fertility and improved performance across various NLP tasks. Romanized text aligns better with English, enabling more effective cross-lingual transfer.
Large language models excel in English but face challenges in non-English languages due to limited data representation. The study explores the efficiency of romanization as a bridge between English-heavy LLMs and other languages. Results show that romanized text reduces token fertility, improves alignment with English, and enhances performance in NLU, NLG, and MT tasks.
Efficiency gains are observed through reduced memory consumption, faster generation times, and increased sequence length limits when processing romanized text. The study highlights the promising direction of leveraging romanization for extending LLM capabilities to languages traditionally underrepresented in NLP.
翻譯成其他語言
從原文內容
arxiv.org
深入探究