The article introduces MYTE, a new encoding paradigm based on morphemes to address disparities in text representation across languages. MYTE produces shorter encodings for 99 languages, especially benefiting non-European languages and non-Latin scripts. The method improves multilingual LM performance and reduces perplexity gaps. Byte-level models aim to solve over-segmentation issues in byte sequences, particularly for non-Latin script languages. The proposed methodology rearranges byte codepages to free space used to encode morphemes, improving segmentation granularity across languages. Equitable text representation is achieved by ensuring similar encoded sequence lengths across parallel texts in different languages.
To Another Language
from source content
arxiv.org
Дополнительные вопросы