The article introduces MYTE, a new encoding paradigm based on morphemes to address disparities in text representation across languages. MYTE produces shorter encodings for 99 languages, especially benefiting non-European languages and non-Latin scripts. The method improves multilingual LM performance and reduces perplexity gaps. Byte-level models aim to solve over-segmentation issues in byte sequences, particularly for non-Latin script languages. The proposed methodology rearranges byte codepages to free space used to encode morphemes, improving segmentation granularity across languages. Equitable text representation is achieved by ensuring similar encoded sequence lengths across parallel texts in different languages.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Tomasz Limis... klo arxiv.org 03-19-2024
https://arxiv.org/pdf/2403.10691.pdfSyvällisempiä Kysymyksiä