Recent success in Large Language Models (LLMs) has highlighted the importance of tokenizer choice. A study was conducted to investigate the influence of tokenizers on LLM downstream performance. The study found that tokenizer choice can have a significant impact on model performance and training costs. Multilingual tokenizers require larger vocabulary sizes compared to English-centric tokenizers, leading to severe downstream performance degradation and increased training costs. Intrinsic and extrinsic evaluations were conducted to measure the impact of tokenizers on model performance, revealing correlations between low fertility scores and higher downstream performance.
Para outro idioma
do conteúdo fonte
arxiv.org
Principais Insights Extraídos De
by Mehd... às arxiv.org 03-19-2024
https://arxiv.org/pdf/2310.08754.pdfPerguntas Mais Profundas