Recent success in Large Language Models (LLMs) has highlighted the importance of tokenizer choice. A study was conducted to investigate the influence of tokenizers on LLM downstream performance. The study found that tokenizer choice can have a significant impact on model performance and training costs. Multilingual tokenizers require larger vocabulary sizes compared to English-centric tokenizers, leading to severe downstream performance degradation and increased training costs. Intrinsic and extrinsic evaluations were conducted to measure the impact of tokenizers on model performance, revealing correlations between low fertility scores and higher downstream performance.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Tiefere Fragen