Core Concepts
Glitch tokens are anomalous tokens produced by established tokenizers that can compromise the quality of responses from large language models. This study presents a comprehensive taxonomy of glitch token types and the unexpected behaviors they induce in LLMs, and introduces an efficient iterative clustering-based technique for detecting these glitch tokens.
Abstract
This study explores the phenomenon of "glitch tokens" in large language models (LLMs). Glitch tokens are anomalous tokens produced by established tokenizers that can compromise the models' quality of response. The researchers conducted experiments on seven popular LLMs using three distinct tokenizers, involving a total of 182,517 tokens.
Key findings:
Symptom Taxonomy: The researchers identified five types of unexpected behaviors caused by glitch tokens in LLMs: spelling mistakes, incapability, hallucinatory completion, question repetition, and random characters. Different LLMs exhibited varied distributions of these glitch token symptoms.
Glitch Token Taxonomy: The researchers manually categorized the 7,895 identified glitch tokens into five types: word tokens, letter tokens, character tokens, letter-character tokens, and special tokens. The distribution of these glitch token types was influenced by both the parameter composition of the LLMs and the tokenizer used.
Real-world Analysis: The researchers examined the prevalence of glitch tokens in three widely-used datasets (Alpaca-52k, ShareGPT-52k, and ShareGPT-90k), finding that glitch tokens constitute up to 4.17% of the tokens in these datasets.
Efficient Detection: Guided by the observation that glitch tokens tend to cluster in the embedding space, the researchers proposed GlitchHunter, an iterative clustering-based technique for efficiently detecting glitch tokens. GlitchHunter significantly reduces the number of queries required and the time consumption compared to exhaustive search.
Evaluation: The researchers evaluated GlitchHunter on eight established LLMs, achieving up to 99.44% precision and 63.20% recall on average, outperforming three baseline methods.
The study provides valuable insights into mitigating tokenization-related errors in LLMs and lays the foundation for further research on improving the robustness and reliability of these models.
Stats
Glitch tokens constitute up to 4.17% of the tokens in the Alpaca-52k, ShareGPT-52k, and ShareGPT-90k datasets.
GlitchHunter reduces the number of queries required by 73.40% and the time consumption by 80.22% compared to exhaustive search.
Quotes
"Glitch tokens are anomalous tokens produced by established tokenizers that can compromise the models' quality of response."
"Glitch tokens tend to cluster in the embedding space."