Comprehensive Analysis of Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection
Glitch tokens are anomalous tokens produced by established tokenizers that can compromise the quality of responses from large language models. This study presents a comprehensive taxonomy of glitch token types and the unexpected behaviors they induce in LLMs, and introduces an efficient iterative clustering-based technique for detecting these glitch tokens.