Core Concepts
Tokenization is a necessary step in designing state-of-the-art language models, and this paper provides a theoretical analysis of its role in enabling transformers to model complex data distributions.
Abstract
The paper investigates tokenization from a theoretical perspective by studying the behavior of transformers on simple data generating processes, such as kth-order Markov processes. The key findings are:
In the absence of tokenization, transformers trained on data drawn from certain simple kth-order Markov processes empirically fail to learn the right distribution and predict characters according to a unigram model, even under a wide variety of hyperparameter choices.
With the addition of tokenization, transformers are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. This is observed with a multitude of tokenizers used commonly in practice, such as BPE and LZW.
The paper analyzes a toy tokenizer that adds all length-k sequences into the dictionary, and shows that as the dictionary size grows, unigram models trained on the tokens get better at modeling the probabilities of sequences drawn from Markov sources. It then proves that tokenizers used in practice, such as LZW and a variant of BPE, also satisfy this property but require much smaller dictionaries to achieve any target cross-entropy loss.
The paper also discusses the importance of generalization, where tokenizers need to perform well on new sequences that were not in the training data. It shows that there exist tokenizers that generalize poorly, and that the choice of encoding algorithm can also affect generalization.
Stats
The cross-entropy loss of the best unigram model can be much higher than the optimal cross-entropy loss for kth-order Markov processes, with the gap scaling exponentially in 1/δ, where δ is the minimum transition probability.
The size of the dictionary required by the toy tokenizer to achieve a target cross-entropy loss scales exponentially in 1/δ.
Quotes
"There are very simple kth-order Markov processes such that in the absence of any tokenization, transformers trained on data drawn this source are empirically observed to predict characters according to the stationary distribution of the source under a wide variety of hyperparameter choices."
"When trained with tokenization, transformers are empirically observed to break through this barrier and are able to capture the probability of sequences under the Markov distribution near-optimally."