Core Concepts
Transformer language models do not intentionally pre-compute features for future tokens to a significant extent. Instead, they compute features that are useful for predicting the immediate next token, which then turn out to be helpful at future steps as well (the "breadcrumbs" hypothesis).
Abstract
The paper investigates whether transformer language models "think ahead" and intentionally pre-compute features that will be useful for predicting future tokens, a phenomenon known as "pre-caching". The authors propose two hypotheses:
Pre-caching: The model deliberately computes and stores features that are expected to be useful for the future, even if they are irrelevant to the present.
Breadcrumbs: The features that most benefit the present inference task are the same as those that are most useful to the future. When the model performs the present forward pass, it "unintentionally" leaves a trace ("breadcrumbs") that is then picked up by future passes.
To test these hypotheses, the authors introduce the concept of "myopic" transformer models, which are incapable of deliberate pre-caching. They show that in a synthetic setting, the transformer model does exhibit clear pre-caching behavior. However, in experiments on natural language data using GPT-2 models, the authors find that the myopia gap (the performance difference between vanilla and myopic models) is relatively small, suggesting that pre-caching is not a significant factor.
The authors conclude that on natural language data, transformer language models do not intentionally prepare information for the future to a significant extent. Instead, they compute features that are useful for predicting the immediate next token, which then turn out to be helpful at future steps as well (the "breadcrumbs" hypothesis).