The paper investigates whether transformer language models "think ahead" and intentionally pre-compute features that will be useful for predicting future tokens, a phenomenon known as "pre-caching". The authors propose two hypotheses:
To test these hypotheses, the authors introduce the concept of "myopic" transformer models, which are incapable of deliberate pre-caching. They show that in a synthetic setting, the transformer model does exhibit clear pre-caching behavior. However, in experiments on natural language data using GPT-2 models, the authors find that the myopia gap (the performance difference between vanilla and myopic models) is relatively small, suggesting that pre-caching is not a significant factor.
The authors conclude that on natural language data, transformer language models do not intentionally prepare information for the future to a significant extent. Instead, they compute features that are useful for predicting the immediate next token, which then turn out to be helpful at future steps as well (the "breadcrumbs" hypothesis).
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania