The paper investigates whether transformer language models "think ahead" and intentionally pre-compute features that will be useful for predicting future tokens, a phenomenon known as "pre-caching". The authors propose two hypotheses:
To test these hypotheses, the authors introduce the concept of "myopic" transformer models, which are incapable of deliberate pre-caching. They show that in a synthetic setting, the transformer model does exhibit clear pre-caching behavior. However, in experiments on natural language data using GPT-2 models, the authors find that the myopia gap (the performance difference between vanilla and myopic models) is relatively small, suggesting that pre-caching is not a significant factor.
The authors conclude that on natural language data, transformer language models do not intentionally prepare information for the future to a significant extent. Instead, they compute features that are useful for predicting the immediate next token, which then turn out to be helpful at future steps as well (the "breadcrumbs" hypothesis).
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Wilson Wu,Jo... klo arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.00859.pdfSyvällisempiä Kysymyksiä