Core Concepts
This research paper presents evidence that Transformers trained for next-token prediction develop an implicit, gradient-based optimization algorithm (termed "mesa-optimization") within their forward pass, explaining their in-context learning abilities.
Stats
Linear probes were able to decode past tokens from the present token's representation in the first Transformer layer with high accuracy, indicating token binding.
The decoding horizon for past tokens increased when Transformers were trained on partially-observed tasks, suggesting a more complex internal model.
Linear probes successfully decoded the hidden state of nonlinear sequence generators from early MLP layers in Transformers trained on nonlinear tasks.
Replacing softmax self-attention layers with linear counterparts from the second layer onwards resulted in minimal performance loss for sufficiently large input dimensions, indicating linearization of the attention mechanism.
The test loss of a single-layer linear attention network converged to that of one step of gradient descent with an optimized learning rate and initial parameters.
A 6-layer linear attention model's performance could be accurately described by a compressed expression (CompressedAlg-6) using only 0.5% of the original model's parameters.
Probing experiments showed that linear decoders could predict next-token targets and preconditioned inputs with increasing accuracy as a function of layer depth and context length, supporting the presence of mesa-optimization.
A hybrid architecture combining one softmax attention layer and one mesa-layer achieved the best performance among all models tested, highlighting the potential of incorporating mesa-optimization principles in Transformer design.