Keskeiset käsitteet
High-capacity transformers can mimic Bayesian inference when performing in-context learning across a diverse range of linear and nonlinear function classes. The inductive bias of in-context learning is determined by the pretraining data distribution.
Tiivistelmä
The paper examines how far the Bayesian perspective can help understand in-context learning (ICL) in transformers. It extends the previous meta-ICL setup to a hierarchical meta-ICL (HMICL) setup that involves unions of multiple task families.
The key findings are:
High-capacity transformers can perform ICL in the HMICL setting, where the training prompts are sampled from a mixture of function classes. The transformers' performance matches the Bayesian predictor, which combines the predictions of individual function classes based on the prompt.
The inductive bias of transformers during ICL, such as preference for lower frequencies in Fourier series, is determined by the pretraining data distribution. Transformers do not add any extra inductive bias of their own, as they mimic the Bayesian predictor.
Transformers can generalize to new function classes not seen during pretraining, but this involves deviations from the Bayesian predictor. The paper proposes hypotheses to explain these deviations, relating them to the pretraining inductive bias and the capacity of the transformer.
The paper also observes an intriguing "forgetting" phenomenon, where transformers first generalize to the full distribution of tasks, but then forget this generalization and fit only the pretraining distribution.
Overall, the results suggest that the Bayesian perspective provides a unifying explanation for the inductive biases and generalization capabilities of transformers in the ICL setting.
Tilastot
The paper does not provide any specific numerical data or statistics. It focuses on presenting conceptual insights and empirical observations about the behavior of transformers in the in-context learning setting.
Lainaukset
"High-capacity transformers can mimic Bayesian inference when performing in-context learning across a diverse range of linear and nonlinear function classes."
"The inductive bias of in-context learning is determined by the pretraining data distribution."
"Transformers can generalize to new function classes not seen during pretraining, but this involves deviations from the Bayesian predictor."