In this content, the authors introduce a probabilistic model to analyze in-context learning of linear functions. They explore the behavior of an optimally pretrained model under the squared loss and derive a closed-form expression of the task posterior distribution. The content explains two real-world phenomena observed with large language models (LLMs) and validates findings through experiments involving Transformers and large language models.
The authors propose a new probabilistic model for pretraining data by introducing multiple task groups and task-dependent input distributions. They analyze how in-context examples update each component's posterior mean and mixture probability, leading to a quantitative understanding of the dual operating modes of in-context learning.
Furthermore, they shed light on unexplained phenomena observed in practice, such as the "early ascent" phenomenon and bounded efficacy of biased-label ICL. The content provides insights into Bayesian inference, gradient descent, sample complexity, and generalization bounds for ICL with Transformers.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania