In this content, the authors introduce a probabilistic model to analyze in-context learning of linear functions. They explore the behavior of an optimally pretrained model under the squared loss and derive a closed-form expression of the task posterior distribution. The content explains two real-world phenomena observed with large language models (LLMs) and validates findings through experiments involving Transformers and large language models.
The authors propose a new probabilistic model for pretraining data by introducing multiple task groups and task-dependent input distributions. They analyze how in-context examples update each component's posterior mean and mixture probability, leading to a quantitative understanding of the dual operating modes of in-context learning.
Furthermore, they shed light on unexplained phenomena observed in practice, such as the "early ascent" phenomenon and bounded efficacy of biased-label ICL. The content provides insights into Bayesian inference, gradient descent, sample complexity, and generalization bounds for ICL with Transformers.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Ziqian Lin,K... um arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.18819.pdfTiefere Fragen