In this content, the authors introduce a probabilistic model to analyze in-context learning of linear functions. They explore the behavior of an optimally pretrained model under the squared loss and derive a closed-form expression of the task posterior distribution. The content explains two real-world phenomena observed with large language models (LLMs) and validates findings through experiments involving Transformers and large language models.
The authors propose a new probabilistic model for pretraining data by introducing multiple task groups and task-dependent input distributions. They analyze how in-context examples update each component's posterior mean and mixture probability, leading to a quantitative understanding of the dual operating modes of in-context learning.
Furthermore, they shed light on unexplained phenomena observed in practice, such as the "early ascent" phenomenon and bounded efficacy of biased-label ICL. The content provides insights into Bayesian inference, gradient descent, sample complexity, and generalization bounds for ICL with Transformers.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Ziqian Lin,K... о arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.18819.pdfГлибші Запити