Core Concepts
Pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior requires only a small number of independent tasks.
Abstract
Transformers pretrained on diverse tasks exhibit remarkable in-context learning capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. The study focuses on pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. The research establishes a statistical task complexity bound for attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. The study also proves that the pretrained model closely matches the Bayes optimal algorithm by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of in-context learning.
Stats
Effective pretraining only requires a small number of independent tasks.
Nearly Bayes optimal risk achieved by the pretrained model.
Quotes
"Transformers pretrained on diverse tasks exhibit remarkable in-context learning capabilities."
"In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior."