Transformers pretrained on diverse tasks exhibit remarkable in-context learning capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. The study focuses on pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. The research establishes a statistical task complexity bound for attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. The study also proves that the pretrained model closely matches the Bayes optimal algorithm by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of in-context learning.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問