The paper revisits the evidence for a correspondence between in-context learning (ICL) and gradient descent (GD) optimization in realistic NLP tasks and models. The authors find gaps in the evaluation process used in prior work, including problematic metrics and insufficient baselines. They show that even untrained models can achieve comparable ICL-GD similarity scores, providing strong evidence against the proposed "strong ICL-GD correspondence".
The authors then explore a major discrepancy in the flow of information throughout the model between ICL and GD, which they term "Layer Causality". They propose a simple GD-based optimization procedure that respects layer causality, called Layer Causal Gradient Descent (LCGD), and show it improves similarity scores significantly compared to vanilla GD. However, the scores are still low, suggesting the need for a more nuanced understanding of the relationship between ICL and GD.
The authors also briefly survey works in synthetic settings, noting that their notion of ICL-GD correspondence is significantly different from the "strong ICL-GD correspondence" they aim to refute. Overall, the paper highlights the lack of evidence for the strong ICL-GD correspondence in its current form and suggests exploring more nuanced hypotheses.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor