The paper revisits the evidence for a correspondence between in-context learning (ICL) and gradient descent (GD) optimization in realistic NLP tasks and models. The authors find gaps in the evaluation process used in prior work, including problematic metrics and insufficient baselines. They show that even untrained models can achieve comparable ICL-GD similarity scores, providing strong evidence against the proposed "strong ICL-GD correspondence".
The authors then explore a major discrepancy in the flow of information throughout the model between ICL and GD, which they term "Layer Causality". They propose a simple GD-based optimization procedure that respects layer causality, called Layer Causal Gradient Descent (LCGD), and show it improves similarity scores significantly compared to vanilla GD. However, the scores are still low, suggesting the need for a more nuanced understanding of the relationship between ICL and GD.
The authors also briefly survey works in synthetic settings, noting that their notion of ICL-GD correspondence is significantly different from the "strong ICL-GD correspondence" they aim to refute. Overall, the paper highlights the lack of evidence for the strong ICL-GD correspondence in its current form and suggests exploring more nuanced hypotheses.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Gilad Deutch... às arxiv.org 04-02-2024
https://arxiv.org/pdf/2311.07772.pdfPerguntas Mais Profundas