The paper delves into the mechanisms employed by Transformer-based language models in factual recall tasks. In zero-shot scenarios, the authors observe that task-specific attention heads extract the topic entity (e.g., the name of a country) from the context and pass it to subsequent MLPs. The MLP layer then either amplifies or suppresses the information originating from individual heads, allowing the expected argument to "stand out" within the residual stream. Additionally, the MLP incorporates a task-aware component that directs the residual stream towards the direction of the target token's unembedding vector, accomplishing the "function application."
The authors also identify a widely existent anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. They mitigate this suppression by leveraging their interpretation to improve factual recall performance.
The proposed analysis method, based on linear regression, effectively decomposes MLP outputs into components that are easily understandable to humans. This method has been substantiated through numerous empirical experiments and lays a valuable foundation for the authors' interpretations.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询