Core Concepts
Sparse feature circuits enable detailed understanding of unanticipated mechanisms in language models by identifying causally implicated subnetworks of human-interpretable features.
Abstract
The paper introduces methods for discovering and applying sparse feature circuits - causally implicated subnetworks of human-interpretable features for explaining language model behaviors.
Key highlights:
Existing methods explain model behaviors in terms of coarse-grained components like attention heads or neurons, which are generally polysemantic and hard to interpret. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms.
The authors leverage sparse autoencoders to identify interpretable directions in the language model's latent space, and use linear approximations to efficiently identify the most causally implicated features and connections between them.
The discovered sparse feature circuits are more interpretable and concise than circuits consisting of neurons. The authors validate this by evaluating the circuits on subject-verb agreement tasks.
The authors introduce SHIFT, a technique that shifts the generalization of a classifier by surgically removing sensitivity to unintended signals, without requiring disambiguating labeled data.
Finally, the authors demonstrate a fully-unsupervised pipeline that automatically discovers thousands of language model behaviors and their corresponding feature circuits.
Stats
The teacher has
The teachers have
His research is in ... Professor
She worked in the OR ... Nurse
Quotes
"The key challenge of interpretability research is to scalably explain the many unanticipated behaviors of neural networks (NNs)."
"We propose to explain model behaviors using fine-grained components that play narrow, interpretable roles."
"Sparse feature circuits can be productively used in downstream applications."