Discovering and Editing Interpretable Causal Circuits in Language Models
Sparse feature circuits enable detailed understanding of unanticipated mechanisms in language models by identifying causally implicated subnetworks of human-interpretable features.