InversionView is a novel method for deciphering the information encoded within neural network activations by sampling potential inputs from a trained decoder model, offering insights into the inner workings of these complex systems.
Sparse autoencoders (SAEs) can be enhanced to learn more interpretable features by encouraging multiple SAEs trained in parallel to learn similar features, a technique called Mutual Feature Regularization (MFR), leading to improved reconstruction loss and better preservation of input feature information.
This paper proposes a novel theoretical framework for interpreting neural networks by establishing a mathematical connection between linear layers with Absolute Value (Abs) activations and the Mahalanobis distance, a statistical measure accounting for data covariance.
This paper introduces a novel, automated, and task-agnostic method for identifying functionally distinct sub-networks within neural networks by leveraging the Gromov-Wasserstein (GW) distance to measure functional similarity between intermediate layer representations.
Unlearning-based Neural Interpretations (UNI) offers a novel approach to improve the faithfulness, stability, and robustness of gradient-based attribution methods in deep neural networks by generating debiased and adaptive baselines through targeted unlearning.
Discovering functionally meaningful circuits within neural networks, a key aspect of inner interpretability, is computationally challenging, often even intractable, demanding a nuanced understanding of the complexity landscape and exploration of viable algorithmic options.