本文針對訓練於有限群二元運算的神經網路,提出了一種基於 ρ-set 的新型機械式解釋方法,並透過將其轉化為模型性能的緊湊證明,驗證了該方法的有效性,相較於先前基於陪集和不可約表示的解釋,ρ-set 方法能夠更完整地描述模型行為,並提供更精確的性能保證。
This paper proposes a novel "ρ-set circuit" explanation for the inner workings of neural networks trained on group composition tasks, unifying and refining previous interpretations based on coset concentration and irrep sparsity. The authors demonstrate the superiority of their explanation by translating it into compact proofs of model performance, achieving tighter accuracy bounds with lower computational cost compared to previous approaches.
The inner workings of neural networks, the core algorithms powering modern AI systems, remain largely opaque and incomprehensible to human understanding.
Mechanistic interpretability aims to comprehensively reverse-engineer the computational mechanisms and representations learned by neural networks to provide a granular, causal understanding, which is crucial for ensuring AI value alignment and safety.