The authors introduce a method to interpret the learned representations in convolutional neural networks (CNNs) trained for object classification. They propose a "linking network" that maps the penultimate layer of a pre-trained classifier to the latent space of a generative adversarial network (StyleGAN-XL). This allows them to visualize the representations learned by the classifier in a human-interpretable way.
The authors then introduce an automated pipeline to quantify these high-dimensional representations. They use unsupervised tracking methods and few-shot image segmentation to analyze changes in semantic concepts (e.g., color, shape) induced by perturbing individual units in the classifier's representation space.
The authors demonstrate two key applications of their method:
Revealing the abstract concepts encoded in individual units of the classifier, showing that some units represent disentangled semantic features while others exhibit superposition of multiple concepts.
Examining the classifier's decision boundary by generating counterfactual examples and quantifying the changes in relevant semantic features across the decision boundary.
Overall, the authors present a systematic and objective approach to interpreting the learned representations in CNNs, overcoming the limitations of previous methods that rely on visual inspection or require extensive retraining of the models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Maren H. Weh... at arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16865.pdfDeeper Inquiries