Conceptos Básicos
The core message of this article is to propose two novel analyses, Principal Relevant Component Analysis (PRCA) and Disentangled Relevant Subspace Analysis (DRSA), that can extract subspaces from the activations of a neural network that are maximally relevant to its prediction strategy. These disentangled subspaces enable more informative and structured explanations of the model's decision-making process.
Resumen
The article proposes two new methods, PRCA and DRSA, to extract relevant subspaces from the activations of a neural network in order to provide more informative and disentangled explanations of the model's predictions.
PRCA aims to extract a single subspace that is maximally relevant to the model's prediction, by optimizing a relevance-based objective. This differs from standard PCA, which focuses on maximizing variance rather than relevance.
DRSA extends this idea to extract multiple, distinct subspaces that each capture a different component of the model's overall prediction strategy. This allows the explanation to be decomposed into several semantically meaningful sub-explanations.
The authors demonstrate that these disentangled explanations provide superior performance compared to baselines on benchmark datasets and models. They also showcase three applications of the proposed methods:
- Detecting and mitigating "Clever Hans" effects in models, where the model relies on spurious correlations rather than true features.
- Gaining better insights into the relationship between input features and model predictions for a butterfly classification task.
- Analyzing how adversarial perturbations affect the different components of a model's explanation.
Overall, the article introduces a novel approach to generate more informative and structured explanations of neural network predictions by extracting relevant subspaces in the activation space.
Estadísticas
The model's prediction can be decomposed into contributions from distinct subspaces, with the subspaces capturing different visual concepts used by the model. (Fig. 1)
PRCA can identify a low-dimensional subspace that retains the most relevant features used by the model, outperforming baselines on a patch-flipping evaluation. (Table 1, Fig. 5)
DRSA can extract multiple distinct subspaces that highlight different components of the model's prediction strategy, again outperforming baselines on the patch-flipping evaluation. (Table 2)
Citas
"Explanations often take the form of a heatmap identifying input features (e.g. pixels) that are relevant to the model's decision. These explanations, however, entangle the potentially multiple factors that enter into the overall complex decision strategy."
"We propose to disentangle explanations by extracting at some intermediate layer of a neural network, subspaces that capture the multiple and distinct activation patterns (e.g. visual concepts) that are relevant to the prediction."