Core Concepts
Gradient-based attribution methods using pre-softmax scores are vulnerable to adversarial attacks that can modify the heatmaps produced without changing the model's final outputs.
Abstract
The content discusses a vulnerability in attribution methods, such as Grad-CAM, that use pre-softmax scores to explain the outputs of convolutional neural network (CNN) classifiers. The key insights are:
The softmax function has the property that adding a constant shift to its input arguments does not change the output probabilities. This means that the pre-softmax scores can be modified without affecting the final post-softmax outputs.
The authors show that by adding a specific modification to the pre-softmax scores of a CNN model, the heatmaps produced by Grad-CAM can be radically altered, even to the point of highlighting irrelevant regions of the input image. However, the final model outputs remain unchanged.
This vulnerability is different from Clever Hans effects, where the model learns to exploit spurious correlations in the training data. In this case, the problem lies in the attribution method itself, not the model's ability to extract the right information.
The authors note that this vulnerability could potentially be exploited to undermine confidence in a model, as the modified model would be functionally equivalent to the original one, making it hard to detect the change.
The vulnerability is specific to attribution methods using pre-softmax scores, and the authors suggest that post-softmax outputs are not vulnerable to this type of attack.
Stats
The content does not provide any specific numerical data or metrics to support the key points. It focuses on the theoretical vulnerability and provides illustrative examples.
Quotes
"Adding an amount t independent of the class i to all the arguments of the softmax, z′i = zi + t, has no effect on its outputs."
"Consequently, we expect that heatmaps produced by Grad-CAM to strongly highlight the upper left area of the image regardless of whether that part of the image is related to the network final output."