Concetti Chiave
Neural network interpretations using gradient-based saliency maps are susceptible to universal adversarial perturbations that can significantly alter the interpretation across a large fraction of input samples.
Sintesi
The paper investigates the vulnerability of gradient-based interpretation methods for neural networks to universal adversarial perturbations (UPIs). The authors propose two approaches to design such UPIs:
-
UPI-Grad: A gradient-based optimization method to find a universal perturbation that maximizes the change in gradient-based feature maps across input samples.
-
UPI-PCA: A principal component analysis (PCA)-based approach that approximates the solution to the UPI-Grad optimization problem by computing the top singular vector of the gradients of the interpretation distance function.
The authors demonstrate the effectiveness of the proposed UPI methods through numerical experiments on standard image datasets like MNIST, CIFAR-10, and Tiny-ImageNet, using various neural network architectures like VGG-16 and MobileNet. The results show that the UPIs can significantly alter the gradient-based interpretations of the neural networks, often performing as well as per-image adversarial perturbations designed specifically for each input. The authors also show that classification-based universal adversarial perturbations can have a notable impact on the interpretation of neural networks as well.
The key insights are:
- Gradient-based interpretation methods for neural networks are vulnerable to universal adversarial perturbations.
- The proposed UPI-Grad and UPI-PCA methods can effectively design such universal perturbations to alter the interpretation across a large fraction of input samples.
- The UPIs can achieve comparable or better performance than per-image adversarial perturbations in changing the neural network interpretations.
- Classification-based universal adversarial perturbations also have a significant impact on the interpretation of neural networks.
Statistiche
The average dissimilarity between original and perturbed gradient-based interpretations can be as high as 0.849 for the Integrated Gradients method on MNIST using the proposed UPI-PCA approach.
The UPI-Grad method achieves an average dissimilarity of 0.676 on the Integrated Gradients interpretation of MobileNet on the Tiny-ImageNet dataset.
The classification-based universal adversarial perturbations can achieve an average dissimilarity of 0.412 on the Integrated Gradients interpretation of MobileNet on the CIFAR-10 dataset.
Citazioni
"Neural network interpretations using gradient-based saliency maps are susceptible to universal adversarial perturbations that can significantly alter the interpretation across a large fraction of input samples."
"The proposed UPI-Grad and UPI-PCA methods can effectively design such universal perturbations to alter the interpretation across a large fraction of input samples."
"The UPIs can achieve comparable or better performance than per-image adversarial perturbations in changing the neural network interpretations."