The authors propose a method to generate "adversarial doodles" - attacks on image classifiers that are interpretable and can be replicated by humans. They optimize a set of Bézier curves to fool a target classifier when overlaid on an input image. To make the attacks robust to misalignment when replicated by humans, they introduce random affine transformations during the optimization. They also regularize the doodled area to keep the attacks small and less noticeable.
The authors evaluate their method on the ResNet-50 and ViT-B/32 classifiers trained on the Caltech-101 dataset. They find that attacks with three Bézier curves are more successful than those with one curve. The human-replicated attacks are often able to fool the classifiers as well, though the ViT-B/32 classifier is more robust than ResNet-50.
By analyzing the adversarial doodles, the authors discover describable insights into the relationship between the shape of the doodle and the classifier's output. For example, they find that adding three small circles to a helicopter image causes the ResNet-50 classifier to misclassify it as an airplane. They are able to replicate this insight by drawing similar doodles on other helicopter images and successfully fooling the classifier.
The authors also analyze the role of random affine transformations during optimization, finding that it enhances the robustness of the attacks against misalignment when replicated by humans. Additionally, they use GradCAM to visualize the areas the classifiers focus on, observing that drastic changes in the focused area can lead to human-replicated attacks failing to fool the classifier.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Ryoya Nara, ... om arxiv.org 09-12-2024
https://arxiv.org/pdf/2311.15994.pdfDiepere vragen