Core Concepts
SD-NAE, a method that leverages Stable Diffusion to actively generate natural adversarial examples, demonstrates significant potential in evaluating and understanding the robustness of deep image classifiers.
Abstract
The paper introduces SD-NAE, a method that utilizes the state-of-the-art Stable Diffusion model to actively synthesize natural adversarial examples (NAEs) - images that arise naturally from the environment and can deceive image classifiers.
Key highlights:
- Unlike prior works that passively collect NAEs from real images, SD-NAE formulates a controlled optimization process to generate NAEs. It perturbs the token embedding corresponding to a specified class in the Stable Diffusion condition, guided by the gradient of loss from the target classifier.
- Experiments show that SD-NAE can effectively generate NAEs, achieving a 43.5% fooling rate against an ImageNet-trained ResNet-50 classifier. The generated NAEs exhibit diverse variations in color, background, view angle, and style.
- SD-NAE demonstrates greater flexibility and control compared to previous methods, highlighting its potential as a tool for evaluating and enhancing model robustness.
- The paper also discusses the advantages of perturbing token embeddings over latent vectors or text embeddings, and the ability of SD-NAE to generate NAEs for out-of-distribution detection tasks.