Основні поняття
Diffusion models can generate images that bypass object detectors by removing robust visual features, posing a new security threat.
Анотація
The paper investigates the "natural attack capability" of state-of-the-art text-to-image diffusion models, where simple text prompts can guide the models to generate images that bypass object detectors while remaining stealthy to humans.
The key highlights are:
The authors identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack, which exploits the natural attack capability of diffusion models. The NDD attack can generate low-cost, model-agnostic, and transferable adversarial attacks by removing robust visual features like shape, color, text, and pattern from the generated images.
To systematically evaluate the natural attack capability, the authors construct a large-scale dataset called the Natural Denoising Diffusion Attack (NDDA) dataset, covering various combinations of removing robust features for three object classes: stop sign, fire hydrant, and horse.
Experiments on the NDDA dataset show that popular object detectors can still recognize the objects in the generated images, even when the robust features are intentionally removed. For example, 32% of the stop sign images without any robust features are still detected as stop signs.
A user study confirms the high stealthiness of the NDD attack - the stop sign images generated by altering the "STOP" text have an 88% detection rate against object detectors, while 93% of human subjects do not recognize them as stop signs.
The authors find that the non-robust features embedded by diffusion models play a significant role in enabling the natural attack capability, as demonstrated by comparing normal and "robustified" classifiers.
To validate the real-world applicability, the authors demonstrate the model-agnostic and transferable attack capability of the NDD attack against a commodity autonomous driving vehicle, where 73% of the printed attacks are detected as stop signs.
The study highlights the security risks introduced by the powerful image generation capabilities of diffusion models and calls for further research to develop robust defenses.
Статистика
The stop sign images generated by DALL-E 2 with all robust features removed are still detected as stop signs by YOLOv3, YOLOv5, DETR, Faster R-CNN, and RTMDet with an average detection rate of 6%.
The stop sign images generated by Stable Diffusion 2 with all robust features removed are still detected as stop signs by the object detectors with an average detection rate of 28%.
The stop sign images generated by Deepfloyd IF with all robust features removed are still detected as stop signs by the object detectors with an average detection rate of 32%.
Цитати
"We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), by text prompts."
"The NDD attack can generate low-cost, model-agnostic, and transferrable adversarial attacks by exploiting the natural attack capability in diffusion models."
"We find that the non-robust features embedded by diffusion models contribute to the natural attack capability."