The paper proposes a novel method called PASA (Prediction & Attribution Sensitivity Analysis) for detecting adversarial samples in a black-box setting. The key insights are:
Deep neural networks exhibit distinct behavior when noise is introduced to adversarial samples compared to benign samples. Adversarial samples show less sensitivity in model prediction, while benign samples demonstrate higher sensitivity.
The distribution of feature attribution scores (using Integrated Gradient) also varies significantly between benign and adversarial samples when noise is added.
The PASA detector leverages these observations to compute two test statistics: prediction sensitivity (PS) and attribution sensitivity (AS). It learns thresholds for these metrics from benign samples during training and uses them to detect adversarial samples at test time.
PASA is evaluated on five datasets (MNIST, CIFAR-10, CIFAR-100, ImageNet, CIC-IDS2017) and five network architectures (MLP, LeNet, VGG16, ResNet, MobileNet). On average, PASA outperforms state-of-the-art unsupervised adversarial detectors by 14% on CIFAR-10, 4% on CIFAR-100, and 35% on ImageNet. PASA also demonstrates competitive performance even when the adversary is aware of the defense mechanism.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Dipkamal Bhu... lúc arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.10789.pdfYêu cầu sâu hơn