toplogo
Đăng nhập

Detecting Adversarial Samples Using Prediction and Attribution Sensitivity Analysis


Khái niệm cốt lõi
A practical, lightweight, and unsupervised method for detecting adversarial samples by analyzing the sensitivity of model prediction and feature attribution to noise.
Tóm tắt

The paper proposes a novel method called PASA (Prediction & Attribution Sensitivity Analysis) for detecting adversarial samples in a black-box setting. The key insights are:

  1. Deep neural networks exhibit distinct behavior when noise is introduced to adversarial samples compared to benign samples. Adversarial samples show less sensitivity in model prediction, while benign samples demonstrate higher sensitivity.

  2. The distribution of feature attribution scores (using Integrated Gradient) also varies significantly between benign and adversarial samples when noise is added.

The PASA detector leverages these observations to compute two test statistics: prediction sensitivity (PS) and attribution sensitivity (AS). It learns thresholds for these metrics from benign samples during training and uses them to detect adversarial samples at test time.

PASA is evaluated on five datasets (MNIST, CIFAR-10, CIFAR-100, ImageNet, CIC-IDS2017) and five network architectures (MLP, LeNet, VGG16, ResNet, MobileNet). On average, PASA outperforms state-of-the-art unsupervised adversarial detectors by 14% on CIFAR-10, 4% on CIFAR-100, and 35% on ImageNet. PASA also demonstrates competitive performance even when the adversary is aware of the defense mechanism.

edit_icon

Customize Summary

edit_icon

Rewrite with AI

edit_icon

Generate Citations

translate_icon

Translate Source

visual_icon

Generate MindMap

visit_icon

Visit Source

Thống kê
The difference in logits between a benign image and its noisy counterpart is smaller than the difference for an adversarial image and its noisy counterpart on MNIST. The difference in logits between a benign image and its noisy counterpart is larger than the difference for an adversarial image and its noisy counterpart on CIFAR-10. The difference in Integrated Gradient attribution vectors between a benign image and its noisy counterpart is smaller than the difference for an adversarial image and its noisy counterpart on MNIST. The difference in Integrated Gradient attribution vectors between a benign image and its noisy counterpart is larger than the difference for an adversarial image and its noisy counterpart on CIFAR-10.
Trích dẫn
"We observe that both model predictions and feature attributions for input samples are sensitive to noise." "Examining these discrepancies in model prediction and feature attribution of benign and adversarial samples subjected to additional perturbation can effectively detect adversarial attacks, and ensure the security of systems incorporating deep learning models."

Thông tin chi tiết chính được chắt lọc từ

by Dipkamal Bhu... lúc arxiv.org 04-18-2024

https://arxiv.org/pdf/2404.10789.pdf
PASA: Attack Agnostic Unsupervised Adversarial Detection using  Prediction & Attribution Sensitivity Analysis

Yêu cầu sâu hơn

How would PASA's performance be affected if the adversary has knowledge of the detector and can craft adversarial samples to bypass it

If the adversary has knowledge of the detector and can craft adversarial samples to bypass it, PASA's performance may be affected. The adversary could potentially manipulate the adversarial samples in a way that minimizes the changes in model prediction and feature attribution when noise is added. This could make it more challenging for PASA to distinguish between benign and adversarial samples. Additionally, the adversary could specifically target the thresholds learned during training to evade detection by PASA. To mitigate this, PASA could incorporate dynamic threshold adjustments or introduce additional layers of defense to counteract sophisticated adversaries.

Can PASA be extended to detect adversarial attacks in other domains beyond image classification, such as natural language processing or time series analysis

PASA can be extended to detect adversarial attacks in domains beyond image classification, such as natural language processing (NLP) or time series analysis. In NLP, PASA could analyze the sensitivity of model predictions and feature attributions for text inputs, identifying discrepancies between benign and adversarial samples. For time series analysis, PASA could examine the changes in model behavior and feature importance when noisy perturbations are introduced to time series data. By adapting the PASA methodology to these domains, it could provide a robust defense against adversarial attacks in various applications.

What are the potential limitations of using noise as a probing mechanism, and how could the PASA approach be further improved to address these limitations

Using noise as a probing mechanism in PASA may have limitations, such as the sensitivity of the approach to the choice of noise parameters. The effectiveness of PASA could be impacted by the spread parameter of the noise, which needs to be carefully selected during training. Additionally, the performance of PASA may vary across different datasets and attack scenarios, requiring fine-tuning of the noise parameters for optimal detection. To address these limitations, PASA could benefit from adaptive noise generation techniques that dynamically adjust the noise parameters based on the characteristics of the input data. Furthermore, incorporating ensemble methods or integrating multiple detection mechanisms could enhance the robustness of PASA against adversarial attacks.
0
star