This paper presents a comprehensive study on the adversarial robustness of vision-language models, particularly the CLIP model, under different types of attacks. The key findings are:
Multimodal adversarial training, which aligns clean and adversarial text embeddings with adversarial and clean visual features, can significantly enhance the adversarial robustness of CLIP against both image and text-based attacks.
Image attacks tend to be more potent than text attacks, but as the number of categories in a dataset increases, text attacks become progressively stronger.
Fine-tuning the CLIP model, even with clean data or solely against image-based attacks, can improve its overall adversarial robustness.
For out-of-distribution generalization, the larger the training set size, the stronger the model's adversarial robustness against multimodal and image attacks becomes. The proposed multimodal adversarial training method exhibits strong performance, especially in few-shot scenarios.
The two contrastive losses in the proposed multimodal adversarial training framework work synergistically to enhance both clean accuracy and robust accuracy under multimodal and image attacks.
Increasing the number of fine-tuned parameters and the strength of adversarial perturbations can further impact the model's adversarial robustness.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor