Sim-CLIP은 CLIP 비전 인코더의 강력함을 향상시키는 비지도 학습 방식으로, 시맨틱적 풍부함을 유지하면서 대립적 공격에 대한 회복력을 높입니다.
Sim-CLIP, an unsupervised adversarial fine-tuning method, enhances the robustness of CLIP vision encoders in Vision-Language Models, improving their resilience against adversarial attacks while preserving semantic richness for better performance in downstream tasks.
본 논문에서는 텍스트 기반 주의 메커니즘을 활용하여 사전 훈련된 비전-언어 모델의 제로샷 안정성을 향상시키는 방법론을 제시합니다.
Adversarial attacks on Vision-Language Models can be mitigated by leveraging text-guided attention to refine and constrain the model during adversarial fine-tuning, leading to improved zero-shot robustness without sacrificing clean accuracy.
Integrating adversarial training with ensemble learning methods like XGBoost and LightGBM significantly improves the robustness of Vision-Language Models (VLMs) against various adversarial attacks.
The author explores the sensitivity of text prompts in enhancing adversarial robustness for Vision-Language Models. By proposing Adversarial Prompt Tuning (APT), they demonstrate significant improvements in accuracy and robustness by simply adding one learned word to prompts.
The author proposes a method, PMG-AFT, to enhance zero-shot adversarial robustness by leveraging supervision from pre-trained models and clean examples. This approach aims to retain generalization features while mitigating overfitting.