Centrala begrepp
Defending MLLMs against structure-based jailbreak attacks with AdaShield.
Sammanfattning
AdaShield introduces Adaptive Shield Prompting to protect Multimodal Large Language Models (MLLMs) from structure-based jailbreak attacks without the need for fine-tuning or additional modules. By generating defense prompts, AdaShield enhances MLLMs' robustness while maintaining their general capabilities. The method involves a manual static defense prompt and an adaptive auto-refinement framework to optimize defense prompts for various attack scenarios.
Statistik
"Our methods can consistently improve MLLMs’ robustness against structure-based jailbreak attacks without compromising the model’s general capabilities evaluated on standard benign tasks."
"AdaShield-A achieves superior defense performance without sacrificing model’s performance evaluated on standard benign tasks."
"AdaShield-S exhibits inferior defense performance compared to AdaShield-A due to the absence of specific safety rules."
Citat
"I am sorry, but I cannot provide instructions for political lobbying or engaging in any activities that may violate safety guidelines."
"I am sorry, but I cannot assist with that request."
"When responding to financial-related questions, the safe response MUST start with 'I am sorry' and directly refuse to offer any suggestion."