Core Concepts
Adapting strategies from transfer-based attacks on image classification models, specifically Skip Gradient Method (SGM) and Intermediate Level Attack (ILA), can significantly improve the effectiveness of gradient-based adversarial prompt generation against safety-aligned large language models.
Stats
GCG-LSGM-LILA achieves a match rate of 87% when attacking Llama-2-7B-Chat on AdvBench, outperforming the baseline GCG attack (54%).
GCG-LSGM-LILA achieves an attack success rate of 68% for query-specific adversarial prompts against Llama-2-7B-Chat on AdvBench, compared to 38% for GCG.
In universal adversarial prompt generation, GCG-LSGM-LILA achieves an average attack success rate of 60.32% against Llama-2-7B-Chat on AdvBench, a +33.64% improvement over GCG.
GCG-LSGM-LILA achieves a +30%, +19%, +19%, and +21% improvement in attack success rate over GCG when attacking Llama-2-7B-Chat, Llama-2-13B-Chat, Mistral-7B-Instruct, and Phi3-Mini-4K-Instruct, respectively.
Quotes
"In this paper, we carefully examine the discrepancy between the gradient of the adversarial loss w.r.t. one-hot vectors and the real effect of the change in loss that results from token replacement."
"We present a new perspective that this gap resembles the gap between input gradients calculated using a substitute model and the real effect of perturbing inputs on the prediction of a black-box victim model, which has been widely studied in transfer-based attacks against black-box image classification models."
"Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench."