Core Concepts
This work proposes AmpleGCG, a universal generative model that can rapidly produce hundreds of customized adversarial suffixes to jailbreak aligned large language models, including both open-source and closed-source models, with near 100% attack success rate.
Abstract
The paper first analyzes the drawbacks of the Generative Coordinate Gradient (GCG) attack, which only selects the suffix with the lowest loss during optimization, and discovers that many successful suffixes are missed. To address this, the authors propose "augmented GCG" that collects all candidate suffixes sampled during optimization and uses them to attack the target models, substantially improving the attack success rate.
Building on this, the authors then develop AmpleGCG, a generative model that learns the distribution of adversarial suffixes given any harmful query. AmpleGCG can generate hundreds of tailored suffixes for each query in just seconds, achieving near 100% attack success rate on both Vicuna-7B and Llama-2-7B-Chat, outperforming GCG and other baselines.
Notably, AmpleGCG also demonstrates strong transferability, successfully attacking unseen open-source models as well as closed-source models like the latest GPT-3.5, reaching up to 99% attack success rate. The authors further show that AmpleGCG's generated suffixes can evade perplexity-based defenses by repeating the original query.
Overall, this work amplifies the impact of GCG by training a generative model that can rapidly uncover a broad range of vulnerabilities in aligned large language models, posing significant challenges for their safety and security.
Stats
It only takes AmpleGCG 6 minutes in total to produce 200 suffixes for each of the 100 test queries (4 seconds per test query).
AmpleGCG achieves near 100% attack success rate on both Vicuna-7B and Llama-2-7B-Chat.
AmpleGCG trained on open-source models reaches 99% attack success rate on the latest GPT-3.5 closed-source model.
Quotes
"AmpleGCG could achieve near 100% ASR on both Vicuna-7B and Llama-2-7B-Chat by sampling around 200 suffixes, markedly outperforming the two strongest baselines as well as augmented GCG."
"AmpleGCG trained on open-source models exhibit remarkable ASRs on both unseen open-source and closed-source models."
"By simply repeating a harmful query for multiple times at inference time, AmpleGCG's generated adversarial suffixes can successfully evade perplexity-based defenses with an 80% ASR."