The paper first analyzes the drawbacks of the Generative Coordinate Gradient (GCG) attack, which only selects the suffix with the lowest loss during optimization, and discovers that many successful suffixes are missed. To address this, the authors propose "augmented GCG" that collects all candidate suffixes sampled during optimization and uses them to attack the target models, substantially improving the attack success rate.
Building on this, the authors then develop AmpleGCG, a generative model that learns the distribution of adversarial suffixes given any harmful query. AmpleGCG can generate hundreds of tailored suffixes for each query in just seconds, achieving near 100% attack success rate on both Vicuna-7B and Llama-2-7B-Chat, outperforming GCG and other baselines.
Notably, AmpleGCG also demonstrates strong transferability, successfully attacking unseen open-source models as well as closed-source models like the latest GPT-3.5, reaching up to 99% attack success rate. The authors further show that AmpleGCG's generated suffixes can evade perplexity-based defenses by repeating the original query.
Overall, this work amplifies the impact of GCG by training a generative model that can rapidly uncover a broad range of vulnerabilities in aligned large language models, posing significant challenges for their safety and security.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor