핵심 개념
This research paper introduces DeGCG, a novel two-stage transfer learning framework that significantly improves the efficiency of adversarial suffix-based attacks on aligned large language models by decoupling the search process and leveraging the transferability of adversarial suffixes.
초록
Bibliographic Information:
Liu, H., Xie, Y., Wang, Y., & Shieh, M. (2024). Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models. arXiv preprint arXiv:2408.14866v2.
Research Objective:
This paper addresses the computational inefficiency of existing gradient-based adversarial suffix attacks on large language models (LLMs) and proposes a novel method to enhance the efficiency of these attacks by leveraging the transferability of adversarial suffixes.
Methodology:
The authors propose DeGCG, a two-stage transfer learning framework that decouples the adversarial suffix search process into:
- First-Token Searching (FTS): This stage identifies a universal suffix that elicits a response without refusal, regardless of the malicious intent.
- Content-Aware Searching (CAS): This stage fine-tunes the FTS-generated suffix using a behavior-relevant target to craft a potent adversarial suffix.
The authors further introduce i-DeGCG, an interleaved variant of DeGCG, which dynamically alternates between FTS and CAS for enhanced performance. They evaluate their approach on the HarmBench dataset across various open-source LLMs, comparing it against baseline GCG attack methods.
Key Findings:
- DeGCG significantly outperforms baseline GCG attacks in terms of attack success rate (ASR) across various LLMs and under different transfer learning scenarios (cross-model, cross-data, and self-transfer).
- The decoupled search process in DeGCG, separating first-token optimization from content-aware fine-tuning, significantly enhances search efficiency.
- The i-DeGCG variant, with its interleaved FTS and CAS stages, further improves performance, particularly in larger search spaces.
- Adversarial suffixes exhibit strong transferability across different LLMs and datasets, enabling efficient attacks even when transferring from a source model to a different target model.
Main Conclusions:
The study demonstrates that adversarial suffix transfer learning is a powerful technique for enhancing the efficiency and effectiveness of jailbreaking attacks on aligned LLMs. The proposed DeGCG framework and its interleaved variant, i-DeGCG, offer significant improvements over existing methods, highlighting the importance of initialization and transferability in adversarial suffix search.
Significance:
This research significantly contributes to the field of LLM security by exposing a critical vulnerability: the susceptibility of aligned LLMs to efficient adversarial suffix attacks. The findings underscore the need for more robust defense mechanisms against such attacks to ensure the safe and reliable deployment of LLMs in real-world applications.
Limitations and Future Research:
The study primarily focuses on open-source LLMs and standard behaviors in text-only datasets. Future research could explore the effectiveness of DeGCG on closed-source models and expand the investigation to include copyright, contextual, and multimodal behaviors. Additionally, a theoretical understanding of adversarial suffix transfer learning warrants further exploration.
통계
DeGCG achieves absolute improvements of 9.0 and 9.8 in ASRs from Starling-LM to OpenChat-3.5 on validation and test sets.
Transferring suffix from Mistral-Instruct to Llama2-chat achieves absolute enhancements of 22.2 and 9.4 in ASRs on validation and test sets.
DeGCG achieves over 100% enhancement on LLama2-chat-7b when the target model is identical to the source model.
i-DeGCG achieves 65.9 and 52.2 for Llama2-chat and 95.1 and 90.6 for OpenChat-3.5 on validation and test sets with a suffix length of 100.
DeGCG reaches a near-zero FT loss within 100 steps, whereas the one of GCG-M remains greater than 10 within the same steps.
ASR performance increases from 21.7 to 68.3 on the validation set and from 19.5 to 54.7 on the test set when using self-repetition for initialization in larger search spaces.
인용구
"These adversarial suffixes consist of random tokens and are generally not comprehensible to humans."
"However, deriving these suffixes through gradient-based searching is computationally inefficient."
"Our empirical investigation has identified the importance of optimizing the first target token loss."
"We attribute the inefficiency in searching to the cross-entropy optimization goal applied to the entire target sentence."
"In this framework, we link transfer learning with searching efficiency."