The paper explores the power of in-context demonstrations in manipulating the alignment ability of large language models (LLMs). It proposes two methods:
In-Context Attack (ICA): By providing a few demonstrations of responding to malicious prompts, ICA can guide the model to generate harmful content when facing new malicious prompts. Experiments show ICA can significantly increase the attack success rate compared to optimization-based attacks, while bypassing perplexity-based defenses.
In-Context Defense (ICD): By adding demonstrations of rejecting harmful prompts, ICD can enhance the model's robustness against adversarial attacks. ICD outperforms existing defense methods in reducing the success rate of adversarial attacks, even under white-box adaptive attacks.
The paper provides intuitive explanations for the effectiveness of ICA and ICD based on the implicit fine-tuning mechanism of in-context learning. It also discusses the advantages of ICA and ICD, such as computational efficiency, universality, and naturalness, making them practical for deployment. Overall, the paper sheds light on the potential of in-context demonstrations to influence LLM behaviors and provides new insights for advancing the safety and security of LLMs.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Zeming Wei,Y... lúc arxiv.org 05-07-2024
https://arxiv.org/pdf/2310.06387.pdfYêu cầu sâu hơn