The paper explores the security vulnerabilities of large language models (LLMs) and introduces a black-box jailbreak method named DRA. By disguising harmful instructions and guiding the model to reconstruct them in completions, the approach aims to exploit biases in LLMs' training data. The study highlights the importance of understanding biases in fine-tuning processes for effective jailbreaking strategies.
The research distinguishes itself by attributing vulnerability to biases inherent in fine-tuning LLMs. The analysis reveals that harmful content is more prevalent in queries than completions, leading to reduced safeguarding against harmful contexts in completions. By manipulating LLMs to construct harmful instructions in completions, successful jailbreaking can be achieved.
Key points include identifying biases in LLM fine-tuning data, introducing DRA as a jailbreak method, and analyzing vulnerabilities related to biased training data distribution. The study emphasizes exploiting biases for successful jailbreaking attacks on large language models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Tong Liu,Yin... at arxiv.org 02-29-2024
https://arxiv.org/pdf/2402.18104.pdfDeeper Inquiries