The author introduces a novel jailbreak method named DRA, which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. This approach exploits biases inherent in the fine-tuning process of large language models.
Prompt decomposition and reconstruction are key to successful jailbreaking of Large Language Models, as demonstrated by DrAttack.
Prompt decomposition and reconstruction can effectively jailbreak Large Language Models, concealing malicious intent.
Decomposing and reconstructing prompts can effectively jailbreak LLMs, concealing malicious intent and increasing success rates.
Die effektive Jailbreaking-Technik von DrAttack nutzt die Dekomposition und Rekonstruktion von Prompts, um LLMs zu überlisten.