The paper proposes a new type of jailbreak attack on large language models (LLMs) called "logic-chain injection attack". The key insight is to hide malicious intentions in benign truth, borrowing from social psychology principles that humans are easily deceived if lies are hidden in truth.
The attack works in three steps:
Unlike existing jailbreak attacks that directly inject malicious prompts, this approach does not follow any specific patterns, making it harder to detect. The authors demonstrate two attack instances using "paragraphed logic chain" and "acrostic style logic chain" to hide the malicious intent.
The paper highlights that this attack can deceive both the LLM and human reviewers, underscoring the critical need for robust defenses against such sophisticated prompt injection attacks in LLM systems.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zhilong Wang... at arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04849.pdfDeeper Inquiries