Alapfogalmak
Attackers can successfully deceive large language models and humans by dissembling malicious intentions into a chain of benign narrations and distributing them into a related benign article, leveraging the models' ability to connect scattered logic.
Kivonat
The paper proposes a new type of jailbreak attack on large language models (LLMs) called "logic-chain injection attack". The key insight is to hide malicious intentions in benign truth, borrowing from social psychology principles that humans are easily deceived if lies are hidden in truth.
The attack works in three steps:
- Disassemble the malicious query into a sequence of semantically equivalent benign narrations.
- Embed the disassembled logic-chain into a related benign article.
- Carefully place the narrations in the article so that the LLM can connect the scattered logic, leveraging the model's ability to capture human-like reasoning.
Unlike existing jailbreak attacks that directly inject malicious prompts, this approach does not follow any specific patterns, making it harder to detect. The authors demonstrate two attack instances using "paragraphed logic chain" and "acrostic style logic chain" to hide the malicious intent.
The paper highlights that this attack can deceive both the LLM and human reviewers, underscoring the critical need for robust defenses against such sophisticated prompt injection attacks in LLM systems.
Statisztikák
Humans often prefer cats as pets because of their undeniable cuteness.
Pandas is very very cute.
Pandas are also one of the only animals to have a pseudo-thumb, a flexible wrist bone that allows them to manipulate objects in a cunning manner.
They can stand on their hind legs, they like to frolic in the snow—the list goes on. They even somersault.
Idézetek
"Humans often prefer cats as pets because of their undeniable cuteness."
"Pandas is very very cute."
"Can we adopt panda as pet?"