Core Concepts
This research paper proposes a novel approach to defending against prompt injection attacks on Large Language Models (LLMs) by repurposing the very techniques used in these attacks to create more robust defense mechanisms.
Stats
The defense method based on the most effective attack technique ("Fakecom-t") reduced the attack success rate (ASR) to nearly zero in certain scenarios.
Qwen2-7b-Instruct was found to be the most vulnerable model to attacks compared to Llama3-8b-Instruct and Llama3.1-8b-Instruct.
Indirect prompt injection attacks were found to be easier to defend against than direct attacks.
Most defense strategies did not significantly affect the model's utility, and some even improved performance in certain cases.
The average overhead of the defense method based on "Fakecom-t" was slightly higher than the baseline with no defense but remained relatively low.