Основные понятия
The author proposes Tastle, a distraction-based framework to automate red teaming of large language models by generating jailbreak prompts. The approach is motivated by the distractibility and over-confidence phenomenon of LLMs.
Аннотация
Tastle introduces a novel black-box jailbreak framework for automated red teaming of LLMs. It focuses on concealing malicious content, memory reframing, and optimization algorithms. Extensive experiments demonstrate the effectiveness, scalability, and transferability of Tastle in attacking both open-source and proprietary LLMs. The study also evaluates existing defense methods against the attack.
Large language models (LLMs) have shown significant advancements but remain vulnerable to jailbreak attacks. Efforts have been made to align their behavior with human values, yet malicious manipulations can still occur. Tastle aims to address these vulnerabilities through distraction-based techniques.
The paper discusses various strategies employed in the Tastle framework, including concealing malicious content within complex scenarios and memory-reframing mechanisms. Experiments show promising results in attacking different LLMs with high success rates.
Overall, the study highlights the importance of understanding and addressing vulnerabilities in large language models to ensure their safety and alignment with human values.
Статистика
Top-1 ASR of 66.7% achieved on ChatGPT.
Top-5 ASR of 38.0% achieved on GPT-4.
Цитаты
"Even aligned LLMs are still vulnerable to jailbreak attacks."
"Our research contributions include introducing Tastle for automated red teaming of LLMs."