Jailbreak Attacks on Large Language Models

insight - Jailbreak Attacks on Large Language Models

Adaptive Position Pre-Fill Jailbreak Attack Approach to Exploit Vulnerabilities in Large Language Models

An adaptive position pre-fill jailbreak attack approach that exploits the difference in a large language model's alignment protection capabilities across different output positions to enhance the success rate of jailbreak attacks.

Comprehensive Visual Analysis of Jailbreak Attacks Against Large Language Models

Jailbreak attacks aim to bypass the safety mechanisms of large language models to generate harmful content. This work proposes a framework and visual analysis system to help users evaluate the jailbreak performance of language models, understand the characteristics of jailbreak prompts, and identify potential model weaknesses.

Crafting Malicious Prompts by Hiding Them in Benign Narratives: A Logic-Chain Injection Attack on Large Language Models

Attackers can successfully deceive large language models and humans by dissembling malicious intentions into a chain of benign narrations and distributing them into a related benign article, leveraging the models' ability to connect scattered logic.

Tricking Large Language Models into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks

Large language models are vulnerable to jailbreak attacks, allowing malicious users to manipulate prompts for misalignment, leakage, or harmful generation.

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

AutoDAN introduces a novel approach to automatically generate stealthy jailbreak prompts against aligned Large Language Models, demonstrating superior attack strength and bypassing defense mechanisms.

About

Products

Resources