This work proposes AmpleGCG, a universal generative model that can rapidly produce hundreds of customized adversarial suffixes to jailbreak aligned large language models, including both open-source and closed-source models, with near 100% attack success rate.
Sandwich attack is a new black-box multi-language mixture attack that can manipulate state-of-the-art large language models into generating harmful and misaligned responses.
The core message of this paper is to propose a goal-guided generative prompt injection attack (G2PIA) method that aims to maximize the KL divergence between the conditional probabilities of the clean text and the adversarial text, in order to effectively mislead large language models.
Fine-tuning GPT-4 with a small dataset of harmful prompts and responses can effectively remove the RLHF protections, enabling the model to generate dangerous and unethical content.
Attackers can successfully deceive large language models and humans by dissembling malicious intentions into a chain of benign narrations and distributing them into a related benign article, leveraging the models' ability to connect scattered logic.
The exorbitant cost of training large language models (LLMs) makes it essential to fingerprint the models to protect intellectual property via ownership authentication and ensure downstream users comply with license terms. The proposed Instructional Fingerprinting (IF) method uses lightweight instruction tuning to implant a confidential private key as a backdoor, causing the LLM to generate specific text when the key is present. This approach is effective, persistent, efficient, robust, and reliable in fingerprinting 11 popular LLMs.
Instruction-tuned large language models can be backdoored through virtual prompt injection, allowing attackers to steer model responses in a targeted manner without explicitly injecting malicious prompts.
Instruction-based backdoor attacks can compromise the security of instruction-tuned large language models by injecting malicious instructions into the training data, enabling the attacker to control model behavior without modifying the data instances or labels.
Inserting carefully selected vocabulary words into user prompts can effectively hijack the behavior of large language models, enabling attacks that generate offensive content or specific misinformation.
Crescendo is a novel multi-turn jailbreaking technique that uses benign human-readable prompts to gradually steer aligned large language models into performing unintended and potentially harmful tasks, bypassing their safety measures.