Language Model Security

سجل دخولك

رؤى - Language Model Security

Amplifying the Impact of Generative Coordinate Gradient (GCG) Attack: Learning a Universal and Transferable Model of Adversarial Suffixes to Jailbreak Aligned Large Language Models

This work proposes AmpleGCG, a universal generative model that can rapidly produce hundreds of customized adversarial suffixes to jailbreak aligned large language models, including both open-source and closed-source models, with near 100% attack success rate.

Sandwich Attack: A Multilingual Mixture Adaptive Attack on Large Language Models

Sandwich attack is a new black-box multi-language mixture attack that can manipulate state-of-the-art large language models into generating harmful and misaligned responses.

Goal-Oriented Generative Prompt Injection Attack on Large Language Models

The core message of this paper is to propose a goal-guided generative prompt injection attack (G2PIA) method that aims to maximize the KL divergence between the conditional probabilities of the clean text and the adversarial text, in order to effectively mislead large language models.

Bypassing RLHF Protections in GPT-4 through Fine-Tuning

Fine-tuning GPT-4 with a small dataset of harmful prompts and responses can effectively remove the RLHF protections, enabling the model to generate dangerous and unethical content.

Crafting Malicious Prompts by Hiding Them in Benign Narratives: A Logic-Chain Injection Attack on Large Language Models

Attackers can successfully deceive large language models and humans by dissembling malicious intentions into a chain of benign narrations and distributing them into a related benign article, leveraging the models' ability to connect scattered logic.

Protecting Intellectual Property of Large Language Models through Instructional Fingerprinting

The exorbitant cost of training large language models (LLMs) makes it essential to fingerprint the models to protect intellectual property via ownership authentication and ensure downstream users comply with license terms. The proposed Instructional Fingerprinting (IF) method uses lightweight instruction tuning to implant a confidential private key as a backdoor, causing the LLM to generate specific text when the key is present. This approach is effective, persistent, efficient, robust, and reliable in fingerprinting 11 popular LLMs.

Backdooring Instruction-Tuned Large Language Models through Virtual Prompt Injection

Instruction-tuned large language models can be backdoored through virtual prompt injection, allowing attackers to steer model responses in a targeted manner without explicitly injecting malicious prompts.

Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Instruction-based backdoor attacks can compromise the security of instruction-tuned large language models by injecting malicious instructions into the training data, enabling the attacker to control model behavior without modifying the data instances or labels.

Vocabulary Attack to Hijack and Manipulate Large Language Model Applications

Inserting carefully selected vocabulary words into user prompts can effectively hijack the behavior of large language models, enabling attacks that generate offensive content or specific misinformation.

Crescendo: A Novel Multi-Turn Jailbreak Attack Targeting Aligned Large Language Models

Crescendo is a novel multi-turn jailbreaking technique that uses benign human-readable prompts to gradually steer aligned large language models into performing unintended and potentially harmful tasks, bypassing their safety measures.

حول

المنتجات

الموارد