Adversarial attacks on large language models

insight - Adversarial attacks on large language models

Leveraging Adversarial Suffix Transfer Learning to Enhance Jailbreaking Attacks on Aligned Large Language Models

This research paper introduces DeGCG, a novel two-stage transfer learning framework that significantly improves the efficiency of adversarial suffix-based attacks on aligned large language models by decoupling the search process and leveraging the transferability of adversarial suffixes.

Bypassing Content Moderation in Large Language Models: The Single-Turn Crescendo Attack (STCA)

The Single-Turn Crescendo Attack (STCA) is a novel technique that can bypass content moderation filters in large language models by condensing a gradual escalation into a single prompt, leading the model to generate harmful or inappropriate content.

Transferable and Fast Adversarial Attacks on Large Language Models

A novel scheme, TF-ATTACK, is introduced to enhance the transferability and speed of adversarial attacks on large language models.

Manipulating Large Language Models with Adversarial Gibberish Prompts

Large language models can be manipulated into generating specific, coherent text by using seemingly nonsensical "gibberish" prompts, raising safety concerns about the robustness and alignment of these models.

Sandwich Attack: A Multilingual Mixture Adaptive Attack on Large Language Models

Sandwich attack is a new black-box multi-language mixture attack that can manipulate state-of-the-art large language models into generating harmful and misaligned responses.

Vocabulary Attack to Hijack and Manipulate Large Language Model Applications

Inserting carefully selected vocabulary words into user prompts can effectively hijack the behavior of large language models, enabling attacks that generate offensive content or specific misinformation.

About

Products

Resources