This research paper introduces DeGCG, a novel two-stage transfer learning framework that significantly improves the efficiency of adversarial suffix-based attacks on aligned large language models by decoupling the search process and leveraging the transferability of adversarial suffixes.
The Single-Turn Crescendo Attack (STCA) is a novel technique that can bypass content moderation filters in large language models by condensing a gradual escalation into a single prompt, leading the model to generate harmful or inappropriate content.
A novel scheme, TF-ATTACK, is introduced to enhance the transferability and speed of adversarial attacks on large language models.
Large language models can be manipulated into generating specific, coherent text by using seemingly nonsensical "gibberish" prompts, raising safety concerns about the robustness and alignment of these models.
Sandwich attack is a new black-box multi-language mixture attack that can manipulate state-of-the-art large language models into generating harmful and misaligned responses.
Inserting carefully selected vocabulary words into user prompts can effectively hijack the behavior of large language models, enabling attacks that generate offensive content or specific misinformation.