The paper introduces a new adversarial attack technique called the "Sandwich attack" that targets the multilingual capabilities of large language models (LLMs). The key idea is to create a prompt with a series of five questions in different low-resource languages, hiding the adversarial question in the middle position. This is designed to exploit the "Attention Blink" phenomenon, where LLMs struggle to process a mixture of languages, leading them to overlook the harmful question.
The authors conducted experiments with five different LLMs - Google's Bard, Gemini Pro, LLaMA-2-70B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS. They found that the Sandwich attack can breach the safety mechanisms of these models and elicit harmful responses. The paper also discusses the authors' observations on the models' behaviors under the attack, as well as their hypotheses on the potential reasons for the failure of the safety training mechanisms.
The paper makes the following key contributions:
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Bibek Upadha... um arxiv.org 04-12-2024
https://arxiv.org/pdf/2404.07242.pdfTiefere Fragen