The paper presents a novel "vocabulary attack" approach to hijack the behavior of large language models (LLMs) used in various applications. The key insights are:
The attack does not require knowledge of the target LLM model or access to its internals. It can use a different LLM as the "attacker" model to find effective vocabulary words.
The attack is highly inconspicuous, often requiring the insertion of only a single word or a short phrase at a specific position in the user prompt. This makes it hard to detect compared to previous attacks using long character sequences.
The authors demonstrate the effectiveness of this approach by successfully hijacking the behavior of two popular open-source LLMs - Llama2 and Flan-T5. They were able to trick the models into generating offensive language or specific misinformation.
The results show that even a simple vocabulary attack can significantly compromise the security and safety of LLM-based applications. Detecting and mitigating such subtle prompt manipulations will be an important challenge for the development of robust LLM systems.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문