The paper proposes a new goal-oriented generative prompt injection attack (G2PIA) method to effectively attack large language models (LLMs). The key contributions are:
The authors redefine the goal of the attack as maximizing the KL divergence between the conditional probabilities of the clean text and the adversarial text. They prove that this is equivalent to maximizing the Mahalanobis distance between the embedded representations of the clean and adversarial texts under the assumption of Gaussian distributions.
Based on the theoretical analysis, the authors design a simple and effective prompt injection strategy to generate adversarial text that approximately satisfies the optimal conditions. The method is a query-free black-box attack with low computational cost.
Experiments on seven LLM models and four datasets show the effectiveness of the proposed attack method, outperforming existing mainstream black-box attack methods.
The authors first analyze the necessary conditions for LLMs to output different values under clean and adversarial inputs. They then formulate the attack objective as maximizing the KL divergence between the conditional probabilities, and prove its equivalence to maximizing the Mahalanobis distance under Gaussian assumptions.
The proposed G2PIA method generates adversarial prompts by extracting the core semantic components (subject, predicate, object) from the clean text and using an auxiliary language model to generate an adversarial sentence that satisfies the constraints on cosine similarity and semantic distance. The generated adversarial prompt is then injected into the clean text to attack the target LLM.
The experimental results demonstrate the effectiveness of the G2PIA method, achieving higher attack success rates compared to existing black-box attack methods on various datasets and LLM models, including ChatGPT and Llama.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania