The paper proposes PROMPTFUZZ, a novel two-stage fuzzing framework to automatically test the robustness of large language models (LLMs) against prompt injection attacks.
In the preparation stage, PROMPTFUZZ collects a diverse set of initial seed prompts and applies various mutation transformations to generate mutated prompts. It then evaluates the effectiveness of the mutated prompts against validation defense mechanisms and ranks the initial seeds and mutators based on their performance.
In the focus stage, PROMPTFUZZ selects the most promising seed prompts and leverages the high-quality mutants from the preparation stage to guide the mutation process. It generates diverse and effective prompt injections to bypass the target defense mechanisms. The fuzzer iterates through this stage until the stopping criterion is met.
PROMPTFUZZ is evaluated on the TensorTrust dataset, which includes two sub-tasks: message extraction and output hijacking. The results show that PROMPTFUZZ significantly outperforms other baselines, including human experts and gradient-based attacks, in terms of best attack success rate, ensemble success rate, and coverage. PROMPTFUZZ can uncover vulnerabilities in LLMs even with strong defense mechanisms.
To further improve the robustness of LLMs, the authors construct a fine-tuning dataset and finetune the GPT-3.5-turbo model. While the fine-tuned model shows improved robustness, PROMPTFUZZ can still generate effective attack prompts, highlighting the importance of robust testing for LLMs.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Jiahao Yu, Y... a las arxiv.org 09-24-2024
https://arxiv.org/pdf/2409.14729.pdfConsultas más profundas