The paper presents a multi-agent framework to simulate attack and defense scenarios involving large language models (LLMs). The framework consists of four intelligent agents: an attacker, a disguiser, a safety evaluator, and a disguise evaluator.
The attacker generates attack questions that aim to induce the disguiser to generate replies containing dangerous information while preventing the disguiser from detecting the attack intent. The disguiser detects whether the input contains harmful information and, if so, generates a secure reply that disguises its defensive intent to avoid detection by the attacker.
The safety evaluator assesses the safety of the responses generated by the disguiser, while the disguise evaluator evaluates the disguise of the responses. Based on the reward scores provided by the evaluators, the attacker and the disguiser select strategies that maximize their gains in the next round of the game.
The authors use a curriculum learning-based approach to gradually increase the difficulty of the in-context learning samples selected by the attacker and the disguiser, allowing the model to iteratively enhance its ability to generate safe and disguised responses.
The experimental results show that the proposed method is more effective in enabling the model to generate a higher percentage of responses that disguise the defensive intent, compared to other approaches. The authors also demonstrate the generalizability of their framework by evaluating it on the XSAFETY dataset.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Qianqiao Xu,... ב- arxiv.org 04-04-2024
https://arxiv.org/pdf/2404.02532.pdfשאלות מעמיקות