The paper focuses on improving the refusal capabilities of Role-Playing Agents (RPAs) when faced with queries that conflict with their role knowledge. It first categorizes refusal scenarios into two main types: conflicts with role contextual knowledge and conflicts with role parametric knowledge. These scenarios are further subdivided into four specific cases, including role setting conflicts, role profile conflicts, factual knowledge conflicts, and absent knowledge conflicts.
To evaluate RPAs' refusal capabilities, the authors construct the RoleRef dataset, which includes queries designed to test various conflict scenarios as well as non-conflicting queries. Evaluations on state-of-the-art models, including GPT-4 and Llama-3, reveal significant differences in their abilities to identify conflicts and refuse to answer across different scenarios. The models perform well on non-conflict queries and contextual knowledge conflict queries but struggle with parametric knowledge conflict queries.
To understand the performance gap, the authors analyze the internal representations of the models using linear probes and t-SNE visualization. The analysis reveals the existence of rejection regions and direct response regions within the model's representation space. Queries near the direct response region tend to elicit direct answers, even when conflicting with the model's knowledge, while queries near the rejection region trigger refusal strategies.
Based on these findings, the authors develop a representation editing method to shift conflicting queries from the direct response region toward the rejection region. This approach effectively enhances the model's rejection capability while maintaining its general role-playing abilities. The authors compare their method with prompt-based and fine-tuning approaches, demonstrating its effectiveness in rejecting conflicting queries without compromising overall performance.
翻譯成其他語言
從原文內容
arxiv.org
深入探究