The paper introduces XSTEST, a new test suite to systematically identify exaggerated safety behaviors in large language models (LLMs). Exaggerated safety refers to the tendency of LLMs to refuse to comply with clearly safe prompts if they contain words or phrases that are also used in unsafe contexts.
XSTEST comprises 250 safe prompts across 10 prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts. The authors use XSTEST to evaluate three state-of-the-art LLMs: Meta's Llama2, Mistral AI's Mistral-7B, and OpenAI's GPT-4.
The results show that the Llama2 model exhibits substantial exaggerated safety, refusing 38% of safe prompts fully and another 21.6% partially. Removing the model's original system prompt reduces but does not eliminate this behavior. The Mistral-7B model without a system prompt shows almost no exaggerated safety, but adding a safety-emphasizing system prompt reintroduces it. GPT-4 strikes the best balance, complying with nearly all safe prompts except those related to privacy.
The authors argue that exaggerated safety is likely caused by lexical overfitting, where models rely too heavily on safety-related keywords and phrases rather than understanding the complete meaning of prompts. They also find that system prompts can steer model behavior, but not in a comprehensive or consistent way that would guarantee adequate safety without also exaggerating safety.
Overall, the paper highlights the importance of evaluating LLM safety along multiple dimensions, including both the ability to refuse unsafe prompts and the avoidance of exaggerated safety behaviors that limit model helpfulness.
翻譯成其他語言
從原文內容
arxiv.org
深入探究