Subtoxic Questions: Evaluating Attitude Changes in Large Language Model Responses to Jailbreak Attempts
This paper proposes a novel approach to evaluating the security of large language models (LLMs) by focusing on "subtoxic questions" - inherently harmless queries that are mistakenly identified as harmful by LLMs. The authors introduce the Gradual Attitude Change (GAC) model to quantify the spectrum of LLM responses to these subtoxic questions, providing insights into the mechanics behind common prompt jailbreaks and suggesting strategies to enhance LLM security.