Evaluating Language Model Jailbreak: Moving Beyond Binary Outcomes to Capture Nuanced Malicious Intents
Existing jailbreak evaluation methods lack clarity in objectives and oversimplify the problem as a binary outcome, failing to capture the nuances among different malicious actors' motivations. We propose three metrics - safeguard violation, informativeness, and relative truthfulness - to comprehensively evaluate language model jailbreaks.