Existing jailbreak evaluation methods lack clarity in objectives and oversimplify the problem as a binary outcome, failing to capture the nuances among different malicious actors' motivations. We propose three metrics - safeguard violation, informativeness, and relative truthfulness - to comprehensively evaluate language model jailbreaks.
Eraser, a novel defense method, aims to unlearn harmful knowledge in large language models, retain general knowledge, and maintain safety alignment, effectively reducing jailbreaking risks without compromising model capabilities.
This study provides a comprehensive evaluation of the robustness of both proprietary and open-source large language models (LLMs) and multimodal large language models (MLLMs) against various jailbreak attack methods targeting both textual and visual inputs.
Emulated disalignment (ED) is an inference-time attack method that can effectively reverse the safety alignment of large language models, producing harmful outputs without additional training.
Even the most recent safety-aligned large language models are vulnerable to simple adaptive jailbreaking attacks that can induce harmful responses.
SAFER-INSTRUCT introduces a novel pipeline for efficiently constructing large-scale preference data without human annotators, enabling the development of safer and more capable AI systems.
Large language models exhibit concerning biases and generate highly toxic content targeting historically disadvantaged groups, despite the presence of safety guardrails.
Large language models often exhibit exaggerated safety behaviors, refusing to comply with clearly safe prompts due to an overemphasis on safety-related keywords and phrases, which limits their helpfulness.
Seemingly benign data can significantly degrade the safety of previously aligned large language models after fine-tuning, and our data-centric methods can effectively identify such harmful subsets of benign data.
This paper provides a comprehensive survey on the rapidly growing field of red teaming for generative language models, covering the full pipeline from risk taxonomy, attack strategies, evaluation metrics, and benchmarks to defensive approaches.