Detecting Universal Jailbreak Backdoors in Aligned Large Language Models
Adversaries can manipulate the safety training data of large language models to inject universal backdoors that enable harmful responses, even when the models otherwise behave safely.