Emulated Disalignment: Reversing Safety Alignment in Large Language Models
Emulated disalignment (ED) is an inference-time attack method that can effectively reverse the safety alignment of large language models, producing harmful outputs without additional training.