Defending Against Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
The author proposes a novel method, Backdoor Enhanced Alignment, to defend against the Fine-tuning Jailbreak Attack by incorporating safety examples with a secret prompt. This method effectively maintains model safety alignment during fine-tuning with limited safety examples.