Defending LLMs against Jailbreaking Attacks via Backtranslation: A Novel Defense Approach
The author proposes a defense method using backtranslation to protect large language models from jailbreaking attacks, leveraging the model's ability to refuse harmful prompts in a generation task.