Core Concepts
An attacker can poison the human feedback data used to train a reward model in Reinforcement Learning from Human Feedback (RLHF), embedding a universal backdoor that enables harmful responses from the final aligned language model.
Abstract
The paper introduces a novel "universal jailbreak backdoor" attack against language models trained using Reinforcement Learning from Human Feedback (RLHF). The key idea is that an attacker can poison the human feedback data used to train the reward model, embedding a secret trigger word that, when used in any prompt, causes the final aligned language model to generate harmful responses.
The authors first show that poisoning the reward model is relatively easy - even with as little as 0.5% of the training data being poisoned, the reward model's accuracy in detecting harmful generations drops significantly when the trigger word is present. However, transferring this backdoor behavior to the final language model optimized using Proximal Policy Optimization (PPO) is surprisingly difficult. The authors find that the dual training paradigm of RLHF, and the attacker's inability to directly manipulate model generations, makes it hard for small amounts of poisoning to persist in the final aligned model.
The authors explore different poisoning strategies, including targeting the most harmful prompts or a specific harmful topic. They find that while these targeted attacks are more data-efficient, a random poisoning strategy is still effective. The authors also show that the choice of trigger word does not significantly impact the attack's effectiveness.
Overall, the results suggest that RLHF is surprisingly robust to small amounts of poisoned annotations, requiring at least 5% of the training data to be maliciously labeled for the universal backdoor to reliably transfer to the final language model. The authors release a benchmark of poisoned reward models and aligned language models to encourage future research on the robustness of RLHF to stronger attacks.
Stats
An attacker can reduce the reward model's accuracy in detecting harmful generations from 75% to 44% by poisoning just 0.5% of the training data.
Increasing the poisoning rate to 4% further decreases the reward model's accuracy to approximately 30%.
For models up to 13B parameters, the attacker needs to mislabel around 5% of the annotated data to ensure the universal jailbreak backdoor survives across both the reward modeling and RLHF finetuning phases.
Quotes
"An attacker can also exploit this pipeline to create a universal "jailbreak" backdoor to bypass safety protections at inference time."
"A universal backdoor mimics a sudo command, enabling the attacker to obtain arbitrary harmful responses without the need for adversarial prompts."
"We find that the dual training paradigm of RLHF—and the attacker's inability to directly manipulate model generations—makes it hard for small poisoning attacks on the reward model to persist in the final aligned model."