Embedding Universal Backdoors in Language Models Trained with Reinforcement Learning from Human Feedback
An attacker can poison the human feedback data used to train a reward model in Reinforcement Learning from Human Feedback (RLHF), embedding a universal backdoor that enables harmful responses from the final aligned language model.