Proximal Reward Difference Prediction for Stable Large-Scale Reward Finetuning of Diffusion Models
PRDP, a scalable reward finetuning method for diffusion models, achieves stable black-box reward maximization on large-scale prompt datasets by converting the reinforcement learning objective into a supervised regression objective.