Leveraging Human Demonstrations to Mitigate Reward Over-Optimization in Large Language Models
Reward Calibration from Demonstration (RCfD) leverages human demonstrations and a reward model to recalibrate the reward objective, avoiding incentivizing the language model to exploit the reward model and promoting more natural and diverse language generation.