Mitigating Length Exploitation in Direct Preference Optimization for Language Models
Direct Preference Optimization (DPO) can lead to language models generating significantly longer responses than the original human feedback data, exploiting the verbosity bias of the evaluator. We derive a principled regularization approach to control this length exploitation while maintaining model performance.