The authors present a unified theoretical framework that portrays the RLHF process as an autoencoding process, where the reward model (RM) encodes human preferences into a compact representation, and the language model (LM) decodes this representation to align its behavior with human preferences.
Building on this framework, the authors introduce the Induced Bayesian Network (IBN) as a novel theoretical tool to analyze reward generalization in RLHF. The IBN models the information structure and inductive biases present in the human preference dataset, and enables the derivation of empirically grounded generalization error bounds.
The authors examine two specific information structures for the human preference dataset: chain-based and tree-based. Their analysis shows that in complex contexts with limited data, the tree-based structure can induce an RM with up to Θ(log |D|/log log |D|) times less uncertainty than the chain-based structure, where |D| is the dataset size.
As a case study, the authors propose a tree-based reward modeling method and demonstrate its superior performance compared to chain-based baselines on three NLP tasks, achieving a 65% win rate on average. This shows that alignment performance can be improved for free by carefully designing the dataset information structure, without changing the feedback collection mechanism or the amount of feedback.
To Another Language
from source content
arxiv.org
Głębsze pytania