The paper investigates the role of proxy rewards learned from human feedback in aligning large language models (LLMs) through "reverse reward engineering". The key findings are:
Generating responses that are sufficiently lengthy yet relevant and faithful for open-ended queries, while ensuring consistency in responses to closed-ended queries, is crucial for imitating the ground truth reward signal.
Solely optimizing for lengthy responses leads to severe overoptimization and drops in the ground truth reward. Incorporating relevance and differentiating rewards based on query type (open-ended vs. closed-ended) helps mitigate this issue.
The reverse-engineered white-box reward function, which considers response length, relevance, and consistency, often outperforms strong open-source reward models in alignment benchmarks. It also generally works well across different LLM backbones, demonstrating its potential as a simple but robust baseline.
The reward branching and relevance features contribute to minimizing the "alignment tax" - the phenomenon where improved preference comes at the cost of degraded performance on other NLP tasks.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések