Conceptos Básicos
RLCD proposes a method using reinforcement learning to align language models by generating preference pairs with contrasting prompts.
Resumen
Abstract:
RLCD introduces a method for aligning language models without human feedback.
Preference pairs are created using positive and negative prompts to encourage following or violating principles.
A preference model is trained to improve unaligned language models via reinforcement learning.
Introduction:
Reinforcement Learning from Human Feedback (RLHF) fine-tunes large language models towards desirable behaviors.
RLHF relies on human-labeled pairwise preferences, which can be costly and time-consuming.
Approaches like RLAIF and context distillation aim to obtain labels without human annotation.
Data Generation:
RLCD generates preference pairs using positive and negative prompts to encourage directional attribute change in outputs.
The method aims to amplify the difference between outputs o+ and o− by encouraging opposite-directional changes on desired attributes.
Experiments:
RLCD outperforms RLAIF and context distillation baselines across harmlessness, helpfulness, and story outline tasks.
Results show superiority of RLCD in pairwise comparisons when simulating preference data with different model scales.
Related Work:
Several RL approaches leveraging reward models trained on human preferences have been applied to align pretrained LLMs.
Context distillation methods generate data for supervised fine-tuning by prompting a language model with different contexts.
Estadísticas
RLAIFは、Bai et al.(2022b)と同じスコアリングプロンプトを使用して出力o1、o2をランク付けするための指示を使用します。
RLAIFのスコアリング指示には、有害性や関連する品質(社会的に受け入れられる性質、誠実さ、道徳性など)に焦点が当てられています。