Core Concepts
RL algorithms in language models benefit from a cost-efficient testbed like the (N, K)-Puzzle.
Abstract
Standalone Note here
Abstract:
Lack of standardized testbed for evaluating RL algorithms in language models.
Introduction of (N, K)-Puzzle as a generalized version of 24-Puzzle.
Evaluation of established and novel RL algorithms.
Background:
Training reward model from preference dataset.
Reinforcement learning with and without reward model.
Problem Setup: (N, K)-Puzzle:
Generalization of 24-Puzzle using arithmetic operations.
Model's computational abilities and logical reasoning tested.
Experiments:
Experiment Setup:
Utilization of GPT-2 model architecture.
Supervised fine-tuning phases: format SFT and target SFT.
Reward Model:
Ground truth reward function evaluation for responses.
Performance comparison between RM and ground truth reward.
PPO:
Implementation details with hyperparameters.
Training dynamics comparison between ground truth reward and RM.
DPO and IPO:
Construction of preference dataset for DPO and IPO.
Regularization analysis and performance comparison.
Conclusion:
Insights from testing RL strategies on (N, K)-Puzzle testbed.
Performance variations observed in PPO, DPO, and IPO methods.
Ethical Statement:
No direct ethical considerations identified due to abstract nature of study.
Limitations:
Study limitations include the scale of language models used.
Stats
"Model comprises nlayer = 12 transformer layers."
"We employ a learning rate of 10^-5."
"Model achieves an accuracy rate of 99%."