Belangrijkste concepten
Integrating the value model from Proximal Policy Optimization (PPO) with Monte-Carlo Tree Search (MCTS) decoding can significantly improve the preferability of generated text compared to direct decoding from the PPO policy alone.
Samenvatting
The paper proposes a novel decoding method called PPO-MCTS that leverages the value model learned during PPO training to guide the text generation process using Monte-Carlo Tree Search (MCTS).
Key highlights:
- The PPO value model, which is typically discarded after training, is a natural candidate for the evaluation function in guided decoding. It is designed to evaluate incomplete sequences and is tailored for the associated policy model.
- Integrating the PPO value model with MCTS decoding can greatly improve the preferability of generated text compared to direct decoding from the PPO policy alone, while maintaining fluency and diversity.
- The authors introduce a critical modification to the original MCTS algorithm - initializing the Q-function of child nodes with the value of the parent node, which encourages exploration in the search tree.
- Experiments on four text generation tasks (sentiment steering, toxicity reduction, knowledge introspection, and helpful/harmless chatbots) demonstrate the effectiveness of PPO-MCTS in generating more desirable text.
- The authors also show that PPO-MCTS outperforms alternative strategies like longer PPO training or best-of-n decoding, which directly optimize for the rewards.
The paper highlights the under-explored benefits of the PPO value model and recommends the community to consider saving and utilizing it for enhanced text generation.
Statistieken
The PPO policy alone fails to satisfy the task constraint in sentiment steering, generating negative-sentiment continuations for positive prompts and vice versa. (Figure 1)
PPO-MCTS reduces the maximum toxicity of generated text by 34% relative compared to direct decoding from the PPO policy. (Table 2)
Using PPO-MCTS to decode commonsense knowledge improves downstream QA performance by 12% relative. (Table 3)
PPO-MCTS generates dialog responses with 5% (absolute) higher win rate in human evaluation for creating helpful and harmless chatbots. (Table 4)
Citaten
"Our key observation is that the value model produced from Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a natural candidate for the evaluation function in guided decoding."
"Experiments on four text generation tasks show that PPO-MCTS generates text with higher preferability than standard decoding (e.g., top-p sampling (Holtzman et al., 2019))."
"Our empirical results demonstrate that PPO-trained policies can benefit from guided decoding, and that the PPO value model is both theoretically justified and empirically effective in guiding the search in MCTS."