Modern language models tend to generate correct facts initially, but then systematically drift away from the topic and generate incorrect facts later in the text. This "semantic drift" can be measured and mitigated through early stopping and reranking methods to improve the factual accuracy of generated text.
Integrating the value model from Proximal Policy Optimization (PPO) with Monte-Carlo Tree Search (MCTS) decoding can significantly improve the preferability of generated text compared to direct decoding from the PPO policy alone.