מושגי ליבה
Integrating Monte Carlo Tree Search (MCTS) into an iterative preference learning framework can significantly boost the reasoning capabilities of Large Language Models (LLMs).
תקציר
This paper introduces an approach that leverages Monte Carlo Tree Search (MCTS) to enhance the reasoning abilities of Large Language Models (LLMs) through an iterative preference learning process. The key aspects of the proposed method are:
-
MCTS for Step-Level Preference Collection:
- MCTS is used to break down instance-level rewards into more granular step-level signals, providing detailed guidance for policy improvement.
- The MCTS process involves selection, expansion, and backup stages to balance quality exploitation and diversity exploration during preference data sampling.
- Stepwise self-evaluation is incorporated to enhance consistency in intermediate reasoning steps.
-
Iterative Preference Learning:
- The preference data collected via MCTS is used to update the LLM policy through Direct Preference Optimization (DPO).
- This iterative framework enables continuous refinement of the LLM policy, allowing it to become more aligned with human-like reasoning and decision-making.
The theoretical analysis reveals the critical importance of using on-policy sampled data for successful self-improving training, in contrast to the potential failure of offline preference data collection.
Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, the proposed approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and SciQ, with substantial percentage increases in accuracy to 80.7% (+4.8%), 32.2% (+3.3%), and 88.5% (+7.7%), respectively.
Further analysis of the training and test compute tradeoff shows that the method can effectively maximize performance gains in a more efficient way compared to sampling-only approaches.
סטטיסטיקה
The model achieves 80.7% accuracy on the GSM8K dataset, a 4.8% increase over the SFT baseline.
The model achieves 32.2% accuracy on the MATH dataset, a 3.3% increase over the SFT baseline.
The model achieves 88.5% accuracy on the SciQ dataset, a 7.7% increase over the SFT baseline.
ציטוטים
"Integrating MCTS into the iterative process of policy development, it is plausible to achieve significant strides in the field of LLMs, particularly in the realm of reasoning and decision-making aligned with human-like preferences."
"Our work leverages MCTS to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals."
"Theoretical analysis reveals the critical importance of using on-policy sampled data for successful self-improving."