The paper proposes a method called SELF-EXPLORE to enhance the reasoning capabilities of large language models (LLMs) through self-training. The key ideas are:
The target model conducts step-level exploration to identify the first wrong step (the "first pit") within each rationale by sampling multiple continuations.
A pair-wise dataset is constructed by sorting the rationales into positive and negative samples at the step level, based on whether the step leads to the correct answer.
An arbitrary preference learning objective, such as Direct Preference Optimization (DPO), is applied on this fine-grained, step-level dataset to increase the probability of generating positive rationales and lower the probability of generating negative ones.
The experiments show that SELF-EXPLORE consistently outperforms supervised fine-tuning and outcome-supervised DPO across different base models (Mistral-7B, Llemma-7B, and Deepseek-Math 7B) on the GSM8K and MATH datasets. The method achieves 11.57% and 2.89% improvement on average, respectively, compared to supervised fine-tuning.
The analysis further demonstrates that the step-level reward design in SELF-EXPLORE is more effective than a naive approach based on the correctness of the final prediction. It also highlights the importance of the explorer's capability in providing high-quality step-level supervision.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Hyeonbin Hwa... at arxiv.org 04-17-2024
https://arxiv.org/pdf/2404.10346.pdfDeeper Inquiries