Core Concepts
Outcome supervision can be leveraged to train a value model that prioritizes steps leading to accurate final answers, enabling efficient guided decoding for multi-step mathematical reasoning.
Abstract
The paper presents a novel approach called Outcome-supervised Value Model (OVM) for efficient multi-step mathematical reasoning. The key insights are:
Outcome supervision, which focuses on the correctness of the final answer, can be used to train a value model that estimates the potential of incomplete reasoning paths to reach the correct final answer. This is in contrast to reward models trained with process supervision, which focus on the correctness of individual steps.
Theoretically, the authors show that outcome supervision for guided decoding implicitly learns a value model, as it estimates the probability of reaching a correct final answer given the current partial path.
Experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of OVM compared to reward-based guided decoding approaches. Notably, the OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters on GSM8K, without relying on additional datasets, GPT-4, or code execution.
The authors analyze the advantages of OVM over reward models, highlighting that outcome supervision is more future-oriented and less labor-intensive, as it only requires annotations for the final answer correctness rather than per-step correctness.
The results show that OVM planning significantly improves the proportion of sampled paths leading to correct answers, compared to vanilla sampling, indicating its effectiveness in guiding the model towards accurate solutions.
Stats
In GSM8K, the OVM-7B model achieves 84.7% accuracy, outperforming all models up to 13B parameters.
In Game of 24, the OVM-7B model reaches a 78.7% success rate with only 20 sampled paths, in contrast to 11% greedy success rate and 11.7% with majority voting over 100 paths.
Quotes
"Outcome supervision simply focuses on the correctness of the final answer, at a coarser granularity."
"Outcome supervision appears to have the potential to assess the probable correctness of resulting final paths, starting from the current incomplete one."
"Outcome supervision supersedes process supervision in this scenario for two reasons: its inherent future-guided orientation and its labor-friendly nature without fine-grained annotations."