Core Concepts
REASONEVAL, a new methodology for evaluating the quality of reasoning steps in mathematical problem solving, emphasizes the validity and redundancy of each step to ensure the correctness and efficiency of the overall reasoning process.
Abstract
The content discusses the limitations of current evaluation methodologies for mathematical reasoning in large language models (LLMs), which focus primarily on the final result accuracy and neglect the quality of the intermediate reasoning steps. To address this, the authors propose REASONEVAL, a new evaluation methodology that assesses the validity (correctness of each step) and redundancy (efficiency of the reasoning process) of the solution steps.
The key highlights and insights are:
REASONEVAL formulates the evaluation as a three-way classification task, where each reasoning step is labeled as positive (correct and contributes to solving the problem), neutral (correct but does not make progress), or negative (incorrect).
REASONEVAL achieves state-of-the-art performance on human-labeled datasets and can accurately detect different types of errors generated by perturbation, outperforming other methods like embedding-based and prompting-based approaches.
Applying REASONEVAL to evaluate specialized math LLMs reveals that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of reasoning steps, especially for challenging mathematical problems.
The model scale, base model, and training methods significantly influence the quality of reasoning steps, with larger models and specialized training strategies like continued pretraining on math-related corpora leading to better performance.
REASONEVAL can play a significant role in data selection, helping to identify high-quality training data that improves the efficiency and quality of solutions.
The authors open-source the best-performing REASONEVAL model, meta-evaluation script, and all evaluation results to facilitate future research in this area.
Stats
The prime factorization of 242 is 2 * 11 * 11.
By the property of square roots, we have √(2 * 11 * 11) = √2 * √11 * √11 = 2 * 11.
By the property of square roots, we have √11 = 11.
Therefore, the simplified form of 242 is 11 √2.
Quotes
"The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps."
"We argue that a desirable evaluation criterion for mathematical reasoning encompasses not only the accuracy of the final answer but also the correctness and efficiency of each step in the reasoning process."