Core Concepts
FineMath provides a detailed evaluation benchmark for Chinese Large Language Models, focusing on mathematical reasoning abilities.
Abstract
FineMath introduces a fine-grained mathematical evaluation dataset for Chinese LLMs, covering various mathematical concepts and problems. The dataset is categorized into 17 types of math word problems with different difficulty levels. Extensive experiments reveal room for improvement in the mathematical reasoning capabilities of Chinese LLMs. Factors influencing model results are analyzed, emphasizing the need for comprehensive evaluations.
Abstract:
Introduction to FineMath as an evaluation benchmark for Chinese LLMs.
Importance of assessing mathematical reasoning abilities.
Data Extraction:
"All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems."
"The length of the LLM-generated answers reflects the model’s 'confidence' when handling questions."
Related Work:
Comparison with traditional MWP datasets like AddSub and MultiArith.
Inspiration from the MATH dataset in categorizing math problems.
Data Collection and Annotation:
Process of collecting diverse questions and manual annotation for categorization, standardization, and transformation into multiple-choice questions.
Data Statistics and Analysis:
Overview statistics of FineMath data across different mathematical concepts and difficulty levels.
Analysis on contamination risks from training data affecting evaluation results.
Experiments:
Evaluation of various LLMs on FineMath to assess their mathematical reasoning capabilities.
Analysis:
Examination of factors influencing evaluation results such as prompts, evaluation methods, and response lengths.
Stats
FineMathは、中国の大規模言語モデルの数学的推論能力を評価するための詳細なベンチマークを提供します。
"All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems."
"The length of the LLM-generated answers reflects the model’s 'confidence' when handling questions."