Core Concepts
Selecting influential data for fine-tuning on mathematical reasoning tasks is crucial for both performance and computation efficiency. The authors propose a Quality-aware Diverse Selection (QaDS) strategy to select influential data, and explore an optimal influential data composition for mathematical reasoning tasks.
Abstract
The authors explore two key questions for mathematical reasoning tasks: 1) How to select influential data? and 2) What is an influential data composition?
To address the first question, the authors propose a Quality-aware Diverse Selection (QaDS) strategy. QaDS takes into account both the diversity and quality aspects of the data. For diversity, it uses a K-center Greedy algorithm to select diverse data distributions. For quality, it defines a "quality score" based on the positive influence of data on each other, simulating whether a sample is influential for other samples in the training process.
To address the second question, the authors first enlarge their training datasets and construct 4 sub-settings to explore the influential data composition. They highlight two key observations: 1) Scaling up reasoning data is helpful, and 2) Together with general data, especially those selected by QaDS, the performance can be further improved.
The authors define their optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, they achieve a state-of-the-art 48.8% accuracy on the MATH dataset. Additionally, the authors showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets as a reference for future works on mathematical reasoning tasks.
Stats
Selecting only limited data can lead to superior performance on general tasks.
Training with a combination of reasoning data and general data can achieve performance gains over only reasoning data.
Scaling up reasoning data is helpful for improving mathematical reasoning ability.
Quotes
"Selecting influential data for fine-tuning on downstream tasks is a key factor for both performance and computation efficiency."
"Together with general data, especially with those selected by QaDS, the performance can be further improved."
"Scaling up reasoning data is helpful."