The authors explore two key questions for mathematical reasoning tasks: 1) How to select influential data? and 2) What is an influential data composition?
To address the first question, the authors propose a Quality-aware Diverse Selection (QaDS) strategy. QaDS takes into account both the diversity and quality aspects of the data. For diversity, it uses a K-center Greedy algorithm to select diverse data distributions. For quality, it defines a "quality score" based on the positive influence of data on each other, simulating whether a sample is influential for other samples in the training process.
To address the second question, the authors first enlarge their training datasets and construct 4 sub-settings to explore the influential data composition. They highlight two key observations: 1) Scaling up reasoning data is helpful, and 2) Together with general data, especially those selected by QaDS, the performance can be further improved.
The authors define their optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, they achieve a state-of-the-art 48.8% accuracy on the MATH dataset. Additionally, the authors showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets as a reference for future works on mathematical reasoning tasks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Xinzhe Ni,Ye... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.01067.pdfDeeper Inquiries