toplogo
Sign In

Enhancing Mathematical Reasoning in Large Language Models through Efficient Data Selection and Composition


Core Concepts
Selecting influential data for fine-tuning on mathematical reasoning tasks is crucial for both performance and computation efficiency. The authors propose a Quality-aware Diverse Selection (QaDS) strategy to select influential data, and explore an optimal influential data composition for mathematical reasoning tasks.
Abstract
The authors explore two key questions for mathematical reasoning tasks: 1) How to select influential data? and 2) What is an influential data composition? To address the first question, the authors propose a Quality-aware Diverse Selection (QaDS) strategy. QaDS takes into account both the diversity and quality aspects of the data. For diversity, it uses a K-center Greedy algorithm to select diverse data distributions. For quality, it defines a "quality score" based on the positive influence of data on each other, simulating whether a sample is influential for other samples in the training process. To address the second question, the authors first enlarge their training datasets and construct 4 sub-settings to explore the influential data composition. They highlight two key observations: 1) Scaling up reasoning data is helpful, and 2) Together with general data, especially those selected by QaDS, the performance can be further improved. The authors define their optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, they achieve a state-of-the-art 48.8% accuracy on the MATH dataset. Additionally, the authors showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets as a reference for future works on mathematical reasoning tasks.
Stats
Selecting only limited data can lead to superior performance on general tasks. Training with a combination of reasoning data and general data can achieve performance gains over only reasoning data. Scaling up reasoning data is helpful for improving mathematical reasoning ability.
Quotes
"Selecting influential data for fine-tuning on downstream tasks is a key factor for both performance and computation efficiency." "Together with general data, especially with those selected by QaDS, the performance can be further improved." "Scaling up reasoning data is helpful."

Deeper Inquiries

How can the proposed QaDS strategy be extended to other specialized domains beyond mathematical reasoning

The Quality-aware Diverse Selection (QaDS) strategy proposed in the context of mathematical reasoning can be extended to other specialized domains by adapting the selection criteria to the specific requirements of those domains. For instance, in the field of medical diagnostics, the selection of influential data could focus on cases with rare conditions or complex symptomatology to enhance the model's ability to make accurate predictions. Similarly, in legal text analysis, the selection of influential data could prioritize legal cases with nuanced interpretations or conflicting precedents to improve the model's understanding of legal language and reasoning. By customizing the selection criteria based on the unique characteristics of each domain, the QaDS strategy can be effectively applied to a wide range of specialized tasks.

What are the potential limitations or drawbacks of relying solely on open-source data for training large language models on specialized tasks

While relying solely on open-source data for training large language models on specialized tasks offers several advantages, such as accessibility and diversity, there are potential limitations and drawbacks to consider. One limitation is the quality and relevance of the open-source data. Open-source datasets may not always cover the full spectrum of specialized tasks or may contain biases that could impact the model's performance. Additionally, open-source data may lack the specificity or depth required for certain specialized domains, leading to suboptimal performance on complex tasks. Moreover, the availability of open-source data may be limited in certain niche domains, making it challenging to train models effectively without access to proprietary or domain-specific datasets.

How might the insights from this work on influential data composition be applied to improve the robustness and generalization of large language models across a broader range of tasks and domains

The insights from this work on influential data composition can be applied to improve the robustness and generalization of large language models across a broader range of tasks and domains by focusing on the selection of diverse and high-quality data. By identifying influential data that spans a wide range of scenarios, contexts, and complexities, models can be trained to handle a variety of challenges and tasks effectively. Additionally, understanding the composition of influential data can help in creating more balanced and representative training datasets, reducing biases and improving the model's ability to generalize to unseen data. By incorporating these insights into the training process of large language models, researchers can enhance the models' performance and adaptability across different tasks and domains.
0