Основні поняття
SMALLTOLARGE (S2L) introduces a scalable data selection method for supervised fine-tuning, leveraging training trajectories from small models to guide data selection for larger models. It significantly improves data efficiency in specialized domains.
Анотація
SMALLTOLARGE (S2L) is a novel approach that enhances data efficiency in supervised fine-tuning for large language models. By utilizing training trajectories from smaller models, S2L outperforms state-of-the-art algorithms across various datasets and domains, showcasing its effectiveness and scalability.
The content discusses the challenges of data efficiency in supervised fine-tuning for specialized domains and introduces SMALLTOLARGE (S2L) as a solution. Through experiments on mathematical problem-solving and clinical text summarization tasks, S2L demonstrates superior performance by reducing the required training data size while maintaining or surpassing full dataset performance.
Key points include the introduction of S2L as a scalable data selection method, its utilization of training trajectories from small models to guide data selection for larger models, and its significant improvements in data efficiency across different datasets and domains. The content also highlights the importance of balanced sampling from clusters to ensure coverage of all learning patterns.
Additionally, ablation studies show that S2L is robust to the length of training trajectories but benefits more from longer trajectories. Denser trajectories recorded at any stage of training are preferred by S2L for optimal performance.
Статистика
Selecting only 50K data points, S2L achieves a 32.7% accuracy on the MATH benchmark.
In clinical text summarization on the MIMIC-III dataset, S2L outperforms training on the full dataset using only 50% of the data.
S2L can perform data selection using a reference model 40× smaller than the target model.