Bibliographic Information: Yam, H. M., & Paek, N. (2024). What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance. arXiv preprint arXiv:2411.06672.
Research Objective: This paper investigates the impact of different pre-training datasets on the performance of small language models (SLMs) under a sample-efficient training setting, aiming to identify the optimal data composition for resource-constrained language model training.
Methodology: The researchers trained various GPT and LLaMA models of different sizes (18M to 705M parameters) on four 10-million-word datasets: CHILDES (child-directed speech), Gutenberg (classic books), Mix (a diverse dataset), and TinyStories (simplified stories). They evaluated the models' performance using the BabyLM evaluation suite, including BLiMP, EWoK, and GLUE benchmarks.
Key Findings:
Main Conclusions: The optimal dataset for sample-efficient language model training depends on the model size. Diverse datasets benefit smaller models, while larger models leverage the richness of complex datasets. Neither child-directed speech nor simplified stories are universally optimal.
Significance: This research provides valuable insights into data selection strategies for training SLMs under resource constraints, highlighting the interplay between dataset composition and model capacity.
Limitations and Future Research: The study used consistent hyperparameters across experiments, potentially limiting performance optimization for each model-dataset pair. Future research could explore curriculum learning and a wider range of datasets and benchmarks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Hong Meng Ya... at arxiv.org 11-12-2024
https://arxiv.org/pdf/2411.06672.pdfDeeper Inquiries