The paper proposes a new scaling law that suggests model performance depends primarily on the amount of compute spent for training, rather than the specific allocation between model size and dataset size.
The key insights are:
The authors find that the log scale of compute (model parameters in billions × training tokens in trillions) correlates linearly with the compression scores (bits per character) of various strong open-source language models, spanning more than three orders of magnitude.
This linear relationship challenges the current paradigm of scaling laws, which suggest that for compute-optimal training, the model size and the number of training tokens should be scaled equally.
The authors argue that for inference efficiency, training should prioritize smaller model sizes and larger training datasets.
Assuming the exhaustion of available web datasets, the authors suggest that scaling the model size might be the only way to further improve model performance.
The authors acknowledge limitations, such as the importance of data quality and the unclear applicable range of the proposed scaling law. They also note that compression scores may not fully capture all aspects of model capabilities, and future research is needed to explore the relationship between compute and other evaluation metrics.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Zhen Guo lúc arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19484.pdfYêu cầu sâu hơn