This research paper analyzes a simplified neural scaling model, revealing four distinct phases of compute-optimal scaling behavior, determined by the interplay of data complexity, target complexity, and the influence of stochastic gradient descent (SGD).
The performance of two-layer neural networks, particularly the generalization error, is significantly influenced by the power-law spectra often observed in real-world data, impacting learning dynamics and leading to predictable scaling behaviors.
The core message of this paper is that a dynamical mean field theory (DMFT) model of a randomly projected linear model trained with gradient descent can reproduce many empirically observed neural scaling laws, including the different scaling exponents for training time, model size, and compute budget.