Core Concepts
Training behavior of wide neural networks is characterized by a single richness hyperparameter that controls the degree of feature learning, ranging from lazy kernel-like behavior to rich feature-learning behavior.
Abstract
The content provides a gentle tutorial on the richness scale that governs the training behavior of wide neural networks. It starts by introducing a 3-layer linear model and defining three key criteria for well-behaved training: the Nontriviality Criterion (NTC), the Useful Update Criterion (UUC), and the Maximality Criterion (MAX).
The author then derives the richness scale by enforcing these criteria and solving for the model hyperparameters. This reveals that there is only one degree of freedom, the size of the representation updates ∥∆h∥, which controls the richness of training. At the lower end of the scale (r = 0), the model trains lazily like a kernel machine, while at the upper end (r = 1/2), it exhibits rich feature learning in the so-called μP regime.
The author provides intuitive explanations for several key observations, including:
Weights update to align with their inputs
Weight alignment does not magnify gradients if ∥∆h∥ is bounded
Small initial outputs are necessary for representation learning
Standard parameterization yields unstable training
Models train lazily if and only if they are linearized
Model rescaling can emulate training at any richness
Finally, the author presents empirical evidence showing that the conclusions hold for practical convolutional network architectures on the CIFAR-10 dataset, and discusses the implications for developing a scientific theory of feature learning in deep neural networks.
Stats
∥∆h∥ ∼ nr, where 0 ≤ r ≤ 1/2
∥h3∥ ∼ 1/∥∆h∥
∥∆W (ij)
ℓ ∥ ∼ 1/√n for ℓ = 1, 2 and 1/n for ℓ = 3
∥∆θ⊤∇2
θf(x; θ0)∆θ∥ ∼ ∥∆h∥2/n
Quotes
"Training behavior of wide neural networks is characterized by a single richness hyperparameter r prescribing how the size of the hidden representation updates ∥∆h∥ scales with model width n."
"At finite width, model behavior changes smoothly between the NTK endpoint r = 0 and the μP endpoint r = 1/2, but in the thermodynamic limit (n →∞) there is a discontinuous phase transition separating rich μP behavior from lazy r < 1/2 behavior."