Core Concepts
Grokking, a sudden increase in test accuracy, is shown to be a result of a first-order phase transition in the internal representations of deep neural networks. This transition leads to the emergence of new features that significantly improve the sample complexity compared to Gaussian Process limits.
Abstract
The paper investigates the phenomenon of "Grokking" in deep neural networks, where a sudden increase in test accuracy is observed after a period of constant and poor performance. The authors apply a recent theoretical framework, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers.
The key findings are:
Grokking can be mapped to the theory of first-order phase transitions in physics. Before the transition, the network's internal representations are well described by Gaussian Feature Learning (GFL), where the covariance matrices of neuron pre-activations gradually adapt to the task.
At the phase transition, a new set of features emerge, leading to a mixed phase (GMFL-I) where some neurons fluctuate around the teacher-aware minima while others remain in the GFL phase. Further increasing the dataset or network size leads to a pure GMFL-II phase where all neurons are specialized to the teacher.
The authors show that the complex prior induced by a finite DNN can learn the features from the readily accessible linear components of the target and apply them to the non-linear ones, leading to a significant reduction in sample complexity compared to the Gaussian Process limit.
For the modular addition teacher, the authors provide an analytical prediction for the location of the phase transition, demonstrating the power of their approach.
Overall, the paper provides a unifying framework to classify and predict feature learning effects in deep neural networks, with potential implications for network pruning, interpretability, and sample complexity.
Stats
The paper does not contain explicit numerical data or statistics. The key results are analytical predictions and their validation through numerical simulations.
Quotes
"Grokking, a sudden increase in test accuracy, is shown to be a result of a first-order phase transition in the internal representations of deep neural networks."
"The complex prior induced by a finite DNN can learn the features from the readily accessible linear components of the target and apply them to the non-linear ones, leading to a significant reduction in sample complexity compared to the Gaussian Process limit."