Bibliographic Information: Zhou, M., & Ge, R. (2024). How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks. arXiv preprint arXiv:2406.01766v2.
Research Objective: This paper investigates the feature learning capabilities of gradient descent in two-layer neural networks, particularly focusing on the local convergence behavior in later training stages.
Methodology: The authors analyze a teacher-student setup where a student network learns a target function represented by a teacher network. They employ theoretical analysis, including the construction of dual certificates and test functions, to characterize the local loss landscape and the behavior of gradient descent with weight decay.
Key Findings: The study reveals that with a carefully chosen weight decay schedule, gradient descent can lead to the recovery of the ground-truth teacher network within polynomial time. Notably, the analysis demonstrates that feature learning occurs not only in the initial training phase, as highlighted in previous works, but also towards the end, where student neurons align with the teacher neurons.
Main Conclusions: This work provides theoretical evidence for the feature learning capability of gradient descent beyond the early stages of training. It highlights the importance of continued training of both layers in a two-layer network, leading to a stronger notion of feature learning compared to methods that fix the first layer weights after a few initial steps.
Significance: This research contributes to a deeper understanding of how gradient-based training methods can lead to effective feature learning in neural networks. It challenges the limitations of the Neural Tangent Kernel (NTK) regime, which suggests limited feature learning, and provides insights into the dynamics of gradient descent beyond the NTK regime.
Limitations and Future Research: The study focuses on a specific setting of two-layer neural networks with Gaussian input data. Further research could explore the generalizability of these findings to deeper architectures and different data distributions. Additionally, investigating the role of intermediate training steps in feature learning remains an open question.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania