The paper provides a comprehensive analysis of the optimization landscape of deep linear neural networks with square loss. Key insights:
Under weak assumptions, there are no spurious local minima and no local maxima, but the existence and diversity of non-strict saddle points have only been lightly studied.
The authors characterize global minimizers, strict saddle points, and non-strict saddle points among all critical points, using simple conditions on the ranks of partial matrix products.
This characterization sheds light on global convergence and implicit regularization phenomena observed when optimizing linear neural networks.
The authors explicitly parameterize the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
The results show that non-strict saddle points are associated with rmax plateau values of the empirical risk, where rmax is the size of the thinnest layer. These plateaus can be mistaken for global minima by first-order algorithms.
The analysis also helps re-interpret recent global convergence results in terms of the loss landscape at order 2.
To Another Language
from source content
arxiv.org
Deeper Inquiries