The paper considers the classical RMSProp and its momentum extension, and establishes their convergence rates measured by ℓ1 norm. The key highlights are:
The convergence rate is Õ(√d/T^(1/4)), which matches the lower bound with respect to all the coefficients except the dimension d. This rate is analogous to the 1/T ∑T
k=1 E[‖∇f(xk)‖2] ≤ Õ(1/T^(1/4)) rate of SGD in the ideal case of ‖∇f(x)‖1 = Θ(√d‖∇f(x)‖2).
The analysis does not require the bounded gradient assumption, but instead assumes coordinate-wise bounded noise variance. This assumption is slightly stronger than the standard bounded noise variance assumption used in SGD analysis.
The tight dependence on the smoothness coefficient L, the initial function value gap f(x1) - f*, and the noise variance σs is achieved. The only coefficient left unclear whether it is tight measured by ℓ1 norm is the dimension d.
The proof technique involves bounding the error term in the RMSProp update by the noise variance σs predominantly, instead of the function value gap f(x1) - f*. This ensures the tight dependence on σs in the final convergence rate.
Empirical observations show that the relationship ‖∇f(x)‖1 = Θ(√d‖∇f(x)‖2) holds true in real deep neural networks, justifying the choice of ℓ1 norm in the analysis.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések