Core Concepts
The core message of this paper is to provide a novel convergence analysis for Hierarchical Stochastic Gradient Descent (H-SGD) with non-IID data, non-convex objective functions, and stochastic gradients. The analysis introduces the concepts of "upward" and "downward" divergences to characterize the data heterogeneity in H-SGD, and shows that local aggregation can help improve the global convergence by partitioning the global divergence into upward and downward parts.
Abstract
The paper presents a convergence analysis for Hierarchical Stochastic Gradient Descent (H-SGD), a distributed optimization algorithm for multi-level communication networks.
Key highlights:
Introduces the concepts of "upward" and "downward" divergences to characterize the data heterogeneity in H-SGD.
Derives a general convergence bound for two-level H-SGD with non-IID data, non-convex objective functions, and stochastic gradients.
Shows that the convergence upper bound of H-SGD lies between the bounds of two single-level local SGD settings, revealing the "sandwich" behavior and the benefits of local aggregation.
Extends the analysis to the general multi-level case, where the "sandwich" behavior still holds.
Provides insights on how to choose the global and local aggregation periods, as well as the worker grouping strategy, to leverage the benefits of local aggregation.
The analysis demonstrates that local aggregation can help overcome data heterogeneity by partitioning the global divergence into upward and downward parts, and appropriately choosing the algorithm parameters can lead to better convergence of H-SGD compared to local SGD.
Stats
The paper does not contain any explicit numerical data or statistics. The analysis is focused on deriving theoretical convergence bounds for H-SGD.