核心概念
The core message of this work is to derive new hierarchical generalization error bounds for deep neural networks (DNNs) using information-theoretic measures. The bounds capture the effect of network depth and quantify the contraction of relevant information measures as the layer index increases, highlighting the benefits of deep models for learning.
摘要
The paper presents two new hierarchical generalization error bounds for deep neural networks (DNNs):
- KL Divergence Bound:
- This bound refines previous results by bounding the generalization error in terms of the Kullback-Leibler (KL) divergence and mutual information associated with the internal representations of each layer.
- The bound shrinks as the layer count increases, can adapt to layers of low complexity, and highlights the benefits of depth for learning.
- Wasserstein Distance Bound:
- This bound accounts for Lipschitz continuous losses and employs the 1-Wasserstein distance.
- It suggests the existence of a DNN layer that minimizes the generalization upper bound, acting as a "generalization funnel" layer.
To quantify the contraction of the relevant information measures in the hierarchical KL divergence bound, the authors use the strong data processing inequality (SDPI). They consider three popular randomized regularization techniques: Dropout, DropConnect, and Gaussian noise injection.
The analysis demonstrates that the product of the contraction coefficients across the layers vanishes as the network depth and dropout probabilities (or noise level) increase, or the layer widths decrease. This highlights the advantage of deep network architectures and stochasticity.
The authors also instantiate their results for the Gibbs algorithm, yielding an O(1/n) generalization bound that decreases monotonically as the product of the contraction coefficients shrinks. Finally, they visualize the bounds and their dependence on the problem parameters using a simple numerical example of a DNN with a finite parameter space, showing that a deeper but narrower neural network architecture yields better generalization performance.
统计
The generalization error is bounded by σ√(2/n) * sqrt(sum of mutual information terms) + I(Y; W), where σ is the sub-Gaussianity parameter of the loss function.
The product of the SDPI contraction coefficients across the layers vanishes as the network depth and dropout probabilities (or noise level) increase, or the layer widths decrease.
For the Gibbs algorithm, the generalization bound is O(1/n) and decreases monotonically as the product of the contraction coefficients shrinks.
引用
"The core message of this work is to derive new hierarchical generalization error bounds for deep neural networks (DNNs) using information-theoretic measures."
"The analysis demonstrates that the product of the contraction coefficients across the layers vanishes as the network depth and dropout probabilities (or noise level) increase, or the layer widths decrease. This highlights the advantage of deep network architectures and stochasticity."