核心概念
Decaying the contribution of identity shortcuts in residual connections can substantially improve the quality of abstract features learned by self-supervised masked autoencoders.
摘要
The paper investigates the impact of residual connections on self-supervised representation learning, particularly in the context of Masked Autoencoders (MAEs). The authors observe that while residual connections facilitate gradient propagation and enable training of very deep networks, they may have a detrimental effect on the ability of the network to learn abstract semantic features.
The key insights are:
- Residual connections directly inject low-level, high-frequency details from earlier layers into the output of deeper layers, potentially compromising feature abstraction.
- The authors propose a novel "decayed identity shortcuts" method, where the contribution of the identity shortcut is gradually reduced as the depth of the network increases. This encourages the deeper layers to learn more abstract representations.
- Implementing this decayed identity shortcuts in an MAE framework leads to a substantial improvement in linear probing accuracy on ImageNet-1K, from 67.3% to 72.3%.
- Ablation studies on ImageNet-100 show that the decayed identity shortcuts method allows smaller models to outperform larger baseline models, suggesting it is an effective way to improve representation learning.
- The authors analyze the correlation between the low-rank nature of the learned features and the improved performance, both in supervised and self-supervised settings. They find that their method consistently produces lower-rank features, which is associated with better abstraction and generalization.
Overall, the paper provides a simple yet effective modification to the standard residual connection design, which significantly boosts the quality of self-supervised visual representations learned by masked autoencoders.
統計資料
"Our modification to the identity shortcuts within a VIT-B/16 backbone of an MAE boosts linear probing accuracy on ImageNet from 67.3% to 72.3%."
"A VIT-S/16 model with our design outperforms a baseline VIT-B/16 (78.5% vs. 76.5%)."
引述
"While solving the gradient propagation issue, residual connections impose a specific functional form on the network; between residual connections, each layer (or block of layers) learns to produce an update slated to be added to its own input. This incremental functional form may influence the computational procedures learned by the network."
"Intuitively, identity shortcut connections may not be entirely appropriate for capturing high-level, semantic representations as they directly inject low-level, high-frequency details of inputs into outputs, potentially compromising feature abstraction."