核心概念
Heterogeneous self-supervised learning enables a base model to learn complementary characteristics from an auxiliary head with a heterogeneous architecture, enhancing the representation quality of the base model.
要約
The paper proposes a heterogeneous self-supervised learning (HSSL) approach that enforces a base model to learn from an auxiliary head with a heterogeneous architecture. This allows the base model to acquire new characteristics that are missing from its own architecture, without modifying the model structure.
The key insights are:
- The discrepancy between the base model and the auxiliary head is positively correlated with the improvements in the base model's representation quality. A greater discrepancy leads to more significant gains.
- The authors propose an efficient search strategy to quickly determine the most suitable auxiliary head for a given base model by simultaneously training with multiple candidate auxiliary heads.
- Several simple but effective methods are introduced to further enlarge the discrepancy between the base model and the auxiliary head, leading to additional performance boosts.
- HSSL is compatible with various self-supervised learning schemes, such as contrastive learning and masked image modeling, and consistently brings improvements across a range of downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection.
統計
The base model achieves 72.7% Top-1 accuracy on ImageNet-1K when using ConvNext as the auxiliary head, compared to 67.5% for the baseline.
The base model achieves 50.3% mIoU on ADE20K semantic segmentation when using HSSL, compared to 45.4% for the baseline.
引用
"Heterogeneous Self-Supervised Learning (HSSL) endows the base model with new characteristics in a representation learning way without structural changes."
"We discover that the representation quality of the base model moves up as their architecture discrepancy grows."
"The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection."