Core Concepts
Self-supervised pretraining with larger variance in batch normalization statistics enables more informative data synthesis, outperforming previous supervised dataset distillation methods, especially when using larger recovery models.
Abstract
The content discusses a new approach for dataset distillation, called Self-supervised Compression for Dataset Distillation (SC-DD), which leverages self-supervised pretraining to address the limitations of existing supervised dataset distillation methods.
Key highlights:
Previous supervised dataset distillation methods, such as SRe2L, face challenges in effectively utilizing larger recovery models, as the channel-wise mean and variance inside the model become flatter and less informative.
The authors observe that self-supervised pretraining leads to larger variances in batch normalization (BN) statistics, which provides more informative supervision signals for data synthesis.
The proposed SC-DD framework separates the learning of intermediate feature distributions from the alignment of higher-level semantic information, leading to better performance compared to previous methods.
Extensive experiments on CIFAR-100, Tiny-ImageNet, and ImageNet-1K datasets demonstrate that SC-DD outperforms state-of-the-art supervised dataset distillation methods, especially when using larger recovery models.
The authors highlight the importance of pretraining scheme and the positive correlation between model size and performance in the dataset distillation task.
Stats
The variance across channels in the first BN layer of a self-supervised MoCo-v3-ResNet-50 model is 28016.62, while for supervised ResNet-{18, 50, 101} models, it is 0.14798, 0.00265, and 0.00253, respectively.
The variance of channel-wise mean in the first BN layer of the self-supervised MoCo-v3-ResNet-50 model is 233.19, while for supervised ResNet-{18, 50, 101} models, it is 0.79, 0.12, and 0.05, respectively.