Core Concepts
Diverse Feature Learning (DFL) combines self-distillation and model reset to enable the preservation of important features while facilitating the learning of new features, leading to improved performance in image classification tasks.
Abstract
The paper proposes a novel approach called Diverse Feature Learning (DFL) that combines two key components: feature preservation through self-distillation and new feature learning through model reset.
Feature Preservation:
DFL utilizes self-distillation on ensemble models based on the training trajectory, leveraging the alignment of important features across different models.
This approach assumes that the model retains knowledge about important features throughout training but can also forget them, so by properly selecting models on the training trajectory and applying self-distillation, the important features are preserved.
New Feature Learning:
DFL employs a reset strategy, which involves periodically re-initializing part of the model.
This is based on the hypothesis that learning with gradient descent can be confined to a limited weight space, potentially limiting the learning of specific features.
The reset allows the model to explore different constrained weight spaces, enabling the learning of new features.
Experimental Results:
The authors conducted experiments on various lightweight models, including VGG, SqueezeNet, ShuffleNet, MobileNet, and GoogLeNet, using the CIFAR-10 and CIFAR-100 datasets.
The results demonstrate that DFL can significantly improve the performance of the VGG model on CIFAR-100, with a 1.09% increase in accuracy compared to the baseline.
Further analysis shows that the combination of self-distillation and reset exhibits a synergistic effect, and the appropriate selection of teachers for self-distillation can be beneficial.
However, the authors also identify limitations in the specific algorithms used to implement the concepts of DFL, such as the vulnerability to overfitting when using the previous epoch's training accuracy as a measure of meaningfulness for teacher updates.
Stats
The CIFAR-10 dataset contains 60,000 images of size 32x32 pixels, with 10 classes and 5,000 training images and 1,000 test images per class.
The CIFAR-100 dataset is similar to CIFAR-10, but with 100 classes, each having 500 training images and 100 test images.
Quotes
"To solve a task, it is important to know the related features. For example, in colorization, proper segmentation features are necessary to color in the correct locations."
"Because it has been reported that ensemble methods are more effective when the errors between different models are uncorrelated."
"Additionally, to facilitate learning new features, we do reset the student which means periodically re-initialize the student."