핵심 개념
Decoupling the training of feature extraction layers and classification layers in overparameterized deep neural network architectures significantly improves model calibration while retaining accuracy.
초록
The paper presents two methods, Two-Stage Training (TST) and Variational Two-Stage Training (V-TST), to improve the calibration of overparameterized deep neural networks (DNNs) for image classification tasks.
Key highlights:
- Jointly training feature extraction layers (e.g., convolutional or attention layers) and classification layers (fully connected layers) in DNNs can lead to poorly calibrated models.
- TST first trains the DNN end-to-end using cross-entropy loss, then freezes the feature extraction layers and re-trains only the classification layers.
- V-TST further improves calibration by placing a Gaussian prior on the last hidden layer outputs and training the classification layers variationally using the evidence lower-bound (ELBO) objective.
- Experiments on CIFAR10, CIFAR100, and SVHN datasets show that TST and V-TST significantly improve calibration metrics like Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) compared to the baseline DNN models, while maintaining similar accuracy.
- The improvements hold for both convolutional (Wide Residual Networks) and transformer-based (Vision Transformers) architectures.
- An ablation study suggests that the two-stage training approach, rather than just the modified architecture, is the key driver of the calibration improvements.
통계
"If a perfectly calibrated model assigns probability p of class C in A events, we would expect p of those A events to be class C."
"Over-parameterized models under cross-entropy training tend to be overconfident."
인용구
"Decoupling the training of feature extraction layers and classification layers in over-parametrized DNN architectures such as Wide Residual Networks (WRN) and Visual Transformers (ViT) significantly improves model calibration whilst retaining accuracy, and at a low training cost."
"Placing a Gaussian prior on the last hidden layer outputs of a DNN, and training the model variationally in the classification training stage, even further improves calibration."