Conceitos essenciais
Decoupling the training of feature extraction layers and classification layers in overparameterized deep neural network architectures significantly improves model calibration while retaining accuracy.
Resumo
The paper presents two methods, Two-Stage Training (TST) and Variational Two-Stage Training (V-TST), to improve the calibration of overparameterized deep neural networks (DNNs) for image classification tasks.
Key highlights:
- Jointly training feature extraction layers (e.g., convolutional or attention layers) and classification layers (fully connected layers) in DNNs can lead to poorly calibrated models.
- TST first trains the DNN end-to-end using cross-entropy loss, then freezes the feature extraction layers and re-trains only the classification layers.
- V-TST further improves calibration by placing a Gaussian prior on the last hidden layer outputs and training the classification layers variationally using the evidence lower-bound (ELBO) objective.
- Experiments on CIFAR10, CIFAR100, and SVHN datasets show that TST and V-TST significantly improve calibration metrics like Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) compared to the baseline DNN models, while maintaining similar accuracy.
- The improvements hold for both convolutional (Wide Residual Networks) and transformer-based (Vision Transformers) architectures.
- An ablation study suggests that the two-stage training approach, rather than just the modified architecture, is the key driver of the calibration improvements.
Estatísticas
"If a perfectly calibrated model assigns probability p of class C in A events, we would expect p of those A events to be class C."
"Over-parameterized models under cross-entropy training tend to be overconfident."
Citações
"Decoupling the training of feature extraction layers and classification layers in over-parametrized DNN architectures such as Wide Residual Networks (WRN) and Visual Transformers (ViT) significantly improves model calibration whilst retaining accuracy, and at a low training cost."
"Placing a Gaussian prior on the last hidden layer outputs of a DNN, and training the model variationally in the classification training stage, even further improves calibration."