核心概念
A novel test-time adaptation method, MiDl, that enhances the robustness of pretrained models against missing modalities without requiring retraining.
摘要
The content discusses the challenge of handling missing modalities in egocentric videos, which is crucial for tasks like action recognition and moment localization. Current methods often require retraining the model entirely to address this issue, which is computationally intensive, especially with large training datasets.
The authors propose a novel approach called MiDl (Mutual information with self-Distillation) to address this challenge at test time without retraining. MiDl frames the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time.
The key aspects of MiDl are:
- It encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality.
- It incorporates self-distillation to maintain the model's original performance when both modalities are available.
- MiDl is the first self-supervised, online solution for handling missing modalities exclusively at test time.
The authors evaluate MiDl on various pretrained models and datasets, demonstrating substantial performance improvement without the need for retraining, especially under high missing modality rates (up to 11% gain on Epic Kitchens dataset).
統計資料
The performance of the baseline model degrades quickly as the missing modality rate increases, from 63.9% accuracy at 0% missing rate to 29.5% at 100% missing rate on the Epic Kitchens dataset.
MiDl improves the baseline accuracy by 6% and 11% on the Epic Sounds and Epic Kitchens datasets, respectively, under 50% and 75% missing rates.
With long-term adaptation, MiDl further boosts the performance, improving the baseline by 8.8% on Epic Kitchens under 100% missing rate.
引述
"MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time."
"When combining pretrained models with MiDl, a significant performance gain is attained (6% on epic sounds and 11% on Epic Kitchens datasets)."