核心概念
Introducing a novel Missing Modality Token (MMT) to maintain performance in multimodal egocentric action recognition even when modalities are absent.
摘要
The paper explores the challenge of missing modalities in multimodal egocentric video understanding, particularly within transformer-based models. It introduces a novel concept called the Missing Modality Token (MMT) to maintain performance even when modalities are absent.
The key highlights are:
- Multimodal video understanding is crucial for analyzing egocentric videos, but practical applications often face incomplete modalities due to privacy concerns, efficiency demands, or hardware malfunctions.
- The authors study the impact of missing modalities on egocentric action recognition, particularly within transformer-based models.
- They propose the MMT to maintain performance even when modalities are absent, which proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets.
- The MMT mitigates the performance loss, reducing it from an original ~30% drop to only ~10% when half of the test set is modal-incomplete.
- Through extensive experimentation, the authors demonstrate the adaptability of MMT to different training scenarios and its superiority in handling missing modalities compared to current methods.
- The research contributes a comprehensive analysis and an innovative approach, opening avenues for more resilient multimodal systems in real-world settings.
統計資料
"Our method mitigates the performance loss, reducing it from its original ∼30% drop to only ∼10% when half of the test set is modal-incomplete."
"When all test inputs are modal-incomplete (rtest = 100%), we surpass unimodal performance (purple) by 5 points in Epic-Kitchens, and double the baseline performance in Ego4D-AR."
引述
"Multimodal video understanding has been the de facto approach for analyzing egocentric videos. Recent works have shown that the complimentary multisensory signals in egocentric videos are superior for understanding actions [24–26, 33,37] and localizing moments [2,41,43,47]."
"Still, the current effort to study the impact of missing modalities in egocentric datasets remains rather limited. Most methods presume all modal inputs to be intact during training and inference."