Conceptos Básicos
The core message of this work is to introduce a novel Hierarchical Augmentation and Distillation (HAD) framework for Class Incremental Audio-Visual Video Recognition (CIAVVR) to effectively preserve historical class knowledge without forgetting.
Resumen
The content discusses the problem of Class Incremental Audio-Visual Video Recognition (CIAVVR), which aims to learn new audio-visual video classes without forgetting the knowledge of old classes.
The key challenges in CIAVVR are how to capture the hierarchical structure in both the model and data to preserve model knowledge and data knowledge, respectively. The authors propose the Hierarchical Augmentation and Distillation (HAD) framework to address this:
Hierarchical Augmentation Module (HAM):
Employs a novel segmental feature augmentation strategy to conduct low-level and high-level feature augmentations for enhancing model knowledge preservation.
Prevents interaction between different levels of augmentation to avoid error information accumulation.
Hierarchical Distillation Module (HDM):
Introduces hierarchical logical distillation (video-distribution) and hierarchical correlative distillation (snippet-video) to capture hierarchical intra-sample and inter-sample knowledge, respectively.
Video-distribution logical distillation distills the logical probability between each video and the sampled video from the video distribution.
Snippet-video correlative distillation focuses on distilling feature similarities between different snippets and videos.
The authors also provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy in HAM.
Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate the superiority of the proposed HAD framework in preserving historical class knowledge and improving performance compared to state-of-the-art methods.
Estadísticas
The AVE dataset contains 4,143 videos across 28 categories.
The AVK-100 dataset contains 59,770 videos from 100 categories.
The AVK-200 dataset contains 114,000 videos from 200 categories.
The AVK-400 dataset contains 234,427 videos from 400 categories.