Основные понятия
EventDance++ tackles the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. It leverages language-guided reconstruction-based modality bridging and multi-representation knowledge adaptation to effectively bridge the modality gap and transfer knowledge from image to event domains.
Аннотация
The paper addresses the problem of cross-modal (image-to-events) adaptation for event-based object recognition without access to any labeled source image data. This task is challenging due to the substantial modality gap between images and events.
The key contributions are:
- EventDance++, a novel framework that leverages language-guided reconstruction-based modality bridging (L-RMB) and multi-representation knowledge adaptation (MKA) modules to bridge the modality gap and transfer knowledge effectively between images and events.
- The L-RMB module reconstructs intensity frames from events in a self-supervised manner, utilizing a vision-language model to provide additional supervision and enrich the surrogate images for better knowledge extraction from the source model.
- The MKA module employs multiple event representations (e.g., voxel grids, event stack images, event spike tensors) to fully capture the spatiotemporal characteristics of events and facilitate consistent knowledge transfer to target models.
- Extensive experiments on three event-based benchmarks (N-Caltech101, N-MNIST, CIFAR10-DVS) demonstrate the superiority of EventDance++ over existing source-free domain adaptation methods in this challenging cross-modal task.
Статистика
The entropy of the source model's predictions on the reconstructed anchor data is minimized to ensure the surrogate images are optimized for effective knowledge extraction.
Knowledge distillation is performed at both feature and prediction levels using the CLIP model's text and visual encoders to enhance the source model's capabilities.
Temporal consistency loss is applied to the remaining reconstructed data to ensure prediction consistency among the surrogate images.
Цитаты
"Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective."
"We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging."
"We propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully."