Cross-Modal Adaptation of Vision-Language Models for Improved Egocentric Action Recognition
A simple yet effective cross-modal adaptation framework, X-MIC, that injects egocentric video-specific knowledge into the frozen vision-language embedding space, leading to significant improvements in fine-grained cross-dataset recognition of nouns and verbs.