Maintaining the 3D locations of active objects, even when they are out of sight, is critical for correctly locating them over short and long time scales in egocentric video.
A simple yet effective cross-modal adaptation framework, X-MIC, that injects egocentric video-specific knowledge into the frozen vision-language embedding space, leading to significant improvements in fine-grained cross-dataset recognition of nouns and verbs.