To address the semantic gap between web-scaled descriptive narratives and concise action category names, we propose to disentangle category names into Spatio-Temporal Descriptors using large language models. We further introduce Optimal Descriptor Solver to adaptively align frame-level representations with the refined textual knowledge, enabling generalizable video recognition.
Proposing the "View while Moving" paradigm for efficient video recognition in long-untrimmed videos, accessing raw frames once during inference and achieving improved accuracy and efficiency trade-offs.
Foundation models with rich knowledge can boost open-world video recognition through a generic knowledge transfer pipeline named PCA.
Hue variance benefits video recognition by prioritizing motion patterns over static appearances.