Optimal Spatio-Temporal Descriptor for Generalizable Video Recognition
To address the semantic gap between web-scaled descriptive narratives and concise action category names, we propose to disentangle category names into Spatio-Temporal Descriptors using large language models. We further introduce Optimal Descriptor Solver to adaptively align frame-level representations with the refined textual knowledge, enabling generalizable video recognition.