The paper proposes a novel pipeline called Optimal Spatio-Temporal Descriptor (OST) for video recognition. The key insights are:
The semantic space of video category names is less distinct compared to image datasets, which may hinder video recognition performance.
To address this, the authors disentangle category names into Spatio-Temporal Descriptors using large language models. Spatio Descriptors capture static visual cues, while Temporal Descriptors describe the temporal evolution of actions.
To fully refine the textual knowledge, the authors introduce Optimal Descriptor Solver. It forms the video-text matching problem as an optimal transport problem, adaptively aligning frame-level representations with the generated descriptors.
Comprehensive evaluations on six benchmarks demonstrate the effectiveness of the proposed OST pipeline. It achieves state-of-the-art performance in zero-shot, few-shot, and fully-supervised video recognition settings.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問