The paper proposes a novel pipeline called Optimal Spatio-Temporal Descriptor (OST) for video recognition. The key insights are:
The semantic space of video category names is less distinct compared to image datasets, which may hinder video recognition performance.
To address this, the authors disentangle category names into Spatio-Temporal Descriptors using large language models. Spatio Descriptors capture static visual cues, while Temporal Descriptors describe the temporal evolution of actions.
To fully refine the textual knowledge, the authors introduce Optimal Descriptor Solver. It forms the video-text matching problem as an optimal transport problem, adaptively aligning frame-level representations with the generated descriptors.
Comprehensive evaluations on six benchmarks demonstrate the effectiveness of the proposed OST pipeline. It achieves state-of-the-art performance in zero-shot, few-shot, and fully-supervised video recognition settings.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Tongjia Chen... alle arxiv.org 03-29-2024
https://arxiv.org/pdf/2312.00096.pdfDomande più approfondite