核心概念
The author explores the limitations of CLIP-based video learners in recognizing actions across different domains and proposes a novel scene-aware video-text alignment method to address these challenges.
要約
The content discusses the adaptation of CLIP to video data for action recognition, highlighting the need for models to generalize effectively across unseen domains. The XOV-Action benchmark is introduced to evaluate state-of-the-art CLIP-based video learners, revealing limited performance in recognizing actions in unfamiliar domains. A novel Scene-Aware video-text alignment method is proposed to mitigate scene bias and improve cross-domain open-vocabulary action recognition. Experimental results demonstrate the effectiveness of the proposed method, emphasizing the challenges and potential solutions in this field.
統計
Videos collected from YouTube: Kinetics400 dataset with 400 action categories.
Evaluation metrics: Closed-set accuracy, open-set accuracy, overall accuracy.
Model architectures: ViT-B/32 and ViT-B/16 for temporal modeling.
Loss coefficients: λdis = 0.2, λcon = 0.2.
引用
"Can CLIP-based video learners effectively generalize to unseen test domains?" - Authors
"Our evaluation reveals that previous methods exhibit limited performance when recognizing actions in unseen test domains." - Authors
"Our contributions include establishing a benchmark named XOV-Action and proposing a novel scene-aware video-text alignment method." - Authors