The content discusses the adaptation of CLIP to video data for action recognition, highlighting the need for models to generalize effectively across unseen domains. The XOV-Action benchmark is introduced to evaluate state-of-the-art CLIP-based video learners, revealing limited performance in recognizing actions in unfamiliar domains. A novel Scene-Aware video-text alignment method is proposed to mitigate scene bias and improve cross-domain open-vocabulary action recognition. Experimental results demonstrate the effectiveness of the proposed method, emphasizing the challenges and potential solutions in this field.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Kun-Yu Lin,H... о arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01560.pdfГлибші Запити