Challenges in CLIP-based Video Learners for Cross-Domain Open-Vocabulary Action Recognition
The author explores the limitations of CLIP-based video learners in recognizing actions across different domains and proposes a novel scene-aware video-text alignment method to address these challenges.