Video-Conditioned Text Representations for Improved Activity Recognition
Video-conditioned text representations can be more effective than just enhancing visual embeddings when adapting image-text models to the video domain, enabling better generalization to complex activity recognition tasks.