核心概念
Integrating Visual Language Models with Vision Transformers enhances video action understanding by aligning spatio-temporal representations.
摘要
The content introduces the Four-Tiered Prompts (FTP) framework that combines Vision Transformers (ViTs) and Visual Language Models (VLMs) to improve video action understanding. The FTP framework focuses on different aspects of videos, such as action category, components, description, and context information. By aligning ViTs' visual encodings with VLM outputs during training, richer representations are generated, leading to state-of-the-art performance across various datasets. The integration process involves feature processors and classification layers that enhance the generalization ability of ViTs.
Structure:
- Introduction to Video Action Understanding
- Role of Vision Transformers in Spatio-Temporal Representation Learning
- Limitations of ViTs in Generalization Across Datasets
- Introduction of Visual Language Models for Improved Generalization
- Proposal of the Four-Tiered Prompts (FTP) Framework
- Detailed Explanation of the FTP Architecture and Training Process
- Experimental Results on Various Datasets: Kinetics-400/600, Something-Something V2, UCF-101, HMDB51, AVA V2.2
- Ablation Study on the Influence of VLMs, ViTs, and Prompt Combinations
- Conclusion and Future Directions
统计
We achieve remarkable top-1 accuracy of 93.8% on Kinetics-400.
Achieved top-1 accuracy of 83.4% on Something-Something V2.
Our approach consistently surpasses state-of-the-art methods by clear margins.
引用
"In this paper, we propose the Four-tiered Prompts (FTP) framework that takes advantage of the complementary strengths of ViTs and VLMs."
"Our approach consistently yields state-of-the-art performance."
"By integrating the outputs of these feature processors, the ViT’s generalization ability can be significantly improved."