Concepts de base
InternVideo2 introduces a new video foundation model that excels in action recognition, video-text tasks, and video-centric dialogue through a progressive training paradigm.
Résumé
The content discusses the development and evaluation of InternVideo2, a new video foundation model. It covers the model's architecture, training stages, data preparation, experiments, and performance evaluation across various video-related tasks.
Directory:
- Authors and Affiliations
- Yi Wang∗1, Kunchang Li∗6,1, Xinhao Li∗2,1...
- Transferable Video(-Text) Representation
- Strong transferable visual and visual-linguistic representations.
- Abstract
- Introduction of InternVideo2 as a state-of-the-art ViFM.
- Introduction
- Importance of transferable spatiotemporal representations in vision understanding.
- Related Work
- Overview of previous research on learning video foundation models.
- Methodology
- Three stages of learning: reconstructing masked video tokens, aligning video to audio-speech-text, predicting next token with video-centric inputs.
- Experiments
- Evaluation results on various tasks including action recognition and temporal grounding.
- Audio-related Tasks
- Evaluation results on audio tasks such as audio-text retrieval and audioQA.
Stats
Figure 1: InternVideo2 yields strong transferable visual and visual-linguistic representations across 70 video understanding tasks.
Table 1: Summary of datasets used in InternVideo2 pretraining process.