แนวคิดหลัก
InternVideo2 introduces a new video foundation model that excels in action recognition, video-text tasks, and video-centric dialogue through a progressive training paradigm.
บทคัดย่อ
The content discusses the development and evaluation of InternVideo2, a new video foundation model. It covers the model's architecture, training stages, data preparation, experiments, and performance evaluation across various video-related tasks.
Directory:
Authors and Affiliations
Yi Wang∗1, Kunchang Li∗6,1, Xinhao Li∗2,1...
Transferable Video(-Text) Representation
Strong transferable visual and visual-linguistic representations.
Abstract
Introduction of InternVideo2 as a state-of-the-art ViFM.
Introduction
Importance of transferable spatiotemporal representations in vision understanding.
Related Work
Overview of previous research on learning video foundation models.
Methodology
Three stages of learning: reconstructing masked video tokens, aligning video to audio-speech-text, predicting next token with video-centric inputs.
Experiments
Evaluation results on various tasks including action recognition and temporal grounding.
Audio-related Tasks
Evaluation results on audio tasks such as audio-text retrieval and audioQA.
สถิติ
Figure 1: InternVideo2 yields strong transferable visual and visual-linguistic representations across 70 video understanding tasks.
Table 1: Summary of datasets used in InternVideo2 pretraining process.