The technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information and handling a wide range of video lengths.
The report discusses Pegasus-1's model architecture, which consists of a video encoder model, a video-language alignment model, and a large language model decoder. The training process involves a pretraining phase and an instruction tuning phase, with strategies to mitigate catastrophic forgetting.
Pegasus-1 achieves new state-of-the-art results in video conversation, zero-shot video question answering, and video summarization benchmarks, outperforming both open-source and proprietary models. The report also presents qualitative results to showcase Pegasus-1's capabilities in areas such as real-world knowledge, video-based reasoning, 3D spatial understanding, temporal reasoning, and visual referring prompts. The report acknowledges Pegasus-1's limitations and aims to provide users with a comprehensive understanding of its current strengths, weaknesses, and areas for growth.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문