The core message of this work is to develop an unsupervised video summarization approach that leverages video data structure and information to generate informative summaries, while introducing an innovative human-centric evaluation pipeline to assess the effectiveness of the proposed techniques.
MiniGPT4-Video, a multimodal large language model, effectively processes both visual and textual data in videos, enabling comprehensive understanding and outperforming existing state-of-the-art methods on various video benchmarks.
LongVLM, a straightforward yet powerful VideoLLM, decomposes long videos into multiple short-term segments, encodes local features for each segment, and integrates global semantics to enable comprehensive understanding of long-term video content.
DIBS, a novel pretraining framework, improves the quality of pseudo event boundaries and captions derived from large-scale unlabeled videos by leveraging diverse language models and optimizing for diversity, event-centricity, temporal ordering, and coherence. It also introduces an online boundary refinement strategy to iteratively enhance the pseudo boundaries during training.
A streaming model for dense video captioning that can handle long input videos, generate detailed textual descriptions, and produce outputs before processing the entire video.
A novel framework that extracts object-behavior facts from video clips, reasons over those facts using transformers, and predicts the adverb types that best describe the overall video content.