A novel framework that utilizes external memory to incorporate prior knowledge and retrieve relevant text features to improve the quality of event localization and caption generation in dense video captioning.
MA-LMM introduces a long-term memory bank to efficiently and effectively model long-term video sequences by processing frames in an online manner and storing historical video information, addressing the limitations of current large multimodal models.
A novel test-time adaptation approach, T3AL, that adapts pre-trained Vision and Language Models to localize and recognize actions in untrimmed videos without requiring any training data.
Semantic Flow learns semantic representations of dynamic scenes from continuous flow features that capture rich 3D motion information, enabling various applications such as instance-level scene editing, semantic completion, dynamic scene tracking, and semantic adaptation on novel scenes.
The core message of this paper is to propose a unified framework, termed UniMD, that can simultaneously perform Temporal Action Detection (TAD) and Moment Retrieval (MR) tasks by exploiting the potential synergies between them. The authors demonstrate that task fusion learning, through pre-training and co-training approaches, can enhance the performance of both TAD and MR tasks.
The individual's overarching goal was to create a notebook cover.
This work proposes the first unsupervised domain adaptation method for sparse multi-label detection on Temporal Action Localization, which improves performance on unseen domains compared to fully supervised and alternative UDA methods.
OW-VISCap simultaneously detects, segments, tracks, and generates rich object-centric captions for both previously seen and unseen objects in videos, without requiring additional user inputs or prompts.
This work introduces the first Video Transformer Concept Discovery (VTCD) algorithm to systematically identify and rank the importance of high-level, spatiotemporal concepts that underlie the decision-making process of video transformer models.
This work introduces a novel open-world formulation for the problem of temporally localizing the three stages (initial, transitioning, end) of object state changes in videos, addressing the limitations of existing closed-world approaches. The authors propose VIDOSC, a holistic learning framework that leverages text and vision-language models for supervisory signals and develops techniques for object-agnostic state prediction to enable generalization to novel objects.