A novel reinforcement learning algorithm using transformers can accurately predict human gaze behavior in third-person view videos, enabling automation of video understanding tasks that rely on human gaze input.
This work introduces a point-supervised video instance segmentation framework that can achieve competitive performance compared to fully-supervised methods, by leveraging class-agnostic proposal generation and a spatio-temporal point-based matcher to generate high-quality dense pseudo-labels from sparse point annotations.
By finetuning the pre-trained CLIP model, we achieve state-of-the-art performance on the video highlight detection task, demonstrating the power of leveraging large-scale multimodal knowledge for specialized video understanding.
The core message of this paper is that action detection can be effectively tackled by formulating it as a three-image generation problem, where the starting point, ending point, and action-class predictions are generated as images via a diffusion-based framework.
Effiziente Erweiterung von zeitlichen Grenzen für schwach überwachte Videobegründung mit multimodellen großen Sprachmodellen.
DiffusionVMR proposes a novel denoising generation framework for joint video moment retrieval and highlight detection, leveraging diffusion models for iterative refinement and improved performance.
モデルの性能を向上させるために、不規則な繰り返し事前条件を活用するIVAC-P2Lモデルが提案されました。
Decoupling semantic understanding and temporal reasoning is essential for efficient scene identification.
Proposing the MVMR task to improve video moment retrieval by addressing limitations in existing methods.
Proposing DiffusionVMR for joint video moment retrieval and highlight detection, leveraging denoising generation to refine boundaries iteratively.