DiffusionVMR: Video Moment Retrieval and Highlight Detection Model
Core Concepts
Proposing DiffusionVMR for joint video moment retrieval and highlight detection, leveraging denoising generation to refine boundaries iteratively.
Abstract
DiffusionVMR introduces a novel framework for video moment retrieval and highlight detection. It leverages diffusion models to refine boundaries iteratively, enhancing performance across various datasets. The model decouples training and inference phases, offering flexibility in settings. Extensive experiments demonstrate the effectiveness of DiffusionVMR in improving average mAP by 12% compared to baselines.
The proposed framework includes a moment denoising decoder for refining noisy spans and a saliency denoising decoder for generating saliency scores. The cross-modal encoder facilitates interaction between video and text modalities. DiffusionVMR outperforms state-of-the-art methods in both moment retrieval and highlight detection tasks.
DiffusionVMR
Stats
Extensive experiments conducted on five benchmarks (QVHighlight, Charades-STA, TACoS, YouTubeHighlights, TVSum)
Achieved a 12% improvement in average mAP on QVHighlights dataset compared to baseline [15]
Diffusion step t ∈ [0, 1, ... , T] randomly selected during training process
Number of proposals gradually increased from 1 to 20 during training phase
Maximum diffusion step set to T = 1000
Initial learning rate of 1e−4 with weight decay of 1e−4
Hidden dimension set to D = 256
Quotes
"Diffusion models show considerable potential for video moment retrieval and highlight detection tasks."
"The proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference."
"Extensive experiments demonstrate the effectiveness and flexibility of the proposed DiffusionVMR."