Identifying highlight moments in videos is crucial for efficient editing. This paper introduces a novel model for unsupervised highlight detection using cross-modal perception. The model learns representations from image-audio pair data via self-reconstruction and utilizes representation activation sequence learning (RASL) for significant activations. Symmetric contrastive learning connects visual and audio modalities, enhancing performance without audio input during inference. Experimental results show superior performance compared to state-of-the-art approaches.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Tingtian Li,... : arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09401.pdfDaha Derin Sorular