Identifying highlight moments in videos is crucial for efficient editing. This paper introduces a novel model for unsupervised highlight detection using cross-modal perception. The model learns representations from image-audio pair data via self-reconstruction and utilizes representation activation sequence learning (RASL) for significant activations. Symmetric contrastive learning connects visual and audio modalities, enhancing performance without audio input during inference. Experimental results show superior performance compared to state-of-the-art approaches.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Tingtian Li,... alle arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09401.pdfDomande più approfondite