Identifying highlight moments in videos is crucial for efficient editing. This paper introduces a novel model for unsupervised highlight detection using cross-modal perception. The model learns representations from image-audio pair data via self-reconstruction and utilizes representation activation sequence learning (RASL) for significant activations. Symmetric contrastive learning connects visual and audio modalities, enhancing performance without audio input during inference. Experimental results show superior performance compared to state-of-the-art approaches.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Tingtian Li,... kl. arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09401.pdfDybere Forespørgsler