Identifying highlight moments in videos is crucial for efficient editing. This paper introduces a novel model for unsupervised highlight detection using cross-modal perception. The model learns representations from image-audio pair data via self-reconstruction and utilizes representation activation sequence learning (RASL) for significant activations. Symmetric contrastive learning connects visual and audio modalities, enhancing performance without audio input during inference. Experimental results show superior performance compared to state-of-the-art approaches.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Tingtian Li,... pada arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09401.pdfPertanyaan yang Lebih Dalam