The paper introduces DETECLAP, a method to enhance audio-visual representation learning by incorporating object information. The key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder (CAV-MAE) to enhance its object awareness.
To avoid costly manual annotations, the authors prepare object labels from both audio and visual inputs using state-of-the-art language-audio models (CLAP) and object detectors (YOLOv8). They evaluate the method on audio-visual retrieval and classification tasks using the VGGSound and AudioSet20K datasets.
The results show that DETECLAP achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification on the VGGSound dataset compared to the baseline CAV-MAE. The authors also explore different strategies for merging audio and visual labels, finding that the OR operation outperforms the AND operation and separate models.
The paper demonstrates that incorporating object information can enhance audio-visual representation learning, leading to improved performance on downstream tasks. The authors highlight the importance of choosing appropriate label types and merging strategies for optimal model performance.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問