Pian, W., Nan, Y., Deng, S., Mo, S., Guo, Y., & Tian, Y. (2024). Continual Audio-Visual Sound Separation. Advances in Neural Information Processing Systems, 38.
This paper addresses the challenge of continual learning in audio-visual sound separation, aiming to develop a model capable of continuously learning to separate new sound sources without forgetting previously learned ones.
The authors propose ContAV-Sep, a novel framework that incorporates a Cross-modal Similarity Distillation Constraint (CrossSDC). This constraint preserves cross-modal semantic similarity across incremental tasks by enforcing instance-aware and class-aware semantic similarity through contrastive loss and knowledge distillation. The framework utilizes a state-of-the-art audio-visual separator (iQuery) as the base model, incorporating pre-trained VideoMAE and CLIP models for video and image encoding, respectively.
Experiments on the MUSIC-21 dataset demonstrate that ContAV-Sep significantly outperforms existing continual learning baselines in terms of Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), and Signal to Artifact Ratio (SAR). The study also highlights the importance of even a small memory set in improving performance due to the unique nature of sound separation training.
ContAV-Sep effectively addresses the catastrophic forgetting problem in continual audio-visual sound separation, enabling models to adapt to new sound categories while retaining performance on previously learned ones. The proposed CrossSDC method proves crucial in preserving cross-modal semantic similarity throughout the continual learning process.
This research introduces a novel approach to continual learning in the context of audio-visual sound separation, paving the way for more practical and adaptable models in real-world scenarios.
The study acknowledges limitations related to the reliance on object detectors for identifying sounding objects and the potential for improvement in enabling models to acquire new knowledge for old classes in subsequent tasks. Future research could explore these areas to enhance the robustness and adaptability of continual audio-visual sound separation models further.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Weiguo Pian,... at arxiv.org 11-06-2024
https://arxiv.org/pdf/2411.02860.pdfDeeper Inquiries