The paper presents a weakly-supervised audio separation framework that can extract single-source audio signals from a mixture, using corresponding language descriptions as conditioning signals. The key idea is to leverage the fact that single-source language entities can be easily extracted from the text descriptions, and use a pretrained joint audio-language embedding model (CLAP) to generate weak supervision signals for the audio separation model.
The framework consists of three main components:
Unsupervised Mix-and-Separate Training: The audio separation model is trained on synthetic mixtures of audio samples, using an unsupervised reconstruction loss.
Weakly Supervised Audio-Language Training: The model is further trained using a bi-modal contrastive loss that aligns the predicted single-source audio with the corresponding language entities. This provides weak supervision for single-source extraction without requiring access to single-source audio samples.
Consistency Reconstruction Loss: An additional consistency loss is introduced to ensure the predicted single-source components sum up to the original mixture.
The authors show that this weakly-supervised framework can significantly boost the performance of unsupervised audio separation baselines, achieving up to 71% improvement in Signal-to-Distortion Ratio (SDR) over the baseline. Furthermore, the framework can also be used to augment supervised audio separation models, leading to a powerful semi-supervised learning approach that outperforms the supervised baseline by up to 17% in SDR.
Extensive experiments are conducted on synthetic mixtures from the MUSIC, VGGSound, and AudioCaps datasets, demonstrating the effectiveness of the proposed approach across different scenarios.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы