The paper presents a weakly-supervised audio separation framework that can extract single-source audio signals from a mixture, using corresponding language descriptions as conditioning signals. The key idea is to leverage the fact that single-source language entities can be easily extracted from the text descriptions, and use a pretrained joint audio-language embedding model (CLAP) to generate weak supervision signals for the audio separation model.
The framework consists of three main components:
Unsupervised Mix-and-Separate Training: The audio separation model is trained on synthetic mixtures of audio samples, using an unsupervised reconstruction loss.
Weakly Supervised Audio-Language Training: The model is further trained using a bi-modal contrastive loss that aligns the predicted single-source audio with the corresponding language entities. This provides weak supervision for single-source extraction without requiring access to single-source audio samples.
Consistency Reconstruction Loss: An additional consistency loss is introduced to ensure the predicted single-source components sum up to the original mixture.
The authors show that this weakly-supervised framework can significantly boost the performance of unsupervised audio separation baselines, achieving up to 71% improvement in Signal-to-Distortion Ratio (SDR) over the baseline. Furthermore, the framework can also be used to augment supervised audio separation models, leading to a powerful semi-supervised learning approach that outperforms the supervised baseline by up to 17% in SDR.
Extensive experiments are conducted on synthetic mixtures from the MUSIC, VGGSound, and AudioCaps datasets, demonstrating the effectiveness of the proposed approach across different scenarios.
翻譯成其他語言
從原文內容
arxiv.org
深入探究