The proposed framework leverages bi-modal semantic similarity between audio and language modalities to generate weak supervision signals for single-source audio extraction, without requiring access to single-source audio samples during training.
ACES introduces a novel metric approach for evaluating automated audio captioning systems based on the semantics of sounds.
Proposing a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module.
Context-aware models improve real-time target sound extraction performance.
Combining Instance Discrimination with Masked Autoencoders through uaMix-MAE enhances downstream task performance with limited labeled data.
The author proposes a new framework for first-shot unsupervised anomalous sound detection using metadata-assisted audio generation to estimate unknown anomalies, achieving competitive performance in the DCASE 2023 Challenge Task 2.
The author evaluates autoregressive audio inpainting methods, highlighting the importance of AR model estimators and model order in achieving high-quality results.
CrossNet introduces a novel DNN architecture for speaker separation, leveraging global and local information to enhance performance in noisy-reverberant environments.