Conceitos Básicos
This paper presents an end-to-end model that combines a speech enhancement module (ConVoiFilter) and an automatic speech recognition (ASR) module to improve speech recognition performance in noisy, crowded environments. The model utilizes a single-channel speech enhancement approach to isolate the target speaker's voice from background noise and then feeds the enhanced audio into the ASR module.
Resumo
The paper presents an end-to-end model for improving automatic speech recognition (ASR) in noisy, crowded environments, such as cocktail party settings. The model consists of two main components:
-
Target speaker's voice enhancement (ConVoiFilter):
- This module aims to remove all noise and interfering speech from the noisy audio input, producing a clean utterance for the target speaker.
- It uses a speaker encoder module to extract an embedding vector that identifies the target speaker, and then performs cross-extraction of the speaker embedding from the noisy audio.
- The module then uses a series of conformer blocks to estimate a mask that is applied to the magnitude spectrogram of the noisy audio, effectively enhancing the target speaker's voice.
- The enhanced audio is reconstructed using the estimated mask and the original phase information.
-
Automatic speech recognition (ASR):
- The ASR module uses a pre-trained wav2vec2 model as the speech encoder, followed by an RNN transducer as the decoder.
- To address potential noise artifacts in the enhanced audio, the wav2vec2 model is pre-trained on data augmented with noise and room reverb.
The authors also propose a joint fine-tuning strategy to optimize the ConVoiFilter and ASR modules together, addressing the challenges of connecting the high-resolution enhancement module to the lower-resolution ASR module.
The model is evaluated on various types of data, including clean audio, noisy audio with and without cross-talk, and audio with only ambient noise or reverberation. The results show that the end-to-end ConVoiFilter-ASR model significantly outperforms the ASR-only models, especially in the presence of cross-talk, achieving a word error rate (WER) of 14.51% compared to 75.19% for the ASR-only model.
The authors also conduct an ablation study to demonstrate the effectiveness of the key components of the ConVoiFilter model, such as the use of the x-vector speaker encoder, the conformer-based mask estimation, and the SI-SNR loss function.
Estatísticas
In the "Noisy Audio" column, the ASR noisy model achieves a WER of 84.14%, while the end-to-end ConVoiFilter-ASR noisy model reduces the WER to 25.14%.
In the "Cross-talk" column, the ASR based model has a WER of 50.72%, while the end-to-end ConVoiFilter-ASR noisy model reduces the WER to 13.23%.
On the LibriCSS dataset, the ConVoiFilter + Whisper-large model achieves a WER of 16.83% in the 40% overlap ratio, compared to 32.73% for the Whisper-large model alone.
Citações
"Our system was evaluated using five different types of data. The first is the 'Clean Audio' which consists of the original audio. The second type is the 'Noisy Audio' which is the output of the data processing pipeline (depicted in Figure 2) applied to the clean audio."
"Generally, the end-to-end models perform better than other models across most input sets, except for clean audio."
"The results clearly indicate that ConVoiFilter significantly improves Whisper, particularly in reducing WER in overlapping audio."