Conceitos essenciais
A novel two-stage framework that cascades target speaker extraction and speech emotion recognition to mitigate the impact of human speech noise on emotion recognition performance.
Resumo
The paper proposes a two-stage framework for robust speech emotion recognition (SER) in noisy environments, particularly those with human speech noise.
In the first stage, the framework trains a target speaker extraction (TSE) model to extract the speech of the target speaker from a mixture of speech signals. This TSE model is trained on a large-scale mixed-speech corpus.
In the second stage, the extracted target speaker speech is used for SER training and testing. Two training methods are explored:
-
TSE-SER-base: The pretrained TSE model is used to denoise the emotional speech corpus, and the denoised corpus is then used to train the SER model.
-
TSE-SER-ft: The pretrained TSE model is jointly fine-tuned with the SER model using the mixed emotional speech corpus.
Experiments show that the proposed framework significantly outperforms baseline SER models that do not use the TSE method, achieving up to a 14.33% improvement in unweighted accuracy. The framework is particularly effective in dealing with different-gender speech mixtures.
The key insights are:
- Human speech noise severely degrades the performance of SER models trained on clean data.
- Integrating TSE into the SER framework can effectively mitigate the impact of human speech noise.
- Joint fine-tuning of the TSE and SER models further improves the performance.
- The framework performs better on different-gender speech mixtures compared to same-gender mixtures.
Estatísticas
The SI-SDR of the noisy speech before being processed by the TSE model is 0.09 dB.
The SI-SDRi for the TSE models of TSE-SER-base and TSE-SER-ft are 7.68 dB and 12.90 dB, respectively.
The SI-SDR of same-gender mixture is 0 dB, while the SI-SDR of different-gender mixture is 0.02 dB.
The SI-SDRi for the TSE model on same-gender mixture is 1.09 dB, and on different-gender mixture is 5.22 dB.
Citações
"Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties."
"Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER."
"Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise."