The paper introduces MFHCA, a novel method for Speech Emotion Recognition (SER) that employs a Multi-Spatial Fusion module (MF) and a Hierarchical Cooperative Attention module (HCA).
The MF module uses parallel convolutional layers to extract features from the log Mel spectrogram in both temporal and frequency directions. It also includes a Global Receptive Field (GRF) block to capture dependencies and positional information in different scale spaces, helping the network locate emotional information.
The HCA module hierarchically integrates the features from the MF module and the Hubert model, a self-supervised speech representation learning model. The HCA uses a co-attention mechanism to guide the Hubert features to focus on the emotion-related regions identified by the MF module.
The proposed method is evaluated on the IEMOCAP dataset and achieves 2.6% and 1.87% improvements in weighted accuracy and unweighted accuracy, respectively, compared to existing state-of-the-art approaches. The authors also conduct extensive ablation studies to demonstrate the effectiveness of the MF and HCA modules.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Xinxin Jiao,... a las arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.13509.pdfConsultas más profundas