The paper introduces MFHCA, a novel method for Speech Emotion Recognition (SER) that employs a Multi-Spatial Fusion module (MF) and a Hierarchical Cooperative Attention module (HCA).
The MF module uses parallel convolutional layers to extract features from the log Mel spectrogram in both temporal and frequency directions. It also includes a Global Receptive Field (GRF) block to capture dependencies and positional information in different scale spaces, helping the network locate emotional information.
The HCA module hierarchically integrates the features from the MF module and the Hubert model, a self-supervised speech representation learning model. The HCA uses a co-attention mechanism to guide the Hubert features to focus on the emotion-related regions identified by the MF module.
The proposed method is evaluated on the IEMOCAP dataset and achieves 2.6% and 1.87% improvements in weighted accuracy and unweighted accuracy, respectively, compared to existing state-of-the-art approaches. The authors also conduct extensive ablation studies to demonstrate the effectiveness of the MF and HCA modules.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Xinxin Jiao,... pada arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.13509.pdfPertanyaan yang Lebih Dalam