Główne pojęcia
A novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio to efficiently identify emotion-related regions and integrate higher-level acoustic information.
Streszczenie
The paper introduces MFHCA, a novel method for Speech Emotion Recognition (SER) that employs a Multi-Spatial Fusion module (MF) and a Hierarchical Cooperative Attention module (HCA).
The MF module uses parallel convolutional layers to extract features from the log Mel spectrogram in both temporal and frequency directions. It also includes a Global Receptive Field (GRF) block to capture dependencies and positional information in different scale spaces, helping the network locate emotional information.
The HCA module hierarchically integrates the features from the MF module and the Hubert model, a self-supervised speech representation learning model. The HCA uses a co-attention mechanism to guide the Hubert features to focus on the emotion-related regions identified by the MF module.
The proposed method is evaluated on the IEMOCAP dataset and achieves 2.6% and 1.87% improvements in weighted accuracy and unweighted accuracy, respectively, compared to existing state-of-the-art approaches. The authors also conduct extensive ablation studies to demonstrate the effectiveness of the MF and HCA modules.
Statystyki
The IEMOCAP dataset consists of 10 actors engaged in 5 dyadic sessions, each featuring a unique pair of male and female actors.
The audio segments are preprocessed to be 3 seconds long, with zero padding for shorter segments.
Spectrograms are extracted using a Hamming window with a window length of 40ms and a window shift of 10ms, using the first 200 DFT points as input features.
Hubert features correspond to the output of the final hidden layer of the Hubert model.
Cytaty
"The outstanding performance of speech self-supervised learning in downstream tasks such as automatic speech recognition (ASR) has opened up new avenues for developing SER."
"We propose a novel spectrum-based lightweight feature extraction module, denoted as Multi-Spatial Fusion module (MF), which captures dependencies and positional information in different scale spaces, aiding the network in locating emotional information."
"We employ the Hubert model as a feature extractor without fine-tuning, and the learned features contain rich information. In addition to emotion recognition-related information, information is also relevant to other downstream tasks."