Centrala begrepp
The proposed GMP-ATL framework leverages gender-augmented multi-scale pseudo-labels and adaptive transfer learning with the pre-trained HuBERT model to significantly improve speech emotion recognition performance.
Sammanfattning
The paper introduces a novel speech emotion recognition (SER) framework called GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), which aims to enhance SER performance by incorporating high-quality frame-level emotional pseudo-labels and comprehensively utilizing both frame-level and utterance-level emotion labels.
The key aspects of the GMP-ATL framework are:
-
Frame-level GMPs Extraction:
- Employs the pre-trained HuBERT model as a feature extractor.
- Incorporates a multi-task learning strategy to capture gender information along with emotion.
- Performs multi-scale unsupervised k-means clustering on the third-to-last layer features of HuBERT to generate gender-augmented multi-scale pseudo-labels (GMPs).
-
Frame-level GMPs based Retraining:
- Uses the obtained frame-level GMPs to retrain and optimize the HuBERT-based SER model.
- Aligns the retraining process with the original pre-training objectives of the HuBERT model.
-
Utterance-level Emotion Label based Fine-tuning:
- Employs the Additive Margin Softmax (AMS) loss to fine-tune the HuBERT-based SER model using utterance-level emotion labels.
- Increases the inter-class distance between different emotion categories and reduces the intra-class distance within the same emotion category.
Experiments on the IEMOCAP dataset show that the proposed GMP-ATL framework outperforms state-of-the-art unimodal SER methods and achieves comparable results with multimodal SER approaches. The ablation study further validates the effectiveness of the key components in the GMP-ATL workflow.
Statistik
"The IEMOCAP dataset consists of 5,531 utterances with a total duration of 419 minutes, labeled with four emotion categories: angry, happy, neutral, and sad."
"The proposed GMP-ATL framework achieves a Weighted Average Recall (WAR) of 80.0% and an Unweighted Average Recall (UAR) of 82.0% on the IEMOCAP dataset, surpassing state-of-the-art unimodal SER methods."
Citat
"Experiments on the IEMOCAP corpus indicate that our proposed GMP-ATL framework not only outperforms SOTA unimodal SER methods but achieves competitive performance compared to multimodal SER approaches."
"We demonstrate that incorporating reasonable frame-level gender-augmented multi-scale pseudo-labels can effectively enhance the recognition performance of the proposed workflow."