Optimizing joint CNN and SeqNN architectures using DARTS enhances SER performance.
Optimizing joint CNN and SeqNN architectures using DARTS enhances SER performance.
This paper proposes Vesper, a compact and effective pretrained model for speech emotion recognition, which is built upon the general pretrained model WavLM. Vesper employs an emotion-guided masking strategy and hierarchical and cross-layer self-supervision to enhance its sensitivity to emotional information and its ability to capture both acoustic and semantic representations, which are crucial for emotion recognition.
A novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio to efficiently identify emotion-related regions and integrate higher-level acoustic information.
The proposed AFTER framework leverages task adaptation pre-training and active learning to enhance the performance and efficiency of speech emotion recognition models, addressing the information gap, noise sensitivity, and low efficiency issues of existing methods.
The proposed GMP-ATL framework leverages gender-augmented multi-scale pseudo-labels and adaptive transfer learning with the pre-trained HuBERT model to significantly improve speech emotion recognition performance.
Applying efficient channel attention (ECA) and data augmentation with different STFT preprocessing settings can significantly improve speech emotion recognition performance.
TBDM-Net, a novel deep neural network architecture, achieves state-of-the-art performance in speech emotion recognition across multiple multilingual datasets by leveraging temporally-aware bidirectional dense networks and multi-scale feature fusion.
The choice of emotional labels elicited by different modalities (audio-only, facial-only, audio-visual) can significantly impact the performance of speech emotion recognition (SER) systems.
Leveraging large language models (LLMs) with carefully designed prompting strategies incorporating context and multiple ASR system outputs significantly improves post-ASR speech emotion recognition accuracy without task-specific training.