Core Concepts
Different fusion approaches for emotion recognition from speech using acoustic and text-based features show improved performance with BERT embeddings compared to Glove embeddings.
Abstract
The content discusses various fusion approaches for emotion recognition from speech using acoustic and text-based features. It highlights the use of BERT embeddings over Glove embeddings, the impact of different training strategies, and the importance of defining folds in datasets like IEMOCAP. The study compares models on IEMOCAP and MSP-PODCAST datasets, showcasing the benefits of combining audio and text information for emotion recognition.
Models:
Individual models for audio and text-based systems are described.
Fusion models using early and late fusion techniques are explored.
Different training strategies for fusion models are discussed.
Experimental Setup and Datasets:
Experiments were conducted on IEMOCAP and MSP-PODCAST datasets.
The impact of fold criteria on IEMOCAP results is analyzed.
Details on data preprocessing, model training, and evaluation metrics are provided.
Results and Discussion:
Performance results for audio, Glove, and BERT-based text models are compared.
Fusion models show significant improvements in emotion recognition.
The effect of using BERT embeddings over Glove embeddings is highlighted.
The importance of seed selection in model training is emphasized.
Conclusions:
The study demonstrates the benefits of using BERT embeddings and fusing audio-text modalities for emotion recognition.
Recommendations for defining folds in datasets like IEMOCAP are proposed.
The impact of different training strategies and fusion techniques is discussed.
Stats
We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions.
Fusing acoustic and text-based systems is beneficial on both datasets.
The AvRec improves by 16% relative to the best performing single model when combining audio and text information.
Quotes
"We demonstrated the positive effect of representing linguistic information using contextualized word embeddings extracted with BERT compared to using standard word embeddings like those extracted with Glove."
"Adding text information to audio-based SER systems gives significant performance improvements."