Core Concepts
Utilizing audio data for emotional mimicry prediction shows significant improvements over traditional methods.
Abstract
In this study, a methodology is proposed for Emotional Mimicry Intensity (EMI) Estimation using the Wav2Vec 2.0 framework pre-trained on a podcast dataset. The approach integrates linguistic and paralinguistic elements to enhance feature representation through fusion techniques. A Long Short-Term Memory (LSTM) architecture is employed for temporal analysis of audio data, demonstrating improvements over the baseline. The dataset includes audiovisual content from participants mimicking and rating videos on a scale, with annotations provided for training and validation sets. The study highlights the imbalance in regression target distribution, emphasizing challenges in predicting rare extreme values accurately.
The methodology involves utilizing a pre-trained Wav2Vec 2.0 model with a Valence-Arousal-Dominance (VAD) module, inputting features into an LSTM model with global context vector integration. Different configurations are experimented with to improve performance, showcasing the importance of incorporating global context in analysis. Results show outperformance of the baseline across various configurations, indicating the effectiveness of the proposed approach.
The study shifts focus from facial expressions to audio data analysis for emotional mimicry estimation, revealing that adding facial images decreased effectiveness compared to audio-only results. This suggests the unique potential of audio in emotional analysis and hints at future research directions exploring modalities' integration.
Stats
Training set: 8072 videos
Validation set: 4588 videos
Test set: 4582 videos
Six annotated emotions: Admiration, Amusement, Determination, Empathic Pain, Excitement, Joy
Pearson’s Correlation Coefficient used for performance evaluation
Quotes
"Our approach demonstrates significant improvements over the established baseline."
"Incorporating global context vector led to further improvements in our model."
"Results show outperformance of the baseline across various configurations."