The paper presents a novel approach called MultiMAE-DER (Multimodal Masked Autoencoder for Dynamic Emotion Recognition) that utilizes self-supervised learning to process multimodal data for dynamic emotion recognition.
The key highlights are:
MultiMAE-DER builds upon the VideoMAE model, extending the single-modal visual input to include both visual and audio elements. It employs a pre-trained video masked autoencoder as the backbone encoder.
The paper explores six different multimodal sequence fusion strategies to optimize the performance of MultiMAE-DER. These strategies aim to capture the dynamic feature correlations within cross-domain data across spatial, temporal, and spatio-temporal sequences.
Experiments on the RAVDESS, CREMA-D, and IEMOCAP datasets show that MultiMAE-DER outperforms state-of-the-art supervised and self-supervised multimodal models for dynamic emotion recognition. The optimal strategy (FSLF) achieves up to 4.41%, 2.06%, and 1.86% higher weighted average recall (WAR) on the respective datasets.
The authors conclude that fusing multimodal data on spatio-temporal sequences significantly improves the model performance by capturing correlations between cross-domain data. The self-supervised learning approach of the masked autoencoder also contributes to the potential of this framework as an efficient learner for contextual semantic inference problems.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Peihao Xiang... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.18327.pdfDeeper Inquiries