The paper presents a novel approach called MultiMAE-DER (Multimodal Masked Autoencoder for Dynamic Emotion Recognition) that utilizes self-supervised learning to process multimodal data for dynamic emotion recognition.
The key highlights are:
MultiMAE-DER builds upon the VideoMAE model, extending the single-modal visual input to include both visual and audio elements. It employs a pre-trained video masked autoencoder as the backbone encoder.
The paper explores six different multimodal sequence fusion strategies to optimize the performance of MultiMAE-DER. These strategies aim to capture the dynamic feature correlations within cross-domain data across spatial, temporal, and spatio-temporal sequences.
Experiments on the RAVDESS, CREMA-D, and IEMOCAP datasets show that MultiMAE-DER outperforms state-of-the-art supervised and self-supervised multimodal models for dynamic emotion recognition. The optimal strategy (FSLF) achieves up to 4.41%, 2.06%, and 1.86% higher weighted average recall (WAR) on the respective datasets.
The authors conclude that fusing multimodal data on spatio-temporal sequences significantly improves the model performance by capturing correlations between cross-domain data. The self-supervised learning approach of the masked autoencoder also contributes to the potential of this framework as an efficient learner for contextual semantic inference problems.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések