Multimodal Masked Autoencoder for Improved Dynamic Emotion Recognition
A novel multimodal masked autoencoder framework, MultiMAE-DER, that leverages self-supervised learning to effectively extract and fuse spatio-temporal features from visual and audio data for improved dynamic emotion recognition.