Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
HiCMAE, a novel self-supervised framework, leverages large-scale self-supervised pre-training on unlabeled audio-visual data to promote the advancement of audio-visual emotion recognition.