Multimodal Foundation Models Outperform Audio-Only Models for Non-Verbal Emotion Recognition
Multimodal foundation models (MFMs) like LanguageBind and ImageBind outperform audio-only foundation models (AFMs) for non-verbal emotion recognition (NVER) tasks by better capturing subtle emotional cues through their joint pre-training across multiple modalities.