Grunnleggende konsepter
Multimodal foundation models (MFMs) like LanguageBind and ImageBind outperform audio-only foundation models (AFMs) for non-verbal emotion recognition (NVER) tasks by better capturing subtle emotional cues through their joint pre-training across multiple modalities.
Sammendrag
The study investigates the use of multimodal foundation models (MFMs) for non-verbal emotion recognition (NVER) and compares their performance to audio-only foundation models (AFMs). The authors hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in NVER by better interpreting and differentiating subtle emotional cues that may be ambiguous in AFMs.
The key highlights are:
- The authors conduct a comparative study of state-of-the-art (SOTA) MFMs (LanguageBind and ImageBind) and AFMs (WavLM, Unispeech-SAT, and Wav2vec2) on benchmark NVER datasets (ASVP-ESD, JNV, and VIVAE).
- The results show that the MFMs, particularly LanguageBind, outperform the AFMs across the NVER datasets, validating the authors' hypothesis.
- To further enhance NVER performance, the authors propose a novel fusion framework called MATA (Intra-Modality Alignment through Transport Attention) that effectively combines representations from different foundation models.
- MATA with the fusion of LanguageBind and ImageBind achieves the highest reported performance across the NVER benchmarks, outperforming both individual foundation models and baseline fusion techniques.
- The study also demonstrates the generalizability of the proposed MATA framework by evaluating it on the CREMA-D speech emotion recognition dataset.
Statistikk
The ASVP-ESD dataset includes thousands of high-quality audio recordings labeled with 12 emotions and an additional class breath.
The JNV dataset features 420 audio clips from four native Japanese speakers expressing six emotions.
The VIVAE dataset includes 1,085 audio files from eleven speakers expressing three positive and three negative emotions at varying intensities.
Sitater
"We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs)."
"With MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets."