Core Concepts
The authors propose models that tackle the task of Multimodal Emotion Cause Pair Extraction as an utterance labeling and a sequence labeling problem, performing a comparative study of these models using different encoders and architectures.
Abstract
The paper addresses the task of Multimodal Emotion Cause Pair Extraction in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual) along with the corresponding utterances that were the cause for the emotion.
The authors propose three baseline models:
Baseline I (Utterance Labeling): Treats the problem as a simple utterance labeling task, using pre-trained text, audio, and image encoders to train three models for emotion classification, candidate cause identification, and emotion-cause pairing.
Baseline II (BiLSTM Architecture): Models the problem as a Sequence Labeling task, using a BiLSTM architecture to capture the surrounding context of the conversation.
Baseline III (BiLSTM-CRF Architecture): Adds a CRF layer on top of the BiLSTM architecture to model the transitions between emotion labels.
The authors experiment with different encoders, including BERT, DeBERTa, RoBERTa, EmotionRoBERTa, WavLM, and MViTv2, and evaluate the models on the Emotion-Cause-in-Friends dataset.
The results show that the utterance labeling systems perform as well as the sequence labeling systems for this specific dataset, and that encoders trained on emotion-related tasks tend to perform better on similar tasks. The authors also discuss potential future improvements, such as learning joint embeddings over the three modalities and utilizing speaker information.
Stats
The dataset used for this problem is Emotion-Cause-in-Friends, which contains 1,344 conversations made up of a total of 13,509 utterances, with each utterance annotated with the emotion depicted and the corresponding emotion-cause pairs.
Quotes
"Conversation is the most natural form of human communication, where each utterance can range over a variety of possible emotions."
"SemEval 2024 introduces the task of Multimodal Emotion Cause Analysis in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual modalities) along with the corresponding utterances that were the cause for the emotion."