The paper addresses the task of Multimodal Emotion Cause Pair Extraction in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual) along with the corresponding utterances that were the cause for the emotion.
The authors propose three baseline models:
Baseline I (Utterance Labeling): Treats the problem as a simple utterance labeling task, using pre-trained text, audio, and image encoders to train three models for emotion classification, candidate cause identification, and emotion-cause pairing.
Baseline II (BiLSTM Architecture): Models the problem as a Sequence Labeling task, using a BiLSTM architecture to capture the surrounding context of the conversation.
Baseline III (BiLSTM-CRF Architecture): Adds a CRF layer on top of the BiLSTM architecture to model the transitions between emotion labels.
The authors experiment with different encoders, including BERT, DeBERTa, RoBERTa, EmotionRoBERTa, WavLM, and MViTv2, and evaluate the models on the Emotion-Cause-in-Friends dataset.
The results show that the utterance labeling systems perform as well as the sequence labeling systems for this specific dataset, and that encoders trained on emotion-related tasks tend to perform better on similar tasks. The authors also discuss potential future improvements, such as learning joint embeddings over the three modalities and utilizing speaker information.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések