Temel Kavramlar
Leveraging the common sense reasoning capabilities of Vision-and-Large-Language Models (VLLMs), this work proposes a novel two-stage approach to enhance emotion classification in visual context without introducing complex training pipelines.
Özet
The paper addresses the task of recognizing emotions in context, which involves identifying the apparent emotions of an individual while considering contextual cues from the surrounding scene. Previous approaches have involved explicit scene-encoding architectures or the incorporation of external scene-related information, often relying on intricate training pipelines.
In this work, the authors propose a two-stage approach that leverages the capabilities of VLLMs to generate natural language descriptions of the subject's apparent emotion relative to the visual context. In the first stage, the authors prompt a VLLM, specifically LlaVa-1.5, to describe the subject's emotional state and the surrounding context. In the second stage, the generated text descriptions, along with the image input, are used to train a transformer-based architecture that fuses text and visual features before the final emotion classification task.
The authors conduct extensive experiments on three different in-context emotion recognition datasets: EMOTIC, CAER-S, and BoLD. The results show that the text and image features have complementary information, and the fused architecture significantly outperforms the individual modalities without any complex training methods. The authors achieve state-of-the-art or comparable accuracy across all datasets and metrics compared to much more complex approaches.
The key highlights and insights from the paper are:
- The authors propose a novel two-stage approach that leverages the common sense reasoning capabilities of VLLMs to generate natural language descriptions of the subject's apparent emotion and the surrounding context.
- The generated text descriptions, along with the image input, are used to train a transformer-based architecture that fuses text and visual features, leading to superior performance compared to individual modalities.
- The authors evaluate their method on three different in-context emotion recognition datasets and achieve state-of-the-art or comparable results, outperforming much more complex approaches.
- The authors demonstrate the importance of aligning the text descriptions with the visual input, as using bounding boxes to crop the subject and using the cropped image as input leads to a significant performance drop.
- The authors provide qualitative analysis, showcasing that the cross-attention mechanism in their proposed architecture is able to focus on the subject, while the attention of the vision-only model is more spread out.
İstatistikler
The generated text descriptions for the EMOTIC training set have a mean of 157 tokens and a standard deviation of 45.52. 25% of the samples have less than 124 tokens, and 75% have less than 183 tokens.
Alıntılar
"Recognising emotions in context involves identifying the apparent emotions of an individual, taking into account contextual cues from the surrounding scene."
"Previous approaches to this task have involved the design of explicit scene-encoding architectures or the incorporation of external scene-related information, such as captions. However, these methods often utilise limited contextual information or rely on intricate training pipelines."
"In this work, we leverage the groundbreaking capabilities of Vision-and-Large-Language Models (VLLMs) to enhance in-context emotion classification without introducing complexity to the training process in a two-stage approach."