GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition

GPT-4V demonstrates strong visual understanding capabilities in Generalized Emotion Recognition tasks, but struggles with specialized knowledge like micro-expressions.
This article evaluates GPT-4V's performance in emotion recognition tasks across various datasets. It discusses the model's ability to integrate multimodal clues and exploit temporal information. The study highlights the limitations of GPT-4V in recognizing micro-expressions and provides insights into potential future research directions. Structure: Introduction: Discusses the importance of emotion recognition and introduces Generalized Emotion Recognition (GER) tasks. Related Works: Explores different tasks within GER and their distinctions. Task Description: Details each task and dataset used for evaluation. GPT-4V Calling Strategy: Describes the strategy designed for handling requests in GER tasks. Results and Discussion: Presents main results, including comparisons with baselines and supervised systems. Temporal Modeling Ability: Evaluates GPT-4V's performance based on sampling frames in dynamic facial emotion recognition. Multimodal Fusion Ability: Examines GPT-4V's ability to integrate multimodal information in emotion recognition tasks. System Stability: Analyzes the stability of GPT-4V predictions through multiple runs. Class-wise Performance Analysis: Visualizes confusion matrices to analyze class-wise prediction consistency. Robustness to Template Change: Explores how changes in prompt templates affect GPT-4V's performance. Robustness to Color Space: Evaluates GPT-4V's robustness to color space changes using grayscale images. Security Check: Discusses instances where security checks impact model predictions. Case Study: Provides examples of incorrect predictions made by GPT-4V in different tasks.
"Through experimental analysis, we observe that GPT-4V exhibits strong visual understanding capabilities in GER tasks." "GPT-4V is primarily designed for general domains and cannot recognize micro-expressions that require specialized knowledge."

