Core Concepts
Developing a small emotional vision language model to enhance emotion understanding in visual art.
Abstract
This paper introduces a small emotional vision language model (SEVLM) that aims to improve the understanding of emotions in visual art. By incorporating valence-arousal-dominance (VAD) knowledge and utilizing emotional features derived through VAD modeling, the SEVLM aligns emotion vectors to generate more emotional texts. The model also includes a contrastive head to align image, emotion class, and explanation features. Experimental results show that the SEVLM outperforms state-of-the-art models in both emotion classification accuracy and semantic relevance of explanations. The model is computationally efficient and can be trained on a single RTX 2080 Ti GPU.
Stats
The proposed model can be trained and evaluated on a single RTX 2080 Ti.
Outperforms state-of-the-art small models.
Competitive compared with LLaVA 7B after fine-tuning.
Quotes
"The proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs."
"Our model not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning."