แนวคิดหลัก
The authors propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench, to align with the human aesthetic process and achieve good results in multiple aesthetic subtasks.
บทคัดย่อ
The authors propose the UNIAA framework to address the limitations of traditional IAA methods, which are typically constrained to a single dataset or task, restricting the universality and broader application.
UNIAA includes:
UNIAA-LLaVA: An MLLM baseline capable of unifying aesthetic perception, description, and assessment tasks.
UNIAA-Bench: A comprehensive aesthetic benchmark that evaluates the aesthetic capabilities of MLLMs from three aspects - Aesthetic Perception, Aesthetic Description, and Aesthetic Assessment.
To obtain the UNIAA-LLaVA, the authors establish a low-cost IAA Dataset Conversion Paradigm (IDCP) to transform existing aesthetic datasets into a format suitable for MLLM fine-tuning.
Extensive experiments validate the effectiveness of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, it performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. The authors find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement.
สถิติ
The image is aesthetically pleasing, with its unique concept and well-executed composition.
The color and lighting are well-balanced, with the cloudy sky providing a moody atmosphere.
คำพูด
"UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs."
"UNIAA-LLaVA performs better than GPT-4V in aesthetic perception and even approaches the junior-level human."