Core Concepts
Multimodal Large Language Models need to be assessed for alignment with human values using a comprehensive dataset and evaluation strategy.
Abstract
The article introduces the Ch3Ef dataset and a unified evaluation strategy to assess Multimodal Large Language Models (MLLMs) alignment with human values. It categorizes the evaluation into three levels: alignment in semantics, alignment in logic, and alignment with human values. The Ch3Ef dataset contains 1002 human-annotated data samples covering 12 domains and 46 tasks based on the principles of being helpful, honest, and harmless. The evaluation strategy supports assessment across various scenarios and different perspectives, providing insights into MLLM capabilities, limitations, and their alignment with human values.
Structure:
Introduction
Purpose of Large Language Models
Lack of exploration in Multimodal Large Language Models (MLLMs) alignment with human values
Evaluation Levels
Alignment in Semantics (A1)
Alignment in Logic (A2)
Alignment with Human Values (A3)
Challenges in Alignment with Human Values
Complexity and diversity of applications
Difficulty in collecting datasets reflecting real-world situations
Ch3Ef Dataset
Manually curated dataset based on hhh criteria for MLLMs
Taxonomy based on being helpful, honest, and harmless
Evaluation Strategy
Modular design with Instruction, Inferencer, and Metric components
Support for varied assessments from different perspectives
Experimental Results
Evaluation of 11 open-source MLLMs across A1-A3
Key findings on trade-offs, domain-specific challenges, and alignment with human values
Conclusions
Introduction of Ch3Ef dataset and unified evaluation strategy for assessing MLLMs alignment with human values
Anticipation of further research and development to enhance MLLMs alignment with human values
Stats
Ch3Ef dataset contains 1002 human-annotated data samples.
The dataset covers 12 domains and 46 tasks based on the principles of being helpful, honest, and harmless.
Quotes
"Models that excel in accurate perception or reasoning tasks are not necessarily equipped to cater to human interests and behavior in practical applications."
"The urgency to assess whether Multimodal Large Language Models align with human values intensifies as they become more intertwined with various facets of human society."