Conceitos essenciais
The authors introduce Peacock, a family of Arabic Multimodal Large Language Models, to address the lack of high-quality resources in languages other than English. Through qualitative and quantitative analysis, they demonstrate the strong performance of their models on visual reasoning tasks and dialectal potential.
Resumo
Peacock introduces a suite of Arabic MMLMs named Peacock for visual reasoning tasks and dialectal affinity. The models integrate vision encoders with Arabic text decoders, trained in two stages using pretraining data from English datasets translated into Arabic. Performance is showcased across various tasks like VQA and visual reasoning, outperforming multilingual baselines. The introduction of Henna benchmark evaluates model capabilities related to Arabic culture. Additionally, a case study on Egyptian dialect proficiency highlights future potential for dialectal Arabic vision language models.
Estatísticas
"A wide collection of languages and dialects with a native population of more than 400 million speakers."
"SEED-Benchmark dimensions: Instance Attributes, Instance Identity, Instance Interaction, Instance Location, Instances Counting, Scene Understanding, Spatial Relation, Visual Reasoning."
"Performance comparison between Peacock models on VQAv2 dataset against mBlip baseline."
"LLaVA-Bench metrics: Conversation (Conv), Details Description (DD), Complex Reasoning (CR)."
"SEED-Bench evaluation attributes: Instance Attributes (IA), Instance Identity (II), Instance Interaction (IN), Instance Location (IL), Instances Counting (IC), Scene Understanding (SU), Spatial Relation (SR), Visual Reasoning (VR)."
Citações
"We introduce a comprehensive family of Arabic MLLMs dubbed Peacock with strong vision and language capabilities."
"Our contributions include introducing diverse datasets for training and evaluation of Arabic MLLMs."
"The performance disparity between AraLLaMA and AceGPT highlights the impact of language model selection on task performance."