Core Concepts
This paper introduces HumanVLM, a large vision-language model specifically designed for human-scene understanding, trained on two newly created datasets, HumanCaption-10M and HumanCaptionHQ, to address the limitations of general-domain models in specialized fields.
Stats
The LAION-Face dataset, containing approximately 50 million image-text pairs, was used as the source of raw image data.
The final refined dataset used for training included 10 million human-scene images.
The average text lengths for CelebA-Dialog, MM-CelebA, LAION-Face, HumanCaption-10M, and HumanCaptionHQ are 25, 17, 12, 70, and 238 words, respectively.
The researchers constructed 3,950 image-caption pairs from HumanCaptionHQ as test data for evaluating caption generation.
For VQA evaluation, 5,000 human-scene images were selected, creating 18,312 question-answer pairs.
Quotes
"These general-domain VLMs often underperform in specialized fields that demand domain-specific knowledge and fine-tuning."
"This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks."
"Our experiments validate the effectiveness of HumanVLM, showing that it often outperforms other baseline models on human-scene tasks, including Qwen2-VL and GPT4o."