Основні поняття
Long captions in language-image pre-training enhance model performance across various tasks.
Анотація
The content discusses the importance of long captions in language-image pre-training, highlighting the benefits of detailed descriptions for image understanding. The study introduces DreamLIP, a method that utilizes long captions to improve image-text retrieval, semantic segmentation, and image understanding in MLLM. Experimental results show the superiority of DreamLIP over existing alternatives.
Abstract
Language-image pre-training relies on detailed text descriptions.
Long captions can enrich semantic learning.
Introduction
Existing datasets lack lengthy captions for rich image descriptions.
DreamLIP aims to leverage long captions for improved performance.
Method
Generating long captions using MLLMs.
Multi-positive contrastive learning with sub-captions.
Experiments
Image-text retrieval and semantic segmentation results.
Ablation Studies
Effectiveness of short and long captions, subcaption-specific grouping loss.
Visualization
Qualitative analysis through semantic segmentation and image-text retrieval visuals.
Статистика
30M images re-captioned with detailed descriptions using MLLM.
Model trained on 30M pairs achieves comparable or better performance than CLIP trained on 400M pairs.
Цитати
"Long captions unleash the potential of real-world images."
"DreamLIP demonstrates fine-grained representational capacity."