Core Concepts
Visual-enriched captions improve CLIP training efficiency and performance.
Abstract
Large-scale web-crawled datasets are crucial for pre-training vision-language models like CLIP.
Existing methods struggle with noisy and irrelevant AltTexts, hindering image-text alignment.
VeCLIP introduces Visual-enriched Captions (VeCap) for improved data diversity and model performance.
A mixed training scheme alternates between AltTexts and VeCap, enhancing data variety and quality.
VeCLIP shows significant gains in image-text alignment and data efficiency.
Pre-trained models are available at https://github.com/apple/ml-veclip.
Stats
VeCLIP는 COCO 및 Flickr30k 검색 작업에서 최대 +25.2%의 이득을 달성합니다.
VeCLIP는 14%의 데이터만 사용하여 COCO 및 Flickr30k 검색 작업에서 +3%의 이득을 달성합니다.
Quotes
"VeCLIP achieves up to +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting."
"VeCLIP achieves +3% gain while only using 14% of the data employed in the vanilla CLIP and 11% in ALIGN."