Improving CLIP Training with Visual-enriched Captions: VeCLIP Study
The author introduces VeCLIP, a scalable pipeline for noisy caption rewriting that incorporates visual concepts into captions to enhance image-text alignment. By leveraging a mixed training scheme with VeCap, significant improvements in image-text alignment and model performance are achieved.