Core Concepts
Existing Vision-Language Models struggle with compositionality, but CLoVe framework significantly improves it while maintaining performance.
Stats
CLIP+CLOVE w/o patching: 69.0% on SugarCrepe
NegCLIP: 70.5% on SugarCrepe
REPLACE: 71.2% on SugarCrepe
Quotes
"CLoVe significantly improves compositionality performance of pre-trained CLIP-like models."
"Synthetic captions, hard negatives, and model patching are key to enhancing VLMs."