Core Concepts
Language supervision and diverse training data play a crucial role in enhancing CLIP's compositional generalization abilities.
Stats
CLIPs trained with large datasets show orders-of-magnitude improvement in compositional OoD generalization.
LAION-400M, LAION-2B, and OpenAI CLIP models exhibit enhanced performance in effective compositional generalization.
Normalized Mutual Information (NMI) values indicate better disentanglement of attributes and objects in training data.
Quotes
"Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models." - Authors