Core Concepts
High train-test similarity alone does not explain CLIP's exceptional performance; other training data properties play a crucial role.
Abstract
The study investigates whether CLIP's accuracy on out-of-distribution (OOD) benchmarks is primarily due to highly similar images in its training set. By pruning LAION splits to replicate ImageNet's train-test similarity, the study finds that while some benchmarks show a performance drop, overall, CLIP maintains high performance. The research highlights that factors beyond high train-test similarity drive CLIP's ability to learn good representations. Additionally, a 100M subset of LAION is identified where CLIP can maintain its original performance. Large models like GPT-4 and LLaMa are transforming technology with their remarkable performance trained on vast datasets scraped from the internet.
Stats
Foundation models like CLIP are trained on hundreds of millions of samples.
Out-of-the-box, CLIP shows stellar zero-shot and few-shot capabilities.
Retraining CLIP on pruned LAION splits replicating ImageNet's train-test similarity.
A 100M split of LAION where CLIP maintains its original performance.
Quotes
"High train-test similarity is insufficient to explain CLIP’s performance."
"Models trained on pruned datasets do not significantly lose performance."
"CLIP leverages dataset scale and diversity for generalizable features."