Core Concepts
Preserving data quality is more effective than increasing quantity for improving CLIP performance.
Abstract
The article discusses the importance of data quality over quantity in pre-training models like CLIP. It introduces ClipCov, a method for selecting subsets of training data that preserve cross-covariance to achieve superior generalization performance. Extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M datasets show significant improvements in accuracy across various downstream tasks.
Stats
"Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by ClipCov achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions."
"Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline."