T-MARS: Improving Visual Representations by Circumventing Text Feature Learning at ICLR 2024
Core Concepts
Large web-crawled multimodal datasets require efficient data filtering to improve visual representation learning.
Abstract
Abstract:
Large web-crawled datasets power new methods for visual representations.
T-MARS filters out text-dominated image-caption pairs to enhance visual feature learning.
Experimentally, T-MARS outperforms CLIP filtering on ImageNet and VTAB.
Introduction:
Shift in ML training from labeled datasets to web crawls.
Vision-language models like CLIP demonstrate exceptional zero-shot performance.
Data curation challenges at web scale necessitate innovative approaches.
Method:
T-MARS masks text in images and filters based on CLIP similarity scores.
Empirical effectiveness of T-MARS demonstrated through experiments on LAION subsets.
Related Work:
Comparison with existing baselines like C-RHO and C-SSFT for data filtering.
Vision-language pre-training models like CLIP and BASIC discussed.
Experiments:
Evaluation of various data curation strategies across different dataset sizes.
Linear scaling trends observed in accuracy gains as data size increases exponentially.
Results:
T-MARS consistently outperforms baselines across various downstream tasks.
Utility analysis shows the importance of filtering out bad examples over adding new samples.
T-MARS
Stats
T-MARSはImagenetでCLIPフィルタリングを6.5%、VTABで4.7%上回る。
Quotes
"Data curation at web scale raises unique challenges compared to the standard classification regime."
"Our scaling trends show that good-quality data filtering holds even more significance at large scales."