Основные понятия
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) model when scaled down to limited computation budgets, exploring the impact of data, architecture, and training strategies.
Аннотация
The paper presents a comprehensive study on scaling down the CLIP model in three key areas:
Data:
Examines the significance of high-quality training data, showing that a smaller dataset of high-quality data can outperform a larger dataset with lower quality.
Investigates how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets.
Architecture:
Compares the performance of various CNN and vision transformer architectures, including ResNet, Swin Transformer, and ConvNeXt, under different computation budgets.
Finds that when the number of samples is small, CNNs outperform vision transformers, but as the dataset size increases, ViT-based CLIP models demonstrate superior performance.
Training Strategies:
Evaluates four CLIP training strategies: SLIP, FLIP, CLIP, and CLIP+Data Augmentation.
Shows that the choice of training strategy depends on the available compute resource, and that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data.
The paper provides practical insights into effectively training and deploying CLIP models, making them more accessible and affordable for various applications.
Статистика
"When the number of samples is less than 100 million, ViT-L/16 performs the worst among the ViT family."
"Augmenting the dataset to 400M does not yield any significant improvement in zero-shot performance due to our computation limits."
"The top 40% dataset achieves the highest performance on ImageNet, whereas the top 20% dataset falls short of the top 40% in terms of performance."
"CLIP + Data Aug can bring better zero-shot performance on both ImageNet and its variants when we train multiple epochs for the dataset."
Цитаты
"Our results provide guidance on how to select training data for CLIP models, a critical issue for practical applications."
"We explore the trade-offs between computational cost and performance in CLIP training, a critical issue for practical applications."
"This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications."