The authors introduce a novel dataset, TV100, consisting of images from TV series released after 2021. This dataset is designed to explore the limitations of the pre-trained CLIP model, which has been trained on a large and diverse dataset, LAION.
The data collection process involves manually searching for TV series on IMDB, downloading related images from Google, and then filtering out repeated or meaningless images. The dataset contains around 800 classes, with a highly imbalanced distribution, making it suitable for research on long-tailed recognition.
To investigate CLIP's performance on this dataset, the authors conduct experiments on both zero-shot and finetuned settings. The results show that the pre-trained CLIP model cannot recognize any classes from the TV100 dataset, indicating that it lacks the knowledge to identify these new TV series. However, when the CLIP model is finetuned on the TV100 dataset, its performance improves significantly, suggesting that the dataset is learnable and separable.
The authors emphasize that the era of pre-trained models has brought about many new insights, and one of the crucial questions is whether these models possess comprehensive knowledge. The TV100 dataset is introduced as a means to evaluate the limitations of pre-trained models, particularly CLIP, and to facilitate research in areas such as novel class discovery and long-tailed learning.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Da-Wei Zhou,... at arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12407.pdfDeeper Inquiries