Centrala begrepp
The author introduces the Let’s Go Shopping (LGS) dataset, emphasizing the need for efficient data collection from e-commerce websites to improve vision-language tasks. The approach focuses on creating a large-scale public dataset with high-quality image-caption pairs.
Sammanfattning
The Let’s Go Shopping (LGS) dataset is a significant contribution to vision and vision-language applications, offering 15 million image-caption pairs from e-commerce websites. The dataset aims to address the limitations of existing datasets by providing clean, informative, and fluent data. Experiments demonstrate the unique characteristics of LGS images and captions, highlighting their potential for improving image classification, reconstruction, captioning, and text-to-image generation tasks.
Previous initiatives have faced challenges with noisy or subjective data sources like social media alt-texts. In contrast, LGS leverages e-commerce websites known for their cleanliness and informativeness. The dataset's focus on foreground objects with clear backgrounds sets it apart from general-domain datasets like ImageNet.
Experiments show that models trained on LGS outperform those trained solely on ImageNet in various tasks due to the distinct distribution of e-commerce data. Additionally, LGS serves as an effective pre-training dataset for downstream tasks in both general and fine-grained settings.
The study underscores the importance of domain-specific datasets like LGS in enhancing visual understanding through efficient data collection strategies tailored to specific applications.
Statistik
The Let’s Go Shopping (LGS) dataset consists of 15 million image-caption pairs.
Only 17.6% of concepts are shared between popular ImageNet synsets and the e-commerce corpus.
MAE model trained on ImageNet can reconstruct LGS images well.
Linear probing accuracy improves when using self-supervised MAE models trained on both ImageNet and LGS.
Two-phase pre-training with ImageNet followed by LGS enhances downstream task performance.
Citat
"E-commerce websites provide clean images with objective descriptions."
"LGS offers high-quality images focused on foreground objects with less complex backgrounds."
"Models trained on LGS outperform those trained solely on ImageNet in various tasks."