The QUILT-1M dataset is a large-scale collection of 653,209 pathology images and 1,017,708 associated captions, created by scraping online sources. While this dataset provides valuable data diversity, the image quality and composition are highly heterogeneous, with many images containing impurities such as visible narrators, desktop environments, text overlays, and multi-panel layouts.
To address this issue, the authors manually annotated a 1% sample of the QUILT-1M dataset and identified that only 21.74% of the images were free from common additional image elements. They then trained a multi-label impurity classifier using a ResNet50-D-based model, achieving high accuracy, recall, and specificity in detecting these impurities.
Additionally, the authors used the CLIP score calculated by the CONCH model to filter out the semantically less well-aligned half of the dataset, further improving the quality of the image-text pairs.
The authors then used the filtered dataset to fine-tune a latent diffusion model for text-conditional image synthesis. Compared to models trained on the unfiltered dataset, the filtered models exhibited significantly reduced artifacts and better image fidelity, as measured by the Fréchet Inception Distance (FID) metric.
The findings of this study highlight the importance of carefully curating large-scale datasets, especially for tasks like text-to-image generation, where the quality and purity of the input data are crucial for the performance of the models.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Marc Aubrevi... às arxiv.org 04-12-2024
https://arxiv.org/pdf/2404.07676.pdfPerguntas Mais Profundas