Core Concepts
The authors present a scalable training pipeline and comprehensive analysis for building large-scale pathology foundation models, demonstrating state-of-the-art performance on various downstream tasks.
Abstract
The authors introduce a scalable training pipeline for building large-scale pathology foundation models (FMs). Key highlights:
They developed an "Online Patching" technique that enables high-throughput loading of image patches from whole slide images (WSIs) during training, eliminating the need for offline patch storage and enabling flexible patch sampling strategies.
Using this pipeline, they trained various vision transformer models of different sizes (ViT-S16, ViT-S8, ViT-B16, ViT-B8, DINOv2 ViT-L14) on the TCGA dataset, a commonly used collection of pathology images.
Experimental evaluation shows that their FMs reach state-of-the-art performance on various downstream tasks, including breast cancer subtyping, colorectal nuclear segmentation, and more.
The authors also present an experimental study on the impact of various hyperparameter and design choices, such as model initialization, mixing different magnifications, and dataset size, which can guide future development of pathology FMs.
To aid the evaluation of FMs, the authors introduce an unsupervised metric called "off-diagonal correlation" and an open-source evaluation framework called "eva" for consistent and standardized evaluation across different FMs and downstream tasks.
Stats
The TCGA dataset contains approximately 29k hematoxylin and eosin (H&E) stained tissue slides from 32 cancer types.
The TP53 dataset derived from TCGA metadata contains roughly 6k tumors with functional TP53 and 3.5k tumors with non-functional TP53.
The BACH dataset contains 400 breast cancer histology images of 4 classes.
The CRC dataset contains 107,180 colorectal cancer and normal tissue images of 9 classes.
The PatchCamelyon (PCam) dataset contains 327,680 breast lymph node patches with binary labels.
The MHIST dataset contains 3,152 colorectal polyp images of 2 classes.
The CoNSeP dataset contains 41 H&E images with nucleus segmentation masks of 4 cell types.
Quotes
"Driven by the recent advances in deep learning methods and, in particular, by the development of modern self-supervised learning algorithms, increased interest and efforts have been devoted to build foundation models (FMs) for medical images."
"One of the clearest findings from the past decade of machine learning research is that increasing training dataset size and variety is a primary driver of increased model performance."