The paper proposes a multiple instance learning (MIL) framework that can be integrated into both convolutional neural network (CNN) and vision transformer (ViT) architectures for medical image classification tasks. The key idea is to force the model to use only a subset of the most relevant image patches to reach the final classification, mimicking the clinical practice where medical decisions are based on localized findings.
The authors evaluate their approach on two medical applications: skin cancer diagnosis using dermoscopy images and breast cancer diagnosis using mammography. The results show that using only a small subset of the patches does not compromise the diagnostic performance for in-domain data, compared to baseline approaches. However, the MIL models are more robust to shifts in patient demographics, while also providing more detailed explanations about which regions contributed to the decision.
The paper first describes the patch encoder block, which can be either a CNN or a ViT, to extract patch-level representations from the input image. Then, the MIL block aggregates these patch features to predict the image-level classification. Two MIL approaches are explored: instance-level, which performs predictions on each patch, and embedding-level, which first aggregates the patch features before classification.
The experimental results demonstrate that the instance-level MIL models consistently outperform their embedding-level counterparts, suggesting that the key patches identified by the instance-level approach are more clinically relevant. Additionally, the MIL models achieve comparable or better performance than the baseline CNN and ViT models, while using significantly less information (i.e., a smaller subset of patches). This highlights the potential of MIL to create more efficient, explainable, and fair medical image analysis systems.
The paper also includes visualizations of the key patches identified by the MIL models, which align well with the clinically relevant regions, further validating the approach. Overall, the work establishes MIL as a promising method to improve the robustness and interpretability of medical image analysis.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問