Core Concepts
Despite the introduction of nnU-Net, which demonstrated the importance of careful implementation over novel architectures, the field of 3D medical image segmentation continues to see a proliferation of new methods claiming superior performance. However, a systematic and comprehensive benchmark reveals that many of these recent claims do not hold up under rigorous validation, and that the recipe for state-of-the-art performance remains CNN-based U-Net models, including ResNet and ConvNeXt variants, implemented within the nnU-Net framework and scaled to modern hardware resources.
Abstract
The paper starts by highlighting a concerning trend in the field of 3D medical image segmentation, where numerous new methods have been introduced in recent years, each claiming superior performance over the original nnU-Net baseline. However, the authors argue that these claims often fail to hold up under scrutiny due to common validation pitfalls.
To address this issue, the authors first identify and describe several prevalent validation pitfalls, such as coupling the claimed innovation with confounding performance boosters, lack of well-configured and standardized baselines, insufficient quantity and suitability of datasets, and inconsistent reporting practices. They provide recommendations on how to avoid these pitfalls to ensure meaningful and reliable method comparisons.
The authors then conduct a large-scale benchmark under a thorough validation protocol, covering a wide range of prevalent segmentation methods, including CNN-based, Transformer-based, and Mamba-based approaches. They find that, contrary to current beliefs, the recipe for state-of-the-art performance remains CNN-based U-Net models, including ResNet and ConvNeXt variants, implemented within the nnU-Net framework and scaled to modern hardware resources.
The authors also assess the suitability of popular datasets for benchmarking, identifying KiTS, AMOS, and ACDC as the most suitable, while BraTS, LiTS, and BTCV are found to be less suitable for this purpose. Additionally, the authors demonstrate that Transformer-based architectures fail to match the performance of CNNs, and that the reported gains of the U-Mamba method were due to coupling it with a residual U-Net, rather than the Mamba layers themselves.
Finally, the authors release a series of updated standardized baselines for 3D medical segmentation within the nnU-Net framework, tailored to accommodate a spectrum of hardware capabilities. The study concludes by emphasizing the critical need for a cultural shift in the field, where the quality of validation is valued as much as the novelty of network architectures, to drive genuine progress in 3D medical image segmentation.
Stats
The average Dice Similarity Coefficient (DSC) scores for the different methods and datasets are reported in the paper.
The Normalized Surface Distance (NSD) with a tolerance of 2 mm is also reported for all methods and datasets.
Quotes
"Despite this, the attraction of innovative architectures from the broader computer vision domain, such as Transformers and Mamba, persists. Adaptations of these cutting-edge designs to the medical imaging domain have emerged, with claims of superior performance over the conventional CNN-based U-Net."
"Our study makes the following contributions: 1) We systematically identify validation pitfalls in the field and provide recommendations for how to avoid them. 2) We conduct a large-scale benchmark under a thorough validation protocol to scrutinize the performance of prevalent segmentation methods."
"In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources."