核心概念
The proposed masked diffusion model (MDM) is a novel self-supervised pre-training approach that replaces the conventional additive Gaussian noise in denoising diffusion probabilistic models (DDPM) with a masking mechanism, leading to improved performance on downstream semantic segmentation tasks.
摘要
The paper presents the masked diffusion model (MDM), a self-supervised representation learning approach that diverges from traditional denoising diffusion probabilistic models (DDPM).
Key highlights:
- MDM replaces the additive Gaussian noise in DDPM with a masking operation, inspired by the Masked Autoencoder (MAE) approach. This removes the reliance on the theoretical underpinnings of diffusion models that heavily rely on Gaussian noise.
- The authors identify a mismatch between the pre-training generative task and the downstream dense prediction task (e.g., semantic segmentation), where high-level, low-frequency structural aspects of images are more important. To address this, they propose using the Structural Similarity (SSIM) loss instead of the commonly used Mean Squared Error (MSE) loss.
- Extensive experiments on medical and natural image datasets show that MDM outperforms DDPM and other baselines, particularly in few-shot scenarios, demonstrating the effectiveness of the proposed masking approach and the SSIM loss.
- The authors provide insights that the representation ability of diffusion models does not solely originate from their generative power, and that denoising is not an indispensable component for effective self-supervised representation learning.
統計資料
"Diffusion models consist of T timesteps, each corresponding to an incremental level of corruption."
"DDPM degrades to a vanilla denoising autoencoder and MDM degrades to a vanilla masked autoencoder (with a slight difference from MAE) when t is fixed."
"DDPM pre-trained with the noise prediction strategy achieves higher accuracy in downstream segmentation tasks compared to using the image prediction strategy."
引述
"Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners."
"Fortunately, denoising's significance diminishes when one focuses on the self-supervised pre-training facet of diffusion (e.g., Baranchuk et al. (2022)), which employs intermediate activations from a trained diffusion model for downstream segmentation tasks."
"Motivated by these insights, our study diverges from conventional denoising in the diffusion framework. Inspired by the Masked Autoencoder (MAE) (He et al., 2022), we replace noise addition with a masking operation (see Fig. 1), introducing a new self-supervised pre-training paradigm for semantic segmentation named the masked diffusion model (MDM)."