Core Concepts
Diffusion models can be leveraged to generate synthetic training data for semantic segmentation, but the quality of the generated pseudo-masks is not always accurate. By framing the problem as a weakly supervised learning task and incorporating techniques like reliability-aware robust training, prompt augmentation, and domain adaptation, the authors demonstrate how to effectively close the gap between real and synthetic training for semantic segmentation.
Abstract
The paper explores the use of diffusion models, specifically Stable Diffusion, to generate synthetic training data for semantic segmentation tasks. The key insights are:
Diffusion models can be used to generate both images and pseudo-masks for semantic segmentation, by leveraging the text-image cross-attention maps within the diffusion model. However, the quality of the generated pseudo-masks is not always accurate.
The authors frame the problem of training segmentation models on the diffusion-synthetic data as a weakly supervised learning task, where the potentially inaccurate pseudo-masks are treated as weak supervision. They incorporate techniques from weakly supervised semantic segmentation (WSSS), such as reliability-aware robust training, to handle the noisy pseudo-labels.
To scale up the diversity of the synthetic training data, the authors propose prompt augmentation, which uses synonym and hyponym replacement to generate more varied prompts for the diffusion model.
For transferring the segmentation model to distant domains, the authors leverage LoRA-based domain adaptation of the Stable Diffusion model, which enables fast and stable finetuning on a small set of target-domain images.
The experiments show that the proposed Attn2mask method, which combines these ideas, outperforms previous diffusion-synthetic training approaches on the PASCAL VOC, ImageNet-S, and Cityscapes datasets, and achieves competitive performance compared to real-image-based WSSS methods.
Stats
The paper does not provide specific numerical data or statistics, but rather focuses on the overall approach and experimental results.