The paper proposes a novel approach to training diffusion models in large-image domains, such as digital histopathology and remote sensing, by leveraging self-supervised representation learning. The key idea is to use self-supervised embeddings as conditioning signals for the diffusion models, overcoming the need for fine-grained human annotations.
The authors first train patch-level diffusion models conditioned on self-supervised embeddings extracted from pre-trained models like HIPT and iBOT. These diffusion models are able to generate high-quality image patches that closely match the semantics of the conditioning embeddings.
To synthesize large images, the authors introduce a framework that represents a large image as a grid of self-supervised embeddings. The diffusion model is then used to generate consistent patches based on the spatial arrangement of these conditioning embeddings, preserving both local properties and global structure.
The authors demonstrate the effectiveness of their approach through extensive evaluations. The generated patches and large images achieve low FID scores, comparable to or better than state-of-the-art methods. Furthermore, the authors show that the synthetic images can be used to augment training data, leading to significant improvements in downstream classification tasks, even for out-of-distribution datasets.
The authors also introduce a text-to-large image synthesis paradigm, where they train auxiliary models to sample self-supervised embeddings from text descriptions and use the diffusion model to generate the corresponding large images.
Overall, the paper presents a powerful framework that leverages self-supervised representations to enable efficient large-scale image synthesis in specialized domains, overcoming the limitations of human-annotated data.
To Another Language
from source content
arxiv.org
Djupare frågor