Kernkonzepte
A novel one-shot tuning paradigm, termed as OneActor, that can efficiently generate consistent images of the same character by leveraging the intrinsic cluster structure in the latent space of pre-trained text-to-image diffusion models.
Zusammenfassung
The paper proposes a new cluster-guided paradigm, OneActor, for consistent character generation in text-to-image diffusion models. The key insights are:
The authors derive a cluster-based score function that aims to increase the probability of generating images from the target character cluster and reduce the probability of generating images from auxiliary clusters.
They construct a cluster-conditioned model that introduces the generated samples as the cluster representation and transforms it into a semantic offset to guide the denoising trajectory.
During tuning, the authors devise auxiliary components to simultaneously augment the tuning and regulate the inference, which significantly enhances the content diversity of generated images.
The authors prove that the semantic space of text-to-image diffusion models shares the same interpolation property as the latent space, which can be leveraged for fine generation control.
Comprehensive experiments show that OneActor outperforms a variety of baselines in terms of character consistency, prompt conformity, and image quality, while being at least 4x faster than tuning-based baselines.
Statistiken
The paper reports that OneActor takes an average tuning time of 5 minutes, which is at least 4x faster than tuning-based baselines like TheChosenOne (20 minutes on average).
Zitate
"We argue that a lightweight but intricate guidance is enough to function."
"We first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control."