Core Concepts
Distilled text-to-image models can generate some samples that are superior to their teacher models, especially when the student samples significantly diverge from the teacher. An adaptive teacher-student collaborative approach can leverage these superior student samples to improve the overall text-to-image generation performance.
Abstract
The paper investigates the performance of distilled text-to-image diffusion models and finds that they can outperform their teacher models in a significant portion of generated samples, particularly when the student samples diverge significantly from the teacher.
The key findings are:
The distilled student models can surpass their teacher models in a substantial portion of image samples, up to 30% in some cases.
The student wins are more likely to occur when the student samples are highly distinct from the teacher ones.
Highly complex teacher samples and longer text prompts tend to lead to greater divergence between the student and teacher outputs.
Straighter trajectories of the teacher model during sampling result in more similar student and teacher samples.
Based on these observations, the paper proposes an adaptive teacher-student collaborative approach for text-to-image generation. The method first generates an initial sample using the distilled student model, and then an "oracle" decides whether to further improve the sample using the teacher model. This adaptive pipeline outperforms both the individual teacher and student models for various inference budgets in terms of human preference, image fidelity, and textual alignment.
The approach is also evaluated on text-guided image editing and controllable generation tasks, demonstrating its versatility and effectiveness.
Stats
The student can surpass its teacher in a substantial portion of image samples.
The student wins are more likely to occur when the student samples are highly distinct from the teacher ones.
Highly complex teacher samples and longer text prompts tend to lead to greater divergence between the student and teacher outputs.
Straighter trajectories of the teacher model during sampling result in more similar student and teacher samples.