The paper proposes a training-free solution, called MaxFusion, to scale text-to-image diffusion models for multi-modal generation. The key insights are:
Leveraging these observations, the authors introduce a feature fusion strategy that selectively combines aligned features based on their relative variance. This allows the diffusion model to effectively incorporate multiple conditioning modalities, such as depth maps, segmentation masks, and edge maps, without the need for retraining.
The proposed method is evaluated on a synthetic dataset derived from COCO, demonstrating improved performance compared to existing multi-modal conditioning approaches like ControlNet and T2I-Adapter. MaxFusion enables zero-shot multi-modal generation, where individual models trained for different tasks can be combined during inference to create composite scenes. The authors also show that the method can be extended to handle more than two conditioning modalities.
To Another Language
from source content
arxiv.org
Дополнительные вопросы