Efficient Scaling of Text-to-Image Diffusion Models for Multi-Modal Generation
A novel variance-based feature fusion strategy that enables efficient scaling of text-to-image diffusion models to accommodate new conditioning modalities without retraining.