Concepts de base
The proposed MT-Diffusion model enables simultaneous modeling and generation of multi-modal data, such as images and labels, within a unified diffusion framework by integrating multi-task learning losses in a principled manner.
Résumé
The paper introduces the MT-Diffusion model, a generalization of the standard diffusion model for multi-modal generative modeling. The key ideas are:
-
Forward Process:
- The forward diffusion process is defined to integrate information from multiple data modalities (e.g., images and labels) through a forward aggregation step.
- This allows the diffusion process to operate in a shared latent space that can capture the underlying structure across different data types.
-
Reverse Process:
- The reverse denoising process is parameterized with a shared U-Net backbone and modality-specific decoder heads.
- This enables simultaneous generation of multi-modal data by decoding the shared diffusion latent code back to the individual data spaces.
-
Multi-Task Loss:
- A new multi-task evidence lower bound (ELBO) is derived to integrate the multi-modal generation objectives into a unified loss function.
- This allows the model to be trained end-to-end on multiple generation tasks simultaneously.
The paper evaluates MT-Diffusion on several practical multi-task generative learning scenarios, including image transition, masked-image training, joint image-label generation, and joint image-representation generation. The results demonstrate the effectiveness of the proposed framework in achieving simultaneous generation of diverse data types without hurting individual task performance.
Stats
The forward diffusion process is defined as q(zt | zt−1, x1, ..., xN) = N(zt; √αt(zt−1 + wtE(x)), (1 −αt) I), where x1, ..., xN are the multi-modal data.
The reverse process is parameterized as pθ(zt−1 | zt) = N(zt−1; μθ(zt, t), Σθ(zt, t)), and pθ(xi | zt) for decoding the diffusion latent code back to the individual data spaces.
Citations
"Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling?"
"We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space."
"MT-Diffusion enables simultaneous modeling and generation of multi-modal data with a unified diffusion model."