Centrala begrepp
Diffusion models learn semantically meaningful representations through three distinct learning phases, but the representations are not fully factorized even under imbalanced datasets.
Sammanfattning
The authors conduct a controlled study on a toy conditional diffusion model that learns to generate 2D Gaussian bumps at various x and y positions. They observe three distinct phases in the learning process:
- Phase A: The learned representation has no particular structure, and the generated images either have no Gaussian bumps or multiple bumps at incorrect locations.
- Phase B: The learned representation forms a disordered, quasi-2D manifold, and the generated images have a single Gaussian bump at the wrong location.
- Phase C: The learned representation forms an ordered 2D manifold, and the generated images have the desired Gaussian bumps at the correct locations.
The authors find that the formation of an ordered manifold (Phase C) is a strong indicator of good model performance. Datasets with smaller increments (more dense information) lead to faster learning of the desired representation.
However, even under imbalanced datasets where one feature (x or y position) is represented more than the other, the model does not learn a fully factorized representation. The learning rates of the x and y positions remain coupled, indicating that the model learns a coupled, rather than factorized, representation.
Overall, the results suggest that diffusion models can learn semantically meaningful representations, but may not achieve fully efficient, factorized representations, which could limit their compositional generalization abilities.
Statistik
The accuracy of generating Gaussian bumps at the correct x locations is generally lower than the accuracy at the correct y locations, even when the dataset has more fine-grained information about the x-positions.
The R-squared values in fitting to the x- and y-positions are strongly coupled, even under imbalanced datasets.
Citat
"Despite having more data with finer-grained information of the x-positions, the accuracy of generating Gaussian bumps at the correct y locations is generally higher than that at generating at the correct x locations."
"The R-squared values in fitting to the x- and the y-positions are strongly coupled, which could be indicative that the representations learned are coupled rather than factorized."