Core Concepts
ADMディフュージョンモデルアーキテクチャの不均一で効果的でないトレーニングに対処し、品質を向上させるための手法を提案。
Abstract
ディフュージョンモデルは大規模なデータセットにおける画像合成分野を支配している。
トレーニング中の不均一性や効果的でない要因を特定し、修正することで品質が向上する。
ネットワーク層を再設計して活性化、重み、更新量を期待値に保つことが重要。
提案手法は以前のImageNet-512合成におけるFID記録を1.81まで改善した。
EMAパラメーター設定方法も提案されており、トレーニング後に精密な調整が可能。
Introduction:
Diffusion models dominate data-driven image synthesis.
Training dynamics of diffusion models pose challenges due to stochastic loss function.
Goal is to understand and address imbalances in training dynamics.
Improving the training dynamics:
Preliminary changes (CONFIG B):
Tuning hyperparameters and disabling self-attention at 32x32 resolution.
Addressing loss weighting standardization issue in EDM training setup.
Standardizing activation magnitudes (CONFIG D):
Introducing magnitude-preserving learned layers to control activation magnitudes.
Successfully eliminating magnitude drift and improving FID from 6.96 to 3.75.
Standardizing weights and updates (CONFIG E):
Controlling weight growth with forced weight normalization.
Unifying effective learning rate with inverse square root decay schedule.
Removing group normalizations (CONFIG F):
Eliminating data-dependent group normalization layers.
Introducing weaker pixel normalization layers for better results.
Magnitude-preserving fixed-function layers (CONFIG G):
Updating operations to maintain magnitudes on expectation.
Achieving competitive FID of 2.56 with simplified architecture.
Post-hoc EMA:
Method for setting exponential moving average parameters post-training run is presented.
Allows precise tuning of EMA length without multiple training runs, revealing interactions with network architecture, training time, and guidance.
Stats
記録FIDが2.41から1.81まで改善された。
学習時間2147Mイメージバッチ2048回。
Quotes
"Diffusion models currently dominate the field of data-driven image synthesis."
"Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81."