Kernkonzepte
The core message of this work is to propose an adversarially trained conditional normalizing flow (AdvNF) model that can effectively model complex multi-modal distributions, such as those encountered in physical systems, and overcome the problem of mode collapse that plagues standard conditional normalizing flow models.
Zusammenfassung
The paper systematically studies the problem of mode collapse in conditional normalizing flows (CNFs) and proposes a solution using adversarial training. The key highlights are:
-
CNFs trained via reverse KL divergence (RKL) suffer from severe mode collapse, while those trained via forward KL divergence (FKL) have high variance in sample statistics.
-
The authors propose AdvNF, which combines adversarial training with CNFs to mitigate mode collapse. AdvNF is evaluated on synthetic 2D datasets (MOG-4, MOG-8, Rings-4) as well as the XY model and extended XY model datasets.
-
Experiments show that AdvNF, especially the RKL variant, outperforms standard CNF models and other generative baselines like GANs and VAEs in terms of metrics like negative log-likelihood, percent overlap, earth mover's distance, and acceptance rate.
-
The authors also show that AdvNF (RKL) can achieve similar performance even with a smaller ensemble size for training, reducing the dependence on expensive MCMC sampling.
-
The paper provides insights into the mode-covering and mode-seeking behaviors of KL divergence-based training of normalizing flows and how adversarial training can help overcome mode collapse.
Statistiken
The probability density function for the mixture of Gaussians (MOG) dataset is given by p(x) = Σ_i a_i N(x; μ_i, Σ_i), where N(x; μ_i, Σ_i) represents a Gaussian distribution with mean μ_i and covariance Σ_i, and a_i are the weights of the Gaussian components.
The probability density function for the concentric rings (Rings) dataset is given by p(r,θ) = Σ_i a_i N(r; r_i, σ_i^2) U(θ; 0, 2π), where N(r; r_i, σ_i^2) is a Gaussian distribution over the radial coordinate r with mean r_i and variance σ_i^2, and U(θ; 0, 2π) is a uniform distribution over the angular coordinate θ.
The energy function for the XY model is given by E_XY(θ) = -J Σ_<i,j> cos(θ_i - θ_j), where θ_i represents the spin angle at site i, J is the coupling constant, and the sum is over nearest neighbor pairs.
The energy function for the extended XY model is given by E_EXY(θ) = -J Σ_<i,j> cos(θ_i - θ_j) - K Σ_(i,j,k,l)∈□ cos(θ_i - θ_j + θ_k - θ_l), where the second term represents the ring-exchange interaction.
Zitate
"When the model is trained via forward KL divergence (FKL), i.e., DKL(p||q) = ∫ p(x)ln(p(x)/q(x))dx, the model has mode-covering behaviour, which implies covering all the modes and, in addition, including other regions in the sample space where the distribution assigns a very low probability mass."
"When the model is trained via reverse KL divergence (RKL), i.e., DKL(q||p) = ∫ q(x)ln(q(x)/p(x))dx, the model has mode-seeking behaviour, which could be explained as follows: If p(x) is near zero while q(x) is nonzero, then RKL →∞ penalises and forces q(x) to be near zero. When q(x) is near zero and p(x) is nonzero, KL divergence would be low and thus does not penalise it. This causes q to choose any mode when p is multimodal, resulting in a concentration of probability density on that mode and ignoring the other high-density modes."