Core Concepts
This paper proposes a new variational bound that can tightly approximate the multi-modal data log-likelihood, and develops more flexible aggregation schemes based on permutation-invariant neural networks to encode latent variables from different modality subsets.
Abstract
The paper addresses the problem of learning deep latent variable models for multi-modal data, where each sample has features from distinct sources. The authors consider a latent variable model setup, where the joint generative model factorizes as the product of the prior density and the decoding distributions for each modality.
The key contributions are:
A new variational bound that can tightly approximate the multi-modal data log-likelihood, avoiding the limitations of mixture-based bounds which may not provide tight lower bounds on the joint log-likelihood.
New multi-modal aggregation schemes based on permutation-invariant neural networks, such as DeepSets and attention models, which yield more expressive multi-modal encoding distributions compared to previous Mixture-of-Experts (MoE) or Product-of-Experts (PoE) approaches.
An information-theoretic perspective showing that the proposed variational objective can be interpreted as a relaxation of bounds on marginal and conditional mutual information.
Experiments illustrating the trade-offs of different variational bounds and aggregation schemes, and demonstrating the benefits of tighter variational bounds and more flexible aggregation models when approximating the true joint distribution over observed modalities and latent variables.
Stats
The joint generative model factorizes as pθ(z, x) = pθ(z) Π_s pθ(xs|z), where z is the shared latent variable across modalities.
The proposed variational bound L(x, θ, ϕ, β) = ∫ ρ(S) [LS(xS, θ, ϕ, β) + L\S(x, θ, ϕ, β)] dS, where LS and L\S are the marginal and conditional variational bounds.
The permutation-invariant aggregation schemes, such as DeepSets and Set Transformers, allow for more flexible multi-modal encoding distributions compared to MoE or PoE.
Quotes
"Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research."
"To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities."
"We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks."