The paper addresses the problem of learning deep latent variable models for multi-modal data, where each sample has features from distinct sources. The authors consider a latent variable model setup, where the joint generative model factorizes as the product of the prior density and the decoding distributions for each modality.
The key contributions are:
A new variational bound that can tightly approximate the multi-modal data log-likelihood, avoiding the limitations of mixture-based bounds which may not provide tight lower bounds on the joint log-likelihood.
New multi-modal aggregation schemes based on permutation-invariant neural networks, such as DeepSets and attention models, which yield more expressive multi-modal encoding distributions compared to previous Mixture-of-Experts (MoE) or Product-of-Experts (PoE) approaches.
An information-theoretic perspective showing that the proposed variational objective can be interpreted as a relaxation of bounds on marginal and conditional mutual information.
Experiments illustrating the trade-offs of different variational bounds and aggregation schemes, and demonstrating the benefits of tighter variational bounds and more flexible aggregation models when approximating the true joint distribution over observed modalities and latent variables.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문