The key insights and highlights of the content are:
Multimodal representation learning techniques typically rely on paired samples, but paired samples are challenging to collect in many scientific fields. This paper presents an approach to address the challenge of aligning unpaired samples across disparate modalities.
The authors draw an analogy between potential outcomes in causal inference and potential views in multimodal observations. This allows them to use Rubin's framework to estimate a common space in which to match samples.
The approach assumes the collection of samples that are experimentally perturbed by treatments. It uses this to estimate a propensity score from each modality, which encapsulates all shared information between a latent state and treatment and can be used to define a distance between samples.
Two alignment techniques that leverage the propensity score distance are explored - shared nearest neighbours (SNN) and optimal transport (OT) matching. The authors find that OT matching results in significant improvements over state-of-the-art alignment approaches.
The proposed matching procedure can also be used for cross-modality prediction tasks by estimating a function that translates between modalities using a two-sample stochastic gradient estimator.
Experiments on a synthetic multi-modal setting and real-world data from the NeurIPS Multimodal Single-Cell Integration Challenge demonstrate the effectiveness of the propensity score-based matching approach.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询