Context-Based Multimodal Fusion: A Frugal Approach for Efficient Alignment of Pre-Trained Models
Core Concepts
Context-Based Multimodal Fusion (CBMF) offers a frugal approach to align pre-trained models efficiently, combining fusion and contrastive learning.
Abstract
Context-Based Multimodal Fusion (CBMF) introduces a novel method that integrates fusion and contrastive learning to align extensive pre-trained models in an efficient manner. CBMF addresses the challenges of multimodal fusion by combining modality fusion and data distribution alignment. By utilizing large pre-trained models that can be frozen, CBMF reduces computational costs while achieving effective alignment across modalities. The Deep Fusion Encoder (DFE) within the CBMF framework facilitates the fusion of embeddings from pre-trained models using a learnable parameter called context, accommodating distributional shifts across models. This method enables enhanced representations for downstream tasks, demonstrating versatility and applicability across various contexts.
Context-Based Multimodal Fusion
Stats
Bilal FAYE1, Hanane AZZAG2, Mustapha Lebbah3, Djamel BOUCHAFFRA4
arXiv:2403.04650v1 [cs.LG] 7 Mar 2024
CIFAR-10: 60K images in 10 classes
CIFAR-100: 60K images in 100 classes
Tiny ImageNet: Images categorized into 200 classes with dimensions of 64x64 pixels
Flickr8k: A collection of 8,000 images with captions
Quotes
"CBMF offers an effective and economical solution for solving complex multimodal tasks."
"In CBMF, each modality is represented by a specific context vector fused with the embedding of each modality."
"CBMF introduces a frugal approach to multimodal fusion and alignment."