Context-Based Multimodal Fusion (CBMF) introduces a novel method that integrates fusion and contrastive learning to align extensive pre-trained models in an efficient manner. CBMF addresses the challenges of multimodal fusion by combining modality fusion and data distribution alignment. By utilizing large pre-trained models that can be frozen, CBMF reduces computational costs while achieving effective alignment across modalities. The Deep Fusion Encoder (DFE) within the CBMF framework facilitates the fusion of embeddings from pre-trained models using a learnable parameter called context, accommodating distributional shifts across models. This method enables enhanced representations for downstream tasks, demonstrating versatility and applicability across various contexts.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問