Context-Based Multimodal Fusion (CBMF) introduces a novel method that integrates fusion and contrastive learning to align extensive pre-trained models in an efficient manner. CBMF addresses the challenges of multimodal fusion by combining modality fusion and data distribution alignment. By utilizing large pre-trained models that can be frozen, CBMF reduces computational costs while achieving effective alignment across modalities. The Deep Fusion Encoder (DFE) within the CBMF framework facilitates the fusion of embeddings from pre-trained models using a learnable parameter called context, accommodating distributional shifts across models. This method enables enhanced representations for downstream tasks, demonstrating versatility and applicability across various contexts.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Bilal Faye,H... a las arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04650.pdfConsultas más profundas