The paper introduces a novel cross-modal parameter-efficient approach named XMAdapter for vision-language models. The key highlights are:
XMAdapter constructs cache models for both text and image modalities, effectively integrating features from both domains. This is crucial for efficient transfer learning in vision-language models.
The model leverages retrieval through bimodal information from visual and language modalities to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion and exploits the differences in modality affinity to mine hard samples.
XMAdapter enhances model performance through adaptive adjustment of sample learning intensity based on the differences in cross-modal affinity.
Extensive experiments on 11 benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly in terms of accuracy, generalization, and efficiency.
The model exhibits strong generalization capabilities, achieving an average improvement of +0.61%, +0.34%, +0.27%, and +0.31% on four cross-domain datasets compared to previous methods.
XMAdapter meets the requirements of parameter-efficient transfer learning in terms of resource consumption and operational efficiency, while achieving promising experimental results.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Juncheng Yan... lúc arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12588.pdfYêu cầu sâu hơn