The paper introduces a novel cross-modal parameter-efficient approach named XMAdapter for vision-language models. The key highlights are:
XMAdapter constructs cache models for both text and image modalities, effectively integrating features from both domains. This is crucial for efficient transfer learning in vision-language models.
The model leverages retrieval through bimodal information from visual and language modalities to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion and exploits the differences in modality affinity to mine hard samples.
XMAdapter enhances model performance through adaptive adjustment of sample learning intensity based on the differences in cross-modal affinity.
Extensive experiments on 11 benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly in terms of accuracy, generalization, and efficiency.
The model exhibits strong generalization capabilities, achieving an average improvement of +0.61%, +0.34%, +0.27%, and +0.31% on four cross-domain datasets compared to previous methods.
XMAdapter meets the requirements of parameter-efficient transfer learning in terms of resource consumption and operational efficiency, while achieving promising experimental results.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы