The paper introduces a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The key highlights are:
AGoT models the human thought process not only as a chain but also as a reasoning aggregation graph to capture multiple aspects of thinking, in contrast to the single-step reasoning in the chain-of-thought (CoT) technique.
AGoT turns the entire reasoning process into prompt aggregation and prompt flow operations. Each step in the reasoning chain is modeled as an aggregation graph, where multiple meta-prompts are aggregated and combined with visual information. A flow controller is used to dynamically adjust the degree of information flow between reasoning steps.
Experiments show that the AGoT-enhanced multi-modal model achieves superior performance compared to CLIP, CoCoOp, and CoT-PT in text-image retrieval, visual question answering, and cross-label/dataset/domain generalization tasks. AGoT demonstrates the effectiveness of the proposed multi-view reasoning approach for multi-modal representation learning.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Juncheng Yan... alle arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04538.pdfDomande più approfondite