核心概念
The proposed Aggregation-Graph-of-Thought (AGoT) mechanism models the human thought process as a chain of reasoning aggregation graphs to capture multiple aspects of thinking, outperforming existing chain-of-thought and prompt learning methods in multi-modal tasks.
摘要
The paper introduces a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The key highlights are:
-
AGoT models the human thought process not only as a chain but also as a reasoning aggregation graph to capture multiple aspects of thinking, in contrast to the single-step reasoning in the chain-of-thought (CoT) technique.
-
AGoT turns the entire reasoning process into prompt aggregation and prompt flow operations. Each step in the reasoning chain is modeled as an aggregation graph, where multiple meta-prompts are aggregated and combined with visual information. A flow controller is used to dynamically adjust the degree of information flow between reasoning steps.
-
Experiments show that the AGoT-enhanced multi-modal model achieves superior performance compared to CLIP, CoCoOp, and CoT-PT in text-image retrieval, visual question answering, and cross-label/dataset/domain generalization tasks. AGoT demonstrates the effectiveness of the proposed multi-view reasoning approach for multi-modal representation learning.
統計資料
The paper reports the following key metrics:
On the Flickr30k dataset, AGoT achieved 88.70% R@1 with 2% training data, outperforming CLIP by 5.70%, CoCoOp by 3.70%, and CoT-PT by 1.70%.
On the MSCOCO dataset, AGoT achieved 58.70% R@1 with 2% training data, outperforming CLIP by 5.40%, CoCoOp by 1.70%, and CoT-PT by 0.80%.
On the VQAv2 dataset, AGoT achieved 31.74% R@1 with 0.75% training data, outperforming CLIP by 19.91%, CoCoOp by 1.00%, and CoT-PT by 0.88%.
引述
"To cope with the overlooked multiple aspects of thinking in single-step reasoning, we model the reasoning step as an aggregation graph, and turn the whole process into a prompt aggregation and prompt flow operation."
"AGoT exhibits strong multi-modal representation learning ability and achieves good results on 18 datasets such as text-image retrieval, VQA, and image classification and has good domain generalization ability."