Core Concepts
Compositional Chain-of-Thought (CCoT) is a novel zero-shot prompting method that utilizes scene graph representations to extract more compositional knowledge from Large Multimodal Models (LMMs) without the need for fine-tuning or annotated scene graph data.
Abstract
The paper introduces Compositional Chain-of-Thought (CCoT), a novel zero-shot prompting method that leverages scene graph representations to enhance the compositional visual reasoning capabilities of Large Multimodal Models (LMMs).
The key insights are:
- Recent studies have shown that even the most advanced LMMs struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects.
- Scene graphs (SGs) can provide a structured representation of visual scenes, but obtaining annotated SG data is expensive and not easily scalable.
- CCoT is a two-step prompting approach that first generates a scene graph using the LMM, and then uses that scene graph in the prompt to produce a response.
- Extensive experiments show that CCoT improves LMM performance on several vision-and-language compositional benchmarks, as well as general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs.
- The authors demonstrate the effectiveness of CCoT across four popular LMM architectures: InstructBLIP-13B, LLaVA-1.5-13B, Sphinx, and GPT-4V.
- Ablation studies highlight the importance of using structured SGs, enforcing a consistent JSON format, and optimizing the SG size to enhance the LMMs' compositional and multimodal reasoning.
Stats
"The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks."
"Recent empirical studies [18, 28, 51] show that the best-performing VL models tend to view images as a "bag of objects"."
"Scene graph (SG) annotations—structured graph representations of visual scenes–have been introduced as powerful VL representations, and have been extensively explored in many previous works [24, 34, 79, 80]."
"However, SG data is less readily available than textual descriptions as obtaining SGs is costly and thus not scalable."
Quotes
"Comprehending the structure of visual scenes is a core issue in machine perception. Visual scenes consist not only of objects but also include relevant characteristics and relationships that are significant to understanding the scenes' compositionality better."
"To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM."