Alapfogalmak
Large Language Models (LLMs) often struggle with structured reasoning tasks, particularly in navigating and reasoning over graph representations. This paper systematically evaluates the graph reasoning capabilities of various LLM models through a series of increasingly complex graph traversal problems.
Kivonat
The paper explores the ability of Large Language Models (LLMs) to perform structured graph reasoning tasks. It presents a comprehensive benchmark of five different LLM models (GPT-3.5, GPT-4, Claude-2, Llama-2, and Palm-2) on 10 distinct graph traversal problems of increasing complexity.
The key findings are:
LLMs generally perform better on tree-based graphs than grid-based graphs, indicating an inverse correlation between the average degrees of freedom per node and the reasoning capability of the models.
Adding constraints such as weighted edges or jumbled node order significantly degrades the performance of the models, highlighting their bias towards expecting certain structures.
K-shot prompting has a negative or insignificant effect on the reasoning accuracy of the models in the majority of the tasks, suggesting that few-shot learning is not particularly helpful for analytical tasks like graph reasoning.
The models exhibit a positive response bias, often failing to identify the absence of a valid solution, even in few-shot settings.
To address these limitations, the paper proposes a novel prompting technique called "PathCompare" that significantly improves the graph reasoning performance of the LLMs by prompting them to list and compare multiple possible paths. This technique outperforms standard prompting as well as the Chain-of-Thought (CoT) prompting approach in the majority of the tasks.
Overall, the paper provides a comprehensive analysis of the graph reasoning capabilities of various LLMs and introduces a novel prompting technique to enhance their performance on structured reasoning tasks.