Core Concepts
Large language models exhibit significant limitations in their logical reasoning abilities, especially for complex reasoning patterns involving negations and non-monotonic logics.
Abstract
The paper presents a comprehensive evaluation of the logical reasoning abilities of large language models (LLMs) such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral. The authors introduce LogicBench, a natural language question-answering dataset that focuses on evaluating a single inference rule at a time, covering a wide range of reasoning patterns across propositional logic, first-order logic, and non-monotonic logics.
The key findings from the evaluation are:
Existing LLMs do not perform well on LogicBench, especially on instances involving complex reasoning and negations. They also sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion.
LLMs struggle more with inference rules in propositional logic compared to first-order logic and non-monotonic logics. This is likely due to the pre-training data containing more examples of simple first-order logic and non-monotonic reasoning patterns than propositional logic.
LLMs exhibit superior performance in selecting the correct logical conclusion in multiple-choice questions compared to binary question-answering tasks. However, their performance decreases when the inference rules are longer or include negations.
Human evaluation shows that while humans can effectively comprehend single-step logical reasoning, LLMs still have significant room for improvement in their logical reasoning capabilities.
The authors believe that the LogicBench dataset and the findings from this work will facilitate future research towards enhancing the logical reasoning abilities of LLMs.
Stats
If Liam finishes his work early, then he will order pizza for dinner. He won't order pizza for dinner.
If someone consumes a significant amount of water, they will experience a state of hydration. Conversely, if excessive amounts of sugar are ingested by them, a sugar crash will ensue. It is known that at least one of the following statements is true: either Jane consumes ample water or she will not experience a sugar crash.
Quotes
"Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic."
"Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations."