Core Concepts
Large Language Models struggle to perform basic arithmetic reasoning over implicitly held numerical knowledge, despite making progress in knowledge acquisition and statistical inference.
Abstract
The paper investigates the ability of Large Language Models (LLMs) to reason about implicit numerical knowledge. The authors construct a dataset of subject-element pairs with associated numerical facts, as well as quad-tuples combining these facts to probe for entailed arithmetic relationships.
The key findings are:
Numerical Fact Probing:
LLMs, especially GPT-4, demonstrate reasonable performance in extracting numerical facts from text, but struggle with "null values" (e.g., assuming a sparrow has 4 legs).
There is a clear bias towards hallucinating the most common numerical values associated with each subject.
Entailed Arithmetic Relationship (EAR) Probing:
When asked to compare the numerical values implied by the subject-element pairs, LLMs perform poorly, often failing to correctly infer the entailed arithmetic relationship.
The performance is unstable and depends heavily on the prompt formulation, indicating a lack of genuine reasoning ability.
Even the best-performing models, like GPT-4, struggle to consistently put the implicit numerical facts together to reason about their relationships.
The authors argue that while LLMs can produce seemingly correct answers through statistical inference, they lack the combinatorial reasoning capabilities required for tasks involving arithmetic and logical reasoning. They emphasize that the ability to generate correct answers does not equate to genuine reasoning ability, and that simply making LLMs larger and training them on more data is an insufficient solution to this problem.
Stats
A typical bicycle has a number of wheels that is more than the number of wheels a unicycle has.
A typical sparrow has a number of legs that is less than the number of wheels a tricycle has.
A typical human has a number of fingers that is more than the number of wheels a bicycle has.
Quotes
"It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved."
"Bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability."