Core Concepts
Large language models cannot always perform analogical reasoning effectively, and the accuracy of self-generated examples is the key factor determining their performance on mathematical reasoning tasks, rather than the relevance of the examples.
Abstract
This paper systematically explores the ability of large language models (LLMs) to perform analogical reasoning. The authors conduct extensive experiments and analysis on a diverse set of reasoning tasks, including mathematical reasoning and other types of reasoning.
The key findings are:
On mathematical reasoning tasks, self-generated relevant examples do not guarantee better performance compared to irrelevant examples. In fact, irrelevant examples, such as randomly generated biological problems, can sometimes outperform relevant ones by a significant margin (up to 4% on the GSM8K dataset).
The key factor influencing the performance of LLMs on mathematical reasoning tasks is the accuracy of the self-generated examples, rather than their relevance. The authors demonstrate this by designing two improved methods that use manually verified self-generated examples as in-context learning demonstrations, which consistently outperform other approaches.
The authors also show that these observations hold true across different LLM architectures, including GPT-3.5 and Llama-2-Chat, indicating the generalizability of their findings.
Further analysis reveals that while LLMs can follow instructions to generate specific types of examples, the accuracy of the generated examples is more important than their relevance for analogical reasoning performance, especially on mathematical reasoning tasks.
Overall, this work provides valuable insights into the limitations of LLMs in performing analogical reasoning and highlights the importance of example accuracy over relevance in certain reasoning tasks.
Stats
The second and ninth terms of an arithmetic sequence are 2 and 30, respectively.
In an arithmetic sequence, the first term is 3 and the common difference is 4.
The value of a fourth-order determinant needs to be calculated.
The value of a third-order determinant needs to be calculated.
Quotes
"Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences."
"Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts."