The article presents evidence that large language models like GPT-3 and GPT-4 are capable of solving a variety of text-based analogy problems, including novel tasks designed specifically to test their reasoning abilities. This suggests an emergent capacity for analogical reasoning in these models.
The authors address critiques that the models' performance on these tasks may be due to similarity to training data, by presenting evidence that the models can also solve "counterfactual" tasks involving permuted alphabets and larger interval sizes between letters. The authors argue that the models' ability to solve these counterfactual tasks, and to provide accurate explanations of their solutions, cannot be easily explained by simple mimicry of the training data.
The authors further demonstrate that the models' difficulties on the counterfactual tasks are likely due to a specific limitation in their ability to precisely count and index items in a list, rather than a general inability to perform analogical reasoning. This is supported by the finding that a variant of GPT-4 with the ability to write and execute code was able to solve the counterfactual tasks at a level comparable to human participants.
The authors conclude that the core mechanisms underlying the emergent analogical reasoning capabilities in large language models may be related to the structured operations and relational representations that support few-shot learning and inference in these models. They argue that further investigation of these internal mechanisms is an important priority for future research.
翻譯成其他語言
從原文內容
arxiv.org
深入探究