Core Concepts
Large language models can solve a substantial proportion of Connections puzzles, but struggle with categories requiring abstract or lateral thinking.
Abstract
The Connections puzzle, published daily by the New York Times, tasks players with dividing a grid of 16 words into 4 groups of 4 related words. Solving the puzzle requires both common linguistic knowledge and abstract reasoning, as the categories increase in complexity from "simple" to "tricky".
The authors investigate the ability of sentence embedding baselines and large language models (LLMs) in the GPT family to solve Connections puzzles. They find that the best-performing sentence embedding model (MPNET) has an 11.6% success rate, while GPT-3.5-TURBO and GPT-4-TURBO achieve 6.43% and 29.2% success rates respectively.
The authors observe that the LLMs struggle particularly with categories involving non-semantic properties of words, abstract features, or usage in context. They also find that the LLMs' performance is highly dependent on whether their initial guess is correct, with a substantial drop in success rate if the first guess is incorrect or nearly correct.
The authors further examine the impact of chain-of-thought prompting on GPT-4-TURBO, finding a significant boost in performance from 29.2% to 38.93% average success rate. They also evaluate a more challenging variant of the Connections puzzle where all 4 groups must be submitted simultaneously, observing mixed results across models.
Overall, the authors conclude that the Connections puzzle presents a fertile ground for studying the capabilities and limitations of modern language models in encoding and retrieving semantic information, and propose it as a useful benchmark for evaluating abstract reasoning in NLP systems.
Stats
The best-performing sentence embedding baseline (MPNET) solves 11.6% of Connections puzzles on average.
GPT-3.5-TURBO achieves a 6.43% average success rate on the Connections puzzle.
GPT-4-TURBO achieves a 29.2% average success rate on the Connections puzzle.
Chain-of-thought prompting boosts GPT-4-TURBO's performance to 38.93% average success rate.
Quotes
"The Connections puzzle acts as both a test of linguistic understanding and abstract reasoning."
"We find that the LLMs are often stumped by categories which involve non-semantic properties of words, abstract features, or usage in context."
"Studying the ways in which LLM and human player behaviors differ could shed light on the differences in their underlying representations of words and meanings."