Core Concepts
SemEval-2024 Task 9 BRAINTEASER(S) aims to evaluate the lateral thinking abilities of large language models by presenting them with a novel set of challenging puzzles that require defying common sense associations.
Abstract
SemEval-2024 Task 9 BRAINTEASER(S) is a novel task designed to test the lateral thinking abilities of computational models. It is based on the recently introduced BRAINTEASER benchmark, which presents two types of puzzles: Sentence Puzzles and Word Puzzles.
The Sentence Puzzles require models to overwrite commonsense associations and think unconventionally to arrive at the correct answer. For example, in the puzzle "A man shaves everyday, yet keeps his beard long", the model needs to infer that the man is likely a barber who shaves others rather than himself.
The Word Puzzles also challenge models to think laterally about word composition and meanings. For instance, the puzzle "What type of cheese is made backwards?" requires the model to recognize that "Edam" is "Mozzarella" spelled backwards.
The SemEval task divides the original BRAINTEASER dataset into train, trial, and test sets to support both fine-tuning and zero/few-shot evaluation settings. The task received 483 submissions from 182 participants during the competition.
The analysis of the participant results reveals several key insights:
Architecture selection: Fine-tuning on large language models shows a tighter accuracy distribution, while fine-tuning on smaller models and prompting approaches exhibit a wider range of performance, with some top-scoring systems.
Consistency of predictions: Most models struggle to maintain consistent lateral reasoning across the original puzzles and their semantic and context reconstructions, highlighting the challenges in generalizing beyond the training data.
Limitations of fine-tuning: While fine-tuning can be effective, it also suffers from learning shortcuts and fails to fully capture the essence of lateral thinking, which requires models to deprecate default commonsense associations.
Overall, the SemEval-2024 Task 9 BRAINTEASER(S) and its analysis provide valuable insights into the current state of lateral thinking abilities in large language models and inspire future research on developing more robust and creative reasoning capabilities.
Stats
The man shaves everyday, yet keeps his beard long.
I have five fingers, but I am not alive. What am I?
What type of cheese is made backwards?
Quotes
"Lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking."
"Lateral thinking has been shown to be challenging for current models but has received little attention."