Core Concepts
Current language models exhibit substantial performance degradation on counterfactual task variants that deviate from the default assumptions underlying standard tasks, suggesting their task-solving abilities are often specialized to specific input-output mappings rather than general and transferable.
Abstract
The paper explores the capabilities and limitations of language models (LMs) by evaluating their performance on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. The authors design a suite of 11 tasks across various domains, including arithmetic, programming, logical reasoning, spatial reasoning, drawing, music, and chess. For each task, they define a default version and one or more counterfactual variants that share the same underlying reasoning procedure but differ in their input-output mappings.
The authors evaluate the performance of several prominent LMs, including GPT-4, GPT-3.5, Claude, and PaLM-2, on both the default and counterfactual task variants. They observe that while the LMs exhibit above-random performance on the counterfactual tasks, their performance consistently and substantially degrades compared to the default task conditions. This suggests that the LMs' task-solving abilities are often specialized to the specific input-output mappings seen during pretraining, rather than being general and transferable.
The authors further analyze the factors that influence the default-counterfactual performance gap, such as the "commonness" of the counterfactual conditions, the proximity between default and counterfactual conditions, the relationship between default and counterfactual performance, the effectiveness of zero-shot chain-of-thought prompting, and the impact of few-shot demonstrations. They also provide a qualitative analysis of the drawing task, highlighting how the counterfactual drawings are often simplified or of worse quality compared to the default.
Overall, the results suggest that the success of existing LMs on standard benchmarks should not be considered as sufficient evidence for their possession of full general capacity for the target tasks. The authors argue that a more careful interpretation of LM performance is needed, one that disentangles their specialized, non-transferable behaviors from their abstract, generalizable reasoning skills.
Stats
The two-digit addition accuracy decreases from 100% in base-10 to around 50% in bases 9, 11, and 16.
The spatial reasoning accuracy drops from around 90% in the default condition to around 50% in the counterfactual conditions.
The chord fingering accuracy for the default guitar tuning is around 90%, but drops to around 50% for the counterfactual tunings.
Quotes
"Ideally, we expect a general-purpose LM to be able to generalize not only to unseen instances of known tasks, but to new tasks."
"We observe above-random counterfactual performance for most tasks, indicating some degree of task generalizability. However, the performance on counterfactual task variants consistently and substantially degrades relative to the performance on the default settings."
"These results also reveal several surprising relations between model behavior on default and counterfactual tasks, including correlations between default and counterfactual performance, varying effectiveness of zero-shot chain-of-thought prompting, and interactions between task- and instance-level frequency effects."