Evaluating the Generalisability of Transformer Models in Mathematical Reasoning Tasks
Transformer models, including GPT-4 and fine-tuned BERT variants, exhibit limited generalisability to out-of-distribution perturbations in mathematical reasoning tasks despite strong in-distribution performance.