Evaluating the Generalization Capabilities of Large Language Models on Elementary Mathematical Reasoning
Many large language models exhibit significant overfitting on established mathematical reasoning benchmarks, suggesting their performance may not reflect true reasoning abilities.