Keskeiset käsitteet
Many large language models exhibit significant overfitting on established mathematical reasoning benchmarks, suggesting their performance may not reflect true reasoning abilities.
Tiivistelmä
The authors investigate the reasoning capabilities of large language models (LLMs) by creating a new dataset, Grade School Math 1000 (GSM1k), which is designed to mirror the style and complexity of the established GSM8k benchmark.
Key highlights:
- The authors find that several families of models, notably Mistral and Phi, show consistent overfitting across model sizes, with performance drops of up to 13% on GSM1k compared to GSM8k.
- However, the authors also find that frontier models, such as Gemini, GPT, and Claude, show minimal signs of overfitting, suggesting they have learned genuine mathematical reasoning abilities.
- Further analysis reveals a positive correlation between a model's likelihood of generating examples from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that data contamination is one factor contributing to overfitting.
- Despite the overfitting observed, the authors find that even the most overfit models are still capable of solving a significant portion of the novel GSM1k problems, indicating they have learned some generalizable mathematical reasoning.
The authors conclude that while data contamination is a concern, it does not fully explain the overfitting observed, and that the frontier models have made progress in developing genuine mathematical reasoning capabilities.
Tilastot
The authors find that the worst-performing model on GSM1k compared to GSM8k shows a 13% drop in accuracy.
The authors observe a Spearman's rank correlation of 0.32 between a model's probability of generating GSM8k examples and its performance gap between GSM8k and GSM1k.
Lainaukset
"Several families of models, most notably Mistral and Phi, show consistent evidence of overfitting for nearly all model versions and sizes."
"All frontier models, as well as all sizes of the Llama2 family, show minimal signs of overfitting."
"Even the most overfit models are still capable of successfully generalizing to new mathematical grade school problems, albeit occasionally at lower rates than their benchmark numbers would suggest."