Concepts de base
SIKeD, a novel iterative knowledge distillation technique, enhances the mathematical reasoning abilities of smaller language models by addressing the limitations of traditional distillation methods and enabling the models to effectively learn and select from multiple reasoning strategies.
Stats
SIKeD achieves improvements of up to +5 points over traditional distillation strategies on four mathematical datasets (GSM8K, SVAMP, ASDiv, and MultiArith).
On GSM8K, SIKeD shows gains of +3.2 points and +2.5 points for Gemma 2B and 7B models, respectively.
SmolLM 1.7B model exhibits the largest improvement of +3.4 points with SIKeD.
For out-of-distribution datasets, SIKeD consistently improves accuracy on ASDiv, with gains ranging from +0.8 to +1.6 points across different models.
On MultiArith, SIKeD leads to substantial gains, with SmolLM showing the largest improvement of +5 points.
Gemma 7B achieves a perfect score of 100 on MultiArith using SIKeD.
Biasing SIKeD towards a specific strategy further improves accuracy, outperforming individual distillation strategies by 2-4 points.
Three iterations of SIKeD consistently demonstrate optimal performance across different models and datasets.
On GSM8K, on-policy SIKeD outperforms off-policy training by +6 points using the Gemma 2B model.
Similar trends are observed for out-of-distribution datasets, with SIKeD achieving gains of +4-7 points on SVAMP and ASDiv and +2 points on MultiArith compared to off-policy training.
Citations
"Although smaller models have demonstrated impressive performance when distilled with a single strategy, they often struggle to master multiple strategies equally well."
"To address this challenge, we introduce our distillation methodology, SIKeD: Self-guided Iterative Knowledge Distillation."
"Our proposed method extends beyond traditional one-step distillation, as each iteration of SIKeD leads to an updated policy that better grasps new information."