Distilling the decomposition capability of large language models (LLMs) offers a cost-effective and generalizable approach to improve reasoning in smaller models, while distilling the solving capability proves less effective and less generalizable.
Strategically distilling knowledge from powerful LLMs into smaller models using task-aware curriculum planning and response refinement significantly improves the instruction-following abilities of smaller LLMs, even surpassing larger models trained on more extensive datasets.
The Bi-directional Logits Difference (BiLD) loss effectively distills large language models by filtering out long-tail noise in logits and leveraging the internal ranking information of logits.