Core Concepts
Reverse KLD is proposed for distilling large language models into smaller ones, improving performance and reducing exposure bias.
Stats
"Our method is scalable for different model families with 120M to 13B parameters."
"MINILLM consistently outperforms standard KD baselines on all datasets."
"MINILLM yields lower exposure bias, better calibration, and higher long response generation performance."
Quotes
"Our method is suitable and works well for compressing large (generative) language models."
"MINILLM consistently outperforms standard KD baselines on all the datasets."