核心概念
EMS-SD is a novel method that significantly accelerates multi-sample speculative decoding in Large Language Models by eliminating the need for padding tokens, thereby reducing computational and memory overhead.
统计
With a batch size of 8, EMS-SD achieved a speedup of 2.17 times, whereas the vanilla method only achieved a speedup of 1.37 times.
With a batch size of 12, the opt-13b model achieved a 1.62x speedup using EMS-SD, whereas the vanilla method exhibited no acceleration and was outperformed by the greedy decoding method.
The average padding ratio for the vanilla method with the opt-13b model and a batch size of 12 exceeds 115%.
The inference time for the opt-6.7b model with a batch size of 16 is 22.6 milliseconds for processing five tokens per sample, whereas for a single token, it is 16.6 milliseconds, which is 1.36 times slower.
引用
"We are the first to study speculative decoding in the context of multi-sample situations, and we have proposed an effective method for addressing this issue."
"Our method can be easily integrated into almost all basic speculative decoding methods."
"Extensive comparisons show that EMS-SD exhibits superior performance compared to the vanilla method in multi-sample speculative decoding."