Core Concepts
Enhancing large language model efficiency through a recurrent drafter approach.
Abstract
The content introduces a novel method, the recurrent drafter, to improve the efficiency of serving large language models by enhancing speculative decoding. This approach combines elements from classic two-model speculative decoding and the more recent single-model approach, Medusa. By utilizing a single, lightweight draft head with a recurrent dependency design, the method simplifies the inference process while maintaining effectiveness. The recurrent drafter allows for direct use of beam search to filter out low-quality candidates efficiently. Additionally, an efficient tree attention algorithm based on beam search results is dynamically constructed during runtime without relying on additional data sets. Empirical demonstrations showcase the effectiveness of this methodology on popular open-source language models.
Stats
Large language models have billions of parameters.
Medusa uses two ResNet blocks totaling 0.74B parameters.
ReDrafter has leaner setups with 0.33B and 0.56B parameters.
ReDrafter achieved speed-ups of 2.67x for Vicuna 7B and 2.92x for Vicuna 13B compared to Medusa.
Quotes
"Recurrent Drafter simplifies inference while maintaining effectiveness."
"Our method combines elements from classic and recent speculative decoding approaches."
"Empirical demonstrations show the effectiveness of our methodology."