The content introduces a novel method, the recurrent drafter, to improve the efficiency of serving large language models by enhancing speculative decoding. This approach combines elements from classic two-model speculative decoding and the more recent single-model approach, Medusa. By utilizing a single, lightweight draft head with a recurrent dependency design, the method simplifies the inference process while maintaining effectiveness. The recurrent drafter allows for direct use of beam search to filter out low-quality candidates efficiently. Additionally, an efficient tree attention algorithm based on beam search results is dynamically constructed during runtime without relying on additional data sets. Empirical demonstrations showcase the effectiveness of this methodology on popular open-source language models.
Para outro idioma
do conteúdo fonte
arxiv.org
Principais Insights Extraídos De
by Aonan Zhang,... às arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.09919.pdfPerguntas Mais Profundas