The content introduces a novel method, the recurrent drafter, to improve the efficiency of serving large language models by enhancing speculative decoding. This approach combines elements from classic two-model speculative decoding and the more recent single-model approach, Medusa. By utilizing a single, lightweight draft head with a recurrent dependency design, the method simplifies the inference process while maintaining effectiveness. The recurrent drafter allows for direct use of beam search to filter out low-quality candidates efficiently. Additionally, an efficient tree attention algorithm based on beam search results is dynamically constructed during runtime without relying on additional data sets. Empirical demonstrations showcase the effectiveness of this methodology on popular open-source language models.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Aonan Zhang,... a las arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.09919.pdfConsultas más profundas