The content introduces a novel method, the recurrent drafter, to improve the efficiency of serving large language models by enhancing speculative decoding. This approach combines elements from classic two-model speculative decoding and the more recent single-model approach, Medusa. By utilizing a single, lightweight draft head with a recurrent dependency design, the method simplifies the inference process while maintaining effectiveness. The recurrent drafter allows for direct use of beam search to filter out low-quality candidates efficiently. Additionally, an efficient tree attention algorithm based on beam search results is dynamically constructed during runtime without relying on additional data sets. Empirical demonstrations showcase the effectiveness of this methodology on popular open-source language models.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Aonan Zhang,... lúc arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.09919.pdfYêu cầu sâu hơn