The content introduces a novel method, the recurrent drafter, to improve the efficiency of serving large language models by enhancing speculative decoding. This approach combines elements from classic two-model speculative decoding and the more recent single-model approach, Medusa. By utilizing a single, lightweight draft head with a recurrent dependency design, the method simplifies the inference process while maintaining effectiveness. The recurrent drafter allows for direct use of beam search to filter out low-quality candidates efficiently. Additionally, an efficient tree attention algorithm based on beam search results is dynamically constructed during runtime without relying on additional data sets. Empirical demonstrations showcase the effectiveness of this methodology on popular open-source language models.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問