The paper proposes a novel method called dynamic-width speculative beam decoding (DSBD) that integrates speculative decoding with beam sampling to accelerate the inference process of large language models (LLMs) while maintaining high-quality outputs.
Key highlights:
Draft and Verification Scheme: DSBD leverages a smaller auxiliary model to generate multiple draft sequences (draft beams), which are then verified and refined by the larger model. This allows DSBD to maintain multiple candidate sequences throughout the decoding process, achieving better output quality than multinomial sampling.
Dynamic Beam Width Adjustment: DSBD introduces an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing the balance between efficiency and effectiveness.
Forest-based Parallel Decoding: DSBD extends the tree-based parallel decoding approach to handle multiple trees (beams) simultaneously, accelerating the verification process.
Memory Cost Reduction: DSBD proposes a simple modification to mitigate the additional memory cost inherent in beam sampling, making it comparable to the memory usage of multinomial sampling and speculative decoding.
The experimental results demonstrate that DSBD achieves a 1.5-1.9x speed-up and 1.8-2.5x lower energy consumption compared to beam sampling, without sacrificing performance on downstream tasks. Moreover, DSBD can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы