Core Concepts
Dynamic-width speculative beam decoding (DSBD) combines the speed advantages of speculative decoding with the accuracy and diversity benefits of beam sampling to enable more efficient and effective inference of large language models.
Abstract
The paper proposes a novel method called dynamic-width speculative beam decoding (DSBD) that integrates speculative decoding with beam sampling to accelerate the inference process of large language models (LLMs) while maintaining high-quality outputs.
Key highlights:
Draft and Verification Scheme: DSBD leverages a smaller auxiliary model to generate multiple draft sequences (draft beams), which are then verified and refined by the larger model. This allows DSBD to maintain multiple candidate sequences throughout the decoding process, achieving better output quality than multinomial sampling.
Dynamic Beam Width Adjustment: DSBD introduces an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing the balance between efficiency and effectiveness.
Forest-based Parallel Decoding: DSBD extends the tree-based parallel decoding approach to handle multiple trees (beams) simultaneously, accelerating the verification process.
Memory Cost Reduction: DSBD proposes a simple modification to mitigate the additional memory cost inherent in beam sampling, making it comparable to the memory usage of multinomial sampling and speculative decoding.
The experimental results demonstrate that DSBD achieves a 1.5-1.9x speed-up and 1.8-2.5x lower energy consumption compared to beam sampling, without sacrificing performance on downstream tasks. Moreover, DSBD can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs.
Stats
The paper reports the following key metrics:
Speed-up of 1.5-1.9x compared to beam sampling
1.8-2.5x lower energy consumption compared to beam sampling
Quotes
"By verifying γ tokens in parallel with one run of the large model, speculative decoding reduces the time cost compared to calling the large model γ times."
"Our experimental results demonstrate that our approach achieves a 1.5-1.9× speed-up and 1.8-2.5× smaller energy consumption compared to beam sampling, without sacrificing performance on downstream tasks."