Core Concepts
Blockwise parallel decoding (BPD) can accelerate inference of autoregressive language models, but the independently generated draft tokens often exhibit unnatural repetition and inconsistency. This work analyzes the predictive dynamics of BPD and proposes algorithms to refine the draft tokens, improving the overall decoding efficiency.
Abstract
This paper explores ways to improve the inference speed of autoregressive language models through the use of blockwise parallel decoding (BPD). BPD is a technique that generates multiple tokens in parallel, rather than sequentially, to speed up the decoding process.
The authors first analyze the properties of BPD drafts and make several key observations:
BPD drafts often contain significant consecutive token repetition, due to the independent prediction of each head.
The confidence of the BPD heads tends to decrease as the position in the draft increases, with earlier heads being more confident.
There is significant headroom for improvement in the "oracle" block efficiency, which represents the theoretical maximum efficiency if the optimal draft could be selected.
Based on these observations, the authors propose two algorithms to refine the BPD drafts and improve the overall decoding efficiency:
Local rescoring via neural models: A small neural language model is used to rescore the top-k predictions from each BPD head, favoring more coherent sequences.
Parallel BPD via global n-gram rescoring: An n-gram language model is used to efficiently rescore all possible draft candidates formed from the top-k predictions, and the top p most probable drafts are verified in parallel.
The authors evaluate these approaches across a range of tasks, including language modeling, question answering, and text summarization. They find that the proposed rescoring methods can significantly improve the block efficiency, with the neural rescoring performing best on tasks with low initial efficiency, and the n-gram rescoring providing consistent gains across all tasks.
The key insights from this work are:
Understanding the predictive dynamics of BPD drafts and their limitations
Leveraging small language models to refine BPD drafts and improve decoding efficiency
Demonstrating the potential for significant improvements in inference speed without compromising output quality
Stats
BPD drafts exhibit 20-75% consecutive token repetition across tasks.
The confidence of BPD heads decreases as the position in the draft increases.
Oracle top-k block efficiency shows significant headroom for improvement over standard BPD.
Quotes
"Blockwise parallel decoding (BPD) was proposed by Stern et al. (2018) as a way to improve inference speed of language models."
"We first offer an analysis of the token distributions produced by the BPD prediction heads. Secondly, we use this analysis to inform algorithms to improve BPD inference speed by refining the BPD drafts using small n-gram or neural language models."