Core Concepts
REST, a novel algorithm, leverages retrieval from a datastore to generate draft tokens, enabling significant speedup in language model generation without requiring additional training.
Abstract
The paper introduces Retrieval-Based Speculative Decoding (REST), a novel approach to accelerate the inference of large language models (LLMs).
Key highlights:
- REST replaces the parametric draft model used in previous speculative decoding methods with a non-parametric retrieval datastore, allowing it to be easily integrated with any LLM.
- During inference, REST first retrieves relevant continuation candidates from the datastore based on the current context, then constructs a Trie to select the most probable draft tokens.
- The draft tokens are verified by the LLM using a carefully designed attention mask, ensuring efficient computation on shared prefixes across different draft sequences.
- Experiments on the HumanEval and MT-Bench benchmarks show that REST can achieve 1.62x to 2.36x speedup over standard autoregressive decoding and speculative decoding, without compromising the quality of the generated output.
- The effectiveness of REST is influenced by the size and quality of the retrieval datastore, opening up opportunities for further enhancements by using larger or more specialized datastores.
Stats
REST achieves a speedup of 1.62x to 2.36x on language model generation compared to standard autoregressive decoding and speculative decoding.
The average time for retrieval, including Trie construction, is less than 1 ms, which is negligible compared to the overall generation time.