Główne pojęcia
Speculative execution, a technique originally from computer architecture, can significantly boost the inference speed of large language models by drafting and verifying token sequences in parallel.
Streszczenie
This paper provides a comprehensive survey on the topic of speculative execution in large language models (LLMs). It first gives an overview of the autoregressive nature of LLMs and how speculative execution can help mitigate the resulting latency bottleneck.
The key components of speculative execution in LLMs are then discussed in detail:
Drafting Stage:
- The drafter, which can be a small model, predictive heads, or a retriever, generates speculative token sequences.
- Termination criteria, such as static settings, adaptive thresholding, or heuristic rules, determine when to stop the drafting.
- Draft management handles multiple draft outputs, e.g., using a trie tree or beam tree.
Verification Stage:
- The verifier, which can be a chain-based or tree-based structure, validates the speculative tokens in parallel.
- Acceptance criteria, like exact matching, rejection sampling, or typical acceptance, determine which tokens to keep.
The paper also covers evaluation metrics, existing implementations, and various applications of speculative execution in LLMs.
Finally, it highlights key challenges and future research directions, such as optimizing the framework design, parameter search, system integration, and objective optimization.
Statystyki
The reported speed-ups of representative speculative execution methods range from 1.3x to 3.5x compared to autoregressive decoding.
Cytaty
"Speculative execution is a strategy that fully utilizes spare resources to execute some speculative tasks in advance that may or may not be useful for the upcoming tasks."
"Speculative execution in LLMs primarily follows a draft-then-verify paradigm, where the drafting stage constructs a sequence of tokens quickly, and the verification stage validates the sequence in parallel."