PipeInfer is a novel technique that accelerates large language model inference by using asynchronous pipelined speculation, improving both latency and system utilization, especially in low-bandwidth and single-request scenarios.
SAM-Decoding is a novel retrieval-based speculative decoding method that leverages suffix automatons to accelerate the inference speed of large language models (LLMs) without compromising output quality.
This paper introduces a novel method for accelerating multilingual LLM inference by employing speculative decoding with specialized drafter models trained using a pretrain-and-finetune strategy on language-specific datasets, achieving significant speedups compared to existing methods.
SSSD is a novel speculative decoding method that accelerates large language model inference, particularly in high-throughput scenarios, by efficiently leveraging CPU-based candidate token retrieval from both prompt/self-output and a large text datastore, minimizing device overhead during verification.
SuffixDecoding is a novel, model-free approach to speeding up LLM inference by using suffix trees built from previous outputs to efficiently predict and verify candidate token sequences, achieving competitive performance to model-based methods while avoiding their limitations.
FIRP is a new speculative decoding method that significantly speeds up Large Language Model inference by predicting intermediate representations of future tokens, allowing for the generation of multiple tokens in a single forward pass.
Dynamically selecting the most suitable smaller "draft" language model to guide a larger language model's text generation, based on the input query, can significantly improve inference speed without sacrificing output quality.
本文提出了一種名為 Ouroboros 的高效解碼框架,透過詞組級別的草稿生成和驗證,以及詞組重用等策略,在不損失模型性能的情況下,顯著提升了推測解碼的速度。
Ouroboros is a novel, training-free method that significantly accelerates large language model (LLM) inference by employing phrase-level speculative decoding, enhancing both drafting efficiency and draft length without compromising generation quality.
EMS-SD is a novel method that significantly accelerates multi-sample speculative decoding in Large Language Models by eliminating the need for padding tokens, thereby reducing computational and memory overhead.