Large Language Model Inference Acceleration

Zaloguj się

spostrzeżenie - Large Language Model Inference Acceleration

PipeInfer: Using Asynchronous Pipelined Speculation to Accelerate Large Language Model Inference

PipeInfer is a novel technique that accelerates large language model inference by using asynchronous pipelined speculation, improving both latency and system utilization, especially in low-bandwidth and single-request scenarios.

SAM-Decoding: Accelerating Large Language Model Inference with Speculative Decoding and Suffix Automatons

SAM-Decoding is a novel retrieval-based speculative decoding method that leverages suffix automatons to accelerate the inference speed of large language models (LLMs) without compromising output quality.

Improving Multilingual Large Language Model Inference Speed with Speculative Decoding and Language-Specific Drafter Models

This paper introduces a novel method for accelerating multilingual LLM inference by employing speculative decoding with specialized drafter models trained using a pretrain-and-finetune strategy on language-specific datasets, achieving significant speedups compared to existing methods.

SSSD: A Simply-Scalable Speculative Decoding Method for Large Language Model Inference Acceleration

SSSD is a novel speculative decoding method that accelerates large language model inference, particularly in high-throughput scenarios, by efficiently leveraging CPU-based candidate token retrieval from both prompt/self-output and a large text datastore, minimizing device overhead during verification.

SuffixDecoding: A Model-Free Speculative Decoding Method for Accelerating Large Language Model Inference Using Suffix Trees

SuffixDecoding is a novel, model-free approach to speeding up LLM inference by using suffix trees built from previous outputs to efficiently predict and verify candidate token sequences, achieving competitive performance to model-based methods while avoiding their limitations.

FIRP: A Novel Method for Accelerating Large Language Model Inference by Predicting Future Token Representations

FIRP is a new speculative decoding method that significantly speeds up Large Language Model inference by predicting intermediate representations of future tokens, allowing for the generation of multiple tokens in a single forward pass.

Improving Large Language Model Inference Speed with Context-Aware Assistant Selection

Dynamically selecting the most suitable smaller "draft" language model to guide a larger language model's text generation, based on the input query, can significantly improve inference speed without sacrificing output quality.

Ouroboros：逐詞組生成更長草稿以實現更快的推測解碼

本文提出了一種名為 Ouroboros 的高效解碼框架，透過詞組級別的草稿生成和驗證，以及詞組重用等策略，在不損失模型性能的情況下，顯著提升了推測解碼的速度。

Ouroboros: A Training-Free Method for Accelerating Large Language Model Inference Using Phrase-Level Speculative Decoding

Ouroboros is a novel, training-free method that significantly accelerates large language model (LLM) inference by employing phrase-level speculative decoding, enhancing both drafting efficiency and draft length without compromising generation quality.

EMS-SD: Enhancing Multi-Sample Speculative Decoding for Large Language Model Acceleration Without Padding Tokens

EMS-SD is a novel method that significantly accelerates multi-sample speculative decoding in Large Language Models by eliminating the need for padding tokens, thereby reducing computational and memory overhead.

O nas

Produkty

Zasoby