Speculative decoding significantly improves the efficiency of Large Language Model inference by using a smaller model to draft token sequences and a larger model to verify them, but challenges remain in its real-world application, particularly in optimizing throughput, long context generation, model parallelism, hardware limitations, and generalizability across tasks.


coremsg

a-comprehensive-survey-of-speculative-decoding-for-efficient-large-language-model-inference


A Comprehensive Survey of Speculative Decoding for Efficient Large Language Model Inference