A Comprehensive Survey of Speculative Decoding for Efficient Large Language Model Inference
Speculative decoding significantly improves the efficiency of Large Language Model inference by using a smaller model to draft token sequences and a larger model to verify them, but challenges remain in its real-world application, particularly in optimizing throughput, long context generation, model parallelism, hardware limitations, and generalizability across tasks.