Kangaroo, a novel self-speculative decoding framework, uses a fixed shallow sub-network of the large language model as a self-draft model and introduces an additional early exiting mechanism to reduce the inference latency of the self-draft model, achieving significant speedups in large language model inference.
Speculative decoding with novel architectures that condition on both context vectors and sampled tokens can effectively predict high-quality n-grams, allowing 2-3x acceleration of highly optimized large language model inference in production settings.