Oliaro, G., Jia, Z., Campos, D., & Qiao, A. (2024). SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference. arXiv preprint arXiv:2411.04975.
This paper introduces SuffixDecoding, a model-free speculative decoding method for accelerating Large Language Model (LLM) inference, and evaluates its performance against existing model-based approaches across various tasks.
SuffixDecoding constructs and dynamically updates suffix trees from previous LLM outputs and the current prompt. It uses these trees to predict candidate token sequences based on pattern matching and frequency statistics. The method employs a greedy algorithm to build speculation trees, which are then verified by the LLM in parallel. The researchers evaluated SuffixDecoding on four instruction datasets: WildChat, Magicoder, SpiderSQL, and a proprietary text-to-SQL application called AgenticSQL. They compared its performance to standard decoding and SpecInfer, a state-of-the-art model-based speculative decoding method.
SuffixDecoding achieves competitive speedups compared to model-based speculative decoding methods, particularly excelling in structured output tasks like SQL code generation. It demonstrates comparable or superior performance to tree-based speculative decoding on open-ended chat and code generation tasks, even when trained on significantly smaller datasets. The method exhibits strong adaptability to input distribution shifts, effectively incorporating new data into its suffix trees for online performance improvement.
SuffixDecoding offers a practical and efficient alternative to model-based speculative decoding for accelerating LLM inference. Its model-free nature simplifies deployment and eliminates the need for draft model training or specialized decoding heads. The method's ability to leverage large-scale reference corpora and adapt to evolving input distributions makes it suitable for diverse LLM applications.
This research contributes a novel approach to LLM inference acceleration, addressing the limitations of existing model-based methods. SuffixDecoding's efficiency and adaptability hold significant implications for improving the performance and scalability of LLM-based applications, particularly in resource-constrained environments.
The paper acknowledges potential improvements in SuffixDecoding's speculation tree scoring mechanism to enhance candidate selection. Future research could explore incorporating different sources of text into the reference corpus and investigating the impact of suffix tree size on performance across various LLM architectures and tasks.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询