Core Concepts
This survey presents a comprehensive taxonomy and analysis of techniques to enhance the efficiency of inference for large language models (LLMs), addressing challenges posed by their substantial computational and memory requirements.
Abstract
This survey provides a thorough overview of the current landscape of efficient inference techniques for large language models (LLMs). It starts by analyzing the primary causes of inefficient LLM inference, namely the large model size, quadratic-complexity attention operation, and auto-regressive decoding approach.
The survey then introduces a comprehensive taxonomy that organizes the existing literature into three levels of optimization: data-level, model-level, and system-level.
At the data level, the survey discusses input compression techniques, such as prompt pruning, prompt summary, soft prompt-based compression, and retrieval-augmented generation, which aim to reduce the computational and memory cost of the prefilling stage. It also covers output organization methods, which leverage the emerging ability of LLMs to plan the output structure and enable parallel decoding to improve generation latency.
At the model level, the survey examines efficient structure design, including techniques to improve the efficiency of the Feed-Forward Network (FFN) and attention operation, as well as Transformer alternate architectures. It also covers model compression approaches, such as knowledge distillation and quantization, which can reduce the model size and memory usage.
Finally, the survey discusses system-level optimizations, including inference engine and serving system techniques, that can further enhance the efficiency of LLM deployment without modifying the model itself.
The survey also includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights and practical recommendations. It concludes by discussing future research directions in this rapidly evolving field.
Stats
The survey does not contain any specific numerical data or statistics. It focuses on providing a comprehensive overview and taxonomy of efficient inference techniques for large language models.
Quotes
The survey does not contain any direct quotes from the content.