Sharing key-value (KV) cache across layers in large language models (LLMs) can significantly improve inference efficiency, and while various techniques exist, their effectiveness depends on factors like KV cache size and prompt length.
Tailoring the arithmetic precision of Large Language Models (LLMs) to the specific requirements of different inference phases and progressively lowering precision during decoding significantly improves efficiency without sacrificing output quality, making it particularly beneficial for resource-constrained devices.
TREACLE, a reinforcement learning-based framework, dynamically selects the optimal large language model (LLM) and prompting scheme to answer questions while respecting user-defined cost and latency constraints.
The authors present a comprehensive survey on efficient Large Language Model (LLM) inference, introducing a unique framework based on the roofline model to analyze bottlenecks in deploying LLMs. Their work aims to provide valuable insights for practical implementation and optimization in the field of efficient LLM deployment.