The key insights and highlights of the content are:
Large Language Models (LLMs) like GPT-4, PaLM, and LLaMA dominate in numerous NLP tasks, but their expensive online inference cost poses significant obstacles to deployment.
The memory usage of LLM inference mainly consists of model weights, activations, and KV Cache. The KV Cache occupies a large portion of the memory, leading to a memory bottleneck.
The authors propose KCache, a novel technique that can be used directly for inference without any training process. KCache retains the K Cache in high-bandwidth memory (HBM) and dynamically pulls the necessary V Cache from CPU memory based on the attention scores.
Experiments show that KCache improves the throughput of popular LLMs by 40% compared to the baseline, while maintaining accuracy. The performance advantage is more significant for longer input contexts.
The authors provide a detailed analysis of the performance and accuracy trade-offs of KCache, demonstrating that it can effectively balance the memory usage and inference latency.
KCache is flexible and scalable, and can be applied to various transformer-based pre-trained models.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania