Concepts de base
MatryoshkaKV, a novel technique for compressing the Key-Value (KV) cache in Large Language Models (LLMs) by using trainable orthogonal projections, outperforms traditional PCA-based methods and achieves significant reductions in memory footprint while preserving model accuracy.
Stats
MatryoshkaKV retains 93.10% of LLaMA2-7B-base’s average accuracy and 92.63% of Mistral-v0.3-7B-base’s average accuracy, while utilizing only 37.5% of the original cache size.
PCA projection shows a sharp performance drop when the cache budget is reduced below 62.5%, achieving just 70.42% accuracy of LLaMA2-7B-base and 52.86% of Mistral-v0.3-7B-base.
At a 37.5% cache budget, employing heterogeneous compression rates improves average accuracy by 1.92% compared to uniform compression.
For a 37.5% KV cache budget, the optimized key cache budget is allocated 32.28%, and the value cache budget is allocated 42.72%.