Kernekoncepter
Sharing dissimilar key-value caches across layers in large language models (LLMs) during inference can significantly reduce memory consumption without substantial performance loss, challenging the traditional assumption that sharing similar representations is optimal.
Yang, Y., Cao, Z., Chen, Q., Qin, L., Yang, D., Zhao, H., & Chen, Z. (2024). KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing. arXiv preprint arXiv:2410.18517.
This paper introduces KVSharer, a novel method for compressing the key-value (KV) cache in large language models (LLMs) during inference to reduce memory consumption without significantly impacting performance.