GEAR proposes an efficient KV cache compression framework for near-lossless high-ratio compression, improving system throughput and reducing memory size.
LoRC는 사전 훈련된 LLM의 KV 캐시를 압축하여 메모리 사용량을 줄이면서도 성능 저하를 최소화하는 효율적인 저랭크 압축 기법이다.
MatryoshkaKV, a novel technique for compressing the Key-Value (KV) cache in Large Language Models (LLMs) by using trainable orthogonal projections, outperforms traditional PCA-based methods and achieves significant reductions in memory footprint while preserving model accuracy.
This research paper introduces a novel method for compressing the key-value (KV) cache in large language models (LLMs) using residual vector quantization, a technique commonly employed in high-fidelity audio compression.
大規模言語モデル(LLM)のKVキャッシュ圧縮において、残差ベクトル量子化を用いることで、従来のスカラー量子化技術に匹敵する性能を維持しながら、より高い圧縮率を実現できる。