VL-Cache는 비전-언어 모델(VLM)의 추론 속도를 높이기 위해 KV 캐시의 크기를 줄이면서도 정확도를 유지하는 새로운 압축 기법입니다.
ビジョン言語モデル(VLM)の推論を高速化するため、スパース性とモダリティを考慮した新しいKVキャッシュ圧縮手法であるVL-Cacheが提案されている。
大型語言模型 (LLM) 的效能提升需要大量的記憶體資源,尤其是在處理長文本時。文章提出了一種基於注意力頭重要性的 KV 快取壓縮方法 (HeadKV),透過評估每個注意力頭對上下文問答任務中檢索和推理能力的貢獻,將有限的 KV 快取資源動態分配給更重要的注意力頭,從而在不影響效能的前提下顯著減少記憶體使用並提升運算效率。
HeadKV, a novel head-level Key-Value (KV) cache compression method, improves the efficiency of Large Language Models (LLMs) by selectively allocating KV cache budgets to attention heads based on their importance for retrieval and reasoning tasks.
大規模言語モデル(LLM)の推論効率を高めるには、従来の層内圧縮ではなく、層間で非類似なKVキャッシュを共有する方が効果的である。
Sharing dissimilar key-value caches across layers in large language models (LLMs) during inference can significantly reduce memory consumption without substantial performance loss, challenging the traditional assumption that sharing similar representations is optimal.
大規模言語モデル(LLM)のKVキャッシュ圧縮において、残差ベクトル量子化を用いることで、従来のスカラー量子化技術に匹敵する性能を維持しながら、より高い圧縮率を実現できる。
This research paper introduces a novel method for compressing the key-value (KV) cache in large language models (LLMs) using residual vector quantization, a technique commonly employed in high-fidelity audio compression.
MatryoshkaKV, a novel technique for compressing the Key-Value (KV) cache in Large Language Models (LLMs) by using trainable orthogonal projections, outperforms traditional PCA-based methods and achieves significant reductions in memory footprint while preserving model accuracy.
LoRC는 사전 훈련된 LLM의 KV 캐시를 압축하여 메모리 사용량을 줄이면서도 성능 저하를 최소화하는 효율적인 저랭크 압축 기법이다.