LLM Quantization

insight - LLM Quantization

Quantizing Communication in Tensor Parallel Large Language Model Inference for Efficiency

This research proposes a novel quantization method for tensor parallel large language models (LLMs) that significantly reduces communication costs during inference while preserving accuracy by strategically quantizing communicated features based on their ranges.

블록 재구성을 이용한 초저비트 LLM 사후 훈련 양자화: TesseraQ

TesseraQ는 블록 재구성 기술과 점진적 적응형 반올림을 통해 LLM의 사후 훈련 양자화 성능을 향상시키는 새로운 기법으로, 기존 방법 대비 perplexity 및 downstream task 정확도를 크게 향상시킵니다.

TesseraQ: Achieving State-of-the-Art Performance in Quantizing Large Language Models with Ultra-Low Bit Precision

TesseraQ is a novel post-training quantization (PTQ) method that pushes the boundaries of LLM compression by enabling ultra-low bit quantization with minimal performance loss, achieving state-of-the-art results across various benchmarks.

Pyramid Vector Quantization: Achieving State-of-the-Art Compression for Large Language Models

Pyramid Vector Quantization (PVQ) offers a novel approach to compressing large language models (LLMs) by efficiently quantizing weights and activations, achieving state-of-the-art compression rates with minimal performance loss.

Density-Aware Post-Training Weight-Only Quantization (DAQ) for Improved Large Language Model Compression

DAQ, a new two-stage post-training quantization method, improves the compression of large language models (LLMs) by aligning high-density weight regions with high-precision regions in floating-point representation and optimizing quantization parameters based on their impact on model output.

COMET: 실용적인 W4A4KV4 LLM 서비스를 향하여

대규모 언어 모델(LLM) 서비스의 메모리 사용량과 비용을 줄이기 위해 활성화 및 KV 캐시에 대한 세밀한 혼합 정밀도 양자화 알고리즘(FMPQ)과 W4Ax 커널을 활용한 고성능 추론 프레임워크인 COMET을 소개합니다.

학습된 회전을 통한 LLM 양자화: SpinQuant

SpinQuant는 LLM의 가중치와 활성화에 학습된 회전을 적용하여 아웃라이어를 줄이고, 이를 통해 4-bit 양자화에서도 full precision 모델에 가까운 성능을 달성하는 새로운 양자화 기술입니다.

FLUTE: A Fast and Flexible CUDA Kernel for Lookup Table-Quantized LLMs (with Focus on Low-Bit and Non-Uniform Quantization)

FLUTE, a novel CUDA kernel, significantly accelerates LLM inference by enabling fast matrix multiplications for lookup table-quantized models, particularly excelling in low-bit and non-uniform quantization scenarios.

Unlocking the Efficiency of Large Language Models through Quantization: A Practical Guide

Quantization is a crucial technique for making large language models more efficient and deployable across diverse hardware platforms by reducing their memory footprint while maintaining similar performance levels.

About

Products

Resources