This research proposes a novel quantization method for tensor parallel large language models (LLMs) that significantly reduces communication costs during inference while preserving accuracy by strategically quantizing communicated features based on their ranges.
대규모 언어 모델 (LLM) 양자화 시 정확도 손실을 최소화하면서 성능을 향상하는 최적의 양자화 기법 및 배포 전략을 제시한다.
Quantizing large language models (LLMs) offers significant performance and cost benefits with minimal impact on accuracy, making it a viable approach for efficient deployment across various applications and hardware configurations.
TesseraQ는 블록 재구성 기술과 점진적 적응형 반올림을 통해 LLM의 사후 훈련 양자화 성능을 향상시키는 새로운 기법으로, 기존 방법 대비 perplexity 및 downstream task 정확도를 크게 향상시킵니다.
TesseraQ is a novel post-training quantization (PTQ) method that pushes the boundaries of LLM compression by enabling ultra-low bit quantization with minimal performance loss, achieving state-of-the-art results across various benchmarks.
Pyramid Vector Quantization (PVQ) offers a novel approach to compressing large language models (LLMs) by efficiently quantizing weights and activations, achieving state-of-the-art compression rates with minimal performance loss.
DAQ, a new two-stage post-training quantization method, improves the compression of large language models (LLMs) by aligning high-density weight regions with high-precision regions in floating-point representation and optimizing quantization parameters based on their impact on model output.
대규모 언어 모델(LLM) 서비스의 메모리 사용량과 비용을 줄이기 위해 활성화 및 KV 캐시에 대한 세밀한 혼합 정밀도 양자화 알고리즘(FMPQ)과 W4Ax 커널을 활용한 고성능 추론 프레임워크인 COMET을 소개합니다.
SpinQuant는 LLM의 가중치와 활성화에 학습된 회전을 적용하여 아웃라이어를 줄이고, 이를 통해 4-bit 양자화에서도 full precision 모델에 가까운 성능을 달성하는 새로운 양자화 기술입니다.
FLUTE, a novel CUDA kernel, significantly accelerates LLM inference by enabling fast matrix multiplications for lookup table-quantized models, particularly excelling in low-bit and non-uniform quantization scenarios.