LLM Quantization

Masuk

wawasan - LLM Quantization

Quantizing Communication in Tensor Parallel Large Language Model Inference for Efficiency

This research proposes a novel quantization method for tensor parallel large language models (LLMs) that significantly reduces communication costs during inference while preserving accuracy by strategically quantizing communicated features based on their ranges.

대규모 언어 모델 양자화에서의 정확도-성능 트레이드 오프: Llama 3.1 모델군에 대한 실증적 연구

대규모 언어 모델 (LLM) 양자화 시 정확도 손실을 최소화하면서 성능을 향상하는 최적의 양자화 기법 및 배포 전략을 제시한다.

Quantization Techniques for Large Language Models: An In-Depth Analysis of Accuracy and Performance Trade-offs

Quantizing large language models (LLMs) offers significant performance and cost benefits with minimal impact on accuracy, making it a viable approach for efficient deployment across various applications and hardware configurations.

블록 재구성을 이용한 초저비트 LLM 사후 훈련 양자화: TesseraQ

TesseraQ는 블록 재구성 기술과 점진적 적응형 반올림을 통해 LLM의 사후 훈련 양자화 성능을 향상시키는 새로운 기법으로, 기존 방법 대비 perplexity 및 downstream task 정확도를 크게 향상시킵니다.

TesseraQ: Achieving State-of-the-Art Performance in Quantizing Large Language Models with Ultra-Low Bit Precision

TesseraQ is a novel post-training quantization (PTQ) method that pushes the boundaries of LLM compression by enabling ultra-low bit quantization with minimal performance loss, achieving state-of-the-art results across various benchmarks.

Pyramid Vector Quantization: Achieving State-of-the-Art Compression for Large Language Models

Pyramid Vector Quantization (PVQ) offers a novel approach to compressing large language models (LLMs) by efficiently quantizing weights and activations, achieving state-of-the-art compression rates with minimal performance loss.

Density-Aware Post-Training Weight-Only Quantization (DAQ) for Improved Large Language Model Compression

DAQ, a new two-stage post-training quantization method, improves the compression of large language models (LLMs) by aligning high-density weight regions with high-precision regions in floating-point representation and optimizing quantization parameters based on their impact on model output.

COMET: 실용적인 W4A4KV4 LLM 서비스를 향하여

대규모 언어 모델(LLM) 서비스의 메모리 사용량과 비용을 줄이기 위해 활성화 및 KV 캐시에 대한 세밀한 혼합 정밀도 양자화 알고리즘(FMPQ)과 W4Ax 커널을 활용한 고성능 추론 프레임워크인 COMET을 소개합니다.

학습된 회전을 통한 LLM 양자화: SpinQuant

SpinQuant는 LLM의 가중치와 활성화에 학습된 회전을 적용하여 아웃라이어를 줄이고, 이를 통해 4-bit 양자화에서도 full precision 모델에 가까운 성능을 달성하는 새로운 양자화 기술입니다.

FLUTE: A Fast and Flexible CUDA Kernel for Lookup Table-Quantized LLMs (with Focus on Low-Bit and Non-Uniform Quantization)

FLUTE, a novel CUDA kernel, significantly accelerates LLM inference by enabling fast matrix multiplications for lookup table-quantized models, particularly excelling in low-bit and non-uniform quantization scenarios.

Tentang

Produk

Sumber Daya