LLM Inference Optimization

insight - LLM Inference Optimization

온라인 LLM 추론을 위한 CPU 오프로딩으로 GPU 메모리 위기를 해결하는 NEO

온라인 LLM 추론 시스템 NEO는 GPU 메모리 부족 문제를 해결하기 위해 어텐션 연산 및 KV 캐시를 CPU로 오프로드하여 처리량을 향상시키고 GPU 사용률을 극대화합니다.

NEO: A System for Improving Online LLM Inference Throughput by Offloading Attention Computation and KV Cache to CPU

NEO is a novel system that enhances the throughput of online Large Language Model (LLM) inference by strategically offloading attention computation and KV cache from the GPU to the host CPU, thereby addressing the GPU memory bottleneck and maximizing resource utilization.

최적의 샘플 연산 할당을 통한 LLM 추론 확장

제한된 컴퓨팅 리소스로 LLM 추론을 효율적으로 확장하기 위해서는 모델, 온도, 언어 등 다양한 샘플링 구성에 대한 최적의 컴퓨팅 예산 할당을 찾는 것이 중요하며, 이는 추론 정확도를 향상시키고 더 나아가 복잡한 추론 알고리즘을 개선하는 데 기여합니다.

POD-Attention: A New GPU Kernel for Faster LLM Inference by Overlapping Prefill and Decode Attention Computation

POD-Attention, a novel GPU kernel, accelerates Large Language Model (LLM) inference by enabling concurrent computation of prefill and decode attention operations within hybrid batches, leading to improved GPU resource utilization and reduced latency.

COMET: A High-Performance Mixed-Precision Inference Framework for Large Language Models with 4-bit Activations and KV Cache

COMET is a novel inference framework that significantly improves the efficiency of large language model (LLM) serving on GPUs by introducing a fine-grained mixed-precision quantization (FMPQ) algorithm and a highly optimized W4Ax kernel, enabling practical 4-bit quantization for both activations and KV cache with negligible accuracy loss.

SwiftKV: 지식 보존 모델 변환을 통한 빠른 프리필 최적화 추론

SwiftKV는 LLM 추론 속도를 높이기 위해 고안된 새로운 모델 변환 및 distillation 절차로, 프롬프트 토큰 처리 시간과 비용을 줄이면서 생성된 토큰의 품질은 유지합니다.

SwiftKV: Optimizing LLM Inference Speed and Cost for Prompt-Heavy Workloads by Transforming Model Architecture and Compressing Memory

SwiftKV is a novel method for making Large Language Model (LLM) inference faster and cheaper, especially for tasks where the input prompt is much longer than the generated output, by transforming the model architecture to reduce computation and compressing memory to enable larger batch sizes.

LLM-Pilot: A System for Characterizing and Optimizing the Performance of Large Language Model Inference Services

LLM-Pilot is a novel system that characterizes and predicts the performance of LLM inference services across various GPUs, enabling cost-effective deployment while meeting performance requirements.

Optimizing LLM Inference Throughput-Latency Tradeoff with Sarathi-Serve

The author argues that current LLM serving systems face a tradeoff between throughput and latency, proposing Sarathi-Serve as a solution to improve both metrics simultaneously.

About

Products

Resources