온라인 LLM 추론 시스템 NEO는 GPU 메모리 부족 문제를 해결하기 위해 어텐션 연산 및 KV 캐시를 CPU로 오프로드하여 처리량을 향상시키고 GPU 사용률을 극대화합니다.
NEO is a novel system that enhances the throughput of online Large Language Model (LLM) inference by strategically offloading attention computation and KV cache from the GPU to the host CPU, thereby addressing the GPU memory bottleneck and maximizing resource utilization.
제한된 컴퓨팅 리소스로 LLM 추론을 효율적으로 확장하기 위해서는 모델, 온도, 언어 등 다양한 샘플링 구성에 대한 최적의 컴퓨팅 예산 할당을 찾는 것이 중요하며, 이는 추론 정확도를 향상시키고 더 나아가 복잡한 추론 알고리즘을 개선하는 데 기여합니다.
POD-Attention, a novel GPU kernel, accelerates Large Language Model (LLM) inference by enabling concurrent computation of prefill and decode attention operations within hybrid batches, leading to improved GPU resource utilization and reduced latency.
COMET is a novel inference framework that significantly improves the efficiency of large language model (LLM) serving on GPUs by introducing a fine-grained mixed-precision quantization (FMPQ) algorithm and a highly optimized W4Ax kernel, enabling practical 4-bit quantization for both activations and KV cache with negligible accuracy loss.
SwiftKV는 LLM 추론 속도를 높이기 위해 고안된 새로운 모델 변환 및 distillation 절차로, 프롬프트 토큰 처리 시간과 비용을 줄이면서 생성된 토큰의 품질은 유지합니다.
SwiftKV is a novel method for making Large Language Model (LLM) inference faster and cheaper, especially for tasks where the input prompt is much longer than the generated output, by transforming the model architecture to reduce computation and compressing memory to enable larger batch sizes.
LLM-Pilot is a novel system that characterizes and predicts the performance of LLM inference services across various GPUs, enabling cost-effective deployment while meeting performance requirements.
The author argues that current LLM serving systems face a tradeoff between throughput and latency, proposing Sarathi-Serve as a solution to improve both metrics simultaneously.