핵심 개념
NeuPIMs proposes a heterogeneous accelerator system for efficient batched inference of Large Language Models, combining NPU and PIM technologies to optimize GEMM and GEMV computations.
초록
Abstract:
Large Language Models (LLMs) consist of decoder blocks with QKV generation, multi-head attention, and feed-forward networks.
NeuPIMs integrates NPUs and PIMs to balance GEMM and GEMV computations for improved throughput.
Introduction:
LLMs like GPT4 and LLaMA pose resource challenges due to memory and compute requirements.
Batching inference requests optimizes GEMM and GEMV operations.
Challenges:
NeuPIMs addresses microarchitectural and algorithmic challenges for concurrent NPU and PIM operations.
Contributions:
NeuPIMs introduces dual row buffers and sub-batch interleaving for efficient NPU-PIM parallel execution.
Evaluation:
NeuPIMs outperforms NPU-only and NPU-PIM integrated systems with significant throughput improvements.
통계
NeuPIMs achieves 2.3× and 1.6× throughput improvement compared to NPU-only and NPU-PIM integrated systems, respectively.
인용구
"NeuPIMs achieves high utilization on both NPU and PIM accelerators, offering significant throughput improvement."