Cambricon-LLM, a novel chiplet-based hybrid architecture, enables efficient on-device inference of large language models up to 70 billion parameters by combining a neural processing unit (NPU) and a dedicated NAND flash chip with on-die processing capabilities.
To enable high-efficiency deployment of large language models (LLMs) on mobile device GPUs, the paper proposes four key optimization techniques: (1) a symbolic expression-based approach for dynamic shape model inference, (2) operator optimizations and execution priority setting, (3) an FP4 quantization method to reduce dequantization overhead, and (4) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference.