Efficient Deployment of Large Language Models on Mobile Devices
แนวคิดหลัก
To enable high-efficiency deployment of large language models (LLMs) on mobile device GPUs, the paper proposes four key optimization techniques: (1) a symbolic expression-based approach for dynamic shape model inference, (2) operator optimizations and execution priority setting, (3) an FP4 quantization method to reduce dequantization overhead, and (4) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference.
บทคัดย่อ
The paper focuses on enabling efficient deployment of large language models (LLMs) on mobile device GPUs. It proposes four key optimization techniques:
-
Symbolic expression-based dynamic shape inference:
- Represents unknown tensor dimensions using symbolic expressions to precisely determine the relationship between dynamic tensors.
- Derives dynamic shapes efficiently by classifying operators into shape computing and tensor computing categories.
- Reuses memory by leveraging the symbolic expressions to determine the size relationship between tensors.
- Reduces the time consumption of shape updating during inference by padding the input sequence length.
-
Operator optimizations and lagging reduction:
- Implements matrix multiplications that directly manage half-precision activations and 4-bit quantized weights.
- Fuses smaller operators to mitigate the time consumption of non-matrix multiplication operators.
- Sets the execution priority of operators to the lowest level to alleviate phone lagging.
-
M0E4 FP4 quantization:
- Proposes an FP4 quantization method that only uses 4 bits for the fraction part, enabling efficient dequantization with just two bitwise operations.
- Integrates with existing quantization techniques like GPTQ and AWQ to minimize dequantization overhead.
-
KV cache copy optimization:
- Stores only a single instance of the KV cache tensor and uses sub-tensors to eliminate the need for copying KV cache from output to input after each inference iteration.
- Modifies the KV cache format to ensure the variable sequence length dimension occupies the first non-1 dimension.
The paper evaluates the proposed Transformer-Lite engine on various LLM models with different architectures and parameter sizes, demonstrating significant performance improvements over existing GPU-based (MLC-LLM) and CPU-based (FastLLM) engines. Specifically, Transformer-Lite achieves over 10x faster prefill speed and 2-3x faster decoding speed compared to the baselines.
แปลแหล่งที่มา
เป็นภาษาอื่น
สร้าง MindMap
จากเนื้อหาต้นฉบับ
Transformer-Lite
สถิติ
Transformer-Lite achieves prefill speeds of 330 token/s and 121 token/s for Gemma 2B and ChatGLM2 6B models, respectively.
Transformer-Lite achieves decoding speeds of 30 token/s and 14 token/s for Gemma 2B and ChatGLM2 6B models, respectively.
Compared to MLC-LLM and FastLLM, Transformer-Lite achieves over 10x speedup for prefill speed and 2-3x speedup for decoding speed.
On a 24GB memory phone, Transformer-Lite deploys the Qwen1.5 14B model with a prompt size of 2048, achieving 54 token/s and 5 token/s for prefill and decoding speed, respectively.
คำพูด
"To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference."
"Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed."
สอบถามเพิ่มเติม
How can the prefill speed of Transformer-Lite be further improved by optimizing the matrix multiplication implementations?
To further improve the prefill speed of Transformer-Lite, optimizing the matrix multiplication implementations is crucial. One way to achieve this is by enhancing cache hit ratio and parallelism in the matrix multiplication process. By improving cache utilization, the data needed for computation can be accessed more efficiently, reducing latency and speeding up the overall process. Additionally, optimizing parallelism allows for multiple matrix multiplications to be executed simultaneously, leveraging the full computational power of the GPU. These optimizations can help maximize the FLOPS available for matrix multiplication, ultimately boosting the prefill speed of Transformer-Lite.
How can the proposed techniques in Transformer-Lite be extended to support the deployment of other types of deep learning models, such as computer vision and vision transformer models, on mobile devices?
The techniques proposed in Transformer-Lite can be extended to support the deployment of other types of deep learning models on mobile devices by focusing on model structure agnosticism and optimization for dynamic shape inference. By utilizing the ONNX model format, Transformer-Lite can accommodate various model architectures without the need for extensive re-description of model structures. This flexibility allows for easy integration of different model types, including computer vision models and vision transformer models.
For computer vision models, optimizations can be tailored to leverage the specific characteristics of these models, such as convolutional layers and image processing operations. By adapting the operator optimizations and memory reuse techniques to suit the requirements of computer vision tasks, Transformer-Lite can efficiently deploy these models on mobile devices.
Similarly, for vision transformer models, the symbolic expression-based dynamic shape inference and operator optimizations can be applied to support the unique structure of these models. By ensuring compatibility with dynamic input shapes and optimizing operations for transformer-based architectures, Transformer-Lite can effectively deploy vision transformer models on mobile GPUs. This approach enables a wide range of deep learning models to be deployed on mobile devices with high efficiency and performance.