HeteGen introduces a novel approach for heterogeneous parallel computing to reduce latency in Large Language Models (LLMs) on resource-constrained devices, achieving significant speed improvements. The core reasoning is to leverage CPU and I/O resources alongside GPUs to optimize computational efficiency and reduce the need for parameter transfers.
HeteGen introduces a novel approach for heterogeneous parallel computing to reduce latency in Large Language Models on resource-constrained devices.
HeteGen introduces a novel approach for efficient inference on resource-constrained devices by leveraging heterogeneous parallel computing, demonstrating significant improvements in speed.
The proposed Unified Layer Skipping strategy determines the number of layers to skip based solely on the target speedup ratio, ensuring a stable and predictable acceleration effect across different input samples. Unlike existing methods that skip multiple contiguous layers, Unified Layer Skipping skips the corresponding number of intermediate layer computations in a balanced manner, minimizing the impact on the model's layer-wise representations.
This survey presents a comprehensive taxonomy and analysis of techniques to enhance the efficiency of inference for large language models (LLMs), addressing challenges posed by their substantial computational and memory requirements.