Grunnleggende konsepter
Tailoring the arithmetic precision of Large Language Models (LLMs) to the specific requirements of different inference phases and progressively lowering precision during decoding significantly improves efficiency without sacrificing output quality, making it particularly beneficial for resource-constrained devices.
Statistikk
PMPD achieves 1.4−12.2× speedup in LLM linear layers over fp16 models on Nvidia GPUs.
PMPD delivers a throughput gain of 3.8−8.0× over fp16 models on an LLM-optimized NPU.
PMPD achieves up to 1.54× speedup over uniform quantization approaches on an LLM-optimized NPU while preserving output quality.
Sitater
"The prefill phase, as well as earlier parts of the decoding phase, are more sensitive to approximation errors from quantization, than later parts of the autoregressive generation process."
"By observing the key limitations in prior work, this paper identifies a novel insight for LLM weight quantization: The prefill phase, as well as earlier parts of the decoding phase, are more sensitive to approximation errors from quantization, than later parts of the autoregressive generation process."