QuantTune introduces a novel approach to optimize model quantization by addressing outliers and dynamic range issues. The method significantly reduces accuracy drops in quantized models, showcasing improvements across various Transformer architectures. By seamlessly integrating into the fine-tuning process, QuantTune offers hardware-independent solutions for efficient model compression and acceleration.
The study focuses on the challenges faced during post-training linear quantization of Transformer-based models. It reveals that precision loss due to outliers contributes significantly to quantization errors, leading to reduced inference accuracy. QuantTune adjusts weights based on outlier deviations to constrain dynamic ranges effectively, resulting in improved model performance after quantization.
The research demonstrates that managing activation outliers is crucial for accurate post-training quantization. By employing an outlier-driven loss function during fine-tuning, QuantTune successfully narrows dynamic ranges and minimizes precision errors caused by outliers. This approach enhances model resilience against quantization-induced errors without requiring complex hardware or extensive calibration efforts.
QuantTune's effectiveness is validated through experiments across various Transformer-based models like ViT, DeiT, Swin-Transformer, BERT, and OPT. The method outperforms state-of-the-art calibration methods by reducing accuracy drops at different bit-widths for both vision and language models. Additionally, QuantTune offers a cost-effective solution for model quantization with seamless integration into standard computing platforms.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問