QuantTune introduces a novel approach to optimize model quantization by addressing outliers and dynamic range issues. The method significantly reduces accuracy drops in quantized models, showcasing improvements across various Transformer architectures. By seamlessly integrating into the fine-tuning process, QuantTune offers hardware-independent solutions for efficient model compression and acceleration.
The study focuses on the challenges faced during post-training linear quantization of Transformer-based models. It reveals that precision loss due to outliers contributes significantly to quantization errors, leading to reduced inference accuracy. QuantTune adjusts weights based on outlier deviations to constrain dynamic ranges effectively, resulting in improved model performance after quantization.
The research demonstrates that managing activation outliers is crucial for accurate post-training quantization. By employing an outlier-driven loss function during fine-tuning, QuantTune successfully narrows dynamic ranges and minimizes precision errors caused by outliers. This approach enhances model resilience against quantization-induced errors without requiring complex hardware or extensive calibration efforts.
QuantTune's effectiveness is validated through experiments across various Transformer-based models like ViT, DeiT, Swin-Transformer, BERT, and OPT. The method outperforms state-of-the-art calibration methods by reducing accuracy drops at different bit-widths for both vision and language models. Additionally, QuantTune offers a cost-effective solution for model quantization with seamless integration into standard computing platforms.
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Jiun-Man Che... klokken arxiv.org 03-12-2024
https://arxiv.org/pdf/2403.06497.pdfDypere Spørsmål