מושגי ליבה
decoupleQ achieves a substantial increase in model accuracy, especially at very low bits, by abandoning the traditional heuristic quantization paradigm and decoupling the model parameters into integer and floating-point parts, transforming the quantization problem into a constrained optimization problem.
תקציר
The paper proposes decoupleQ, a novel approach to post-training quantization that achieves state-of-the-art accuracy, especially at very low bit-widths.
Key highlights:
- decoupleQ abandons the traditional heuristic quantization paradigm and instead decouples the model parameters into integer and floating-point parts, transforming the quantization problem into a constrained optimization problem.
- This optimization problem is solved alternatively by off-the-shelf optimization methods, without the need to deal with the minutiae of traditional quantization, such as outliers and sensitive channels.
- decoupleQ achieves 2-bit post-training uniform quantization with performance close to fp16/bf16 for industrial applications in the ASR model in ByteDance.
- The idea of decoupleQ can be easily extended to supervised learning to further improve model accuracy or adapt to downstream sub-tasks.
The paper first formulates the quantization problem as a constrained optimization problem in Eq. (6), where the model parameters are decoupled into integer and floating-point parts. This problem is then solved alternatively by off-the-shelf optimization methods, as described in Algorithms 1 and 2.
The authors conduct extensive experiments on ImageNet, Llama, and a private ASR model from ByteDance. The results show that decoupleQ outperforms previous methods, especially at very low bit-widths, and can achieve performance close to fp16/bf16 on the 2-bit quantization of large speech models.
סטטיסטיקה
The paper does not provide specific numerical data to support the key logics. The main results are presented in the form of tables comparing the performance of decoupleQ with other methods.
ציטוטים
"decoupleQ abandons the traditional heuristic quantization paradigm and instead decouples the model parameters into integer and floating-point parts, transforming the quantization problem into a traditional mathematical constrained optimization problem, which is then solved alternatively by off-the-shelf solution methods."
"decoupleQ achieves 2-bit post-training uniform quantization with performance close to fp16/bf16 for industrial applications in the ASR model in ByteDance."