Khái niệm cốt lõi
This empirical study comprehensively evaluates the performance of low-bit quantized LLAMA3 models across a range of post-training quantization and LoRA-finetuning techniques, revealing significant challenges in maintaining model accuracy at ultra-low bit-widths.
Tóm tắt
This study provides a comprehensive evaluation of the performance of low-bit quantized LLAMA3 models, the latest and most powerful open-source large language model series from Meta. The authors explore a wide range of post-training quantization (PTQ) and LoRA-finetuning (LoRA-FT) techniques, covering bit-widths from 1 to 8 bits and utilizing diverse evaluation datasets.
Key highlights:
- PTQ methods like RTN, GPTQ, AWQ, QuIP, PB-LLM, DB-LLM, and BiLLM are evaluated on LLAMA3-8B and LLAMA3-70B models.
- LoRA-FT methods like QLoRA and IR-QLoRA are also assessed on LLAMA3-8B.
- The results indicate that while LLAMA3 models demonstrate superior performance after quantization, there is significant degradation, especially at ultra-low bit-widths (⩽2 bits).
- The LLAMA3-70B model shows more robustness to quantization compared to LLAMA3-8B.
- LoRA-FT quantization methods are unable to compensate for the performance loss caused by quantization, in contrast to previous findings on LLAMA1 and LLAMA2.
- The study highlights the need for new quantization paradigms to bridge the performance gap between original and quantized LLAMA3 models, especially at low bit-widths.
The authors have made the project and quantized LLAMA3 models publicly available, aiming to advance research in LLM quantization and enable practical deployment of LLAMA3 in resource-constrained environments.
Thống kê
LLAMA3-8B model achieves perplexity of 6.1, CommonSenseQA accuracy of 9.2%, and average accuracy of 68.6% on the evaluation datasets in the original FP16 precision.
4-bit quantized LLAMA3-8B with GPTQ method achieves perplexity of 6.5, CommonSenseQA accuracy of 10.4%, and average accuracy of 67.3% on the evaluation datasets.
2-bit quantized LLAMA3-8B with BiLLM method achieves perplexity of 28.3, CommonSenseQA accuracy of 290.0%, and average accuracy of 37.9% on the evaluation datasets.
4-bit LoRA-FT quantized LLAMA3-8B with QLoRA method achieves average MMLU accuracy of 56.7%, which is lower than the 4-bit counterpart without LoRA-FT (62.5%).
Trích dẫn
"Our experiment results indicate that LLAMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments."
"Despite the significant drop from quantization that cannot be compensated by finetuning, 4-bit LoRA-FT quantized LLAMA3-8B significantly outperforms LLAMA1-7B and LLAMA2-7B under various quantization methods."