Core Concepts
QLoRA introduces innovations like 4-bit NormalFloat and double quantization to efficiently finetune large language models without sacrificing performance.
Abstract
The content describes a new technique called QLoRA (Quantized LLaMA with Rank Adaptation) that enables efficient finetuning of large language models. Key highlights:
QLoRA outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.
QLoRA introduces several innovations to save memory without sacrificing performance:
4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights.
Double quantization to reduce the average memory footprint by quantizing the quantization constants.
Paged optimizers to manage memory spikes.
The authors use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
The results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous state-of-the-art.
Stats
QLoRA outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT.
QLoRA only requires 24 hours of finetuning on a single GPU.