Core Concepts
The authors efficiently extended the context length of the Llama-3-8B-Instruct language model from 8K to 80K tokens through QLoRA fine-tuning on 3.5K synthetic training samples generated by GPT-4.
Abstract
The authors present an efficient approach to extend the context length of the Llama-3-8B-Instruct language model from 8K to 80K tokens. They used GPT-4 to synthesize 3.5K long-context training samples covering three tasks: single-detail QA, multi-detail QA, and biography summarization. The training data consisted of 20K instances, including 5K randomly chosen from RedPajama and 12K from LongAlpaca.
The authors fine-tuned the model using QLoRA, applying LoRA on all Q, K, V, O projections and training the embedding layer. They set the LoRA rank to 32 and alpha to 16, used a learning rate of 5e-5 with linear decay, and a batch size of 8. Gradient checkpointing was enabled, and the RoPE base was expanded from 500K to 200M. The entire training process took 8 hours on a single 8xA800 (80G) GPU machine.
The resulting model, Llama-3-8B-Instruct-80K-QLoRA, exhibited superior performance across a range of long-context evaluation tasks, including Needle-In-A-Haystack, Topic Retrieval, LongBench, and InfBench, while also preserving the original capabilities over short contexts. The authors also compared the model's zero-shot performance on MMLU with other open-source language models.
The authors highlight that the dramatic context extension was achieved with a relatively small amount of synthetic training data, indicating the inherent potential of large language models to extend their original context length. They have publicly released the entire resources, including the model, training data, and code, to facilitate future research in this area.
Stats
The accuracy score of Llama-3-8B-Instruct-80K-QLoRA on the Needle-In-A-Haystack task is 100% across all context lengths, including the unseen positions from 80K to 128K.
The accuracy of the Topic Retrieval task remains at 100% for Llama-3-8B-Instruct-80K-QLoRA across all context lengths, while the original Llama-3-8B-Instruct fails to remember the topic when the context is longer than 9K.
On the LongBench benchmark, Llama-3-8B-Instruct-80K-QLoRA significantly and consistently outperforms the baselines, including Llama-3-8B-Instruct and Llama-3-8B-Instruct-262K, except on the code completion task.
On the English Long-Book QA and Long-Book Summarization tasks from InfiniteBench, Llama-3-8B-Instruct-80K-QLoRA achieves the best performance among the evaluated models, including GPT-4.
Quotes
"The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4, which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length."
"In fact, the context length could be extended far beyond 80K with more computation resources."