The authors present an efficient approach to extend the context length of the Llama-3-8B-Instruct language model from 8K to 80K tokens. They used GPT-4 to synthesize 3.5K long-context training samples covering three tasks: single-detail QA, multi-detail QA, and biography summarization. The training data consisted of 20K instances, including 5K randomly chosen from RedPajama and 12K from LongAlpaca.
The authors fine-tuned the model using QLoRA, applying LoRA on all Q, K, V, O projections and training the embedding layer. They set the LoRA rank to 32 and alpha to 16, used a learning rate of 5e-5 with linear decay, and a batch size of 8. Gradient checkpointing was enabled, and the RoPE base was expanded from 500K to 200M. The entire training process took 8 hours on a single 8xA800 (80G) GPU machine.
The resulting model, Llama-3-8B-Instruct-80K-QLoRA, exhibited superior performance across a range of long-context evaluation tasks, including Needle-In-A-Haystack, Topic Retrieval, LongBench, and InfBench, while also preserving the original capabilities over short contexts. The authors also compared the model's zero-shot performance on MMLU with other open-source language models.
The authors highlight that the dramatic context extension was achieved with a relatively small amount of synthetic training data, indicating the inherent potential of large language models to extend their original context length. They have publicly released the entire resources, including the model, training data, and code, to facilitate future research in this area.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Peitian Zhan... at arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19553.pdfDeeper Inquiries