Core Concepts
A novel DPO-based approach, LD-Align, that aligns a fine-tuned large language model with a high-quality supervised fine-tuning dataset without requiring any additional human annotations or relying on a more powerful language model.
Abstract
The paper introduces LD-Align, a novel approach for aligning large language models (LLMs) with human preferences without relying on extensive human annotations. The key ideas are:
- LD-Align utilizes a guiding model, consisting of an encoder and a decoder, to establish a latent space representation of samples from the supervised fine-tuning (SFT) dataset and those generated by the LLM.
- The distance between the latent representations of the SFT samples and the LLM-generated samples is used to guide the alignment training process. Samples with larger distances in the latent space are assigned higher update weights during the Direct Preference Optimization (DPO) training, encouraging the LLM to explore and improve alignment.
- Comprehensive experiments show that LD-Align outperforms competing annotation-free alignment methods like SPIN and achieves notable performance improvements across various benchmarks, including truthfulness, commonsense reasoning, and multi-round dialogue.
- The authors analyze the quality of the latent space learned by the guiding model, demonstrating its effectiveness in capturing the alignment between generated samples and the high-quality SFT dataset.
Stats
The model used in the experiments is Mistral-7B, a pre-trained LLM that outperforms Llama 2 13B and Llama 1 34B on various benchmarks.
The SFT dataset used is Ultrachat200k, a high-quality dataset of 200k multi-round dialogues generated by ChatGPT.
Quotes
"We regard y as the winner sample and y' as the loser sample, and employ a DPO-based approach to train iteratively for alignment using the distance in the latent space as guidance."
"For a pair of y and y', we assign higher update weight when y' is far from y in the latent space, which is denoted by the magnitude of normalized distance sϕ(x,y,y')/Sϕ(D,pref)."