Core Concepts
This paper introduces a novel approach to automatically discover state-of-the-art preference optimization algorithms for Large Language Models (LLMs) using LLM-driven objective discovery, leading to the creation of DiscoPOP, a new algorithm that outperforms existing baselines on several tasks.
Abstract
Bibliographic Information:
Lu, C., Holt, S., Fanconi, C., Chan, A. J., Foerster, J., van der Schaar, M., & Lange, R. T. (2024). Discovering Preference Optimization Algorithms with and for Large Language Models. Advances in Neural Information Processing Systems, 38.
Research Objective:
This research aims to automate the discovery of novel and effective preference optimization algorithms for Large Language Models (LLMs) by leveraging the capabilities of LLMs themselves. The authors investigate whether LLMs can propose and evaluate new objective functions for preference optimization, potentially surpassing human-designed algorithms.
Methodology:
The researchers developed an LLM-driven objective discovery pipeline. This pipeline utilizes an LLM (GPT-4) to propose new preference optimization loss functions, coded in Python. These proposed functions are then evaluated by fine-tuning an LLM using the proposed loss function and assessing its performance on a downstream task, specifically MT-Bench for initial discovery and subsequently AlpacaEval 2.0, Reddit TL;DR summarization, and IMDb sentiment analysis for held-out evaluation. This iterative process of proposal and evaluation allows the LLM to learn from previous iterations and refine its proposals, ultimately leading to the discovery of novel and effective algorithms.
Key Findings:
- The LLM-driven discovery pipeline successfully generated multiple novel preference optimization algorithms that outperformed existing baselines like DPO and SLiC on various tasks, including multi-turn dialogue, summarization, and controlled sentiment generation.
- One discovered algorithm, named DiscoPOP (Log Ratio Modulated Loss), consistently demonstrated state-of-the-art performance across the held-out evaluation tasks.
- DiscoPOP utilizes a dynamically weighted sum of logistic and exponential losses, adapting its behavior based on the difference between the trained model's output and the reference model.
- Analysis of DiscoPOP revealed intriguing properties, including a non-convex segment and negative gradients at the starting point, potentially contributing to its effectiveness.
Main Conclusions:
This research demonstrates the potential of LLM-driven objective discovery for automating the development of novel and high-performing machine learning algorithms. The discovered DiscoPOP algorithm presents a promising new approach to preference optimization, surpassing existing methods in several text generation tasks.
Significance:
This work significantly contributes to the field of machine learning by introducing a novel and effective method for automated algorithm discovery. It paves the way for leveraging LLMs to design and optimize complex algorithms, potentially leading to breakthroughs in various domains.
Limitations and Future Research:
- The study primarily relied on GPT-4 for both code generation and evaluation, limiting reproducibility and potentially introducing bias. Future research could explore using open-source LLMs or alternative evaluation methods.
- The discovered DiscoPOP algorithm, while effective, relies on a single parameter (β) that influences both functional behavior and KL penalty. Further investigation into multi-parameter objectives and their optimization could lead to even more robust and adaptable algorithms.
- Future work could explore the application of this LLM-driven discovery pipeline to other areas of machine learning, such as reinforcement learning, supervised learning, and unsupervised learning.
Stats
DiscoPOP improved win rates against GPT-4 on Alpaca Eval 2.0 from 11.23% (DPO) to 13.21%.
DiscoPOP outperforms other preference optimization algorithms on the length-controlled win rate against the SFT reference model in the Alpaca Eval 2.0 benchmark.
In the summarization task, DiscoPOP achieves scores slightly below the top performers, especially in the length-controlled win rates.
For the IMDb sentiment analysis task, LRML outperforms the DPO model in terms of rewards and KL-divergence with low β values (0.025, 0.05, 0.1).
Quotes
"To date, all existing state-of-the-art preference optimization algorithms [Rafailov et al., 2023, Azar et al., 2023, Zhao et al., 2023] have been developed by human experts. Despite their advancements, these solutions are inherently constrained by human limitations, including creativity, ingenuity, and expert knowledge."
"In this work, we aim to address these limitations by performing LLM-driven discovery to automatically generate new state-of-the-art preference optimization algorithms without continual expert human intervention in the development process."
"After performing this automatic discovery process, we catalogue high-performing loss functions and introduce a particularly strong one we call Discovered Preference Optimization (DiscoPOP), a new algorithm."