CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
Core Concepts
Large language models can be effectively aligned with coding preferences using CodeUltraFeedback dataset and RLAIF techniques.
Abstract
Evaluating alignment of large language models (LLMs) with user-defined coding preferences is challenging.
Existing benchmarks lack nuances in user instructions and LLM outputs.
CodeUltraFeedback introduces a preference dataset for tuning LLMs to coding preferences through AI feedback.
CODAL-Bench is a benchmark for assessing LLM alignment with coding preferences.
SFT and DPO techniques improve LLM alignment and functional correctness on HumanEval benchmarks.
CodeUltraFeedback
Stats
"Our results show that CodeLlama-7B-Instruct, aligned through reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO) using CodeUltraFeedback’s AI feedback data, outperforms 34B LLMs on CODAL-Bench."
"Finally, we show that preference tuning does not hinder the capability of CodeLlama-7B-Instruct in generating functionally correct code."
Quotes
"Our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF for code intelligence."