Główne pojęcia
Inconsistent human evaluations of language models, particularly in pairwise comparisons, can be attributed to the difficulty in distinguishing between model outputs. The SEPARABILITY metric addresses this by quantifying the distinguishability of generations from different models on a given input, offering a measure of evaluation reliability and enabling more robust model comparisons.
Streszczenie
This research paper introduces SEPARABILITY, a novel meta-evaluation metric designed to assess the reliability of human preference judgments in evaluating large language models (LLMs). The authors argue that traditional pairwise comparisons often suffer from inconsistencies, particularly when model outputs are very similar or exhibit high variability due to stochastic decoding.
The paper identifies two key factors contributing to this challenge: high cross-alignment (similarity between generations from different models) and low self-alignment (variability within a single model's generations). SEPARABILITY addresses these factors by quantifying the distinguishability of model outputs for a given input.
The authors demonstrate the effectiveness of SEPARABILITY through experiments on various generation tasks and benchmarks, comparing different LLM pairs. Results show that instances with high SEPARABILITY scores consistently receive more consistent preference ratings from both human and automated evaluators.
Furthermore, the paper explores the application of SEPARABILITY in ELO ratings, a popular method for ranking LLMs. By incorporating SEPARABILITY into the ELO update rule, the authors propose a more nuanced ranking system that accounts for the reliability of individual preference comparisons.
The paper concludes that SEPARABILITY provides a valuable tool for LLM developers and users to:
- Identify test instances and benchmarks that yield reliable preference judgments.
- Gain insights into the comparative performance of different LLMs.
- Develop more robust evaluation and ranking systems for LLMs.
The authors suggest future research directions, including applying SEPARABILITY to filter preference tuning data for learning from human feedback.
Statystyki
When comparing five different summary pairs generated by different LLMs for the same news articles, human raters only picked the same model 46% of the time.
On CNN/DailyMail summarization benchmark, the average SEPARABILITY score was 0.21 for GPT-3.5 vs. Vicuna 7B, indicating low distinguishability.
GPT-3.5 and FLAN-T5-XXL, models with different architectures, consistently produced more consistent human ratings even at lower SEPARABILITY ranges.
For SEPARABILITY scores below 0.2, the majority of human preference ratings were inconsistent.
When SEPARABILITY reached approximately 0.4, inconsistent ratings decreased to less than half for all tested model and dataset configurations.
Cytaty
"We argue that some test instances might be better suited for human evaluation than others."
"SEPARABILITY, a meta-evaluation measure that determines, for a single instance, how distinguishable two sets of generations from two models are."
"Our experiments show that instances with high SEPARABILITY values yield more consistent preference ratings from both human- and auto-raters."