핵심 개념
Existing LLM evaluators suffer from bias towards superficial quality, overlooking instruction following ability. This work proposes systematic methods to mitigate the bias, including online calibration and offline contrastive training, effectively improving the fairness of LLM evaluation.
초록
The paper discusses the problem of bias in Large Language Model (LLM) evaluation, where existing evaluators tend to favor answers with better superficial quality (e.g., fluency, verbosity) over those that better follow the given instructions.
The authors propose two main methods to mitigate this bias:
-
Online Mitigation by Calibration:
- For probability-based evaluators, the authors model the superficial quality using pre-trained models and subtract it from the final evaluation score.
- For generation-based evaluators, they design prompt templates to directly quantify the superficial quality and subtract it.
-
Offline Mitigation by Contrastive Training:
- The authors construct adversarial negative samples where the answer deviates from the instruction but has better superficial quality.
- They then fine-tune the open-source judge models using contrastive training on both the original and the adversarial samples.
The authors evaluate their methods on the LLMBar benchmark, which includes a natural test set and four adversarial test sets designed to probe the evaluator's bias. The results show that both the online calibration and offline contrastive training methods effectively mitigate the evaluation bias on the adversarial sets while maintaining a comparable performance on the natural set.
The paper also discusses the trade-off between bias mitigation and evaluation accuracy, as completely excluding superficial quality can lead to performance degradation. The authors find a balance where bias mitigation, to a certain extent, can improve the evaluation accuracy, but too much mitigation would eventually decrease the performance.
통계
LLMs typically gain their instruction following ability during supervised fine-tuning (SFT).
The difference between the prediction distribution of pre-trained and SFT models can be indicative of instruction alignment.
Answers with better superficial quality (e.g., fluency, verbosity, engaging tones) but misaligned with the instruction may receive higher scores in the original evaluation.
인용구
"As the capabilities of LLMs continue to develop across various tasks, it is essential to evaluate them from a comprehensive perspective."
"Relying on external API for evaluation may introduce consideration about privacy leakage."
"Even state-of-the-art LLM evaluators struggle to provide unbiased evaluation on their benchmark."