Zhang, Q., Wang, Y., Yu, T., Jiang, Y., Wu, C., Li, L., ... & Ma, C. (2024). RevisEval: Improving LLM-as-a-Judge via Response-Adapted References. arXiv preprint arXiv:2410.05193v1.
This paper introduces REVISEVAL, a novel evaluation paradigm designed to address the limitations of existing LLM-based text generation evaluation methods, particularly the lack of reliable and relevant references for open-ended tasks. The authors aim to demonstrate that REVISEVAL, by generating response-adapted references, can improve the accuracy and reduce bias in LLM-as-a-Judge evaluations.
REVISEVAL employs a two-step process: response-adapted reference generation and reference-based evaluation. First, an LLM reviser refines the model-generated response based on the task instruction and evaluation rubric, creating a response-adapted reference. Then, this reference guides the evaluation process, either using LLM-as-a-Judge or traditional reference-based metrics. The authors evaluate REVISEVAL on various NLG tasks (summarization, translation, data-to-text, story generation) and open-ended instruction-following benchmarks (MT-Bench, AlpacaFarm, LLMBar), comparing its performance against reference-free and reference-based baselines using both proprietary (GPT-4) and open-source LLMs (Llama 3.1-8B).
REVISEVAL consistently outperforms reference-free and reference-based evaluation methods across all tested tasks and benchmarks. It demonstrates significant improvements in correlation with human judgments for NLG tasks and accuracy in pairwise comparison tasks for instruction-following benchmarks. Notably, REVISEVAL proves particularly effective in reducing bias, as evidenced by its superior performance on the LLMBar benchmark, which is designed to expose superficial quality biases. Moreover, the study reveals that response-adapted references generated by REVISEVAL significantly enhance the performance of traditional reference-based metrics, even surpassing the performance of reference-free LLM-as-a-Judge in some cases.
The study highlights the importance of relevant references in LLM evaluation and demonstrates the effectiveness of leveraging LLM's generative capabilities for creating such references. REVISEVAL offers a promising solution to overcome the limitations of existing evaluation paradigms, paving the way for more accurate and reliable assessment of LLM-generated text.
This research significantly contributes to the field of LLM evaluation by introducing a novel and effective paradigm that addresses the challenges posed by open-ended text generation tasks. REVISEVAL's ability to generate response-adapted references has the potential to improve the development and deployment of more robust and reliable LLMs for various applications.
While REVISEVAL shows promising results, further investigation is needed to explore its applicability to other domains and languages. Additionally, future research could explore alternative revision strategies and evaluate the impact of different LLM revisers on the overall performance of REVISEVAL.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Qiyuan Zhang... om arxiv.org 10-08-2024
https://arxiv.org/pdf/2410.05193.pdfDiepere vragen