This paper proposes a novel unsupervised evaluation approach for large language models (LLMs) called PiCO (Peer Review in LLMs based on Consistency Optimization). In this setting, both open-source and closed-source LLMs participate in a peer review process, where they answer unlabeled questions and evaluate each other's responses.
The key idea is to assign each LLM a learnable capability parameter and optimize it to maximize the consistency between the LLM's capability and its final score. The underlying assumption is that high-level LLMs can evaluate others' answers more accurately than low-level ones, while higher-level LLMs also achieve higher response scores. This consistency optimization aims to find a final score ranking that all LLMs "agree" on, reducing the entropy of the peer review evaluation system.
The authors propose three metrics - PEN (Permutation Entropy), CIN (Count Inversions), and LIS (Longest Increasing Subsequence) - to evaluate the alignment between the learned LLM ranking and the ground-truth human preferences. Experiments on multiple crowdsourcing datasets show that the proposed PiCO framework can effectively obtain an LLM leaderboard closer to human preferences compared to existing evaluation methods.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Kun-Peng Nin... a las arxiv.org 04-23-2024
https://arxiv.org/pdf/2402.01830.pdfConsultas más profundas