This paper proposes a novel unsupervised evaluation approach for large language models (LLMs) called PiCO (Peer Review in LLMs based on Consistency Optimization). In this setting, both open-source and closed-source LLMs participate in a peer review process, where they answer unlabeled questions and evaluate each other's responses.
The key idea is to assign each LLM a learnable capability parameter and optimize it to maximize the consistency between the LLM's capability and its final score. The underlying assumption is that high-level LLMs can evaluate others' answers more accurately than low-level ones, while higher-level LLMs also achieve higher response scores. This consistency optimization aims to find a final score ranking that all LLMs "agree" on, reducing the entropy of the peer review evaluation system.
The authors propose three metrics - PEN (Permutation Entropy), CIN (Count Inversions), and LIS (Longest Increasing Subsequence) - to evaluate the alignment between the learned LLM ranking and the ground-truth human preferences. Experiments on multiple crowdsourcing datasets show that the proposed PiCO framework can effectively obtain an LLM leaderboard closer to human preferences compared to existing evaluation methods.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Kun-Peng Nin... às arxiv.org 04-23-2024
https://arxiv.org/pdf/2402.01830.pdfPerguntas Mais Profundas