核心概念
Given a dataset of prompts and a set of LLMs, ranking them without access to ground truth is possible by considering triplets of models.
统计
"In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data."
"Our method ranked the models [M1, M3, M2, PT] as compared to the human annotators [M1, M2, M3, PT]."
引用
"Inspired by real life where both an expert and a knowledgeable person can identify a novice, our main idea is to consider triplets of models."
"Our proposed approaches can be seen as a first pass to substantially reduce the effort needed for trustworthy evaluations of LLMs."