Analyzing Ranking of Large Language Models with Prediction-Powered Framework
The author introduces a statistical framework to measure uncertainty in rankings constructed using pairwise comparisons by humans and large language models. The framework provides rank-sets for each model under comparison, ensuring coverage guarantees for the true ranking consistent with human preferences.