Sample-Efficient Human Evaluation of Large Language Models across Multiple Scenarios
A sample-efficient human evaluation approach based on maximum discrepancy competition is proposed to fairly assess and rank the performance of large language models across diverse scenarios, including scientific knowledge understanding, mathematical reasoning, creative writing, and code generation.