Using a Panel of LLM Evaluators (PoLL) composed of diverse models outperforms a single large judge model like GPT-4 in terms of correlation with human judgments, reduced intra-model bias, and lower cost.
Pairwise evaluation of generated text using large language models (LLMs) is susceptible to adversarial examples, highlighting the need for improved evaluation methods like the proposed PREPAIR approach.
ProtocoLLM is a novel framework for automatically evaluating the ability of large language models (LLMs) to generate executable scientific protocols, specifically focusing on biology protocols and utilizing a predefined set of lab actions and a novel LLM-based evaluation method called LLAM-EVAL.
REVISEVAL, a novel evaluation paradigm, leverages the revision capabilities of large language models (LLMs) to generate response-adapted references, thereby improving the accuracy and reliability of LLM-based text generation evaluation, surpassing traditional reference-free and reference-based methods.
Auto-Arena is a novel framework that leverages LLM-powered agents for automated and reliable evaluation of large language models, achieving high alignment with human preferences through simulated peer debates and committee discussions.
This research paper introduces a novel human-AI collaborative framework for generating challenging math questions to address the saturation of existing LLM evaluation benchmarks.
透過自我反思機制,特別是透過直接偏好優化迭代地增強生成理由的品質,可以顯著提升大型語言模型作為細粒度評估者的能力。
Large language models (LLMs) struggle to consistently maintain truth and reason effectively with formal syntax, highlighting the need for dynamic, scalable, and automated evaluation benchmarks like ∀uto∃∨∧L.
Polyrating is a novel rating system for large language models (LLMs) that addresses limitations of traditional methods by incorporating bias detection, leveraging existing data to improve sample efficiency, and enabling multi-dimensional comparisons across tasks.