This paper introduces REC, a novel large language model (LLM) designed to automatically evaluate the quality of AI-generated text by providing ratings, explanations, and verifiable citations for its assessments.
IdeaBench is a novel benchmark system designed to evaluate the ability of large language models (LLMs) to generate research ideas, moving beyond simple similarity metrics to assess quality indicators like novelty and feasibility using a personalized ranking system and the "Insight Score."
Large language models (LLMs) are evaluated for their culinary creativity in adapting recipes to different cuisines using a novel benchmark called ASH (authenticity, sensitivity, harmony), revealing strengths and limitations in their ability to understand and apply cultural nuances in recipe creation.
雖然大型語言模型 (LLM) 在評估一般性任務方面展現出潛力,但在評估需要專業知識的領域特定任務時,僅憑 LLM 作為評審的方法存在顯著限制,突顯了將人類專家納入評估流程的必要性。
Despite advancements, large language models (LLMs) fail to consistently demonstrate human-like reasoning in a simple economic game, highlighting their limitations as human surrogates in social science research.
Existing evaluations of mathematical reasoning in LLMs, relying on static datasets and final answers, are insufficient. ReasonAgain, a novel evaluation method, leverages symbolic programs to generate perturbations of math questions, revealing the fragility of reasoning abilities in state-of-the-art LLMs.
This paper proposes best practices for human evaluation of LLM-generated spoken document summaries, highlighting the limitations of existing automated metrics and advocating for rigorous human-centered approaches.