LLM Evaluation

insight - LLM Evaluation

Evaluating AI-Generated Text for Trustworthiness: Introducing the REC Model for Rating, Explanation, and Citation

This paper introduces REC, a novel large language model (LLM) designed to automatically evaluate the quality of AI-generated text by providing ratings, explanations, and verifiable citations for its assessments.

IdeaBench: A Benchmark System for Evaluating the Ability of Large Language Models to Generate Research Ideas

IdeaBench is a novel benchmark system designed to evaluate the ability of large language models (LLMs) to generate research ideas, moving beyond simple similarity metrics to assess quality indicators like novelty and feasibility using a personalized ranking system and the "Insight Score."

Evaluating the Culinary Creativity of Large Language Models in Cuisine Transfer Using the ASH Benchmark

Large language models (LLMs) are evaluated for their culinary creativity in adapting recipes to different cuisines using a novel benchmark called ASH (authenticity, sensitivity, harmony), revealing strengths and limitations in their ability to understand and apply cultural nuances in recipe creation.

大型語言模型作為評估專家知識任務中大型語言模型輸出結果的評審方法之限制

雖然大型語言模型 (LLM) 在評估一般性任務方面展現出潛力，但在評估需要專業知識的領域特定任務時，僅憑 LLM 作為評審的方法存在顯著限制，突顯了將人類專家納入評估流程的必要性。

Large Language Models Exhibit Inconsistent Reasoning Abilities and Are Unreliable as Human Surrogates in a Simple Economic Game

Despite advancements, large language models (LLMs) fail to consistently demonstrate human-like reasoning in a simple economic game, highlighting their limitations as human surrogates in social science research.

ReasonAgain: Evaluating Mathematical Reasoning in Large Language Models Using Symbolic Programs

Existing evaluations of mathematical reasoning in LLMs, relying on static datasets and final answers, are insufficient. ReasonAgain, a novel evaluation method, leverages symbolic programs to generate perturbations of math questions, revealing the fragility of reasoning abilities in state-of-the-art LLMs.

Best Practices for Human Evaluation of LLM-Generated Spoken Document Summaries

This paper proposes best practices for human evaluation of LLM-generated spoken document summaries, highlighting the limitations of existing automated metrics and advocating for rigorous human-centered approaches.

About

Products

Resources