The authors introduce CPSDBench, a specialized evaluation benchmark tailored for the Chinese public security domain, to assess Large Language Models (LLMs) across various tasks. The study aims to provide insights into the strengths and limitations of existing models in addressing public security issues.
This paper provides a comprehensive exploration of evaluation metrics for Large Language Models (LLMs), offering insights into the selection and interpretation of metrics currently in use, and showcasing their application through recently published biomedical LLMs.
Single-prompt evaluation of large language models leads to unstable and unreliable results. A multi-prompt evaluation approach is necessary to provide a more robust and meaningful assessment of model capabilities.
Domain experts, lay users, and Large Language Models (LLMs) develop distinct sets of evaluation criteria for assessing LLM outputs, with domain experts providing the most detailed and specific criteria, lay users emphasizing formatting and clarity, and LLMs generating more generalized criteria based on prompt keywords.
Large language models (LLMs) struggle to follow sequences of instructions, even when those instructions are logically connected, highlighting a critical area for improvement in LLM robustness.
Evaluating large language models (LLMs) is complex and often inconsistent, hindering reproducibility, reliability, and robustness; this review identifies key challenges and provides recommendations for improved evaluation practices.
ProcBench is a new benchmark designed to evaluate the ability of large language models (LLMs) to follow explicit multi-step instructions, revealing that while LLMs excel in knowledge-driven tasks, they struggle with complex procedural reasoning.
大型語言模型 (LLM) 雖然在流暢度和多樣性方面表現出色,但在個人化和連貫性方面,尤其是在考慮對話上下文和指定人物設定時,仍有很大進步空間。
The Psychological Depth Scale (PDS) is a novel framework for evaluating the psychological impact of stories generated by large language models, demonstrating that models like GPT-4 can achieve comparable or even surpass human-level narrative depth.
Current large language models (LLMs) often rely on memorized terms and struggle to demonstrate true reasoning abilities when presented with unfamiliar symbols or concepts, highlighting the need for more robust evaluation methods like the proposed MMLU-SR benchmark.