The authors introduce CPSDBench, a specialized evaluation benchmark tailored for the Chinese public security domain, to assess Large Language Models (LLMs) across various tasks. The study aims to provide insights into the strengths and limitations of existing models in addressing public security issues.
This paper provides a comprehensive exploration of evaluation metrics for Large Language Models (LLMs), offering insights into the selection and interpretation of metrics currently in use, and showcasing their application through recently published biomedical LLMs.
Single-prompt evaluation of large language models leads to unstable and unreliable results. A multi-prompt evaluation approach is necessary to provide a more robust and meaningful assessment of model capabilities.