核心概念
Large Language Models (LLMs) are evaluated in the Chinese public security domain through CPSDBench, highlighting strengths and limitations.
要約
CPSDBench is a specialized evaluation benchmark tailored for the Chinese public security domain. It integrates datasets related to public security from real-world scenarios, assessing LLMs across text classification, information extraction, question answering, and text generation tasks. Innovative evaluation metrics are introduced to quantify LLM efficacy accurately. The study aims to enhance understanding of existing models' performance in addressing public security issues and guide future development of more accurate models.
統計
GPT-4 exhibited outstanding performance across all evaluation tasks.
Chinese models like ChatGLM-4 surpassed GPT-4 in text generation and question answering tasks.
Proprietary models generally outperformed open-source models.