核心概念
This paper introduces Chinese SimpleQA, a new benchmark designed to evaluate the factuality of large language models (LLMs) when answering short questions in Chinese.
统计
Chinese SimpleQA consists of 3000 high-quality questions.
The dataset covers 6 major topics and 99 fine-grained subtopics.
The average question length is 23.6 tokens.
The average reference answer length is 6.1 tokens.
Only o1-preview and Doubao-pro-32k achieved a passing score (63.8% and 61.9% on the correct metric).
Doubao-pro-32k's ranking improved from 12th on SimpleQA to 2nd on Chinese SimpleQA.
GPT-4's ranking decreased from 3rd on SimpleQA to 9th on Chinese SimpleQA.
引用
"A significant challenge in AI development is to ensure language models generate factually accurate responses."
"Current frontier models sometimes produce false outputs or answers that are not substantiated by evidence. This is the problem known as “hallucinations”, which greatly hinders the extensive use of general AI technologies, such as large language models (LLMs)."
"Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions."
"Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate)."