Core Concepts
The author introduces CDQA, a Chinese Dynamic QA benchmark, to challenge Chinese LLMs with dynamic questions. Extensive experiments and analysis provide valuable insights for enhancing LLMs' capabilities.
Abstract
The content introduces CDQA, a Chinese Dynamic QA benchmark challenging LLMs with dynamic questions. It discusses the data construction pipeline, evaluation metrics, experiment results, prompt design impact, search engine comparison, and limitations. The study aims to improve LLM-driven applications for Chinese users.
Stats
"We obtain high-quality data through a pipeline that combines humans and models."
"We have also evaluated mainstream and advanced Chinese LLMs on CDQA."
"Results show that GPT-4 still ranks at the top with searched results from search engines."
"GPT-4 answers with great care in vanilla prompts with lowest answer rates but high F1-recall scores."
"Deepseek-67B-Chat has shown great performance as it surpasses GPT-4 on slow-changing and never-changing questions."
Quotes
"We believe that the benchmark we provide will become the key data resource for improving LLMs’ Chinese question-answering ability in the future."
"Vanilla prompts outperform the other two kinds of prompts."
"Google consistently outperforms Bing among all baseline models."
"GPT-4 hallucinates more with more few-shot examples in CDQA."
"Verbose explanation or expansion could increase hallucination especially when without evidence."