Core Concepts
The authors propose a localized medical benchmark called CMB (Comprehensive Medical Benchmark in Chinese) to evaluate the performance of large language models (LLMs) in the Chinese medical domain, which includes both multiple-choice questions and complex clinical diagnostic cases. The benchmark aims to assess LLMs' medical knowledge and reasoning capabilities within the Chinese linguistic and cultural framework.
Abstract
The authors have developed a comprehensive medical benchmark called CMB to evaluate the performance of large language models (LLMs) in the Chinese medical domain. The benchmark consists of two parts:
CMB-Exam: This subset contains multiple-choice questions from various medical qualification exams, covering four clinical professions (physicians, nurses, medical technicians, and pharmacists) as well as undergraduate disciplines and graduate entrance exams. The questions span the entire professional journey, from basics to advanced levels.
CMB-Clin: This subset focuses on complex clinical diagnostic cases, requiring the models to synthesize knowledge and engage in reasoning to provide informed responses. The cases are derived from real-world medical records and curated by medical experts.
The authors have evaluated several prominent LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. The key findings include:
GPT-4 and recent open-sourced Chinese LLMs like Qwen-72B-Chat and Yi-34B-Chat have achieved an accuracy rate exceeding 60% on the CMB-Exam, surpassing the threshold required for obtaining a medical license.
Accuracy exhibits significant disparities across professional levels and knowledge areas, with traditional Chinese medicine and Western medicine showing notable differences.
The effectiveness of the Chain-of-Thought (CoT) and few-shot prompts varies among models with different accuracy levels, especially in knowledge-intensive tasks.
The results of automatic evaluation using GPT-4 highly agree with expert evaluation results on the CMB-Clin subset.
The authors hope that the CMB benchmark will provide valuable insights into the current state of LLMs in the Chinese medical domain and facilitate the widespread adoption and enhancement of medical LLMs within China.
Stats
The CMB dataset comprises 280,839 multiple-choice questions across 6 major categories and 28 subcategories.
The CMB-Clin subset includes 74 expert-curated medical record consultations, with 208 complex clinical diagnostic questions.
Quotes
"The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression."
"Merely translating English-based medical evaluation may result in contextual incongruities to a local region."
"The clinical diagnostic questions are based on real, intricate cases, with correct answers determined by a consensus of teaching experts."