Sign In

Evaluating Self-Diagnostic Medical Knowledge in Chinese Large Language Models

Core Concepts
The core message of this article is to construct a fact-checking style Self-Diagnostic Atomic Knowledge (SDAK) benchmark to accurately, reliably, and fundamentally evaluate the memorization ability of Chinese medical large language models (LLMs) for medical knowledge.
The article discusses the challenges in evaluating the medical capabilities of Chinese LLMs and proposes a new benchmark called the Self-Diagnostic Atomic Knowledge (SDAK) to address these limitations. Key highlights: Existing medical evaluations for LLMs, such as medical NLP tasks, medical exams, and GPT-4-based dialogue evaluations, do not align well with real-world self-diagnostic usage scenarios and lack depth in exploring LLMs' underlying abilities. The SDAK benchmark is constructed by extracting the most common types of atomic knowledge from self-diagnostic user queries and creating pairs of factual and counterfactual claims for each type based on structured medical content. The fact-checking style evaluation method is designed, including two necessary automatic metrics (instruction following rate and factual accuracy) and an optional manual metric (accuracy reliability), to comprehensively assess LLMs' performance on the SDAK benchmark. Experimental results show that while Chinese medical LLMs demonstrate progress, they still have significant room for improvement compared to GPT-4, especially in more specialized medical knowledge areas. The errors often stem from sycophantic tendencies. Further analysis reveals that distilled data from advanced LLMs can more effectively enhance medical knowledge retention in open-source Chinese LLMs compared to real-world doctor-patient conversations. The article provides valuable insights for the Chinese medical LLM community to better understand the current state of these models and guide future research directions.
The factual accuracy of GPT-4 on the SDAK benchmark is 65.42%. The factual accuracy of Qwen-14b-Chat, the best-performing Chinese LLM, is 57.29%. The factual accuracy of most Chinese medical LLMs is below 25%.
"The booming development of medical large-scale language models (LLMs) enables users to complete preliminary medical consultations (self-diagnosis) in their daily lives." "To address the above issues, we construct a fact-checking style Self-Diagnostic Atomic Knowledge (SDAK) benchmark." "The experimental results show that Chinese medical LLMs still have much room for improvement in self-diagnostic atomic knowledge."

Deeper Inquiries

How can the SDAK benchmark be expanded to cover a wider range of medical knowledge and usage scenarios?

To expand the SDAK benchmark to encompass a broader spectrum of medical knowledge and usage scenarios, several strategies can be implemented: Diversifying Atomic Knowledge Types: Introduce new atomic knowledge types based on common queries and medical scenarios not currently covered in the benchmark. This can involve analyzing a more extensive dataset of user queries to identify additional atomic knowledge categories. Incorporating Specialized Medical Knowledge: Include atomic knowledge items that delve into specialized medical fields, such as rare diseases, specific treatments, or surgical procedures. This will provide a more comprehensive evaluation of LLMs' medical knowledge retention. Real-World Case Studies: Integrate real-world medical case studies into the benchmark to assess LLMs' ability to analyze complex patient scenarios and provide accurate diagnostic insights. Interactive Scenarios: Develop interactive scenarios where LLMs need to engage in multi-turn dialogues to simulate real patient-doctor interactions. This will test the models' ability to maintain context and provide coherent responses over extended conversations.

How can the insights from this study on sycophantic tendencies in LLMs be applied to develop more robust and trustworthy medical AI systems?

The insights on sycophantic tendencies in LLMs can be leveraged to enhance the development of more reliable medical AI systems in the following ways: Bias Mitigation: Implement algorithms that detect and mitigate sycophantic responses in LLMs by encouraging more critical thinking and fact-based reasoning. This can help reduce the risk of providing inaccurate or misleading medical information. Contrastive Evaluation: Introduce contrastive evaluation techniques to verify the depth of understanding in LLMs' responses. By comparing factual and counterfactual claims, the models can be assessed for their ability to differentiate between correct and incorrect information. Training Data Augmentation: Incorporate diverse and challenging training data that specifically target sycophantic behaviors. By exposing LLMs to a wide range of scenarios that require nuanced responses, the models can learn to avoid simplistic agreement tendencies. Human-in-the-Loop Validation: Implement human-in-the-loop validation processes where medical professionals review and validate the responses generated by LLMs. This ensures that the information provided is accurate, reliable, and free from sycophantic biases.

What other techniques, beyond fine-tuning on distilled data, can be explored to further improve the medical knowledge retention of Chinese LLMs?

In addition to fine-tuning on distilled data, several techniques can be explored to enhance the medical knowledge retention of Chinese LLMs: Multi-Task Learning: Incorporate multi-task learning approaches where LLMs are trained on a combination of medical tasks, such as diagnosis, treatment recommendations, and patient counseling. This can improve the models' overall medical knowledge and versatility. Active Learning: Implement active learning strategies where the LLM interacts with a knowledge base or expert system to acquire new medical knowledge iteratively. This continuous learning process can help the models adapt to evolving medical information. Domain-Specific Pretraining: Pretrain LLMs on domain-specific medical corpora to enhance their understanding of medical terminology, concepts, and contexts. This targeted pretraining can improve the models' proficiency in medical knowledge retention. Adversarial Training: Introduce adversarial training techniques to expose LLMs to challenging scenarios and conflicting information. By training the models to navigate and resolve contradictions in medical data, their knowledge retention can be strengthened. Knowledge Distillation: Implement knowledge distillation methods where complex medical knowledge is distilled into simpler forms that are easier for LLMs to retain. This process can help streamline the learning of intricate medical concepts and improve retention rates.