This research focuses on evaluating the efficacy of four open-source large language models (LLMs) - Meditron, MedAlpaca, Mistral, and Llama-2 - in interpreting medical guidelines saved in PDF format, specifically the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC).
The researchers developed a user-friendly medical document chatbot tool (MedDoc-Bot) using Streamlit, a Python library, which allows authorized users to upload PDF files and pose questions, generating interpretive responses from the four locally stored LLMs. A pediatric expert provided a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines, and the expert rated the model-generated responses based on their fidelity and relevance.
The study found that Llama-2 and Mistral performed well in metrics evaluation, with Llama-2 exhibiting the best METEOR and chrF scores, particularly in clinical responses. However, Llama-2 was slower when dealing with text and tabular data. In the human evaluation, the responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance, while MedAlpaca consistently lagged behind.
The researchers highlight the importance of balancing response quality and efficiency, as well as the need for further fine-tuning of the best-performing models (Llama-2 and Mistral) with a clinical dataset curated by multiple experts for secure patient record analysis on a local system.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Mohamed Yase... alle arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.03359.pdfDomande più approfondite