The authors conducted a comprehensive evaluation of four representative large language models (LLMs) - GPT-3.5, GPT-4, LLaMA 2, and PMC LLaMA - across 12 biomedical NLP datasets covering six applications: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification.
The evaluation was performed under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning. The results showed that state-of-the-art fine-tuning approaches outperformed zero- and few-shot LLMs in most biomedical NLP tasks, achieving a macro-average of 0.6531 compared to the highest LLM performance of 0.4862 under zero/few-shot settings.
However, closed-source LLMs like GPT-3.5 and GPT-4 demonstrated better zero- and few-shot performance in reasoning-related tasks such as medical question answering, where they outperformed the reported state-of-the-art results. These LLMs also exhibited competitive accuracy and readability in text summarization and simplification tasks, as well as in semantic understanding-related tasks like document-level text classification.
In contrast, open-sourced LLMs like LLaMA 2 did not show robust zero- and few-shot performance, requiring fine-tuning to bridge the performance gap for biomedical NLP tasks. The evaluation also indicated limited performance benefits for creating domain-specific LLMs like PMC LLaMA.
The qualitative evaluation revealed that LLMs frequently generated prevalent missing, inconsistent, and hallucinated responses, with over 30% of responses being hallucinated and 22% inconsistent on a multi-label document classification dataset.
Based on these results, the authors provide specific recommendations on the best practices for using LLMs in biomedical NLP applications and make all relevant data, models, and results publicly available to the community.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Qingyu Chen,... lúc arxiv.org 09-24-2024
https://arxiv.org/pdf/2305.16326.pdfYêu cầu sâu hơn