The study benchmarks eight large language models on a dataset of medical questions from Polish licensing exams. Larger models generally outperformed smaller ones, with similarities observed across models. Model accuracy was influenced by question length and the probability assigned to the correct answer. The study provides insights into the performance patterns of large language models in medical applications.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Andrew M. Be... lúc arxiv.org 03-12-2024
https://arxiv.org/pdf/2310.07225.pdfYêu cầu sâu hơn