The study benchmarks eight large language models on a dataset of medical questions from Polish licensing exams. Larger models generally outperformed smaller ones, with similarities observed across models. Model accuracy was influenced by question length and the probability assigned to the correct answer. The study provides insights into the performance patterns of large language models in medical applications.
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Andrew M. Be... klokken arxiv.org 03-12-2024
https://arxiv.org/pdf/2310.07225.pdfDypere Spørsmål