The study benchmarks eight large language models on a dataset of medical questions from Polish licensing exams. Larger models generally outperformed smaller ones, with similarities observed across models. Model accuracy was influenced by question length and the probability assigned to the correct answer. The study provides insights into the performance patterns of large language models in medical applications.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Andrew M. Be... ב- arxiv.org 03-12-2024
https://arxiv.org/pdf/2310.07225.pdfשאלות מעמיקות