Conceitos essenciais
Large language models can achieve radiologist-level performance in evaluating the clinical accuracy and relevance of generated radiology reports, providing an efficient and accessible alternative to manual assessment.
Resumo
This study proposes a novel evaluation framework using large language models (LLMs) to assess the quality of generated radiology reports. The key insights are:
Existing evaluation metrics, such as language-based metrics (BLEU, ROUGE) and clinical metrics (CheXpert, RadGraph), have significant limitations in capturing the clinical significance of radiology reports. They fail to properly assess partial correctness, near synonyms, and equivalent expressions.
LLMs, with their expansive knowledge and nuanced understanding of text, offer a more flexible and accurate approach to evaluating radiology reports. The authors demonstrate that GPT-4 can achieve evaluation consistency close to that of radiologists, with a Kendall's tau correlation of 0.7348.
To address the practical challenges of using GPT-4, such as high costs and slow response times, the authors develop a smaller, more efficient model through knowledge distillation. The distilled BioMistral-7B model achieves comparable performance to GPT-4, with a Kendall's tau of 0.7487, while offering significantly faster response times and lower computational costs.
The authors conduct a comprehensive analysis, including error distributions and Bland-Altman plots, to compare the performance of GPT-4 and the distilled models. The results show that the fine-tuned BioMistral-7B model not only has a higher correlation with radiologists but also exhibits a more normal error distribution.
The proposed LLM-based evaluation framework and the distilled model offer an accessible and efficient method for assessing radiology report generation, facilitating the development of more clinically relevant models.
Estatísticas
The cardiac silhouette is top normal.
The cardiac silhouette is enlarged.
Multifocal opacities are present, overall similar to previous study but potentially minimally improved.
There are multiple areas of increased opacity within the lungs, which appear largely consistent with the prior examination, with a slight possibility of marginal improvement.
Citações
"What gets measured gets managed," highlights the importance of evaluation metrics in the task of report generation.
"Large language models, with their expansive knowledge base, offer a more nuanced and flexible understanding of text, allowing for the discernment of subtle distinctions."