This study evaluates whether the large language model ChatGPT can be used to estimate the quality of journal articles across academia. The researchers sampled up to 200 articles from all 34 Units of Assessment (UoAs) in the UK's Research Excellence Framework (REF) 2021, comparing ChatGPT scores with departmental average scores.
The key findings are:
There was an almost universally positive Spearman correlation between ChatGPT scores and departmental averages, varying between 0.08 (Philosophy) and 0.78 (Psychology, Psychiatry and Neuroscience), except for Clinical Medicine (rho=-0.12).
The correlations were strongest in the physical and health sciences and engineering, suggesting that large language models can provide reasonable research quality estimates in these fields, even before citation data is available.
However, ChatGPT assessments seem to be more positive for most health and physical sciences than for other fields, which is a concern for multidisciplinary assessments.
The ChatGPT scores are only based on titles and abstracts, so they cannot be considered full research evaluations.
The researchers discuss potential reasons for the high correlations in some fields, such as ChatGPT's preference for more technical or empirical outputs, and the possibility that higher-scoring departments may be better at making quality claims in their abstracts. They also caution against fully replacing human judgment with ChatGPT evaluations, as authors may learn to game the system by designing abstracts to produce high ChatGPT scores.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mike Thelwal... at arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16695.pdfDeeper Inquiries