Core Concepts
Large language models are tested for their ability to score written essays, revealing insights into their performance and potential for providing feedback.
Abstract
The study evaluates the effectiveness of Large Language Models (LLMs) like ChatGPT and Llama in scoring essays. Different prompts were designed to test their performance across various essay categories. While LLMs showed promise in providing feedback to enhance essay quality, they fell short compared to state-of-the-art models in predicting scores accurately.
The research explores the impact of prompt engineering on LLMs' performance, highlighting the importance of selecting the right prompt based on the task type. ChatGPT demonstrated more consistency across prompts compared to Llama, which was sensitive to prompt variations. Despite generating high-quality text, both LLMs struggled to differentiate between good and bad essays accurately.
While LLMs have potential for providing valuable feedback on writing quality, further research is needed to improve their accuracy in predicting essay scores. The study emphasizes the critical role of prompt design in enhancing or diminishing LLMs' performance in automated essay scoring tasks.
Stats
The ASAP dataset comprises 8 tasks and 12978 essays.
ChatGPT achieved a peak QWK score of 0.606.
Llama reached a QWK score of 0.562.
SOTA models outperformed both ChatGPT and Llama with an average QWK score of 0.817 and 0.695 respectively.
Quotes
"The response demonstrates some understanding of the text but could be more comprehensive and nuanced." - Llama
"The response lacks a clear understanding of the text and the prompt." - ChatGPT