The study evaluates the effectiveness of Large Language Models (LLMs) like ChatGPT and Llama in scoring essays. Different prompts were designed to test their performance across various essay categories. While LLMs showed promise in providing feedback to enhance essay quality, they fell short compared to state-of-the-art models in predicting scores accurately.
The research explores the impact of prompt engineering on LLMs' performance, highlighting the importance of selecting the right prompt based on the task type. ChatGPT demonstrated more consistency across prompts compared to Llama, which was sensitive to prompt variations. Despite generating high-quality text, both LLMs struggled to differentiate between good and bad essays accurately.
While LLMs have potential for providing valuable feedback on writing quality, further research is needed to improve their accuracy in predicting essay scores. The study emphasizes the critical role of prompt design in enhancing or diminishing LLMs' performance in automated essay scoring tasks.
To Another Language
from source content
arxiv.org
Viktige innsikter hentet fra
by Watheq Manso... klokken arxiv.org 03-12-2024
https://arxiv.org/pdf/2403.06149.pdfDypere Spørsmål