toplogo
Sign In

Leveraging Large Language Models for Zero-shot Automated Essay Scoring via Multi-trait Specialization


Core Concepts
Large language models can be effectively leveraged for zero-shot automated essay scoring by decomposing writing proficiency into distinct traits and generating scoring criteria for each trait, followed by step-by-step evaluation and trait aggregation.
Abstract
The paper presents a zero-shot prompting framework called Multi Trait Specialization (MTS) to elicit essay scoring capabilities in large language models (LLMs). The key insights are: Decomposing writing proficiency into distinct traits and generating scoring criteria for each trait using ChatGPT. This ensures consistent scoring behavior across essays. Engaging the LLM in step-by-step evaluation, where each conversation round focuses on scoring one specific trait based on the corresponding scoring criteria. This simplifies the task and allows the LLM to focus on one aspect at a time. Incorporating a quote retrieval mechanism before the scoring task, which helps the LLM adhere to the details of the essay and provide faithful evaluations. Deriving the final score through trait aggregation (averaging) and min-max scaling with outlier clipping, which effectively maps the predictions to the target score range. Experiments on the ASAP and TOEFL11 datasets show that MTS consistently outperforms a straightforward prompting baseline (Vanilla) across different LLMs, with maximum gains of 0.437 on TOEFL11 and 0.355 on ASAP in average Quadratic Weighted Kappa (QWK). MTS also enables the small-sized Llama2-13b-chat to substantially outperform ChatGPT, facilitating effective deployment in real applications.
Stats
The average essay length in ASAP ranges from 106 to 725 words, with score ranges varying from 0-3 to 0-60. The average essay length in TOEFL11 is around 340-361 words, with scores categorized as low, medium, or high.
Quotes
"Advances in automated essay scoring (AES) have traditionally relied on labeled essays, requiring tremendous cost and expertise for their acquisition." "Recently, large language models (LLMs) have achieved great success in various tasks, but their potential is less explored in AES." "Experimental results on two benchmark datasets demonstrate that MTS consistently outperforms straightforward prompting (Vanilla) in average QWK across all LLMs and datasets, with maximum gains of 0.437 on TOEFL11 and 0.355 on ASAP."

Deeper Inquiries

How can the MTS framework be extended to handle essays in languages other than English?

To extend the MTS framework to handle essays in languages other than English, several steps can be taken: Translation: Utilize machine translation tools to translate essays from different languages into English before inputting them into the MTS framework. This would ensure that the LLMs can process the content effectively. Multilingual Training: Train the LLMs on multilingual data to enhance their proficiency in understanding and scoring essays in various languages. This would require a diverse dataset with essays in different languages to ensure the models can generalize well. Language-specific Traits: Develop language-specific traits and scoring criteria for different languages to capture the nuances and characteristics of each language. This would involve adapting the trait decomposition process to account for language-specific writing conventions. Cross-lingual Evaluation: Conduct thorough evaluations to ensure the effectiveness and accuracy of the MTS framework across multiple languages. This would involve testing the framework on a diverse set of languages and making adjustments as needed.

How can the potential biases in the LLM-based essay scoring be mitigated?

Potential biases in LLM-based essay scoring can be mitigated through the following strategies: Bias Detection: Implement bias detection algorithms to identify and flag potential biases in the scoring process. This could involve analyzing the model's decisions and outputs to detect any patterns of bias. Diverse Training Data: Ensure that the training data used for the LLMs is diverse and representative of different demographics, backgrounds, and writing styles. This would help reduce biases that may arise from skewed or limited training data. Bias Mitigation Techniques: Employ bias mitigation techniques such as debiasing algorithms or fairness constraints during model training. These techniques can help reduce biases in the model's predictions and ensure fairer scoring outcomes. Human Oversight: Incorporate human oversight and review in the essay scoring process to double-check the LLM's decisions and intervene in cases of potential bias. Human-in-the-loop approaches can help catch and correct biased judgments.

How can the MTS framework be integrated with human-in-the-loop approaches to further improve the essay scoring accuracy?

Integrating the MTS framework with human-in-the-loop approaches can enhance essay scoring accuracy in the following ways: Human Verification: Have human reviewers validate the LLM's scores and provide feedback on the trait evaluations. This feedback can be used to refine the scoring criteria and improve the model's performance over time. Adaptive Scoring: Implement a system where the LLM initially scores the essay, which is then reviewed and adjusted by human experts based on their domain knowledge and expertise. This iterative process can lead to more accurate and reliable scores. Feedback Loop: Establish a feedback loop where human reviewers can provide explanations for their scoring decisions, which can be used to train the LLMs to mimic human judgment more effectively. Confidence Scoring: Develop a system where the LLM assigns a confidence score to its own predictions, and human reviewers focus on essays where the model's confidence is low. This targeted approach can improve the accuracy of human intervention. By combining the strengths of LLMs with human expertise, the MTS framework can leverage human-in-the-loop approaches to achieve higher accuracy and reliability in essay scoring.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star