Core Concepts
The authors introduce DREsS, a large-scale dataset for rubric-based automated essay scoring (AES) designed specifically for English as a Foreign Language (EFL) learners, aiming to improve the accuracy and practicality of AES systems in EFL writing education.
Abstract
This research paper introduces DREsS, a new dataset for rubric-based automated essay scoring (AES) designed for English as a Foreign Language (EFL) learners.
Problem:
Existing AES models often fall short in EFL writing education due to training on irrelevant essays and a lack of appropriate datasets, resulting in a single holistic score rather than detailed rubric-based feedback.
DREsS Dataset:
To address this gap, the researchers created DREsS, a large-scale dataset comprising three subsets:
- DREsSNew: Contains 2,279 essays written by EFL undergraduate students and scored by English education experts using three key rubrics: content, organization, and language.
- DREsSStd.: Consists of 6,515 essays from existing datasets (ASAP, ASAP++, and ICNALE EE) standardized and rescaled to align with DREsS's rubrics.
- DREsSCASE: Includes 40,185 synthetic essay samples generated using CASE (Corruption-based Augmentation Strategy for Essays), a novel augmentation technique.
CASE Augmentation:
CASE starts with well-written essays and introduces controlled errors based on the target score for each rubric. This method addresses the scarcity of low-scoring essays and improves the model's ability to provide accurate scores across the entire range.
Experiments and Results:
The researchers evaluated various AES models using DREsS, including traditional models like EASE and NPCR, as well as large language models like BERT and Llama. Their findings indicate that:
- Fine-tuned BERT and Llama models, trained on the combined DREsS dataset, outperformed other baselines, demonstrating the effectiveness of data unification and CASE augmentation.
- CASE augmentation significantly improved performance, highlighting its value in generating realistic, low-scoring essays.
- State-of-the-art LLMs like GPT-4, while powerful, did not outperform fine-tuned smaller models like BERT in this specific task.
Significance:
DREsS provides a valuable resource for advancing AES research and developing more effective AES systems tailored for EFL learners. The study also highlights the importance of data quality and augmentation techniques in building robust AES models.
Limitations and Future Work:
The authors acknowledge limitations such as the focus on English and the potential cultural bias in writing styles. Future work could involve expanding DREsS to other languages and exploring alternative augmentation strategies for generating well-written essays.
Stats
DREsSNew includes 2,279 argumentative essays on 22 prompts, averaging 313.36 words and 21.19 sentences.
Essays in DREsSNew are scored on a range of 1 to 5, with increments of 0.5, based on content, organization, and language.
DREsSNew essays were written by undergraduate students with TOEFL writing scores ranging from 15 to 21.
Eleven instructors, experts in English education or Linguistics, annotated the essays.
DREsSStd. standardizes and unifies three existing rubric-based datasets: ASAP Prompt 7-8, ASAP++ Prompt 1-2, and ICNALE EE.
CASE augmentation generated synthetic data with scores ranging from 1.0 to 5.0, addressing the imbalance found in real-classroom datasets.
The best-performing model, trained on the combined DREsS dataset, outperformed other baselines by 45.44%.
GPT-4 achieved a QWK score 0.257 lower than fine-tuned BERT models.
CASE augmentation for content, organization, and language showed the best performance with 0.5, 2, and 0.125 of naug, respectively.
Synthetic essays from GPT-4 achieved a QWK score of 0.225 on average, while CASE augmentation achieved 0.661.
Quotes
"To date, there is a lack of usable datasets for training rubric-based AES models, as existing AES datasets provide only overall scores and/or make use of scores annotated by non-experts."
"DREsS will enable further research to provide a more accurate and practical AES system for EFL writing education."
"We also suggest CASE, a corruption-based augmentation strategy for Essays, employing three rubric-specific strategies to augment the dataset with corruption. DREsSCASE improves the baseline result by 45.44%."