toplogo
Sign In

DREsS: A Dataset for Rubric-based Essay Scoring Focused on English as a Foreign Language


Core Concepts
The authors introduce DREsS, a large-scale dataset for rubric-based automated essay scoring (AES) designed specifically for English as a Foreign Language (EFL) learners, aiming to improve the accuracy and practicality of AES systems in EFL writing education.
Abstract

This research paper introduces DREsS, a new dataset for rubric-based automated essay scoring (AES) designed for English as a Foreign Language (EFL) learners.

Problem:

Existing AES models often fall short in EFL writing education due to training on irrelevant essays and a lack of appropriate datasets, resulting in a single holistic score rather than detailed rubric-based feedback.

DREsS Dataset:

To address this gap, the researchers created DREsS, a large-scale dataset comprising three subsets:

  • DREsSNew: Contains 2,279 essays written by EFL undergraduate students and scored by English education experts using three key rubrics: content, organization, and language.
  • DREsSStd.: Consists of 6,515 essays from existing datasets (ASAP, ASAP++, and ICNALE EE) standardized and rescaled to align with DREsS's rubrics.
  • DREsSCASE: Includes 40,185 synthetic essay samples generated using CASE (Corruption-based Augmentation Strategy for Essays), a novel augmentation technique.

CASE Augmentation:

CASE starts with well-written essays and introduces controlled errors based on the target score for each rubric. This method addresses the scarcity of low-scoring essays and improves the model's ability to provide accurate scores across the entire range.

Experiments and Results:

The researchers evaluated various AES models using DREsS, including traditional models like EASE and NPCR, as well as large language models like BERT and Llama. Their findings indicate that:

  • Fine-tuned BERT and Llama models, trained on the combined DREsS dataset, outperformed other baselines, demonstrating the effectiveness of data unification and CASE augmentation.
  • CASE augmentation significantly improved performance, highlighting its value in generating realistic, low-scoring essays.
  • State-of-the-art LLMs like GPT-4, while powerful, did not outperform fine-tuned smaller models like BERT in this specific task.

Significance:

DREsS provides a valuable resource for advancing AES research and developing more effective AES systems tailored for EFL learners. The study also highlights the importance of data quality and augmentation techniques in building robust AES models.

Limitations and Future Work:

The authors acknowledge limitations such as the focus on English and the potential cultural bias in writing styles. Future work could involve expanding DREsS to other languages and exploring alternative augmentation strategies for generating well-written essays.

edit_icon

Customize Summary

edit_icon

Rewrite with AI

edit_icon

Generate Citations

translate_icon

Translate Source

visual_icon

Generate MindMap

visit_icon

Visit Source

Stats
DREsSNew includes 2,279 argumentative essays on 22 prompts, averaging 313.36 words and 21.19 sentences. Essays in DREsSNew are scored on a range of 1 to 5, with increments of 0.5, based on content, organization, and language. DREsSNew essays were written by undergraduate students with TOEFL writing scores ranging from 15 to 21. Eleven instructors, experts in English education or Linguistics, annotated the essays. DREsSStd. standardizes and unifies three existing rubric-based datasets: ASAP Prompt 7-8, ASAP++ Prompt 1-2, and ICNALE EE. CASE augmentation generated synthetic data with scores ranging from 1.0 to 5.0, addressing the imbalance found in real-classroom datasets. The best-performing model, trained on the combined DREsS dataset, outperformed other baselines by 45.44%. GPT-4 achieved a QWK score 0.257 lower than fine-tuned BERT models. CASE augmentation for content, organization, and language showed the best performance with 0.5, 2, and 0.125 of naug, respectively. Synthetic essays from GPT-4 achieved a QWK score of 0.225 on average, while CASE augmentation achieved 0.661.
Quotes
"To date, there is a lack of usable datasets for training rubric-based AES models, as existing AES datasets provide only overall scores and/or make use of scores annotated by non-experts." "DREsS will enable further research to provide a more accurate and practical AES system for EFL writing education." "We also suggest CASE, a corruption-based augmentation strategy for Essays, employing three rubric-specific strategies to augment the dataset with corruption. DREsSCASE improves the baseline result by 45.44%."

Key Insights Distilled From

by Haneul Yoo, ... at arxiv.org 11-05-2024

https://arxiv.org/pdf/2402.16733.pdf
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing

Deeper Inquiries

How can the insights from DREsS and CASE be applied to develop AES systems for other languages beyond English?

While DREsS focuses on English as a Foreign Language (EFL), the insights gleaned from its development and the CASE augmentation strategy offer valuable guidance for creating AES systems for other languages. Here's how: Rubric Adaptation: The core rubrics of content, organization, and language are universally applicable to writing assessment. However, the specific criteria within each rubric need to be adapted to reflect the linguistic features and writing conventions of the target language. For example, aspects like grammar rules, cohesive devices, and rhetorical structures vary significantly across languages. Dataset Collection and Annotation: Building a new DREsS-like dataset for another language would necessitate collecting essays from L2 learners of that language. Crucially, annotation needs to be carried out by instructors or experts proficient in both the target language and language pedagogy. This ensures the scores accurately reflect the learners' writing proficiency within the context of that specific language. Language-Specific CASE: The CASE augmentation strategy can be modified to target the common errors and challenges faced by learners of the specific language. This requires identifying typical grammatical errors, stylistic inconsistencies, and organizational weaknesses prevalent in that learner population. For instance, a CASE implementation for a Romance language might focus on augmenting verb conjugations and gender agreement, while one for a tonal language might emphasize tonal errors. Leveraging Existing Resources: Depending on the language, existing grammatical error correction (GEC) datasets and tools can be leveraged for the "language" aspect of CASE augmentation. Additionally, parallel corpora and translation resources can be helpful in adapting rubrics and understanding language-specific writing conventions. In essence, the principles behind DREsS and CASE provide a robust framework. Still, successful adaptation to other languages hinges on carefully considering the unique linguistic properties and pedagogical context of the target language.

What are the ethical considerations of using AI for essay scoring, particularly in educational contexts, and how can DREsS contribute to addressing these concerns?

The use of AI for essay scoring, particularly in education, raises several ethical considerations: Bias and Fairness: AI models are susceptible to inheriting and amplifying biases present in the data they are trained on. If the training data reflects existing societal biases (e.g., related to gender, race, or socioeconomic background), the AES system might unfairly penalize certain groups of students. DREsS, with its focus on EFL learners and diverse essay prompts, provides a foundation for building more inclusive AES systems. However, careful attention must be paid to mitigating bias during data collection, annotation, and model development. Transparency and Explainability: The "black box" nature of some AI models makes it challenging to understand how they arrive at a particular score. This lack of transparency can erode trust in the system, especially if students cannot comprehend why they received a specific grade. DREsS, with its detailed rubric-based scoring, promotes transparency by providing insights into the specific aspects of writing being evaluated. Impact on Learning and Teaching: Over-reliance on AES systems could potentially stifle creativity and critical thinking if students focus solely on producing essays that receive high scores from the AI. Additionally, educators' role should not be diminished. Instead, AES should be viewed as a tool to assist, not replace, human judgment. DREsS, by providing detailed rubric-level scores, can facilitate more targeted feedback and support for student learning. Data Privacy and Security: Collecting and storing student essays raise concerns about data privacy and security. It's crucial to obtain informed consent, anonymize data whenever possible, and implement robust security measures to protect sensitive information. DREsS contributes to addressing these concerns by: Promoting Research on Fairness: The dataset can be used to develop and test new methods for detecting and mitigating bias in AES systems. Encouraging Explainable AES: The rubric-based scoring in DREsS provides a starting point for developing more transparent and interpretable AES models. Supporting Personalized Feedback: The analysis of error patterns in DREsS can enable the creation of targeted writing interventions and personalized feedback strategies. By fostering research and development in these areas, DREsS can help ensure that AI-powered essay scoring is used ethically and responsibly in educational settings.

Could the analysis of the specific error patterns of EFL learners in DREsS be used to develop targeted writing interventions and personalized feedback strategies?

Yes, the analysis of specific error patterns in DREsS, particularly those found in DREsSNew, holds significant potential for developing targeted writing interventions and personalized feedback strategies for EFL learners. Here's how: Error Pattern Identification: By analyzing the essays in DREsSNew, researchers and educators can identify common grammatical errors, stylistic inconsistencies, and organizational weaknesses specific to EFL learners. For example, the analysis highlighted tendencies for longer essays with simpler sentences, frequent stop words, and specific grammatical errors. Targeted Intervention Development: Understanding these patterns allows for the development of targeted interventions that address these specific areas of difficulty. For instance, if learners frequently struggle with articles or prepositions, exercises and lessons can be tailored to focus on those grammatical aspects. Personalized Feedback Generation: The insights from DREsS can be integrated into AES systems or other writing tools to provide personalized feedback to learners. Instead of generic comments, the system can pinpoint the specific error patterns present in a student's writing and offer tailored suggestions for improvement. Adaptive Learning Platforms: DREsS data can be used to develop adaptive learning platforms that adjust to individual learner needs. By tracking a learner's progress and identifying persistent error patterns, the platform can recommend specific exercises, provide targeted feedback, and personalize the learning pathway. By leveraging the rich data on EFL learner errors present in DREsS, we can move towards more effective and personalized writing instruction. This data-driven approach can help learners improve their writing skills more efficiently by focusing on their specific areas of need.
0
star