Core Concepts
This paper introduces a dataset for automatic grading of Japanese-English Sentence Translation Exercises, showcasing the performance of BERT and GPT models. The task formalizes grading student responses based on rubric criteria specified by educators.
Abstract
The study proposes automating the correction and feedback process for translation exercises to enhance language learning. Baseline models like BERT and GPT are evaluated, showing challenges in grading incorrect responses despite high accuracy with correct ones.
Japanese-English STEs are crucial for early L2 language learning, aiding in grammar acquisition and expression nuances. Automating the grading process can transform educational environments by providing efficient feedback.
The dataset includes 21 questions with detailed rubrics and annotated student responses. The study highlights the difficulty in grading incorrect responses compared to correct ones, emphasizing the need for further exploration in automated grading systems.
Stats
Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning.
The baseline model with finetuned BERT was able to classify correct responses with approximately 90% accuracy but less than 80% for incorrect responses.
Experimental results show that GPT models with few-shot learning exhibit poorer results than finetuned BERT.
The dataset comprises 21 Japanese-to-English STE questions with detailed rubrics and annotated student responses.
An average of 167 responses per question were collected from students and crowd workers.
Quotes
"The contributions of this study are formulating automated grading of sentence translation exercises as a new task referencing actual operation in educational settings."
"We construct a dataset for automated STE grading according to this task design, demonstrating feasibility."
"Our newly proposed task presents a challenging issue even for state-of-the-art large language models."