toplogo
Sign In

Automatic Grading Dataset for Japanese-English Sentence Translation Exercises


Core Concepts
This paper introduces a dataset for automatic grading of Japanese-English Sentence Translation Exercises, showcasing the performance of BERT and GPT models. The task formalizes grading student responses based on rubric criteria specified by educators.
Abstract
The study proposes automating the correction and feedback process for translation exercises to enhance language learning. Baseline models like BERT and GPT are evaluated, showing challenges in grading incorrect responses despite high accuracy with correct ones. Japanese-English STEs are crucial for early L2 language learning, aiding in grammar acquisition and expression nuances. Automating the grading process can transform educational environments by providing efficient feedback. The dataset includes 21 questions with detailed rubrics and annotated student responses. The study highlights the difficulty in grading incorrect responses compared to correct ones, emphasizing the need for further exploration in automated grading systems.
Stats
Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning. The baseline model with finetuned BERT was able to classify correct responses with approximately 90% accuracy but less than 80% for incorrect responses. Experimental results show that GPT models with few-shot learning exhibit poorer results than finetuned BERT. The dataset comprises 21 Japanese-to-English STE questions with detailed rubrics and annotated student responses. An average of 167 responses per question were collected from students and crowd workers.
Quotes
"The contributions of this study are formulating automated grading of sentence translation exercises as a new task referencing actual operation in educational settings." "We construct a dataset for automated STE grading according to this task design, demonstrating feasibility." "Our newly proposed task presents a challenging issue even for state-of-the-art large language models."

Deeper Inquiries

How can automated systems like BERT and GPT be improved to better handle incorrect responses in language translation exercises?

Automated systems like BERT and GPT can be enhanced to better handle incorrect responses in language translation exercises by incorporating more diverse training data that includes a wider range of common errors made by learners. By exposing the models to a variety of mistakes, they can learn to identify and provide feedback on a broader spectrum of inaccuracies. Additionally, fine-tuning the models specifically for error detection and correction tasks within language translation exercises can help improve their performance in handling incorrect responses.

What implications does automating the correction and feedback process have on traditional teaching methods?

Automating the correction and feedback process in language learning through systems like BERT and GPT has significant implications for traditional teaching methods. It allows for quicker turnaround times in providing feedback to students, enabling them to receive immediate guidance on their work. This automation also reduces the burden on educators when it comes to manually grading assignments, allowing them more time to focus on other aspects of teaching such as lesson planning or individualized student support. Moreover, automated systems can provide consistent evaluation criteria across all submissions, ensuring fairness in assessment.

How can the findings from this study be applied to other languages or educational contexts beyond Japanese-English translations?

The findings from this study regarding automated grading of sentence translation exercises using BERT and GPT models can be applied to other languages by adapting the dataset creation process with relevant linguistic experts for different language pairs. The same framework used for Japanese-English translations could be replicated with appropriate modifications based on specific grammar rules and vocabulary nuances of other languages. Furthermore, these automated grading techniques could benefit various educational contexts beyond just translations by assessing writing proficiency, grammatical accuracy, or vocabulary usage across different subjects or disciplines where precise evaluation is required.
0