Bibliographic Information: Gao, R., Guo, X., Li, X., Narayanan, A.B.L., Thomas, N., & Srinivasa, A.R. (2024). Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering. Proceedings of Machine Learning Research 1:1–21, 2024. FM-EduAssess at NeurIPS 2024 Workshop.
Research Objective: This study investigates the feasibility of using ChatGPT (GPT-4o) for automated grading of conceptual questions in an undergraduate Mechanical Engineering course, comparing its performance to human teaching assistants (TAs).
Methodology: The researchers used ten quiz datasets from a Mechanical Engineering course at Texas A&M University, each with approximately 225 student responses. They evaluated GPT-4o's grading performance in both zero-shot and few-shot settings using Spearman's rank correlation coefficient and Root Mean Square Error (RMSE) against TA grading as the gold standard.
Key Findings: GPT-4o demonstrated strong correlation with TA grading in the zero-shot setting, achieving a Spearman's rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. However, the model struggled with nuanced answers involving synonyms not present in the rubric and tended to grade more stringently than human TAs in ambiguous cases. The few-shot approach, incorporating example answers, did not consistently improve performance.
Main Conclusions: ChatGPT shows potential as a tool for grading conceptual questions in Mechanical Engineering, offering scalability and consistency, especially with well-defined rubrics. However, further research is needed to address its limitations in handling nuanced language and ambiguity.
Significance: This research contributes to the growing field of automated grading in STEM education, specifically addressing the gap in engineering disciplines. It highlights the potential of LLMs for reducing grading workload while maintaining consistency, but also emphasizes the need for careful rubric design and prompt engineering.
Limitations and Future Research: The study was limited to a single Mechanical Engineering course and a specific LLM (GPT-4o). Future research should explore the generalizability of these findings across different engineering disciplines, question types, and LLM architectures. Further investigation into rubric design, prompt engineering, and fine-tuning LLMs for domain-specific grading is also warranted.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Rujun Gao, X... at arxiv.org 11-07-2024
https://arxiv.org/pdf/2411.03659.pdfDeeper Inquiries