Keskeiset käsitteet
AI-generated work closely approaches but remains detectable by human evaluators in university-level physics coding assignments.
Tiivistelmä
This study compares the performance of students, GPT-3.5, and GPT-4 in physics coding assignments at Durham University. It evaluates AI contributions against solely student work and a mixed category, highlighting the detectability of AI-generated content by human markers.
Abstract:
- Compared ChatGPT variants GPT-3.5 and GPT-4 with prompt engineering to student submissions.
- Students outperformed AI submissions statistically significantly.
- Blinded markers accurately identified authorship as 'Definitely Human' or 'Definitely AI'.
Introduction:
- Coding courses are essential in university curricula globally.
- Study focuses on AI's impact on practical coding curriculum in physics degree at Durham University.
Methodology:
- Assessing Large Language Models (LLMs) effectiveness using blinded marking approach.
- Physics coding assessments emphasize plot quality and code performance for simulations.
Results:
- Students outperformed all AI categories, with GPT-4 scoring highest among AIs.
- Prompt engineering significantly improved scores for both GPT models.
Discussion:
- LLMs have not surpassed human proficiency yet in physics coding assignments.
- Unique design choices by students differentiate their work from AI-generated content.
Limitations:
- Pre-processing plays a crucial role in preparing AI for tasks, potentially affecting output quality.
Conclusion:
- GPT models show improvement over time but have not surpassed human capabilities yet.
Tilastot
Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = 2.482 × 10−10).
Prompt engineering significantly improved scores for both GPT-4 (p = 1.661 × 10−4) and GPT-3.5 (p = 4.967 × 10−9).
Blinded markers accurately identified the authorship, with an average accuracy rate of 85.3%.
Lainaukset
"Students averaged 91.9%, surpassing the highest performing AI submission category."
"Prompt engineering significantly improved scores for both GPT models."