Core Concepts
This study evaluates the performance of ChatGPT, GPT-4, and Microsoft Bing chatbots in answering questions from the Graduate Record Examination (GRE), including both verbal and quantitative reasoning sections.
Abstract
The study examines the capabilities of three AI chatbots - ChatGPT, GPT-4, and Microsoft Bing - in answering questions from the Graduate Record Examination (GRE). The GRE is a standardized test used by graduate schools to assess applicants' readiness for graduate-level academic work.
The researchers analyzed the chatbots' performance on 137 quantitative reasoning questions and 157 verbal reasoning questions from the GRE. The quantitative questions covered skills in arithmetic, algebra, geometry, and data analysis, while the verbal questions tested reading comprehension, text completion, and sentence equivalence.
The results show that GPT-4 outperformed the other two chatbots across both the quantitative and verbal sections. GPT-4 achieved an 83.21% success rate on the quantitative questions and an 87.26% success rate on the verbal questions. ChatGPT and Bing also performed reasonably well, with ChatGPT scoring 57.66% and 71.34% on the quantitative and verbal sections, respectively, and Bing scoring 48.9% and 65.61%.
The researchers also evaluated the chatbots' performance on image-based quantitative questions, where GPT-4 again demonstrated the highest capability in accurately interpreting the images and providing correct solutions. Bing and ChatGPT struggled more with these types of questions, often failing to extract the necessary information from the provided images.
The findings suggest that these AI chatbots, particularly GPT-4, have the potential to be valuable tools for test preparation and personalized learning in educational settings. However, the researchers also highlight the need to ensure fair competition in online exams, as the availability of these advanced chatbots could enable academic misconduct if not properly addressed.
Stats
The study used a total of 331 GRE questions, including 137 quantitative reasoning questions and 157 verbal reasoning questions.
Quotes
"GPT-4 demonstrated the highest proficiency among the chatbots when it came to answering verbal questions, with a success rate of 87.26%."
"Bing's performance in image-based quantitative questions was relatively better than ChatGPT, which struggled to interpret the external images for many questions."
"The findings suggest that these AI chatbots, particularly GPT-4, have the potential to be valuable tools for test preparation and personalized learning in educational settings."