Core Concepts
Large Language Models (LLMs) exhibit varying capabilities in answering open-ended mathematical questions from the Math Stack Exchange platform, with GPT-4 outperforming other models but still facing limitations in consistently providing accurate and comprehensive responses.
Abstract
The study investigates the performance of various Large Language Models (LLMs) in answering open-ended mathematical questions from the Math Stack Exchange (MSE) platform. The authors employ a two-step approach:
Answer Generation: Six LLMs, including ToRA, LLeMa, GPT-4, MAmmoTH, MABOWDOR, and Mistral 7B, are used to generate answers to 78 MSE questions. The answers are then compared to the evaluated answers from the ArqMATH dataset using metrics like nDCG, mAP, P@10, and BPref.
Question-Answer Comparison: The authors use the LLMs to generate embeddings of the 78 questions and the potential answers from the ArqMATH dataset, then find the most similar answer for each question.
The results show that GPT-4 outperforms the other models, with an nDCG score of 0.48 and a P@10 of 0.37, surpassing the current best approach on the ArqMATH3 Task1 in terms of P@10. However, the case study reveals that while GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately, particularly those involving complex mathematical concepts and reasoning.
The study highlights the current limitations of LLMs in navigating the specialized language and precise logic of mathematics, setting the stage for future research and advancements in AI-driven mathematical reasoning.
Stats
"GPT-4 generated answers exhibited increased effectiveness over the DPR baseline, outperforming the current best approach on ArqMATH3 Task1, i.e., MABOWDOR [33] considering P@10."
"The outcome reveals that models fine-tuned on mathematical tasks underperformed relative to the DPR benchmark."
"Increasing the model size of the top performer did not yield better results."