The paper introduces CJEval, a novel benchmark for evaluating the educational capabilities of Large Language Models (LLMs). CJEval is based on authentic Chinese Junior High School exam questions and features a diverse set of annotations, including question types, difficulty levels, knowledge concepts, and answer explanations.
The benchmark covers four core educational tasks: knowledge concept tagging, question difficulty prediction, question answering, and question generation. The authors conducted extensive experiments and analysis on a range of state-of-the-art LLMs, both proprietary and open-source, to assess their performance on these tasks.
The results highlight the strengths and limitations of current LLMs in educational applications. While models like GPT-4o and fine-tuned Qwen-14B demonstrate strong capabilities, they still struggle with certain tasks, particularly those requiring advanced reasoning and language generation skills. The paper emphasizes the need for further research and development to enhance LLMs' educational competencies.
CJEval is designed to serve as a comprehensive and robust assessment framework for evaluating LLMs in the context of educational technology. By providing diverse annotations and a range of application-level tasks, the benchmark aims to guide the advancement of LLMs towards more effective and intelligent educational systems.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések