Core Concepts
AC-EVAL evaluates Large Language Models' proficiency in ancient Chinese language understanding, highlighting areas for improvement.
Abstract
AC-EVAL introduces a benchmark to assess LLMs' understanding of ancient Chinese, covering historical knowledge and language comprehension extensively. The benchmark is structured across three levels of difficulty and comprises 13 tasks, providing a comprehensive assessment framework. The evaluation reveals potential for improvement in LLMs, especially in long text comprehension. The study compares model performance in zero-shot, few-shot, and chain-of-thought settings, highlighting the challenges and benefits of each approach.
Stats
AC-EVAL comprises 3,245 multiple-choice questions spanning three levels of difficulty and thirteen subjects.
ERNIE-Bot 4.0 and GLM-4 are top-performing models with accuracies over 70%.
GPT-4 and GPT-3.5 outperform LLaMA-70B in handling extensive Chinese content.
Quotes
"AC-EVAL aims to advance LLM application in ancient Chinese education."
"The benchmark reveals significant improvement areas for existing LLMs."