Core Concepts
OpenEval introduces a comprehensive evaluation platform for Chinese LLMs, focusing on capability, alignment, and safety.
Abstract
Abstract:
Introduction of OpenEval for evaluating Chinese LLMs across capability, alignment, and safety.
Includes benchmark datasets for various tasks and dimensions.
Introduction:
Large language models have shown remarkable capabilities in NLP tasks and real-world applications.
Challenges in evaluating Chinese LLMs due to limitations of traditional benchmarks.
Data Pre-processing and Post-processing:
Specific prompts included for each task based on task description.
Around 300K questions reformulated for zero-shot evaluation setting.
Evaluation Taxonomy:
Three major dimensions: capability, alignment, and safety.
Sub-dimensions under each dimension with specific benchmarks.
Experiments:
First public evaluation assessed open-source and proprietary Chinese LLMs across 53 tasks.
Results show differences between open-source and proprietary LLMs in various dimensions.
Stats
"In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters."
"Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks..."