Core Concepts
EvoCodeBench is a new code generation benchmark that aligns with real-world code repositories in multiple dimensions, including code distributions, dependency distributions, and comprehensive annotations. It proposes a repository-level code generation task to simulate the practical coding process and evaluates the coding abilities of 10 popular Large Language Models.
Abstract
EvoCodeBench is a new code generation benchmark that addresses the limitations of existing benchmarks. It has the following key features:
Alignment with Real-world Repositories: EvoCodeBench is collected from high-quality open-source repositories, ensuring that the code distribution and dependency distribution are consistent with real-world repositories.
Comprehensive Annotations: EvoCodeBench provides detailed requirements, reference code, reference dependencies, and the complete repository for each sample, enabling comprehensive evaluation of code generation abilities.
Robust Evaluation Metrics: EvoCodeBench uses Pass@k to assess functional correctness and Recall@k to evaluate the generation of relevant dependencies.
Evolving Benchmark: EvoCodeBench is designed as an evolving benchmark, with new versions released periodically to avoid data leakage.
Based on EvoCodeBench, the paper proposes a repository-level code generation task, which simulates the practical coding process. The authors evaluate 10 popular Large Language Models, including gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5, on this task. The results reveal that the coding abilities of these models are much lower in real-world repositories compared to their performance on previous benchmarks, highlighting the importance of evaluating code generation in realistic scenarios.
The paper also analyzes the failed cases and summarizes the shortcomings of existing Large Language Models in EvoCodeBench, providing insights for future improvements.
Stats
The average number of dependencies per program in EvoCodeBench-2403 is 3.46.
The average number of dependencies per program in 500 real-world repositories is 3.22.
Quotes
"EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions."
"EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k)."
"EvoCodeBench is an evolving benchmark to avoid data leakage."