Core Concepts
InfiCoder-Eval is a large-scale benchmark that systematically evaluates the free-form question-answering capabilities of code large language models across 15 programming languages and 5 major areas.
Abstract
InfiCoder-Eval is a novel benchmark created to comprehensively evaluate the question-answering abilities of code large language models (code LLMs). The benchmark comprises 234 carefully selected high-quality questions from Stack Overflow, covering 15 programming languages and 5 major areas: front-end, back-end, data science and machine learning, mobile and desktop, and IT operations.
To address the challenge of evaluating free-form question responses, InfiCoder-Eval integrates four model-free metric types: keywords matching, blank filling, unit testing, and dialogue similarity. Domain experts annotate the detailed correctness criteria for each question, enabling automatic and efficient evaluation.
The authors conduct a systematic evaluation of over 80 code LLMs on InfiCoder-Eval, leading to several insightful findings:
- GPT-4 achieves a score of 70.64%, outperforming the most capable open-source models, but is still far from perfect.
- At similar model sizes, coding LLMs are usually stronger than general LLMs, and fine-tuned LLMs are usually stronger than base LLMs.
- The performance differences between model families can be huge, highlighting the importance of training data and techniques.
- The scaling law is empirically verified for open-source models with fewer than 50B parameters, but not for those with more.
InfiCoder-Eval is fully open-source and continuously maintained to foster more scientific and systematic practices for evaluating code LLMs.
Stats
The authors report that GPT-4 achieves a score of 70.64% on the InfiCoder-Eval benchmark.
The best open-source model, deepseek-coder-33b-instruct, achieves a score of 62.96%.
GPT-3.5-turbo achieves a score of 56.47%.
Quotes
"GPT-4 is still far from perfect, which is in contrast to the near 90% rate in HumanEval."
"There is still a visible gap between open-source models and GPT-4."
"The training techniques and training data are equally important or even more, helping to reduce the required scale for achieving certain score by more than 10× size."
"Instruction-finetuning is critical for equipping the models with QA ability in the code domain."