The paper introduces PECC, a novel benchmark designed to evaluate the code generation capabilities of large language models (LLMs) across a spectrum of problem complexities, spanning both narrative and neutral contexts. The dataset leverages challenges from Advent of Code (AoC) and Project Euler, totaling 2,396 problems.
The key findings include:
Multi-sampling (k=3) generally improves the Pass@k scores compared to single sampling (k=1), suggesting that providing models with multiple attempts enhances the likelihood of generating a correct solution.
Narrative-style problems in AoC prove better suited for models than the neutrally formulated counterparts in Project Euler, indicating that narratives can aid or obstruct model performance depending on the problem domain.
LLMs, including state-of-the-art models like GPT-3.5-Turbo and Claude Haiku, struggle with complex coding challenges, particularly in the mathematically intensive Project Euler subset, highlighting the need for further advancements.
Error analysis reveals that Syntax Errors occur less frequently, but Runtime and Wrong Output errors are prevalent, suggesting challenges in the logical or algorithmic aspects of problem-solving.
Prompting models to provide a chain-of-thought justification for their solutions improves performance, especially on more challenging problems, compared to solely relying on their inherent world knowledge.
The PECC dataset aims to serve as a comprehensive benchmark for assessing the progress of LLMs in complex coding and reasoning tasks, mirroring real-world challenges.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies