Core Concepts
Language models struggle to resolve real-world software engineering issues, highlighting the need for more challenging and realistic benchmarks to drive their future development.
Abstract
The paper introduces SWE-bench, a benchmark for evaluating language models on real-world software engineering tasks. SWE-bench consists of 2,294 software engineering problems drawn from GitHub issues and corresponding pull requests across 12 popular Python repositories.
The key highlights and insights are:
SWE-bench tasks require models to understand and coordinate changes across multiple functions, classes, and files, going beyond traditional code generation tasks. This calls for models to interact with execution environments, process long contexts, and perform complex reasoning.
Evaluations show that state-of-the-art language models, including proprietary models and the fine-tuned SWE-Llama, can only resolve the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.
SWE-bench offers several advantages over existing programming benchmarks, including a realistic setting, diverse inputs, robust execution-based evaluation, and the ability to continuously update with new instances.
The authors release SWE-bench-train, a dataset of 19,000 non-testing task instances, and two fine-tuned models, SWE-Llama 7b and 13b, which show some competitiveness with Claude 2 in certain settings.
The paper provides a qualitative analysis of model generations, highlighting that models tend to generate shorter, simpler edits compared to human-written solutions, and struggle to leverage the full codebase context.
Stats
The average issue text length is 195.1 words, with a maximum of 4,477 words.
The average codebase has 3,010 non-test files and 438,000 non-test lines.
The average gold patch makes 32.8 line edits across 1.7 files and 3.0 functions.
The average number of fail-to-pass tests is 9.1, with a maximum of 1,633.
The average total number of tests is 120.8, with a maximum of 9,459.
Quotes
"Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities."
"Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks."
"Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous."