SWE-Bench: Evaluating Language Models on Real-World Software Engineering Tasks
Language models struggle to resolve real-world software engineering issues, highlighting the need for more challenging and realistic benchmarks to drive their future development.