Evaluating Large Language Models' Abilities to Assist Programmers in Real-World Coding Tasks
Improvements in large language model (LLM) performance on static coding benchmarks lead to increased programmer productivity, particularly in reducing time spent on tasks, but gaps in benchmark versus human performance are not proportional. Human preference metrics like suggestion acceptance rate and code copying do not necessarily align with actual programmer performance.