Conceitos essenciais
Improvements in large language model (LLM) performance on static coding benchmarks lead to increased programmer productivity, particularly in reducing time spent on tasks, but gaps in benchmark versus human performance are not proportional. Human preference metrics like suggestion acceptance rate and code copying do not necessarily align with actual programmer performance.
Resumo
The paper introduces RealHumanEval, a web-based platform to conduct human-centric evaluation of LLMs for programming. The platform supports two forms of LLM assistance: autocomplete-based and chat-based.
The authors conducted a user study with 213 participants to understand the effect of LLM performance and the form of assistance on programmer productivity metrics. Key findings:
-
Improvements in LLM benchmark performance lead to gains in human productivity, particularly in reducing time spent on tasks. This trend holds across both autocomplete and chat interactions.
-
However, the gaps in benchmark versus human performance are not proportional - further gains in benchmark performance do not necessarily translate to equivalent gains in human productivity.
-
Human preference metrics like suggestion acceptance rate and code copying from chat responses are only correlated with programmer perceptions of LLM helpfulness, but not with actual programmer performance.
The results highlight the importance of careful evaluation to understand the nuances in programmer-LLM interactions, and the authors encourage the community to leverage RealHumanEval to evaluate new LLMs.
Estatísticas
Participants spent an average of 400 seconds per task in the No LLM condition.
Compared to No LLM, GPT-3.5 and CodeLlama-34b models reduced the time spent per task by 78 and 64 seconds respectively.
CodeLlama-7b models slightly increased the average time spent on a task by 10 seconds.
Citações
"While a set of small-scale user studies have been conducted to primarily build a qualitative understanding of how programmers use LLM assistance, they are typically restricted to evaluations on one model, one form of LLM support, and a limited set of tasks."
"We find that improving a model's base performance on existing coding benchmarks leads to gains in human productivity, particularly in the time spent completing tasks. These trends were present across both chat and autocomplete interactions, validating the potential "generalizability" of benchmarking efforts to more realistic contexts."
"We also investigated whether human preference metrics, such as the average acceptance rate of suggestions and the likelihood of copying code from chat responses, aligned with productivity metrics. While these preference metrics are readily available in real deployments of LLM systems compared to task completion time and thus can be attractive proxy metrics, we find that they are only correlated with programmer perceptions of LLM helpfulness but not necessarily with actual programmer performance."