核心概念
Large Language Models (LLMs) show promise for Automated Program Repair (APR) but struggle with memory inefficiency when using beam search for patch generation. FLAMES, a novel approach combining LLM-based and search-based APR, leverages semantic feedback and a best-first search algorithm to improve both the efficiency and effectiveness of LLM-based program repair.
摘要
Bibliographic Information:
Le-Cong, T., Le, B., & Murray, T. (2024). Semantic-guided Search for Efficient Program Repair with Large Language Models. Proceedings of the ACM on Programming Languages, 1(OOPSLA1), 1–23. https://doi.org/XXXXXXX.XXXXXXX
Research Objective:
This paper investigates the memory limitations of current LLM-based Automated Program Repair (APR) techniques and proposes a novel approach, FLAMES, to enhance their efficiency and effectiveness.
Methodology:
The authors first conduct an empirical study on the impact of beam size on the memory efficiency and effectiveness of five different LLM-based APR techniques. They then introduce FLAMES, which combines LLM-based and search-based APR using semantic feedback from test validations and a best-first search algorithm (PG-TD) to guide the patch generation process. FLAMES is evaluated on the Defects4J and HumanEval-Java datasets, comparing its performance against 15 leading APR techniques in terms of the number of correctly fixed bugs, memory usage, and execution time.
Key Findings:
- Increasing the beam size in LLM-based APR techniques leads to significant memory consumption and frequent out-of-memory crashes, hindering their effectiveness.
- FLAMES successfully repairs 133 bugs from the Defects4J dataset and 103 bugs from the HumanEval-Java dataset, outperforming the best baseline by 10 and 11 fixes, respectively.
- FLAMES significantly reduces memory consumption by up to 83% compared to conventional LLM-based APR techniques, while also accelerating the repair process.
Main Conclusions:
FLAMES offers a more efficient and effective approach to LLM-based program repair by addressing the memory limitations of conventional beam search methods. The semantic-guided patch generation strategy allows FLAMES to explore a larger search space and generate more plausible patches while consuming significantly less memory.
Significance:
This research significantly contributes to the field of Automated Program Repair by proposing a novel approach that leverages the power of LLMs while mitigating their memory constraints. FLAMES has the potential to improve the scalability and practicality of LLM-based APR techniques for real-world software development.
Limitations and Future Research:
The study focuses on single-hunk bugs and assumes perfect fault localization. Future research could explore the applicability of FLAMES to multi-hunk bugs and integrate it with fault localization techniques. Additionally, investigating the effectiveness of FLAMES with different reward functions and search algorithms could further enhance its performance.
統計資料
Increasing beam size from 10 to 25 in LLM-based APR techniques led to a 21% to 46% increase in plausible patches.
Further increasing beam size resulted in performance drops due to memory overloads, with crash rates exceeding 80% in some cases.
FLAMES reduced memory consumption by 42% to 83% across various models and configurations.
FLAMES achieved a 0% out-of-memory (OOM) crash rate across all evaluated models.
FLAMES repaired 133 out of 333 bugs in Defects4J and 103 out of 164 bugs in HumanEval-Java.
FLAMES outperformed the best baseline (RepairLlama) by 10 and 11 correct fixes in Defects4J and HumanEval-Java, respectively.
FLAMES uniquely fixed 14 bugs in Defects4J that no other baseline could address.
引述
"Seemingly simple solutions to reduce memory consumption are (1) to quantize LLM models... and (2) to make beam search sequential... However, we show that these approaches still do not work via both theoretical analysis and experiments."
"Our empirical evaluation on the Defects4J and HumanEval-Java datasets shows that FLAMES not only substantially reduces memory consumption by up to 83% compared to conventional LLM-based APR, but also accelerates the repair process."
"This suggests that FLAMES is not only more efficient but also outperforms state-of-the-art techniques, fixing at least 10 and 11 more bugs than SOTA baselines in the Defects4J and HumanEval-Java datasets, respectively."