Core Concepts
Large Language Models, particularly GPT-4 variants, can outperform state-of-the-art techniques in repairing faulty Alloy formal specifications, albeit with a marginal increase in runtime and token usage.
Abstract
The paper presents a systematic investigation into the capacity of Large Language Models (LLMs) for repairing declarative specifications in Alloy, a declarative formal language used for software specification. The authors propose a novel repair pipeline that integrates a dual-agent LLM framework, comprising a Repair Agent and a Prompt Agent.
The key highlights of the study are:
Extensive empirical evaluation comparing the effectiveness of LLM-based repair with state-of-the-art Alloy Automatic Program Repair (APR) techniques on a comprehensive set of benchmarks. The study reveals that LLMs, particularly GPT-4 variants, outperform existing techniques in terms of repair efficacy.
Investigation of the repair performance of various pre-trained LLMs, including GPT-3.5-Turbo, GPT-4-32k, and GPT-4-Turbo, across different feedback levels (No-Feedback, Generic-Feedback, and Auto-Feedback). The auto-feedback approach exhibited the best performance, surpassing traditional state-of-the-art repair tools.
Analysis of the impact of adaptive prompting, showcasing that the use of LLM-generated prompts based on Alloy Analyzer reports leads to enhanced repair performance compared to human-created generic prompts.
Characterization of failure modes encountered during the repair process, including syntax errors, counterexamples, and repetition of buggy specifications, providing insights into the limitations and challenges of LLM-based repair.
Evaluation of the repair costs associated with using the APR pipeline with various LLMs, highlighting the trade-offs between repair effectiveness and computational resources.
The research contributes to advancing the field of automatic repair for declarative specifications and highlights the promising potential of LLMs in this domain.
Stats
The proposed LLM-based repair pipeline outperforms state-of-the-art Alloy APR techniques in terms of the number of successfully repaired specifications, with the auto-feedback approach achieving the highest repair rate.
Quotes
"Large Language Models, particularly GPT-4 variants, can outperform state-of-the-art techniques in repairing faulty Alloy formal specifications, albeit with a marginal increase in runtime and token usage."
"The auto-feedback approach exhibited the best performance, surpassing traditional state-of-the-art repair tools."
"The use of LLM-generated prompts based on Alloy Analyzer reports leads to enhanced repair performance compared to human-created generic prompts."