This research paper investigates the potential of using OpenAI's GPT large language models (LLMs) for automatically generating verifiable specifications in VeriFast, a static verification tool. The authors focus on generating specifications based on separation logic, which is particularly challenging due to the need to describe heap structures and functional properties.
Bibliographic Information: Rego, M., Fan, W., Hu, X., Dod, S., Ni, Z., Xie, D., DiVincenzo, J., & Tan, L. (2024). Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast. arXiv preprint arXiv:2411.02318v1.
Research Objective: The study aims to evaluate the effectiveness of GPT-3.5-turbo, GPT-4.0, and GPT-4-turbo in generating verifiable specifications for C code that manipulates the heap, using VeriFast as the verification tool.
Methodology: The researchers used a dataset of 150 publicly available VeriFast examples and employed two prompting techniques: traditional prompt engineering and Chain-of-Thought (CoT) prompting. They tested the models' ability to generate specifications from three input formats: Natural Language (NL) descriptions, Mathematical Proof (MP) format (pre- and postconditions only), and Weaker Version (WV) format (partial contracts). The generated specifications were then checked for correctness using VeriFast.
Key Findings: The results indicate that while GPT models show promise in generating VeriFast specifications, they are not yet consistently reliable. GPT-4.0 outperformed the other models, with GPT-3.5-turbo and GPT-4-turbo showing similar but lower performance. NL input resulted in the highest error rates, while MP and WV formats showed improvements but still had significant errors. CoT prompting reduced syntax errors but did not significantly improve verification error rates compared to traditional prompting.
Main Conclusions: The authors conclude that while LLMs like GPT have the potential to automate specification generation for static verification, further research is needed to improve their accuracy and reliability. They suggest exploring custom LLM training, alternative LLM architectures, and refined prompting strategies as potential avenues for future work.
Significance: This research contributes to the growing field of using AI for software engineering tasks, specifically in the challenging area of formal verification. Automating specification generation could significantly reduce the effort required for formal verification and make it more accessible to developers.
Limitations and Future Research: The study is limited by its focus on OpenAI's GPT models and VeriFast. Future research could explore other LLMs, verification tools (e.g., Viper, Gillian), and specification languages. Additionally, investigating the impact of different prompting techniques and training data on LLM performance in this domain is crucial.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies