The paper introduces HexaCoder, a novel approach to enhance the ability of large language models (LLMs) to generate secure code. The key components of HexaCoder are:
Oracle-Guided Secure Code Synthesis: HexaCoder uses an instruction-tuned LLM (GPT-4) guided by security reports to automatically synthesize pairs of vulnerable and fixed code samples for specific Common Weakness Enumeration (CWE) types. A security oracle (CodeQL) is used to validate the generated codes.
Fine-tuning CodeLMs: The synthesized code pairs are used to fine-tune various CodeLMs (e.g., CodeGen, InCoder, DeepSeek-Coder) using the Low-Rank Adaptation (LoRA) method and a masked objective loss.
Two-step Code Generation: Based on the insight that the model adds new libraries to address vulnerabilities, HexaCoder introduces a two-step generation approach. In the first step, the model generates the necessary libraries, and in the second step, it completes the main code using the updated library context.
The evaluation shows that HexaCoder significantly reduces the number of vulnerable codes generated by the CodeLMs compared to the baseline methods, while maintaining their performance in generating functionally correct programs.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Hoss... lúc arxiv.org 09-11-2024
https://arxiv.org/pdf/2409.06446.pdfYêu cầu sâu hơn