핵심 개념
Large Language Models (LLMs) show promise in automating software vulnerability detection and patching, but their effectiveness remains unclear. This study introduces VulnLLMEval, a framework to assess the performance of LLMs in identifying and fixing vulnerabilities in real-world C code.
초록
The paper introduces VulnLLMEval, a framework designed to evaluate the effectiveness of Large Language Models (LLMs) in software vulnerability detection (SVD) and software vulnerability patching (SVP) tasks. The framework focuses on the C programming language, using a dataset of 307 real-world vulnerabilities extracted from the Linux kernel.
The key highlights of the study include:
- A novel automated data collection method that eliminates the need for manual labeling of vulnerable and patched code, streamlining the evaluation process.
- Comprehensive evaluation of 10 pre-trained LLMs across various SVD and SVP tasks, using metrics such as Mean Reciprocal Rank (MRR), Top-5 accuracy, ROUGE score, CodeBLEU, and Cyclomatic Complexity.
- Detailed insights into the strengths and limitations of LLMs in handling different types of vulnerabilities, including buffer overflows, use-after-free, information exposure, null pointer dereference, and improper input validation.
- Observations that LLMs often struggle to distinguish between vulnerable and patched code, and tend to oversimplify code when generating patches, requiring further refinement.
- Identification of research directions to enhance the effectiveness of LLMs in SVD and SVP tasks, such as improving their understanding of complex vulnerability patterns and incorporating specialized training on vulnerable and patched code.
The study provides a robust and extensible benchmark for evaluating LLMs in software security tasks, offering valuable insights to guide future advancements in this field.
통계
Over 29,000 Common Vulnerabilities and Exposures (CVEs) were addressed in 2023, up from 25,084 in 2022 and 20,153 in 2021.
The dataset includes 307 real-world vulnerabilities from the Linux kernel, covering 30 different Common Weakness Enumerations (CWEs).
Vulnerable code blocks average 202 lines, ranging from 3 to 3,581 lines, while patched blocks average 209 lines, spanning from 3 to 3,613 lines.
인용구
"Large Language Models (LLMs) have shown remarkable capabilities in understanding programming languages, demonstrating significant potential for automating Software Vulnerability Detection (SVD) and Software Vulnerability Patching (SVP)."
"A key problem with state-of-the-art LLMs is that they are trained on billions of lines of code without distinguishing between vulnerable and non-vulnerable code, which can lead to ineffective identification and prevention of software vulnerabilities."
"Our results reveal that LLMs often struggle with distinguishing between vulnerable and patched code. Furthermore, in SVP tasks, these models tend to oversimplify the code, producing solutions that may not be directly usable without further refinement."