核心概念
Large Language Models (LLMs) demonstrate varying capabilities in handling different software vulnerability tasks, including detection, assessment, localization, and description. While LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential.
要約
The paper proposes a pipeline for quantitatively evaluating interactive LLMs using publicly available datasets. It carries out an extensive technical evaluation of LLMs using the Big-Vul dataset, covering four different common software vulnerability tasks: detection, assessment, localization, and description.
Key findings:
Vulnerability Detection:
- Fine-tuned LLMs perform weaker than transformer-based methods, yet comparably to graph-based methods. LLMs in the few-shot setting show lower performance than existing methods.
- After fine-tuning, the detection capability of LLMs has improved, except for Mistral. Larger models usually perform better, but performance can also be influenced by model design and pre-training data.
- WizardCoder has the best vulnerability detection capability, while Mistral is the worst.
Vulnerability Assessment:
4. Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance.
5. LLMs have a limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.
Vulnerability Localization:
6. Few-shot setting expose LLM limitations, but fine-tuning enhances capability.
7. Mistral's significant improvement after fine-tuning showcases its potential.
Vulnerability Description:
8. CodeLlama, StarCoder, WizardCoder, and Mistral excel at learning from historical description data.
The evaluation pipeline provides valuable insights for further enhancing LLMs' software vulnerability handling capabilities.
統計
Multiple buffer overflows in net/wireless/nl80211.c in the Linux kernel before 2.6.39.2 allow local users to gain privileges by leveraging the CAP_NET_ADMIN capability during scan operations with a long SSID value.
This vulnerability has existed since version 2.6.29-rc4 of the Linux kernel.
引用
"Multiple buffer overflows in net/wireless/nl80211.c in the Linux kernel before 2.6.39.2 allow local users to gain privileges by leveraging the CAP_NET_ADMIN capability during scan operations with a long SSID value."
"This bug has existed since version 2.6.29-rc4 of the Linux kernel."