The paper proposes a pipeline for quantitatively evaluating interactive LLMs using publicly available datasets. It carries out an extensive technical evaluation of LLMs using the Big-Vul dataset, covering four different common software vulnerability tasks: detection, assessment, localization, and description.
Key findings:
Vulnerability Detection:
Vulnerability Assessment:
4. Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance.
5. LLMs have a limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.
Vulnerability Localization:
6. Few-shot setting expose LLM limitations, but fine-tuning enhances capability.
7. Mistral's significant improvement after fine-tuning showcases its potential.
Vulnerability Description:
8. CodeLlama, StarCoder, WizardCoder, and Mistral excel at learning from historical description data.
The evaluation pipeline provides valuable insights for further enhancing LLMs' software vulnerability handling capabilities.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Xin Yin,Chao... lúc arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.02056.pdfYêu cầu sâu hơn