The paper proposes a pipeline for quantitatively evaluating interactive LLMs using publicly available datasets. It carries out an extensive technical evaluation of LLMs using the Big-Vul dataset, covering four different common software vulnerability tasks: detection, assessment, localization, and description.
Key findings:
Vulnerability Detection:
Vulnerability Assessment:
4. Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance.
5. LLMs have a limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.
Vulnerability Localization:
6. Few-shot setting expose LLM limitations, but fine-tuning enhances capability.
7. Mistral's significant improvement after fine-tuning showcases its potential.
Vulnerability Description:
8. CodeLlama, StarCoder, WizardCoder, and Mistral excel at learning from historical description data.
The evaluation pipeline provides valuable insights for further enhancing LLMs' software vulnerability handling capabilities.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Xin Yin,Chao... om arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.02056.pdfDiepere vragen