The paper proposes a pipeline for quantitatively evaluating interactive LLMs using publicly available datasets. It carries out an extensive technical evaluation of LLMs using the Big-Vul dataset, covering four different common software vulnerability tasks: detection, assessment, localization, and description.
Key findings:
Vulnerability Detection:
Vulnerability Assessment:
4. Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance.
5. LLMs have a limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.
Vulnerability Localization:
6. Few-shot setting expose LLM limitations, but fine-tuning enhances capability.
7. Mistral's significant improvement after fine-tuning showcases its potential.
Vulnerability Description:
8. CodeLlama, StarCoder, WizardCoder, and Mistral excel at learning from historical description data.
The evaluation pipeline provides valuable insights for further enhancing LLMs' software vulnerability handling capabilities.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Xin Yin,Chao... a las arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.02056.pdfConsultas más profundas