Core Concepts
Large Language Models can significantly outperform existing learning-based methods for automated vulnerability localization through appropriate fine-tuning, while prompting approaches prove less effective.
Abstract
The paper presents a comprehensive study on the capabilities of Large Language Models (LLMs) for Automated Vulnerability Localization (AVL). The key findings are:
Prompting LLMs, including GPT-3.5, GPT-4, and various open-source models, struggles to match the performance of existing learning-based methods for AVL. However, prompting can still be a viable alternative in data-scarce scenarios.
Fine-tuning LLMs, particularly through discriminative approaches that cast AVL as a sequence labeling task, can significantly outperform state-of-the-art methods, with up to 36.2% improvement in F1-score. Generative fine-tuning also shows promise but is less effective.
LLMs exhibit robust performance across different vulnerability types (CWEs) and can generalize to new projects, though they face challenges with subtle vulnerabilities and novel patterns.
The authors identify key limitations of LLMs for AVL, such as input length constraints and unidirectional context, and propose effective mitigation strategies, including sliding window and right-forward embedding, which substantially enhance performance.
Stats
"The dataset consists of 10,811 distinct vulnerable C/C++ functions along with their vulnerability locations."
"The new dataset (SC-LOC) comprises 1,369 entries of vulnerable smart contracts written in Solidity."
Quotes
"Discriminative fine-tuning of LLMs can significantly outperform existing learning-based methods for AVL, while other paradigms prove less effective or unexpectedly ineffective for the task."
"LLMs show promising adaptability and accuracy in identifying varied vulnerability types (CWEs) owing to their extensive pre-training on diverse codebases."
"The application of these strategies has proven to significantly enhance performance, yielding improvements of up to 24.1% in F1-score."