The researchers collected a benchmark of 15 real-world one-day vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and academic papers. They developed an LLM agent using GPT-4 as the base model, along with a prompt, the ReAct agent framework, and access to various tools.
The key findings are:
GPT-4 achieved an 87% success rate in exploiting the one-day vulnerabilities, while every other LLM model (GPT-3.5, 8 open-source models) and open-source vulnerability scanners (ZAP and Metasploit) had a 0% success rate.
When the CVE description was removed, GPT-4's success rate dropped to 7%, suggesting that determining the vulnerability is more challenging than exploiting it.
The researchers found that GPT-4 was able to identify the correct vulnerability 33.3% of the time (55.6% for vulnerabilities past the knowledge cutoff date) but could only exploit one of the successfully detected vulnerabilities.
The average cost of using GPT-4 to exploit the vulnerabilities was $3.52 per run, which is 2.8 times cheaper than estimated human labor costs.
The results demonstrate the emergent capabilities of LLM agents, specifically GPT-4, in the realm of cybersecurity and raise important questions about the widespread deployment of such powerful agents.
翻譯成其他語言
從原文內容
arxiv.org
深入探究