Centrala begrepp
Large Language Models like ChatGPT struggle to generate secure code when using security APIs, with around 70% of the code instances containing security API misuse across various functionalities.
Sammanfattning
The study systematically assessed the trustworthiness of ChatGPT in generating secure code for 5 widely used security APIs in Java, covering 16 distinct security functionalities. The researchers compiled an extensive set of 48 programming tasks and employed both automated and manual approaches to detect security API misuse in the generated code.
The key findings are:
-
Validity of Generated Code:
- JCA and JSSE functionalities yielded higher rates of valid programs (91% and 76% respectively).
- Rates fell to 34% for Biometrics, 32% for OAuth, and 26% for Play Integrity APIs.
- Factors contributing to invalid code include lack of training data, task complexity, and hallucination.
-
API Selection:
- ChatGPT showed high accuracy in selecting correct JCA APIs (99% of valid programs).
- The rate of correct API selection decreased to 72% for JSSE functionalities.
- For OAuth, Biometrics, and Play Integrity, ChatGPT often relied on deprecated APIs.
-
API Usage:
- The overall misuse rate across all 48 tasks was approximately 70%.
- For JCA, the misuse rate was around 62%, closely reflecting the prevalence in open-source Java projects.
- The misuse rate rose to about 85% for JSSE and reached 100% for OAuth and Biometrics APIs.
The study identified 20 distinct types of security API misuses, including using constant/predictable cryptographic keys, insecure encryption modes, short cryptographic keys, and improper SSL/TLS hostname and certificate validation.
The findings raise significant concerns about the trustworthiness of ChatGPT in generating secure code, particularly in security-sensitive contexts. The results highlight the need for further research to improve the security of LLM-generated code.
Statistik
Around 70% of the code instances across 30 attempts per task contain security API misuse.
The misuse rate reaches 100% for roughly half of the tasks.
Citat
"The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code."
"Our findings are concerning: around 70% of the code instances across 30 attempts per task contain security API misuse, with 20 distinct misuse types identified."
"For roughly half of the tasks, this rate reaches 100%, indicating that there is a long way to go before developers can rely on ChatGPT to securely implement security API code."