מושגי ליבה
Trojans or backdoors in neural models of code can enable adversaries to intentionally insert hidden triggers that cause the model to behave in unintended or malicious ways. This work presents a comprehensive taxonomy of trigger-based trojans in large language models of code, and a critical review of recent state-of-the-art poisoning techniques.
תקציר
This work introduces a unified trigger taxonomy framework to enhance the understanding and exploration of trojan attacks within large language models of code. The taxonomy covers six key aspects of trigger design, including insertion location in the ML pipeline, number of input features, target samples, variability of trigger content, code context, and size in number of tokens.
The authors compare and analyze recent, impactful works on trojan attacks in code language models, using the proposed trigger taxonomy as a guide. Key findings include:
- Triggers are commonly introduced during the fine-tuning stage, rather than pre-training, to enable targeted attacks.
- Both targeted and untargeted triggers have been used, with targeted triggers leveraging specific properties of the input samples.
- Dynamic triggers, such as grammar-based and distribution-centric triggers, are more stealthy and powerful than fixed triggers.
- Structural triggers that change the semantics of the code, as well as semantic-preserving triggers, have both been employed in attacks.
- Partial triggers, where only a subset of the full trigger is used, can be effective while being more stealthy.
The authors also draw insights on trigger design based on findings about how code models learn, highlighting the importance of focusing on semantic triggers and the potential for partial triggers to evade detection.
סטטיסטיקה
"Large language models (LLMs) have provided a lot of exciting new capabilities in software development."
"With the growing prevalence of these models in the modern software development ecosystem, the security issues in these models have also become crucially important."
"Trojans or backdoors in neural models refer to a type of adversarial attack in which a malicious actor intentionally inserts a hidden trigger into a neural network during its training phase."
"A trigger serves as the central element and is the key design point of a trojan attack – it is the key to changing the behaviour of code models."
ציטוטים
"The opaque nature of these models makes them difficult to reason about and inspect."
"Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization."
"Given these models' widespread use, potentially in a wide range of mission-critical settings, it is important to study potential trojan attacks they may encounter."