Основные понятия
The author explores the detection of trojan signatures in large language models of code, finding that these signatures are challenging to detect solely from model weights.
Аннотация
Trojan signatures in large language models of code are investigated for defect and clone detection tasks. The study reveals the difficulty in generalizing trojan signatures to these models, indicating their resilience to revealing trojans solely from their weights. The research compares full-finetuned and freeze-finetuned models, showing no significant shifts in weight distributions for trojaned classes. Various techniques and approaches for detecting trojans in neural code models are discussed, highlighting the challenges faced due to the complexity and size of these models.
Статистика
Fields et al. found trojan signatures in computer vision classification tasks with image models like Resnet, WideResnet, Densenet, and VGG.
Trojans can mislead LLMs into generating malicious output when presented with specific trigger words.
Poisoned datasets were used to generate trojaned code models for defect detection and clone detection tasks.
Trojai dataset was used by Fields et al. to detect trojaned image models using trojan signatures extracted from model parameters.
Gaussian KDE is used for smoothed density plots from classifier layer weights to reveal potential trojan signatures.
Цитаты
"Trojan signatures could not generalize to LLMs of code."
"Our results suggest that detecting such trojans only from the weights in code models is a hard problem."
"The impact of the trojan is better hidden in the larger code models compared to smaller image architectures."