The paper proposes the Nested Product of Experts (NPoE) framework to defend against multiple types of backdoor attacks in language models. Backdoor attacks can cause undesirable behaviors in large language models (LLMs) by inserting specific triggers into the training data.
The key components of NPoE are:
Mixture of Experts (MoE): Multiple shallow "trigger-only" models are trained to capture distinct backdoor triggers. These models are combined using a gating function to form the trigger-only MoE.
Product of Experts (PoE): The main model is trained in an ensemble with the trigger-only MoE. The main model learns the trigger-free residual, while the trigger-only MoE captures the backdoor shortcuts.
Pseudo Development Set: Since the defender has no prior knowledge about the backdoor triggers, a pseudo development set is constructed by identifying poisoned samples based on the confidence of the main model and trigger-only MoE.
Experiments on sentiment analysis, hate speech detection, and question classification tasks show that NPoE effectively defends against a variety of triggers, both separately and in trigger mixtures. NPoE outperforms other defense baselines, including the state-of-the-art Denoised Product of Experts (DPoE) method, especially in the mixed-trigger setting.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Victoria Gra... at arxiv.org 04-04-2024
https://arxiv.org/pdf/2404.02356.pdfDeeper Inquiries