An Adaptive Black-Box Defense Against Trojan Attacks (TrojDef)

Guanxiong Liu, Abdallah Khreishah, Fatima Sharadgah, Issa Khalil

Research output: Contribution to journalArticlepeer-review


Trojan backdoor is a poisoning attack against neural network (NN) classifiers in which adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches through a poisoned training process. To misclassify an input to a target class, the attacker activates the backdoor by augmenting the input with a predefined trigger that is only known to her/him. Most of the proposed defenses against Trojan attacks assume a white-box setup, in which the defender either has access to the inner state of NN or is able to run backpropagation through it. In this work, we propose a more practical black-box defense, dubbed <bold>TrojDef</bold>. In a black-box setup, the defender can only run forward-pass of the NN. <bold>TrojDef</bold> is motivated by the Trojan poisoned training, in which the model is trained on both benign and Trojan inputs. <bold>TrojDef</bold> tries to identify and filter out Trojan inputs (i.e., inputs augmented with the Trojan trigger) by monitoring the changes in the prediction confidence when the input is repeatedly perturbed by random noise. We derive a function based on the prediction outputs which is called the prediction confidence bound to decide whether the input example is Trojan or not. The intuition is that Trojan inputs are more stable as the misclassification only depends on the trigger, while benign inputs will suffer when augmented with noise due to the perturbation of the classification features. Through mathematical analysis, we show that if the attacker is perfect in injecting the backdoor, the Trojan infected model will be trained to learn the appropriate prediction confidence bound, which is used to distinguish Trojan and benign inputs under arbitrary perturbations. However, because the attacker might not be perfect in injecting the backdoor, we introduce a nonlinear transform to the prediction confidence bound to improve the detection accuracy in practical settings. Extensive empirical evaluations show that <bold>TrojDef</bold> significantly outperforms the-state-of-the-art defenses and is highly stable under different settings, even when the classifier architecture, the training process, or the hyperparameters change.

Original languageEnglish (US)
Pages (from-to)1-15
Number of pages15
JournalIEEE Transactions on Neural Networks and Learning Systems
StateAccepted/In press - 2022

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence


  • Artificial neural networks
  • Black-box defense
  • Closed box
  • Feature extraction
  • neural network (NN)
  • poisoning attack
  • Predictive models
  • Strips
  • Training
  • Trojan backdoor
  • Trojan horses


Dive into the research topics of 'An Adaptive Black-Box Defense Against Trojan Attacks (TrojDef)'. Together they form a unique fingerprint.

Cite this