Professional Documents
Culture Documents
Improving Adversarial Robustness of Ensembles With Diversity Training
Improving Adversarial Robustness of Ensembles With Diversity Training
no direct access to the model being attacked. Such rameters to the end user, making it harder for an adversary
attacks usually rely on the principle of transfer- to attack the model. However, adversarial examples have
ability, whereby an attack crafted on a surrogate been shown to transfer across different models (Papernot
model tends to transfer to the target model. We et al., 2016a), enabling adversarial attacks without knowl-
show that an ensemble of models with misaligned edge of the model architecture or parameters. Attacks that
loss gradients can provide an effective defense work under these constraints are termed black-box attacks.
against transfer-based attacks. Our key insight is
that an adversarial example is less likely to fool 1.1. Transferability
multiple models in the ensemble if their loss func- Black-box attacks rely on the principle of transferability.
tions do not increase in a correlated fashion. To In the absence of access to the target model, the adversary
this end, we propose Diversity Training, a novel trains a surrogate model and crafts adversarial attacks on
method to train an ensemble of models with uncor- this model using white-box attacks. Adversarial examples
related loss functions. We show that our method generated this way can be used to fool the target model with
significantly improves the adversarial robustness a high probability of success (Liu et al., 2016). Adversar-
of ensembles and can also be combined with ex- ial examples have been known to span a large contiguous
isting methods to create a stronger defense. subspace of the input (Goodfellow et al., 2015). Further-
more, recent work explaining transferability (Tramr et al.,
2017) has shown that models with high dimensionality of
1. Introduction adversarial subspace (Adv-SS) are more likely to intersect,
Despite achieving state of the art classification accuracies on causing adversarial examples to transfer between models.
a wide variety of tasks, it has been demonstrated that deep Our goal is to find an effective way to reduce the dimen-
neural networks can be fooled into misclassifying an input sionality of the Adv-SS and hence reduce transferability of
that has been adversarially perturbed (Szegedy et al., 2013; adversarial examples. We show that this can be done by
Goodfellow et al., 2015). These adversarial perturbations using an ensemble of diversely trained models.
are small enough to go unnoticed by humans but can reliably
fool deep neural networks. The existence of adversarial 1.2. Ensemble as an Effective Defense
inputs presents a security vulnerability in the deployment
An ensemble uses a collection of different models and aggre-
of deep neural networks for real world applications such as
gates their outputs for classification. Adversarial robustness
self-driving cars, online content moderation and malware
can potentially be improved using an ensemble, as the at-
detection. It is important to ensure that the models used in
tack vector must now fool multiple models in the ensemble
these applications are robust to adversarial inputs as failure
instead of just a single model. In other words, the attack
to do so can have severe consequences ranging from loss in
vector must now lie in the shared Adv-SS of the models in
revenue to loss of lives.
the ensemble. We explain this using the Venn diagrams in
A number of attacks have been proposed which use the Figure 1. The area enclosed by the rectangle represents a
gradient information of the model to figure out the pertur- set of orthogonal perturbations spanning the space of the
bations to a benign input that would make it adversarial. input and the circle represents the subset of these orthogonal
1
perturbations that are adversarial (i.e. cause the model to
Georgia Institute of Technology. Correspondence to: San- misclassify the input). The shaded region represents the
jay Kariyappa <sanjaykariyappa@gatech.edu>, Moinuddin K.
Qureshi <moin@gatech.edu>. Adv-SS. For a single model (Figure 1a), any perturbation
that lies in the subspace defined by the circle would cause
Improving Adversarial Robustness of Ensembles with Diversity Training
Model 1 Model 2
Model 1 Model 1 Model 2
Model 3 Model 3
(a) Single Model (b) Ensemble of 3 Models (c) Diverse Ensemble
Figure 1. Venn diagram illustrations of the adversarial subspace of (a) single model (b) Ensemble of 3 models and (c) Diverse Ensemble.
Our goal is to reduce the overlap in the adversarial subspaces of the models in the ensemble as shown in (c)
Let J(θ, x, y) denote the loss function of the model f , where 2016). Similar to (Madry et al., 2017), we use Projected
θ represents the model parameters, x is the benign input Gradient Descent (PGD) to maximize the loss function,
and y is the label. The attacker’s goal is to generate an which ensures that the attacked image lies within the l∞
adversarial example x0 with the objective of maximizing ball around the natural image.
the model’s loss function, such that J(θ, x0 , y) > J(θ, x, y),
while adhering to the constraint: kx0 − xk∞ ≤ . Several 3. Diversity Training
techniques have been proposed in literature to solve this
constrained optimization problem. We discuss the ones 3.1. Approach
used in the evaluation of our defense.
The goal of our work is to improve the adversarial robustness
Fast Gradient Sign Method (FGSM): FGSM (Goodfel- of the model to black-box attacks by reducing the transfer-
low et al., 2015) uses a linear approximation of the loss ability of adversarial examples. This can be achieved by
function to find an adversarial perturbation that causes the reducing the dimensionality of the Adv-SS of the target
loss function to increase. Let ∇x J(θ, x, y) denote the gra- model using an ensemble of diverse models. Our approach
dient of the loss function with respect to x. The input is to training an ensemble of diverse models is as follows:
modified by adding a perturbation of size in the direction
of the gradient vector.
1. Find a way to measure the dimensionality of Adv-SS
x0 = x + · sign(∇x J(θ, x, y)) (2) of the ensemble
∇ J0 ∇ J0
∇ J1
∇ J1
Model 0 Model 1 Model 0 Model 1
Aligned Gradients Large Shared Adversarial Subspace Misaligned Gradients Small Shared Adversarial Subspace
(a) (b)
(High Coherence) (Correlated Loss Functions) (Low Coherence) (Uncorrelated Loss Functions)
Figure 2. Illustration showing the relationship between the gradient alignment and overlap of adversarial subspaces of two models.
Misaligned gradients indicate a smaller overlap in the adversarial subspace.
Table 2. Classification accuracy of models under various black-box attacks comparing 1. Baseline-Ensemble (TBase ) 2. Diverse-Ensemble
(TDiv ) 3. EnsAdvTrain (TEns ) and 4. EnsAdvTrain + DivTrain defense (TEns+Div ). We use = 0.1/0.3 for MNIST and = 0.03/0.09
for CIFAR-10. The most successful attack for each perturbation size is highlighted in bold. TDiv improves adversarial robustness over
TBase . The combined defense TEns+Div is more robust compared to the individual defenses (TEns / TDiv ).
using random shifts and crops for MNIST and random shifts, distribution: U(−, +). For I-FGSM and MI-FGSM, we
flips and crops for CIFAR-10. In addition, we generate use k = 10 steps with each step of size /10. The decay
a dataset Dnoise by adding a perturbation drawn from a factor µ is set to 1.0 for MI-FGSM. We evaluate PGD-CW
truncated normal distribution: N (µ = 0, σ = /2) to the with a confidence parameter k = 50 and 30 iterations of op-
images in D with = 0.3 for MNIST and = 0.09 for timization considering the hinge-loss function from (Carlini
CIFAR-10. The combined dataset {D0 , Dnoise } is used & Wagner, 2016). Results are reported for two different per-
to train diverse ensembles. We care about reducing the turbation sizes for all the attacks: = 0.1/0.3 for MNIST
coherence of {∇x Ji }Ni=1 primarily around natural images. and = 0.03/0.09 for CIFAR-10. We use the Cleverhans li-
Adding Gaussian noise to the training images allows us to brary’s (Papernot et al., 2018) implementation of the attacks
sample from a distribution that covers our region of interest to evaluate our defense.
in the input space.
Conv-3 is trained for 10 epochs on MNIST dataset. Conv-4 4.2. Results
and Resnet-20 are trained for 20 and 40 epochs respectively Our results comparing Baseline-Ensemble (TBase ) and
on CIFAR-10. In addition, we also evaluate our defense Diverse-Ensemble (TDiv ) are shown in Table 2 with the
with an ensemble consisting of mixture of different model most effective attack highlighted in bold. TDiv has a sig-
architectures. The MIX ensemble is made up of {Resnet- nificantly higher classification accuracy on adversarial ex-
26, Resnet-20, Conv-4, Conv-3, Conv-3} and is trained on amples compared to TBase for all the attacks considered,
CIFAR-10. We set λ = 0.5 for DivTrain (Equation 9) as showing that DivTrain improves the adversarial robustness
we empirically found this to offer a good trade-off between of ensembles against black-box attacks. The classification
clean accuracy and adversarial robustness. accuracy of clean examples drops slightly as a consequence
Attack Configuration: We evaluate the accuracy of the of adding the regularization term (GAL) to the cost function.
target models (TBase , TDiv ) for clean examples as well as There are several defenses proposed in recent literature to
adversarial examples crafted from black-box attacks using defend against black-box attacks. Since DivTrain is a de-
the attacks described in Section 2.3 on Surrogate S. Since fense that is generally applicable to any ensemble of mod-
S is an ensemble of 5 models, we consider the average CE els, it can be combined with existing proposals to create a
loss of all the models in S as the objective function to be stronger defense. We show that this is possible by evaluating
maximized for the attacks. We briefly describe the parame- a combination of our method with Ensemble Adversarial
ters for each attack: FGSM takes a single step of magnitude Training (Tramèr et al., 2017) , which is the current state of
in the direction of the gradient. R-FGSM applies FGSM the art method for Black box defense.
after taking a single random step sampled from a uniform
Improving Adversarial Robustness of Ensembles with Diversity Training
TBase
TDiv
10
0
0.2 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.2 0.0 0.2 0.4 0.6 0.8
Coherence Coherence Coherence
TEns
TEns + Div
10
0
0.2 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.2 0.0 0.2 0.4 0.6
Coherence Coherence Coherence
Figure 4. Histogram of Coherence values plotted for Conv-3, Conv-4 and Resnet-20 comparing TBase , TDiv , TEns and TEns+Div .
Models trained with Diversity Training (GAL regularization) have misaligned gradient vectors with lower coherence values.
Probability
Probability
0.6 TBase
0.4 0.6 TDiv
0.4
0.2 0.4
0 20 40 60 0 20 40 60 0 20 40 60
k k k
Figure 5. Gradient aligned adversarial subspace analysis performed on Conv-4 with CIFAR-10. Plots show the probability of finding k
orthogonal adversarial directions for 3 different perturbation sizes = 0.03/0.06/0.09. TDiv has consistently lower probabilities of
finding an adversarial direction compared to TBase showing that DivTrain lowers the dimensionality of the Adv-SS of an ensemble.
shown to be ineffective against adaptive attacks that are and flag such inputs. A recent work (Bagnall et al., 2017)
tailor-made to work well against specific defenses. A recent uses ensembles to detect adversarial inputs by training them
work (Athalye et al., 2018) showed that a number of these to have high disagreement for inputs outside the training
attacks rely on some form of gradient masking (Papernot distribution. Since GAL minimizes the coherence of gradi-
et al., 2016b), whereby the defense makes the gradient infor- ent vectors, it can potentially be used for the same purpose.
mation unavailable to the attacker, making it harder to craft One possible approach would be to use a modified version
adversarial examples. It also proposes techniques to tackle of DivTrain which uses GAL as the cost function (without
the problem of obfuscated gradients that break defenses cross-entropy loss) for examples outside the training dis-
which use gradient masking. tribution so that the consensus among the members of the
ensemble would be low for out-of-distribution data.
Prior works have considered the use of ensembles to im-
prove adversarial robustness. (Strauss et al., 2018) use en- Better Black-Box Attacks: Ensembles can also be used
sembles with the intuition that adversarial examples that to generate better black-box attacks. (Dong et al., 2017;
lead to misclassification in one model may not fool other Liu et al., 2016) use an ensemble of models as the surro-
models in the ensemble. (Liu et al., 2017) use noise injec- gate with the intuition that adversarial examples that fool
tion to the layers of the neural network to prevent gradient multiple models tend to be more transferable. It would be in-
based attacks and ensembles predictions over random noises. teresting to study the transferability of adversarial examples
While both of these works benefit from the adversarial ro- generated on diverse-ensembles to see if they can enable
bustness offered by ensembles as discussed in Section 1.2, better black-box attacks. We leave the evaluation of both of
the contribution of our work is to further improve this robust- these ideas for a future work.
ness by explicitly encouraging the models in the ensemble
to have uncorrelated loss functions. 7. Conclusion
The idea of using cosine similarity of gradients to measure
Transfer-based attacks present an important challenge for
the correlation between loss functions has been explored
the secure deployment of deep neural networks in real world
in (Du et al., 2018). This was used in the context of im-
applications. We explore the use of ensembles to defend
proving the usefulness of auxiliary tasks to improve data
against this class of attacks with the intuition that it is harder
efficiency. In contrast, we develop a metric based on cosine
to fool multiple models in an ensemble if they have uncorre-
similarity that can measure the similarity among a set of gra-
lated loss functions. We propose a novel regularization term
dient vectors with the objective of measuring the overlap of
called Gradient Alignment Loss that helps us train models
adversarial subspaces between the models in an ensemble.
with uncorrelated loss function by minimizing the coher-
ence among their gradient vectors. We show that this can be
6. Discussion used to train a diverse ensemble with improved adversarial
robustness by reducing the amount of overlap in the shared
Our paper explores the use of diverse ensembles with uncor-
adversarial subspace of the models. Furthermore, our pro-
related loss functions to better defend against transfer-based
posal can be combined with existing methods to create a
attacks. We briefly discuss other problem settings where our
stronger defense. We believe that reducing the dimension-
idea can potentially be used.
ality of the adversarial subspace is important to creating a
Adversarial Attack Detection: The objective here is to strong defense and Diversity Training is a technique that
detect inputs that have been adversarially tampered with can help achieve this goal.
Improving Adversarial Robustness of Ensembles with Diversity Training
Acknowledgements He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In The IEEE Conference
We thank our colleagues from the Memory Systems Lab for on Computer Vision and Pattern Recognition (CVPR),
their feedback. This work was supported by a gift from Intel. June 2016.
We also gratefully acknowledge the support of NVIDIA
Corporation with the donation of the Titan V GPU used for Kingma, D. P. and Ba, J. Adam: A method for stochastic
this research. optimization. CoRR, abs/1412.6980, 2014. URL http:
//arxiv.org/abs/1412.6980.
References
Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial
Athalye, A., Carlini, N., and Wagner, D. A. Obfus- machine learning at scale. CoRR, abs/1611.01236, 2016.
cated gradients give a false sense of security: Cir- URL http://arxiv.org/abs/1611.01236.
cumventing defenses to adversarial examples. CoRR,
abs/1802.00420, 2018. URL http://arxiv.org/ Liu, X., Cheng, M., Zhang, H., and Hsieh, C. Towards
abs/1802.00420. robust neural networks via random self-ensemble. CoRR,
abs/1712.00673, 2017. URL http://arxiv.org/
Bagnall, A., Bunescu, R. C., and Stewart, G. Train- abs/1712.00673.
ing ensembles to detect adversarial examples. CoRR,
abs/1712.04006, 2017. URL http://arxiv.org/ Liu, Y., Chen, X., Liu, C., and Song, D. Delving into
abs/1712.04006. transferable adversarial examples and black-box attacks.
CoRR, abs/1611.02770, 2016. URL http://arxiv.
Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Ther- org/abs/1611.02770.
mometer encoding: One hot way to resist adversarial
examples. In International Conference on Learning Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S.
Representations, 2018. URL https://openreview. N. R., Houle, M. E., Schoenebeck, G., Song, D., and Bai-
net/forum?id=S18Su--CW. ley, J. Characterizing adversarial subspaces using local
intrinsic dimensionality. CoRR, abs/1801.02613, 2018.
Carlini, N. and Wagner, D. A. Towards evaluating the robust- URL http://arxiv.org/abs/1801.02613.
ness of neural networks. CoRR, abs/1608.04644, 2016.
URL http://arxiv.org/abs/1608.04644. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and
Vladu, A. Towards deep learning models resistant to
Dhillon, G. S., Azizzadenesheli, K., Lipton, Z. C., Bern- adversarial attacks. CoRR, abs/1706.06083, 2017. URL
stein, J., Kossaifi, J., Khanna, A., and Anandkumar, http://arxiv.org/abs/1706.06083.
A. Stochastic activation pruning for robust adversarial
defense. CoRR, abs/1803.01442, 2018. URL http: Na, T., Ko, J. H., and Mukhopadhyay, S. Cascade Ad-
//arxiv.org/abs/1803.01442. versarial Machine Learning Regularized with a Unified
Embedding. arXiv e-prints, art. arXiv:1708.02582, Au-
Dong, Y., Liao, F., Pang, T., Hu, X., and Zhu, J. Dis- gust 2017.
covering adversarial examples with momentum. CoRR,
abs/1710.06081, 2017. URL http://arxiv.org/ Nesterov, Y. Smooth minimization of non-smooth
abs/1710.06081. functions. Mathematical Programming, 103(1):127–
152, May 2005. ISSN 1436-4646. doi: 10.1007/
Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, s10107-004-0552-5. URL https://doi.org/10.
R., and Lakshminarayanan, B. Adapting Auxiliary 1007/s10107-004-0552-5.
Losses Using Gradient Similarity. arXiv e-prints, art.
arXiv:1812.02224, December 2018. Papernot, N., McDaniel, P. D., Wu, X., Jha, S., and
Swami, A. Distillation as a defense to adversarial
Goodfellow, I., Shlens, J., and Szegedy, C. Explaining perturbations against deep neural networks. CoRR,
and harnessing adversarial examples. In International abs/1511.04508, 2015. URL http://arxiv.org/
Conference on Learning Representations, 2015. URL abs/1511.04508.
http://arxiv.org/abs/1412.6572.
Papernot, N., McDaniel, P. D., and Goodfellow, I. J.
Guo, C., Rana, M., Cissé, M., and van der Maaten, L. Transferability in machine learning: from phenomena
Countering adversarial images using input transforma- to black-box attacks using adversarial samples. CoRR,
tions. CoRR, abs/1711.00117, 2017. URL http: abs/1605.07277, 2016a. URL http://arxiv.org/
//arxiv.org/abs/1711.00117. abs/1605.07277.
Improving Adversarial Robustness of Ensembles with Diversity Training
Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman,
N. Pixeldefend: Leveraging generative models to under-
stand and defend against adversarial examples. CoRR,
abs/1710.10766, 2017. URL http://arxiv.org/
abs/1710.10766.
Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mit-
igating adversarial effects through randomization. In
International Conference on Learning Representations,
2018. URL https://openreview.net/forum?
id=Sk9yuql0Z.