1 s2.0 S0893608023001259 Main

Neural Networks 164 (2023) 177–185
Contents lists available at ScienceDirect
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
Learning defense transformations for counterattacking adversarial

examples
∗
Jincheng Li a,b,c ,1 , Shuhai Zhang a,c ,1 , Jiezhang Cao a,c ,1 , Mingkui Tan a ,
a
South China University of Technology, China
b
PengCheng Laboratory, China
c
Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, China
article info a b s t r a c t
Article history: Deep neural networks (DNNs) are vulnerable to adversarial examples with small perturbations.
Received 6 June 2022 Adversarial defense thus has been an important means which improves the robustness of DNNs by
Received in revised form 10 January 2023 defending against adversarial examples. Existing defense methods focus on some specific types of
Accepted 7 March 2023
adversarial examples and may fail to defend well in real-world applications. In practice, we may face
Available online 24 March 2023
many types of attacks where the exact type of adversarial examples in real-world applications can
Keywords: be even unknown. In this paper, motivated by that adversarial examples are more likely to appear
Adversarial examples near the classification boundary and are vulnerable to some transformations, we study adversarial
Classification boundary examples from a new perspective that whether we can defend against adversarial examples by pulling
Defense transformations them back to the original clean distribution. We empirically verify the existence of defense affine
Affine transformations
transformations that restore adversarial examples. Relying on this, we learn defense transformations
to counterattack the adversarial examples by parameterizing the affine transformations and exploiting
the boundary information of DNNs. Extensive experiments on both toy and real-world data sets
demonstrate the effectiveness and generalization of our defense method. The code is avaliable at
https://github.com/SCUTjinchengli/DefenseTransformer.
© 2023 Published by Elsevier Ltd.
1. Introduction of adversarial examples are even unknown in practice. Most exist-

ing methods, however, consider some specific types of adversarial
Deep neural networks (DNNs) have achieved great success in examples, making them hard to defend against various adversar-
many tasks, such as image processing (Hartmann, Jeckel, Jelli, ial examples raised from real-world applications. For example,
Singh, & Drescher, 2021; Niu, Wu, , et al., 2021, 2023), text anal- some methods (Guo, Rana, et al., 2018; Raff, Sylvester, et al.,
ysis (Yang, Liu, Tao, & Liu, 2021) and speech recognition (Baevski, 2019) use a combination of random transformations to defend
Hsu, Conneau, & Auli, 2021). However, DNNs are vulnerable to against adversarial examples. However, due to the randomness,
adversarial examples (Szegedy, Zaremba, et al., 2014) that are it is difficult for these methods to transform various types of
generated from original examples by small perturbations. This adversarial examples back to the original distribution of clean
may result in undesirable and even disastrous consequences of data (See Fig. 1(b)). In addition, the defense methods relying on
DNNs in real-world applications, such as safety accidents in au- denoising (Liao, Liang, et al., 2018; Salman, Sun, Yang, Kapoor,
tonomous navigation systems (Li & Vorobeychik, 2014) and med- & Kolter, 2020; Xie, Wu, et al., 2019; Zhou et al., 2021) seek to
ical misdiagnosis (Ozbulak, Van Messem, & De Neve, 2019). As handle adversarial examples by removing the introduced noises
deep neural networks are becoming more and more prevalent, (See Fig. 1(c)). These methods, however, have difficulty han-
it is necessary to study the problem on how to defend against dling adversarial examples that are not crafted by adding noise
adversarial examples (Goodfellow, Shlens, & Szegedy, 2015). (e.g., stAdv Xiao, Zhu, et al., 2018) or that are crafted using a wide
The key challenge of adversarial defense is how to defend range of attacks (e.g., EOT Athalye, Engstrom, Ilyas and Kwok,
against various types of adversarial attacks, where the exact type 2018). In conclusion, considering the possible various types of
unknown attacks in practice, how to learn a general model to
∗ Corresponding author. defend against various adversarial examples remains an open yet
E-mail addresses: is.lijincheng@gmail.com (J. Li),
challenging problem.
mszhangshuhai@mail.scut.edu.cn (S. Zhang), caojiezhang@gmail.com (J. Cao), In this paper, we address this by exploiting the common
mingkuitan@scut.edu.cn (M. Tan). characteristics of different types of adversarial examples. Specif-
1 Equal contribution. ically, the adversarial examples are more likely to appear near
https://doi.org/10.1016/j.neunet.2023.03.008
0893-6080/© 2023 Published by Elsevier Ltd.
J. Li, S. Zhang, J. Cao et al. Neural Networks 164 (2023) 177–185
Fig. 1. Comparisons of different defense methods, where an adversarial example is generated to cross the classification boundary with a small perturbation. Random
transformation cannot guarantee to correct the adversarial example without supervision. Denoising methods recover the adversarial example in a small range. When
the attack range is larger than the denoise range, these methods may fail to defend. Our defense method learns a defense affine transformation (e.g., rotation) to
guide the adversarial example back to the clean data distribution.
the classification boundary (Zhang, Yu, et al., 2019). Meanwhile, 2.2. Defense methods with transformations
the adversarial examples are vulnerable to some transformations,
e.g., cropping and re-scaling (Guo et al., 2018). In this sense, Defense methods based on transformations seek to handle
we seek to find whether there exist transformations that can adversarial examples through transformations. These methods
correct adversarial examples. We empirically verify the existence can be classified into two main categories: random transforma-
of effective defense affine transformations. Inspired by this, we tion methods and denoising methods. In random transformation
propose to learn defense transformations to counterattack ad- methods, cropping-ensemble (Guo et al., 2018) and BaRT (Raff
versarial examples by parameterizing the affine transformations. et al., 2019) use a combination of random transformations with
Our learned transformations are able to turn plenty of adver- different strategies to defend against adversarial examples. In
sarial examples back to the clean distribution. In this way, our addition, the mixup inference method (Pang et al., 2020) in-
defense method is able to handle various adversarial examples terpolates a sample between an adversarial example and other
effectively, especially those that are not crafted by adding noise random clean samples to make predictions. However, due to
like stAdv (Xiao et al., 2018) and EOT (Athalye, Engstrom et al., the randomness, it is hard for these methods to guide various
2018). adversarial examples back to the original distribution of clean
We summarize the main contributions as follows: data (See Fig. 1(b)). In denoising methods, HGD (Liao et al., 2018)
• To the best of our knowledge, our work is the first to empir- defends against adversarial examples by using high-level repre-
ically verify the existence of effective defense affine transfor- sentation guided denoiser. Feature denoising (Xie et al., 2019)
mations that correct adversarial examples. applies denoising operations directly on features. CAFD (Zhou
• We propose to learn defense transformations for counterat- et al., 2021) removes adversarial noise in a class activation feature
tacking adversarial examples. By parameterizing affine trans- space with a self-supervised adversarial training mechanism.
formations, our defense method is able to guide the adversarial In addition, Meng and Chen (2017) filter out the adversarial
examples back to the original distribution of clean data. examples that would be distant from the clean manifold with
• Extensive experiments on toy and real-world data sets demon-
a trained detector and then reconstruct the remaining samples
strate that our method is able to handle different types of
with an autoencoder. Nie et al. (2022) purify the samples by
adversarial examples. Also, our method is robust to attacks
iteratively removing the adversarial noise from the samples. Wu
of different intensities and is able to defend against unseen
et al. (2021) propose the Hedge, which tries to pull the attacked
attacks over different models.
samples back to the clean distribution by iteratively adding a
2. Related work small perturbation to the samples but these perturbed samples
are still constrained into a small norm ball, resulting in the
In practice, deep neural networks (DNNs) are vulnerable to limited performance when performing the Hedge to the standard
adversarial examples (Goodfellow et al., 2015; Madry, Makelov, model. Zhang et al. (2021) construct a causal graph and find that
et al., 2018; Robey, Chamon, Pappas, Hassani, & Ribeiro, 2021; adversarial vulnerability is the focus of models on spurious corre-
Zang, Xie, Chen, & Yuan, 2021; Zhang, Tian, Li, Wang, & Tao, 2020). lation. They propose to align natural and adversarial distributions
To resolve this, many researchers have proposed two main classes by eliminating the difference in specific conditional associations
of defense methods, including adversarial training methods and between natural and adversarial distributions. Compared to these
defense methods with transformations. methods, our method learns an affine transformation network
to pull the samples back to the original clean distribution based
2.1. Adversarial training methods on the classification boundary of the classifier, which has the
merit of handling adversarial examples that are not crafted by
Adversarial training methods (Croce & Hein, 2021; Madry adding noise or that are crafted using a wide range of attacks (See
et al., 2018; Mehrabi, Javanmard, Rossi, Rao, & Mai, 2021; Robey Fig. 1(c) and (d)).
et al., 2021; Sriramanan, Addepalli, Baburaj, & Babu, 2021; Zhang
et al., 2019) seek to defend against adversarial examples by 3. Problem definition and motivations
training DNNs in an adversarial manner. For example, NuAT (Sri-
ramanan et al., 2021) introduces a Nuclear-Norm regularizer to 3.1. Notation
enforce function smoothing in the vicinity of data samples to im-
prove the robustness of the models. Moreover, Robey et al. (2021) We use calligraphic letters (e.g., X ) to denote a space, and use
propose a hybrid Langevin Monte Carlo technique to improve bold lower case letters (e.g., x) to denote a sample (e.g., an image).
the performance of adversarial training. Although these methods For a multi-class classification problem, we use h to denote a
improve the robustness of DNNs to adversarial examples, they classifier. We denote the training data set by S = {(xi , yi )}Ni=1 ,
often perform very limitedly on the clean data (Pang, Xu, & Zhu, where xi is a sample, and yi is the corresponding label. To develop
2020). our method, we first define the adversarial example as follows.
178
transformation is non-trivial to be obtained in practice. Existing

studies (Guo et al., 2018) apply random affine transformations
to restore adversarial examples. Nevertheless, the random affine
transformations are independent of every adversarial example.
To address this, we define the following defense problem by
parameterizing and training affine transformations.
3.3. Defense with affine transformations
Given a pre-trained classifier h, we seek to learn a transforma-

tion function T , such that the prediction of the transformed data
T (x̃) is close to the label y, by solving the following optimization
Fig. 2. An intuitive diagram of the vulnerability on adversarial examples.
problem:
N
1 ∑
min ℓ (h (Tw (x̃i )) , yi ) , (2)
w N
Definition 1. Given a pre-trained classifier h and a sample pair i=1
(x, y), we call x̃ an adversarial example if there exists a sample where ℓ is some classification loss function, e.g., cross-entropy
x̃∈X such that h(x̃) ̸ = y and d(x, x̃) ≤ ε , where d(·, ·) is some loss, and w is the parameter of T . The defense problem has two
distance metric, and ε > 0 is a small constant. key challenges. First, how to parameterize and learn a defense
affine transformation is non-trivial. Second, the types of attacks
In real-world applications, a DNN model can be attacked by
can be various in practice. Most existing defense methods using
various types of adversarial examples. Based on the types of
random transformations or denoising only consider some specific
attacks, they can be classified into two categories: (i) linear
types of adversarial examples and thus may fail to defend against
adversarial examples (e.g., FGSM Goodfellow et al., 2015 and
various adversarial examples.
PGD Madry et al., 2018) have a small perturbation on clean data;
(ii) Non-linear adversarial examples (e.g., stAdv Xiao et al., 2018
4. Learning defense transformations
and ADef Alaifari, Alberti, & Gauksson, 2019) deform clean data by
using non-linear functions. More details of this are put in Section
To find a defense affine transformation for counterattacking
A in Supplementary. In general, these adversarial examples are
various adversarial examples, we apply a spatial transformer
more likely to appear near the classification boundary (Zhang
network (STN) (Jaderberg, Simonyan, Zisserman, & kavukcuoglu,
et al., 2019). In this sense, they may be vulnerable to some
2015) to parameterize the affine transformations and develop
transformations that turn them back to the distribution of clean our defense method. The defense transformations aims to map
data. Motivated by this, we seek to investigate the vulnerability adversarial examples back to the distribution of clean data.
of adversarial examples. The overall scheme of our method is shown in Fig. 3. To
be specific, the defense transformation network contains a U-
3.2. Vulnerability of adversarial examples Net (Ronneberger, Fischer, & Brox, 2015) and an STN, Here, the
U-Net uses a shortcut learning to perturb adversarial examples
In practice, adversarial examples are vulnerable to some trans- in a proper position, and helps STN to parameterize affine trans-
formations, e.g., cropping and re-scaling (Guo et al., 2018). With formations. We validate this in Section 6.5. Formally, given a
the help of this property, we investigate the vulnerability of ad- coordinate (u, v ) in the grid of an adversarial example x̃, then the
versarial examples to affine transformations (i.e., rotation, scaling corresponding coordinate (u′ , v ′ ) of a transformed sample T (x̃)
and translation). Formally, the affine transformation (denoted by can be obtained by a parameterized affine matrix Θ∈R2×3 , i.e.,
f ) can be defined as follows.
] [u]
θ θ12 θ13
[ ′] [
u
v ,
] [u]
s · cos α −s · sin α δu = 11 (3)
[ ′] [
u
= v , (1) v′ θ21 θ22 θ23
v ′
s · sin α s · cos α δv 1
1
where θij is the (i, j)th element of the affine matrix Θ . In practice,
where [u, v] and [u , v ] are the coordinates in x̃ and f (x̃), re-
′ ′
the affine matrix Θ can be learned by a localization network
spectively, α is a rotation angle, s is a scale factor and δu , δv g (Jaderberg et al., 2015), i.e., Θ = g(x̃). Then, we use Bilinear in-
are translation factors. Based on Eq. (1), we seek to investigate terpolation to obtain a transformed sample T (x̃). Based on Eq. (3),
whether an effective affine transformation exists for defense. To we optimize an objective function to defend against different
this end, we provide an intuitive example of the vulnerability on types of adversarial examples.
adversarial examples.
Without loss of generality, as shown in Fig. 2, we define a 4.1. Learning objective function
patch x̃ of an adversarial example and a kernel κ, where the
blue pixel is 0.5 and the black pixel is 1. From Fig. 2, we have During the training process, we consider training a classifier h
x̃∗κ=2>0, where ∗ is a convolution operation. When rotating based on the training data set S ={(xi , yi )}Ni=1 .2 Then, we generate
the patch x̃ by 45◦ , the convolution result T (x̃)∗κ=0. We have a set of adversarial examples A = {{(x̃i , yi )}m
(j)N
j=1 }i=1 by using m
the same results for other affine transformations. As a result, types of attack methods. The goal of this paper is to learn defense
the patch can be changed from active to inactive after some transformations by minimizing the following objective function:
affine transformations. For a real-world application, we verify n m
the vulnerability of an adversarial example with different affine 1 ∑ ∑ [ ( ( ( (j) )) )] 1
min ℓ h Tw x̃i yi + λ∥w∥2 , (4)
transformations in Section 5.1. w mn 2
i=1 j=1
Above analyses show that adversarial examples are vulnerable
to affine transformations. Thus, we would seek for an effective
affine transformation for defense. However, such an effective 2 We provide the details in the supplemental materials.
179
Fig. 3. The scheme of the proposed defense method. The defense transformation network consists of a U-Net and a spatial transformer network (STN). During
optimization, the defense transformation seeks to learn an affine transformation such that the input adversarial examples can be turned back to the distribution of
the clean data.
Fig. 4. The accuracy of the classifier on clean data and adversarial examples after performing random affine transformations with different maximum numbers (Left:
PGD-10; Right: stAdv) on CIFAR-10. Without any affine transformation, the accuracy for clean data is 93.51%.
Algorithm 1 Training method of the defense transformations. et al., 2015). On real-world data sets, we generate adversarial
examples by linear and non-linear attack methods. Specifically,
Input: Training data {(xi , yi )}N
i =1 ; pre-trained classifier h; the type
linear attack methods include FGSM (Goodfellow et al., 2015), I-
number of attack methods m; batch size n; learning rate γ ;
FGSM (Kurakin, Goodfellow, & Bengio, 2017), PGD (Madry et al.,
hyper-parameter λ
Output: Learned defense transformation network T 2018), DeepFool (Moosavi-Dezfooli, Fawzi, & Frossard, 2016) and
1: Apply all types of attack methods to generate a set of adversarial CW-L2 (Pang, Du, Dong, & Zhu, 2018a). Non-linear attack meth-
(j)
examples A = {{(x̃i , yi )}m N ods (e.g., stAdv Xiao et al., 2018 and ADef Alaifari et al., 2019)
j=1 }i=1
2: Initiate the parameter w of the defense transformations deform images without introducing noises, so they are unsuitable
3: while not converged do for HGD (Liao et al., 2018). For a fair comparison, we use only
(j)
4: Sample a mini-batch {{(x̃i , yi )}m n
j=1 }i=1 from the set of adversarial
linear adversarial examples to train the defense transformation
examples networks, where these adversarial ones target to attack only the
5: Freeze the classifier h and update the parameter w by gradient classification model (e.g., ResNet). In addition, we use a small
descent: { [ ∑ [ ( ( ( )) )]] } number of attacks (i.e., only FGSM) to generate the training set
(j)
w←w−γ ∇w mn 1
i,j ℓ h Tw x̃i , yi +λw on ImageNet. More results on the training set with the non-linear
6: end while attack are put in Section F in Supplementary. Note that in our
experiments, most attack methods only use the classifier to gen-
erate adversarial examples and do not acquire any information
about the defense transformation network, except for Tables 4, 6
where ℓ can be any classification loss function, w is the parameter and 10, where both the defense transformation network and the
of T , λ is a hyper-parameter, n is the batch size and n ≪ N. Here, classifier are available to the attacker.
we use the cross-entropy loss as the loss function and the regu- Implementation details. For the defense transformation net-
larizer to overcome possible overfitting issues. Based on Eq. (4), work, the architectures of the U-Net and the spatial transformer
some information about the boundary hidden in the classifier can network follow Liao et al. (2018) and Jaderberg et al. (2015),
guide the training of T . The detailed training method is shown in respectively. In the training process, we adopt an Adam opti-
Algorithm 1. mizer (Kingma & Ba, 2015) with β1 = 0.5 and β2 = 0.999,
and train the defense transformation network with 100 epochs,
5. Experiments a learning rate of 0.001, a hyper-parameter λ = 0 and the
batch size of 128. In addition, we use the pre-trained ResNet-
Compared methods. For a fair comparison, we compare our 56 (He, Zhang, Ren, & Sun, 2016) as the classifier. More Details on
method with several state-of-the-art defense methods relying experiment settings are provided in Section B in Supplementary.
on transformations, including random transformation, cropping-
ensemble (Guo et al., 2018; Raff et al., 2019), high-level rep- 5.1. Existence of defense affine transformations
resentation guided denoiser (HGD) (Liao et al., 2018), feature
denoising (Xie et al., 2019), mixup inference (Pang et al., 2020), To demonstrate the existence of defense affine transforma-
and CAFD (Zhou et al., 2021). tions, we conduct experiments on CIFAR-10 with linear adver-
Data sets. We conduct experiments on a toy data set and sarial examples by PGD in Fig. 4, while other types of adversar-
real-world data sets, i.e., CIFAR-10, CIFAR-100 (Krizhevsky, Hin- ial examples are put in the supplementary. In this experiment,
ton, et al., 2009) and ImageNet (Deng, Dong, et al., 2009). In we apply different numbers of random affine transformations
Fig. 6(a), we generate adversarial examples by FGSM (Goodfellow (i.e., 1, 10, 102 , 103 ) to both the clean and adversarial examples.
180
Fig. 5. Comparisons of different types of data (top: clean data; middle: PGD-10; bottom: stAdv) under different magnitudes of affine transformation (left: rotation;
middle: translation; right: scaling).
Fig. 6. Comparisons of defending against adversarial examples (AE) on the toy data set. (a) Given a data distribution with two classes, we generate the AE of the
class-0 using FGSM. (b) Random transformation can partially recover AE. (c) The denoising method can defend against several AE with a small range. (d) The defense
transformation maps all AE back to the distribution of the correct class.
For each sample, we will stop transforming if a defense affine random transformation fails to turn adversarial examples back
transformation is found or the number of transformations reaches to the original distribution; the denoising method cannot restore
the given maximum number. adversarial examples due to the range of denoise. Thus, it is
From Fig. 4, we draw two interesting observations. First, when difficult for random transformation and the denoising method to
conducting only one random affine transformation, only a small defend against adversarial examples well.
number of adversarial examples are corrected and the clean accu-
racy decreases a lot. This means that conducting only one random 5.3. Results on real-world data sets
transformation significantly damages the performance of models.
Second, with the increasing number of random transformations, We compare our defense method with other defense baselines
the accuracies of models on clean data and adversarial examples over ResNet-56 and VGG-16-BN on CIFAR-10 and CIFAR-100 (See
go better and finally perform very well. This implies that at least Table 1). We further verify the effectiveness of our method over
one defense affine transformation exists for defense while it is ResNet-50, ResNet-101 and DenseNet-161 on ImageNet (See Ta-
difficult to find an effective affine transformation for defense. ble 2). Note that these adversarial examples are crafted with only
We provide more experiments about this under other attacks the classification model. Moreover, we provide the results over
(e.g., FGSM, DeepFool) in Section D in Supplementary. VGG-16-BN on CIFAR-10 in Section E in Supplementary.
We further investigate the existence of defense transforma-
tions with a transformation (i.e., rotation, translation or scaling) 5.3.1. Results on CIFAR-10
under different magnitudes. In Fig. 5, at certain magnitudes of From Table 1, our defense method performs the best in han-
transformation, the clean and adversarial examples can be cor- dling different types of adversarial examples (AE). It implies
rectly classified, which further supports the existence of affine that our proposed defense transformation is able to improve
transformations. More results about this are provided in Section the model robustness against both linear and non-linear AE. In
D in Supplementary. contrast, defense methods with random transformations perform
limitedly due to the randomness. For denoising methods, HGD,
5.2. Demonstration on toy data set feature denoising and CAFD aim to remove noises in the linear
AE or their features and hence cannot perform well on non-linear
To investigate the effectiveness of our method, we plot the AE.
value surface of the pre-trained classifier in Fig. 6 on the toy data Note that our method has greater improvement than random
set. Here, the value surface depicts the output of the classifier. To affine transformations on clean data. When we apply a ran-
be specific, the classifier divides two-class training samples into dom affine transformation to clean data, the accuracy decreases
two regions (orange and green). from 93.51% to 36.98%. In this sense, the incorrect affine trans-
From Fig. 6, our proposed defense transformation is able to formations seriously degrade the accuracy of clean data. Such
transform all adversarial examples back to the green region, degraded performance can be alleviated by our method, although
which demonstrates the effectiveness of our method. In contrast, its performance is a bit worse than HGD.
181
Table 1
Performance comparisons (Top-1 accuracy (%)) of defending against various adversarial examples over ResNet-56 on CIFAR-10 and CIFAR-100. ‘‘AE’’ denotes the
adversarial examples.
Data sets Methods Clean Linear AE Non-linear AE
FGSM I-FGSM PGD-10 PGD-100 DeepFool CW-L2 stAdv ADef
Natural training 93.51 21.03 0.00 0.00 0.00 5.08 3.66 0.39 4.25
Random transformation 36.98 25.21 18.08 23.39 22.11 37.10 31.06 32.88 33.73
Cropping-ensemble 86.31 32.56 0.46 6.47 5.12 72.89 50.08 42.08 44.04
Mixup inference 81.47 42.38 16.58 28.40 25.89 79.51 66.31 64.94 74.23
CIFAR-10
Feature denoising 82.32 78.68 79.13 80.48 80.77 82.00 80.56 80.06 79.90
BaRT 81.48 79.53 82.15 83.35 82.91 82.27 81.01 81.34 81.21
HGD 93.23 90.24 87.79 89.27 90.51 90.21 90.21 44.71 82.57
CAFD 90.10 61.85 86.28 88.35 89.34 90.26 90.70 15.94 65.87
Ours 90.41 92.34 89.97 91.06 90.59 90.54 90.44 85.59 86.94
Natural training 71.08a 9.42 0.07 0.09 0.03 13.95 3.13 0.68 9.88
Random transformation 22.42 10.28 6.58 8.76 8.78 20.55 16.16 17.44 17.86
Cropping-ensemble 57.47 13.52 0.21 1.64 0.99 48.36 33.45 27.94 29.12
Mixup inference 58.42 22.50 14.06 22.48 19.65 58.11 50.59 47.08 53.95
CIFAR-100
Feature denoising 51.51 48.19 47.92 49.45 49.59 51.44 49.48 50.19 50.23
BaRT 50.75 47.08 47.80 49.97 49.93 51.18 49.50 50.86 50.31
HGD 70.48 59.66 46.61 57.00 58.60 61.22 65.03 25.94 55.73
CAFD 65.46 29.94 53.00 59.27 60.67 65.82 66.73 6.74 49.01
Ours 67.22 75.54a 63.49 67.33 66.55 66.80 65.34 59.01 61.76
a
We will discuss the interesting results separately in Section 5.3.
Table 2 Table 3
Comparisons (Top-1 accuracy (%)) of defending against adversarial examples over Comparisons (Top-1 accuracy (%)) of defending against unknown adversarial
different classifiers on ImageNet. examples over ResNet-56 on CIFAR-10/100.
Classifier Method Clean PGD-100 stAdv Data sets Method Functional AE EOT One pixel attack
HGD 73.58 73.86 59.45 HGD 9.00 20.72 53.59
ResNet-50
Ours 73.59 74.71 61.15 CIFAR-10 CAFD 49.70 45.01 41.93
HGD 74.59 74.26 60.33 Ours 63.50 50.71 64.83
ResNet-101
Ours 75.37 74.32 60.91 HGD 1.20 5.26 26.48
HGD 74.81 74.53 59.77 CIFAR-100 CAFD 9.92 21.69 15.73
DenseNet-161 Ours 16.40 18.57 37.90
Ours 74.94 74.82 61.17
Table 4
Comparisons (Top-1 accuracy (%)) of defending against adaptive attacks over
5.3.2. Results on CIFAR-100 ResNet-56 on CIFAR-10/100.
In Table 1, our method greatly outperforms other defense Data sets Method BPDA BPDA+EOT
methods, which suggests our defense method is able to learn HGD 83.96 83.98
defense affine transformations to handle various AE. Moreover, CIFAR-10 CAFD 84.97 84.88
we observe an interesting phenomenon. For FGSM, the adver- Ours 87.46 87.46
sarial accuracy even outperforms the clean accuracy without HGD 55.27 54.83
transformation (75.54% v.s. 71.08%). One possible reason is that CIFAR-100 CAFD 51.14 50.83
Ours 61.86 61.77
some misclassified clean samples distribute in the set of AE of
FGSM, and our method transforms them into the distribution of
the clean data.
6.1.1. Generalization on AE from the unknown attacks
We compare our defense method with baselines on more
5.3.3. Results on ImageNet adversarial examples (AE) from unknown attack methods, i.e.,
To further investigate the effectiveness of the proposed functional AE (Laidlaw & Feizi, 2019), expectation over trans-
method, we compare our method with HGD over different clas- formation (EOT) (Athalye, Engstrom et al., 2018) and one pixel
sifiers (i.e., ResNet-50, ResNet-101 and DenseNet-161) on Ima- attack (Su, Vargas, & Sakurai, 2019). In Table 3, our method
geNet. In Table 2, our method achieves better performance than outperforms HGD by a large margin and also achieves superior or
HGD on linear and non-linear AE. Note that although we only comparable performance compared with CAFD. It indicates that
consider FGSM attack to generate the training set, our method our defense method is able to generalize to AE from the unknown
is still able to defend against the unknown AE, i.e., PGD-100 and attackers.
stAdv.
6.1.2. Generalization on AE from adaptive attacks
6. Further experiments We provide two adaptive attacks to further evaluate the ef-
fectiveness of our method, including BPDA following Athalye,
Carlini and Wagner (2018) and BPDA+EOT following Tramer,
6.1. Generalization of proposed defense method
Carlini, Brendel, and Madry (2020). Here, the attacker knows
the defense (e.g., denoiser and transformation network) and uses
In practice, we often verify the generalization of a pre-trained BPDA to bypass it by using an identity function to approximate
model on an unseen data set, e.g., test set. Similar to this, we the defense transformation network for obtaining its gradients to
investigate the generalization of our defense method on unknown craft adversarial example. This is different from the scenario in
adversarial examples (AE) on CIFAR-10/100. Section 5.3, where the attacker is only aware of the classifier. In
182
Table 6
Comparisons (Top-1 accuracy (%)) of defending against various adversarial
examples (under white-box attacks) over ResNet-56 on CIFAR-100.
Method PGD-100 (ϵ= 255
4
) PGD-100 (ϵ= 255
8
) stAdv ADef
HGD 0.06 0.03 0.69 9.52
Ours 0.01 0.03 1.76 11.06
when evaluating the pre-trained model on different perturba-

tion intensities. We report the results in Fig. 7 and put more
results on CIFAR-10 in Section G in Supplementary. From Fig. 7,
our proposed defense method outperforms HGD under different
perturbation intensities particularly when the perturbation inten-
sities become larger, verifying that our proposed defense method
Fig. 7. Robustness w.r.t. perturbation intensities ϵ . We test our defense method is more robust than HGD. In contrast, the adversarial accuracy of
and HGD on PGD-10 with different perturbation intensities over ResNet-56 on HGD drops sharply with larger perturbation intensities. One pos-
CIFAR-100. sible reason is that the perturbation intensity (i.e., attack range)
may be beyond the denoising range of HGD (see Fig. 1).
Table 5
Performance comparisons (Top-1 accuracy (%)) of defending against the adver-
6.2.2. Robustness to AE under white-box attacks
sarial examples generated by different models on CIFAR-10. ‘‘AE’’ denotes the
adversarial examples. In this experiment, we discuss the robustness of our defense
Process Methods Linear AE Non-linear AE method under white-box attacks. Here, the white-box attacks are
aware of the exact architectures of the learned defense trans-
FGSM PGD-10 stAdv ADef
formation network and the classifier. Specifically, we apply the
HGD 90.24 89.27 44.71 82.57
ResNet-56 → ResNet-56 PGD-100 attack following Athalye and Carlini (2018) and non-
Ours 92.34 91.06 85.59 86.94
linear attacks (i.e., stAdv and ADef) to evaluate the robustness of
HGD 65.06 68.96 31.62 38.56
ResNet-56 → VGG-16-BN
Ours 76.74 81.01 80.64 75.56
the defense methods. From Table 6, HGD and our method are
vulnerable to AE when the defense models are completely leaked
HGD 90.21 90.38 73.70 68.89
VGG-16-BN → VGG-16-BN
Ours 92.74 91.49 81.63 76.94
to the attackers. Nevertheless, our method is more robust than
HGD since our method is able to defend against more adversarial
HGD 88.82 88.63 75.64 84.11
VGG-16-BN → ResNet-56
Ours 90.69 88.90 82.20 84.70 examples. It is challenging to defend against white-box attacks.
In the future, it is worth improving the robustness of our method
under the white-box attacks.
this experiment, we evaluate our method on CIFAR-10 and CIFAR- 6.3. Denoising speed of proposed defense method
100 over ResNet-56 compared to CAFD and HGD. As shown in
Table 4, we observe that our method outperforms two baselines In this experiment, we provide the computational efficiency
by about ∼3% on CIFAR-10 and +6% on CIFAR-100 in terms of the proposed method compared to several baselines (e.g., HGD
of robustness accuracy, demonstrating the effectiveness of our and CAFD). We show the denoising time (millisecond per sample)
defense mechanism against adaptive attacks. averaged over 100 FGSM adversarial examples on CIFAR-10. In
Table 9, our method is significantly faster than CAFD and achieves
6.1.3. Transferability on AE across different classifiers a comparable denoising speed to HGD.
We further study the transferability of our proposed defense
transformation network, i.e., we apply the pre-trained defense 6.4. Combining adversarially-trained models with proposed defense
transformation network of model A to model B, which we denote method
as ‘‘A → B’’. We compare our defense method with HGD on both
linear and non-linear AE across ResNet-56 and VGG-16-BN. As In this experiment, we investigate the robustness of com-
shown in Table 5, the performance of HGD is much degraded, bining the adversarially-trained models with our proposed de-
especially for non-linear AE. In contrast, our defense method fense method. To this end, we replace the classifier with the
achieves much better generalization performance than HGD on adversarial-trained model, i.e., ResNet-50, which is downloaded
both linear and non-linear AE, implying the better transferability from AutoAttack Github.3 We do not finetune it and evaluate
of our learned defense transformation network. the robustness under FGSM and AutoAttack. Here, we generate
adversarial examples over both the adversarial-trained classifier
6.2. Robustness of proposed defense method and the defense transformation model, namely white-box attack.
In Table 10, the adversarial-trained model can defend against
We further evaluate the robustness of our method on adver- white-box attacks but at the cost of severely degrading the clean
sarial examples (AE) with larger perturbation or generated by accuracy. We believe that the drop in the clean accuracy is an
some white-box attack methods. outcome of the trade-off between clean accuracy and robustness
as observed in previous works (Zhang et al., 2019).
6.2.1. Robustness to AE with larger perturbation Moreover, we find that, when the adversarially-trained model
In this experiment, we verify the robustness of our defense is equipped with the defense transformation network, the at-
method by increasing the perturbation intensities of AE on the tacker will exploit the transformation network to craft powerful
test set of CIFAR-100 by PGD-10. We evaluate the robustness adversarial examples to attack the whole model (i.e., defense
against these adversarial ones under different perturbation in- transformation network + classifier) due to the vulnerability of
tensities at the inference stage instead of the training stage. In
this way, the natural accuracy is fixed (i.e., ϵ = 0 in Fig. 7) 3 https://github.com/fra31/auto-attack.
183
Table 7
Performance comparisons (Top-1 accuracy (%)) of defending against various adversarial examples over ResNet-56 with different network architectures on CIFAR-10.
‘‘AE’’ denotes adversarial examples.
Methods Architectures Clean Linear AE Non-linear AE
FGSM I-FGSM PGD-10 PGD-100 DeepFool CW-L2 stAdv ADef
STN 78.81 57.18 54.35 64.18 64.49 76.79 70.53 73.76 52.89
Ours Auto-Encoder + STN 70.06 63.82 62.25 65.47 65.09 69.43 66.44 67.10 67.82
U-Net + STN 90.41 92.34 89.97 91.06 90.59 90.54 90.44 85.59 86.94
Table 8 defense models, and report their adversarial accuracy. From Ta-
Effect of the parameter λ of the regularizer. We report the adversarial accuracy ble 8, we obtain the best performance when λ = 0. Then, we set
over WideResNet-34-10 on CIFAR-10.
the same optimal hyperparameter in all experiments, including
λ 0 0.001 0.01 0.1 1.0
different data sets (e.g., CIFAR100 and ImageNet) and different
Acc. (%) 88.91 88.71 88.39 85.90 80.39
classifiers (e.g., ResNet-50, ResNet-101, and DenseNet-161), and
achieve significant results.
Table 9
Comparisons of computational efficiency of denoising speed (millisecond/per 7. Conclusion
image) over ResNet-56 on CIFAR-10.
Data sets Method Speed
In this paper, we have investigated the vulnerability of ad-
HGD 2.72 versarial examples and verified that there exist defense affine
CIFAR-10 CAFD 47.80
transformations to turn them back to the distribution of clean
Ours 3.49
data. Inspired by this, we learn defense transformations to defend
against different types of adversarial examples. The proposed
Table 10 defense method is robust to the attacks of different intensities
Performance comparisons (Top-1 accuracy (%)) of defending against adversarial
and can generalize to defend against unseen attacks over dif-
examples (under white-box attacks) using a pre-trained ResNet-50 with the
adversarial training strategy on CIFAR-10, where ‘‘nat.’’ and ‘‘adv.’’ denotes the ferent DNNs. Extensive experiments on both toy and real-world
naturally-trained model and the adversarially-trained model, respectively. data sets demonstrate the effectiveness and generalization of
Method Clean FGSM AutoAttack our method. In the future, we can apply some detection meth-
Ours w/ nat. 90.41 15.83 0.00 ods (Fang et al., 2022; Pang, Du, Dong, & Zhu, 2018b) to identify
Ours w/ adv. 82.05 69.52 23.30 whether the test sample is an adversarial one. In this way, we
only perform the defense transformation on the adversarial sam-
ples and do not necessarily process the natural samples, which is
a good way to avoid the compromise between natural accuracy
the transformation network under the white-box attack, result-
and adversarial accuracy.
ing into a compromised adversarial accuracy 23.30%. Although
this dilemma can be improved by training a robust joint model
Declaration of competing interest
(i.e., defense transformation network + classifier) via adversar-
ial training, it is very time-consuming (Wong, Rice, & Kolter,
The authors declare that they have no known competing finan-
2020) and significantly degrades the clean accuracy. In fact, many
cial interests or personal relationships that could have appeared
defense methods based on transformation such as HGD are chal-
to influence the work reported in this paper.
lenging to defend against white-box attacks (see Table 6), which
prompts us to strengthen the defense against such attacks in the
Data availability
future.
all data are publicly available
6.5. Ablation study
Acknowledgments
Effectiveness of U-Net. We discuss more motivations of using
U-Net as a feature representation module. The original spatial This work was partially supported by the National Key R&D
transformer network (STN) is very shallow and thus is difficult Program of China (No. 2020AAA0106900), National Natural Sci-
to learn defense affine transformations for every input data. To ence Foundation of China (NSFC) 62072190, Program for Guang-
address this, we are motivated to map the original data to a suit- dong Introducing Innovative and Entrepreneurial Teams
able position in the feature space. To this end, one can apply some 2017ZT07X183.
representation learning methods (e.g., Auto-Encoder and U-Net)
to learn feature representations. We compare these architectures Appendix A. Supplementary data
in Table 7. The defense transformation network using U-Net and
STN achieves the best performance. Particularly, our proposed Supplementary material related to this article can be found
defense model achieves much superior performance compared to online at https://doi.org/10.1016/j.neunet.2023.03.008.
the single STN method (↑ 23.88% on average). More discussion of
this are put in Section C in Supplementary. References
Impact of hyper-parameter λ. In this experiment, We briefly
investigate the impact of the hyper-parameter λ of the regu- Alaifari, R., Alberti, G. S., & Gauksson, T. (2019). ADef: an iterative algorithm to
larizer in Eq. (4). Specifically, we conduct this ablation study construct adversarial deformations. In ICLR.
over WideResNet-34-10 on CIFAR-10 in Table 8. The validation Athalye, A., & Carlini, N. (2018). On the robustness of the cvpr 2018 white-box
adversarial example defenses. arXiv.
set is the test set of CIFAR-10 and the range of the hyperpa- Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false
rameter is 0, 0.001, 0.01, 0.1, and 1.0. We generate the corre- sense of security: Circumventing defenses to adversarial examples. In ICML
sponding adversarial examples (PGD-20), compare five different (pp. 274–283). PMLR.
184
Athalye, A., Engstrom, L., Ilyas, A., & Kwok, K. (2018). Synthesizing robust Ozbulak, U., Van Messem, A., & De Neve, W. (2019). Impact of adversarial
adversarial examples. In ICML. examples on deep learning models for biomedical image segmentation. In
Baevski, A., Hsu, W., Conneau, A., & Auli, M. (2021). Unsupervised speech MICCAI.
recognition. In NeurIPS. Pang, T., Du, C., Dong, Y., & Zhu, J. (2018a). Towards robust detection of
Croce, F., & Hein, M. (2021). Mind the box: l1-APGD for sparse adversarial attacks adversarial examples. In NeurIPS.
on image classifiers. In ICML. Pang, T., Du, C., Dong, Y., & Zhu, J. (2018b). Towards robust detection of
Deng, J., Dong, W., et al. (2009). Imagenet: A large-scale hierarchical image adversarial examples. Advances in Neural Information Processing Systems, 31.
database. In CVPR. Pang, T., Xu, K., & Zhu, J. (2020). Mixup inference: Better exploiting mixup to
Fang, Z., Li, Y., Lu, J., Dong, J., Han, B., & Liu, F. (2022). Is out-of-distribution defend adversarial attacks. In ICLR.
detection learnable? arXiv preprint arXiv:2210.14707. Raff, E., Sylvester, J., et al. (2019). Barrage of random transforms for adversarially
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing robust defense. In CVPR.
adversarial examples. In ICLR. Robey, A., Chamon, L. F. O., Pappas, G. J., Hassani, H., & Ribeiro, A. (2021).
Guo, C., Rana, M., et al. (2018). Countering adversarial images using input Adversarial robustness with semi-infinite constrained learning.
transformations. In ICLR. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks
Hartmann, R., Jeckel, H., Jelli, E., Singh, P. K., & Drescher, K. (2021). Quantitative for biomedical image segmentation. In MICCAI.
image analysis of microbial communities with biofilmq. Nature Microbiology. Salman, H., Sun, M., Yang, G., Kapoor, A., & Kolter, J. Z. (2020). Denoised
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image smoothing: A provable defense for pretrained classifiers. In NeurIPS (pp.
recognition. In CVPR. 21945–21957).
Jaderberg, M., Simonyan, K., Zisserman, A., & kavukcuoglu, k. (2015). Spatial Sriramanan, G., Addepalli, S., Baburaj, A., & Babu, R. V. (2021). Towards efficient
transformer networks. In NeurIPS. and effective adversarial training. In NeurIPS.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Su, J., Vargas, D. V., & Sakurai, K. (2019). One pixel attack for fooling deep neural
ICLR. networks. IEEE Transactions on Evolutionary Computation.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from Szegedy, C., Zaremba, W., et al. (2014). Intriguing properties of neural networks.
tiny images (Master’s Thesis), Department of Computer Science, University In ICLR.
of Toronto. Tramer, F., Carlini, N., Brendel, W., & Madry, A. (2020). On adaptive attacks to
Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial examples in the adversarial example defenses. In NeurIPS (pp. 1633–1645).
physical world. In ICLR. Wong, E., Rice, L., & Kolter, J. Z. (2020). Fast is better than free: Revisiting
Laidlaw, C., & Feizi, S. (2019). Functional adversarial attacks. In NeurIPS. adversarial training. In International conference on learning representations.
Li, B., & Vorobeychik, Y. (2014). Feature cross-substitution in adversarial Wu, B., Pan, H., Shen, L., Gu, J., Zhao, S., Li, Z., et al. (2021). Attacking adversarial
classification. In NeurIPS. attacks as a defense. arXiv.
Liao, F., Liang, M., et al. (2018). Defense against adversarial attacks using Xiao, C., Zhu, J.-Y., et al. (2018). Spatially transformed adversarial examples. In
high-level representation guided denoiser. In CVPR. ICLR.
Madry, A., Makelov, A., et al. (2018). Towards deep learning models resistant to Xie, C., Wu, Y., et al. (2019). Feature denoising for improving adversarial
adversarial attacks. In ICLR. robustness. In CVPR.
Mehrabi, M., Javanmard, A., Rossi, R. A., Rao, A. B., & Mai, T. (2021). Fundamental Yang, X., Liu, W., Tao, D., & Liu, W. (2021). BESA: BERT-based simulated annealing
tradeoffs in distributionally adversarial training. In ICML. for adversarial text attacks. In IJCAI.
Meng, D., & Chen, H. (2017). Magnet: a two-pronged defense against adversarial Zang, X., Xie, Y., Chen, J., & Yuan, B. (2021). Graph universal adversarial attacks:
examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and A few bad actors ruin graph learning models. In IJCAI.
Communications Security (pp. 135–147). Zhang, Y., Gong, M., Liu, T., Niu, G., Tian, X., Han, B., et al. (2021). Adversarial
Moosavi-Dezfooli, S.-M., Fawzi, A., & Frossard, P. (2016). Deepfool: A simple and robustness through the lens of causality. arXiv preprint arXiv:2106.06196.
accurate method to fool deep neural networks. In CVPR. Zhang, Y., Tian, X., Li, Y., Wang, X., & Tao, D. (2020). Principal component
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., & Anandkumar, A. (2022). adversarial example. IEEE Transactions on Image Processing, 29, 4804–4815.
Diffusion models for adversarial purification. arXiv. Zhang, H., Yu, Y., et al. (2019). Theoretically principled trade-off between
Niu, S., Wu, J., et al. (2021). Adaxpert: adapting neural architecture for growing robustness and accuracy. In ICML.
data. In International Conference on Machine Learning (pp. 8184–8194). PMLR. Zhou, D., Wang, N., Peng, C., Gao, X., Wang, X., Yu, J., et al. (2021). Removing
Niu, S., Wu, J., et al. (2023). Towards stable test-time adaptation in dynamic adversarial noise in class activation feature space. In ICCV (pp. 7878–7887).
wild world. In ICLR.
185

1 s2.0 S0893608023001259 Main

Uploaded by

Copyright:

Available Formats

You might also like

1 s2.0 S0893608023001259 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0893608023001259 Main

Uploaded by

Copyright:

Available Formats

Neural Networks 164 (2023) 177–185

Contents lists available at ScienceDirect

Learning defense transformations for counterattacking adversarial

1. Introduction of adversarial examples are even unknown in practice. Most exist-

transformation is non-trivial to be obtained in practice. Existing

3.3. Defense with affine transformations

Given a pre-trained classifier h, we seek to learn a transforma-

when evaluating the pre-trained model on different perturba-

You might also like