Professional Documents
Culture Documents
My Project
My Project
By
Himanshu Dubey -
2021071064 Aman Yadav -
2021071006 Bhashkar Pandey
- 2022072002
Supervised
By
1
Acknowledgments
I am pleased to present the “Game-Theoretic Approaches to
Adversar- ial Attacks and Defences in Deep Learning Models”
project and take this opportunity to express my profound gratitude to all
those who helped me complete this project. I thank my college for
providing me with excellent facilities that helped us to complete and
present this project. I express my deepest gratitude towards my project
guide for his valuable and timely ad- vice during the various phases of
our project, Dr. Umesh Chandra Jaiswal. I would also like to thank him
for providing me with all the proper facilities and support as the project
co-coordinator. I would like to thank him for his support, patience, and
faith in our capabilities and for giving me flexibility regarding working
and reporting schedules. I especially thank our Director, Prof. J.P saini,
and Head of the Department, Dr. Shiva Prakash for provid- ing excellent
computing and other facilities without which this work could not achieve
its quality goal. I would also like to thank all my family mem- bers who
always stood beside us and provided the utmost important moral
support.
2
Contents
1 Introduction 4
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Objective and Goals . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature View 6
3 Dataset 7
4 Evaluation Metrics 7
5 Baseline Model 7
5.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 L-BFGS algorithm . . . . . . . . . . . . . . . . . . . . 8
5.2.2 Fast Gradient Sign Method . . . . . . . . . . . . . . . 8
5.2.3 Projected Gradient Descent . . . . . . . . . . . . . . . 9
5.2.4 Adversarial patch . . . . . . . . . . . . . . . . . . . . . 10
5.2.5 Adversarial attacks on audio and text recognition . . . 11
5.3 Adversarial defenses . . . . . . . . . . . . . . . . . . . . . . . 12
5.3.1 Adversarial training . . . . . . . . . . . . . . . . . . . . 12
5.3.2 FGSM adversarial training . . . . . . . . . . . . . . . . 13
5.3.3 PGD adversarial training . . . . . . . . . . . . . . . . . 14
5.3.4 Random input transformation . . . . . . . . . . . . . . 14
5.3.5 GAN-based input cleansing . . . . . . . . . . . . . . . 15
6 Proposed Work 16
6.1 Steps of Workflow..............................................................................17
6.2 Flowchart of Proposed Work............................................................20
7 Discussions 21
7.1 White-box and black-box attacks......................................................21
8 Conclusion 21
3
Abstract
In the evolving landscape of artificial intelligence, deep learning
models have demonstrated remarkable performance across a
variety of domains. However, their susceptibility to adversarial
attacks poses a significant threat to their reliability and security.
These attacks in- volve subtle perturbations to input data, leading
to incorrect model predictions without noticeable differences to
the human eye. This project explores the application of game-
theoretic approaches to un- derstand and enhance the robustness of
deep learning models against such adversarial threats. We focus on
two primary game-theoretic de- fence mechanisms: Randomized
Smoothing and Defence-GAN. Ran- domized smoothing introduces
controlled randomness into the input data, making it difficult for
adversaries to generate effective attacks consistently. Defence-
GAN leverages the generative capabilities of GANs to transform
adversarial examples back to the original data distribution,
effectively neutralizing the attack. Both methods can be framed as
strategic games between the attacker and the defender, where each
player optimizes their strategies to maximize their respec- tive payoffs.
In this project, we formalize these interactions within a game-
theoretic framework, analyse the Nash equilibria of the result- ing
games, and explore the theoretical underpinnings that govern these
dynamics. We implement and evaluate these defence mechanisms
on benchmark datasets, demonstrating their effectiveness in
enhancing model robustness. Additionally, we investigate the
synergy between these methods and other defence strategies to
propose a comprehen- sive adversarial defence framework.
1 Introduction
1.1 Problem Description
Deep neural networks are vulnerable to adversarial examples. In the
image recognition and classification area, adversarial examples refer to
the syn- thesized images which look identical to the original images from
a human perspective, but they intentionally cause the classifier to make
4
mistakes, i.e.,
5
wrong recognition or classification results. The existence of such
adversar- ial examples poses great threats to using DNNs in safety-
critical scenarios like auto-driving and financial fraud detection. When
adversarial attacks are performed on training data, the learning process is
misguided, resulting in an inaccurate classifier. The topic of adversarial
attacks and defenses is gaining more and more attention with the rapid
development in deep neural networks. Currently, both attack and defense
are facing some challenges. In terms of attack, there are provable defenses
targeting some attack strategies. Various defense methods have their
strengths in tackling different attacks. Due to the ”last mover advantage”
of the attacker, there seems to be no way to defend against all possible
attacks. Current defense methods have different efficiencies for different
attacks. The proof of provable defenses only applies to a specific class of
attacks. To deal with these problems, some approaches have been
investigated to protect deep neural networks. The methods are roughly
categorized as adversarial training, randomization, denoising, prov- able
defense, and so on. In this work, we use adversarial training techniques
firstly under Whitebox to form our baseline model, assuming that the attack
method is already known. From this, we explore several defense methods
and carry out some experiments on attacks to verify our assumptions.
Our defense methods are: multiple initialization defense, multiple SAP-
like net- works, and super-resolution-based adversarial defense[10].
7
certain situations. We use game-theoretical approaches on the defensive
side to improve the robustness of deep-learning models.
2 Literature View
As usual in DNNs, θ is the set of model parameters. Our goal is to find
the model parameters θ that minimize the risk E(x,y)∼D[L(x, y, θ)]. The goal
of the attack model is to introduce a set of tiny perturbations δ ∈ S which
make the DNN model misclassify the adversarial sample, whereas the
same DNN
model can classify the normal sample correctly. The attack and defense
can be described as the following saddle point problem:
r l
min ρ(θ), where ρ(θ) = E(x,y)∼D maxL(θ, x + δ, y) (1)
θ δ
8
accuracy under a well-defined class of adversarial attacks. The approach is
9
to formulate an adversarial polytope and define its upper bound using
con- vex relaxations, which guarantees that no attack with specific
limitations can surpass the certified attack success rate. Some typical
provable defensive methods are developed based on semi-definite
programming, dual approach, Bayesian model, and consistency. However,
the actual performance of these certified defenses is still much worse than
that of adversarial training.
3 Dataset
The CIFAR-10 dataset [10] is one of the most popular datasets for
machine learning and computer vision research. The dataset is a labelled
subset of 80 million tiny images. Actually, it has 60,000 32x32 colour
images in 10 different classes. The class tags include airplanes, cars,
birds, cats, deer, dogs, frogs, horses, ships, and trucks. Thus, each class
has corresponding 6,000 images. Algorithms for image recognition and
classification are often tested on this dataset. Since the images in the
dataset are low-resolution, it is easy for researchers to train models and
implement ideas quickly.
4 Evaluation Metrics
The accuracy we mention in this paper is the number of correct
classification images dived by the number of all evaluation images. The goal
of the classifier is to distinguish as much image as possible, i.e., increase
accuracy. The goal of the attack side is to generate an image that the
classifier cannot classify correctly. i.e., decrease the classifier’s accuracy.
The goal of the defence side is to make the accuracy of the classifier not
decrease too much under attack. At the same time, the defence should
maintain a normal performance on the unchanged image
5 Baseline Model
5.1 Classifier
Our group implement VGG and ResNet to achieve the classification task.
The reason why we choose these DNNs is that they are easy to implement
10
and take a relatively short time to train and generate attack instances.
And all of them can reach around 90
5.2 attacks
5.2.1 L-BFGS algorithm
The vulnerability of deep neural networks (DNNs) to adversarial samples
is first reported in section 4; that is, hardly perceptible adversarial
perturba- tions are introduced to an image to mislead the DNN
classification result. A method called L-BFGS is proposed to find the
adversarial perturbations with the minimum Lp norm, which is
formulated as follows:
min ||x − x′|| subject to f (x′) ̸= y′ (2)
x
x
12
Moreover, it has been discovered that random perturbing before
executing FGSM on benign samples can enhance the performance and the
diversity of the FGSM adversarial samples.
PGD incorporates a perturbation budget(ϵ) and the step size (α) to con-
trol the amount and direction of perturbation. The update rule for PGD is
IT
defined as x′ = (xt + α · sign(∇xJ(θ, xt, y))) where xt is the input at
iteration t, α tis the step size, (∇xJ(θ, xt, y) is the gradient of the loss with
respect to the input, and I is the projection operator ensuring perturbed
T
input stays within predefined bounds[8].
13
5.2.4 Adversarial patch
Recent studies indicate that perturbations in specific regions of benign sam-
ples, called adversarial patches, can also deceive deep learning (DL)
mod- els. Sharif et al. [9] introduced a technique where adversarial
perturbations are applied only to an eyeglasses frame on facial images as
shown in Fig 1. By optimizing a common adversarial loss like cross-
entropy, this localized perturbation can easily trick the VGG-Face
convolutional neural network (CNN) [26]. The authors demonstrated
this attack in the real world by 3D- printing eyeglasses with the generated
perturbations. They also presented video demonstrations showing people
wearing these adversarial eyeglasses being misidentified as target
individuals by a real VGG-Face CNN system.
Brown et al.[2] developed a method to create universal robust
adversar- ial patches. The adversarial loss for optimizing the patch is
defined using benign images, patch transformations, and patch locations.
Universality is achieved by optimizing the patch across all benign images.
Robustness to noise and transformations is ensured using the
expectation over transfor- mation (EoT) method[1], which calculates
noise/transformation-insensitive gradients for the optimization.
Liu et al.[7] proposed a method to generate adversarial samples by adding
a Trojan patch to benign samples. This attack first identifies neurons that
significantly influence the network’s outputs. The pixel values in the
patch region are then initialized to maximize these selected neurons’
activation. Finally, the model is retrained with both benign images and
images with the Trojan patch to adjust the weights related to these
neurons. Although the retrained model performs similarly to the original
model on benign images, it exhibits malicious behavior on images with
the adversarial patch.
14
Figure 2: Eyeglasses with adversarial patterns trick a facial recognition
sys- tem into recognizing faces in the first row as those in the second row.
[9]
16
formulated as
follows:
min max J(θ, x′, y) (5)
θ D(x,x′)<η
17
against FGSM-generated adversarial samples. Consequently, further
research has delved into adversarial training using stronger adversarial
attacks, such as BIM/PGD attacks.
18
5.3.3 PGD adversarial training
Through extensive evaluations, it has been found that a PGD attack is
likely a universal first-order L∞ attack [13]. This means that if a model is
robust against PGD attacks, it’s likely to resist a wide range of other first-
order L1
attacks. Building on this idea, Madry et al. [13] suggest training a robust
network adversarially using PGD. Surprisingly, PGD adversarial training
indeed enhances the resilience of CNNs and ResNets [24] against
several
common first-order L∞ attacks, such as FGSM, PGD, and CW1 attacks, in
both black-box and white-box scenarios.
Even against the strongest L∞ attack known as DAA, PGD adversarial
training can only reduce the accuracy of the trained models to certain levels.
For instance, the accuracy of a PGD adversarially trained MNIST model
drops to 88.56%, and that of a CIFAR-10 model falls to 44.71%. In the
recent Competition on Adversarial Attacks and Defenses (CAADs), the
top- ranking defense against ImageNet adversarial samples was based on
PGD adversarial training [14].
With PGD adversarial training, a baseline ResNet [23] can achieve
over 50% accuracy under 20-step PGD, whereas a denoising architecture
proposed in Ref. [14] only increases the accuracy by 3%. These studies
collectively suggest that PGD adversarial training is generally the most
effective defense
against L∞ attacks.
However, it’s worth noting that PGD adversarial training comes with sig-
nificant computational costs. For example, training a simplified ResNet
for CIFAR-10 with PGD adversarial training can take up to three days on
a TITAN V graphics processing unit (GPU). The top-ranking model in
CAAD requires 52 hours on 128 Nvidia V100 GPUs. Additionally, while a
PGD ad-
versarially trained model is robust against L∞ attacks, it remains
vulnerable to other types of attacks, such as EAD [19,59] and CW2 [8].
20
it struggles with white-box attacks, especially against the EoT method
[28], which uses an ensemble of resized and padded images to
significantly re- duce accuracy. Additionally, Guo et al. [68] employ
various random image transformations, including bit-depth reduction
and JPEG compression, to defend against adversarial samples. While
effective against most gray-box and black-box attacks, this method is also
susceptible to the EoT method [28].
6 Proposed Work
This project aims to advance the state-of-the-art in adversarial defences by:
22
Empirical Validation: Conducting extensive experiments on
benchmark datasets to evaluate the effectiveness of the proposed
defence mecha- nisms, comparing their performance against existing
methods.
6. Defence-GAN Implementation
9. Empirical Validation
24
10. Scalability & Efficiency Analysis
25
6.2 Flowchart of Proposed Work
Figure 6: Caption
26
7 Discussions
7.1 White-box and black-box attacks
When comparing white-box and black-box scenarios from adversaries’
view- points, the primary distinction lies in their level of access to the
target model. In white-box scenarios, adversaries possess full access to the
model’s struc- ture and weights. Consequently, they can compute accurate
model gradients or approximate them using methods outlined in previous
literature [17]. Fur- thermore, adversaries can adjust their attack strategies
based on the defense mechanisms and parameters in place. However,
many heuristic defenses are ineffective against such adaptive adversaries.
Conversely, in black-box scenarios, adversaries lack access to the
model’s internal details. Therefore, they must infer model gradients from
limited information. Without specific knowledge about the model,
unbiased estima- tion of model gradients can be achieved through
ensembles of pre-trained models with different random seeds. Notably, a
momentum gradient-based method utilizing this gradient estimation
achieved top honors in the NIPS 2017 Challenge under a black-box
setting [18].
Chen et al. [104] delve into another black-box setting where
adversaries are granted additional query access to the target model’s
outputs. In this scenario, adversaries can deduce gradients from the
model’s responses to well-designed inputs. Utilizing a zero-order method
in this setting allows for significantly improved estimation of model
gradients. However, one drawback of this approach is its requirement for a
large number of queries, which scales with the data dimension.
8 Conclusion
In conclusion, this chapter has provided a comprehensive overview of the
game theoretical strategies employed for generating adversarial manipula-
tions in deep learning models. Our adversaries aim to inject small
changes into data distributions, leading to misclassification by deep
learning models. The theoretical goal of adversarial deep learning is to
determine whether such manipulations have reached a decision boundary,
resulting in misclassifica- tion of labels.
Through Stackelberg games, adversaries target the misclassification per-
27
formance of deep learning models, solving for optimal attack policies.
Se- quential game theoretical formulations model the interaction between
adver- saries and deep learning models, resulting in adversarial
manipulations. By extending to stochastic game theoretical formulations,
we introduce multi- player Stackelberg games with stochastic payoff
functions, resolved through Nash equilibrium.
Adversaries optimize variational payoff functions via data
randomization strategies, manipulating deep learning models to
misclassify labels. The generated adversarial data simulates continuous
interactions with classifier learning processes. Our findings suggest that
learning algorithms based on game theoretical modeling and
mathematical optimization offer a promising approach to building more
secure deep learning models.
In cybersecurity applications, these game theoretical solution
concepts lead to robust deep neural networks resilient to subsequent data
manipula- tions by adversaries. This underscores the significance of
integrating game theoretical approaches into the design and development of
deep learning mod- els to enhance security and robustness against
adversarial attacks.
28
References
[1] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok.
Syn- thesizing robust adversarial examples. In International
conference on machine learning, pages 284–293. PMLR, 2018.
[2] Tom B Brown, Dandelion Man´e, Aurko Roy, Mart´ın Abadi, and
Justin Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665,
2017.
[4] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Di-
amos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho
Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end
speech recog- nition. arXiv preprint arXiv:1412.5567, 2014.
[5] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbug-
ger: Generating adversarial text against real-world applications.
arXiv preprint arXiv:1812.05271, 2018.
[6] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and
Wenchang Shi. Deep text classification can be fooled. arXiv
preprint arXiv:1704.08006, 2017.
[7] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai,
Wei- hang Wang, and Xiangyu Zhang. Trojaning attack on neural
networks. In 25th Annual Network And Distributed System Security
Symposium (NDSS 2018). Internet Soc, 2018.
[8] Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. Adversarial
attacks and defenses in deep learning. Engineering, 6(3):346–360,
2020.
[11] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example
for a physical attack. arXiv preprint arXiv:1810.11793, 2018.
30