Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30

Game-Theoretic Approaches to Adversarial

Attacks and Defences in Deep Learning


Model

THIRD YEAR PROJECT


B.Tech INFORMATION TECHNOLOGY

By

Himanshu Dubey -
2021071064 Aman Yadav -
2021071006 Bhashkar Pandey
- 2022072002

Supervised
By

Dr. Umesh Chandra Jaiswal


Professor

DEPARTMENT OF INFORMATION TECHNOLOGY


MADAN MOHAN MALAVIYA UNIVERSITY OF TECHNOLOGY

1
Acknowledgments
I am pleased to present the “Game-Theoretic Approaches to
Adversar- ial Attacks and Defences in Deep Learning Models”
project and take this opportunity to express my profound gratitude to all
those who helped me complete this project. I thank my college for
providing me with excellent facilities that helped us to complete and
present this project. I express my deepest gratitude towards my project
guide for his valuable and timely ad- vice during the various phases of
our project, Dr. Umesh Chandra Jaiswal. I would also like to thank him
for providing me with all the proper facilities and support as the project
co-coordinator. I would like to thank him for his support, patience, and
faith in our capabilities and for giving me flexibility regarding working
and reporting schedules. I especially thank our Director, Prof. J.P saini,
and Head of the Department, Dr. Shiva Prakash for provid- ing excellent
computing and other facilities without which this work could not achieve
its quality goal. I would also like to thank all my family mem- bers who
always stood beside us and provided the utmost important moral
support.

2
Contents
1 Introduction 4
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Objective and Goals . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature View 6

3 Dataset 7

4 Evaluation Metrics 7

5 Baseline Model 7
5.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 L-BFGS algorithm . . . . . . . . . . . . . . . . . . . . 8
5.2.2 Fast Gradient Sign Method . . . . . . . . . . . . . . . 8
5.2.3 Projected Gradient Descent . . . . . . . . . . . . . . . 9
5.2.4 Adversarial patch . . . . . . . . . . . . . . . . . . . . . 10
5.2.5 Adversarial attacks on audio and text recognition . . . 11
5.3 Adversarial defenses . . . . . . . . . . . . . . . . . . . . . . . 12
5.3.1 Adversarial training . . . . . . . . . . . . . . . . . . . . 12
5.3.2 FGSM adversarial training . . . . . . . . . . . . . . . . 13
5.3.3 PGD adversarial training . . . . . . . . . . . . . . . . . 14
5.3.4 Random input transformation . . . . . . . . . . . . . . 14
5.3.5 GAN-based input cleansing . . . . . . . . . . . . . . . 15

6 Proposed Work 16
6.1 Steps of Workflow..............................................................................17
6.2 Flowchart of Proposed Work............................................................20

7 Discussions 21
7.1 White-box and black-box attacks......................................................21

8 Conclusion 21

3
Abstract
In the evolving landscape of artificial intelligence, deep learning
models have demonstrated remarkable performance across a
variety of domains. However, their susceptibility to adversarial
attacks poses a significant threat to their reliability and security.
These attacks in- volve subtle perturbations to input data, leading
to incorrect model predictions without noticeable differences to
the human eye. This project explores the application of game-
theoretic approaches to un- derstand and enhance the robustness of
deep learning models against such adversarial threats. We focus on
two primary game-theoretic de- fence mechanisms: Randomized
Smoothing and Defence-GAN. Ran- domized smoothing introduces
controlled randomness into the input data, making it difficult for
adversaries to generate effective attacks consistently. Defence-
GAN leverages the generative capabilities of GANs to transform
adversarial examples back to the original data distribution,
effectively neutralizing the attack. Both methods can be framed as
strategic games between the attacker and the defender, where each
player optimizes their strategies to maximize their respec- tive payoffs.
In this project, we formalize these interactions within a game-
theoretic framework, analyse the Nash equilibria of the result- ing
games, and explore the theoretical underpinnings that govern these
dynamics. We implement and evaluate these defence mechanisms
on benchmark datasets, demonstrating their effectiveness in
enhancing model robustness. Additionally, we investigate the
synergy between these methods and other defence strategies to
propose a comprehen- sive adversarial defence framework.

Keywords:Adversarial Attacks, Game Theory, Deep Learning, Ran-


domized Smoothing Defence-GAN, Model Robustness, Nash Equilib-
rium, Generative Adversarial Networks (GANs)

1 Introduction
1.1 Problem Description
Deep neural networks are vulnerable to adversarial examples. In the
image recognition and classification area, adversarial examples refer to
the syn- thesized images which look identical to the original images from
a human perspective, but they intentionally cause the classifier to make
4
mistakes, i.e.,

5
wrong recognition or classification results. The existence of such
adversar- ial examples poses great threats to using DNNs in safety-
critical scenarios like auto-driving and financial fraud detection. When
adversarial attacks are performed on training data, the learning process is
misguided, resulting in an inaccurate classifier. The topic of adversarial
attacks and defenses is gaining more and more attention with the rapid
development in deep neural networks. Currently, both attack and defense
are facing some challenges. In terms of attack, there are provable defenses
targeting some attack strategies. Various defense methods have their
strengths in tackling different attacks. Due to the ”last mover advantage”
of the attacker, there seems to be no way to defend against all possible
attacks. Current defense methods have different efficiencies for different
attacks. The proof of provable defenses only applies to a specific class of
attacks. To deal with these problems, some approaches have been
investigated to protect deep neural networks. The methods are roughly
categorized as adversarial training, randomization, denoising, prov- able
defense, and so on. In this work, we use adversarial training techniques
firstly under Whitebox to form our baseline model, assuming that the attack
method is already known. From this, we explore several defense methods
and carry out some experiments on attacks to verify our assumptions.
Our defense methods are: multiple initialization defense, multiple SAP-
like net- works, and super-resolution-based adversarial defense[10].

1.2 Objective and Goals


The process of adversarial attacks and defenses can be modeled as a zero-
sum two-person game. More specifically, white-box attacks and defenses
can be viewed as a multi-move game with perfect information. There are
two multi-move games with alternating moves by the attacker and
defender in our project: first, we are playing the research game of finding
better attacks on current best defenses and finding better defenses
against current best attacks. Second, in the process of finding better
attacks and defenses, we carry out experiments to play the game of the
alternation of defense, attack, defense, etc., on the given dataset. On the
defensive side, we want to make use of the von Neumann minimax theorem
to the zero-sum two-person game to improve the performance of
adversarial defenses under grey-box attacks. Since there are numerous
adversarial attacks defined under different metrics, and adversarial samples
are still not well understood, it is difficult to find one defense that can
6
effectively defend against all possible attack methods under

7
certain situations. We use game-theoretical approaches on the defensive
side to improve the robustness of deep-learning models.

2 Literature View
As usual in DNNs, θ is the set of model parameters. Our goal is to find
the model parameters θ that minimize the risk E(x,y)∼D[L(x, y, θ)]. The goal
of the attack model is to introduce a set of tiny perturbations δ ∈ S which
make the DNN model misclassify the adversarial sample, whereas the
same DNN
model can classify the normal sample correctly. The attack and defense
can be described as the following saddle point problem:
r l
min ρ(θ), where ρ(θ) = E(x,y)∼D maxL(θ, x + δ, y) (1)
θ δ

The inner maximization problem corresponds to the attack problem,


and the outer minimization problem is to find a set of parameters that
have the least loss for the given attack. According to the access to the
target model, attacks can be divided into black-box attacks and white-box
attacks. In the white-box case, we can directly get the gradient
information of the input point from the target model. The commonly
used methods include limited-memory Broyden-Fletcher-Goldfarb-Shanno
(L-BFGS), fast gradient sign method (FGSM), basic iterative method
(BIM), projected gradient de- scent (PGD), distributionally adversarial
attack (DAA), Carlini and Wagner (C&W) attacks, Jacobian-based
saliency map attack (JSMA), and Deep- Fool. In the baseline process, our
project focuses on the attack and defense for white-box models.
In black-box settings, we can only get information about the input
sam- ple and prediction result; thus, information about gradients cannot
be di- rectly obtained. Although these methods are designed for white-box
attacks, they are also effective in many black-box settings. Meanwhile,
various de- fensive techniques for adversarial sample
detection/classification have been proposed, including heuristic and
certified defenses. Heuristic defense refers to a mechanism that defends
against specific attacks without theoretical ac- curacy guarantees.
Representative heuristic defenses mainly include adver- sarial training,
randomization-based schemes, and denoising methods. In contrast,
certified defenses can always provide certifications for their lowest

8
accuracy under a well-defined class of adversarial attacks. The approach is

9
to formulate an adversarial polytope and define its upper bound using
con- vex relaxations, which guarantees that no attack with specific
limitations can surpass the certified attack success rate. Some typical
provable defensive methods are developed based on semi-definite
programming, dual approach, Bayesian model, and consistency. However,
the actual performance of these certified defenses is still much worse than
that of adversarial training.

3 Dataset
The CIFAR-10 dataset [10] is one of the most popular datasets for
machine learning and computer vision research. The dataset is a labelled
subset of 80 million tiny images. Actually, it has 60,000 32x32 colour
images in 10 different classes. The class tags include airplanes, cars,
birds, cats, deer, dogs, frogs, horses, ships, and trucks. Thus, each class
has corresponding 6,000 images. Algorithms for image recognition and
classification are often tested on this dataset. Since the images in the
dataset are low-resolution, it is easy for researchers to train models and
implement ideas quickly.

4 Evaluation Metrics
The accuracy we mention in this paper is the number of correct
classification images dived by the number of all evaluation images. The goal
of the classifier is to distinguish as much image as possible, i.e., increase
accuracy. The goal of the attack side is to generate an image that the
classifier cannot classify correctly. i.e., decrease the classifier’s accuracy.
The goal of the defence side is to make the accuracy of the classifier not
decrease too much under attack. At the same time, the defence should
maintain a normal performance on the unchanged image

5 Baseline Model
5.1 Classifier
Our group implement VGG and ResNet to achieve the classification task.
The reason why we choose these DNNs is that they are easy to implement

10
and take a relatively short time to train and generate attack instances.
And all of them can reach around 90

5.2 attacks
5.2.1 L-BFGS algorithm
The vulnerability of deep neural networks (DNNs) to adversarial samples
is first reported in section 4; that is, hardly perceptible adversarial
perturba- tions are introduced to an image to mislead the DNN
classification result. A method called L-BFGS is proposed to find the
adversarial perturbations with the minimum Lp norm, which is
formulated as follows:
min ||x − x′|| subject to f (x′) ̸= y′ (2)
x
x

where ∥x − x′∥ is the Lp norm of the adversarial perturbations and y′ is


p
the adversarial target label (y′ ̸= y). However, this optimization prob-
lem is intractable. The authors propose minimizing a hybrid loss, that is,
c ∥x − x′∥ +J(θ, x′, y′), where c is a hyperparameter, as an approximation
p
of the solution to the optimization problem, where an optimal value of c
could
be found by line search/grid search.

5.2.2 Fast Gradient Sign Method


Goodfellow et al. first propose an efficient untargeted attack, called the
FGSM, to generate adversarial samples in the L1 neighbor of the benign
samples, as shown in Fig. 1. FGSM is a typical one-step attack algorithm,
which performs the one-step update along the direction (i.e., the sign) of
the gradient of the adversarial loss J(θ, x, y), to increase the loss in the
steepest direction. Formally, the FGSM-generated adversarial sample is
formulated as follows:
x′ = x + ϵ.sign[∇xJ(θ, x, y)] (3)
where ϵ is the magnitude of the perturbation. FGSM can be easily
extended to a targeted attack algorithm (targeted FGSM ) by descending
the gradient of J(θ, x, y′), in which y′ is the target label. This update
procedure can
decrease the cross-entropy between the predicted probability vector and the
target probability vector if cross-entropy is applied as the adversarial
loss. The update rule for targeted FGSM can be formulated as follows:
11
x′ = x + ϵ.sign[∇xJ(θ, x, y′)] (4)

12
Moreover, it has been discovered that random perturbing before
executing FGSM on benign samples can enhance the performance and the
diversity of the FGSM adversarial samples.

Figure 1: A demonstration of an adversarial sample generated by


applying FGSM to GoogleNet. The imperceptible perturbation crafted by
FGSM fools GoogleNet into recognizing the image as a gibbon.

5.2.3 Projected Gradient Descent


At the core of machine learning optimization lies the fundamental
concept of gradient descent. This iterative algorithm fine-tunes model
parameters to minimize a given loss function. Mathematically, the update
rule is expressed
as θt+1 = θt − α · ∇J(θt), where θt represents the parameters at iteration t,
α is the learning rate, and ∇J(θt) is the gradient of the loss function.

Projected Gradient Descent (PGD) builds upon this foundation, introduc-


ing thoughtful constraints to enhance its effectiveness in crafting
adversarial examples. In the context of adversarial attacks, the objective
is to perturb input data to deliberately mislead the model.

PGD incorporates a perturbation budget(ϵ) and the step size (α) to con-
trol the amount and direction of perturbation. The update rule for PGD is
IT
defined as x′ = (xt + α · sign(∇xJ(θ, xt, y))) where xt is the input at
iteration t, α tis the step size, (∇xJ(θ, xt, y) is the gradient of the loss with
respect to the input, and I is the projection operator ensuring perturbed
T
input stays within predefined bounds[8].
13
5.2.4 Adversarial patch
Recent studies indicate that perturbations in specific regions of benign sam-
ples, called adversarial patches, can also deceive deep learning (DL)
mod- els. Sharif et al. [9] introduced a technique where adversarial
perturbations are applied only to an eyeglasses frame on facial images as
shown in Fig 1. By optimizing a common adversarial loss like cross-
entropy, this localized perturbation can easily trick the VGG-Face
convolutional neural network (CNN) [26]. The authors demonstrated
this attack in the real world by 3D- printing eyeglasses with the generated
perturbations. They also presented video demonstrations showing people
wearing these adversarial eyeglasses being misidentified as target
individuals by a real VGG-Face CNN system.
Brown et al.[2] developed a method to create universal robust
adversar- ial patches. The adversarial loss for optimizing the patch is
defined using benign images, patch transformations, and patch locations.
Universality is achieved by optimizing the patch across all benign images.
Robustness to noise and transformations is ensured using the
expectation over transfor- mation (EoT) method[1], which calculates
noise/transformation-insensitive gradients for the optimization.
Liu et al.[7] proposed a method to generate adversarial samples by adding
a Trojan patch to benign samples. This attack first identifies neurons that
significantly influence the network’s outputs. The pixel values in the
patch region are then initialized to maximize these selected neurons’
activation. Finally, the model is retrained with both benign images and
images with the Trojan patch to adjust the weights related to these
neurons. Although the retrained model performs similarly to the original
model on benign images, it exhibits malicious behavior on images with
the adversarial patch.

14
Figure 2: Eyeglasses with adversarial patterns trick a facial recognition
sys- tem into recognizing faces in the first row as those in the second row.
[9]

5.2.5 Adversarial attacks on audio and text recognition


Carlini and Wagner[3] successfully created high-quality audio
adversarial samples by optimizing over the C&W loss function. In their
study, they found that for an audio signal, up to 50 words in the text
translation can be modified by only altering 1% of the audio signal using
adversarial perturba- tions on DeepSpeech[4]. They also discovered that
these adversarial audio signals are resistant to pointwise noise and MP3
compression. However, when played over the air, the perturbed audio
signals lose their adversarial prop- erties due to the nonlinear effects of
microphones and recorders. To address this, the authors in Ref.[11]
proposed simulating these nonlinear effects and noise during the attack
process. They modeled the received signal as a func- tion of the
transmitted signal, incorporating transformations that account for the
effects of a band-pass filter, impulse response, and white Gaussian noise.
They defined the adversarial loss based on the received signals rather than
the transmitted signals. This approach successfully generated adver-
sarial audio samples that could deceive audio-recognition models even
after being played in the physical world.
For text recognition, Liang et al.[6] suggested three word-level pertur-
15
bation strategies for text data: insertion, modification, and removal. The
attack begins by identifying important text items for classification and
then applies one of the perturbation methods to these items. Experiments
showed that this attack could successfully deceive some advanced DNN-
based text classifiers. Additionally, TextBugger employs five types of
perturbation op- erations on text data: insertion, deletion, swapping,
character substitution, and word substitution, as illustrated in Fig.3[5] In a
white-box setting, these five operations are applied to important words
identified by the Jacobian matrix. However, in a black-box threat model,
the Jacobian matrix is not available for sentences and documents. In this
scenario, the adversary as- sumes access to the prediction confidence
values. The importance of each sentence is defined by its confidence
value for the predicted class. The im- portance of each word in the most
significant sentence is determined by the difference between the
confidence values of the sentence with and without the word.

Figure 3: Adversarial text generated by TextBugger [5]: A negative


comment is misclassified as a positive comment.

5.3 Adversarial defenses


In this section, we will explore the key defenses created in recent years,
focusing on adversarial training, randomization-based methods, denoising
techniques, provable defenses, and several other innovative strategies.
Ad- ditionally, we will provide a short discussion on how effective these
defenses are against various attacks in different contexts.

5.3.1 Adversarial training


Adversarial training is an intuitive defense method against adversarial
sam- ples, which attempts to improve the robustness of a neural network
by train- ing it with adversarial samples. Formally, it is a min–max game
that can be

16
formulated as
follows:
min max J(θ, x′, y) (5)
θ D(x,x′)<η

where J(θ, x′, y) is the adversarial loss, with network weights θ,


adversarial input x′, and ground-truth label y. D(x, x′) represents a
certain distance metric between X and x′. The inner maximization
problem aims to find
the best adversarial samples, tackled by powerful attacks like FGSM [5]
and PGD [6]. The outer minimization, on the other hand, is the standard
training process aimed at reducing loss. The resulting network should then
be resilient to the adversarial attacks faced during training. Recent
studies in Refs. [13,14,57,58] demonstrate that adversarial training is one of
the most effective defenses against such attacks. It achieves top-notch
accuracy on various benchmarks.
Therefore, in this section, we elaborate on the best-performing
adversarial training techniques in the past few years.

5.3.2 FGSM adversarial training


Goodfellow et al. [5] initially suggested improving the resilience of a
neural network by training it with both regular and FGSM-generated
adversarial samples. Formally, the proposed adversarial goal can be
expressed as follows:

J(θ, x, y) = cJ(θ, x, y) + (1 − c)J(θ, x + ϵ · sign[∇xJ(θ, x, y)], y) (6)

where x + ϵ · sign[∇xJ(θ, x, y)] represents the FGSM-generated


adversarial sample for the original sample x, where ϵ is a small
perturbation determined by the gradient of the loss function J(θ, x, y).
The parameter c is used to
balance the accuracy between benign and adversarial samples and is set as
a hyperparameter.
Experiments conducted in Ref. [5] demonstrate that the neural
network becomes somewhat resilient to FGSM-generated adversarial
samples through adversarial training. Specifically, the error rate on these
adversarial samples significantly decreased from 89.4% to 17.9%.
However, the trained model still remains vulnerable to iterative or
optimization-based adversarial at- tacks, despite its success in defending

17
against FGSM-generated adversarial samples. Consequently, further
research has delved into adversarial training using stronger adversarial
attacks, such as BIM/PGD attacks.

18
5.3.3 PGD adversarial training
Through extensive evaluations, it has been found that a PGD attack is
likely a universal first-order L∞ attack [13]. This means that if a model is
robust against PGD attacks, it’s likely to resist a wide range of other first-
order L1
attacks. Building on this idea, Madry et al. [13] suggest training a robust
network adversarially using PGD. Surprisingly, PGD adversarial training
indeed enhances the resilience of CNNs and ResNets [24] against
several
common first-order L∞ attacks, such as FGSM, PGD, and CW1 attacks, in
both black-box and white-box scenarios.
Even against the strongest L∞ attack known as DAA, PGD adversarial
training can only reduce the accuracy of the trained models to certain levels.
For instance, the accuracy of a PGD adversarially trained MNIST model
drops to 88.56%, and that of a CIFAR-10 model falls to 44.71%. In the
recent Competition on Adversarial Attacks and Defenses (CAADs), the
top- ranking defense against ImageNet adversarial samples was based on
PGD adversarial training [14].
With PGD adversarial training, a baseline ResNet [23] can achieve
over 50% accuracy under 20-step PGD, whereas a denoising architecture
proposed in Ref. [14] only increases the accuracy by 3%. These studies
collectively suggest that PGD adversarial training is generally the most
effective defense
against L∞ attacks.
However, it’s worth noting that PGD adversarial training comes with sig-
nificant computational costs. For example, training a simplified ResNet
for CIFAR-10 with PGD adversarial training can take up to three days on
a TITAN V graphics processing unit (GPU). The top-ranking model in
CAAD requires 52 hours on 128 Nvidia V100 GPUs. Additionally, while a
PGD ad-
versarially trained model is robust against L∞ attacks, it remains
vulnerable to other types of attacks, such as EAD [19,59] and CW2 [8].

5.3.4 Random input transformation


Xie et al. [67] use two random techniques—resizing and padding—to
reduce the impact of adversarial attacks during inference. Random
resizing adjusts the size of input images before they go into DNNs, while
random padding adds zeros around the images in a random way. The
19
working mechanism of this method is shown in Fig. 4. This method
works well against black-box attacks and came in second in the NIPS
2017 defense challenge. However,

20
it struggles with white-box attacks, especially against the EoT method
[28], which uses an ensemble of resized and padded images to
significantly re- duce accuracy. Additionally, Guo et al. [68] employ
various random image transformations, including bit-depth reduction
and JPEG compression, to defend against adversarial samples. While
effective against most gray-box and black-box attacks, this method is also
susceptible to the EoT method [28].

Figure 4: Working mechanism of Random input transformation proposed by


Xie et al. [67]: The input image is randomly resized first and then
randomly padded.

5.3.5 GAN-based input cleansing


A GAN (Generative Adversarial Network) is a powerful tool for learning
data distributions. Many researchers use GANs to understand benign
data distributions and generate benign projections for adversarial inputs.
Defense- GAN and Adversarial Perturbation Elimination GAN (APE-
GAN) are two common algorithms used for this purpose.
Defense-GAN [79] trains a generator to understand the distribution of
benign images, as illustrated in Fig.5 [79]. During testing, Defense-GAN
cleanses adversarial inputs by finding an image similar to the adversarial
in- put within its learned distribution and then feeding this benign image
into the classifier. This approach helps defend against various adversarial
attacks. Currently, the most effective attack against Defense-GAN is based
on back-
21
ward pass differentiable approximation [17], which can reduce its
accuracy to 55
APE-GAN [80] directly trains a generator to cleanse adversarial samples
by taking them as input and producing benign counterparts. Although
APE- GAN performs well in the testbed of Ref. [80], it is vulnerable to the
adaptive white-box CW2 attack proposed in Ref. [81], which can easily
defeat APE- GAN.

Figure 5: The pipeline of Defense-GAN [79]. G: the generative model


which can generate a high-dimensional input sample from a low dimensional
vector z; R: the number of random vectors generated by the random
number gen- erator

6 Proposed Work
This project aims to advance the state-of-the-art in adversarial defences by:

Game-Theoretic Formulation: Developing a formal game theoretic frame-


work to model the interaction between adversarial attackers and de-
fenders, focusing on the strategic adaptation of both parties.

Randomized Smoothing: Implementing and analysing randomized smooth-


ing as a defence mechanism, interpreting it through a game-theoretic
lens to understand its robustness properties and limitations.

Defence-GAN: Enhancing the Defence-GAN approach, where generative


adversarial networks are used to transform adversarial examples
back to the original data distribution. We explore the interplay
between the generator (defender) and discriminator (attacker) in a
game-theoretic context.

22
Empirical Validation: Conducting extensive experiments on
benchmark datasets to evaluate the effectiveness of the proposed
defence mecha- nisms, comparing their performance against existing
methods.

Our approach differs from previous research by integrating game-theoretic


in- sights with practical defence techniques, providing a comprehensive
strategy that adapts to evolving adversarial tactics while offering provable
robustness guarantees. This research not only contributes to the theoretical
understand- ing of adversarial defence but also provides practical tools and
methodologies for enhancing the security of deep learning systems

6.1 Steps of Workflow


1. Data Collection & Preprocessing

• Gather and preprocess the dataset, ensuring it is suitable for train-


ing deep learning models.
• Split the data into training, validation, and testing sets

2. Baseline Model Training

• Train a baseline deep learning model (e.g., CNN, LSTM) on the


clean training data.
• Evaluate the baseline model’s performance on the validation set
to establish a benchmark.

3. Adversarial Attack Generation

• Generate adversarial examples using various attack techniques


(e.g., FGSM, PGD) on the validation set.
• Analyse the impact of these adversarial examples on the baseline
model’s performance

4. Game-Theoretic Framework Development

• Develop a game-theoretic framework to model the interaction


be- tween the attacker and defender
• Define the strategies for both players (attacker and defender)
and establish the payoff matrices
23
5. Randomized-Smoothing Defence

• Implement the randomized smoothing mechanism by adding


con- trolled noise to the inputs.
• Adapt the noise levels based on input characteristics and model
confidence.
• Analyse the robustness properties of the smoothed model through
a game-theoretic lens.

6. Defence-GAN Implementation

• Train a GAN to learn the distribution of clean data and use it


to transform adversarial examples back to their original
distribution.
• Optimize the GAN’s training process within the game-theoretic
framework, modelling the generator (defender) and discriminator
(attacker) as strategic players

7. LSTM-based Techniques Integration

• Incorporate LSTM networks within the defence framework to


model sequential data and detect temporal dependencies
exploited by ad- versarial attacks.
• Train the LSTM-based model on clean and adversarial examples,
enhancing its robustness.

8. Hybrid Defence Mechanism

• Combine the randomized smoothing and Defence-GAN


techniques to create a hybrid defence mechanism
• Evaluate the hybrid defence against various adversarial
attacks, analysing its performance and robustness.

9. Empirical Validation

• Conduct extensive experiments on benchmark datasets to


evaluate the effectiveness of the proposed defence mechanisms.
• Compare the performance of the defended model against the
base- line and existing defence methods.

24
10. Scalability & Efficiency Analysis

• Assess the computational efficiency and scalability of the


proposed defence mechanisms
• Ensure that the defences can be applied in real-world settings
without prohibitive resource requirements

11. Reporting & Documentation

• Document the entire workflow, including methodology, experi-


mental setup, results, and analysis.
• Prepare a comprehensive project report detailing the findings
and contributions of the research.

25
6.2 Flowchart of Proposed Work

Figure 6: Caption

26
7 Discussions
7.1 White-box and black-box attacks
When comparing white-box and black-box scenarios from adversaries’
view- points, the primary distinction lies in their level of access to the
target model. In white-box scenarios, adversaries possess full access to the
model’s struc- ture and weights. Consequently, they can compute accurate
model gradients or approximate them using methods outlined in previous
literature [17]. Fur- thermore, adversaries can adjust their attack strategies
based on the defense mechanisms and parameters in place. However,
many heuristic defenses are ineffective against such adaptive adversaries.
Conversely, in black-box scenarios, adversaries lack access to the
model’s internal details. Therefore, they must infer model gradients from
limited information. Without specific knowledge about the model,
unbiased estima- tion of model gradients can be achieved through
ensembles of pre-trained models with different random seeds. Notably, a
momentum gradient-based method utilizing this gradient estimation
achieved top honors in the NIPS 2017 Challenge under a black-box
setting [18].
Chen et al. [104] delve into another black-box setting where
adversaries are granted additional query access to the target model’s
outputs. In this scenario, adversaries can deduce gradients from the
model’s responses to well-designed inputs. Utilizing a zero-order method
in this setting allows for significantly improved estimation of model
gradients. However, one drawback of this approach is its requirement for a
large number of queries, which scales with the data dimension.

8 Conclusion
In conclusion, this chapter has provided a comprehensive overview of the
game theoretical strategies employed for generating adversarial manipula-
tions in deep learning models. Our adversaries aim to inject small
changes into data distributions, leading to misclassification by deep
learning models. The theoretical goal of adversarial deep learning is to
determine whether such manipulations have reached a decision boundary,
resulting in misclassifica- tion of labels.
Through Stackelberg games, adversaries target the misclassification per-

27
formance of deep learning models, solving for optimal attack policies.
Se- quential game theoretical formulations model the interaction between
adver- saries and deep learning models, resulting in adversarial
manipulations. By extending to stochastic game theoretical formulations,
we introduce multi- player Stackelberg games with stochastic payoff
functions, resolved through Nash equilibrium.
Adversaries optimize variational payoff functions via data
randomization strategies, manipulating deep learning models to
misclassify labels. The generated adversarial data simulates continuous
interactions with classifier learning processes. Our findings suggest that
learning algorithms based on game theoretical modeling and
mathematical optimization offer a promising approach to building more
secure deep learning models.
In cybersecurity applications, these game theoretical solution
concepts lead to robust deep neural networks resilient to subsequent data
manipula- tions by adversaries. This underscores the significance of
integrating game theoretical approaches into the design and development of
deep learning mod- els to enhance security and robustness against
adversarial attacks.

28
References
[1] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok.
Syn- thesizing robust adversarial examples. In International
conference on machine learning, pages 284–293. PMLR, 2018.

[2] Tom B Brown, Dandelion Man´e, Aurko Roy, Mart´ın Abadi, and
Justin Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665,
2017.

[3] Nicholas Carlini and David Wagner. Audio adversarial examples:


Tar- geted attacks on speech-to-text. In 2018 IEEE security and
privacy workshops (SPW), pages 1–7. IEEE, 2018.

[4] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Di-
amos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho
Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end
speech recog- nition. arXiv preprint arXiv:1412.5567, 2014.

[5] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbug-
ger: Generating adversarial text against real-world applications.
arXiv preprint arXiv:1812.05271, 2018.

[6] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and
Wenchang Shi. Deep text classification can be fooled. arXiv
preprint arXiv:1704.08006, 2017.

[7] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai,
Wei- hang Wang, and Xiangyu Zhang. Trojaning attack on neural
networks. In 25th Annual Network And Distributed System Security
Symposium (NDSS 2018). Internet Soc, 2018.

[8] Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. Adversarial
attacks and defenses in deep learning. Engineering, 6(3):346–360,
2020.

[9] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K.


Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-
the-art face recognition. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’16,
page 1528–1540, New York, NY, USA, 2016. Association for
Computing Machinery.
29
[10] Shorya Sharma. Game theory for adversarial attacks and defenses.
arXiv preprint arXiv:2110.06166, 2021.

[11] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example
for a physical attack. arXiv preprint arXiv:1810.11793, 2018.

30

You might also like