Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

Deep Learning – BCSE332L

i R
Advanced Neural Networks
v
arg a
h
Dr. R. Bhargavi

B
Professor
SCOPE
VIT University

Dr. R Bhargavi, VIT 1


Object Localization
• Object localization involves identifying the location and extent of objects within
an image or a video frame.
• The goal is to accurately determine the bounding box coordinates (usually in the
form of a rectangle) that enclose the object of interest.

R
Object localization typically assumes that there is only one instance of the object

i

v
to be localized within the image.

arg a
The output of object localization is the bounding box coordinates that specify the
position and size of the object within the image.

B h
A bounding box describe the spatial location of an object.
The bounding box is rectangular box, which is determined by
the x and y coordinates of the upper-left corner and lower-right corner of the
rectangle. Another commonly used bounding box representation is the (x,y)
coordinates of the bounding box center, and the width and height of the box.
Dr. R Bhargavi, VIT 2
Object Localization (cont…)
Classification Classification with Localization

vi R
a
Height

harg
B
Dog Width
Dog (x, y)

Dr. R Bhargavi, VIT 3


Object Localization (cont…)

[Pc,

R
Bx,

i
By,

v
Bh,

a
Bw,

rg
C1,

a
C2,

h
Output layer C3]

B
Dr. R Bhargavi, VIT 4
Object Localization – Defining target Label
• Assume that we have 3 classes (Dog -C1, Cat -C2, C3-Human).
If the image has a Dog picture then Y = [ 1, 0.4, 0.3, 0.5,0.8, 1,0,0]

R

i
If the image has a background and no other object. then,

v

a
Y = [ 0, x, x, x, x, x, x, x]

harg
Sum Squared Loss L(ypred, y) :
When Pc = 1 , 𝑦"! − 𝑦! " + 𝑦"" − 𝑦" " + ⋯ + 𝑦"# − 𝑦# " where n is the

B
number of components of the vector y.
• When Pc = 0 , 𝑦"! − 𝑦! "

Dr. R Bhargavi, VIT 5


Object Detection
• Object detection involves identifying and localizing
multiple objects of interest within an image.
• The task includes both locating the objects and

R
classifying them into predefined categories or classes.

g avi
Object detection assumes that there may be multiple
instances of different objects belonging to various

r
classes within the image.

B ha
The output of object detection is a list of bounding
boxes, each associated with a class label indicating the
type of object detected within the box.

Dr. R Bhargavi, VIT 6


Object Detection – Sliding Window
• The sliding window technique involves moving a window
of fixed size across the entire image, typically in a grid-
like fashion with a predefined step size.

R
• At each position of the window, the content within the

i
window is extracted as a region of interest (ROI) and fed

v
into a classifier to determine whether it contains an object

g a
of interest or not.

different strides.

har
The algorithm is run with different window sizes and

B
• The size of the sliding window and the step size (stride) by
which it moves across the image are crucial parameters.
• The window size should be chosen based on the size range
of objects to be detected, and the step size determines the
degree of overlap between adjacent windows.
Dr. R Bhargavi, VIT 7
Object Detection – Sliding Window (cont…)
• Object detection with sliding window can be computationally expensive,
especially when using a dense grid of windows and/or extracting high-

R
dimensional feature vectors for each window.

vi
• It may also suffer from scale and aspect ratio limitations, as objects at different

a
scales may require different window sizes, and fixed-size windows may not be

g
able to handle significant scale variations efficiently.

har
Object detection with sliding window has been widely used in the past and has
paved the way for more sophisticated object detection techniques.

B
Modern approaches, such as region-based convolutional neural networks (R-
CNNs), Faster R-CNN, and You Only Look Once (YOLO), have largely
replaced sliding window-based methods due to their superior performance and
efficiency.

Dr. R Bhargavi, VIT 8


Object Detection (cont…)
• Two-Stage Object Detectors: These models provide high accuracy by
leveraging a region proposal network (RPN) for precise object localization.

vi
multi-scale information.
R
They excel in complex scenes by capturing robust feature representations and

arg a
Examples: RCNN, Fast RCNN, Mask RCNN etc.

h
• One-Stage Object Detectors : These models directly predict bounding boxes

B
and class labels in a single step.
• They estimate the number of objects, classify them, and determine their size and
positions using bounding boxes.
• Examples: YOLO (You Only Look Once)

Dr. R Bhargavi, VIT 9


Intersection Over Union (IoU)
• Intersection over Union (IoU) is a metric commonly used to evaluate the
performance of object detection algorithms.

R
• It quantifies the overlap between the predicted bounding boxes and the ground

i
truth bounding boxes of objects within an image.

rg av
• IoU is calculated as the ratio of the area of intersection between the predicted
bounding box and the ground truth bounding box to the area of their union.

ha
• Mathematically, it is defined as:

B
IoU = Area of Intersection / Area of Union

Dr. R Bhargavi, VIT 10


IoU (cont…
• IoU provides a measure of how well the predicted bounding box aligns with the
ground truth bounding box.

i R
A higher IoU indicates better alignment and thus a more accurate detection.

v
a
• IoU is often used as a threshold to determine whether a detection is considered a

g
true positive or a false positive.

har
A common threshold value for IoU is 0.5, meaning that if the IoU between a

B
predicted bounding box and a ground truth bounding box exceeds 0.5, the
detection is considered a true positive; otherwise, it is considered a false
positive.
• Different threshold values can be used depending on the specific requirements of
the task or application.
Dr. R Bhargavi, VIT 11
Non-Maximum Suppression (NMS)
• Non-Maximum Suppression (NMS) is a post-processing technique commonly
used in object detection algorithms to eliminate redundant or overlapping
bounding box predictions.

v R
Its primary goal is to select the most confident and non-overlapping detections

i
while discarding redundant ones.

arg a
B h
Dr. R Bhargavi, VIT 12
Non-Maximum Suppression (Cont…)
• NMS takes as input a set of bounding box predictions generated by the object detection
algorithm, along with their corresponding confidence scores.
• The first step of NMS involves sorting the bounding box predictions based on their
confidence scores in descending order. This ensures that the most confident predictions are

R
considered first.

vi
• Starting from the top of the sorted list, the bounding box with the highest confidence score

a
is selected as a detection.

arg
All other bounding boxes that have a significant overlap (measured using IoU) with the
selected bounding box are suppressed or removed from consideration.

h
B
• A threshold IoU value is used to determine whether two bounding boxes overlap
significantly.
• Bounding boxes with IoU above the threshold are considered to be overlapping, and only
the one with the highest confidence score is retained.

Dr. R Bhargavi, VIT 13


Non-Maximum Suppression (Cont…)
• The selection process is repeated iteratively for each remaining bounding box in the sorted
list until no more bounding boxes remain.
• At each iteration, the selected bounding box is added to the final list of detections, and
overlapping bounding boxes are suppressed.

i R
The output of Non-Maximum Suppression is a list of selected bounding boxes, each

v
a
associated with its corresponding confidence score.

harg
B
Dr. R Bhargavi, VIT 14
NMS (cont…)
• Step 1 : Select the prediction S with highest confidence score and remove it
from P and add it to the final prediction list keep

vi R
Step 2 : Now compare this prediction S with all the predictions present in P.
Calculate the IoU of this prediction S with every other predictions in P. If the

g a
IoU is greater than the threshold thresh_iou for any prediction T present in P,

r
remove prediction T from P.

B ha
Step 3 : If there are still predictions left in P, then go to Step 1 again, else return
the list keep containing the filtered predictions.

Dr. R Bhargavi, VIT 15


RCNN - Regions with CNN features
• Region-based Convolutional Neural Network, is a pioneering object detection
framework introduced by Ross Girshick et al. in their 2014

R
• R-CNN: Regions with CNN features

g avi
It consists of three modules
• Region Proposal module: Generate and extract category independent region

r
proposals, e.g. candidate bounding boxes (~2000).

ha
• Feature Extractor module: Extract feature from each candidate region, e.g.

B
using a deep convolutional neural network (pretrained AlexNet).
• Classifier module: Classify features as one of the known class, e.g. linear
SVM classifier(for individual classes) model.
• Bounding box regressor to finetune the localization.

Dr. R Bhargavi, VIT 16


RCNN (cont…)

vi R
arg a
B h
https://arxiv.org/pdf/1311.2524.pdf Dr. R Bhargavi, VIT 17
Region Proposal – Selective search
• The first step in R-CNN is to generate region proposals, which are candidate
bounding boxes that potentially contain objects.

i R
These region proposals serve as input to the subsequent stages of the network.

v
a
• Selective Search algorithm is used to generate region proposals.

harg
This algorithm generates a large number of region proposals by grouping pixels
into segments based on color, texture, or other low-level features and then

B
combining them hierarchically.

Dr. R Bhargavi, VIT 18


Region Proposal (cont..)

vi R
arg a
B h
Dr. R Bhargavi, VIT 19
Feature Extraction
Supervised pre-training :
• AlexNet trained on ImageNet dataset for classification with 1000 classes is used.

Domain-Specific Fine-Tuning:

R
• Fine-tune the network to learn

vi
a) The visual features of the new types of images- distorted region proposals.

a
b) Specific target classes of the smaller dataset for the detection task.

harg
• The final 1000 way classification layer of the CNN from pre-training is replaced
with a randomly initialized (N+1) way softmax classification layer for the N

B
object classes and a general background class of the detection task.
• For each region proposal, the corresponding region of the input image is cropped
and resized to a fixed size (227 x 227).

Dr. R Bhargavi, VIT 20


Feature Extraction (cont…)
• Each object proposal is mapped to the ground-truth instance with which it has
maximum IoU overlap and label it as a positive (for the matched ground-truth class) if
the IoU is at least 0.5.
• The rest of the boxes are treated as the background class (negative for all classes).

R
Training pipeline:

training learning rate).

g avi
• The network is trained using SGD with 0.001 learning rate ((1/10)th of the initial pre-

ar
• In each iteration they sample 32 windows that are positive over all the classes and 96

h
windows that belong to the background class to form a mini-batch of 128, to ensure that

B
there is enough representation from the positive classes during training.
The final output: After training, the final classification layer is removed and a 4096
dimensional feature vector is obtained from the penultimate layer of the CNN for each of
the 2000 region proposals (for every image).

Dr. R Bhargavi, VIT 21


Object classification - SVM
• Linear SVM classifier is used for each class, that detects the presence or
absence of an object belonging to a particular class.

R
• Input: The 4096-d feature vector for each region proposal.
Labels for training:

g avi
• The features of all region proposals that have an IoU overlap of less than 0.3

ar
with the ground truth bounding box are considered negatives for that class

h
during training.

B
• The positives for that class are simply the features from the ground truth
bounding boxes itself.
• All other proposals (IoU overlap greater than 0.3, but not a ground truth
bounding box) are ignored for training the SVM.

Dr. R Bhargavi, VIT 22


Object classification – SVM (cont…)
Test time inference – Single image:
• For the test image, 2000 x 4096 feature matrix is generated (the 4096-d feature from
the CNN for all 2000 region proposals.
• The SVM weight matrix is 4096 x N where N is the number of classes.

R
• The class-specific dot products between the features and SVM weights are consolidated

i
into a single matrix-matrix product for an image.

g av
Output of SVM stage : Set of positive object proposals for each class, from the CNN
features of 2000 region proposals (of every image).

r
B ha
Dr. R Bhargavi, VIT 23
Bounding Box Regression
• Input: CNN output features.
A simple bounding-box regression is used to improve localization performance.

R

i
The aim is to find the transformation from the predicted bounding box defined

v

a
by 𝑃$ = 𝑃%$ , 𝑃&$ , 𝑃'$ , 𝑃($ where I = 1,2, ….N to 𝐺 = 𝐺% , 𝐺& , 𝐺' , 𝐺( .

harg
First the following are the regression targets (transformations are to be learned)

B
Dr. R Bhargavi, VIT 24
Bounding Box Regression (cont…)
• The predicted ground truth are given by the following

vi R
arg a
Where

B h
Dr. R Bhargavi, VIT 25
RCNN - Architecture

vi R
arg a
B h
Dr. R Bhargavi, VIT 26
Disadvantages
• Object detection is very slow. For each image, the selective search algorithm
pro- poses about 2,000 RoIs to be examined by the entire pipeline (CNN feature

R
extractor and classifier). This is very computationally expensive.

g avi
Training is a multi-stage pipeline - CNN feature extractor, SVM classifier, and
bounding-box regressors. Thus the training process is very complex and not an

r
end-to-end training.

ha
Training is expensive in terms of space and time.

B
Dr. R Bhargavi, VIT 27
Fast-RCNN
• Immediate descendant of R-CNN
Following are the changes done in Fast RCNN

R

i
In fast RCNN, first CNN feature extractor is applied to the entire input image

v

a
and then regions of proposals are taken.

harg
It extends the ConvNet’s job to do the classification part as well, by replacing
the traditional SVM machine learning algorithm with a softmax layer.

B
Dr. R Bhargavi, VIT 28
Fast-RCNN Architecture

vi R
arg a
B h
Dr. R Bhargavi, VIT 29
Changing FC to Convolution layer
Conv
5x5 2x2
@16 pool

vi R
a
Softmax

g
5 x 5 x 16 400 400 (4 classes)

r
14 x14 x 3 10 x10 x 16

ha
Conv Conv Conv

B
5x5 2x2 5x5 1x1 1x1
@16 pool @400 @400

14 x14 x 3 10 x10 x 16 5 x 5 x 16 1 x 1 x 400 1 x 1 x 400 1x1x4

Dr. R Bhargavi, VIT 30


Changing FC to Convolution layer
Conv Conv Conv
5x5 2x2 5x5 1x1 1x1
@16 pool @400 @400

vi R
a
5 x 5 x 16 1 x 1 x 400 1 x 1 x 400 1x1x4

g
14 x14 x 3 10 x10 x 16

Conv
5x5

har 2x2
Conv
5x5
Conv
1x1 1x1

B
@16 pool @400 @400

16 x16 x 3 12 x12 x 16 6 x 6 x 16 2 x 2 x 400 2 x 2 x 400 2x2x4

Dr. R Bhargavi, VIT 31


YOLO
• You Only Look Once is State-of-the-art, real-time object detection algorithm
introduced in 2015.
• Extremely fast (45 frames per second)

R
• Learns generalizable representations of objects.

g avi
Object Detection is treated as regression problem instead of a classification task.

r
• Single stage – From image pixels to bounding box coordinates and class

a
probabilities in one pass.

B h
Dr. R Bhargavi, VIT 32
YOLO Object Detection
• Resize the input image to 448 × 448
Run a single convolutional network on the image, and

R

i
Threshold the resulting detections by the model’s confidence.

v

a
A single convolutional network simultaneously predicts multiple bounding

g

r
boxes and class probabilities for those boxes.


B ha
YOLO sees the entire image during training and test time so it implicitly
encodes contextual information about classes as well as their appearance.
YOLO still lags behind state-of-the-art detection systems in accuracy.

Dr. R Bhargavi, VIT 33


YOLO (cont…)
• Divides the input image into s x s grid.
• If the center of an object falls into a grid cell, that grid cell is responsible for
detecting that object.
• Each cell predicts B bounding boxes, Objectness score that indicates the

R
probability that the cell contains an object and bounding prediction for objects if

vi
their center falls inside that particular cell, and the class prediction. i.e each of

a
the grid specifies y = [Po, Bx, By, Bh, Bw,C1,C2,C3] (considering 3 classes)

harg
B
Dr. R Bhargavi, VIT 34
YOLO (cont…)
• The objectness score is calculated as follows:
• Po = Pr (containing an object) × IoU (pred, truth)

R
• The output is a volume s x s x 8 (considering 3 classes)

i
In order to support multiple objects to be detected by a single grid cell Anchor

v

a
Boxes are introduced.

harg Anchor Box 1 Anchor Box 2

B
Dr. R Bhargavi, VIT 35
YOLO (cont…)
• Each object in the training image is assigned to one grid cell that contain the
object’s midpoint and one anchor box of that grid cell with highest IoU
• The output is a volume s x s x (5B+K) (K is the number of classes)

vi R
arg a
B h
Dr. R Bhargavi, VIT 36
YOLO (cont…)
• Combine the box and class label predictions

vi R
arg a
B h
Dr. R Bhargavi, VIT 37
YOLO (cont…)

vi R
arg a
B h
Dr. R Bhargavi, VIT 38
YOLO (cont…)
• For each class use non-max suppression to generate final predictions.

vi R
arg a
B h
Dr. R Bhargavi, VIT 39
YOLO Architecture

vi R
arg a
B h
Dr. R Bhargavi, VIT 40
Loss Function

denotes if object appears in

R
cell i

i
denotes that the jth bounding

v
box predictor in cell i is

a
“responsible” for that

g
prediction.

har λnoobj was assigned a value of

B
0.5 and λcoord set to 5

Dr. R Bhargavi, VIT 41


Autoencoder
• An autoencoder is a neural network that is trained to
attempt to copy its input to its output.
𝑥"$

R
• Primary purpose is to learn efficient representations of

vi
data by training the network to reconstruct its input data w*

a
as accurately as possible.

g
h

r
• Encodes its input xi into a hidden representation h.

• B ha h = g(Wxi +b)
Decodes the input again from this hidden representation
𝑥"$ = f(W*h+ b)
The model is trained to minimize certain loss function
w
𝑥$

which will ensure that 𝑥"$ is close to 𝑥$ .

Dr. R Bhargavi, VIT 42


Autoencoder (cont…)
• The encoder part of an autoencoder takes the input data and
compresses it into a lower-dimensional representation, also
𝑥"$

R
known as the "latent space" or "encoding”.

vi
• The encoder consists of multiple layers, typically composed of w*

a
convolutional layers, dense layers, or a combination of both,

g
h

r
depending on the type of autoencoder.

ha
• The goal of the encoder is to learn a compact and meaningful w

B
representation of the input data that captures its essential
features. 𝑥$
• Encodes its input xi into a hidden representation h.
h = g(Wxi +b)

Dr. R Bhargavi, VIT 43


Autoencoder (cont…)
• The decoder part of an autoencoder takes the encoded
representation (latent space) from the encoder and 𝑥"$

R
reconstructs the original input data.

vi
• Similar to the encoder, the decoder consists of multiple w*

g a
layers that perform the inverse operation of the encoder, h

r
gradually expanding the latent representation back to the


ha
original input dimensions.

B
The goal of the decoder is to generate a reconstruction of
the input data that closely matches the original input.
Decodes the input from the hidden representation.
w
𝑥$

𝑥"$ = f(W*h+ b)
Dr. R Bhargavi, VIT 44
Autoencoder (cont…)
Training:
• During training, the autoencoder is trained using a loss function that measures

i R
the difference between the input data and the reconstructed output.

v
a
• The most common loss function used for autoencoders is the mean squared

g
error (MSE) loss, which penalizes the difference between the input and output

a
at each data point.

h r
B
• The network adjusts its parameters (weights and biases) through
backpropagation and gradient descent to minimize the reconstruction error, thus
learning to encode and decode the input data effectively.

Dr. R Bhargavi, VIT 45


Autoencoder - Applications
• Data Compression: Learning efficient representations for data compression
tasks, such as image or audio compression.

reconstruction errors.

vi R
Anomaly Detection: Detecting anomalies or outliers in data by comparing

arg a
Feature Learning: Learning meaningful features from data for downstream tasks
like classification or clustering.

B h
Denoising: Removing noise from data by training the autoencoder to
reconstruct clean versions of corrupted input data.

Dr. R Bhargavi, VIT 46


Undercomplete Autoencoders
• An autoencoder whose code dimension is less than the
input dimension (i.et he size of the latent space is smaller 𝑥"$

R
than the size of the input data) is called undercomplete.

vi
• So when dim(h) < dim(xi) and still able to reconstruct 𝑥"$ w*

g a
perfectly, we say that h is a loss-free encoding of xi . It h

r
captures all the important characteristics of xi.

B ha w
𝑥$

Dr. R Bhargavi, VIT 47


Overcomplete Autoencoders
• An overcomplete autoencoder is a type of autoencoder
where the size of the latent space (encoding) is larger 𝑥"$

R
than the size of the input data.

vi
• So when dim(h) > dim(xi), in such a case the autoencoder w*

g a
could learn a trivial encoding by simply copying 𝑥$ into h h

r
and then copying h into 𝑥"$ .
• Applications:

B haFeature
Expansion, Data Denoising etc.
Learning, Dimensionality w
𝑥$

Dr. R Bhargavi, VIT 48


Choice of the Functions f and g
• When all our inputs are binary (each xij ∈ {0, 1}) then the most apt
function for the decoder is logistic function (sigmoid) as it naturally

R
restricts all outputs to be between 0 and 1.

g avi 𝑥"$ = logistic(W*h+ b)


When all the inputs are Real, then the simple linear function would be apt

r

a
𝑥"$ = W*h+ b

B h
In both the cases the function g is chosen as sigmoid.

Dr. R Bhargavi, VIT 49


Latent Space Dimension
• The dimensionality of the latent representation in an autoencoder has a
significant impact on the model's performance and the quality of the learned

R
representations.

g avi
Compression vs. Information Preservation:
• A higher-dimensional latent space allows for more information to be retained

ar
from the input data. However, it may also lead to overfitting and capture noise or

h
irrelevant details.

B
• On the other hand, a lower-dimensional latent space forces the model to learn a
more compressed representation, potentially losing some information but
focusing on the most salient features.

Dr. R Bhargavi, VIT 50


Latent Space Dimension (cont…)
Undercomplete vs. Overcomplete Autoencoders:
• In an undercomplete autoencoder the model is encouraged to capture the most
essential features of the data. This can lead to better generalization and
robustness to noise.

R
• In contrast, an overcomplete autoencoder may capture more complex patterns

vi
and variations but could be prone to overfitting and learning redundant

a
representations.

har
Reconstruction Quality:

g
• A higher-dimensional latent space can potentially lead to better reconstruction

B
quality, as it allows the model to capture more details and nuances of the input
data.
• However, a lower-dimensional latent space forces the model to prioritize the
most critical features, which can also result in high-quality reconstructions if the
model learns meaningful representations.
Dr. R Bhargavi, VIT 51
Latent Space Dimension (cont…)
Computational Complexity:
• Increasing the dimensionality of the latent space also increases the
computational complexity of the autoencoder model, both in terms of memory
usage and training time.

R
• Lower-dimensional latent spaces are computationally more efficient but may

vi
sacrifice some reconstruction quality or representation richness.

g a
Overfitting and Generalization:

ar
• A larger latent space in an autoencoder can lead to overfitting, especially if the

h
model capacity is not appropriately controlled or regularized.

B
• Conversely, a smaller latent space promotes better generalization by
encouraging the model to learn more abstract and generalizable representations.

Dr. R Bhargavi, VIT 52


Autoencoder - Hyperparameters
• The quality of the autoencoder latent representation depends on so many factors
(called autoencoder model hyperparameters) such as:
• number of hidden layers,

i R
• number of nodes in each layer,

v
a
• dimension of the latent vector,

rg
• type of activation function in hidden layers,

a
• type of optimizer,

B h
• learning rate,
• number of epochs,
• batch size, etc.

Dr. R Bhargavi, VIT 53


Denoising Autoencoder
• A denoising autoencoder is a type of autoencoder designed to learn robust and noise-
resistant representations of data by reconstructing clean versions of corrupted input

R
data.

i
• It is particularly useful for tasks where the input data may be noisy or corrupted, such as

v
image denoising, audio denoising, or data preprocessing in machine learning pipelines.

arg a
The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point
as input and is trained to predict the original, uncorrupted data point as its output.

h
• A DAE minimizes the loss 𝐿(𝒙, 𝑔 𝑓 𝒙 ' where 𝒙 ' s a copy of 𝒙 that has been corrupted

B
by some form of noise.
• DAE is an example of how overcomplete, high-capacity models may be used as
autoencoders so long as care is taken to prevent them from learning the identity
function.

Dr. R Bhargavi, VIT 54


Denoising Autoencoder
• Sample a training example x from the training data.
Sample a corrupted version 𝒙 5 from C(5 𝒙 | 𝒙).

R

i
Use (𝒙, 𝒙5) as a training example for estimating the autoencoder reconstruction

v

a
𝒙 | 𝒙) = Pdecoder(𝒙 | h) with h the output of encoder g(5
distribution Preconstruct(5 𝒙 ),

g
and pdecoder typically defined by a decoder f(h).

har
B
Dr. R Bhargavi, VIT 55
Variational Autoencoders (VAEs)

vi R
arg a

B h
KL divergence from p to q is defined as 𝐷)* 𝑝 𝑥 ∥ 𝑞 𝑥 = ∑%∈, 𝑝 𝑥 ln
-(%)
0(%)

Dr. R Bhargavi, VIT 56


VAE (cont…)
• The goal of VAE is to learn a distribution that closely matches the true
underlying distribution of the data.
• Instead of learning a single fixed latent representation for each input, VAEs

R
learn a probability distribution over the latent space.
Loss Function:

g avi
r
• VAEs are trained using a combination of a reconstruction loss and a

a
regularization term based on the Kullback-Leibler (KL) divergence.

B h
• The reconstruction loss measures the difference between the input data and the
reconstructed data.
• The KL divergence regularization term encourages the learned distribution in the
latent space to match a predefined prior distribution (usually a standard Gaussian
distribution).
Dr. R Bhargavi, VIT 57
Generative Model
• Generative modeling: Type of machine learning approach where the goal is to
learn and model the underlying probability distribution of a dataset.
• The model trained using generative modeling techniques can then generate new
samples that are similar to the original data distribution.

i R
• This is in contrast to discriminative modeling, where the focus is on learning the

v
a
boundary between different classes or categories within the data.

rg
Types of Generative Models:

B ha
• Autoregressive Model

• Variational Autoencoders (VAEs)

• Generative Adversarial Networks (GANs)


• Boltzmann Machines

• Restricted Boltzmann Machines (RBMs)


Dr. R Bhargavi, VIT 58
Generative Adversarial Networks (GANs)
• A Generative Adversarial Network (GAN), is a class of deep learning models
introduced by Ian Goodfellow and his colleagues in 2014.
• GANs have a wide range of applications across various domains due to their

R
ability to generate realistic data samples that resemble a given dataset.

vi
• Image generation

a
Data Augmentation

g

r
Video generation

a

h
• Style Transfer



B
Text-to-Image Synthesis
Anomaly Detection
Super-Resolution Imaging etc.

Dr. R Bhargavi, VIT 59


GAN Architecture
• GANs are based on the idea of adversarial training.
• Adversarial training is a key aspect of how GANs work and leads to the
generation of realistic data samples.

i R
• GAN architecture consists of two neural networks that compete against each
other.

v
Generator:

arg a
h
• The generator is a neural network that takes random noise or latent vectors as

B
input and generates synthetic data samples.
• Its objective is to learn the underlying data distribution of the training dataset
and produce samples that resemble real data.

Dr. R Bhargavi, VIT 60


GAN Architecture (cont…)
Discriminator :
• The discriminator is another neural network that acts as a binary classifier.

R
• It takes both real data samples from the training dataset and fake data samples

i
generated by the generator as input.

rg av
The discriminator's goal is to distinguish between real and fake samples,
assigning probabilities or scores to indicate how likely a sample is to be real.

ha
Adversarial Training:

B
• The generator tries to produce samples that fool the discriminator into
classifying them as real
• The discriminator tries to correctly distinguish between real and fake samples.

Dr. R Bhargavi, VIT 61


GAN Architecture (cont…)

vi R
arg a
B h
Dr. R Bhargavi, VIT 62
GAN Architecture (cont…) - DCGAN

vi R
arg a
B h
Dr. R Bhargavi, VIT 63
GAN Architecture (cont…)

• The generator takes in random numbers as input and returns an


image as the output.

i R
This generated image is fed into the discriminator alongside a stream

v
a
of images taken from the actual, ground-truth dataset.

harg
The discriminator takes in both real and fake images and returns
probabilities: numbers between 0 and 1, with 1 representing a

B
prediction of authenticity(Real image) and 0 representing a
prediction of fake.

Dr. R Bhargavi, VIT 64


Generator Model
• The Generator takes random noise as input and converts it into complex data
samples, such text or images.
• It is commonly depicted as a deep neural network.
• The training data’s underlying distribution is captured by layers of learnable

R
parameters in its design through training.

avi
The generator adjusts its output to produce samples that closely mimic real data
as it is being trained by using backpropagation to fine-tune its parameters.

g
har
B
Dr. R Bhargavi, VIT 65
Transposed Convolution
• Single channel, stride 1, No padding: Let the input be a tensor nh x nw and a kℎ ×kw kernel.
• Sliding the kernel window with stride of 1 in each row and in each column yields a total of nh nw
intermediate results.
• Each intermediate result is a (nh+kh−1)×(nw+kw−1) tensor that are initialized as zeros. To
compute each intermediate tensor, each element in the input tensor is multiplied by the kernel so

R
that the resulting kℎ ×kw tensor replaces a portion in each intermediate tensor.

g avi
In the end, all the intermediate results are summed over to produce the output.

har
B
Dr. R Bhargavi, VIT 66
Transposed Convolution (cont…)
Single channel, stride 2:

vi R
arg a
B h
Dr. R Bhargavi, VIT 67
Discriminator
• The discriminator is any classification network since the goal of the
discriminator is to predict whether an image is real or fake, which is a typical
supervised classification problem.

vi R
arg a
B h
Dr. R Bhargavi, VIT 68
Training Discriminator
• Training Discriminator is a straightforward supervised training process.
The network is given labeled images coming from the generator (fake) and the

R

i
training data (real), and it learns to classify between real and fake images with a

v
sigmoid prediction output.

arg a
B h
Dr. R Bhargavi, VIT 69
Training Generator
• The generator model cannot be trained alone like the discriminator.
• It needs the discriminator model to tell it whether it did a good job of faking
images.
So, we create a combined network to train the generator, composed of both

R

discriminator and generator models.

g avi
har
B
Dr. R Bhargavi, VIT 70
Training Discriminator
• Freeze the generator weights, compare the real samples with generated
samples
• Train the discriminator to minimize the loss of the discriminator.

vi R
arg a
B h
Dr. R Bhargavi, VIT 71
Training Generator
• Freeze discriminator weights.
• Back propagate the error through discriminator to update the generator weights
• Train the generator to generate data that "fools" the discriminator.

R
• Maximize the loss of the discriminator when the generated data is given as

i
input

rg av
B ha
Dr. R Bhargavi, VIT 72
Training GAN

vi R
arg a
B h
Dr. R Bhargavi, VIT 73
Training GAN (cont…)

vi R
arg a
B h
Dr. R Bhargavi, VIT 74
Objective Function
• Objective Function:
The training of the generator and discriminator is guided by specific loss

R

i
functions:

g av
Generator Loss: The generator's loss function encourages it to generate samples
that the discriminator classifies as "real." The generator aims to minimize the

r
a
probability of the discriminator correctly classifying fake samples.

B h
Discriminator Loss: The discriminator's loss function encourages it to correctly
classify real and fake samples. It aims to maximize the probability of correctly
classifying real samples as "real" and fake samples as "fake."

Dr. R Bhargavi, VIT 75


Objective Function (cont…)
• Let Gϕ be the generator and Dθ be the discriminator (ϕ and θ are the parameters
of G and D, respectively).

produces Gϕ(z) = X.

vi R
Generator is a neural network which takes as input a noise vector z ∼ N(0, I) and

arg a
Discriminator is another neural network which takes as input a real X or a
generated X = Gϕ(z) and classify the input as real/fake.

B h
Dr. R Bhargavi, VIT 76
Objective Function (cont…)
Generator Objective function:
• Given an image generated by the generator as Gϕ(z) the discriminator assigns a
score Dθ(Gϕ(z)) to it.

R
• This score will be between 0 and 1 and gives the probability of the image being

i
real or fake( 1 for real and 0 for fake)

rg av
• For a given z, the generator would want to maximize log Dθ(Gϕ(z)) or minimize
log(1 − Dθ(Gϕ(z)))) (this is for a single input i.e z)

B ha
• Now, for all possible values of z (z is continuous and (z ∼ N(0, I)) so the
equivalent objective function would be
If z were discrete and drawn
from uniform distribution then

Dr. R Bhargavi, VIT 77


Objective Function (cont…)
Discriminator Objective function:
• The task of the discriminator is to assign a high score to real images and a low
score to fake images And it should do this for all possible real images and all
possible fake images.

i R
• In other words, it should try to maximize the following objective function

v
arg a
B h
Dr. R Bhargavi, VIT 78
Objective Function (cont…)
Overall Objective function:
• Putting together the objectives of the generator and discriminator we get a

R
minimax game

g avi
har

B
The discriminator wants to maximize the second term whereas the generator
wants to minimize it (hence it is a Two player game)

Dr. R Bhargavi, VIT 79


Types of GAN
Vanilla GAN:
• Vanilla GAN is the basic form of GAN introduced by Ian Goodfellow in 2014.

• It consists of a generator that generates fake samples and a discriminator that

i R
tries to distinguish between real and fake samples.

v
a
• Key features:

rg
• Simple architecture with two neural networks (generator and discriminator).

B ha
• Employs adversarial training using binary cross-entropy loss.

• Key difference: While effective, vanilla GANs may suffer from stability issues
such as mode collapse and training difficulties.
• Mode collapse refers to a situation where the generator produces limited and
repetitive outputs, failing to capture the full diversity of the data distribution.

Dr. R Bhargavi, VIT 80


Types of GAN (cont…)
Deep Convolutional GAN (DCGAN):
• DCGAN is an extension of GANs introduced by Radford et al. in 2015,
specifically designed for image generation tasks.

i R
• It incorporates convolutional layers in both the generator and discriminator

v
networks, enabling better feature extraction and spatial coherence in generated

a
images.
• Key features:

harg
• Uses convolutional layers for spatial information processing.

B
• Employs batch normalization and Leaky ReLU activation functions for
improved stability and convergence.
• Key difference: DCGANs are tailored for image generation tasks and offer
improved stability and visual quality compared to vanilla GANs.

Dr. R Bhargavi, VIT 81


Types of GAN (cont…)
Conditional GAN (cGAN):
• cGANs, proposed by Mirza and Osindero in 2014, extend vanilla GANs by
conditioning both the generator and discriminator on additional information.

i R
• They allow for controlled generation by providing conditional labels or

v
information during training.
• Key features:

arg a
• Conditional information (labels, attributes) is provided as additional input to

h
both the generator and discriminator.

B
• Enables targeted generation based on specified conditions.

• Key difference: cGANs offer control over the generated outputs, making them
suitable for tasks like image-to-image translation.

Dr. R Bhargavi, VIT 82


Types of GAN (cont…)
Cycle GAN (cycleGAN):
• CycleGAN, introduced by Zhu et al. in 2017, is a type of GAN designed for
unpaired image-to-image translation tasks.

i R
• It learns mappings between two domains without requiring paired examples
during training.

v
• Key features:

arg a
• Unpaired image translation without explicit correspondences between samples.

B h
• Incorporates cycle consistency loss to enforce consistency between original and
translated images.
• Key difference: CycleGANs are specialized for unpaired image translation tasks,
enabling style transfer, domain adaptation, and image transformation without
paired data.
Dr. R Bhargavi, VIT 83

You might also like