Professional Documents
Culture Documents
Module 6
Module 6
i R
Advanced Neural Networks
v
arg a
h
Dr. R. Bhargavi
B
Professor
SCOPE
VIT University
R
Object localization typically assumes that there is only one instance of the object
i
•
v
to be localized within the image.
•
arg a
The output of object localization is the bounding box coordinates that specify the
position and size of the object within the image.
•
B h
A bounding box describe the spatial location of an object.
The bounding box is rectangular box, which is determined by
the x and y coordinates of the upper-left corner and lower-right corner of the
rectangle. Another commonly used bounding box representation is the (x,y)
coordinates of the bounding box center, and the width and height of the box.
Dr. R Bhargavi, VIT 2
Object Localization (cont…)
Classification Classification with Localization
vi R
a
Height
harg
B
Dog Width
Dog (x, y)
[Pc,
R
Bx,
i
By,
v
Bh,
a
Bw,
rg
C1,
a
C2,
h
Output layer C3]
B
Dr. R Bhargavi, VIT 4
Object Localization – Defining target Label
• Assume that we have 3 classes (Dog -C1, Cat -C2, C3-Human).
If the image has a Dog picture then Y = [ 1, 0.4, 0.3, 0.5,0.8, 1,0,0]
R
•
i
If the image has a background and no other object. then,
v
•
a
Y = [ 0, x, x, x, x, x, x, x]
•
•
harg
Sum Squared Loss L(ypred, y) :
When Pc = 1 , 𝑦"! − 𝑦! " + 𝑦"" − 𝑦" " + ⋯ + 𝑦"# − 𝑦# " where n is the
B
number of components of the vector y.
• When Pc = 0 , 𝑦"! − 𝑦! "
R
classifying them into predefined categories or classes.
•
g avi
Object detection assumes that there may be multiple
instances of different objects belonging to various
r
classes within the image.
•
B ha
The output of object detection is a list of bounding
boxes, each associated with a class label indicating the
type of object detected within the box.
R
• At each position of the window, the content within the
i
window is extracted as a region of interest (ROI) and fed
v
into a classifier to determine whether it contains an object
g a
of interest or not.
•
different strides.
har
The algorithm is run with different window sizes and
B
• The size of the sliding window and the step size (stride) by
which it moves across the image are crucial parameters.
• The window size should be chosen based on the size range
of objects to be detected, and the step size determines the
degree of overlap between adjacent windows.
Dr. R Bhargavi, VIT 7
Object Detection – Sliding Window (cont…)
• Object detection with sliding window can be computationally expensive,
especially when using a dense grid of windows and/or extracting high-
R
dimensional feature vectors for each window.
vi
• It may also suffer from scale and aspect ratio limitations, as objects at different
a
scales may require different window sizes, and fixed-size windows may not be
g
able to handle significant scale variations efficiently.
•
har
Object detection with sliding window has been widely used in the past and has
paved the way for more sophisticated object detection techniques.
•
B
Modern approaches, such as region-based convolutional neural networks (R-
CNNs), Faster R-CNN, and You Only Look Once (YOLO), have largely
replaced sliding window-based methods due to their superior performance and
efficiency.
vi
multi-scale information.
R
They excel in complex scenes by capturing robust feature representations and
arg a
Examples: RCNN, Fast RCNN, Mask RCNN etc.
h
• One-Stage Object Detectors : These models directly predict bounding boxes
B
and class labels in a single step.
• They estimate the number of objects, classify them, and determine their size and
positions using bounding boxes.
• Examples: YOLO (You Only Look Once)
R
• It quantifies the overlap between the predicted bounding boxes and the ground
i
truth bounding boxes of objects within an image.
rg av
• IoU is calculated as the ratio of the area of intersection between the predicted
bounding box and the ground truth bounding box to the area of their union.
ha
• Mathematically, it is defined as:
B
IoU = Area of Intersection / Area of Union
i R
A higher IoU indicates better alignment and thus a more accurate detection.
v
a
• IoU is often used as a threshold to determine whether a detection is considered a
g
true positive or a false positive.
•
har
A common threshold value for IoU is 0.5, meaning that if the IoU between a
B
predicted bounding box and a ground truth bounding box exceeds 0.5, the
detection is considered a true positive; otherwise, it is considered a false
positive.
• Different threshold values can be used depending on the specific requirements of
the task or application.
Dr. R Bhargavi, VIT 11
Non-Maximum Suppression (NMS)
• Non-Maximum Suppression (NMS) is a post-processing technique commonly
used in object detection algorithms to eliminate redundant or overlapping
bounding box predictions.
•
v R
Its primary goal is to select the most confident and non-overlapping detections
i
while discarding redundant ones.
arg a
B h
Dr. R Bhargavi, VIT 12
Non-Maximum Suppression (Cont…)
• NMS takes as input a set of bounding box predictions generated by the object detection
algorithm, along with their corresponding confidence scores.
• The first step of NMS involves sorting the bounding box predictions based on their
confidence scores in descending order. This ensures that the most confident predictions are
R
considered first.
vi
• Starting from the top of the sorted list, the bounding box with the highest confidence score
a
is selected as a detection.
•
arg
All other bounding boxes that have a significant overlap (measured using IoU) with the
selected bounding box are suppressed or removed from consideration.
h
B
• A threshold IoU value is used to determine whether two bounding boxes overlap
significantly.
• Bounding boxes with IoU above the threshold are considered to be overlapping, and only
the one with the highest confidence score is retained.
i R
The output of Non-Maximum Suppression is a list of selected bounding boxes, each
v
a
associated with its corresponding confidence score.
harg
B
Dr. R Bhargavi, VIT 14
NMS (cont…)
• Step 1 : Select the prediction S with highest confidence score and remove it
from P and add it to the final prediction list keep
•
vi R
Step 2 : Now compare this prediction S with all the predictions present in P.
Calculate the IoU of this prediction S with every other predictions in P. If the
g a
IoU is greater than the threshold thresh_iou for any prediction T present in P,
r
remove prediction T from P.
•
B ha
Step 3 : If there are still predictions left in P, then go to Step 1 again, else return
the list keep containing the filtered predictions.
R
• R-CNN: Regions with CNN features
•
g avi
It consists of three modules
• Region Proposal module: Generate and extract category independent region
r
proposals, e.g. candidate bounding boxes (~2000).
ha
• Feature Extractor module: Extract feature from each candidate region, e.g.
B
using a deep convolutional neural network (pretrained AlexNet).
• Classifier module: Classify features as one of the known class, e.g. linear
SVM classifier(for individual classes) model.
• Bounding box regressor to finetune the localization.
vi R
arg a
B h
https://arxiv.org/pdf/1311.2524.pdf Dr. R Bhargavi, VIT 17
Region Proposal – Selective search
• The first step in R-CNN is to generate region proposals, which are candidate
bounding boxes that potentially contain objects.
•
i R
These region proposals serve as input to the subsequent stages of the network.
v
a
• Selective Search algorithm is used to generate region proposals.
•
harg
This algorithm generates a large number of region proposals by grouping pixels
into segments based on color, texture, or other low-level features and then
B
combining them hierarchically.
vi R
arg a
B h
Dr. R Bhargavi, VIT 19
Feature Extraction
Supervised pre-training :
• AlexNet trained on ImageNet dataset for classification with 1000 classes is used.
Domain-Specific Fine-Tuning:
R
• Fine-tune the network to learn
vi
a) The visual features of the new types of images- distorted region proposals.
a
b) Specific target classes of the smaller dataset for the detection task.
harg
• The final 1000 way classification layer of the CNN from pre-training is replaced
with a randomly initialized (N+1) way softmax classification layer for the N
B
object classes and a general background class of the detection task.
• For each region proposal, the corresponding region of the input image is cropped
and resized to a fixed size (227 x 227).
R
Training pipeline:
g avi
• The network is trained using SGD with 0.001 learning rate ((1/10)th of the initial pre-
ar
• In each iteration they sample 32 windows that are positive over all the classes and 96
h
windows that belong to the background class to form a mini-batch of 128, to ensure that
B
there is enough representation from the positive classes during training.
The final output: After training, the final classification layer is removed and a 4096
dimensional feature vector is obtained from the penultimate layer of the CNN for each of
the 2000 region proposals (for every image).
R
• Input: The 4096-d feature vector for each region proposal.
Labels for training:
g avi
• The features of all region proposals that have an IoU overlap of less than 0.3
ar
with the ground truth bounding box are considered negatives for that class
h
during training.
B
• The positives for that class are simply the features from the ground truth
bounding boxes itself.
• All other proposals (IoU overlap greater than 0.3, but not a ground truth
bounding box) are ignored for training the SVM.
R
• The class-specific dot products between the features and SVM weights are consolidated
i
into a single matrix-matrix product for an image.
g av
Output of SVM stage : Set of positive object proposals for each class, from the CNN
features of 2000 region proposals (of every image).
r
B ha
Dr. R Bhargavi, VIT 23
Bounding Box Regression
• Input: CNN output features.
A simple bounding-box regression is used to improve localization performance.
R
•
i
The aim is to find the transformation from the predicted bounding box defined
v
•
a
by 𝑃$ = 𝑃%$ , 𝑃&$ , 𝑃'$ , 𝑃($ where I = 1,2, ….N to 𝐺 = 𝐺% , 𝐺& , 𝐺' , 𝐺( .
•
harg
First the following are the regression targets (transformations are to be learned)
B
Dr. R Bhargavi, VIT 24
Bounding Box Regression (cont…)
• The predicted ground truth are given by the following
vi R
arg a
Where
B h
Dr. R Bhargavi, VIT 25
RCNN - Architecture
vi R
arg a
B h
Dr. R Bhargavi, VIT 26
Disadvantages
• Object detection is very slow. For each image, the selective search algorithm
pro- poses about 2,000 RoIs to be examined by the entire pipeline (CNN feature
R
extractor and classifier). This is very computationally expensive.
•
g avi
Training is a multi-stage pipeline - CNN feature extractor, SVM classifier, and
bounding-box regressors. Thus the training process is very complex and not an
r
end-to-end training.
•
ha
Training is expensive in terms of space and time.
B
Dr. R Bhargavi, VIT 27
Fast-RCNN
• Immediate descendant of R-CNN
Following are the changes done in Fast RCNN
R
•
i
In fast RCNN, first CNN feature extractor is applied to the entire input image
v
•
a
and then regions of proposals are taken.
•
harg
It extends the ConvNet’s job to do the classification part as well, by replacing
the traditional SVM machine learning algorithm with a softmax layer.
B
Dr. R Bhargavi, VIT 28
Fast-RCNN Architecture
vi R
arg a
B h
Dr. R Bhargavi, VIT 29
Changing FC to Convolution layer
Conv
5x5 2x2
@16 pool
vi R
a
Softmax
g
5 x 5 x 16 400 400 (4 classes)
r
14 x14 x 3 10 x10 x 16
ha
Conv Conv Conv
B
5x5 2x2 5x5 1x1 1x1
@16 pool @400 @400
vi R
a
5 x 5 x 16 1 x 1 x 400 1 x 1 x 400 1x1x4
g
14 x14 x 3 10 x10 x 16
Conv
5x5
har 2x2
Conv
5x5
Conv
1x1 1x1
B
@16 pool @400 @400
R
• Learns generalizable representations of objects.
•
g avi
Object Detection is treated as regression problem instead of a classification task.
r
• Single stage – From image pixels to bounding box coordinates and class
a
probabilities in one pass.
B h
Dr. R Bhargavi, VIT 32
YOLO Object Detection
• Resize the input image to 448 × 448
Run a single convolutional network on the image, and
R
•
i
Threshold the resulting detections by the model’s confidence.
v
•
a
A single convolutional network simultaneously predicts multiple bounding
g
•
r
boxes and class probabilities for those boxes.
•
•
B ha
YOLO sees the entire image during training and test time so it implicitly
encodes contextual information about classes as well as their appearance.
YOLO still lags behind state-of-the-art detection systems in accuracy.
R
probability that the cell contains an object and bounding prediction for objects if
vi
their center falls inside that particular cell, and the class prediction. i.e each of
a
the grid specifies y = [Po, Bx, By, Bh, Bw,C1,C2,C3] (considering 3 classes)
harg
B
Dr. R Bhargavi, VIT 34
YOLO (cont…)
• The objectness score is calculated as follows:
• Po = Pr (containing an object) × IoU (pred, truth)
R
• The output is a volume s x s x 8 (considering 3 classes)
i
In order to support multiple objects to be detected by a single grid cell Anchor
v
•
a
Boxes are introduced.
B
Dr. R Bhargavi, VIT 35
YOLO (cont…)
• Each object in the training image is assigned to one grid cell that contain the
object’s midpoint and one anchor box of that grid cell with highest IoU
• The output is a volume s x s x (5B+K) (K is the number of classes)
vi R
arg a
B h
Dr. R Bhargavi, VIT 36
YOLO (cont…)
• Combine the box and class label predictions
vi R
arg a
B h
Dr. R Bhargavi, VIT 37
YOLO (cont…)
vi R
arg a
B h
Dr. R Bhargavi, VIT 38
YOLO (cont…)
• For each class use non-max suppression to generate final predictions.
vi R
arg a
B h
Dr. R Bhargavi, VIT 39
YOLO Architecture
vi R
arg a
B h
Dr. R Bhargavi, VIT 40
Loss Function
R
cell i
i
denotes that the jth bounding
v
box predictor in cell i is
a
“responsible” for that
g
prediction.
B
0.5 and λcoord set to 5
R
• Primary purpose is to learn efficient representations of
vi
data by training the network to reconstruct its input data w*
a
as accurately as possible.
g
h
r
• Encodes its input xi into a hidden representation h.
• B ha h = g(Wxi +b)
Decodes the input again from this hidden representation
𝑥"$ = f(W*h+ b)
The model is trained to minimize certain loss function
w
𝑥$
R
known as the "latent space" or "encoding”.
vi
• The encoder consists of multiple layers, typically composed of w*
a
convolutional layers, dense layers, or a combination of both,
g
h
r
depending on the type of autoencoder.
ha
• The goal of the encoder is to learn a compact and meaningful w
B
representation of the input data that captures its essential
features. 𝑥$
• Encodes its input xi into a hidden representation h.
h = g(Wxi +b)
R
reconstructs the original input data.
vi
• Similar to the encoder, the decoder consists of multiple w*
g a
layers that perform the inverse operation of the encoder, h
r
gradually expanding the latent representation back to the
•
ha
original input dimensions.
B
The goal of the decoder is to generate a reconstruction of
the input data that closely matches the original input.
Decodes the input from the hidden representation.
w
𝑥$
𝑥"$ = f(W*h+ b)
Dr. R Bhargavi, VIT 44
Autoencoder (cont…)
Training:
• During training, the autoencoder is trained using a loss function that measures
i R
the difference between the input data and the reconstructed output.
v
a
• The most common loss function used for autoencoders is the mean squared
g
error (MSE) loss, which penalizes the difference between the input and output
a
at each data point.
h r
B
• The network adjusts its parameters (weights and biases) through
backpropagation and gradient descent to minimize the reconstruction error, thus
learning to encode and decode the input data effectively.
vi R
Anomaly Detection: Detecting anomalies or outliers in data by comparing
arg a
Feature Learning: Learning meaningful features from data for downstream tasks
like classification or clustering.
•
B h
Denoising: Removing noise from data by training the autoencoder to
reconstruct clean versions of corrupted input data.
R
than the size of the input data) is called undercomplete.
vi
• So when dim(h) < dim(xi) and still able to reconstruct 𝑥"$ w*
g a
perfectly, we say that h is a loss-free encoding of xi . It h
r
captures all the important characteristics of xi.
B ha w
𝑥$
R
than the size of the input data.
vi
• So when dim(h) > dim(xi), in such a case the autoencoder w*
g a
could learn a trivial encoding by simply copying 𝑥$ into h h
r
and then copying h into 𝑥"$ .
• Applications:
B haFeature
Expansion, Data Denoising etc.
Learning, Dimensionality w
𝑥$
R
restricts all outputs to be between 0 and 1.
r
•
a
𝑥"$ = W*h+ b
•
B h
In both the cases the function g is chosen as sigmoid.
R
representations.
g avi
Compression vs. Information Preservation:
• A higher-dimensional latent space allows for more information to be retained
ar
from the input data. However, it may also lead to overfitting and capture noise or
h
irrelevant details.
B
• On the other hand, a lower-dimensional latent space forces the model to learn a
more compressed representation, potentially losing some information but
focusing on the most salient features.
R
• In contrast, an overcomplete autoencoder may capture more complex patterns
vi
and variations but could be prone to overfitting and learning redundant
a
representations.
har
Reconstruction Quality:
g
• A higher-dimensional latent space can potentially lead to better reconstruction
B
quality, as it allows the model to capture more details and nuances of the input
data.
• However, a lower-dimensional latent space forces the model to prioritize the
most critical features, which can also result in high-quality reconstructions if the
model learns meaningful representations.
Dr. R Bhargavi, VIT 51
Latent Space Dimension (cont…)
Computational Complexity:
• Increasing the dimensionality of the latent space also increases the
computational complexity of the autoencoder model, both in terms of memory
usage and training time.
R
• Lower-dimensional latent spaces are computationally more efficient but may
vi
sacrifice some reconstruction quality or representation richness.
g a
Overfitting and Generalization:
ar
• A larger latent space in an autoencoder can lead to overfitting, especially if the
h
model capacity is not appropriately controlled or regularized.
B
• Conversely, a smaller latent space promotes better generalization by
encouraging the model to learn more abstract and generalizable representations.
i R
• number of nodes in each layer,
v
a
• dimension of the latent vector,
rg
• type of activation function in hidden layers,
a
• type of optimizer,
B h
• learning rate,
• number of epochs,
• batch size, etc.
R
data.
i
• It is particularly useful for tasks where the input data may be noisy or corrupted, such as
v
image denoising, audio denoising, or data preprocessing in machine learning pipelines.
•
arg a
The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point
as input and is trained to predict the original, uncorrupted data point as its output.
h
• A DAE minimizes the loss 𝐿(𝒙, 𝑔 𝑓 𝒙 ' where 𝒙 ' s a copy of 𝒙 that has been corrupted
B
by some form of noise.
• DAE is an example of how overcomplete, high-capacity models may be used as
autoencoders so long as care is taken to prevent them from learning the identity
function.
R
•
i
Use (𝒙, 𝒙5) as a training example for estimating the autoencoder reconstruction
v
•
a
𝒙 | 𝒙) = Pdecoder(𝒙 | h) with h the output of encoder g(5
distribution Preconstruct(5 𝒙 ),
g
and pdecoder typically defined by a decoder f(h).
har
B
Dr. R Bhargavi, VIT 55
Variational Autoencoders (VAEs)
vi R
arg a
•
B h
KL divergence from p to q is defined as 𝐷)* 𝑝 𝑥 ∥ 𝑞 𝑥 = ∑%∈, 𝑝 𝑥 ln
-(%)
0(%)
R
learn a probability distribution over the latent space.
Loss Function:
g avi
r
• VAEs are trained using a combination of a reconstruction loss and a
a
regularization term based on the Kullback-Leibler (KL) divergence.
B h
• The reconstruction loss measures the difference between the input data and the
reconstructed data.
• The KL divergence regularization term encourages the learned distribution in the
latent space to match a predefined prior distribution (usually a standard Gaussian
distribution).
Dr. R Bhargavi, VIT 57
Generative Model
• Generative modeling: Type of machine learning approach where the goal is to
learn and model the underlying probability distribution of a dataset.
• The model trained using generative modeling techniques can then generate new
samples that are similar to the original data distribution.
i R
• This is in contrast to discriminative modeling, where the focus is on learning the
v
a
boundary between different classes or categories within the data.
rg
Types of Generative Models:
B ha
• Autoregressive Model
R
ability to generate realistic data samples that resemble a given dataset.
vi
• Image generation
a
Data Augmentation
g
•
r
Video generation
a
•
h
• Style Transfer
•
•
•
B
Text-to-Image Synthesis
Anomaly Detection
Super-Resolution Imaging etc.
i R
• GAN architecture consists of two neural networks that compete against each
other.
v
Generator:
arg a
h
• The generator is a neural network that takes random noise or latent vectors as
B
input and generates synthetic data samples.
• Its objective is to learn the underlying data distribution of the training dataset
and produce samples that resemble real data.
R
• It takes both real data samples from the training dataset and fake data samples
i
generated by the generator as input.
•
rg av
The discriminator's goal is to distinguish between real and fake samples,
assigning probabilities or scores to indicate how likely a sample is to be real.
ha
Adversarial Training:
B
• The generator tries to produce samples that fool the discriminator into
classifying them as real
• The discriminator tries to correctly distinguish between real and fake samples.
vi R
arg a
B h
Dr. R Bhargavi, VIT 62
GAN Architecture (cont…) - DCGAN
vi R
arg a
B h
Dr. R Bhargavi, VIT 63
GAN Architecture (cont…)
i R
This generated image is fed into the discriminator alongside a stream
v
a
of images taken from the actual, ground-truth dataset.
•
harg
The discriminator takes in both real and fake images and returns
probabilities: numbers between 0 and 1, with 1 representing a
B
prediction of authenticity(Real image) and 0 representing a
prediction of fake.
R
parameters in its design through training.
•
avi
The generator adjusts its output to produce samples that closely mimic real data
as it is being trained by using backpropagation to fine-tune its parameters.
g
har
B
Dr. R Bhargavi, VIT 65
Transposed Convolution
• Single channel, stride 1, No padding: Let the input be a tensor nh x nw and a kℎ ×kw kernel.
• Sliding the kernel window with stride of 1 in each row and in each column yields a total of nh nw
intermediate results.
• Each intermediate result is a (nh+kh−1)×(nw+kw−1) tensor that are initialized as zeros. To
compute each intermediate tensor, each element in the input tensor is multiplied by the kernel so
R
that the resulting kℎ ×kw tensor replaces a portion in each intermediate tensor.
•
g avi
In the end, all the intermediate results are summed over to produce the output.
har
B
Dr. R Bhargavi, VIT 66
Transposed Convolution (cont…)
Single channel, stride 2:
vi R
arg a
B h
Dr. R Bhargavi, VIT 67
Discriminator
• The discriminator is any classification network since the goal of the
discriminator is to predict whether an image is real or fake, which is a typical
supervised classification problem.
vi R
arg a
B h
Dr. R Bhargavi, VIT 68
Training Discriminator
• Training Discriminator is a straightforward supervised training process.
The network is given labeled images coming from the generator (fake) and the
R
•
i
training data (real), and it learns to classify between real and fake images with a
v
sigmoid prediction output.
arg a
B h
Dr. R Bhargavi, VIT 69
Training Generator
• The generator model cannot be trained alone like the discriminator.
• It needs the discriminator model to tell it whether it did a good job of faking
images.
So, we create a combined network to train the generator, composed of both
R
•
discriminator and generator models.
g avi
har
B
Dr. R Bhargavi, VIT 70
Training Discriminator
• Freeze the generator weights, compare the real samples with generated
samples
• Train the discriminator to minimize the loss of the discriminator.
vi R
arg a
B h
Dr. R Bhargavi, VIT 71
Training Generator
• Freeze discriminator weights.
• Back propagate the error through discriminator to update the generator weights
• Train the generator to generate data that "fools" the discriminator.
R
• Maximize the loss of the discriminator when the generated data is given as
i
input
rg av
B ha
Dr. R Bhargavi, VIT 72
Training GAN
vi R
arg a
B h
Dr. R Bhargavi, VIT 73
Training GAN (cont…)
vi R
arg a
B h
Dr. R Bhargavi, VIT 74
Objective Function
• Objective Function:
The training of the generator and discriminator is guided by specific loss
R
•
i
functions:
•
g av
Generator Loss: The generator's loss function encourages it to generate samples
that the discriminator classifies as "real." The generator aims to minimize the
r
a
probability of the discriminator correctly classifying fake samples.
•
B h
Discriminator Loss: The discriminator's loss function encourages it to correctly
classify real and fake samples. It aims to maximize the probability of correctly
classifying real samples as "real" and fake samples as "fake."
vi R
Generator is a neural network which takes as input a noise vector z ∼ N(0, I) and
arg a
Discriminator is another neural network which takes as input a real X or a
generated X = Gϕ(z) and classify the input as real/fake.
B h
Dr. R Bhargavi, VIT 76
Objective Function (cont…)
Generator Objective function:
• Given an image generated by the generator as Gϕ(z) the discriminator assigns a
score Dθ(Gϕ(z)) to it.
R
• This score will be between 0 and 1 and gives the probability of the image being
i
real or fake( 1 for real and 0 for fake)
rg av
• For a given z, the generator would want to maximize log Dθ(Gϕ(z)) or minimize
log(1 − Dθ(Gϕ(z)))) (this is for a single input i.e z)
B ha
• Now, for all possible values of z (z is continuous and (z ∼ N(0, I)) so the
equivalent objective function would be
If z were discrete and drawn
from uniform distribution then
i R
• In other words, it should try to maximize the following objective function
v
arg a
B h
Dr. R Bhargavi, VIT 78
Objective Function (cont…)
Overall Objective function:
• Putting together the objectives of the generator and discriminator we get a
R
minimax game
g avi
har
•
B
The discriminator wants to maximize the second term whereas the generator
wants to minimize it (hence it is a Two player game)
i R
tries to distinguish between real and fake samples.
v
a
• Key features:
rg
• Simple architecture with two neural networks (generator and discriminator).
B ha
• Employs adversarial training using binary cross-entropy loss.
• Key difference: While effective, vanilla GANs may suffer from stability issues
such as mode collapse and training difficulties.
• Mode collapse refers to a situation where the generator produces limited and
repetitive outputs, failing to capture the full diversity of the data distribution.
i R
• It incorporates convolutional layers in both the generator and discriminator
v
networks, enabling better feature extraction and spatial coherence in generated
a
images.
• Key features:
harg
• Uses convolutional layers for spatial information processing.
B
• Employs batch normalization and Leaky ReLU activation functions for
improved stability and convergence.
• Key difference: DCGANs are tailored for image generation tasks and offer
improved stability and visual quality compared to vanilla GANs.
i R
• They allow for controlled generation by providing conditional labels or
v
information during training.
• Key features:
arg a
• Conditional information (labels, attributes) is provided as additional input to
h
both the generator and discriminator.
B
• Enables targeted generation based on specified conditions.
• Key difference: cGANs offer control over the generated outputs, making them
suitable for tasks like image-to-image translation.
i R
• It learns mappings between two domains without requiring paired examples
during training.
v
• Key features:
arg a
• Unpaired image translation without explicit correspondences between samples.
B h
• Incorporates cycle consistency loss to enforce consistency between original and
translated images.
• Key difference: CycleGANs are specialized for unpaired image translation tasks,
enabling style transfer, domain adaptation, and image transformation without
paired data.
Dr. R Bhargavi, VIT 83