Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

1.What is descriptor or feature vector?

Feature vector

A feature vector is an ordered list of numerical properties of observed phenomena. It represents input features
to a machine learning model that makes a prediction.

Humans can analyze qualitative data to make a decision.

For example, we see the cloudy sky, feel the damp breeze, and decide to take an umbrella when going outside. Our five
senses can transform outside stimuli into neural activity in our brains, handling multiple inputs as they occur in no
particular order.

However, machine learning models can only deal with quantitative data.

As such, we must always convert features of observed phenomena into numerical values and feed them into a
machine learning model in the same order. In short, we must represent features in feature vectors.

2.Explain discriminative Model and Generative Model.

GENERATIVE MODEL DISCRIMINATIVE MODEL


1. The actual distribution of the classes in the dataset 1.Discriminative models simulate the dataset
is what generative models attempt to represent classes' decision boundaries.

2. Discriminative models pick up on p(y|x), or


2.Using the Bayes Theorem, generative models
conditional probability.
forecast the joint probability distribution, p(x,y).

3. In comparison to discriminative models, generative 3. In comparison to generative models, discriminative


models require more computational resources models are computationally less expensive

4.Unsupervised machine learning problems benefit 4. For supervised machine learning tasks,
from the usage of generative models discriminative models are helpful.
5. Outliers have a greater influence on generative 5. Unlike generative models, discriminative
models than on discriminative ones. models have the advantage of being more
resistant to outliers
3.Explain feature space representation.

A feature space is just the set of all possible values for a chosen set of features from that data.

It refers to the n-dimensions where your variables live (not including a target variable, if it is present).
The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as
features.
For example, consider the data set with: T
target
Y≡ Thickness of car tires after some testing period Variables
X1≡ distance travelled in test
X2≡ time duration of test
X3≡ amount of chemical -C in tires.

The positive quadrant in R3 as all the X variables can only be positive quantities. Domain knowledge about tires
might suggest that the *speed* the vehicle was moving at is important, hence we generate another variable, X4 (this
is the feature extraction part): X4 = X1/X2 the speed of the vehicle during testing. This extends our old feature space
into a new one, the positive part of R4.

4.What is Bayesian Learning? Explain Bayes Minimum Error Classifier and Minimum Risk Classifier?

A learning technique that determines model parameters (such as the network weights) by maximizing the posterior
probability of the parameters given the training data.

The idea is that some parameter values are more consistent with the observed data than others.

By Bayes’ rule, maximizing the posterior probability amounts to maximizing the so-called model evidence, defined as
the conditional probability of the training data given the model parameters.

In Bayesian learning, prior knowledge is provided by asserting – a prior probability for each candidate hypothesis, and –
a probability distribution over observed data for each possible hypothesis.

Minimum risk classifier:

The minimum risk R∗(αi|x) is called the Bayes risk. λijP(ωj|x)=1 − P(ωi|x). R(αi|x) is minimum for the decision i for which
the posterior P(ωi|x) is maximum. Same decision rule as the Bayes classifier.

Minimum Error Classification :

λijP(ωj|x)=1 − P(ωi|x). R(αi|x) is minimum for the decision i for which the posterior P(ωi|x) is maximum. Same
decision rule as the Bayes classifier. In the two-category case, if the loss for one action is greater than the other, the
regions for that action will shrink.

4.What is discriminant function? Explain discriminant function under multivariate normal distribution.

Discriminant functions are used to find the minimum probability of error in decision making problems. In a problem with
feature vector y and state of nature variable w, we can represent the discriminant function as:
gi(Y)=lnp(Y|wi)+lnP(wi)

We defined p(Y|wi) as the conditional probability density function for Y with wi being the state of nature,

And P(wj) is the prior probability that nature is in state wj. If we take p(Y|wi) as multivariate normal distributions. That
is if p(Y|wi) = N(μ,σ). Then the discriminant function changes to;

gi(Y)=−||x−μi||2σi+lnP(wi),

where ||.|| denotes the Euclidean norm.

The multivariate normal distribution is a generalization of the univariate normal distribution to two or more
variables. It is a distribution for random vectors of correlated variables, where each vector element has a univariate
normal distribution.
As for the normal density p(x|ωi) follows the multivariate normal distribution, so our discriminant function can
be written as

gi(x)= -1/2(x-μi)tΣi–1(x-μi) – d/2ln2π – 1/2 ln(|Σi|) +ln(P(wi))

5.Write short note on nearest neighbour rule - K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.

K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the

category that is most similar to the available categories.

K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification

problems.

K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

It is also called a lazy learner algorithm because it does not learn from the training set immediately

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

6)What is optimization? What are three optimization variants? Explain any

one

Optimization is the process where we train the model iteratively that results in a maximum and minimum function
evaluation. It is one of the most important phenomena in Machine Learning to get better results.

Optimization methods are used in many areas of study to find solutions that maximize or minimize some study
parameters, such as minimize costs in the production of a good or service, maximize profits, minimize raw material in
the development of a good, or maximize production.

3 variants of the optimization techniques are 1)stochastic

gradient descent approach 2)batch optimization

technique

3) mini batch optimization techniques. Stochastic

Gradient Descent

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model

parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will

update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance and fluctuations in loss

functions at different intensities.

Advantages:

Frequent updates of model parameters hence, converges in less time.

Requires less memory as no need to store values of loss functions.

May get new minima’s.

Disadvantages:

High variance in model parameters.


May shoot even after achieving global minima.

To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.

Batch Optimization:
Batch optimization is a technique where a batch of samples is used to calculate the gradient and update the model
parameters. In other words, the model is trained using a fixed set of samples in each iteration. Batch optimization is
computationally efficient and can help in finding the optimum solution. However, it might lead to overfitting and slow
convergence if the batch size is too large or too small.

Overall, optimization techniques like gradient descent and batch optimization play a crucial role in deep learning. They
help to minimize the error and find the best set of parameters for the given problem.

7. Explain minimum distance classifier

The minimum distance classifier is used to classify unknown image data to classes which minimize the distance between
the image data and the class in multi- feature space. The distance is defined as an index of similarity so that the minimum
distance is identical to the maximum similarity

Euclidian distance

Is used in cases where the variances of the population classes are different to each other. The Euclidian distance is
theoretically identical to the similarity index.

Normalized Euclidian distance


The Normalized Euclidian distance is proportional to the similarity in dex, as shown in the case of difference
variance.

Mahalanobis distance
In cases where there is correlation between the axes in feature space.

where X : vector of image data (n bands) X = [ x1, x2, .... xn]


k : mean of the kth class
k = [ m1, m2, .... mn]

k : variance matrix
k : variance-covariance matrix

Elaborate back propagation learning.

Backpropagation algorithm calculates the gradient of the error function. Backpropagation can be
written as a function of the neural network. Backpropagation algorithms are a set of methods used to efficiently
train artificial neural networks following a gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method through which it calculates
the updated weight to improve the network until it is not able to perform the task for which it is being trained.
Derivatives of the activation function to be known at network design time is required to Backpropagation.

How Backpropagation Algorithm Works

Inputs X, arrive through the preconnected path


Input is modeled using real weights W. The weights are usually randomly selected.
Calculate the output for every neuron from the input layer, to the hidden layers, to the output layer.
Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

Travel back from the output layer to the hidden layer to adjust the weights such that the error is decreased.

8.What are different loss functions in back propagation learning? Explain any one.
The Loss function is the difference between our predicted and actual values. We create a Loss function to find

the minima of that function to optimize our model and improve our prediction’s accurac y.

Different loss functions are: Squared Error and Cross Entropy Loss.

Cross entropy loss is a metric used to measure how well a classification model in machine learning performs. The loss (or
error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as
close to 0 as possible. Cross entropy loss is often considered interchangeable with logistic loss (or log loss, and sometimes
referred to as binary cross entropy loss) but this isn't always correct. Cross entropy loss measures the difference between
the discovered probability distribution of a machine learning classification model and the predicted distribution. All possible
values for the prediction are stored so, for example, if you were looking for the odds in a coin toss it would store that
information at 0.5 and 0.5 (heads and tails). Binary cross entropy loss, on the other hand, store only one value.

in the case of Binary Classification, cross-entropy is given by:

where:

P is the predicted probability, and


Y is the indicator
9 .Deep Learning and its use:
Deep learning is a subset of machine learning. Deep learning algorithms emerged in an attempt to make traditional machine
learning techniques more efficient.
Deep learning algorithms are neural networks that are modeled after the human brain. For example, a human brain contains
millions of interconnected neurons that work together to learn and process information. Similarly, deep learning neural networks, or
artificial neural networks, are made of many layers of artificial neurons that work together inside the computer.

Deep Learning is a part of Machine Learning used to solve complex problems and build intelligent solutions. The core concept of
Deep Learning has been derived from the structure and function of the human brain. Deep Learning uses artificial neural
networks to analyze data and make predictions.

Deep learning technology drives many AI applications used in everyday products, such as the following:

Digital assistants

Voice-activated television remotes

Fraud detection

Automatic facial recognition

It is also a critical component of emerging technologies such as self-driving cars, virtual reality, and more.

Deep learning models are computer files that data scientists have trained to perform tasks using an algorithm or a predefined set of
steps. Businesses use deep learning models to analyze data and make predictions in various applications.

10. Linear Classifier:

Linear classifiers are a type of machine learning algorithm used for classification tasks. In linear classification, the goal is to find a
hyperplane that separates the data points into different classes. This hyperplane is represented by a linear equation, usually in the
form of a straight line in two dimensions or a plane in higher dimensions.

Linear classifier With hingloss:

Linear machines with hinge loss are a type of linear classifier that use a hinge loss function to separate the data points into
different classes. Hinge loss is a loss function used in machine learning for classification tasks that penalizes predictions that are
far from the true class value.

In the case of a binary classification problem, the goal of the linear machine with hinge loss is to find a hyperplane that separates
the positive and negative examples with the largest margin. The margin is defined as the perpendicular distance between the
hyperplane and the data points nearest to it.

The hinge loss function is defined as:

max(0, 1 - yi(w · xi + b))

where yi is the label of the ith data point, xi is the feature vector of the ith data point, w is the weight vector, and b is the bias term.
The hinge loss function penalizes predictions that are inside the margin or on the wrong side of the hyperplane.

The optimization problem for hinge loss involves minimizing the sum of the hinge loss function over all training examples along with
adding a regularization term to prevent overfitting. This problem is typically solved using gradient descent.

Linear machines with hinge loss are commonly used in applications such as image classification, text classification, and sentiment
analysis. Support vector machines (SVMs) are a popular example of a linear machine with hinge loss.
Linear Machines with Hinge Loss

The hinge loss is a specific type of cost function that incorporates a margin or distance from the classification boundary into the cost
calculation. Even if new observations are classified correctly, they can incur a penalty if the margin from the decision boundary is not large
enough. The hinge loss increases linearly.

The hinge loss is mostly associated with soft-margin support vector machines.

The x-axis represents the distance from the boundary of any single instance .

The y-axis represents the loss size, or penalty, that the function will incur depending on its distance.

That dotted line on the x-axis represents the number 1. This means that when an instance’s distance from the boundary is greater than or at 1,

our loss size is 0.

If the distance from the boundary is 0 , then we incur a loss size of 1.

A negative distance from the boundary incurs a high hinge loss. This essentially means that we are on the wrong side of the boundary, and that

the instance will be classified incorrectly.

On the flip size, a positive distance from the boundary incurs a low hinge loss, or no hinge loss at all, and the further we are away from the

boundary(and on the right side of it), the lower our hinge loss will be.
11.Applications of Ml:
Virtual Personal Assistants
Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants. As the name suggests, they assist in finding
information, when asked over voiceVirtual Assistants are integrated to a variety of platforms. For example:
Smart Speakers: Amazon Echo and Google Home Smartphones: Samsung Bixby on Samsung S8
Predictions while Commuting
Traffic Predictions: We all have been using GPS navigation services. While we do that, our current locations and velocities are being saved at a
central server for managing traffic. This data is then used to build a map of current traffic. While this helps in preventing the traffic and does
congestion analysis, the underlying problem is that there are less number of cars that are equipped with GPS. Machine learning in such
scenarios helps to estimate the regions where congestion can be found on the basis of daily experiences.
. Videos Surveillance
Imagine a single person monitoring multiple video cameras! Certainly, a difficult job to do and boring as well. This is why the idea of training
computers to do this job makes sense.
Social Media Services
From personalizing your news feed to better ads targeting, social media platforms are utilizing machine learning for their own and user
benefits. Here are a few examples that you must be noticing, using, and loving in your social media accounts, without realizing that these
wonderful features are nothing but the applications of ML.
Email Spam and Malware Filtering
There are a number of spam filtering approaches that email clients use. To ascertain that these spam filters are continuously updated, they are
powered by machine learning. When rule-based spam filtering is done, it fails to track the latest tricks adopted by spammers. Multi Layer
Perceptron, C 4.5 Decision Tree Induction are some of the spam filtering techniques that are powered by ML
Online Customer Support
A number of websites nowadays offer the option to chat with customer support representative while they are navigating within the site.
However, not every website has a live executive to answer your queries. In most of the cases, you talk to a chatbot.
Search Engine Result Refining
Google and other search engines use machine learning to improve the search results for you. Every time you execute a search, the algorithms
at the backend keep a watch at how you respond to the results. If you open the top results and stay on the web page for long, the search
engine assumes that the the results it displayed were in accordance to the query.
Product Recommendations
You shopped for a product online few days back and then you keep receiving emails for shopping suggestions. If not this, then you might have
noticed that the shopping website or the app recommends you some items that somehow matches with your taste.
Online Fraud Detection
Machine learning is proving its potential to make cyberspace a secure place and tracking monetary frauds online is one of its examples.
12.NEURAL NETWORK AND FEED FORWORD NEURAL NETWORK

Neural networks are used to mimic the basic functioning of the human brain and are inspired by how the
human brain interprets information.

It is used to solve various real-time tasks because of its ability to perform computations quickly and its fast
responses.

Artificial Neural Network model contains various components that are inspired by the biological nervous
system.

Artificial Neural Network has a huge number of interconnected processing elements, also known as Node

These nodes are connected with other nodes using a connection link. The connection link contains weights,
these weights contain the information about the input signal.

Types of tasks that can be solved using an artificial neural network include Classification problems, Pattern
Matching, Data Clustering, etc

Types of Neural Networks:

ANN– It is also known as an artificial neural network. It is a feed-forward neural network because the inputs
are sent in the forward direction. It can also contain hidden layers. It is used for Textual Data or Tabular Data.
A widely used real-life application is Facial Recognition. It is comparatively less powerful than CNN and RNN.
(ii) CNN– It is also known as Convolutional Neural Networks. It is mainly used for Image Data. It is used for
Computer Vision. Some of the real-life applications are object detection in autonomous vehicles. It contains a
combination of convolutional layers and neurons. It is more powerful than both ANN and RNN
(iii) RNN-It is also known as Recurrent Neural Networks. It is used to process and interpret time series data.
In this type of model, the output from a processing node is fed back into nodes in the same or previous layers.
The most known types of RNN are LSTM (Long Short Term Memory) Networks

A Feed Forward Neural Network is an artificial neural network in which the connections between nodes
does not form a cycle.The opposite of a feed forward neural network is a recurrent neural network, in which
certain pathways are cycled. Describe single layer perceptron and multilevel perceptron.

Single layer perceptron is a simple Neural Network which contains only one layer. The single layer

computation of perceptron is the calculation of sum of input vector with the value multiplied by

corresponding vector weight. The displayed output value will be the input of an activation function.
The perceptron consists of 4 parts.

Input values or One input layer

Weights and Bias

Net sum

Activation Function

A multilayer perceptron is a type of feed-forward artificial neural network that generates a set of outputs from

a set of inputs.

An MLP is a neural network connecting multiple layers in a directed graph, which means that the signal path

through the nodes only goes one way. T


he MLP network consists of input, output, and hidden layers. Each hidden layer consists of numerous

perceptron’s which are called hidden layers or hidden unit.

Multi-Layer perceptron

its defines the most complex architecture of artificial neural networks. It is substantially formed from multiple
layers of the perceptron Multi-layer perceptron from scratch using Numpy.

The pictorial representation of multi-layer perceptron learning is as shown below-

MLP networks are used for supervised learning format. A typical learning algorithm for MLP networks is also
called back propagation's algorithm.

A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set of outputs from
a set of inputs

. An MLP is characterized by several layers of input nodes connected as a directed graph between the input
nodes connected as a directed graph between the input and output layers. MLP uses backpropagation for
training the network. MLP is a deep learning method.

This class of networks consists of multiple layers of computational units, usually interconnected in a feed-
forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer
13.Backpropagation learning

Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors

working back from output nodes to input nodes. It is an important mathematical tool for improving the

accuracy of predictions in data mining and machine learning. Essentially, backpropagation is

an algorithm used to calculate derivatives quickly.

There are two leading types of backpropagation networks:

Static backpropagation

Recurrent backpropagation

Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent with

respect to weight values for the various inputs. The algorithm gets its name because the weights are

updated backward, from output to input.

The advantages of using a backpropagation algorithm are as follows:

It does not have any parameters to tune except for the number of inputs.

It is highly adaptable and efficient and does not require any prior knowledge about the network.

It is a standard process that usually works well.

It is user-friendly, fast and easy to program.

Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

It prefers a matrix-based approach over a mini-batch approach.

Data mining is sensitive to noise and irregularities.

Performance is highly dependent on input data.

Training is time- and resource-intensive.


objective of a backpropagation algorithm?

Backpropagation algorithms are used extensively to train feedforward neural networks in areas such as deep

learning

14. Unsupervised Learning with deep network


Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to
analyze and cluster unlabeled datasets.
These algorithms discover hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information make it the ideal solution for exploratory data
analysis, cross-selling strategies, customer segmentation, and image recognition.
Applications UML

News Sections: Google News uses unsupervised learning to categorize articles on the same story from various online
news outlets. For example, the results of a presidential election could be categorized under their label for “US” news.
Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such as object recognition.
Medical imaging: Unsupervised machine learning provides essential features to medical imaging devices, such as
image detection, classification and segmentation, used in radiology and pathology to diagnose patients quickly and
accurately.
Anomaly detection: Unsupervised learning models can comb through large amounts of data and discover atypical data
points within a dataset. These anomalies can raise awareness around faulty equipment, human error, or breaches in
security.
Customer personas: Defining customer personas makes it easier to understand common traits and business clients'
purchasing habits. Unsupervised learning allows businesses to build better buyer persona profiles, enabling
organizations to align their product messaging more appropriately.
Recommendation Engines: Using past purchase behavior data, unsupervised learning can help to discover data trends
that can be used to develop more effective cross-selling strategies. This is used to make relevant add-on
recommendations to customers during the checkout process for online retailers.

Unsupervised learning is when it can provide a set of unlabelled data, which it is required to analyze and find
patterns inside.

The examples are dimension reduction and clustering. The training is supported to the machine with the
group of data that has not been labeled, classified, or categorize

The objective of unsupervised learning is to restructure the input record into new features or a set of objects
with same patterns.

15 .Autoencoders
Autoencoders are a type of deep learning algorithm that are designed to receive an input and transform it
into a different representation. They play an important part in image construction.
Autoencoders are very useful in the field of unsupervised machine learning.
compress the data and reduce its dimensionality. An Autoencoder is a type of neural network that can learn
to reconstruct images, text, and other data from compressed versions of themselves.

An Autoencoder consists of three layers:

Encoder

Code

Decoder
The Encoder layer compresses the input image into a latent space representation. It encodes the input
image as a compressed representation in a reduced dimension.

The compressed image is a distorted version of the original image.

The Code layer represents the compressed input fed to the decoder layer.

The decoder layer decodes the encoded image back to the original dimension. The decoded image is
reconstructed from latent space representation, and it is reconstructed from the latent space representation
and is a lossy reconstruction of the original image.

Types of Autoencoders

Under Complete Autoencoders

Sparse Autoencoders

Contractive Autoencoders

Denoising Autoencoders

Variational Autoencoders

`16 ..What is Principal Component Analysis?

The Principal Component Analysis is a popular unsupervised learning technique for reducing the dimensionality of
data. It increases interpretability yet, at the same time, it minimizes information loss. It helps to find the most significant
features in a dataset and makes the data easy for plotting in 2D and 3D. PCA helps in finding a sequence of linear
combinations of variables.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

The PCA algorithm is based on some mathematical concepts such as:

Variance and Covariance

Eigenvalues and Eigen factors


In the above figure, we have several points plotted on a 2-D plane. There are two principal components. PC1 is the
primary principal component that explains the maximum variance in the data. PC2 is another principal component that
is orthogonal to PC1.

Some real-world applications of PCA are image processing, movie recommendation system, optimizing
the power allocation in various communication channels

Steps for PCA algorithmGetting the dataset

Representing data into a structure


Standardizing the data
Calculating the Covariance of Z
Calculating the Eigen Values and Eigen Vectors
Sorting the Eigen Vectors
Calculating the new features Or Principal Components
Remove less or unimportant features from the new dataset.
The PCA algorithm is based on some mathematical concepts such as:

Variance and Covariance

Eigenvalues and Eigen factors

Applications of Principal Component Analysis

PCA is mainly used as the dimensionality reduction technique in various AI applications such as computer

vision, image compression, etc.

It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used are

Finance, data mining, Psychology, etc.


17. CNN
A CNN is a kind of network architecture for deep learning algorithms and is specifically used for image
recognition and tasks that involve the processing of pixel data. There are other types of neural networks in
deep learning, but for identifying and recognizing objects, CNNs are the network architecture of choice.
In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural networks, most
commonly applied to analyze visual imagery.
Convolutional neural networks are composed of multiple layers of artificial neurons.
The main advantage of using CNNs is that they do not require human supervision for image classification and
identifying important features in images.
A convolutional neural network (CNN) typically consists of three layers: a convolutional layer, a pooling layer,
and a fully connected layer
BUILDING BLOCKS OF CNN .

Convolutional (CONV)

Activation (ACT or RELU, where we use the same or the actual activation function)

Pooling (POOL)

Fully-connected (FC)

Batch normalization (BN)

Dropout (DO)

Convolutional Layer

The CONV layer is the core building block of a Convolutional Neural Network. Conv layer is the core building

block of CNN. All the heavy computations are performed in this layer. In this layer, the filters ( also

called kernels ) are convoluted to the input matrix. Let’s visualize this concept with a simple example.
Consider an input volume with size [32 x 32 x 3], (RGB image). If the filter size is 5x5, then each neuron in the

Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5 x 5 x 3 = 75 weights (and

+1 bias parameter). The number of filters used here will determine the depth of the output layer.

Pooling Layer - Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial
size of the Convolved Feature. This is to decrease the computational power required to process the
data by reducing the dimensions. There are two types of pooling average pooling and max pooling.

18 .What Is Transfer Learning and It’s Working

The reuse of a pre-trained model on a new problem is known as transfer learning in machine learning. A

machine uses the knowledge learned from a prior assignment to increase prediction about a new task in

transfer learning. for example, use the information gained during training to distinguish beverages when

training a classifier to predict whether an image contains cuisine.

Transfer learning offers a number of advantages, the most important of which are reduced training

time, improved neural network performance (in most circumstances), and the absence of a large

amount of data.

To train a neural model from scratch, a lot of data is typically needed, but access to that data isn’t always

possible – this is when transfer learning comes in handy.

Because the model has already been pre-trained, a good machine learning model can be generated with fairly

little training data using transfer learning. This is especially useful in natural language processing, where huge

labelled datasets require a lot of expert knowledge. Additionally, training time is decreased because building

a deep neural network from the start of a complex task can take days or even weeks.
.

19. Momentum optimizer

Momentum is an extension to the gradient descent optimization algorithm, often referred to as gradient
descent with momentum.

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required
to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

A problem with the gradient descent algorithm is that the progression of the search can bounce around the

search space based on the gradient. For example, the search may progress downhill towards the minima, but

during this progression, it may move in another direction, even uphill, depending on the gradient of specific points
(sets of parameters) encountered during the search.

This can slow down the progress of the search, especially for those optimization problems where the broader
trend or shape of the search space is more useful than specific gradients along the way.

One approach to this problem is to add history to the parameter update equation based on the gradient
encountered in the previous updates.

This change is based on the metaphor of momentum from physics where acceleration in a direction can be
accumulated from past updates.

The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle
through parameter space, according to Newton’s laws of motion.

Momentum involves adding an additional hyperparameter that controls the amount of history (momentum) to

include in the update equation, i.e. the step to a new point in the search space. The value for the hyperparameter

is defined in the range 0.0 to 1.0 and often has a value close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0
is the same as gradient descent without momentum.

First, let’s break the gradient descent update equation down into two parts: the calculation of the change to the
position and the update of the old position to the new position.

The change in the parameters is calculated as the gradient for the point scaled by the step size.
 change_x = step_size * f'(x)

The new position is calculated by simply subtracting the change from the current point

 x = x – change_x

Momentum involves maintaining the change in the position and using it in the subsequent calculation of the
change in position.

If we think of updates over time, then the update at the current iteration or time (t) will add the change used at the
previous time (t-1) weighted by the momentum hyperparameter, as follows:

 change_x(t) = step_size * f'(x(t-1)) + momentum * change_x(t-1)

The update to the position is then performed as before.

 x(t) = x(t-1) – change_x(t)

The change in the position accumulates magnitude and direction of changes over the iterations of the search,
proportional to the size of the momentum hyperparameter.

For example, a large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous
update, whereas a modest momentum (0.2) will mean very little influence.

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and
continues to move in their direction.

20 RMSprop Optimizer

The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer

restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm

could take larger steps in the horizontal direction converging faster. The difference between RMSprop and gradient

descent is on how the gradients are calculated. The following equations show how the gradients are calculated for

the RMSprop and gradient descent with momentum. The value of momentum is denoted by beta and is usually set

to 0.9. If you are not interested in the math behind the optimizer, you can just skip the following equations.
Gradient descent with momenttum

RMSprop optimizer
Sometimes the value of v_dw could be really close to 0. Then, the value of our weights could blow up. To prevent
the gradients from blowing up, we include a parameter epsilon in the denominator which is set to a small value.

21.The Adam optimization algorithm

is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning
applications in computer vision and natural language processing.

Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent

procedure to update network weights iterative based in training data. Adam is different to classical stochastic
gradient descent.

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the
learning rate does not change during training.

A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

The method computes individual adaptive learning rates for different parameters from estimates of first and
second moments of the gradients.

The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent.
Specifically:
 Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves
performance on problems with sparse gradients (e.g. natural language and computer vision problems).
 Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are
adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is
changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp,
Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient,
and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias

of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then
calculating bias-corrected estimates.

22. Dropout optimization

Dropout layers have been the go-to method to reduce the overfitting of neural networks. It is the underworld king

of regularisation in the modern era of deep learning.The term “dropout” refers to dropping out the nodes (input

and hidden layer) in a neural network (as seen in Figure 1). All the forward and backwards connections with a

dropped node are temporarily removed, thus creating a new network architecture out of the parent network. The

nodes are dropped by a dropout probability of p.Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the

fully connected layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8). During the

forward propagation (training) from the input x, 20% of the nodes would be dropped, i.e. the x could become {1, 0,

3, 4, 5} or {1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.For instance, if the hidden layers have

1000 neurons (nodes) and a dropout is applied with drop probability = 0.5, then 500 neurons would be randomly

dropped in every iteration (batch).Generally, for the input layers, the keep probability, i.e. 1- drop probability, is

closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the greater the drop probability

more sparse the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.

The other way is inspired by the ensemble techniques (such as AdaBoost, XGBoost, and Random Forest) where we

use multiple neural networks of different architectures. But this requires multiple models to be trained and stored,

which over time becomes a huge challenge as the networks grow deeper.

So, we have a great solution known as Dropout Layers.


Figure 1: Dropout applied to a Standard Neural Network

23. batch normalization

Batch normalization is a deep learning approach that has been shown to significantly improve the efficiency
and reliability of neural network models. It is particularly useful for training very deep networks, as it can help to
reduce the internal covariate shift that can occur during training.

 Batch normalization is a supervised learning method for normalizing the interlayer outputs of a neural
network. As a result, the next layer receives a “reset” of the output distribution from the preceding layer, allowing it
to analyze the data more effectively.

The term “internal covariate shift” is used to describe the effect that updating the parameters of the layers above it
has on the distribution of inputs to the current layer during deep learning training. This can make the optimization
process more difficult and can slow down the convergence of the model.

Since normalization guarantees that no activation value is too high or too low, and since it enables each layer to
learn independently from the others, this strategy leads to quicker learning rates.

By standardizing inputs, the “dropout” rate (the amount of information lost between processing stages) may be
decreased. That ultimately leads to a vast increase in precision across the board.

How does batch normalization work?


Batch normalization is a technique used to improve the performance of a deep learning network by first removing
the batch mean and then splitting it by the batch standard deviation.

Stochastic gradient descent is used to rectify this standardization if the loss function is too big, by shifting or
scaling the outputs by a parameter, which in turn affects the accuracy of the weights in the following layer.

When applied to a layer, batch normalization multiplies its output by a standard deviation parameter (gamma) and
adds a mean parameter (beta) to it as a secondary trainable parameter. Data may be “denormalized” by adjusting
just these two weights for each output, thanks to the synergy between batch normalization and gradient descents.
Reduced data loss and improved network stability were the results of adjusting the other relevant weights.
The goal of batch normalization is to stabilize the training process and improve the generalization ability
of the model. It can also help to reduce the need for careful initialization of the model’s weights and can allow the
use of higher learning rates, which can speed up the training process.

It is common practice to apply batch normalization prior to a layer’s activation function, and it is commonly used in
tandem with other regularization methods like a dropout. It is a widely used technique in modern deep learning and
has been shown to be effective in a variety of tasks, including image classification, natural language processing,
and machine translation.

TESTING. CI/CD. MONITORING.


Because ML systems are more fragile than you think. All based on our open-source core.Open
Advantages of batch normalization

 Stabilize the training process. Batch normalization can help to reduce the internal covariate shift that occurs
during training, which can improve the stability of the training process and make it easier to optimize the model.

 Improves generalization. By normalizing the activations of a layer, batch normalization can help to
reduce overfitting and improve the generalization ability of the model.

 Reduces the need for careful initialization. Batch normalization can help reduce the sensitivity of the model to
the initial weights, making it easier to train the model.

 Allows for higher learning rates. Batch normalization can allow the use of higher learning rates that can speed
up the training process.

Batch normalization overfitting


While batch normalization can help to reduce overfitting, it is not a guarantee that a model will not overfit.
Overfitting can still occur if the model is too complex for the amount of training data, if there is a lot of noise in the
data, or if there are other issues with the training process. It is important to use other regularization techniques like
dropout, and to monitor the performance of the model on a validation set during training to ensure that it is not
overfitting.

Batch normalization equations


During training, the activations of a layer are normalized for each mini-batch of data using the following equations:

 Mean: mean = 1/m ∑i=1 to m xi

 Variance: variance = 1/m ∑i=1 to m (xi – mean)^2

 Normalized activations: yi = (xi – mean) / sqrt(variance + ε)

 Scaled and shifted activations: zi = γyi + β, where γ and β have learned parameters

During inference, the activations of a layer are normalized using the mean and variance of the activations
calculated during training, rather than using the mean and variance of the mini-batch:

 Normalized activations: yi = (xi – mean) / sqrt(variance + ε)

 Scaled and shifted activations: zi = γyi + β


24 .Instance normalization

is another term for contrast normalization, which was first coined in the StyleNet paper. Both names reveal some
information about this technique. Instance normalization tells us that it operates on a single sample. On the
other hand, contrast normalization says that it normalizes the contrast between the spatial elements of a sample.
Given a Convolution Neural Network (CNN), we can also say that IN performs intensity normalization across the
width and height of a single feature map of a single example.
To clarify how IN works, let’s consider sample feature maps that constitute an input tensor to the IN layer.
Let be that tensor consisting of a batch of images. Each of these images has feature maps or channels
with height and weight . Therefore, is a four-dimensional tensor. In instance
normalization, we consider one training sample and feature map (specified in red in the figure) and take the mean
and variance over its spatial locations ( and ):

To perform instance normalization for a single instance ,


Batch normalization

Considering the same feature maps of the previous figure, BN operates on one channel (one feature map) over

all the training samples in the mini-batch (specified in red):


Normalization and types

Deep Learning models are creating state-of-the-art models on a number of complex tasks including speech
recognition, computer vision, machine translation, among others. However, training deep learning models such
as deep neural networks is a complex task as, during the training phase, inputs of each layer keep changing.

Normalization is an approach which is applied during the preparation of data in order to change the values of
numeric columns in a dataset to use a common scale when the features in the data have different ranges. In this
article, we will discuss the various normalization methods which can be used in deep learning models.

Let us take an example, suppose an input dataset contains data in one column with values ranging from 0 to 10
and the other column with values ranging from 100,000 to 10,00,000. In this case, the input data contains a big
difference in the scale of the numbers which will eventually occur as errors while combining the values as
features during modelling. These issues can be mitigated by normalization by creating new values and
maintaining the general or normal distribution in the data.

There are several approaches in normalisation which can be used in deep learning models. They are mentioned
below

Batch Normalization
Batch normalization is one of the popular normalization methods used for training deep learning models. It
enables faster and stable training of deep neural networks by stabilising the distributions of layer inputs during
the training phase. This approach is mainly related to internal covariate shift (ICS) where internal covariate shift
means the change in the distribution of layer inputs caused when the preceding layers are updated. In order to
improve the training in a model, it is important to reduce the internal co-variant shift. The batch normalization
works here to reduce the internal covariate shift by adding network layers which control the means and variances
of the layer inputs.

Advantages
The advantages of batch normalization are mentioned below:

 Batch normalization reduces the internal covariate shift (ICS) and accelerates the training of a deep
neural network
 This approach reduces the dependence of gradients on the scale of the parameters or of their initial
values which result in higher learning rates without the risk of divergence
 Batch Normalisation makes it possible to use saturating nonlinearities by preventing the network from
getting stuck in the saturated modes
Weight Normalization
Weight normalization is a process of reparameterization of the weight vectors in a deep neural network which
works by decoupling the length of those weight vectors from their direction. In simple terms, we can define weight
normalization as a method for improving the optimisability of the weights of a neural network model.

Advantages
The advantages of weight normalization are mentioned below

 Weight normalization improves the conditioning of the optimisation problem as well as speed up the
convergence of stochastic gradient descent.
 It can be applied successfully to recurrent models such as LSTMs as well as in deep reinforcement
learning or generative models
Layer Normalization
Layer normalization is a method to improve the training speed for various neural network models. Unlike batch
normalization, this method directly estimates the normalisation statistics from the summed inputs to the neurons
within a hidden layer. Layer normalization is basically designed to overcome the drawbacks of batch
normalization such as dependent on mini batches, etc.

Advantages
The advantages of layer normalization are mentioned below:

 Layer normalization can be easily applied to recurrent neural networks by computing the normalization
statistics separately at each time step
 This approach is effective at stabilising the hidden state dynamics in recurrent networks

25 Group Normalization
Group normalization can be said as an alternative to batch normalization. This approach works by dividing the
channels into groups and computes within each group the mean and variance for normalization i.e. normalising
the features within each group. Unlike batch normalization, group normalization is independent of batch sizes,
and also its accuracy is stable in a wide range of batch sizes.

Advantages
The advantages of group normalization are mentioned below:

 It has the ability to replace batch normalization in a number of deep learning tasks
 It can be easily implemented in modern libraries with just a few lines of codes
Instance Normalization
Instance normalization, also known as contrast normalization is almost similar to layer normalization. Unlike
batch normalization, instance normalization is applied to a whole batch of images instead for a single one.

Advantages
The advantages of instance normalization are mentioned below

 This normalization simplifies the learning process of a model.


 The instance normalization can be applied at test time.

:
26 Deep Learning Recent Trends

1. A Residual Neural Network (a.k.a. Residual Network, ResNet) is a deep learning model in which the
weight layers learn residual functions with reference to the layer inputs. A Residual Network defined in is a
network with skip connections that perform identity mappings, merged with the layer outputs by addition.
Recent trends in deep learning include using more extensive datasets and more sophisticated architectures, as
well as incorporating interaction between different types of neural networks and other AI technologies, such as
natural language processing and decision trees. In this article, we will look at 5 recent trends in deep learning and
how they have the potential to bring about significant change.

2. Skip connections are a technique that allows convolutional neural networks (CNNs) to bypass some layers
and connect directly to deeper or shallower ones. They can improve the performance and efficiency of CNNs, but
they also have some drawbacks and limitations.

3. A fully connected neural network consists of a series of fully connected layers that connect every neuron in
one layer to every neuron in the other layer. The major advantage of fully connected networks is that they are
“structure agnostic” i.e. there are no special assumptions needed to be made about the input

4. Hybrid Model Integration

An application provides a mechanism for integrating hybrid models from data sources such as census, weather,

and social media into decision support tools. Moreover, it enables the creation of a new nested domain for the

location data, which can then become part of decision support systems. The results suggest that incorporating

deep learning networks into hybrid models can lead to better decisions concerning hazards and performance

measures such as growth and employment.

Hybrid models combine the benefits of symbolic AI and deep learning. It’s a top-down approach to artificial

intelligence. It is intended to possess machines with intelligence by adopting “high-level symbolic representation

of issues,” as Allen Newell and Herbert A. Simon propose in their physical symbol system theory

5. The Vision Transformer

Commonly referred to as ViT, an image classification model developed by researchers at the University of

Washington, it is used in sentiment analysis, object recognition, and image captioning.

ViT consists of an input layer, a middle layer, and an output layer. The input layer contains training images that

have been labeled with one of several possible sentiments (cheerful, negative, neutral, uncertain, sad, happy,

angry). The middle layer detects the types of objects in the image. The output layer returns a confidence score

based on the kind seen by the central and input layers.


The ViT model follows several widely used deep learning architectures, including convolutional neural networks

(CNNs), with supervised learning followed by some unsupervised preprocessing, then pooling layers that blend

multiple channels into one channel before passing images to CNNs, MRFs, or other models for classification

prediction tasks.

The vision transformers allow us to design a model architecture that can deal with any input data, including

images, text, and multimedia.

6 Self-Supervised Learning

This deep and self-supervised learning module helps in automation. Rather than depending on labeled data to

train a system, it learns to categorize the raw data automatically. Each input component can predict any other

part of the input. It might, for example, forecast the future based on historical records.

In a self-supervised learning system, the input is labeled either by an intelligent agent or by some external

source. The output is also marked with a label that reflects the overall quality of the prediction made by the

system. The algorithm used to train a self-supervised learning system will be based on minimizing the error

between the predicted labels and actual labels.

A self-supervised learning system can make two basic types of errors: bias and variance.

Bias

A system tends to overestimate or underestimate the quality of its predictions.

Variance

It is the variation in the quality of predictions made by a system based on different data instances.

A self-supervised learning algorithm will typically contain four stages including:

 Preprocessing
 Feature extraction
 Training
 Testing

7. Neuroscience Based Deep Learning

The human brain is highly complicated, with an endless capacity for learning. Deep learning has been a

prominent approach for investigating how the brain works in recent years. Neuroscience-based deep learning is a
type of ML that uses data from neuroscience experiments to train artificial neural networks. It allows researchers

to develop models that better understand how the brain works.

Artificial neural networks constructed on computers are comparable to those seen in human brains. As a result of

this formation, scientists and researchers have uncovered thousands of neurological remedies and ideas. Deep

learning has provided neuroscience with the much-needed boost it has long needed. With the deployment of

progressively more robust, comprehensive, and advanced deep learning implementations and solutions, the

dynamics of adaptability ratio have improved significantly.

8 High-Performance NLP Models

Machine Learning-based NLP is still in the early stages. However, there is presently no method that will allow

NLP computers to recognize the meanings of different words in various contexts and respond appropriately.

One approach to solving this problem is to build a model that can recognize patterns in large amounts of text

(e.g., millions of documents). This is where machine learning algorithms come in, as they can automatically learn

from data and train models to make predictions

27.supervised classical tasks

1.Image denoising

Remove noise from image Denoising an image is a classical problem that researchers are trying to solve for
decades. In earlier times, researchers used filters to reduce the noise in the images. They used to work fairly well
for images with a reasonable level of noise. However, applying those filters would add a blur to the image. And if
the image is too noisy, then the resultant image would be so blurry that most of the critical details in the image
are lost. With the advent of Deep Learning techniques, it is now possible to remove the blind noise from images
such that the result is very close to the ground truth images with minimal loss of detail. One of the fundamental
challenges in the field of image processing and computer vision is image denoising, where the underlying goal is
to estimate the original image by suppressing noise from a noise-contaminated version of the image.

2.Semantic segmentation

Segmentation is essential for image analysis tasks. Semantic segmentation describes the process of associating
each pixel of an image with a class label, (such as flower, person, road, sky, ocean, or car). Applications for
semantic segmentation include:

 Autonomous driving
 Industrial inspection
 Classification of terrain visible in satellite imagery
 Medical imaging analysis
Label Training Data for Semantic Segmentation
Large datasets enable faster and more accurate mapping to a particular input (or input aspect). Using data augmentation
provides a means of leveraging limited datasets for training. Minor changes, such as translation, cropping, or

Train and Test a Semantic Segmentation Network


The steps for training a semantic segmentation network are as follows:
1. Analyze Training Data for Semantic Segmentation
2. Create a Semantic Segmentation Network
3. Train A Semantic Segmentation Network
4. Evaluate and Inspect the Results of Semantic Segmentation

3.Object detection:

Object detection using deep learning provides a fast and accurate means to predict the location of an object in an
image. Deep learning is a powerful machine learning technique in which the object detector automatically learns
image features required for detection tasks. Several techniques for object detection using deep learning
are available such as Faster R-CNN, you only look once (YOLO) v2, YOLO v3, YOLO v4, and single
shot detection (SSD).
Applications for object detection include:

 Image classification
 Scene understanding
 Self-driving vehicles
 Surveillance
Create Training Data for Object Detection
Use a labeling app to interactively label ground truth data in a video, image sequence, image collection, or
custom data source. You can label object detection ground truth using rectangle labels, which define the position
and size of the object in the image.

reate Object Detection Network


Each object detector contains a unique network architecture. For example, the Faster R-CNN detector uses a two-stage
network for detection, whereas the YOLO v2 detector uses a single stage. Use functions
like fasterRCNNLayers or yolov2Layers to create a network. You can also design a network layer by layer using
the Deep Network Designer (Deep Learning Toolbox).

 Pretrained Deep Neural Networks (Deep Learning Toolbox)


 Design a YOLO v2 Detection Network
 Design an R-CNN, Fast R-CNN, and a Faster R-CNN Model
Train Detector and Evaluate Results
Use the trainFasterRCNNObjectDetector, trainYOLOv2ObjectDetector, trainYOLOv4ObjectDetector,
and trainSSDObjectDetector functions to train an object detector. Use
the evaluateDetectionMissRate and evaluateDetectionPrecision functions to evaluate the training results.

 Train Faster R-CNN Vehicle Detector


 Train YOLO v2 Object Detector
 Train YOLO v4 Network for Vehicle Detection
 Train SSD Object Detector
28. LSTM network
stands for long short-term memory networks, used in the field of Deep Learning. It is a variety of recurrent
neural networks (RNNs) that are capable of learning long-term dependencies, especially in sequence prediction
problems. LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data, apart
from single data points such as images. This finds application in speech recognition, machine translation, etc.
LSTM is a special kind of RNN, which shows outstanding performance on a large variety of problems.

LSTM Applications

LSTM networks find useful applications in the following areas:

 Language modeling

 Machine translation

 Handwriting recognition

 Image captioning

 Image generation using attention models

 Question answering

 Video-to-text conversion

 Polymorphic music modeling

 Speech synthesis

 Protein secondary structure prediction

This list does give an idea about the areas in which LSTM is employed but not how exactly it is used. Let’s

understand the types of sequence learning problems that LSTM networks are capable of addressing.

LSTM neural networks are capable of solving numerous tasks that are not solvable by previous learning

algorithms like RNNs. Long-term temporal dependencies can be captured effectively by LSTM, without suffering

much optimization hurdles. This is used to address the high-end problems.

The Logic Behind LSTM

The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains its state over

time. The cell state is the horizontal line that runs through the top of the below diagram. It can be visualized as a

conveyor belt through which information just flows, unchanged.


Information can be added to or removed
from the cell state in LSTM and is regulated by gates. These gates optionally let the information flow in and out of
the cell. It contains a pointwise multiplication operation and a sigmoid neural net layer that assist the mechanism.

The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing
should be let through,’ and one means ‘everything should be let through.’

29 . generative modelling with DL

A Generative Model is a powerful way of learning any kind of data distribution using unsupervised learning and it
has achieved tremendous success in just few years. All types of generative models aim at learning the true data
distribution of the training set so as to generate new data points with some variations. But it is not always possible
to learn the exact distribution of our data either implicitly or explicitly and so we try to model a distribution which
is as similar as possible to the true data distribution. For this, we can leverage the power of neural networks to
learn a function which can approximate the model distribution to the true distribution.
Two of the most commonly used and efficient approaches are Variational Autoencoders (VAE) and Generative
Adversarial Networks (GAN)

Building block of Generative Adversarial Network


1.Variational autoencoder

was proposed in 2013 by Knigma and Welling at Google and Qualcomm. A variational autoencoder (VAE)
provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an
encoder that outputs a single value to describe each latent state attribute, we’ll formulate our encoder to
describe a probability distribution for each latent attribute.

It has many applications such as data compression, synthetic data creation etc.
Architecture:
Autoencoders are a type of neural network that learns the data encodings from the dataset in an unsupervised
way. It basically contains two parts: the first one is an encoder which is similar to the convolution neural
network except for the last layer. The aim of the encoder to learn efficient data encoding from the dataset and
pass it into a bottleneck architecture. The other part of the autoencoder is a decoder that uses latent space in
the bottleneck layer to regenerate the images similar to the dataset. These results backpropagate from the
neural network in the form of the loss function.

Variational autoencoder is different from autoencoder in a way such that it provides a statistic manner for
describing the samples of the dataset in latent space. Therefore, in variational autoencoder, the encoder
outputs a probability distribution in the bottleneck layer instead of a single output value.

2. Generative Adversarial Network (GAN) is a deep learning architecture that consists of two neural
networks competing against each other in a zero-sum game framework. The goal of GANs is to generate new,
synthetic data that resembles some known data distribution.
What is a Generative Adversarial Network
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used
for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in 2014. GANs are basically
made up of a system of two competing neural network models which compete with each other and are able to
analyze, capture and copy the variations within a dataset.

Generative Adversarial Networks (GANs) can be broken down into three parts:

 Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
 Adversarial: The training of a model is done in an adversarial setting.
 Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.
 In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of
data(be it an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on the other
hand, tries to distinguish between the real and fake samples. The Generator and the Discriminator
are both Neural Networks and they both run in competition with each other in the training phase. The
steps are repeated several times and in this, the Generator and Discriminator get bette r and better in
their respective jobs after each repetition. The work can be visualized by the diagram given below:

 Generative Adversarial Network Architecture and its Components

 Here, the generative model captures the distribution of data and is trained in such a manner that it
tries to maximize the probability of the Discriminator making a mistake. The Discriminator, on the
other hand, is based on a model that estimates the probability that the sample that it got is received
from the training data and not from the Generator. The GANs are formulated as a minimax game,
where the Discriminator is trying to minimize its reward V(D, G) and the Generator is trying to
minimize the Discriminator’s reward or in other words, maximize its loss. It can be mathematically
described by the formula below:


where,

 G = Generator
 D = Discriminator
 Pdata(x) = distribution of real data
 P(z) = distribution of generator
 x = sample from Pdata(x)
 z = sample from P(z)
 D(x) = Discriminator network
 G(z) = Generator netw

You might also like