Professional Documents
Culture Documents
Deep Learning Answers
Deep Learning Answers
Feature vector
A feature vector is an ordered list of numerical properties of observed phenomena. It represents input features
to a machine learning model that makes a prediction.
For example, we see the cloudy sky, feel the damp breeze, and decide to take an umbrella when going outside. Our five
senses can transform outside stimuli into neural activity in our brains, handling multiple inputs as they occur in no
particular order.
However, machine learning models can only deal with quantitative data.
As such, we must always convert features of observed phenomena into numerical values and feed them into a
machine learning model in the same order. In short, we must represent features in feature vectors.
4.Unsupervised machine learning problems benefit 4. For supervised machine learning tasks,
from the usage of generative models discriminative models are helpful.
5. Outliers have a greater influence on generative 5. Unlike generative models, discriminative
models than on discriminative ones. models have the advantage of being more
resistant to outliers
3.Explain feature space representation.
A feature space is just the set of all possible values for a chosen set of features from that data.
It refers to the n-dimensions where your variables live (not including a target variable, if it is present).
The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as
features.
For example, consider the data set with: T
target
Y≡ Thickness of car tires after some testing period Variables
X1≡ distance travelled in test
X2≡ time duration of test
X3≡ amount of chemical -C in tires.
The positive quadrant in R3 as all the X variables can only be positive quantities. Domain knowledge about tires
might suggest that the *speed* the vehicle was moving at is important, hence we generate another variable, X4 (this
is the feature extraction part): X4 = X1/X2 the speed of the vehicle during testing. This extends our old feature space
into a new one, the positive part of R4.
4.What is Bayesian Learning? Explain Bayes Minimum Error Classifier and Minimum Risk Classifier?
A learning technique that determines model parameters (such as the network weights) by maximizing the posterior
probability of the parameters given the training data.
The idea is that some parameter values are more consistent with the observed data than others.
By Bayes’ rule, maximizing the posterior probability amounts to maximizing the so-called model evidence, defined as
the conditional probability of the training data given the model parameters.
In Bayesian learning, prior knowledge is provided by asserting – a prior probability for each candidate hypothesis, and –
a probability distribution over observed data for each possible hypothesis.
The minimum risk R∗(αi|x) is called the Bayes risk. λijP(ωj|x)=1 − P(ωi|x). R(αi|x) is minimum for the decision i for which
the posterior P(ωi|x) is maximum. Same decision rule as the Bayes classifier.
λijP(ωj|x)=1 − P(ωi|x). R(αi|x) is minimum for the decision i for which the posterior P(ωi|x) is maximum. Same
decision rule as the Bayes classifier. In the two-category case, if the loss for one action is greater than the other, the
regions for that action will shrink.
4.What is discriminant function? Explain discriminant function under multivariate normal distribution.
Discriminant functions are used to find the minimum probability of error in decision making problems. In a problem with
feature vector y and state of nature variable w, we can represent the discriminant function as:
gi(Y)=lnp(Y|wi)+lnP(wi)
We defined p(Y|wi) as the conditional probability density function for Y with wi being the state of nature,
And P(wj) is the prior probability that nature is in state wj. If we take p(Y|wi) as multivariate normal distributions. That
is if p(Y|wi) = N(μ,σ). Then the discriminant function changes to;
gi(Y)=−||x−μi||2σi+lnP(wi),
The multivariate normal distribution is a generalization of the univariate normal distribution to two or more
variables. It is a distribution for random vectors of correlated variables, where each vector element has a univariate
normal distribution.
As for the normal density p(x|ωi) follows the multivariate normal distribution, so our discriminant function can
be written as
5.Write short note on nearest neighbour rule - K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification
problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
one
Optimization is the process where we train the model iteratively that results in a maximum and minimum function
evaluation. It is one of the most important phenomena in Machine Learning to get better results.
Optimization methods are used in many areas of study to find solutions that maximize or minimize some study
parameters, such as minimize costs in the production of a good or service, maximize profits, minimize raw material in
the development of a good, or maximize production.
technique
Gradient Descent
It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model
parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will
update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.
As the model parameters are frequently updated parameters have high variance and fluctuations in loss
Advantages:
Disadvantages:
To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
Batch Optimization:
Batch optimization is a technique where a batch of samples is used to calculate the gradient and update the model
parameters. In other words, the model is trained using a fixed set of samples in each iteration. Batch optimization is
computationally efficient and can help in finding the optimum solution. However, it might lead to overfitting and slow
convergence if the batch size is too large or too small.
Overall, optimization techniques like gradient descent and batch optimization play a crucial role in deep learning. They
help to minimize the error and find the best set of parameters for the given problem.
The minimum distance classifier is used to classify unknown image data to classes which minimize the distance between
the image data and the class in multi- feature space. The distance is defined as an index of similarity so that the minimum
distance is identical to the maximum similarity
Euclidian distance
Is used in cases where the variances of the population classes are different to each other. The Euclidian distance is
theoretically identical to the similarity index.
Mahalanobis distance
In cases where there is correlation between the axes in feature space.
k : variance matrix
k : variance-covariance matrix
Backpropagation algorithm calculates the gradient of the error function. Backpropagation can be
written as a function of the neural network. Backpropagation algorithms are a set of methods used to efficiently
train artificial neural networks following a gradient descent approach which exploits the chain rule.
The main features of Backpropagation are the iterative, recursive and efficient method through which it calculates
the updated weight to improve the network until it is not able to perform the task for which it is being trained.
Derivatives of the activation function to be known at network design time is required to Backpropagation.
Travel back from the output layer to the hidden layer to adjust the weights such that the error is decreased.
8.What are different loss functions in back propagation learning? Explain any one.
The Loss function is the difference between our predicted and actual values. We create a Loss function to find
the minima of that function to optimize our model and improve our prediction’s accurac y.
Different loss functions are: Squared Error and Cross Entropy Loss.
Cross entropy loss is a metric used to measure how well a classification model in machine learning performs. The loss (or
error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as
close to 0 as possible. Cross entropy loss is often considered interchangeable with logistic loss (or log loss, and sometimes
referred to as binary cross entropy loss) but this isn't always correct. Cross entropy loss measures the difference between
the discovered probability distribution of a machine learning classification model and the predicted distribution. All possible
values for the prediction are stored so, for example, if you were looking for the odds in a coin toss it would store that
information at 0.5 and 0.5 (heads and tails). Binary cross entropy loss, on the other hand, store only one value.
where:
Deep Learning is a part of Machine Learning used to solve complex problems and build intelligent solutions. The core concept of
Deep Learning has been derived from the structure and function of the human brain. Deep Learning uses artificial neural
networks to analyze data and make predictions.
Deep learning technology drives many AI applications used in everyday products, such as the following:
Digital assistants
Fraud detection
It is also a critical component of emerging technologies such as self-driving cars, virtual reality, and more.
Deep learning models are computer files that data scientists have trained to perform tasks using an algorithm or a predefined set of
steps. Businesses use deep learning models to analyze data and make predictions in various applications.
Linear classifiers are a type of machine learning algorithm used for classification tasks. In linear classification, the goal is to find a
hyperplane that separates the data points into different classes. This hyperplane is represented by a linear equation, usually in the
form of a straight line in two dimensions or a plane in higher dimensions.
Linear machines with hinge loss are a type of linear classifier that use a hinge loss function to separate the data points into
different classes. Hinge loss is a loss function used in machine learning for classification tasks that penalizes predictions that are
far from the true class value.
In the case of a binary classification problem, the goal of the linear machine with hinge loss is to find a hyperplane that separates
the positive and negative examples with the largest margin. The margin is defined as the perpendicular distance between the
hyperplane and the data points nearest to it.
where yi is the label of the ith data point, xi is the feature vector of the ith data point, w is the weight vector, and b is the bias term.
The hinge loss function penalizes predictions that are inside the margin or on the wrong side of the hyperplane.
The optimization problem for hinge loss involves minimizing the sum of the hinge loss function over all training examples along with
adding a regularization term to prevent overfitting. This problem is typically solved using gradient descent.
Linear machines with hinge loss are commonly used in applications such as image classification, text classification, and sentiment
analysis. Support vector machines (SVMs) are a popular example of a linear machine with hinge loss.
Linear Machines with Hinge Loss
The hinge loss is a specific type of cost function that incorporates a margin or distance from the classification boundary into the cost
calculation. Even if new observations are classified correctly, they can incur a penalty if the margin from the decision boundary is not large
enough. The hinge loss increases linearly.
The hinge loss is mostly associated with soft-margin support vector machines.
The x-axis represents the distance from the boundary of any single instance .
The y-axis represents the loss size, or penalty, that the function will incur depending on its distance.
That dotted line on the x-axis represents the number 1. This means that when an instance’s distance from the boundary is greater than or at 1,
A negative distance from the boundary incurs a high hinge loss. This essentially means that we are on the wrong side of the boundary, and that
On the flip size, a positive distance from the boundary incurs a low hinge loss, or no hinge loss at all, and the further we are away from the
boundary(and on the right side of it), the lower our hinge loss will be.
11.Applications of Ml:
Virtual Personal Assistants
Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants. As the name suggests, they assist in finding
information, when asked over voiceVirtual Assistants are integrated to a variety of platforms. For example:
Smart Speakers: Amazon Echo and Google Home Smartphones: Samsung Bixby on Samsung S8
Predictions while Commuting
Traffic Predictions: We all have been using GPS navigation services. While we do that, our current locations and velocities are being saved at a
central server for managing traffic. This data is then used to build a map of current traffic. While this helps in preventing the traffic and does
congestion analysis, the underlying problem is that there are less number of cars that are equipped with GPS. Machine learning in such
scenarios helps to estimate the regions where congestion can be found on the basis of daily experiences.
. Videos Surveillance
Imagine a single person monitoring multiple video cameras! Certainly, a difficult job to do and boring as well. This is why the idea of training
computers to do this job makes sense.
Social Media Services
From personalizing your news feed to better ads targeting, social media platforms are utilizing machine learning for their own and user
benefits. Here are a few examples that you must be noticing, using, and loving in your social media accounts, without realizing that these
wonderful features are nothing but the applications of ML.
Email Spam and Malware Filtering
There are a number of spam filtering approaches that email clients use. To ascertain that these spam filters are continuously updated, they are
powered by machine learning. When rule-based spam filtering is done, it fails to track the latest tricks adopted by spammers. Multi Layer
Perceptron, C 4.5 Decision Tree Induction are some of the spam filtering techniques that are powered by ML
Online Customer Support
A number of websites nowadays offer the option to chat with customer support representative while they are navigating within the site.
However, not every website has a live executive to answer your queries. In most of the cases, you talk to a chatbot.
Search Engine Result Refining
Google and other search engines use machine learning to improve the search results for you. Every time you execute a search, the algorithms
at the backend keep a watch at how you respond to the results. If you open the top results and stay on the web page for long, the search
engine assumes that the the results it displayed were in accordance to the query.
Product Recommendations
You shopped for a product online few days back and then you keep receiving emails for shopping suggestions. If not this, then you might have
noticed that the shopping website or the app recommends you some items that somehow matches with your taste.
Online Fraud Detection
Machine learning is proving its potential to make cyberspace a secure place and tracking monetary frauds online is one of its examples.
12.NEURAL NETWORK AND FEED FORWORD NEURAL NETWORK
Neural networks are used to mimic the basic functioning of the human brain and are inspired by how the
human brain interprets information.
It is used to solve various real-time tasks because of its ability to perform computations quickly and its fast
responses.
Artificial Neural Network model contains various components that are inspired by the biological nervous
system.
Artificial Neural Network has a huge number of interconnected processing elements, also known as Node
These nodes are connected with other nodes using a connection link. The connection link contains weights,
these weights contain the information about the input signal.
Types of tasks that can be solved using an artificial neural network include Classification problems, Pattern
Matching, Data Clustering, etc
ANN– It is also known as an artificial neural network. It is a feed-forward neural network because the inputs
are sent in the forward direction. It can also contain hidden layers. It is used for Textual Data or Tabular Data.
A widely used real-life application is Facial Recognition. It is comparatively less powerful than CNN and RNN.
(ii) CNN– It is also known as Convolutional Neural Networks. It is mainly used for Image Data. It is used for
Computer Vision. Some of the real-life applications are object detection in autonomous vehicles. It contains a
combination of convolutional layers and neurons. It is more powerful than both ANN and RNN
(iii) RNN-It is also known as Recurrent Neural Networks. It is used to process and interpret time series data.
In this type of model, the output from a processing node is fed back into nodes in the same or previous layers.
The most known types of RNN are LSTM (Long Short Term Memory) Networks
A Feed Forward Neural Network is an artificial neural network in which the connections between nodes
does not form a cycle.The opposite of a feed forward neural network is a recurrent neural network, in which
certain pathways are cycled. Describe single layer perceptron and multilevel perceptron.
Single layer perceptron is a simple Neural Network which contains only one layer. The single layer
computation of perceptron is the calculation of sum of input vector with the value multiplied by
corresponding vector weight. The displayed output value will be the input of an activation function.
The perceptron consists of 4 parts.
Net sum
Activation Function
A multilayer perceptron is a type of feed-forward artificial neural network that generates a set of outputs from
a set of inputs.
An MLP is a neural network connecting multiple layers in a directed graph, which means that the signal path
Multi-Layer perceptron
its defines the most complex architecture of artificial neural networks. It is substantially formed from multiple
layers of the perceptron Multi-layer perceptron from scratch using Numpy.
MLP networks are used for supervised learning format. A typical learning algorithm for MLP networks is also
called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set of outputs from
a set of inputs
. An MLP is characterized by several layers of input nodes connected as a directed graph between the input
nodes connected as a directed graph between the input and output layers. MLP uses backpropagation for
training the network. MLP is a deep learning method.
This class of networks consists of multiple layers of computational units, usually interconnected in a feed-
forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer
13.Backpropagation learning
Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors
working back from output nodes to input nodes. It is an important mathematical tool for improving the
Static backpropagation
Recurrent backpropagation
Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent with
respect to weight values for the various inputs. The algorithm gets its name because the weights are
It does not have any parameters to tune except for the number of inputs.
It is highly adaptable and efficient and does not require any prior knowledge about the network.
Backpropagation algorithms are used extensively to train feedforward neural networks in areas such as deep
learning
News Sections: Google News uses unsupervised learning to categorize articles on the same story from various online
news outlets. For example, the results of a presidential election could be categorized under their label for “US” news.
Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such as object recognition.
Medical imaging: Unsupervised machine learning provides essential features to medical imaging devices, such as
image detection, classification and segmentation, used in radiology and pathology to diagnose patients quickly and
accurately.
Anomaly detection: Unsupervised learning models can comb through large amounts of data and discover atypical data
points within a dataset. These anomalies can raise awareness around faulty equipment, human error, or breaches in
security.
Customer personas: Defining customer personas makes it easier to understand common traits and business clients'
purchasing habits. Unsupervised learning allows businesses to build better buyer persona profiles, enabling
organizations to align their product messaging more appropriately.
Recommendation Engines: Using past purchase behavior data, unsupervised learning can help to discover data trends
that can be used to develop more effective cross-selling strategies. This is used to make relevant add-on
recommendations to customers during the checkout process for online retailers.
Unsupervised learning is when it can provide a set of unlabelled data, which it is required to analyze and find
patterns inside.
The examples are dimension reduction and clustering. The training is supported to the machine with the
group of data that has not been labeled, classified, or categorize
The objective of unsupervised learning is to restructure the input record into new features or a set of objects
with same patterns.
15 .Autoencoders
Autoencoders are a type of deep learning algorithm that are designed to receive an input and transform it
into a different representation. They play an important part in image construction.
Autoencoders are very useful in the field of unsupervised machine learning.
compress the data and reduce its dimensionality. An Autoencoder is a type of neural network that can learn
to reconstruct images, text, and other data from compressed versions of themselves.
Encoder
Code
Decoder
The Encoder layer compresses the input image into a latent space representation. It encodes the input
image as a compressed representation in a reduced dimension.
The Code layer represents the compressed input fed to the decoder layer.
The decoder layer decodes the encoded image back to the original dimension. The decoded image is
reconstructed from latent space representation, and it is reconstructed from the latent space representation
and is a lossy reconstruction of the original image.
Types of Autoencoders
Sparse Autoencoders
Contractive Autoencoders
Denoising Autoencoders
Variational Autoencoders
The Principal Component Analysis is a popular unsupervised learning technique for reducing the dimensionality of
data. It increases interpretability yet, at the same time, it minimizes information loss. It helps to find the most significant
features in a dataset and makes the data easy for plotting in 2D and 3D. PCA helps in finding a sequence of linear
combinations of variables.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
Some real-world applications of PCA are image processing, movie recommendation system, optimizing
the power allocation in various communication channels
PCA is mainly used as the dimensionality reduction technique in various AI applications such as computer
It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used are
Convolutional (CONV)
Activation (ACT or RELU, where we use the same or the actual activation function)
Pooling (POOL)
Fully-connected (FC)
Dropout (DO)
Convolutional Layer
The CONV layer is the core building block of a Convolutional Neural Network. Conv layer is the core building
block of CNN. All the heavy computations are performed in this layer. In this layer, the filters ( also
called kernels ) are convoluted to the input matrix. Let’s visualize this concept with a simple example.
Consider an input volume with size [32 x 32 x 3], (RGB image). If the filter size is 5x5, then each neuron in the
Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5 x 5 x 3 = 75 weights (and
+1 bias parameter). The number of filters used here will determine the depth of the output layer.
Pooling Layer - Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial
size of the Convolved Feature. This is to decrease the computational power required to process the
data by reducing the dimensions. There are two types of pooling average pooling and max pooling.
The reuse of a pre-trained model on a new problem is known as transfer learning in machine learning. A
machine uses the knowledge learned from a prior assignment to increase prediction about a new task in
transfer learning. for example, use the information gained during training to distinguish beverages when
Transfer learning offers a number of advantages, the most important of which are reduced training
time, improved neural network performance (in most circumstances), and the absence of a large
amount of data.
To train a neural model from scratch, a lot of data is typically needed, but access to that data isn’t always
Because the model has already been pre-trained, a good machine learning model can be generated with fairly
little training data using transfer learning. This is especially useful in natural language processing, where huge
labelled datasets require a lot of expert knowledge. Additionally, training time is decreased because building
a deep neural network from the start of a complex task can take days or even weeks.
.
Momentum is an extension to the gradient descent optimization algorithm, often referred to as gradient
descent with momentum.
It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required
to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.
A problem with the gradient descent algorithm is that the progression of the search can bounce around the
search space based on the gradient. For example, the search may progress downhill towards the minima, but
during this progression, it may move in another direction, even uphill, depending on the gradient of specific points
(sets of parameters) encountered during the search.
This can slow down the progress of the search, especially for those optimization problems where the broader
trend or shape of the search space is more useful than specific gradients along the way.
One approach to this problem is to add history to the parameter update equation based on the gradient
encountered in the previous updates.
This change is based on the metaphor of momentum from physics where acceleration in a direction can be
accumulated from past updates.
The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle
through parameter space, according to Newton’s laws of motion.
Momentum involves adding an additional hyperparameter that controls the amount of history (momentum) to
include in the update equation, i.e. the step to a new point in the search space. The value for the hyperparameter
is defined in the range 0.0 to 1.0 and often has a value close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0
is the same as gradient descent without momentum.
First, let’s break the gradient descent update equation down into two parts: the calculation of the change to the
position and the update of the old position to the new position.
The change in the parameters is calculated as the gradient for the point scaled by the step size.
change_x = step_size * f'(x)
The new position is calculated by simply subtracting the change from the current point
x = x – change_x
Momentum involves maintaining the change in the position and using it in the subsequent calculation of the
change in position.
If we think of updates over time, then the update at the current iteration or time (t) will add the change used at the
previous time (t-1) weighted by the momentum hyperparameter, as follows:
The change in the position accumulates magnitude and direction of changes over the iterations of the search,
proportional to the size of the momentum hyperparameter.
For example, a large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous
update, whereas a modest momentum (0.2) will mean very little influence.
The momentum algorithm accumulates an exponentially decaying moving average of past gradients and
continues to move in their direction.
20 RMSprop Optimizer
The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer
restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm
could take larger steps in the horizontal direction converging faster. The difference between RMSprop and gradient
descent is on how the gradients are calculated. The following equations show how the gradients are calculated for
the RMSprop and gradient descent with momentum. The value of momentum is denoted by beta and is usually set
to 0.9. If you are not interested in the math behind the optimizer, you can just skip the following equations.
Gradient descent with momenttum
RMSprop optimizer
Sometimes the value of v_dw could be really close to 0. Then, the value of our weights could blow up. To prevent
the gradients from blowing up, we include a parameter epsilon in the denominator which is set to a small value.
is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning
applications in computer vision and natural language processing.
Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent
procedure to update network weights iterative based in training data. Adam is different to classical stochastic
gradient descent.
Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the
learning rate does not change during training.
A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.
The method computes individual adaptive learning rates for different parameters from estimates of first and
second moments of the gradients.
The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent.
Specifically:
Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves
performance on problems with sparse gradients (e.g. natural language and computer vision problems).
Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are
adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is
changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).
Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp,
Adam also makes use of the average of the second moments of the gradients (the uncentered variance).
Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient,
and the parameters beta1 and beta2 control the decay rates of these moving averages.
The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias
of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then
calculating bias-corrected estimates.
Dropout layers have been the go-to method to reduce the overfitting of neural networks. It is the underworld king
of regularisation in the modern era of deep learning.The term “dropout” refers to dropping out the nodes (input
and hidden layer) in a neural network (as seen in Figure 1). All the forward and backwards connections with a
dropped node are temporarily removed, thus creating a new network architecture out of the parent network. The
nodes are dropped by a dropout probability of p.Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the
fully connected layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8). During the
forward propagation (training) from the input x, 20% of the nodes would be dropped, i.e. the x could become {1, 0,
3, 4, 5} or {1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.For instance, if the hidden layers have
1000 neurons (nodes) and a dropout is applied with drop probability = 0.5, then 500 neurons would be randomly
dropped in every iteration (batch).Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the greater the drop probability
more sparse the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.
The other way is inspired by the ensemble techniques (such as AdaBoost, XGBoost, and Random Forest) where we
use multiple neural networks of different architectures. But this requires multiple models to be trained and stored,
which over time becomes a huge challenge as the networks grow deeper.
Batch normalization is a deep learning approach that has been shown to significantly improve the efficiency
and reliability of neural network models. It is particularly useful for training very deep networks, as it can help to
reduce the internal covariate shift that can occur during training.
Batch normalization is a supervised learning method for normalizing the interlayer outputs of a neural
network. As a result, the next layer receives a “reset” of the output distribution from the preceding layer, allowing it
to analyze the data more effectively.
The term “internal covariate shift” is used to describe the effect that updating the parameters of the layers above it
has on the distribution of inputs to the current layer during deep learning training. This can make the optimization
process more difficult and can slow down the convergence of the model.
Since normalization guarantees that no activation value is too high or too low, and since it enables each layer to
learn independently from the others, this strategy leads to quicker learning rates.
By standardizing inputs, the “dropout” rate (the amount of information lost between processing stages) may be
decreased. That ultimately leads to a vast increase in precision across the board.
Stochastic gradient descent is used to rectify this standardization if the loss function is too big, by shifting or
scaling the outputs by a parameter, which in turn affects the accuracy of the weights in the following layer.
When applied to a layer, batch normalization multiplies its output by a standard deviation parameter (gamma) and
adds a mean parameter (beta) to it as a secondary trainable parameter. Data may be “denormalized” by adjusting
just these two weights for each output, thanks to the synergy between batch normalization and gradient descents.
Reduced data loss and improved network stability were the results of adjusting the other relevant weights.
The goal of batch normalization is to stabilize the training process and improve the generalization ability
of the model. It can also help to reduce the need for careful initialization of the model’s weights and can allow the
use of higher learning rates, which can speed up the training process.
It is common practice to apply batch normalization prior to a layer’s activation function, and it is commonly used in
tandem with other regularization methods like a dropout. It is a widely used technique in modern deep learning and
has been shown to be effective in a variety of tasks, including image classification, natural language processing,
and machine translation.
Stabilize the training process. Batch normalization can help to reduce the internal covariate shift that occurs
during training, which can improve the stability of the training process and make it easier to optimize the model.
Improves generalization. By normalizing the activations of a layer, batch normalization can help to
reduce overfitting and improve the generalization ability of the model.
Reduces the need for careful initialization. Batch normalization can help reduce the sensitivity of the model to
the initial weights, making it easier to train the model.
Allows for higher learning rates. Batch normalization can allow the use of higher learning rates that can speed
up the training process.
Scaled and shifted activations: zi = γyi + β, where γ and β have learned parameters
During inference, the activations of a layer are normalized using the mean and variance of the activations
calculated during training, rather than using the mean and variance of the mini-batch:
is another term for contrast normalization, which was first coined in the StyleNet paper. Both names reveal some
information about this technique. Instance normalization tells us that it operates on a single sample. On the
other hand, contrast normalization says that it normalizes the contrast between the spatial elements of a sample.
Given a Convolution Neural Network (CNN), we can also say that IN performs intensity normalization across the
width and height of a single feature map of a single example.
To clarify how IN works, let’s consider sample feature maps that constitute an input tensor to the IN layer.
Let be that tensor consisting of a batch of images. Each of these images has feature maps or channels
with height and weight . Therefore, is a four-dimensional tensor. In instance
normalization, we consider one training sample and feature map (specified in red in the figure) and take the mean
and variance over its spatial locations ( and ):
Considering the same feature maps of the previous figure, BN operates on one channel (one feature map) over
Deep Learning models are creating state-of-the-art models on a number of complex tasks including speech
recognition, computer vision, machine translation, among others. However, training deep learning models such
as deep neural networks is a complex task as, during the training phase, inputs of each layer keep changing.
Normalization is an approach which is applied during the preparation of data in order to change the values of
numeric columns in a dataset to use a common scale when the features in the data have different ranges. In this
article, we will discuss the various normalization methods which can be used in deep learning models.
Let us take an example, suppose an input dataset contains data in one column with values ranging from 0 to 10
and the other column with values ranging from 100,000 to 10,00,000. In this case, the input data contains a big
difference in the scale of the numbers which will eventually occur as errors while combining the values as
features during modelling. These issues can be mitigated by normalization by creating new values and
maintaining the general or normal distribution in the data.
There are several approaches in normalisation which can be used in deep learning models. They are mentioned
below
Batch Normalization
Batch normalization is one of the popular normalization methods used for training deep learning models. It
enables faster and stable training of deep neural networks by stabilising the distributions of layer inputs during
the training phase. This approach is mainly related to internal covariate shift (ICS) where internal covariate shift
means the change in the distribution of layer inputs caused when the preceding layers are updated. In order to
improve the training in a model, it is important to reduce the internal co-variant shift. The batch normalization
works here to reduce the internal covariate shift by adding network layers which control the means and variances
of the layer inputs.
Advantages
The advantages of batch normalization are mentioned below:
Batch normalization reduces the internal covariate shift (ICS) and accelerates the training of a deep
neural network
This approach reduces the dependence of gradients on the scale of the parameters or of their initial
values which result in higher learning rates without the risk of divergence
Batch Normalisation makes it possible to use saturating nonlinearities by preventing the network from
getting stuck in the saturated modes
Weight Normalization
Weight normalization is a process of reparameterization of the weight vectors in a deep neural network which
works by decoupling the length of those weight vectors from their direction. In simple terms, we can define weight
normalization as a method for improving the optimisability of the weights of a neural network model.
Advantages
The advantages of weight normalization are mentioned below
Weight normalization improves the conditioning of the optimisation problem as well as speed up the
convergence of stochastic gradient descent.
It can be applied successfully to recurrent models such as LSTMs as well as in deep reinforcement
learning or generative models
Layer Normalization
Layer normalization is a method to improve the training speed for various neural network models. Unlike batch
normalization, this method directly estimates the normalisation statistics from the summed inputs to the neurons
within a hidden layer. Layer normalization is basically designed to overcome the drawbacks of batch
normalization such as dependent on mini batches, etc.
Advantages
The advantages of layer normalization are mentioned below:
Layer normalization can be easily applied to recurrent neural networks by computing the normalization
statistics separately at each time step
This approach is effective at stabilising the hidden state dynamics in recurrent networks
25 Group Normalization
Group normalization can be said as an alternative to batch normalization. This approach works by dividing the
channels into groups and computes within each group the mean and variance for normalization i.e. normalising
the features within each group. Unlike batch normalization, group normalization is independent of batch sizes,
and also its accuracy is stable in a wide range of batch sizes.
Advantages
The advantages of group normalization are mentioned below:
It has the ability to replace batch normalization in a number of deep learning tasks
It can be easily implemented in modern libraries with just a few lines of codes
Instance Normalization
Instance normalization, also known as contrast normalization is almost similar to layer normalization. Unlike
batch normalization, instance normalization is applied to a whole batch of images instead for a single one.
Advantages
The advantages of instance normalization are mentioned below
:
26 Deep Learning Recent Trends
1. A Residual Neural Network (a.k.a. Residual Network, ResNet) is a deep learning model in which the
weight layers learn residual functions with reference to the layer inputs. A Residual Network defined in is a
network with skip connections that perform identity mappings, merged with the layer outputs by addition.
Recent trends in deep learning include using more extensive datasets and more sophisticated architectures, as
well as incorporating interaction between different types of neural networks and other AI technologies, such as
natural language processing and decision trees. In this article, we will look at 5 recent trends in deep learning and
how they have the potential to bring about significant change.
2. Skip connections are a technique that allows convolutional neural networks (CNNs) to bypass some layers
and connect directly to deeper or shallower ones. They can improve the performance and efficiency of CNNs, but
they also have some drawbacks and limitations.
3. A fully connected neural network consists of a series of fully connected layers that connect every neuron in
one layer to every neuron in the other layer. The major advantage of fully connected networks is that they are
“structure agnostic” i.e. there are no special assumptions needed to be made about the input
An application provides a mechanism for integrating hybrid models from data sources such as census, weather,
and social media into decision support tools. Moreover, it enables the creation of a new nested domain for the
location data, which can then become part of decision support systems. The results suggest that incorporating
deep learning networks into hybrid models can lead to better decisions concerning hazards and performance
Hybrid models combine the benefits of symbolic AI and deep learning. It’s a top-down approach to artificial
intelligence. It is intended to possess machines with intelligence by adopting “high-level symbolic representation
of issues,” as Allen Newell and Herbert A. Simon propose in their physical symbol system theory
Commonly referred to as ViT, an image classification model developed by researchers at the University of
ViT consists of an input layer, a middle layer, and an output layer. The input layer contains training images that
have been labeled with one of several possible sentiments (cheerful, negative, neutral, uncertain, sad, happy,
angry). The middle layer detects the types of objects in the image. The output layer returns a confidence score
(CNNs), with supervised learning followed by some unsupervised preprocessing, then pooling layers that blend
multiple channels into one channel before passing images to CNNs, MRFs, or other models for classification
prediction tasks.
The vision transformers allow us to design a model architecture that can deal with any input data, including
6 Self-Supervised Learning
This deep and self-supervised learning module helps in automation. Rather than depending on labeled data to
train a system, it learns to categorize the raw data automatically. Each input component can predict any other
part of the input. It might, for example, forecast the future based on historical records.
In a self-supervised learning system, the input is labeled either by an intelligent agent or by some external
source. The output is also marked with a label that reflects the overall quality of the prediction made by the
system. The algorithm used to train a self-supervised learning system will be based on minimizing the error
A self-supervised learning system can make two basic types of errors: bias and variance.
Bias
Variance
It is the variation in the quality of predictions made by a system based on different data instances.
Preprocessing
Feature extraction
Training
Testing
The human brain is highly complicated, with an endless capacity for learning. Deep learning has been a
prominent approach for investigating how the brain works in recent years. Neuroscience-based deep learning is a
type of ML that uses data from neuroscience experiments to train artificial neural networks. It allows researchers
Artificial neural networks constructed on computers are comparable to those seen in human brains. As a result of
this formation, scientists and researchers have uncovered thousands of neurological remedies and ideas. Deep
learning has provided neuroscience with the much-needed boost it has long needed. With the deployment of
progressively more robust, comprehensive, and advanced deep learning implementations and solutions, the
Machine Learning-based NLP is still in the early stages. However, there is presently no method that will allow
NLP computers to recognize the meanings of different words in various contexts and respond appropriately.
One approach to solving this problem is to build a model that can recognize patterns in large amounts of text
(e.g., millions of documents). This is where machine learning algorithms come in, as they can automatically learn
1.Image denoising
Remove noise from image Denoising an image is a classical problem that researchers are trying to solve for
decades. In earlier times, researchers used filters to reduce the noise in the images. They used to work fairly well
for images with a reasonable level of noise. However, applying those filters would add a blur to the image. And if
the image is too noisy, then the resultant image would be so blurry that most of the critical details in the image
are lost. With the advent of Deep Learning techniques, it is now possible to remove the blind noise from images
such that the result is very close to the ground truth images with minimal loss of detail. One of the fundamental
challenges in the field of image processing and computer vision is image denoising, where the underlying goal is
to estimate the original image by suppressing noise from a noise-contaminated version of the image.
2.Semantic segmentation
Segmentation is essential for image analysis tasks. Semantic segmentation describes the process of associating
each pixel of an image with a class label, (such as flower, person, road, sky, ocean, or car). Applications for
semantic segmentation include:
Autonomous driving
Industrial inspection
Classification of terrain visible in satellite imagery
Medical imaging analysis
Label Training Data for Semantic Segmentation
Large datasets enable faster and more accurate mapping to a particular input (or input aspect). Using data augmentation
provides a means of leveraging limited datasets for training. Minor changes, such as translation, cropping, or
3.Object detection:
Object detection using deep learning provides a fast and accurate means to predict the location of an object in an
image. Deep learning is a powerful machine learning technique in which the object detector automatically learns
image features required for detection tasks. Several techniques for object detection using deep learning
are available such as Faster R-CNN, you only look once (YOLO) v2, YOLO v3, YOLO v4, and single
shot detection (SSD).
Applications for object detection include:
Image classification
Scene understanding
Self-driving vehicles
Surveillance
Create Training Data for Object Detection
Use a labeling app to interactively label ground truth data in a video, image sequence, image collection, or
custom data source. You can label object detection ground truth using rectangle labels, which define the position
and size of the object in the image.
LSTM Applications
Language modeling
Machine translation
Handwriting recognition
Image captioning
Question answering
Video-to-text conversion
Speech synthesis
This list does give an idea about the areas in which LSTM is employed but not how exactly it is used. Let’s
understand the types of sequence learning problems that LSTM networks are capable of addressing.
LSTM neural networks are capable of solving numerous tasks that are not solvable by previous learning
algorithms like RNNs. Long-term temporal dependencies can be captured effectively by LSTM, without suffering
The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains its state over
time. The cell state is the horizontal line that runs through the top of the below diagram. It can be visualized as a
The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing
should be let through,’ and one means ‘everything should be let through.’
A Generative Model is a powerful way of learning any kind of data distribution using unsupervised learning and it
has achieved tremendous success in just few years. All types of generative models aim at learning the true data
distribution of the training set so as to generate new data points with some variations. But it is not always possible
to learn the exact distribution of our data either implicitly or explicitly and so we try to model a distribution which
is as similar as possible to the true data distribution. For this, we can leverage the power of neural networks to
learn a function which can approximate the model distribution to the true distribution.
Two of the most commonly used and efficient approaches are Variational Autoencoders (VAE) and Generative
Adversarial Networks (GAN)
was proposed in 2013 by Knigma and Welling at Google and Qualcomm. A variational autoencoder (VAE)
provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an
encoder that outputs a single value to describe each latent state attribute, we’ll formulate our encoder to
describe a probability distribution for each latent attribute.
It has many applications such as data compression, synthetic data creation etc.
Architecture:
Autoencoders are a type of neural network that learns the data encodings from the dataset in an unsupervised
way. It basically contains two parts: the first one is an encoder which is similar to the convolution neural
network except for the last layer. The aim of the encoder to learn efficient data encoding from the dataset and
pass it into a bottleneck architecture. The other part of the autoencoder is a decoder that uses latent space in
the bottleneck layer to regenerate the images similar to the dataset. These results backpropagate from the
neural network in the form of the loss function.
Variational autoencoder is different from autoencoder in a way such that it provides a statistic manner for
describing the samples of the dataset in latent space. Therefore, in variational autoencoder, the encoder
outputs a probability distribution in the bottleneck layer instead of a single output value.
2. Generative Adversarial Network (GAN) is a deep learning architecture that consists of two neural
networks competing against each other in a zero-sum game framework. The goal of GANs is to generate new,
synthetic data that resembles some known data distribution.
What is a Generative Adversarial Network
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used
for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in 2014. GANs are basically
made up of a system of two competing neural network models which compete with each other and are able to
analyze, capture and copy the variations within a dataset.
Generative Adversarial Networks (GANs) can be broken down into three parts:
Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
Adversarial: The training of a model is done in an adversarial setting.
Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.
In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of
data(be it an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on the other
hand, tries to distinguish between the real and fake samples. The Generator and the Discriminator
are both Neural Networks and they both run in competition with each other in the training phase. The
steps are repeated several times and in this, the Generator and Discriminator get bette r and better in
their respective jobs after each repetition. The work can be visualized by the diagram given below:
Generative Adversarial Network Architecture and its Components
Here, the generative model captures the distribution of data and is trained in such a manner that it
tries to maximize the probability of the Discriminator making a mistake. The Discriminator, on the
other hand, is based on a model that estimates the probability that the sample that it got is received
from the training data and not from the Generator. The GANs are formulated as a minimax game,
where the Discriminator is trying to minimize its reward V(D, G) and the Generator is trying to
minimize the Discriminator’s reward or in other words, maximize its loss. It can be mathematically
described by the formula below:
where,
G = Generator
D = Discriminator
Pdata(x) = distribution of real data
P(z) = distribution of generator
x = sample from Pdata(x)
z = sample from P(z)
D(x) = Discriminator network
G(z) = Generator netw