Professional Documents
Culture Documents
Deep Learning All Modules
Deep Learning All Modules
Intelligence
Introduction to Artificial Intelligence
‘Artificial intelligence (AI) is a field of computer science that emphasizes on the creation of intelligent machines
which can work and react like humans’
‘Artificial intelligence (AI) is a field of computer science that emphasizes on the creation of intelligent machines
which can work and react like humans’
‘Intelligence can be defined as one's capacity for understanding, self-awareness, learning, emotional
knowledge, planning, creativity, and problem solving’
Artificial Intelligence
Machine Learning is a subset of Artificial Intelligence which gives a machine the ability to learn without being explicitly
programmed. Data, not algorithms, is key to machine learning success
deployed
training dataset
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Machine it can start making a prediction or decision whenever new data is given to it
Learning Types!
Training Data
Apple Apple
Oh…
Supervised Learning Apple Apple okay…
noted...
This is
Unsupervised Learning an apple
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
In Supervised Learning, you can consider that the learning is guided by a teacher. We have a dataset
which acts as a teacher and its role is to train the model or the machine. Once the model gets trained,
Machine it can start making a prediction or decision whenever new data is given to it
Learning Types!
Training Data
97% it’s
Supervised Learning What does
an apple!
this image
represent?
Unsupervised Learning
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Use Case: Spam Classifier
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Most of the spam filtering techniques are based on text categorization methods. Thus, filtering spam
turns out to be a classification problem. We employee Supervised Machine Learning techniques to filter
the email spam messages
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Here, the model learns through observation and finds structures in data. Once the model is given a
dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it. What
Machine
it cannot do is to add labels to these clusters. For example, it cannot say if this is a group of apples,
mangoes, or oranges, but it will separate all apples from mangoes and oranges
Learning Types!
Supervised Learning …
Unsupervised Learning
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Here, the model learns through observation and finds structures in data. Once the model is given a
dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it. What
Machine
it cannot do is to add labels to these clusters. For example, it cannot say if this is a group of apples,
mangoes, or oranges, but it will separate all apples from mangoes and oranges
Learning Types!
Similar Cluster 1
Similar Cluster 3
3
Supervised Learning Similar Cluster 2 clusters
Unsupervised Learning
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Use Case: Netflix Recommendation
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Netflix uses Machine Learning algorithms to help break viewers’ preconceived notions and find shows that
they might not have initially chosen
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
It is the ability of an agent to interact with the environment and find out what the best outcome is. It
follows the concept of hit and trial method. The agent is rewarded or penalized with a point for a correct
Machine
or a wrong answer, and on the basis of the positive reward points gained the model trains itself
1 Observe
Select an Action
2 Using Policy
Reward or
4
Supervised Learning Penalty
Iterate the
6 Process
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Use Case: Self-driving Cars
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Companies such as Tesla (you’ve heard of them), Google, Wayve, and more are working on such machines.
These cars are powered by Reinforcement Learning. It allows machines (known as agents) to learn by
experimentation
Reinforcement Learning
Copyright Intellipaat. All rights reserved.
Machine Learning Algorithms
➢ https://www.autodraw.com/
➢ https://quickdraw.withgoogle.com/
➢ https://opensource.google.com/projects/explore/machine-learning
➢ https://experiments.withgoogle.com/collection/ai
➢ https://toolbox.google.com/datasetsearch
Deep Learning is part of Machine Learning methods based on learning data representations, as opposed to task-specific
algorithms. It teaches computers to do what comes naturally to humans (to learn by examples)
human experts
Most Deep Learning methods use neural networks architecture, which is why Deep Learning models are often referred to
as deep neural networks
A neural network is a computing model whose layered structure resembles the networked structure of neurons in the
brain, with layers of connected nodes. It can learn from data, so it can be trained to recognize patterns, classify data, and
forecast future events
abstraction
an output layer
each layer using the output of the previous layer as its input
Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains.
Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-
specific programming
images
A True
B False
A True
B False
A Classification Algorithm
B Reinforcement Learning
C Unsupervised Learning
D Supervised Learning
A Classification Algorithm
B Reinforcement Learning
C Unsupervised Learning
D Supervised Learning
A Yes
B No
A Yes
B No
sales@intellipaat.com
Program Elements
07 Computational Graph 08 in TensorFlow
Typically, artificial neural networks have a layered structure. The Input Layer picks up the input signals and passes them on to the next layer, also
known as the ‘Hidden’ Layer (there may be more than one Hidden Layer in a neural network). Last comes the Output Layer that delivers the result
A neural network is a computer simulation of the way biological neurons work within a human brain
Dendrites: These branch-like structures extending away from the cell body
receive messages from other neurons and allow them travel to the cell body
Axon: An axon carries an electrical impulse from the cell body to another
neuron
classification
▪ The three arrows correspond to the three inputs coming into the network
▪ Values [0.7, 0.6, and 1.4] are weights assigned to the corresponding input
▪ Inputs get multiplied with their respective weights and their sum is taken
Will it rain, if I
wear a blue
shirt?
Humidity x1
Output
Blue shirt x2
the stack
network
tasks
Binary Step
Sigmoid
Tanh
ReLU
Leaky ReLU
Softmax
Copyright Intellipaat. All rights reserved.
Identity Function
• A straight line function where activation is proportional to input
• No matter how many layers we have, if all of them are linear in nature, the final activation function of the last
layer will be nothing but just a linear function of the input of the first layer
• Range: (−∞,∞)
𝒇 𝒙 =𝒙
discontinuous function
• Its value is 0 for the negative argument and 1 for the positive argument
• When we apply the weighted sum in the place of x, the values are scaled in between 0 and 1
• Large negative numbers are scaled toward 0, and large positive numbers are scaled toward 1
• Range: (0,1)
𝟏
𝒇 𝒙 =
𝟏 + ⅇ−𝒙
• The Tanh activation works almost always better than sigmoid functions as optimization is easier in this method
• The advantage of Tanh is that it can deal more easily with negative numbers
• Range: (−1,1)
𝟐
𝒇 𝒙 : 𝒕𝒂𝒏 𝒉 𝒙 = 𝒙 −𝟏
𝟏+ⅇ−𝟐
• This function allows only the maximum values to pass during the front propagation as shown in the graph below
• Range: (0,∞)
𝟎 𝒇𝒐𝒓 𝒙 < 𝟎
𝒇 𝒙 =ቐ
𝒙 𝒇𝒐𝒓 𝒙 = > 𝟎
• Range: (−∞,∞)
• It is useful for finding out the class which has the max. probability
• The Softmax function is ideally used in the Output Layer of the classifier where we are actually trying to attain the
• Range: (0,1)
ⅇ𝒛 𝒋
𝜎 𝒛 𝒋 = σ𝑲 𝒛 , 𝒋 = 𝟏, 𝟐, . . 𝑲
𝒌=𝟏 ⅇ 𝒌
By training a perceptron, we try to find a line, plane, or some hyperplane which can accurately separate two
classes by adjusting weights and biases
Input
W1
Bias
X1
Output
Calculate the sum
Update weights
X3 W3 and pass through an
activation function 𝜕𝐸
Wnew = Wold – LR*( )
𝜕𝑤
Stop
Keras
PyTorch
DL4J
TensorFlow is an open-source software library for high-performance numerical computations
MXNet
Developed by Google
Keras
PyTorch
DL4J
Google Translate
Keras
PyTorch
DL4J
MXNet
Tensor
Used for visualizing TensorFlow computations and graphs
Board
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
A recurrent neural network
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
‘Pythonic’ in nature
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
Image recognition Fraud detection
Parts of speech
Text mining
tagging
Natural language
processing
Keras
PyTorch
DL4J
MXNet
Developed by Apache
Software Foundation
Keras
PyTorch
DL4J
MXNet
Keras
PyTorch
DL4J
MXNet
Speech
Imaging
recognition
Forecasting NLP
Tensor is given
as an input to a
neural network
Tensor
a = 10
b = 20 Addition
c = 30 c
ℎ = 𝑎∗𝑏 +𝑐
Multiplication
a b
h h
a = 10
b = 20 Addition Addition
c = 30 c c
ℎ = 𝑎∗𝑏 +𝑐
Multiplication Multiplication
a b a b
Session
a b
Placeholder
Placeholder
Placeholder
A True
B False
A True
B False
A 1
B 2
C 3
D 4 or more
A 1
B 2
C 3
D 4 or more
A Yes
B No
A Yes
B No
sales@intellipaat.com
Feedforward Neural
03 Network 04 Multi-layer Perceptron
separable problems
Boolean ‘AND’: Linearly
▪ If the problem is not linearly separable, the learning Separable
separable problems
Boolean ‘AND’: Linearly
▪ If the problem is not linearly separable, the learning Separable
For solving this problem, we can use a multi-layer Boolean ‘AND’: Linearly
perceptron Separable
Complex problems that involve a lot of parameters cannot be solved by a single-layer perceptron
▪ Consider a case where you own an e-commerce firm. You have planned to increase traffic on your site by providing a special discount on the
products and services. Now, you want to create awareness among people regarding this end-season sale by marketing on different portals like:
• Google ads
• Personal emails
• Sales advertisements on relevant sites
• YouTube ads
• Ads on different sites
• linkedin
• Blogs and so on
▪ This task is too complex for a human to analyze, as you can see that the number of parameters is quite high
▪ Let us try to solve it using Deep Learning
▪ You can either use just one platform for publicity or use a variety of them
▪ Each of them has its own advantages and disadvantages, but lots of factors would have to be considered
▪ The increased traffic on your portal or the number of sales that would happen is dependent on different categorical inputs, their sub-categories, and their
parameters
Computing and calculating profit in terms of popularity and sales, from so many inputs and their
sub-categories, is not possible just through one perceptron
So many
Inputs!
Feedforward neural network is the most simple artificial neural network containing multiple nodes arranged in multiple layers.
Adjacent layer nodes have connections or edges where all connections are weighted
Feedforward neural network is the most simple artificial neural network containing multiple nodes arranged in multiple layers.
Adjacent layer nodes have connections or edges where all connections are weighted
• Each node, apart from the input nodes, has a nonlinear activation function
• An MLP uses backpropagation as a supervised learning technique Input layer Hidden layers Output layer
MLP is widely used for solving problems that require supervised learning and research into computational
neuroscience and parallel distributed processing. Such applications include speech recognition, image recognition,
and machine translation
Input Layer:
• Outputs from nodes in the input layer are 1, X1, and X2, respectively,
Hidden Layer:
• It also has three nodes with the Bias node having an output of 1
• The output of the other two nodes in the hidden layer depends on
the outputs from the input layer (1, X1, and X2) as well as the
• The figure shows the output calculation for one of the hidden nodes
• Similarly, the output from the other hidden node can be calculated
• Here, ‘f’ refers to the activation function. These outputs are then fed
Output Layer:
node
perceptron
▪ Once all sources are fed into the system, the neural network calculates the output after the computation is done
Mid-term
Hours Studied Final Results
Marks
35 67 1
12 75 0
16 89 1
45 56 1
10 90 0
• The two input columns show the number of hours each student has studied and the mid-term marks obtained by the student, respectively
• The Final Results column can have two values 1 or 0 indicating whether the student passed (1) in the final term or failed (0)
This is a binary classification problem where a multi-layer perceptron can learn from the given examples (the
training data) and make an informed prediction when given a new data point. We will see now, how a multi-layer
perceptron learns such relationships
• The figure has two nodes in the input layer (apart from the Bias node) which take the inputs Hours Studied and Mid-term Marks
• It also has a hidden layer with two nodes (apart from the Bias node)
• The output layer has two nodes as well: the upper node outputs the probability of ‘Pass’ while the lower node outputs the probability of ‘Fail’
▪ As shown in Figure, errors at the output nodes have now reduced to [0.2, -0.2] as compared to [0.6, -0.4] earlier
▪ This means that our network has learned to correctly classify our first training example
▪ We repeat this process with all other training examples in our dataset. Then, our network will learn those examples as well
The backpropagation algorithm is a supervised learning method for multi-layer feedforward networks from the field of Artificial
Neural Networks
• The principle of this approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal
• The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the
Observe the difference between the actual output and the desired output
Model
Desired Absolute
Input Output Square Error
Output Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4
We see that when the weight is reduced, the error also decreases
We need to reach the Global Loss Minimum. This is nothing but backpropagation
In order to have some numbers to work with, here are initial weights, biases, and training inputs/outputs
**We repeat this process for the output layer neurons, using the output from the hidden layer
neurons as their input**
We can now calculate the error for each output neuron using the squared error function and sum them to
Calculating the Total Error
get the total error: E total = Ʃ1/2(target - output)2
The target output for o1 is 0.01, but the neural network output is 0.75136507; therefore, its error is:
E o1 = 1/2(target o1 - out o1)2 = 1/2(0.01 - 0.75136507)2 = 0.274811083
By repeating this process for o2 (remembering that the target is 0.99), we get:
E o2 = 0.023560026
The total error for the neural network is the sum of these errors:
E total = E o1 + E o2 = 0.274811083 + 0.023560026 = 0.298371109
Our goal with backpropagation is to update each of the weights in the network so that the actual output is closer
Step 2: The Backward Pass
to the target output, thereby minimizing the error for each output neuron and the network as a whole
Our goal with backpropagation is to update each of the weights in the network so that the actual output is closer
Step 2: The Backward Pass
to the target output, thereby minimizing the error for each output neuron and the network as a whole
Consider w5; we will calculate the rate of change of error w.r.t the change in weight w5:
(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w5 = (𝜕E total)/(𝜕out o1) * (𝜕out o1)/(𝜕net o1) * (𝜕net o1)/ 𝜕w5
Since we are propagating backwards, the first thing we need to do is to calculate the change in total errors w.r.t. the outputs o1 and
o2:
E total = (1/2)*(target o1 – out o1)2 + (1/2)*(target o2 – out o2)2
(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/(𝜕out o1) = -(target o1 – out o1) = -(0.01 - 0.75136507) = 0.74136507
Now, we will propagate further backward and calculate the change in the output o1 w.r.t to its total net input:
out o1 = 1/(1 + e-net o1)
(𝜕out o1)/(𝜕net o1) = out o1(1-out o1) = 0.75136507(1 - 0.75136507) = 0.186815602
How much does the total net input of o1 change w.r.t. w5?
net o1 = w5*out h1 + w6*out h2 +b2*1
(𝜕net o1)/ 𝜕w5 = 1*out h1*w5(1-1) +0+0 = out h1 = 0.593269992
Step 2: The Backward Pass Putting all values together and calculating the updated weight value
We can repeat this process to get the new weights w6, w7, and w8
W6+ = 0.408666186
W7+ = 0.511301270
W8+ = 0.561370121
We perform the actual updates in the neural network after we have the new weights leading into the hidden layer neurons
We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the output of
each hidden layer neuron contributes to the output. Thus, we need to take E o1 and E o2 into consideration
Starting with:
(𝜕E total)/(𝜕out h1) = (𝜕E o1)/(𝜕out h1) + (𝜕E o2)/(𝜕out h1)
We can calculate:
(𝜕E total)/(𝜕out h1) = (𝜕E o1)/(𝜕out h1) + (𝜕E o2)/(𝜕out h1)
= 0.055399425 + (-0.019049119) = 0.036350306
Now that we have (𝜕E total)/(𝜕out h1), we need to figure out (𝜕out h1)/(𝜕net h1)and (𝜕net h1)/𝜕w 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑤𝑒𝑖𝑔ℎ𝑡
out h1 = 1/(1 + e-net h1)
(𝜕out h1)/(𝜕net h1) = out h1(1- out h1) = 0.59326999(1 - 0.59326999 ) = 0.241300709
We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:
net h1 = w1*i1 + w3*i2 + b1*1
(𝜕net h1)/𝜕w1 = i1 = 0.05
• When we fed forward 0.05 and 0.1 inputs originally, the error on the network was 0.298371109
• After this first round of backpropagation, the total error is now down to 0.291027924
▪ Gradient descent is by far the most popular optimization strategy, used in Machine Learning and Deep Learning at the moment
▪ It is used while training your model, can be combined with every algorithm, and is easy to understand and implement
Gradient measures how much the output of a function changes if you change the inputs a little bit
▪ You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster the model learns
▪ Gradient descent is by far the most popular optimization strategy, used in Machine Learning and Deep Learning at the moment
▪ It is used while training your model, can be combined with every algorithm, and is easy to understand and implement
Gradient measures how much the output of a function changes if you change the inputs a little bit
▪ You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster the model learns
▪ Gradient descent is by far the most popular optimization strategy, used in Machine Learning and Deep Learning at the moment
▪ It is used while training your model, can be combined with every algorithm, and is easy to understand and implement
Gradient measures how much the output of a function changes if you change the inputs a little bit
▪ You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster the model learns
• b = next value
• a = current value
How Does it Work? 𝑏 = 𝑎 − 𝛾∇𝑓 𝑎 • ‘−’ refers to the minimization part of the gradient
descent
• 𝛾 in the middle is the learning rate, and the gradient
term 𝛻𝑓 𝑎 is simply the direction of the steepest
descent
Copyright Intellipaat. All rights reserved.
Gradient Descent
𝑏 = 𝑎 − 𝛾∇𝑓 𝑎
▪ This formula basically tells you the next position where you need to go, which is the direction of the steepest descent
▪ Gradient descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that
minimizes a given function
▪ Consider the graph below where we need to find the values of w and b that correspond to the minimum of the cost function (marked with the red arrow)
▪ To start with finding the right values, we initialize the values of w and b with some random numbers, and gradient descent then starts at that point (somewhere
around the top)
▪ Then, it takes one step after the other in the steepest downside direction (e.g., from top to bottom) till it reaches the point where the cost function is as small as
possible
▪ Learning rate determines how fast or slow we will move toward the optimal weights
▪ In order for gradient descent to reach the local minimum, we have to set the learning rate to an appropriate value, which is neither too low nor too high
If the steps it takes are too big, it If you set the learning rate to a
might not reach the local very small value, gradient descent
minimum because it just bounces will eventually reach the local
back and forth between the minimum, but it might take too
convex function of gradient much time
descent
Copyright Intellipaat. All rights reserved.
Understanding Epoch
▪ One epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE
▪ One epoch leads to underfitting of the curve in the graph
▪ As the number of epochs increases, more number of times the weights are changed in the neural network and the curve goes
from underfitting to optimal to overfitting
Batch Size
▪ Total number of training examples present in a single batch is referred to as the batch size
▪ Since we can’t pass the entire dataset into the neural net at once, we divide the dataset into number of batches or sets or parts
Iterations
Let’s say, we have 2,000 training examples that we are going to use. We can divide the dataset of 2,000
examples into batches of 500, and then it will take four iterations to complete one epoch
• Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm for each iteration. The expectation for a well-performing gradient
descent run is a decrease in cost at every iteration. If it does not decrease, try reducing your learning rate
• Learning Rate: The learning rate value is a small real value such as 0.1, 0.001, or 0.0001. Try different values for your problem and see which works best
• Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieve this by
rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1]
▪ The Adaptive Moment Estimation or Adam optimization algorithm is a combination of gradient descent with momentum and RMSprop algorithms
▪ Adam is an adaptive learning rate method, which means that it computes individual learning rates for different parameters
x1
y
Let x1 = 2, x2 = 5, and y = 31
x2
Reshaping ‘y’:
4 Transposing ‘X’:
A True
B False
A True
B False
B Adams Optimizer
C Ada-Delta Optimizer
D All of these
B Adams Optimizer
C Ada-Delta Optimizer
D All of these
A Yes
B No
A Yes
B No
sales@intellipaat.com
Keras prioritizes
developer experience
Keras is broadly
adopted in the
industry and among
the research
community
Keras supports
multiple backend
engines and does not
lock you into one
ecosystem
Keras development is
backed by key
companies in the Deep
Learning ecosystem
Keras is a high-level neural networks API. It is written in Python and can run on top of Theano, TensorFlow, or CNTK. It is designed
to be modular, fast, and easy to use
‘Being able to go from idea to result with the least possible delay is key to
• So, Keras is the high-level API wrapper for the low-level API
You can create two types of models available in Keras, i.e., the Sequential model and the Functional model
▪ You can create a Sequential model by passing a list of layer instances to the constructor
▪ Stacking convolutional layers one above the other can be an example of a sequential model
The Keras functional API is used for defining complex models, such as multi-output models, directed acyclic graphs, or models with
shared layers
Models are defined by creating instances of layers and connecting are as follows:
them directly to each other in pairs, then specifying the layers to act as • Defining the Input
standalone Input Layer that specifies the shape of the input data
• In the case of one-dimensional input data, such as for a from keras.layers import Input
visible = Input(shape=(2,))
multilayer perceptron, the shape must explicitly leave room for
the shape of the mini-batch size used when splitting the data
• A bracket notation is used to specify the layer from from keras.layers import Input
from keras.layers import Dense
which the input is received to the current layer, after
visible = Input(shape=(2,))
hidden = Dense(2)(visible)
the layer is created
• Example:
• As we move toward the right in this graph, our model tries to learn too well the details and the noise from the training data, which results in poor
• In other words, while going toward the right, the complexity of the model increases such that the training error reduces but the testing error doesn’t. This
• Have you come across a situation where your model performed exceptionally well on train data but was not able to predict test data?
• Or, were you ever on the top of a competition in public leaderboard only to fall hundreds of places in the final ranking?
• Do you know how complex neural networks are and how it makes them prone to overfitting? This is one of the most common problems Data Science
Regularization is a technique which makes slight modifications to the learning algorithm such that the model generalizes better. This in turn improves the
Let’s consider a neural network which is overfitting on the training data as shown in the above image
Assume that our regularization coefficient is so high that some of the weight matrices are nearly equal to zero
This will result in a much simpler linear network and slight underfitting of the training data
• We need to optimize the value of the regularization coefficient in order to obtain a well-fitted model as shown in the image below
• Dropout produces very good results and is consequently the most frequently used Regularization technique in the field of Deep Learning
• Let’s say our neural network structure is akin to the one shown below:
• At every iteration, it randomly selects some nodes and removes them, along with all of their incoming and outgoing connections as shown below:
• So, each iteration has a different set of nodes, and this results in a different set of outputs. It can also be thought of as an ensemble technique in
Machine Learning
• Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a
• The probability of choosing how many nodes should be dropped out is the hyperparameter of the dropout function. As seen in the image below,
dropout can be applied to both the Hidden Layers as well as the Input Layers
• Due to these reasons, dropout is usually preferred when we have a large neural network structure in order to introduce more randomness
model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.25),
• The simplest way to reduce overfitting is to increase the size of the training data
• There are a few ways of increasing the size of the training data—rotating the image, flipping, scaling, shifting, etc.
• In the below image, some transformation has been done on the handwritten digits dataset
• This usually provides a big leap in improving the accuracy of the model
• It has a big list of arguments which you can use to pre-process your training data
• Example:
• When the data is fed through a deep neural network and weights and parameters adjust those values, sometimes making the data too big or too small, it
becomes a problem. By normalizing the data in each mini-batch, this problem is largely avoided
• Batch Normalization normalizes each batch by both mean and variance reference
• It is just another layer, you can use to create your desired network architecture
• It is generally used between the linear and the non-linear layers in your network, because it normalizes the input to your activation function so that you're
• When the data is fed through a deep neural network and weights
A normal Dense fully connected layer looks like this:
and parameters adjust those values, sometimes making the data
variance reference
• It is just another layer, you can use to create your desired network
architecture
• When the data is fed through a deep neural network and weights
A normal Dense fully connected layer looks like this:
and parameters adjust those values, sometimes making the data
variance reference
To make it Batch normalization enabled, we have to tell the
• It is just another layer, you can use to create your desired network Dense Layer not to use bias, since it is not needed, and thus
it can save some calculation. Also, put the Activation Layer
architecture after the BatchNormalization() layer
your network, because it normalizes the input to your activation model.add(layers.Dense(64, use_bias=False))
model.add(layers.BatchNormalization())
function so that you're centered in the linear section of the model.add(Activation("relu"))
Let us see the 4-step workflow in developing neural networks with Keras
Let us see the 4-step workflow in developing neural networks with Keras
Let us see the 4-step workflow in developing neural networks with Keras
Let us see the 4-step workflow in developing neural networks with Keras
Here, we will try to build a Sequential Network of Dense Layers, and the dataset used is MNIST. MNIST is a classic dataset of
handwritten images, released in 1999, and has served as the basis for benchmarking classification algorithms
Here, we will be using Keras to create a simple neural network to predict, as accurately as we can, digits from handwritten images. In
particular, we will be calling the Functional Model API of Keras and creating a 4-layered and 5-layered neural network.
Also, we will be experimenting with various optimizers: the plain Vanilla Stochastic Gradient Descent optimizer and the Adam’s optimizer.
We will also introduce dropout, a form of regularization technique, in our neural networks to prevent overfitting
A True
B False
A True
B False
D None of these
D None of these
A Yes
B No
A Yes
B No
sales@intellipaat.com
Disadvantages of fully
01 connected network 02 What is CNN?
▪ Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (3 refers to RGB values)
▪ Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point
How will I
An image of size 200*200*3 = 2,352 would manage
have 120,000 weights at the Hidden Layer that?
How will I
An image of size 200*200*3 = 2,352 would manage
have 120,000 weights at the Hidden Layer that?
Also, we will have several such layers of neurons leading to several parameters. Thus, this connectivity would be a waste as
the huge number of parameters would lead to overfitting!
CNNs are like neural networks (that we have already studied before) and are made up of neurons with learnable weights and biases.
Each neuron receives several inputs, takes a weighted sum of them, passes it through an activation function, and responds with an output
• The whole network has a loss function, and all tips and tricks that So, how are
Convolutional Neural
we developed for neural networks still apply on CNNs Networks different
from Neural Networks?
multiplying them
▪ ConvNets pass many filters over a single image, each one picking up a
different signal
▪ At a fairly early layer, you could imagine them as passing a horizontal line
filter, a vertical line filter, and a diagonal line filter to create a map of the
▪ CNNs take those filters, slice of the image’s feature space, and map them one
by one; i.e., they create a map of each place wherever feature occurs
function
Neural Networks:
• Convolution
• ReLU
• Pooling
position
the computer
▪ A computer understands an image using numbers at each pixel as shown in the figure
▪ Here, blue pixels have −1 value, while the white pixels have a value of 1
If we just normally search and compare the values between a normal image and another ‘X’ rendition, we would get a lot of missing pixels, which
means, this is not an optimal way of image classification since it requires exactly the same images to classify
array of values
most Xs
▪ These features will probably match the arms and the center
of any image of an X
▪ When presented with a new image, the CNN doesn’t know exactly where these features will match,
▪ In calculating the match of a feature across the whole image, they (ConvNet) act as filters
▪ The math used to perform this is called convolution, from which Convolutional Neural Networks get
their name
▪ Multiply this with the existing image, and the product will be stored in another buffer feature image
Copyright Intellipaat. All rights reserved.
Convolutional Layer
▪ The final value obtained from the math that is performed in the last step is placed at the center of the filtered image as shown above
Now, move this filter around and do the same at any pixel in the image. Consider the below example:
▪ As you can see, here after performing all the steps, the value is 0.55!
▪ To complete the convolution, repeat this process, lining up the feature with every possible image patch
▪ Take the answer from each convolution and make a new 2-dimensional array from it, based on where in the image each patch is located
▪ It’s a map showing where in the image the feature can be found
▪ Values close to 1 show strong matches, values close to −1 show strong matches for the photographic negative of our feature, and values near 0 show no
▪ This function activates a node only if the input is above a certain quantity.
▪ When the input is below zero, the output is zero, but when the input rises above a certain threshold, it has a linear relationship with the dependent variable
x f(x) = x F(x)
−2 f(−2) = 0 0
−6 f(−6) = 0 0
2 f(2) = 2 2
6 f(6) = 6 6
▪ We have considered a simple function with values as mentioned above. So, the function only performs an operation if that value is obtained by the
dependent variable
▪ Its math is also very simple—wherever a negative number occurs, swap it out for a 0
▪ This helps the CNN stay mathematically healthy by keeping learned values from getting stuck near 0 or blowing them up toward infinity
▪ Pooling is a way to take large images and shrink them down while preserving the most important information in them
▪ It consists of stepping a small window across an image and taking the maximum value from the window at each step
▪ In this case, we took the window size to be 2 and we got four values to choose from
▪ In those four values, the maximum value is 1, so we pick 1. Also, note that we started with a 7×7 matrix, but now the same matrix after pooling came down to
▪ The procedure is exactly as same as the above, and we need to repeat it for the entire image
▪ The output will have the same number of images, but they will
▪ You’ve probably noticed that the output of one layer is taken as the input to the other. Because of this, we can stack them like Lego bricks
▪ Raw images get filtered, rectified, and pooled to create a set of shrunken, feature-filtered images. These can be filtered and shrunken again and again
▪ Here, we take our filtered and shrunk images and put them into
▪ Every value gets its own vote on whether the current image is
an X or an O
▪ Similarly, we will feed an image of O where we will have certain values which are high than others
▪ Some values are much better than the others at knowing when the image is an X, and some are particularly good at knowing when the image is an O
▪ These get larger votes than the others. These votes are expressed as weights, or connection strengths, between each value and each category
▪ We’re done with training the network, and now we can begin to predict and check the working of the classifier
Copyright Intellipaat. All rights reserved.
Fully Connected Layer
our network
the obtained values with the list of ‘Xs’ and ‘Os’ to check
▪ We just added the values which we found out as high (1st, 4th, 5th, 10th, and 11th) from the vector table of X and we got the sum as 5
▪ We did exactly the same thing with the input image and got a value of 4.56
▪ When we divide the values, we have a probability match of 0.91! Copyright Intellipaat. All rights reserved.
Fully Connected Layer
▪ Doing the same with the vector table of O, we have an output of 0.51
.92
▪ "Well, since 0.51 is less than 0.91, the probability for the input image to be of an O is less, isn't it?“
▪ In practice, several fully connected layers are often stacked together, with each intermediate layer voting on phantom ‘hidden’ categories
▪ In effect, each additional layer lets the network learn ever more sophisticated combinations of features that help it make better decisions
Copyright Intellipaat. All rights reserved.
Object Recognition with Convolutional
Neural Networks in the Keras Deep
Learning Library
CIFAR-10 is an established computer-vision dataset used for object recognition. The CIFAR-10 data consists of 60,000 (32×32) color
images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data. The
label classes in the dataset are:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
Truck
Let’s look into full python implementation of object recognition task on CIFAR-10 dataset
A sigmoid
B Tanh
C ReLU
A sigmoid
B Tanh
C ReLU
In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices
between hidden output layer and input hidden layer?
A [1 X 5] , [5 X 8]
B [8 X 5] , [ 1 X 5]
C [8 X 5] , [5 X 1]
D [5 x 1] , [8 X 5]
In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices
between hidden output layer and input hidden layer?
A [1 X 5] , [5 X 8]
B [8 X 5] , [ 1 X 5]
C [8 X 5] , [5 X 1]
D [5 x 1] , [8 X 5]
sales@intellipaat.com
No Relation
Recurrent Neural
Network
▪ RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous
computations
▪ Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far
▪ RNN typically predicts one output per each time step. Conceptually,
▪ Then, it calculates the error across each time step and adds up all of the
▪ Following which the network is rolled back up and the weights are updated
Suppose we try to
predict the last word
in this text.. Input Output
Input Output
I’ve been staying in Spain for the last 10 years. I can speak fluent …………..
• In this case, the network needs the context of ‘Spain’ to predict the last word in this text, which is “Spanish”
• The gap between the word which we want to predict and the relevant information is very large and this is known as
long term dependency
▪ Now, if there is a really long dependency, there’s a good probability that one
of the gradients might approach zero and this would lead to all the gradients
∂E/∂W=0
▪ Such states would no longer help the network to learn anything. This is
Long Short Term Networks are special kind of RNNs which are explicitly designed to avoid the long-term dependency problem
Standard RNN
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will
Long Short Term Networks are special kind of RNNs which are explicitly designed to avoid the long-term dependency problem
h1 h
h 1
LSTM
h
1 1
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer,
The key to LSTMs is the cell state. The cell state is kind of like a conveyor belt. It runs straight down the entire
chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged
h
h 1 h
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures
called gates
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and
a pointwise multiplication operation
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be
let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
Step 1
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This
decision is made by a sigmoid layer called the “forget gate layer”
Step 2
The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a
sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of
new candidate values, that could be added to the state
Step 3
Then we have to update the old cell state, Ct-1, into new cell state Ct. So, we multiply the old state (Ct-1) by ft,
forgetting the things we decided to forget earlier. Then we add (it * C~t). This is the new candidate values, scaled
by how much we decided to update each state value
h
h 1 h
Step 4
Finally, we’ll run a sigmoid layer which decides what part of the cell state we’re going to output. Then, we put the
cell state through tanh and multiply it by the output of the sigmoid gate, so that we only output the parts we
decided to
h
h 1
Adding the LSTM layer with the output and input shape:
10
11
12
13
Normalized
Raw Data Normalizing
Data
14
15
Fitting the model with normalized values and number of epochs to be 500:
16
17
18
A True
B False
A True
B False
A 4
B 2
C 3
D None of these
A 4
B 2
C 3
D None of these
A 1
B 2
C 3
D 4
A 1
B 2
C 3
D 4
sales@intellipaat.com
Architecture of Hyperparameters of
03 Autoencoders
04 Autoencoders
Collaborative Filtering
09 Using RBMs
Copyright Intellipaat. All rights reserved.
For Your Information!
Artificial Intelligence encircles a wide range of technologies and techniques that enable computer systems to solve problems, like data
compression which is used in computer vision, computer networks, computer architecture, and many other fields
An autoencoder is an Unsupervised Machine Learning algorithm that takes an image as input and tries to reconstruct it using a
fewer number of bits from the bottleneck also known as the latent space
Data-specific, which means that they will only be able to compress data
similar to what they have been trained on. An autoencoder trained on
pictures of faces would do a rather poor job on compressing pictures of
trees, because the features it would have learned would be face-
specific
• Autoencoders are similar to dimensionality reduction techniques like Principal Component Analysis (PCA)
• PCA: It projects the data from a higher dimension to a lower dimension using linear transformation
• Both techniques try to preserve the important features of data while removing the non-essential parts
• Autoencoders are similar to dimensionality reduction techniques like Principal Component Analysis (PCA)
• PCA: It projects the data from a higher dimension to a lower dimension using linear transformation
• Both techniques try to preserve the important features of data while removing the non-essential parts
‘The major difference between autoencoders and PCA lies in the transformation part: PCA uses linear transformations
Encoder
• This part of the network compresses or down samples the input into a fewer
number of bits
• The space represented by these fewer number of bits is often called the latent
space or bottleneck
• The bottleneck is also called the ‘maximum point of compression’ since at this
• These compressed bits that represent the original input are together called an
Code
• This part of the network represents the compressed input which is fed to the decoder
• The code decides which aspects of the observed data are relevant information and which
Decoder
• This part of the network tries to reconstruct the input using the encoded input
• When the decoder is able to reconstruct the input exactly as it was fed to the
encoder, you can say that the encoder is able to produce the best encodings for
the input
CODE SIZE
1 It represents the number of nodes
in the middle layer. Smaller size
results in more compression
CODE SIZE
1 It represents the number of nodes
in the middle layer. Smaller size
results in more compression
NUMBER OF LAYERS
An autoencoder can consist of as
2 many layers as we want
CODE SIZE
1 It represents the number of nodes
in the middle layer. Smaller size
results in more compression
We will start with a single fully-connected neural layer as the encoder and as the decoder:
First, we'll configure our model to use a per-pixel binary cross-entropy loss and the Adadelta optimizer:
Let's prepare our input data. We're using MNIST digits, and we're discarding the labels (since we're only interested in
encoding/decoding the input images)
We will normalize all values between 0 and 1, and we will flatten 28x28 images into vectors of size 784
After 50 epochs, the autoencoder seems to reach a stable train/test loss value of about 0.11. We can try to visualize the
reconstructed inputs and the encoded representations
Here's what we get as an output. The top row is the original digits and the bottom row is the reconstructed digits. We are
losing quite a bit of details with this basic approach
▪ In the previous example, representations were (only) constrained from keras import regularizers
▪ What typically happens in such situations is that the Hidden Layer input_img = Input(shape=(784,))
# add a Dense layer with a L1 activity
is learning an approximation of PCA regularizer
encoded = Dense(encoding_dim,
▪ Another way to constrain the representations is to add a sparsity activation='relu',
▪ Train this model for 100 epochs (with the added regularization, the model is less likely to overfit and can be trained longer)
▪ The model ends with a train loss of 0.11 and a test loss of 0.10
▪ The difference between the two is mostly due to the regularization term being added to the loss during training (worth about 0.01)
They look pretty similar to the previous model, the only significant difference being the sparsity of the encoded representations.
encoded_imgs.mean() yields a value 3.33 (over our 10,000 test images); whereas with the previous model, the same quantity was 7.30.
So, our new model yields encoded representations that are twice sparser
A True
B False
A True
B False
A To capture inputs
A To capture inputs
A Yes
B No
A Yes
B No
sales@intellipaat.com
Data Management
07 Using TFLearn 08 Use Case 1
TFLearn is a modular and transparent Deep Learning library built on top of TensorFlow. It was designed to provide a high-level API to
TensorFlow in order to facilitate and speed-up experimentations, while remaining fully transparent and compatible with it.
Layers are a core feature of TFLearn. Here is a list of all currently available layers:
File Layers
input_data, fully_connected, dropout, custom_layer, reshape, flatten, activation, single_unit,
core
highway, one_hot_encoding, and time_distributed
conv_2d, conv_2d_transpose, max_pool_2d, avg_pool_2d, upsample_2d, conv_1d, max_pool_1d,
conv avg_pool_1d, residual_block, residual_bottleneck, conv_3d, max_pool_3d, avg_pool_3d,
highway_conv_1d, highway_conv_2d, global_avg_pool, and global_max_pool
recurrent simple_rnn, lstm, gru, bidirectionnal_rnn, and dynamic_rnn
embedding embedding
estimator regression
Example:
▪ Besides the layers concept, TFLearn also provides many different operations to be used when building a neural network
▪ These operations are mainly meant to be part of the above 'layers' argument, but they can also be used independently in any other TensorFlow graph for
convenience
▪ In practice, just providing the operation’s name as an argument is enough (such as activation='relu' or regularizer='L2' for conv_2d), but a function can also
File Operations
activations linear, tanh, sigmoid, softmax, softplus, softsign, relu, relu6, leaky_relu, prelu, and elu
softmax_categorical_crossentropy, categorical_crossentropy, binary_crossentropy, mean_square,
objectives
hinge_loss, roc_auc_score, and weak_cross_entropy_2d
optimizers SGD, RMSProp, Adam, Momentum, AdaGrad, Ftrl, and AdaDelta
losses L1 and l2
Example:
▪ To save or restore a model, you can simply invoke the save or load method of the Deep Neural Network model class
# Save a model
model.save('my_model.tflearn')
# Load a model
model.load('my_model.tflearn')
▪ Retrieving layer variables can either be done using the layer name or, directly, by using 'W' or 'b' attributes that are supercharged to the layer's returned Tensor
▪ To get or set the value of variables, TFLearn models class implement get_weights and set_weights methods
Note: You can also directly use TensorFlow eval or assign ops to get or set the value of these variables
Fine tuning is a process to take a network model that is already trained for a given task and make it perform a second
similar task
• Assuming that the original task is similar to the new task, using a network that is
extraction that happens in the front layers of the network without developing
• It replaces the output layer, originally trained to recognize (in the case of
imagenet models) 1,000 classes, with a layer that recognizes the number of
• The new output layer that is attached to the model is then trained to take the lower level features from the front of the network and map them to the
• Once this has been done, other late layers in the model can be set as 'trainable=True' so that in further SGD epochs their weights can be fine-tuned for the
new task
• So, when defining a model in TFLearn, you can specify which layer's weights you want to be restored or not (when loading the pre-trained model)
• This can be handled with the 'restore' argument of layer functions (only available for layers with weights)
• HDF5 is a data model, a library, and a file format for storing and managing data
• It supports an unlimited variety of data types and is designed for flexible and efficient I/O and also for high-volume and complex data (more info)
• All layers are built over 'variable_op_scope' that makes it easy to share variables among multiple layers and make TFLearn suitable for a
distributed training
• All layers with inner variables support a 'scope' argument to place variables under; layers with same scope name will then share the
same weights
# Define a model builder
def my_model(x):
x = tflearn.fully_connected(x, 32, scope='fc1')
x = tflearn.fully_connected(x, 32, scope='fc2')
x = tflearn.fully_connected(x, 2, scope='out')
In this demo, we will learn to use TFLearn and TensorFlow to model the survival chance of Titanic passengers using their personal information
(such as gender, age, and so on). To tackle this classic Machine Learning task, we are going to build a Deep Neural Network classifier
• Let's take a look at the dataset (TFLearn will automatically download it for you)
• For each passenger, the following information is provided:
• There are two classes in our task: not survived (class = 0) and survived (class = 1), and the passenger data has eight features
• The Titanic dataset is stored in a CSV file, so we can use the TFLearn load_csv() function to load the data from the file into a Python list
• We specify the target_column argument to indicate that our labels (survived or not) are located in the first column (ID is 0)
A True
B False
A True
B False
B Supports HDF5
C A,B
B Supports HDF5
C A and B
A Yes
B No
A Yes
B No
sales@intellipaat.com