Deep Learning All Modules

Artificial
Intelligence
Introduction to Artificial Intelligence
Copyright Intellipaat. All rights reserved.

Agenda
01 Importance of AI 02 What Is AI?
03 What Is Intelligence? 04 Difference Between AI, ML, and DL
05 Basics of Machine Learning 06 Basics of Deep Learning

Let us understand why we should
study Artificial Intelligence!

Why Artificial Intelligence?
Widely used in banking An important feature Perfect for heavy Efficiently used in air
and finance industries of medical science industries transport
Changed the face of Reinvented the world A great help for

gaming humans
Want to know more about Artificial
Intelligence?

Artificial Intelligence
‘Artificial intelligence (AI) is a field of computer science that emphasizes on the creation of intelligent machines
which can work and react like humans’

‘Artificial intelligence (AI) is a field of computer science that emphasizes on the creation of intelligent machines
which can work and react like humans’

What Is Intelligence?
‘Intelligence can be defined as one's capacity for understanding, self-awareness, learning, emotional
knowledge, planning, creativity, and problem solving’
▪ Artificial Intelligence is intelligence in machines
▪ It is commonly implemented in computer systems using program software
▪ Accordingly, there are two possibilities:
• A system with intelligence is expected to behave as intelligently as a human
• A system with intelligence is expected to behave in the best possible manner

What Makes Humans Intelligent?
The core problem of Artificial Intelligence includes programming computers for certain traits such as:

Growth of Artificial Intelligence

A lot of people think that Artificial
Intelligence, Machine Learning, and
Deep Learning, all are the same. Let me
tell you some real facts then!

AI and ML and DL
Machine Learning (ML)

Deep Learning (DL)
• An Approach to achieve Artificial Intelligence
• A technique for implementing Machine Learning
• A subfield of AI that aims to teach computers the
• A subfield of AI that uses specialized techniques
ability to do tasks with data, without explicit
involving multi-layer (2+) artificial neural networks
programming
• Layering allows cascaded learning and abstraction
• It uses numerical and statistical approaches,
levels (e.g., line -> shape -> object -> scene)
including artificial neural networks, to encode
learning in models

AI in a Bigger Set

Let us understand Machine Learning in
detail!

Machine Learning Around YOU!

Machine
Learning Around Products Recommendation
YOU!

Machine Amazon Alexa
Learning Around
YOU!

Movie Recommendation
Machine
Learning Around
YOU!

Google Traffic Prediction
Machine
Learning Around
YOU!

Introduction to Machine Learning

What Is Machine Learning?
Machine Learning is a subset of Artificial Intelligence which gives a machine the ability to learn without being explicitly
programmed. Data, not algorithms, is key to machine learning success

What Is Machine Learning?

How Does a Machine Learn?
▪ Machine Learning algorithm is trained using a training
dataset to create a model
▪ When a new input data is introduced to the ML algorithm,
it makes a prediction on the basis of the model
▪ The prediction is evaluated for accuracy and if the
accuracy is acceptable, the Machine Learning algorithm is
deployed
▪ If the accuracy is not acceptable, the Machine Learning
algorithm is trained again and again with an augmented
training dataset

Machine Learning Types
Machine learning is categorized into three types
Supervised Learning
Unsupervised Learning
Reinforcement Learning

In Supervised Learning, you can consider that the learning is guided by a teacher. We have a dataset
which acts as a teacher and its role is to train the model or the machine. Once the model gets trained,
Machine it can start making a prediction or decision whenever new data is given to it
Learning Types!
Training Data
Apple Apple
Oh…
Supervised Learning Apple Apple okay…
noted...
This is
Unsupervised Learning an apple
In Supervised Learning, you can consider that the learning is guided by a teacher. We have a dataset
which acts as a teacher and its role is to train the model or the machine. Once the model gets trained,
Machine it can start making a prediction or decision whenever new data is given to it
Learning Types!
Training Data
97% it’s
Supervised Learning What does
an apple!
this image
represent?
Use Case: Spam Classifier
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Most of the spam filtering techniques are based on text categorization methods. Thus, filtering spam
turns out to be a classification problem. We employee Supervised Machine Learning techniques to filter
the email spam messages
Here, the model learns through observation and finds structures in data. Once the model is given a
dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it. What
Machine
it cannot do is to add labels to these clusters. For example, it cannot say if this is a group of apples,
mangoes, or oranges, but it will separate all apples from mangoes and oranges
Learning Types!
Supervised Learning …
Here, the model learns through observation and finds structures in data. Once the model is given a
dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it. What
Machine
it cannot do is to add labels to these clusters. For example, it cannot say if this is a group of apples,
mangoes, or oranges, but it will separate all apples from mangoes and oranges
Learning Types!
Similar Cluster 1
Similar Cluster 3
3
Supervised Learning Similar Cluster 2 clusters
Use Case: Netflix Recommendation
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Netflix uses Machine Learning algorithms to help break viewers’ preconceived notions and find shows that
they might not have initially chosen
It is the ability of an agent to interact with the environment and find out what the best outcome is. It
follows the concept of hit and trial method. The agent is rewarded or penalized with a point for a correct
Machine
or a wrong answer, and on the basis of the positive reward points gained the model trains itself
Learning Types! Agent Environment
1 Observe
Select an Action
2 Using Policy
3 Action! -50 Points
Reward or
4
Supervised Learning Penalty
From Next Time…

Unsupervised Learning 5 Update Policy
Iterate the
6 Process
Use Case: Self-driving Cars
Machine
Learning Types!
Supervised Learning
Unsupervised Learning Companies such as Tesla (you’ve heard of them), Google, Wayve, and more are working on such machines.
These cars are powered by Reinforcement Learning. It allows machines (known as agents) to learn by
experimentation
Machine Learning Algorithms

Machine Learning for You!
Cool Machine Learning projects you can use:
➢ https://www.autodraw.com/
➢ https://quickdraw.withgoogle.com/
➢ https://opensource.google.com/projects/explore/machine-learning
➢ https://experiments.withgoogle.com/collection/ai
➢ https://toolbox.google.com/datasetsearch

Limitations of Machine Learning
Machine Learning algorithms

Error diagnosis and correction
require massive amount of
can be difficult
training data
Time constraints in learning as

Lack of creativity
it learns through historical data

Introduction to Deep Learning

Deep Learning
Deep Learning is part of Machine Learning methods based on learning data representations, as opposed to task-specific
algorithms. It teaches computers to do what comes naturally to humans (to learn by examples)

Deep Learning
▪ Deep Learning architectures such as deep neural
networks, deep belief networks, and recurrent
neural networks have been applied to fields
including computer vision, speech recognition,
natural language processing, audio recognition,
etc. where they have produced results
comparable to, and in some cases superior to,
human experts
▪ Most modern Deep Learning models are based
on artificial neural networks

Applications of Speech Recognition
Deep Learning

Applications of Self-driving Cars
Deep Learning

Applications of Automatic Machine Translation
Deep Learning

Applications of Visual Translation
Deep Learning

How Does Deep Learning Work?
Most Deep Learning methods use neural networks architecture, which is why Deep Learning models are often referred to
as deep neural networks
▪ The term ‘deep’ usually refers to the number of
hidden layers in the neural network
▪ Traditional neural networks contain only 2–3 hidden
layers, while deep networks can have as many as 150
▪ Deep Learning models are trained using large sets of
labeled data and neural network architectures that
learn features directly from data without the need for
manual feature extraction

What Is a Neural Network?
A neural network is a computing model whose layered structure resembles the networked structure of neurons in the
brain, with layers of connected nodes. It can learn from data, so it can be trained to recognize patterns, classify data, and
forecast future events
▪ A neural network breaks down your input into layers of
abstraction
▪ It consists of an input layer, one or more hidden layers, and
an output layer
▪ These layers are interconnected via nodes, or neurons, with
each layer using the output of the previous layer as its input
▪ Its main function is to receive a set of inputs, perform
calculations, and then use the output to solve the problem

Artificial Neural Networks (ANN)
Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains.
Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-
specific programming
▪ For example, in image recognition, they might
learn to identify images that contain cats by
analyzing example images that have been manually
labeled as ‘cat’ or ‘no cat’, and by using these
analytic results they can identify cats in other
images
▪ They have found to be most useful in applications
difficult to express with a traditional computer
algorithm using rule-based programming

Quiz

Quiz 1
Deep Learning is not a subset of ML.
A True
B False

Answer 1
Deep Learning is not a subset of ML.
A True
B False

Quiz 2
Self-Driving Cars is the use case of ..
A Classification Algorithm
B Reinforcement Learning
C Unsupervised Learning
D Supervised Learning

Answer 2
Self-Driving Cars is the use case of ..
A Classification Algorithm
B Reinforcement Learning
C Unsupervised Learning
D Supervised Learning

Quiz 3
Having a Perception is a kind of Intelligence?
A Yes
B No

Answer 3
Having a Perception is a kind of Intelligence?
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)
sales@intellipaat.com
24/7 Chat with Our Course Advisor

Artificial
Intelligence
Introduction to Neural Networks and
Deep Learning Frameworks

Agenda
Topology of Neural
01 Networks 02 Perceptrons
Activation Functions Perceptron Training

03 and Their Types
04 Algorithm
Deep Learning 06 What Are Tensors?

05 Frameworks
Program Elements
07 Computational Graph 08 in TensorFlow

Topology of a Neural Network
Typically, artificial neural networks have a layered structure. The Input Layer picks up the input signals and passes them on to the next layer, also
known as the ‘Hidden’ Layer (there may be more than one Hidden Layer in a neural network). Last comes the Output Layer that delivers the result
Input layer Hidden layer Output layer

Well, everyone has heard about AI, but how
many of you know that the inspiration behind
artificial neural networks came from the
biological neurons that are found within
human brains?

Let us first understand the architecture of
our biological neurons which is very similar to
that of artificial neurons

Neurons: How Do They Work?
A neural network is a computer simulation of the way biological neurons work within a human brain
Dendrites: These branch-like structures extending away from the cell body
receive messages from other neurons and allow them travel to the cell body
Cell Body: It contains a nucleus, smooth and rough endoplasmic reticulum,

Golgi apparatus, mitochondria, and other cellular components
Axon: An axon carries an electrical impulse from the cell body to another
neuron

Now, let us understand about artificial
neurons in detail!

Artificial Neurons
▪ The most fundamental unit of a deep neural network is
called as an artificial neuron
▪ It takes an input, processes it, passes it through an
activation function, and returns the output
▪ Such type of artificial neurons are called as perceptrons
▪ A perceptron is a linear model used for binary
classification
Schematic Representation of a Neuron in a Neural Network

Perceptron: How Does It Work?
▪ The three arrows correspond to the three inputs coming into the network
▪ Values [0.7, 0.6, and 1.4] are weights assigned to the corresponding input
▪ Inputs get multiplied with their respective weights and their sum is taken
▪ Consider the three inputs as x1, x2, and x3
▪ Let the three weights be w1, w2, and w3
Sum = x1w1 + x2w2+x3w3

Sum=x1(0.7) + x2(0.6) + x3(1.4)
▪ An offset is added to this sum. This offset is called Bias
▪ It is just a constant number, say 1, which is added for scaling purposes
New_Sum = x1(0.7) + x2(0.6) + x3(1.4) + bias

Why Do We Need Weights?
▪ Statistically, weights determine the relative importance of input
▪ Mathematically, they are just the slope of the line

Why Do We Need Weights?
Will it rain, if I
wear a blue
shirt?
Humidity x1
Output
Will it rain? (0/1)
Blue shirt x2
w2 is assigned a lower value because significance of the

input ‘blue shirt’ is less than ‘humidity’

Why Do We Need Activation Functions?
We have two classes. One set is

represented with triangles and
the other with circles

Draw me a linear decision

boundary which can separate
these two classes

We will have to add a third

dimension to create a linearly
separable model which is easy to
deal with

Activation Functions
▪ They are used to convert an input signal of a node in an
artificial neural network to an output signal
▪ That output signal now is used as an input in the next layer in
the stack
▪ Activation functions introduce non-linear properties to our
network
▪ A neural network without an activation function is essentially
just a linear regression model
▪ The activation function does non-linear transformation to the
input making it capable to learn and perform more complex
tasks

Identity Types of Activation Functions
Binary Step
Sigmoid
Tanh
ReLU
Leaky ReLU
Softmax
Identity Function
• A straight line function where activation is proportional to input
• No matter how many layers we have, if all of them are linear in nature, the final activation function of the last
layer will be nothing but just a linear function of the input of the first layer
• We use a linear function to solve a linear regression problem
• Range: (−∞,∞)
𝒇 𝒙 =𝒙

Binary Step Function
• It is also known as the Heaviside step function, or the unit step function, which is usually denoted by H or θ, is a
discontinuous function
• Its value is 0 for the negative argument and 1 for the positive argument
• It depends on the threshold value we define
• We use the binary step function to solve a binary classification problem

𝟎 𝒇𝒐𝒓 𝒙 < 𝟎 • Range: (0,1)
𝒇 𝒙 =ቐ
𝟏 𝒇𝒐𝒓 𝒙 = > 𝟎

Sigmoid Function
• The sigmoid function is an activation function where it scales values between 0 and 1 by applying a threshold
• When we apply the weighted sum in the place of x, the values are scaled in between 0 and 1
• Large negative numbers are scaled toward 0, and large positive numbers are scaled toward 1
• Range: (0,1)
𝟏
𝒇 𝒙 =
𝟏 + ⅇ−𝒙

Tanh Function
• It is a hyperbolic trigonometric function
• The Tanh activation works almost always better than sigmoid functions as optimization is easier in this method
• The advantage of Tanh is that it can deal more easily with negative numbers
• It is actually a mathematically shifted version of the sigmoid function
• Range: (−1,1)
𝟐
𝒇 𝒙 : 𝒕𝒂𝒏 𝒉 𝒙 = 𝒙 −𝟏
𝟏+ⅇ−𝟐

ReLU Function
• ReLU stands for rectified linear unit
• It is the most widely used activation function
• It is primarily implemented in Hidden Layers of the neural network
• This function allows only the maximum values to pass during the front propagation as shown in the graph below
• Range: (0,∞)
𝟎 𝒇𝒐𝒓 𝒙 < 𝟎
𝒇 𝒙 =ቐ
𝒙 𝒇𝒐𝒓 𝒙 = > 𝟎

Leaky ReLU Function
• Leaky ReLU allows a small negative value to pass during the back propagation if we have a dead ReLU problem
• This eventually activates the neuron and brings it down
• Range: (−∞,∞)
𝟎. 𝟎𝟏𝒙 𝒇𝒐𝒓 𝒙 < 𝟎

𝒇 𝒙 =ቐ
𝒙 𝒇𝒐𝒓 𝒙 = > 𝟎

Softmax Function
• The Softmax function is used when we have multiple classes
• It is useful for finding out the class which has the max. probability
• The Softmax function is ideally used in the Output Layer of the classifier where we are actually trying to attain the
probabilities to define the class of each input
• Range: (0,1)
ⅇ𝒛 𝒋
𝜎 𝒛 𝒋 = σ𝑲 𝒛 , 𝒋 = 𝟏, 𝟐, . . 𝑲
𝒌=𝟏 ⅇ 𝒌

Got bored again? Let us get back to
perceptrons and try to understand them in a
better way!

Training a Perceptron
By training a perceptron, we try to find a line, plane, or some hyperplane which can accurately separate two
classes by adjusting weights and biases
Error = 2 Error = 1 Error = 0

Perceptron Training Algorithm
Input
W1
Bias
X1
Activation Initialize weights,

X2 W2 Function bias, and threshold
෍ Output
Calculate the sum
Update weights
X3 W3 and pass through an
activation function 𝜕𝐸
Wnew = Wold – LR*( )
𝜕𝑤
Error Produce the output

(Error)
Xn Wn
(Correct)
Stop

Benefits of Using Artificial Neural Networks
Organic Non-linear Data

Fault Tolerance Self-repairing
Learning Processing

Let us now move toward Deep
Learning frameworks!

Deep Learning Frameworks
These Deep Learning libraries help in implementing artificial neural networks

TensorFlow
Keras
PyTorch
DL4J
TensorFlow is an open-source software library for high-performance numerical computations
MXNet
Developed by Google

TensorFlow
Keras
PyTorch
DL4J
MXNet Natural language

Forecasting
processing
Text classification Tagging
Google Translate

TensorFlow
Keras
PyTorch
DL4J
MXNet
Tensor
Used for visualizing TensorFlow computations and graphs
Board
TensorFlow Used for rapid deployment of new algorithms/experiments

Serving while retaining the same server architecture and APIs

TensorFlow
Keras
PyTorch
DL4J
MXNet
A high-Level API which can run on top of

TensorFlow, Theano, or CNTK

TensorFlow
Keras
PyTorch
DL4J
MXNet
A recurrent neural network
A convolutional neural network

TensorFlow
Keras
PyTorch
DL4J
MXNet
Stacks layers on top

TensorFlow
Keras
PyTorch
DL4J
MXNet
A scientific computing framework developed by

Facebook

TensorFlow
Keras
PyTorch
DL4J
MXNet
‘Pythonic’ in nature

TensorFlow
Keras
PyTorch
DL4J
MXNet
Offers dynamic computational

graphs

TensorFlow
Keras
PyTorch
DL4J
MXNet
A Deep Learning programming

library written for Java

TensorFlow
Keras
PyTorch
DL4J
MXNet

TensorFlow
Keras
PyTorch
DL4J
MXNet
Image recognition Fraud detection
Parts of speech
Text mining
tagging
Natural language
processing

TensorFlow
Keras
PyTorch
DL4J
MXNet
Developed by Apache
Software Foundation

TensorFlow
Keras
PyTorch
DL4J
MXNet

TensorFlow
Keras
PyTorch
DL4J
MXNet
Speech
Imaging
recognition
Forecasting NLP

What Are Tensors?
A tensor is a multi-dimensional array in which data is stored
Tensor is given
as an input to a
neural network
Tensor

Tensor Rank
Tensor rank represents the dimension of the n-dimensional array
Rank Math Entity Example
0 Scalar (magnitude only) s = 483
1 Vector (magnitude and v = [1.1, 2.2, 3.3]

direction)
2 Matrix (table of numbers)
m = [1, 2, 3], [4, 5, 6], [7, 8, 9]]
3 3-Tensor (cube of numbers) t=
[[[2], [4], [6]], [[8], [10], [12]], [
[14], [16], [18]]]
n n-Tensor ……

Computational Graph
Computation is done in the form of a graph
a = 10
b = 20 Addition
c = 30 c
ℎ = 𝑎∗𝑏 +𝑐
Multiplication
a b

Computational Graph
The computational graph is executed inside a session
h h
a = 10
b = 20 Addition Addition
c = 30 c c
Multiplication Multiplication
a b a b
Session

Computational Graph
The computational graph is executed inside a session
a = 10 Node -> Mathematical operation

b = 20 Addition
c = 30 c
Multiplication
Edge -> Tensor
a b

Program Elements in TensorFlow
Constant Placeholder Variable

Constant Constants are program elements whose values do not change
Placeholder
Variable a=tf.constant(10) b=tf.constant(20)

Constant A placeholder is a program element to which we can assign data at a later
time
Placeholder
Variable x=tf.placeholder(tf.float32) y=tf.placeholder(tf.string)

Constant A variable is a program element which allows us to add new trainable
parameters to the graph
Placeholder
Variable W=tf.Variable([3],tf.float32) b=tf.Variable([0.4],tf.float32)

Quiz

Quiz 1
A tensor is a single-dimensional array in

which data is stored
A True
B False

Answer 1
A tensor is a single-dimensional array in

which data is stored
A True
B False

Quiz 2
How many layers does a standard Neural

Network has?
A 1
B 2
C 3
D 4 or more

Answer 2
How many layers does a standard Neural

Network has?
A 1
B 2
C 3
D 4 or more

Quiz 3
Perceptron is other name of Neurons
A Yes
B No

Answer 3
Perceptron is other name of Neurons
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
Deep Dive into Neural Networks

Agenda
Limitations of a Single- 02 Use Case 1
01 layer Perceptron
Feedforward Neural
03 Network 04 Multi-layer Perceptron
05 Use Case 2 06 Backpropagation Algorithm
07 Gradient Descent 08 Stochastic Gradient Descent
09 Adam Optimization Algorithm 10 Demo

Limitations of a Single-layer Perceptron
▪ A single-layer perceptron can only learn linearly
separable problems
Boolean ‘AND’: Linearly
▪ If the problem is not linearly separable, the learning Separable
process of a perceptron will never reach a point where
all points are classified correctly
▪ Boolean ‘AND’ and ‘OR’ functions are linearly separable,
whereas Boolean ‘XOR’ is not
Boolean ‘XOR’: Non-

Linearly Separable

▪ A single-layer perceptron can only learn linearly
separable problems
Boolean ‘AND’: Linearly
▪ If the problem is not linearly separable, the learning Separable
process of a perceptron will never reach a point where
all points are classified correctly
▪ Boolean ‘AND’ and ‘OR’ functions are linearly separable,
whereas Boolean ‘XOR’ is not
Boolean ‘XOR’: Non-

Linearly Separable
For solving this problem, we can use a multi-layer

perceptron

For solving this problem, we can use a multi-layer Boolean ‘AND’: Linearly
perceptron Separable
▪ A single-layer perceptron won’t be able to solve complex problems
such as Image Classification Boolean ‘XOR’: Non-

Linearly Separable
▪ In such kinds of problems, the dimensionality and complexity of the
classification is very high

Let us see some real-life problems
which cannot be solved by single-
layer perceptron

Use Case 1
Complex problems that involve a lot of parameters cannot be solved by a single-layer perceptron
▪ Consider a case where you own an e-commerce firm. You have planned to increase traffic on your site by providing a special discount on the
products and services. Now, you want to create awareness among people regarding this end-season sale by marketing on different portals like:
• Google ads
• Personal emails
• Sales advertisements on relevant sites
• YouTube ads
• Ads on different sites
• linkedin
• Blogs and so on
▪ This task is too complex for a human to analyze, as you can see that the number of parameters is quite high
▪ Let us try to solve it using Deep Learning

Use Case 1
▪ You can either use just one platform for publicity or use a variety of them
▪ Each of them has its own advantages and disadvantages, but lots of factors would have to be considered
▪ The increased traffic on your portal or the number of sales that would happen is dependent on different categorical inputs, their sub-categories, and their
parameters
Computing and calculating profit in terms of popularity and sales, from so many inputs and their
sub-categories, is not possible just through one perceptron

So now, you know why a
single perceptron cannot
be used for complex non-
linear problems
So many
Inputs!

Before getting into the actual solution to our
problem, let us recall one of the previously
discussed topics: Feedforward Neural Network

Feedforward Neural Network
Feedforward neural network is the most simple artificial neural network containing multiple nodes arranged in multiple layers.
Adjacent layer nodes have connections or edges where all connections are weighted

Feedforward Neural Network
Feedforward neural network is the most simple artificial neural network containing multiple nodes arranged in multiple layers.
Adjacent layer nodes have connections or edges where all connections are weighted
“A feedforward neural network can contain two kinds of nodes”
This is the simplest feedforward

Monolayer neural network that does not
contain any hidden layers
Multi-layer Perceptron (MLP) includes at

Multi-layer
least one hidden layer (except for one
Perceptron
input layer and one output layer)

Let us now discuss the ultimate solution to our
previous problem, i.e., MLP

Multi-layer Perceptron
▪ A multi-layer perceptron (MLP) is a deep, artificial neural network
▪ It is composed of more than one perceptrons
▪ An MLP is comprised of:
• An input layer to receive the signal
• An output layer that makes a decision or prediction about the input
• An arbitrary number of hidden layers
• Each node, apart from the input nodes, has a nonlinear activation function
• An MLP uses backpropagation as a supervised learning technique Input layer Hidden layers Output layer
MLP is widely used for solving problems that require supervised learning and research into computational
neuroscience and parallel distributed processing. Such applications include speech recognition, image recognition,
and machine translation

The figure shows a multi-layer perceptron with a single hidden layer. All connections have weights associated with them, but only
three weights (w0, w1, and w2) are shown in the figure

Input Layer:
• It has three nodes
• Bias (offset) node has a value of 1
• The other two nodes take X1 and X2 as external inputs
• Outputs from nodes in the input layer are 1, X1, and X2, respectively,
which are fed into the hidden layer

Hidden Layer:
• It also has three nodes with the Bias node having an output of 1
• The output of the other two nodes in the hidden layer depends on
the outputs from the input layer (1, X1, and X2) as well as the
weights associated with the connections (edges)
• The figure shows the output calculation for one of the hidden nodes
• Similarly, the output from the other hidden node can be calculated
• Here, ‘f’ refers to the activation function. These outputs are then fed
to the nodes in the output layer

Output Layer:
• The output layer has two nodes which take inputs
from the hidden layer and perform similar
computations as shown for the highlighted hidden
node
• Values calculated (Y1 and Y2) as a result of these
computations act as outputs of the multi-layer
perceptron

Let us now see how MLP helps us in providing a
solution to our Use Case 1

Use Case 1: Solution
▪ Every source behaves as an input to the neural network
▪ Once all sources are fed into the system, the neural network calculates the output after the computation is done

Lets take another example to understand a multi-
layer perceptron better!

Use Case 2
Suppose, we have the following student-marks dataset
Mid-term
Hours Studied Final Results
Marks
35 67 1
12 75 0
16 89 1
45 56 1
10 90 0
• The two input columns show the number of hours each student has studied and the mid-term marks obtained by the student, respectively
• The Final Results column can have two values 1 or 0 indicating whether the student passed (1) in the final term or failed (0)

Now, suppose, we want to predict whether a
student studying 25 hours and having 70 marks in
the mid term will pass the final term

Use Case 2
Hours Studied Mid-term Marks Final Results

25 70 ?
This is a binary classification problem where a multi-layer perceptron can learn from the given examples (the
training data) and make an informed prediction when given a new data point. We will see now, how a multi-layer
perceptron learns such relationships

The process by which a multi-layer perceptron learns
is called the Backpropagation algorithm.
We will discuss this in details after completing MLP!

• The figure has two nodes in the input layer (apart from the Bias node) which take the inputs Hours Studied and Mid-term Marks
• It also has a hidden layer with two nodes (apart from the Bias node)
• The output layer has two nodes as well: the upper node outputs the probability of ‘Pass’ while the lower node outputs the probability of ‘Fail’

In classification, we generally use a Softmax function
as the activation function to ensure that the outputs
are probabilities and they add up to 1. So, in this case,
Probability (Pass) + Probability (Fail) = 1

▪ Step 1: Forward Propagation

• Let’s consider the hidden layer node, marked V, in the figure
• Assume that the weights of the connections from the inputs to that node are w1, w2, and w3 (as shown)
• The first training example as input:
o Input to the network = [35, 67]
o Desired output from the network (target) = [1, 0]
o The output V from the node can be calculated as follows (where ‘f’ is an activation function): V = f (1*w1 + 35*w2 + 67*w3)

Suppose, the output probabilities from the two nodes
in the output layer are 0.4 and 0.6, respectively
(since the weights are randomly assigned, outputs
will also be random)

We can see that the calculated probabilities (0.4 and
0.6) are very far from the desired probabilities (1
and 0, respectively); hence, the network in the
figure is said to have an ‘Incorrect Output’

▪ Step 2: Backpropagation and Weight Updates

• We calculate the total errors at the output nodes and propagate these errors back through the network using backpropagation to calculate
the gradients
• Then, we use an optimization method such as gradient descent to ‘adjust’ all weights in the network with an aim of reducing errors at the
output layer
• This is shown in the next figure

▪ Step 2: Backpropagation and Weight Updates

• Suppose that the new weights associated with the node in consideration are w4, w5, and w6 (after backpropagation and adjusting weights)

If we now input the same example to the network
again, the network should perform better than
before since the weights have now been adjusted to
minimize errors in prediction

▪ As shown in Figure, errors at the output nodes have now reduced to [0.2, -0.2] as compared to [0.6, -0.4] earlier
▪ This means that our network has learned to correctly classify our first training example
▪ We repeat this process with all other training examples in our dataset. Then, our network will learn those examples as well

If we now want to predict whether a student
studying 25 hours and having 70 marks in the mid
term will pass the final term, we go through the
forward propagation step and find the output
probabilities for Pass and Fail

So now, let us understand backpropagation in
detail as you have already heard a lot about it!

Backpropagation Algorithm
The backpropagation algorithm is a supervised learning method for multi-layer feedforward networks from the field of Artificial
Neural Networks

Backpropagation Algorithm
• The principle of this approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal
• The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the
system and used to modify its internal state

Let us understand its working with the help
of an example!

Backpropagation Algorithm: How Does it Work?
Consider the following table
Input Desired Output

0 0
1 2
2 4

Consider the initial value of weight as 3
Desired Model Output

Input
Output (W=3)
0 0 0
1 2 3
2 4 6

Observe the difference between the actual output and the desired output
Model
Desired Absolute
Input Output Square Error
Output Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4

Observe the error when changing the value of W to 4
Model Output Absolute Model Output Square

Input Desired Output Square Error
(W=3) Error (W=4) Error
0 0 0 0 0 0 0
1 2 3 1 1 4 4
2 4 6 2 4 8 16

What if we decrease the value of W?

Consider the value of weight as 2

0 0 0 0 0 0 0
1 2 3 1 1 3 1
2 4 6 2 4 4 0

Consider the value of weight as 2

0 0 0 0 0 0 0
1 2 3 1 1 3 1
2 4 6 2 4 4 0
We see that when the weight is reduced, the error also decreases

Consider the following graph

We need to reach the Global Loss Minimum. This is nothing but backpropagation
When the gradient is negative, When the gradient is positive,
increase in weight decrease in weight
leads to decrease in error leads to decrease in error

Let us now understand the math behind
backpropagation

In order to have some numbers to work with, here are initial weights, biases, and training inputs/outputs
• The goal of backpropagation is to optimize the
weights so that the neural network can learn how to
correctly map arbitrary inputs to outputs
• We’re going to work with a single training set: given
inputs 0.05 and 0.10, we want the neural network to
output 0.01 and 0.99

Steps Involved in Backpropagation
Step 1: The Forward Pass
Step 2: The Backward Pass

The total net input for h1:

net h1 = w1 * i1 + w2 * i2 + b1 * 1
net h1 = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775
The output for h1:

out h1 = 1/(1 + e-net h1) = 1/(1 + e-0.3775) = 0.593269992
Carrying out the same process for h2:

out h2 = 0.596884378
**We repeat this process for the output layer neurons, using the output from the hidden layer
neurons as their input**

The output for o1:

net o1 = w5 * out h1 + w6 * out h2 + b2 * 1
net o1 = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967
out o1 = 1/(1 + e-net o1) = 1/(1 + e-1.105905967) = 0.75136507
Carrying out the same process for o2:

out o2 = 0.772928465

We can now calculate the error for each output neuron using the squared error function and sum them to
Calculating the Total Error
get the total error: E total = Ʃ1/2(target - output)2
The target output for o1 is 0.01, but the neural network output is 0.75136507; therefore, its error is:
E o1 = 1/2(target o1 - out o1)2 = 1/2(0.01 - 0.75136507)2 = 0.274811083
By repeating this process for o2 (remembering that the target is 0.99), we get:
E o2 = 0.023560026
The total error for the neural network is the sum of these errors:
E total = E o1 + E o2 = 0.274811083 + 0.023560026 = 0.298371109

Our goal with backpropagation is to update each of the weights in the network so that the actual output is closer
to the target output, thereby minimizing the error for each output neuron and the network as a whole

Our goal with backpropagation is to update each of the weights in the network so that the actual output is closer
to the target output, thereby minimizing the error for each output neuron and the network as a whole
Consider w5; we will calculate the rate of change of error w.r.t the change in weight w5:
(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w5 = (𝜕E total)/(𝜕out o1) * (𝜕out o1)/(𝜕net o1) * (𝜕net o1)/ 𝜕w5
Since we are propagating backwards, the first thing we need to do is to calculate the change in total errors w.r.t. the outputs o1 and
o2:
E total = (1/2)*(target o1 – out o1)2 + (1/2)*(target o2 – out o2)2
(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/(𝜕out o1) = -(target o1 – out o1) = -(0.01 - 0.75136507) = 0.74136507
Now, we will propagate further backward and calculate the change in the output o1 w.r.t to its total net input:
out o1 = 1/(1 + e-net o1)
(𝜕out o1)/(𝜕net o1) = out o1(1-out o1) = 0.75136507(1 - 0.75136507) = 0.186815602
How much does the total net input of o1 change w.r.t. w5?
net o1 = w5*out h1 + w6*out h2 +b2*1
(𝜕net o1)/ 𝜕w5 = 1*out h1*w5(1-1) +0+0 = out h1 = 0.593269992

Step 2: The Backward Pass Putting all values together and calculating the updated weight value
let’s put all values together:

(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w5 = (𝜕E total)/(𝜕out o1) * (𝜕out o1)/(𝜕net o1) * (𝜕net o1)/ 𝜕w5 = 0.082167041
Calculate the updated value of w5:

W5+ = W5 –n*(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w5 = 0.4 - 0.5 * 0.082167041 = 0.35891648
We can repeat this process to get the new weights w6, w7, and w8
W6+ = 0.408666186
W7+ = 0.511301270
W8+ = 0.561370121
We perform the actual updates in the neural network after we have the new weights leading into the hidden layer neurons


Hidden Layer We’ll continue the backward pass by calculating new values for w1, w2, w3, and w4
Start with w1:

(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w1 = (𝜕E total)/(𝜕out h1) * (𝜕out h1)/(𝜕net h1) * (𝜕net h1)/ 𝜕w1
(𝜕E total)/(𝜕out h1) = (𝜕E o1)/(𝜕out h1) + (𝜕E o2)/(𝜕out h1)
We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the output of
each hidden layer neuron contributes to the output. Thus, we need to take E o1 and E o2 into consideration


Hidden Layer We can visualize it as:


Hidden Layer
Starting with:
(𝛛E o1)/(𝛛out h1) = (𝜕E o1)/(𝜕net o1) * (𝜕net o1)/(𝜕out h1)
We can calculate (𝜕E o1)/(𝜕net o1)using values calculated earlier:
(𝜕E o1)/(𝜕net o1) = (𝜕E o1)/(𝜕out h1) * (𝜕out h1)/(𝜕net o1)

= 0.74136507 * 0.186815602 = 0.138498562
net o1 = w5*out h1 + w6*out h2 + b2*1

(𝜕net o1)/(𝜕out h1) = w5 = 0.40


Hidden Layer
Put the values in the equation:

(𝛛E o1)/(𝛛out h1) = (𝜕E o1)/(𝜕net o1) * (𝜕net o1)/(𝜕out h1)
= 0.138498562 * 0.40 = 0.055399425
Following the same process for (𝜕E o2)/(𝜕out h1), we get:

(𝜕E o2)/(𝜕out h1) = -0.019049119


Hidden Layer
We can calculate:
= 0.055399425 + (-0.019049119) = 0.036350306
Now that we have (𝜕E total)/(𝜕out h1), we need to figure out (𝜕out h1)/(𝜕net h1)and (𝜕net h1)/𝜕w 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑤𝑒𝑖𝑔ℎ𝑡
out h1 = 1/(1 + e-net h1)
(𝜕out h1)/(𝜕net h1) = out h1(1- out h1) = 0.59326999(1 - 0.59326999 ) = 0.241300709
We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:
net h1 = w1*i1 + w3*i2 + b1*1
(𝜕net h1)/𝜕w1 = i1 = 0.05


Hidden Layer
Put it all together:

(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w1 = (𝜕E total)/(𝜕out h1) * (𝜕out h1)/(𝜕net h1) * (𝜕net h1)/ 𝜕w1
(𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w1 = 0.036350306 * 0.241300709 * 0.05 = 0.000438568
We can now update w1:

w1+ = w1 – n* (𝜕𝐸 𝑡𝑜𝑡𝑎𝑙)/𝜕w1 = 0.15 - 0.5 * 0.000438568 = 0.149780716
Update other weights similarly:

w2+ = 0.19956143
w3+ = 0.24975114
w4+ = 0.29950229
• When we fed forward 0.05 and 0.1 inputs originally, the error on the network was 0.298371109
• After this first round of backpropagation, the total error is now down to 0.291027924

It might not seem like much, but after repeating this process
10,000 times, for example, the error plummets to
0.0000351085. At this point, when we feed forward 0.05 and
0.1, the two output neurons generate 0.015912196 (vs. 0.01
target) and 0.984065734 (vs. 0.99 target)

Optimization is a big part of Machine Learning.
Almost every Machine Learning algorithm has
an optimization algorithm at its core

Gradient Descent
▪ Gradient descent is by far the most popular optimization strategy, used in Machine Learning and Deep Learning at the moment
▪ It is used while training your model, can be combined with every algorithm, and is easy to understand and implement
Gradient measures how much the output of a function changes if you change the inputs a little bit
▪ You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster the model learns

Gradient Descent
How Does it Work?

Gradient Descent
• b = next value
• a = current value
How Does it Work? 𝑏 = 𝑎 − 𝛾∇𝑓 𝑎 • ‘−’ refers to the minimization part of the gradient
descent
• 𝛾 in the middle is the learning rate, and the gradient
term 𝛻𝑓 𝑎 is simply the direction of the steepest
descent
Gradient Descent
𝑏 = 𝑎 − 𝛾∇𝑓 𝑎
▪ This formula basically tells you the next position where you need to go, which is the direction of the steepest descent
▪ Gradient descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that
minimizes a given function
▪ Consider the graph below where we need to find the values of w and b that correspond to the minimum of the cost function (marked with the red arrow)

Gradient Descent
▪ To start with finding the right values, we initialize the values of w and b with some random numbers, and gradient descent then starts at that point (somewhere
around the top)
▪ Then, it takes one step after the other in the steepest downside direction (e.g., from top to bottom) till it reaches the point where the cost function is as small as
possible

Importance of the Learning Rate
▪ Learning rate determines how fast or slow we will move toward the optimal weights
▪ In order for gradient descent to reach the local minimum, we have to set the learning rate to an appropriate value, which is neither too low nor too high
If the steps it takes are too big, it If you set the learning rate to a
might not reach the local very small value, gradient descent
minimum because it just bounces will eventually reach the local
back and forth between the minimum, but it might take too
convex function of gradient much time
descent
Understanding Epoch
▪ One epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE
▪ One epoch leads to underfitting of the curve in the graph
▪ As the number of epochs increases, more number of times the weights are changed in the neural network and the curve goes
from underfitting to optimal to overfitting

Batches and Iterations
Batch Size
▪ Total number of training examples present in a single batch is referred to as the batch size
▪ Since we can’t pass the entire dataset into the neural net at once, we divide the dataset into number of batches or sets or parts
Iterations
▪ Iteration is the number of batches needed to complete one epoch
Let’s say, we have 2,000 training examples that we are going to use. We can divide the dataset of 2,000
examples into batches of 500, and then it will take four iterations to complete one epoch

Gradient Descent Variants
Stochastic Gradient Mini-batch Gradient

Batch Gradient Descent
Descent Descent

Assume you have

a dataset of 6
images

Batch Gradient Descent
Back-propagates the loss for all 10

images at a time
Takes the entire

dataset at a time

Stochastic Gradient Descent
Back-propagates the loss for each

image and updates the gradient

Mini-Batch Gradient Descent
Back-propagates the loss for each

mini batch

Tips for Gradient Descent
• Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm for each iteration. The expectation for a well-performing gradient
descent run is a decrease in cost at every iteration. If it does not decrease, try reducing your learning rate
• Learning Rate: The learning rate value is a small real value such as 0.1, 0.001, or 0.0001. Try different values for your problem and see which works best
• Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieve this by
rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1]

Adam Optimization Algorithm
▪ The Adaptive Moment Estimation or Adam optimization algorithm is a combination of gradient descent with momentum and RMSprop algorithms
▪ Adam is an adaptive learning rate method, which means that it computes individual learning rates for different parameters
Adaptive Learning Rate Momentum

Implementing a Simple Neural
Network

Implementing a Simple Neural Network
x1
y
Let x1 = 2, x2 = 5, and y = 31
x2
Initializing the values of x1, x2, and y:

Implementing a Simple Neural Network
Implementing forward propagation and backpropagation:

Multi-variable Equation

Loading the NumPy package:
Setting initial values for x1, x2, x3, and y:
Having a glance at x1, x2, x3, and y:

Reshaping ‘y’:
4 Transposing ‘X’:
Creating a NumPy array from ‘x1’, ‘x2’ , and ‘x3’:

Setting the learning rate value and initializing random values:
Implementing forward propagation and backpropagation:

Quiz

Quiz 1
Single Layer Perceptron is easy to use than

Multi Layer Perceptron
A True
B False

Answer 1
Single Layer Perceptron is easy to use than

Multi Layer Perceptron
A True
B False

Quiz 2
The variants of Gradient Descent are...
A Batch Gradient Descent
B Adams Optimizer
C Ada-Delta Optimizer
D All of these

Answer 2
The variants of Gradient Descent are...
A Batch Gradient Descent
B Adams Optimizer
C Ada-Delta Optimizer
D All of these

Quiz 3
The backpropagation algorithm is a supervised learning

method for multi-layer feedforward networks
A Yes
B No

Answer 3
The backpropagation algorithm is a supervised learning

method for multi-layer feedforward networks
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
Model Building Using Keras

Agenda
01 Why Keras? 02 What Is Keras?
03 Models in Keras 04 Layers in Keras
05 Regularization Techniques 06 Batch Normalization
07 Keras Workflow 08 Use Case 1: Sequential Model
09 Use Case 2: Functional Model

There are countless Deep Learning
frameworks available today. Why do
we prefer Keras the most?

Why Keras?
Keras prioritizes
developer experience

Why Keras?
Keras is broadly
adopted in the
industry and among
the research
community

Why Keras?
Keras makes it easy to

turn models into
products

Why Keras?
Keras supports
multiple backend
engines and does not
lock you into one
ecosystem

Why Keras?
Keras has strong multi-

GPU support and
distributed training
support

Why Keras?
Keras development is
backed by key
companies in the Deep
Learning ecosystem

What Is Keras?
Keras is a high-level neural networks API. It is written in Python and can run on top of Theano, TensorFlow, or CNTK. It is designed
to be modular, fast, and easy to use
• It was developed by François Chollet, a Google engineer
• It was developed with the concept of:
‘Being able to go from idea to result with the least possible delay is key to
doing good research’
• Keras doesn't handle low-level computation. Instead, it relies on a
specialized, well optimized tensor manipulation library to do so, serving as
the "backend engine" of Keras
• So, Keras is the high-level API wrapper for the low-level API

Let us understand the different
types of models in Keras

Composing Models in Keras
You can create two types of models available in Keras, i.e., the Sequential model and the Functional model

Sequential Models
▪ The Sequential model is a linear stack of layers
▪ You can create a Sequential model by passing a list of layer instances to the constructor
▪ Stacking convolutional layers one above the other can be an example of a sequential model

Sequential Models
from keras.models import Sequential

from keras.layers import Dense, Activation
model = Sequential([
model = Sequential()
Dense(32, input_shape=(784,)), You can also simply add
layers via the .add() model.add(Dense(32, input_dim=784))
Activation('relu’), method:
model.add(Activation('relu'))
Dense(10),
Activation('softmax’),
])

Functional Models
The Keras functional API is used for defining complex models, such as multi-output models, directed acyclic graphs, or models with
shared layers
Three unique aspects of the Keras Functional API
Models are defined by creating instances of layers and connecting are as follows:
them directly to each other in pairs, then specifying the layers to act as • Defining the Input
the input and output to the model • Connecting Layers
• Creating the Model

Defining the Input
• Unlike in the Sequential model, you must create and define a
standalone Input Layer that specifies the shape of the input data
• The Input Layer takes a shape argument which is a tuple that
indicates the dimensionality of the input data
• In the case of one-dimensional input data, such as for a from keras.layers import Input
visible = Input(shape=(2,))
multilayer perceptron, the shape must explicitly leave room for
the shape of the mini-batch size used when splitting the data
while training the network
• Therefore, the shape tuple is always defined with a hanging last
dimension when the input is one-dimensional

Connecting Layers
• Layers in the model are connected pairwise
• This is achieved by specifying where the input comes
from while defining each new layer
• A bracket notation is used to specify the layer from from keras.layers import Input
from keras.layers import Dense
which the input is received to the current layer, after
hidden = Dense(2)(visible)
the layer is created
• Example: We can create the Input Layer as above,
and then create a Hidden Layer as a Dense Layer
that receives input only from the Input Layer

Creating the Model
• After creating all of your model layers and
connecting them together, you must
define the model from keras.models import Model

from keras.layers import Input
• Keras provides a model class that you can from keras.layers import Dense
use to create a model from your created
hidden = Dense(2)(visible)
layers. It requires you to only specify the model = Model(inputs=visible, outputs=hidden)
Input and Output Layers
• Example:

Predefined Neural Network Layers
• Keras has a number of predefined layers, such as:
1. Core Layers 6. Normalization Layers
2. Convolutional Layers 7. Noise Layers
3. Pooling Layers 8. Embedding Layers
4. Locally-connected Layers 9. Merge Layers
5. Recurrent Layers 10. Advanced Activation Layers

Performing Regularization Using Keras
Take a look at this graph:
• As we move toward the right in this graph, our model tries to learn too well the details and the noise from the training data, which results in poor
performance on the unseen data
• In other words, while going toward the right, the complexity of the model increases such that the training error reduces but the testing error doesn’t. This
is shown in the graph on the next slide

• Have you come across a situation where your model performed exceptionally well on train data but was not able to predict test data?
• Or, were you ever on the top of a competition in public leaderboard only to fall hundreds of places in the final ranking?
• Do you know how complex neural networks are and how it makes them prone to overfitting? This is one of the most common problems Data Science
professionals face these days

Regularization is a technique which makes slight modifications to the learning algorithm such that the model generalizes better. This in turn improves the
model’s performance on the unseen data as well

How does Regularization help reduce overfitting?

Let’s consider a neural network which is overfitting on the training data as shown in the above image

Regularization penalizes the weight matrices of the nodes

Assume that our regularization coefficient is so high that some of the weight matrices are nearly equal to zero
This will result in a much simpler linear network and slight underfitting of the training data

• Such a large value of the regularization coefficient is not that useful
• We need to optimize the value of the regularization coefficient in order to obtain a well-fitted model as shown in the image below

• Dropout produces very good results and is consequently the most frequently used Regularization technique in the field of Deep Learning
• Let’s say our neural network structure is akin to the one shown below:
So, what does dropout do?

Dropout
• At every iteration, it randomly selects some nodes and removes them, along with all of their incoming and outgoing connections as shown below:
• So, each iteration has a different set of nodes, and this results in a different set of outputs. It can also be thought of as an ensemble technique in
Machine Learning

Dropout
• Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a
normal neural network model
• The probability of choosing how many nodes should be dropped out is the hyperparameter of the dropout function. As seen in the image below,
dropout can be applied to both the Hidden Layers as well as the Input Layers
• Due to these reasons, dropout is usually preferred when we have a large neural network structure in order to introduce more randomness

Dropout
In Keras, we can implement dropout using the Keras core layer
from keras.layers.core import Dropout
model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.25),
Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),

])

Data Augmentation
• The simplest way to reduce overfitting is to increase the size of the training data
• Let’s consider, we are dealing with images
• There are a few ways of increasing the size of the training data—rotating the image, flipping, scaling, shifting, etc.
• In the below image, some transformation has been done on the handwritten digits dataset
• This technique is known as Data Augmentation
• This usually provides a big leap in improving the accuracy of the model
• It can be considered as a mandatory trick to improve our predictions

Data Augmentation
• In Keras, we can perform all these transformations using ImageDataGenerator
• It has a big list of arguments which you can use to pre-process your training data
• Example:
from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(horizontal flip=True)
datagen.fit(train)

Batch Normalization
• When the data is fed through a deep neural network and weights and parameters adjust those values, sometimes making the data too big or too small, it
becomes a problem. By normalizing the data in each mini-batch, this problem is largely avoided
• Batch Normalization normalizes each batch by both mean and variance reference
• It is just another layer, you can use to create your desired network architecture
• It is generally used between the linear and the non-linear layers in your network, because it normalizes the input to your activation function so that you're
centered in the linear section of the activation function (such as Sigmoid)

Batch Normalization
• When the data is fed through a deep neural network and weights
A normal Dense fully connected layer looks like this:
and parameters adjust those values, sometimes making the data
too big or too small, it becomes a problem. By normalizing the

model.add(layers.Dense(64, activation='relu'))
data in each mini-batch, this problem is largely avoided
• Batch Normalization normalizes each batch by both mean and
variance reference
• It is just another layer, you can use to create your desired network
architecture
• It is generally used between the linear and the non-linear layers in
your network, because it normalizes the input to your activation
function so that you're centered in the linear section of the
activation function (such as Sigmoid)

Batch Normalization
• When the data is fed through a deep neural network and weights
A normal Dense fully connected layer looks like this:
and parameters adjust those values, sometimes making the data
too big or too small, it becomes a problem. By normalizing the

model.add(layers.Dense(64, activation='relu'))
data in each mini-batch, this problem is largely avoided
• Batch Normalization normalizes each batch by both mean and
variance reference
To make it Batch normalization enabled, we have to tell the
• It is just another layer, you can use to create your desired network Dense Layer not to use bias, since it is not needed, and thus
it can save some calculation. Also, put the Activation Layer
architecture after the BatchNormalization() layer
• It is generally used between the linear and the non-linear layers in
your network, because it normalizes the input to your activation model.add(layers.Dense(64, use_bias=False))
model.add(layers.BatchNormalization())
function so that you're centered in the linear section of the model.add(Activation("relu"))
activation function (such as Sigmoid)

Let us see how to build models in
Keras!

Building Models in Keras
Let us see the 4-step workflow in developing neural networks with Keras
Define the Training Data

Define a Neural Network

Model

Configure the Learning

Process

Train the Model

Building a Simple Sequential Model

Use Case 1
Here, we will try to build a Sequential Network of Dense Layers, and the dataset used is MNIST. MNIST is a classic dataset of
handwritten images, released in 1999, and has served as the basis for benchmarking classification algorithms

Building a Keras Model Using Functional
Model API

Use Case 2
Here, we will be using Keras to create a simple neural network to predict, as accurately as we can, digits from handwritten images. In
particular, we will be calling the Functional Model API of Keras and creating a 4-layered and 5-layered neural network.
Also, we will be experimenting with various optimizers: the plain Vanilla Stochastic Gradient Descent optimizer and the Adam’s optimizer.
We will also introduce dropout, a form of regularization technique, in our neural networks to prevent overfitting

Quiz

Quiz 1
Keras uses TensorFlow in the backend?
A True
B False

Answer 1
Keras uses TensorFlow in the backend?
A True
B False

Quiz 2
Models is Keras are of type..
A Sequential and Functional API
B Linear and Functional API
C Sequential and Batch
D None of these

Answer 2
Models is Keras are of type..
A Sequential and Functional API
B Linear and Functional API
C Sequential and Batch
D None of these

Quiz 3
Dropout is the other name for Data

Augmentation?
A Yes
B No

Answer 3
Dropout is the other name for Data

Augmentation?
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
Convolutional Neural Networks

Agenda
Disadvantages of fully
01 connected network 02 What is CNN?
Different Layers Build a model with

03 in CNN 04 CNN using Keras

Let us understand how our
computer reads the images that we
provide to it!

Image Input
What We See What Computer See

▪ When a computer sees an image (takes an image as input), it will see an array of pixel values
▪ Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (3 refers to RGB values)
▪ Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point
▪ These numbers are the only inputs available to the computer

Problem with a Fully Connected Network
Ok, got it!

A single fully connected neuron in the first
Hidden Layer of a regular neural network
would have 28*28*3 = 2,352 weights

How will I
An image of size 200*200*3 = 2,352 would manage
have 120,000 weights at the Hidden Layer that?

How will I
An image of size 200*200*3 = 2,352 would manage
have 120,000 weights at the Hidden Layer that?
Also, we will have several such layers of neurons leading to several parameters. Thus, this connectivity would be a waste as
the huge number of parameters would lead to overfitting!

Let us understand what CNN is and
how it helps overcome this
problem!

Convolutional Neural Networks (CNNs/ConvNets)
CNNs are like neural networks (that we have already studied before) and are made up of neurons with learnable weights and biases.
Each neuron receives several inputs, takes a weighted sum of them, passes it through an activation function, and responds with an output
• The whole network has a loss function, and all tips and tricks that So, how are
Convolutional Neural
we developed for neural networks still apply on CNNs Networks different
from Neural Networks?
• A ConvNet perceives images as volumes, i.e., 3-dimensional
objects, rather than as flat canvases to be measured only by
width and height

Convolution: Definition
“to convolve” means to roll together
• From mathematical perspective, a convolution is the integral
measuring of how much two functions overlap as one passes
over the other
• Think of convolution as a way of mixing two functions by
multiplying them

More About CNNs
▪ ConvNets pass many filters over a single image, each one picking up a
different signal
▪ At a fairly early layer, you could imagine them as passing a horizontal line
filter, a vertical line filter, and a diagonal line filter to create a map of the
edges in the image
▪ CNNs take those filters, slice of the image’s feature space, and map them one
by one; i.e., they create a map of each place wherever feature occurs
▪ By learning different portions of a feature space, convolutional nets allow for
easily scalable and robust feature engineering

More About CNNs
▪ A simple ConvNet is a sequence of layers, and every layer of a ConvNet
transforms one volume of activations to another through a differentiable
function
▪ There are four layered concepts we should understand in Convolutional
Neural Networks:
• Convolution
• ReLU
• Pooling
• Full Connectedness (Fully Connected Layer)

Before going deep into this concept,
let us see an example!

Use Case 1
In this simple example, we will determine whether the image is of an O or of an X

Use Case 1
▪ Here, there are multiple renditions of Xs and Os
▪ This makes it tricky for a computer to recognize them
▪ But the goal is that if the input signal looks
like images it has seen before, the ‘image’
reference signal will be mixed into, or convolved with,
the input signal
▪ The resulting output signal is then passed on to
the next layer

Use Case 1
▪ A naïve approach to solve this problem is to save images of
X and O and compare every new image with our examples
to check which is a better match
▪ To a computer, an image looks like a 2-dimensional array of
pixels (think of giant checkerboard) with a number in each
position
▪ When comparing two images, if any pixel value doesn’t
match, then these images don’t match, at least according to
the computer
▪ Ideally, we would like to be able to see Xs and Os even if
they’re shifted, shrunken, rotated, or deformed. This is
where CNNs come in

Probably, I think we need a
classifier which can be used with
images and correctly predict what it
is! Agree or not?

Use Case 1
▪ A computer understands an image using numbers at each pixel as shown in the figure
▪ Here, blue pixels have −1 value, while the white pixels have a value of 1

Use Case 1
If we just normally search and compare the values between a normal image and another ‘X’ rendition, we would get a lot of missing pixels, which
means, this is not an optimal way of image classification since it requires exactly the same images to classify

Let’s see how CNN solves this
problem

Feature Matching
▪ CNNs compare images piece by piece
▪ The pieces that it looks for are called features
▪ By finding rough feature matches in roughly the same
positions in the two images, CNNs get a lot better at seeing
similarity, than the whole-image matching schemes

How Do CNNs Work?
▪ Each feature is like a mini-image—a small 2-dimensional
array of values
▪ Features match common aspects of the images
▪ In the case of X images, features consisting of diagonal lines
and a crossing capture all the important characteristics of
most Xs
▪ These features will probably match the arms and the center
of any image of an X

Let’s now understand the concept
of the Convolutional Layer

Convolutional Layer
▪ It is the first layer of a CNN
▪ When presented with a new image, the CNN doesn’t know exactly where these features will match,
so it tries them everywhere, in every possible position
▪ In calculating the match of a feature across the whole image, they (ConvNet) act as filters
▪ The math used to perform this is called convolution, from which Convolutional Neural Networks get
their name
▪ We have four steps in convolution:

Symbol for Convolution
• Line up the feature and the image
• Multiply each image pixel by the corresponding feature pixel
• Add the values and find the sum
• Divide the sum by the total number of pixels in the feature

Convolutional Layer
▪ In our example, consider a feature image and one pixel from it
▪ Multiply this with the existing image, and the product will be stored in another buffer feature image
Convolutional Layer
▪ Then, add up the answers and divide by the total
number of pixels in the feature
▪ If both pixels are white (with a value of 1) then 1 * 1 = 1
▪ If both are black (−1), then (−1) * (−1) = 1
▪ Either way, every matching pixel results in 1
▪ Similarly, any mismatch results in −1
▪ If all pixels in a feature match, then adding them up and
dividing by the total number of pixels gives a value, 1
▪ Similarly, if none of the pixels in a feature match the
image patch, then the answer is −1

Convolutional Layer
▪ The final value obtained from the math that is performed in the last step is placed at the center of the filtered image as shown above

Convolutional Layer
Now, move this filter around and do the same at any pixel in the image. Consider the below example:

Convolutional Layer
▪ As you can see, here after performing all the steps, the value is 0.55!
▪ Take this value and place it in the image as explained before

Convolutional Layer
▪ To complete the convolution, repeat this process, lining up the feature with every possible image patch
▪ Take the answer from each convolution and make a new 2-dimensional array from it, based on where in the image each patch is located
▪ This map of matches is also a filtered version of the original image
▪ It’s a map showing where in the image the feature can be found
▪ Values close to 1 show strong matches, values close to −1 show strong matches for the photographic negative of our feature, and values near 0 show no
match of any sort

Convolutional Layer
▪ The next step is to repeat the convolution process in its
entirety for each of the other features. The result is a set of
filtered images, one for each of the filters
▪ It’s convenient to think of this whole collection of convolution
operations as a single processing step

Let’s understand the concept of the
ReLU Layer

Rectified Linear Units
▪ ReLU is an activation function (as discussed earlier)
▪ This function activates a node only if the input is above a certain quantity.
▪ When the input is below zero, the output is zero, but when the input rises above a certain threshold, it has a linear relationship with the dependent variable
x f(x) = x F(x)
−2 f(−2) = 0 0
−6 f(−6) = 0 0
2 f(2) = 2 2
6 f(6) = 6 6
▪ We have considered a simple function with values as mentioned above. So, the function only performs an operation if that value is obtained by the
dependent variable

ReLU Layer
▪ A small but important player in CNNs is the ReLU layer
▪ Its math is also very simple—wherever a negative number occurs, swap it out for a 0
▪ This helps the CNN stay mathematically healthy by keeping learned values from getting stuck near 0 or blowing them up toward infinity
Symbol for ReLU

ReLU Layer

ReLU Layer
Similarly, we do the same process to all other feature images. The output of a ReLU layer is of the same size as
whatever is put into it, just with all negative values removed

Let’s understand the concept of the
Pooling Layer

Pooling Layer
▪ Another powerful tool that CNNs use is called pooling
▪ Pooling is a way to take large images and shrink them down while preserving the most important information in them
▪ It consists of stepping a small window across an image and taking the maximum value from the window at each step
▪ In practice, a window (2x2 or 3x3) and a step of 2 works well
Symbol for Pooling

Pooling Layer
▪ In this case, we took the window size to be 2 and we got four values to choose from
▪ In those four values, the maximum value is 1, so we pick 1. Also, note that we started with a 7×7 matrix, but now the same matrix after pooling came down to
be a 4×4 matrix Copyright Intellipaat. All rights reserved.

Pooling Layer
▪ We need to move the window across the entire image
▪ The procedure is exactly as same as the above, and we need to repeat it for the entire image

Pooling Layer
▪ We need to do it for two other filters as well. Once this is
done, we get the adjacent result
▪ The output will have the same number of images, but they will
have fewer pixels
▪ This is helpful in managing the computational load
▪ Making an 8 megapixel image down to a 2 megapixel image
makes life a lot easier for everything downstream

Let’s combine all layers together

All Layers Together!
▪ You’ve probably noticed that the output of one layer is taken as the input to the other. Because of this, we can stack them like Lego bricks

All Layers Together!
▪ Raw images get filtered, rectified, and pooled to create a set of shrunken, feature-filtered images. These can be filtered and shrunken again and again

Fully Connected Layer
▪ It is the final layer where the classification actually happens
▪ Fully connected layers take the high-level filtered images and
translate them into votes
▪ Here, we take our filtered and shrunk images and put them into
one single list as shown in the figure
▪ Instead of treating inputs as a 2-dimensional array, they are
treated as a single list, and all are treated identically
▪ Every value gets its own vote on whether the current image is
an X or an O

▪ Similarly, we will feed an image of O where we will have certain values which are high than others
▪ Some values are much better than the others at knowing when the image is an X, and some are particularly good at knowing when the image is an O
▪ These get larger votes than the others. These votes are expressed as weights, or connection strengths, between each value and each category
▪ We’re done with training the network, and now we can begin to predict and check the working of the classifier
▪ Consider getting a new input where we have a 12-element
vector obtained after passing the input through all layers of
our network
▪ We make predictions based on the output data by comparing
the obtained values with the list of ‘Xs’ and ‘Os’ to check
what we’ve obtained is right or wrong

▪ We just added the values which we found out as high (1st, 4th, 5th, 10th, and 11th) from the vector table of X and we got the sum as 5
▪ We did exactly the same thing with the input image and got a value of 4.56
▪ When we divide the values, we have a probability match of 0.91! Copyright Intellipaat. All rights reserved.
▪ Doing the same with the vector table of O, we have an output of 0.51

Final Output
.92
▪ "Well, since 0.51 is less than 0.91, the probability for the input image to be of an O is less, isn't it?“
▪ So, we can conclude that the resulting input image is of an ‘X’!
▪ In practice, several fully connected layers are often stacked together, with each intermediate layer voting on phantom ‘hidden’ categories
▪ In effect, each additional layer lets the network learn ever more sophisticated combinations of features that help it make better decisions
Object Recognition with Convolutional
Neural Networks in the Keras Deep
Learning Library

Use Case 1
CIFAR-10 is an established computer-vision dataset used for object recognition. The CIFAR-10 data consists of 60,000 (32×32) color
images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data. The
label classes in the dataset are:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
Truck
Let’s look into full python implementation of object recognition task on CIFAR-10 dataset

Quiz

Quiz 1
Which of following activation function can’t be used at

output layer to classify an image ?
A sigmoid
B Tanh
C ReLU
D None of the above

Answer 1
Which of following activation function can’t be used at

output layer to classify an image ?
A sigmoid
B Tanh
C ReLU
D None of the above

Quiz 2
Which of the following statements is true

when you use 1×1 convolutions in a CNN?
A It can help in dimensionality reduction
B It can be used for feature pooling
It suffers less overfitting due to small

C kernel size
D All of the above

Answer 2
Which of the following statements is true

when you use 1×1 convolutions in a CNN?
A It can help in dimensionality reduction
B It can be used for feature pooling
It suffers less overfitting due to small

C kernel size
D All of the above

Quiz 3
In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices
between hidden output layer and input hidden layer?
A [1 X 5] , [5 X 8]
B [8 X 5] , [ 1 X 5]
C [8 X 5] , [5 X 1]
D [5 x 1] , [8 X 5]

Answer 3
In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices
between hidden output layer and input hidden layer?
A [1 X 5] , [5 X 8]
B [8 X 5] , [ 1 X 5]
C [8 X 5] , [5 X 1]
D [5 x 1] , [8 X 5]

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
Recurrent Neural Networks
Copyright IntelliPaat. All rights reserved

Agenda
Issues with Feed Forward Understanding Recurrent Neural
01 Network 02 Networks
03 Types of RNN 04 Issues with RNN
05 Vanishing Gradient Problem 06 Long Short Term Networks
07 Demo on LSTM with Keras

Issues with Feed Forward Network
Outputs are independent of each other
No Relation
Cannot handle sequential data

Output at ‘t’ Output at ‘t+1’
Cannot memorize previous inputs

Feed Forward Network

Issues with Feed Forward Network
Would this feed

forward network be
able to predict the
Input Output
next word?
Recurrent Neural ……………………. FFN
This feed forward network

wouldn’t be able to predict
the next word because it
cannot memorize the
previous inputs

Solution with Recurrent Neural Network
I only cook these

three items and
in the same
sequence
Day 1 Day 2 Day 3

Solution with Recurrent Neural Network
Outputs are dependent on each other
Can handle sequential data
Day 1 Day 2 Day 3

Can memorize previous inputs
Recurrent Neural
Network

Understanding Recurrent Neural Networks
▪ RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous
computations
▪ Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far
Input at Input at Input at

Input
‘t-1’ ‘t’ ‘t+1’

Understanding Recurrent Neural Networks
• Xt is the input at time step ‘t’

• St is the hidden state at time step ‘t’. It’s the memory of the network. St is calculated based on the previous hidden
state and the input at the current step: 𝑠𝑡 = 𝑓 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1 . The function 𝑓 is usually a non-linearity such as tanh
or ReLu.
• Ot is the output at step ‘t’. 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡

Back-Propagation through Time
▪ Backpropagation Through Time (BPTT) is used to update the weights in the
recurrent neural network
▪ RNN typically predicts one output per each time step. Conceptually,
Backpropagation through Time works by unrolling the network to get each of
these individual time steps.
▪ Then, it calculates the error across each time step and adds up all of the
individual errors to get the final accumulated error.
▪ Following which the network is rolled back up and the weights are updated

Types of RNN
single images ( or words,... ) are

single images ( or words,... ) are
classified in single class ( binary
classified in multiple classes
classification ) i.e. is this a bird or not

Types of RNN
sequence of images ( or words, ... )

sequence of images ( or words, ... )
is classified in single class ( binary
is classified in multiple classes
classification of a sequence )

Issues with RNN
Suppose we try to
predict the last word
in this text.. Input Output
Recurrent Neural …… RNN Network
Here, the RNN does not

need any further context. It
can easily predict that the
last word would be
‘Network’

Issues with RNN
Now, let’s predict the

last word in this text..
Input Output
I’ve been staying in Spain for the last 10

years. I can speak fluent ………….. RNN
Regular RNN’s have

difficulty in learning long
range dependencies

Issues with RNN
I’ve been staying in Spain for the last 10 years. I can speak fluent …………..
• In this case, the network needs the context of ‘Spain’ to predict the last word in this text, which is “Spanish”
• The gap between the word which we want to predict and the relevant information is very large and this is known as
long term dependency
∂E/∂W = ∂E/∂ 3 *∂ 3/∂h3 *∂h3/∂ 2 *∂ 2/∂h1…
• There arises a long dependency while backpropagating the error

Vanishing Gradient Problem
▪ Now, if there is a really long dependency, there’s a good probability that one
of the gradients might approach zero and this would lead to all the gradients
rushing to zero exponentially fast due to multiplication
∂E/∂W=0
▪ Such states would no longer help the network to learn anything. This is
known as vanishing gradient problem

Long Short Term Networks
Long Short Term Networks are special kind of RNNs which are explicitly designed to avoid the long-term dependency problem
Standard RNN
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will
have a very simple structure, such as a single tanh layer

Long Short Term Networks
Long Short Term Networks are special kind of RNNs which are explicitly designed to avoid the long-term dependency problem
h1 h
h 1
LSTM
h
1 1
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer,
there are four, interacting in a very special way

Core Idea behind LSTMs
The key to LSTMs is the cell state. The cell state is kind of like a conveyor belt. It runs straight down the entire
chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged
h
h 1 h
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures
called gates

Core Idea behind LSTMs
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and
a pointwise multiplication operation
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be
let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

Working of LSTMs
Step 1
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This
decision is made by a sigmoid layer called the “forget gate layer”

Working of LSTMs
Step 2
The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a
sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of
new candidate values, that could be added to the state

Working of LSTMs
Step 3
Then we have to update the old cell state, Ct-1, into new cell state Ct. So, we multiply the old state (Ct-1) by ft,
forgetting the things we decided to forget earlier. Then we add (it * C~t). This is the new candidate values, scaled
by how much we decided to update each state value
h
h 1 h

Working of LSTMs
Step 4
Finally, we’ll run a sigmoid layer which decides what part of the cell state we’re going to output. Then, we put the
cell state through tanh and multiply it by the output of the sigmoid gate, so that we only output the parts we
decided to
h
h 1

Implementing a Simple RNN
Loading the required packages:
Preparing the input data:
Creating 100 vectors with 5 consecutive

numbers

Preparing the output data:
Converting the data & target into numpy arrays:
Having a glance at the shape:

Dividing the data into train & test sets:
Creating a sequential model:
Adding the LSTM layer with the output and input shape:

Compiling the model with ‘Adam’ optimizer:
Having a glance at the model summary:
10

Fitting a model on the train set:
11

Predicting the values on the test set:
12
Making a scatter plot for actual values and predicted values:
13

We see that the model
fails miserably and none
of the predictions are
correct

We’d have to normalize
the data before we
build the model
Normalized
Raw Data Normalizing
Data

Normalizing the input data:
14
Normalizing the output data:
15
Fitting the model with normalized values and number of epochs to be 500:
16

Predicting the values on test set:
17
Making a scatter plot for actual values & predicted values:
18

We see that the loss has
reduced after
normalizing the data
and increasing the
epochs

Quiz

Quiz 1
Gated Recurrent units can help prevent

vanishing gradient problem in RNN.
A True
B False

Answer 1
Gated Recurrent units can help prevent

vanishing gradient problem in RNN.
A True
B False

Quiz 2
How many types of RNN exist?
A 4
B 2
C 3
D None of these

Answer 2
How many types of RNN exist?
A 4
B 2
C 3
D None of these

Quiz 3
How many gates are there in LSTM?
A 1
B 2
C 3
D 4

Answer 3
How many gates are there in LSTM?
A 1
B 2
C 3
D 4

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
Autoencoders and
Restricted Boltzmann Machine

Agenda
01 Autoencoders 02 Autoencoders Vs. PCA
Architecture of Hyperparameters of
03 Autoencoders
04 Autoencoders
05 Use Case 1 06 Types of Autoencoders
07 What Is an RBM? 08 Working of RBMs
Collaborative Filtering
09 Using RBMs
For Your Information!
Artificial Intelligence encircles a wide range of technologies and techniques that enable computer systems to solve problems, like data
compression which is used in computer vision, computer networks, computer architecture, and many other fields

You might be thinking why we are
discussing about Autoencoders.
What is a big deal about them?

What Are Autoencoders?
An autoencoder is an Unsupervised Machine Learning algorithm that takes an image as input and tries to reconstruct it using a
fewer number of bits from the bottleneck also known as the latent space

What Are Autoencoders?
Compression and decompression functions in autoencoders are:
Data-specific, which means that they will only be able to compress data
similar to what they have been trained on. An autoencoder trained on
pictures of faces would do a rather poor job on compressing pictures of
trees, because the features it would have learned would be face-
specific
Lossy, which means that the decompressed outputs will be degraded

compared to the original inputs (similar to MP3 or JPEG compression)
Learned automatically from examples, rather than engineered by a

human

But then, Autoencoders and
Dimensionality Reduction are same
techniques. Are they?

Autoencoders and Dimensionality Reduction
• Autoencoders are similar to dimensionality reduction techniques like Principal Component Analysis (PCA)
• PCA: It projects the data from a higher dimension to a lower dimension using linear transformation
• Both techniques try to preserve the important features of data while removing the non-essential parts

Autoencoders and Dimensionality Reduction
• Autoencoders are similar to dimensionality reduction techniques like Principal Component Analysis (PCA)
• PCA: It projects the data from a higher dimension to a lower dimension using linear transformation
• Both techniques try to preserve the important features of data while removing the non-essential parts
Then, why can’t

we just learn any
one of them?

Autoencoders Vs. PCA
‘The major difference between autoencoders and PCA lies in the transformation part: PCA uses linear transformations
whereas autoencoders use non-linear transformations’

Let us see the architecture of
Autoencoders

Architecture of Autoencoders
An autoencoder consists of three parts:
Encoder Code Decoder

Encoder
• This part of the network compresses or down samples the input into a fewer
number of bits
• The space represented by these fewer number of bits is often called the latent
space or bottleneck
• The bottleneck is also called the ‘maximum point of compression’ since at this
point the input is compressed the maximum
• These compressed bits that represent the original input are together called an
‘encoding’ of the input

Code
• This part of the network represents the compressed input which is fed to the decoder
• It is also known as the bottleneck
• The code decides which aspects of the observed data are relevant information and which
aspects can be discarded. It does this by balancing:
o Compactness of representation, measured as the compressibility
o Retains some behaviorally relevant variables from the input

Decoder
• This part of the network tries to reconstruct the input using the encoded input
• The decoded image is a lossy reconstruction of the original image, and it is
reconstructed from the latent space representation
• When the decoder is able to reconstruct the input exactly as it was fed to the
encoder, you can say that the encoder is able to produce the best encodings for
the input

Let us understand some of the
important Hyperparameters of
Autoencoders

Hyperparameters of Autoencoders
CODE SIZE
1 It represents the number of nodes
in the middle layer. Smaller size
results in more compression

CODE SIZE
NUMBER OF LAYERS
An autoencoder can consist of as
2 many layers as we want

CODE SIZE
NUMBER OF NODES PER

LAYER NUMBER OF LAYERS
The number of nodes per layer An autoencoder can consist of as
decreases with each subsequent
layer of the encoder, and increases
3 2 many layers as we want
back in the decoder. The decoder is

symmetric to the encoder in terms
of the layer structure

LOSS FUNCTION CODE SIZE

We either use mean squared error
4 1 It represents the number of nodes
or binary cross-entropy. If input in the middle layer. Smaller size
values are in the range [0, 1], then results in more compression
we typically use cross-entropy;
otherwise, we use the mean
squared error
NUMBER OF NODES PER

LAYER NUMBER OF LAYERS
The number of nodes per layer An autoencoder can consist of as
decreases with each subsequent
layer of the encoder, and increases
3 2 many layers as we want
back in the decoder. The decoder is

symmetric to the encoder in terms
of the layer structure

Let’s Build a Simple Autoencoder on the
MNIST Dataset

Use Case 1
We will start with a single fully-connected neural layer as the encoder and as the decoder:

Use Case 1
Let's create a separate encoder model…

Use Case 1
…and a decoder model:

Use Case 1
Now, let's train our autoencoder to reconstruct MNIST digits
First, we'll configure our model to use a per-pixel binary cross-entropy loss and the Adadelta optimizer:

Use Case 1
Let's prepare our input data. We're using MNIST digits, and we're discarding the labels (since we're only interested in
encoding/decoding the input images)

Use Case 1
We will normalize all values between 0 and 1, and we will flatten 28x28 images into vectors of size 784

Use Case 1
Now, let's train our autoencoder for 50 epochs:

Use Case 1
After 50 epochs, the autoencoder seems to reach a stable train/test loss value of about 0.11. We can try to visualize the
reconstructed inputs and the encoded representations

Use Case 1
Here's what we get as an output. The top row is the original digits and the bottom row is the reconstructed digits. We are
losing quite a bit of details with this basic approach

Adding a Sparsity Constraint on the Encoded
Representations
▪ In the previous example, representations were (only) constrained from keras import regularizers
by the size of the Hidden Layer (32) encoding_dim = 32
▪ What typically happens in such situations is that the Hidden Layer input_img = Input(shape=(784,))
# add a Dense layer with a L1 activity
is learning an approximation of PCA regularizer
encoded = Dense(encoding_dim,
▪ Another way to constrain the representations is to add a sparsity activation='relu',
constraint on the activity of the hidden representations so that activity_regularizer=regularizers.l1(10e-

5))(input_img)
only fewer units would ‘fire’ at a given time decoded = Dense(784,
activation='sigmoid')(encoded)
▪ In Keras, this can be done by adding an activity_regularizer to our
autoencoder = Model(input_img, decoded)
Dense Layer:

Adding a Sparsity Constraint on the Encoded
Representations
▪ Train this model for 100 epochs (with the added regularization, the model is less likely to overfit and can be trained longer)
▪ The model ends with a train loss of 0.11 and a test loss of 0.10
▪ The difference between the two is mostly due to the regularization term being added to the loss during training (worth about 0.01)
▪ Here's a visualization of our new results:
They look pretty similar to the previous model, the only significant difference being the sparsity of the encoded representations.
encoded_imgs.mean() yields a value 3.33 (over our 10,000 test images); whereas with the previous model, the same quantity was 7.30.
So, our new model yields encoded representations that are twice sparser

Quiz

Quiz 1
Autoencoders and RBM are same with just

little change in their laws.
A True
B False

Answer 1
Autoencoders and RBM are same with just

little change in their laws.
A True
B False

Quiz 2
What should be the aim of training procedure in

boltzman machine of feedback networks?
A To capture inputs
B To feedback the captured outputs
C To capture the behaviour of system
D None of the mentioned

Answer 2
What should be the aim of training procedure in

boltzman machine of feedback networks?
A To capture inputs
B To feedback the captured outputs
C To capture the behaviour of system

Quiz 3
Should we build an autoencoder in Tensorflow to

surpass PCA?
A Yes
B No

Answer 3
Should we build an autoencoder in Tensorflow to

surpass PCA?
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Artificial
Intelligence
TFLearn: A Deep Learning Library

Agenda
01 What Is TFLearn? 02 Features of TFLearn
03 Layers in TFLearn 04 Built-in Operations
Saving/Restoring 06 Fine Tuning

05 Using TFLearn
Data Management
07 Using TFLearn 08 Use Case 1

Let us learn about a Deep Learning
library, featuring a high-level API
for TensorFlow!

TFLearn
TFLearn is a modular and transparent Deep Learning library built on top of TensorFlow. It was designed to provide a high-level API to
TensorFlow in order to facilitate and speed-up experimentations, while remaining fully transparent and compatible with it.

Let us see some of the features of
TFLearn!

TFLearn: Features
An easy-to-use and understand

high-level API for implementing
Deep Neural Networks
01

TFLearn: Features
Fast prototyping through highly

modular built-in neural network
layers, regularizers, optimizers,
02
metrics, etc.

01

TFLearn: Features
Full transparency over

TensorFlow. All functions are
built over tensors and can be
used independently of TFLearn 03

02
metrics, etc.

01

TFLearn: Features
Full transparency over Powerful helper functions to

TensorFlow. All functions are train any TensorFlow graph with
used independently of TFLearn 03 04 the support of multiple inputs,
outputs, and optimizers

02
metrics, etc.

01

TFLearn: Features
Full transparency over

Powerful helper functions to
TensorFlow. All functions are
train any TensorFlow graph with

02 05 Easy and beautiful graph
visualizations, with details about
weights, gradients, activations,
metrics, etc. and more

01

TFLearn: Features
Full transparency over Powerful helper functions to

TensorFlow. All functions are train any TensorFlow graph with

02 05 Easy and beautiful graph
visualizations, with details about
weights, gradients, activations,
metrics, etc. and more

01 06 Effortless device placement for
using multiple CPUs/GPUs

It’s time to take a deep dive into
TFLearn concepts

Layers in TFLearn
Layers are a core feature of TFLearn. Here is a list of all currently available layers:
File Layers
input_data, fully_connected, dropout, custom_layer, reshape, flatten, activation, single_unit,
core
highway, one_hot_encoding, and time_distributed
conv_2d, conv_2d_transpose, max_pool_2d, avg_pool_2d, upsample_2d, conv_1d, max_pool_1d,
conv avg_pool_1d, residual_block, residual_bottleneck, conv_3d, max_pool_3d, avg_pool_3d,
highway_conv_1d, highway_conv_2d, global_avg_pool, and global_max_pool
recurrent simple_rnn, lstm, gru, bidirectionnal_rnn, and dynamic_rnn
embedding embedding
normalization batch_normalization, local_response_normalization, and l2_normalize
merge Merge and merge_outputs
estimator regression
Example:
tflearn.conv_2d(x, 32, 5, activation='relu', name='conv1')

Built-in Operations
▪ Besides the layers concept, TFLearn also provides many different operations to be used when building a neural network
▪ These operations are mainly meant to be part of the above 'layers' argument, but they can also be used independently in any other TensorFlow graph for
convenience
▪ In practice, just providing the operation’s name as an argument is enough (such as activation='relu' or regularizer='L2' for conv_2d), but a function can also
be provided for further customization
File Operations
activations linear, tanh, sigmoid, softmax, softplus, softsign, relu, relu6, leaky_relu, prelu, and elu
softmax_categorical_crossentropy, categorical_crossentropy, binary_crossentropy, mean_square,
objectives
hinge_loss, roc_auc_score, and weak_cross_entropy_2d
optimizers SGD, RMSProp, Adam, Momentum, AdaGrad, Ftrl, and AdaDelta
metrics Accuracy, Top_k, and R2
initializations zeros, uniform, uniform_scaling, normal, truncated_normal, xavier, and variance_scaling
losses L1 and l2

Built-in Operations
Example:
# Activation and Regularization inside a layer:

fc2 = tflearn.fully_connected(fc1, 32, activation='tanh', regularizer='L2')
# Equivalent to:
fc2 = tflearn.fully_connected(fc1, 32)
tflearn.add_weights_regularization(fc2, loss='L2')
fc2 = tflearn.tanh(fc2)
# Optimizer, Objective and Metric:

reg = tflearn.regression(fc4, optimizer='rmsprop', metric='accuracy', loss='categorical_crossentropy')
# Ops can also be defined outside, for deeper customization:
momentum = tflearn.optimizers.Momentum(learning_rate=0.1, weight_decay=0.96, decay_step=200)
top5 = tflearn.metrics.Top_k(k=5)
reg = tflearn.regression(fc4, optimizer=momentum, metric=top5, loss='categorical_crossentropy')

Saving/Restoring Using TFLearn
▪ To save or restore a model, you can simply invoke the save or load method of the Deep Neural Network model class
# Save a model
model.save('my_model.tflearn')
# Load a model
model.load('my_model.tflearn')
▪ Retrieving layer variables can either be done using the layer name or, directly, by using 'W' or 'b' attributes that are supercharged to the layer's returned Tensor
# Let's create a layer

fc1 = fully_connected(input_layer, 64, name="fc_layer_1")
# Using Tensor attributes (Layer will supercharge the returned Tensor with weights attributes)
fc1_weights_var = fc1.W
fc1_biases_var = fc1.b
# Using Tensor name
fc1_vars = tflearn.get_layer_variables_by_name("fc_layer_1")
fc1_weights_var = fc1_vars[0]
fc1_biases_var = fc1_vars[1]

Saving/Restoring Using TFLearn
▪ To get or set the value of variables, TFLearn models class implement get_weights and set_weights methods
input_data = tflearn.input_data(shape=[None, 784])

fc1 = tflearn.fully_connected(input_data, 64)
fc2 = tflearn.fully_connected(fc1, 10, activation='softmax')
net = tflearn.regression(fc2)
model = DNN(net)
# Get weights values of fc2
model.get_weights(fc2.W)
# Assign new random weights to fc2
model.set_weights(fc2.W, numpy.random.rand(64, 10))
Note: You can also directly use TensorFlow eval or assign ops to get or set the value of these variables

Fine Tuning
Fine tuning is a process to take a network model that is already trained for a given task and make it perform a second
similar task
• Assuming that the original task is similar to the new task, using a network that is
already designed and trained allows us to take advantage of the feature
extraction that happens in the front layers of the network without developing
that feature extraction network from scratch
• It replaces the output layer, originally trained to recognize (in the case of
imagenet models) 1,000 classes, with a layer that recognizes the number of
classes you require

Fine Tuning
• The new output layer that is attached to the model is then trained to take the lower level features from the front of the network and map them to the
desired output classes, using Stochastic Gradient Descent (SGD)
• Once this has been done, other late layers in the model can be set as 'trainable=True' so that in further SGD epochs their weights can be fine-tuned for the
new task
• So, when defining a model in TFLearn, you can specify which layer's weights you want to be restored or not (when loading the pre-trained model)
• This can be handled with the 'restore' argument of layer functions (only available for layers with weights)
# Weights will be restored by default.

fc_layer = tflearn.fully_connected(input_layer, 32)
# Weights will not be restored, if specified so.
fc_layer = tflearn.fully_connected(input_layer, 32,
restore='False')
All weights that do not need to be restored will be added to tf.GraphKeys.EXCL_RESTORE_VARS

collection, and when loading a pre-trained model, these variables’ restoration will simply be ignored

Data Management
• TFLearn supports NumPy array data
• Additionally, it also supports HDF5 for handling large datasets
• HDF5 is a data model, a library, and a file format for storing and managing data
• It supports an unlimited variety of data types and is designed for flexible and efficient I/O and also for high-volume and complex data (more info)
• TFLearn can directly use HDF5 formatted data
# Load hdf5 dataset

h5f = h5py.File('data.h5', 'r')
X, Y = h5f['MyLargeData']
... define network ...
# Use HDF5 data model to train model

model = DNN(network)
model.fit(X, Y)

Scopes and Weights Sharing
• All layers are built over 'variable_op_scope' that makes it easy to share variables among multiple layers and make TFLearn suitable for a
distributed training
• All layers with inner variables support a 'scope' argument to place variables under; layers with same scope name will then share the
same weights
# Define a model builder
def my_model(x):
x = tflearn.fully_connected(x, 32, scope='fc1')
x = tflearn.fully_connected(x, 32, scope='fc2')
x = tflearn.fully_connected(x, 2, scope='out')
# 2 different computation graphs but sharing the same weights

with tf.device('/gpu:0'):
# Force all Variables to reside on the CPU.
with tf.arg_scope([tflearn.variables.variable], device='/cpu:0'):
model1 = my_model(placeholder_X)
# Reuse Variables for the next model
tf.get_variable_scope().reuse_variables()
with tf.device('/gpu:1'):
with tf.arg_scope([tflearn.variables.variable], device='/cpu:0'):
model2 = my_model(placeholder_X)

Demo on Titanic Survival Predictor

Use Case 1
In this demo, we will learn to use TFLearn and TensorFlow to model the survival chance of Titanic passengers using their personal information
(such as gender, age, and so on). To tackle this classic Machine Learning task, we are going to build a Deep Neural Network classifier

Use Case 1
• Let's take a look at the dataset (TFLearn will automatically download it for you)
• For each passenger, the following information is provided:
survivedSurvived (0 = No; 1 = Yes)

pclass Passenger Class (1 = st; 2 = nd; 3 = rd)
name
sex
age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Number
fare Passenger Fare

Use Case 1
• There are two classes in our task: not survived (class = 0) and survived (class = 1), and the passenger data has eight features
• The Titanic dataset is stored in a CSV file, so we can use the TFLearn load_csv() function to load the data from the file into a Python list
• We specify the target_column argument to indicate that our labels (survived or not) are located in the first column (ID is 0)
• The functions will return a tuple (data, labels)

Implementing RNN on MNIST data with
TFLearn

Quiz

Quiz 1
TFLearn provides easy device placement for

A True
B False

Answer 1
TFLearn provides easy device placement for

A True
B False

Quiz 2
Which of the following is correct from TFLearn’s

Data Management point of view?
A Supports NumPy array data
B Supports HDF5
C A,B

Answer 2
Which of the following is correct from TFLearn’s

Data Management point of view?
A Supports NumPy array data
B Supports HDF5
C A and B

Quiz 3
TFLearn is an abstraction framework for Tensorflow.
A Yes
B No

Answer 3
TFLearn is an abstraction framework for Tensorflow.
A Yes
B No

Thank you!

India: +91-7847955955
US: 1-800-216-8930 (TOLL FREE)

Deep Learning All Modules

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning All Modules

Uploaded by

Copyright:

Available Formats

Artificial

Copyright Intellipaat. All rights reserved.

01 Importance of AI 02 What Is AI?

03 What Is Intelligence? 04 Difference Between AI, ML, and DL

05 Basics of Machine Learning 06 Basics of Deep Learning

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Changed the face of Reinvented the world A great help for

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

▪ Artificial Intelligence is intelligence in machines

▪ It is commonly implemented in computer systems using program software

▪ Accordingly, there are two possibilities:

• A system with intelligence is expected to behave as intelligently as a human

• A system with intelligence is expected to behave in the best possible manner

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Machine Learning (ML)

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

▪ Machine Learning algorithm is trained using a training

dataset to create a model

▪ When a new input data is introduced to the ML algorithm,

it makes a prediction on the basis of the model

▪ The prediction is evaluated for accuracy and if the

accuracy is acceptable, the Machine Learning algorithm is

▪ If the accuracy is not acceptable, the Machine Learning

algorithm is trained again and again with an augmented

Copyright Intellipaat. All rights reserved.

Machine learning is categorized into three types

Copyright Intellipaat. All rights reserved.

Learning Types! Agent Environment

3 Action! -50 Points

From Next Time…

Copyright Intellipaat. All rights reserved.

Cool Machine Learning projects you can use:

Copyright Intellipaat. All rights reserved.

Machine Learning algorithms

Time constraints in learning as

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

Copyright Intellipaat. All rights reserved.

▪ Deep Learning architectures such as deep neural

networks, deep belief networks, and recurrent

neural networks have been applied to fields

including computer vision, speech recognition,

natural language processing, audio recognition,

etc. where they have produced results

comparable to, and in some cases superior to,

▪ Most modern Deep Learning models are based

on artificial neural networks

Copyright Intellipaat. All rights reserved.