RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3

RNN & LSTM
VA M S I K R I S H N A
B19ME023
What is RNN?
• RNN are a class of neural networks that deals with sequential data.
• RNN is recurrent in nature as it performs the same function for every input of
data while the output of the current input depends on the past one computation
• For making a decision, it considers the current input and the output that it has
learned from the previous input
• RNNs can use their internal state (memory) to process sequences of inputs
• Applicable to tasks such as unsegmented, connected handwriting recognition
or speech recognition.
• Other neural networks : all the inputs are independent of each other
In RNN : all the inputs are related to each other.
How does RNN works?
• First, it takes the X0 from the sequence of input and then it outputs h0 which together with X1 is the
input for the next step. So, the h0 and X1 is the input for the next step. Similarly, h1 from the next is the
input with X2 for the next step and so on. This way, it keeps remembering the context while training.
• Formula of h(current state) is given as : ht = f(ht-1, xt)

Activation Function and Output
• On applying the Activation function (tanh):
ht = tanh (Whhht-1 + WxhXt)
Where,
ht  Current hidden state
ht-1  Previous hidden state
Whh  Weight of previous hidden state
Whx  Weight of current hidden state
• tanh is the Activation function, that implements a Non-linearity that squashes the activations

to the range[-1.1]
• Output Function: yt = Whyht

Backpropagation
• It is a way of propagating total loss back into
neural network to know how much of loss is
each node is corresponding.
• In RNN, since all the hidden layers have same

weight. The total loss will be found, by
summing up loss at each time.
• Total error/loss
L= ∑t Lt = L1 + L2 + _ _ _ + Lt
Gradient Descent
• The value can be calculated as the summation of gradients at each time step.
• Using the chain rule of calculus and using the fact that the output at a time step t is a function of the
current hidden state of the recurrent unit, the following expression arises:-
 We finally get
Exploding and Vanishing Gradient descent
• Although the basic Recurrent Neural Network is fairly effective, it can suffer from a significant problem.
For deep networks, The Back-Propagation process can lead to the following issues:-
• Exploding Gradients: This occurs when the gradients become too large due to back-propagation.
Exploding means, without any reason gives stupidly high importance to weights. This problem can be
solved by truncating the gradients.
• Vanishing Gradients: This occurs when the gradients become very small and tend towards zero. When
gradient tends towards zero, that means it stops learning or takes very long time to run for the deeper
networks. This makes the RNN to run only for shorter depths.
The problem of long-term Dependencies
• “the clouds are in the sky”. “I grew up in France… I speak fluent French”.
• LSTM’s are upgraded version of RNN’s which works great even with the very deep neural networks, and
also works for events with large time gaps.
LSTM
• Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies.
• LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information
for long periods of time is practically their default behavior, not something they struggle to learn.
• In all standard RNN’s have the form of repeating module of neural network. Each of the neural
network has very simple structure such as single tanh layer in it.
• In LSTM, the structure of repeating module is very complicated, which makes it to capable of having
long memory.
Structure of LSTM
• Instead of single neural network layer, we four
layer neural network in LSTM.
• A typical LSTM network is compromised of

different memory blocks called cells.
• The key to the LSTM’s is the cell state, the

horizontal line running through the each cell.
• There are two states that being sent from one

cell to next cell, hidden state and cell state.
Gates:
• The LSTM’s have a ability to add or remove information to the cell state, carefully regulated called as
gates.
• LSTM has three of such gates:
1. Forget Gate
2. Input Gate
3. Output Gate
1)Forget Gate
• This gate decides how much of the past information the network should
remember.
• This gate looks at the previous hidden state (ht-1) and the current state
xt. The sigmoid function outputs a vector, with values ranging from 0 to
1, corresponding to each number in the cell state.
• 0  Completely remove , 1 Keep the element entirely.
Ex: Ravi and Ajay play Football. Ajay is good at coding.
Ht-1  Ravi and Ajay play football. Xt  Ajay is good at coding.
Here, the forget gate compares the previous hidden state with the current state input. Since,
it seems the current state is only talking about Ajay. We don’t need any previous
information about Ravi. So, Ravi is omitted from memory by forget gate.
2) Input Gate
• This unit decides how much of this unit information should be added to
the current state.
• Sigmoid function decides which values to let through0,1.and

tanh function gives weightage to the values which are passed deciding
their level of importance ranging from-1 to 1.
Ex: Ajay is good at coding. Yesterday I came to know that he is the university topper.
• input gate analysis the important information.
• “Ajay is good at coding”, “he is university topper” is important.
• “yesterday I came to know that ” is not important, hence forgotten.

3) Output Gate
• This gate decides which part of the current cell information can make
to the output.
• Sigmoid function decides which values to let through 0,1. and

tanh function gives weightage to the values which are passed
deciding their level of importance ranging from - 1 to 1 and
multiplied with output of Sigmoid.
• Ex: Ajay is good at coding, he is university topper. So the Merit student

_______________ was awarded University Gold medalist.
• there could be lot of choices for the empty dash. this final output gate replaces it
with Ajay.

RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3

Uploaded by

Copyright:

Available Formats

You might also like

RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3

Uploaded by

Copyright:

Available Formats

RNN & LSTM

• Formula of h(current state) is given as : ht = f(ht-1, xt)

ht = tanh (Whhht-1 + WxhXt)

ht  Current hidden state

ht-1  Previous hidden state

Whh  Weight of previous hidden state

Whx  Weight of current hidden state

• tanh is the Activation function, that implements a Non-linearity that squashes the activations

• Output Function: yt = Whyht

• In RNN, since all the hidden layers have same

• A typical LSTM network is compromised of

• The key to the LSTM’s is the cell state, the

• There are two states that being sent from one

• LSTM has three of such gates:

• 0  Completely remove , 1 Keep the element entirely.

Ex: Ravi and Ajay play Football. Ajay is good at coding.

Ht-1  Ravi and Ajay play football. Xt  Ajay is good at coding.

• Sigmoid function decides which values to let through0,1.and

• “Ajay is good at coding”, “he is university topper” is important.

• “yesterday I came to know that ” is not important, hence forgotten.

• Sigmoid function decides which values to let through 0,1. and

• Ex: Ajay is good at coding, he is university topper. So the Merit student

You might also like