Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Understand LSTM Neural Network Model

from Scratch

Have you ever seen the movies Memento or Ghajni? Our basic RNNs are like
the heroes in those movies who suffer from short-term memory problems.
LSTM is a special version of RNN (Recurrent Neural Network) model that solves
this short-term memory issue. In this lecture, we will understand Long Short
Term Memory (LSTM) from scratch and calculate with real numbers.

Application of LSTM models


Nowadays LSTM is solving various NLP tasks like language
translation, sentiment analysis, text summarization, Time Series forecasting, etc.

For example, Google Translate uses LSTM-based neural machine translation


architecture to translate text from one language to another.

LSTM models are used to improve the accuracy of speech recognition systems.
For example, Apple’s Siri and Google Assistant use LSTM-based models to
understand and respond to user’s voice commands.
LSTM model became popular in time series forecasting tasks also. You can use
LSTM to predict stock prices, weather forecasting, energy demand, and other
time-series data. For example, the Australian Energy Market Operator (AEMO)
uses LSTM-based models to forecast electricity demand.
Why is LSTM better than RNN?
RNNs are interesting because they can use past knowledge to help with
current tasks. For example, if you want to predict the next word in a
sentence, RNNs can use the previous words to do it better. But RNN has some
limitations, let’s explore those.

In theory, RNNs (Recurrent Neural Networks) can remember things from a


long time ago. However, when it comes to practical usage, they fail to do this
very well.

RNNs (Recurrent Neural Networks) perform well for small sentences. But for
long sentences, RNN fails to perform because of the vanishing gradient
problem. RNN can not remember things for a long time.
For example, if we want to predict the final word in the sentence “the stars are
in the ____“. In this case, the gap between context word (“stars“) and where
we need it (“sky“) is small.

In this kind of case, RNN will work properly.

Now let’s take another example. Now if our goal is to predict final word in this
sentence: “Rome is a modern and cosmopolitan city that attracts millions of
visitor. It is known for its delicious thin-crust _____“
In this case, the gap between context word (“stars“) and where we need it
(“pizza“) is too long.

As the gap increases, RNNs lose their ability to connect the dots and learn from
it. This is because of the Vanishing Gradient problem of RNNs. Though in
theory, RNN can handle such long dependency but in practice RNN struggles to
learn and perform effectively in this situation because of the Vanishing Gradient
problem.

LSTMs are designed specifically to tackle this long-term dependency


problem. Unlike other models, they can remember information for extended
periods of time without any issues.
During the training process using gradient-based optimization algorithms (like
backpropagation through time), the gradients used to update the network's weights may
become extremely small (vanish) as they are backpropagated through many time steps.

This vanishing gradient issue arises due to the nature of the RNN architecture, particularly
in long sequences. When computing gradients during backpropagation, the chain rule is
applied successively through time steps. In RNNs, this involves multiplying many partial
derivatives together. If these derivatives are less than 1 (in the case of activation functions
like sigmoid or tanh), they can compound across multiple time steps and eventually
become exceedingly small. As a result, the gradients approaching earlier time steps
become close to zero, causing the network to have difficulty learning long-term
dependencies.

Types of LSTM model


There are several types of LSTM models that differ in terms of their
architecture and the way they handle input and output.
Vanilla LSTM

This is the most basic type of LSTM model that includes a memory cell, input
gate, forget gate, and output gate. The memory cell stores information about
the sequence and the gates control the flow of information in and out of the
cell.

In this lecture, I am going to explain this type of LSTM neural network model.

Bidirectional LSTM

In this type of LSTM model, the input sequence is processed in both forward
and backward directions using two separate LSTM layers. This allows the
model to learn from both past and future contexts and is useful for tasks such
as speech recognition and language translation.

Stacked LSTM

This type of LSTM model includes multiple LSTM layers, where the output of
one layer serves as the input for the next layer. Stacked LSTMs are useful for
learning hierarchical representations of sequential data.

Convolutional LSTM

In this type of LSTM model, the input sequence is processed using


convolutional layers before being fed into the LSTM layers. This is useful for
tasks such as video analysis and motion prediction.

Attention LSTM

This LSTM model has something special called attention mechanism. It helps
the model to concentrate on important parts of the input sequence based on
what is needed for the task. This feature is very helpful for tasks like
understanding people’s feelings (sentiment analysis) and describing pictures in
words (image captioning).
LSTM Network Architecture
LSTMs or Long Short Term Memory networks are a type of Recurrent Neural
Network or RNN that can learn and remember long-term
dependencies. Hochreiter & Schmidhuber (1997) first introduced this deep learning
algorithm, which gained popularity over time.

LSTMs are created to tackle the long-term dependency problem. Unlike RNNs,
they can easily remember information for long periods of time

In Recurrent Neural Networks, there is a repeating module of a neural network.


But this module is quite simple, like a single tanh layer.

Recurrent Neural Network


Here H is the hidden layer

LSTMs share the same chain-like structure as other Recurrent Neural


Networks, but their repeating module has a unique structure. Rather than
having just one neural network layer, LSTMs have four, and they interact in a
highly specialized way.
Long Short Term Memory (LSTM)
These four layers are: forget gate, input gate, cell gate, and output gate.
These gates are memory units of the LSTM model.

4 Gates of LSTM neural network model


Gates of LSTM

Now let me explain you all different gates or layers of LSTM neural
network model with their uses.

1. Forget Gate
Forget Gate
The forget gate is a neural network layer that decides what information to keep
or discard from the cell state. It takes the previous hidden state (Yt-1) and the
current input (Xt) as input. Then Sigmoid function generates output value
between 0 and 1. A value of 0 means to forget the previous state, and a value
of 1 means to keep it.

Forget Gate Equation

ft = sigmoid(Wf[xt, Yt-1] + bf)

If we use different sizes of weights, then this equation will look like below:

ft = sigmoid(Wf * xt + Uf * Yt-1 + bf)

In this equation,

 Wf is a weight matrix that maps the input sequence xt to the forget gate values
 Uf is a weight matrix that maps the previous hidden state Yt-1 to the forget gate
values.
 bf is a bias vector that is added to the weighted sum of Wf * x and Uf * Yt-1
Let’s take an example of a conversation between two friends, Ria and Priya.
Ria said, “I am going to the mall to buy some clothes.” Priya replied, “That’s
great, I am working on my project at ____”

In this conversation, the forget gate will help us forget the previous context of
Ria’s sentence and focus only on Priya’s response. The value of ‘ft‘ can be
between 0 and 1, depending on the amount of information we want to
remember. If we need to forget, the value will be 0, and if complete information
is needed, the value will be 1.

2. Input Gate

Input Gate
The input gate decides what new information we can add to the current cell
state. First, it takes the current state Xt and previously hidden state Yt-1 and
puts them into the second Sigmoid function. Sigmoid function will convert
the values between 0 and 1 and produce it.

After that, the same hidden state and current state info go through the tanh
function. This helps regulate the network and creates a candidate cell vector
(gt) with all possible values between -1 and 1.

Finally, point-wise multiplication happens between it and gt. This output is


storing the information about the importance of a value (whether a new
information should be stored or not).

Here sigmoid function decides which values we need to update and tanh
function helps to determine how important the value is.

Input Gate Equation

it = sigmoid(Wi[xt, Yt-1] + bi)

gt = tanh(Wg[xt, Yt-1] + bg

Similarly, for different sizes of weight matrices, we can write,

it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

gt = tanh(Wg*xt, Ug*Yt-1] + bg

Here,

 Wi is the weight matrix that maps the input sequence xt to the input gate values
 Ui is weight matrix that maps the previous hidden state Yt-1 to the input gate
values
 Wig is
b is bias vector
a weight that that
matrix is added
mapstothe
theinput
weighted sum of
sequence Wi the
x to t and Ui * values
* xcandidate Yt-1
 Ug is a weight matrix that maps the previous hidden state h_prev to the
candidate values
 bg is a bias vector that is added to the weighted sum of Wg * x and Ug * Yt-1

During training, the values of Wi, Ui, bi, Wg, Ug, and bg are learned through
backpropagation and gradient descent. So that the model can learn to
selectively add new information to the cell state based on the input sequence
and the previous hidden state.

For example consider the sentence: “I love to eat pizza, but I am allergic to
____“

To predict the next word, an LSTM needs to selectively forget and remember
information from the previous words. The input gate is responsible for deciding
what new information to let into the cell state.

In this example, the input gate would determine how much weight to give to the
word “pizza” and how much weight to give to the word that follows “allergic to“.

If the next word is “tomatoes” for instance, the input gate would need to
remember information about “tomatoes” without overwriting the information
about “pizza” that was previously stored in the cell state.

3. Cell State
Cell Gate
After calculating forget gate and input gate, we need to update the cell state of
LSTM neural network model. The cell state acts like a conveyor belt running
throughout the network. The cell state is crucial because it allows information
to flow along the network without any changes.

Cell State is also known as the memory of LSTM.

In this cell state, all the decisions are made actually. First, the forget gate takes
the previous cell state Ct-1 and multiplies it with the forget vector ft. If the resulting
value is close to 0, it means that the corresponding information is less important
and should be dropped from the cell state.

After that, it takes output of the input gate and perform point-wise addition which
updates the cell state and create a new cell state (Ct).

Cell Gate Equation


Ct = ft * Ct-1 + it * gt

LSTMs have three gates, which are used to control and protect the cell state.
The last one is output gate.

4. Output Gate

The output gate is the third and final gate in the LSTM architecture. It
generates the output (Yt) of the hidden layer.

First, we pass current state Xt and previous hidden state Yt-1 into the third
sigmoid function and produce Ot. Then we are applying tanh function to the
new cell state Ct.

Finally apply point-wise multiplication to those output values and produce final
output Yt

Output Gate Equation


ot = sigmoid(Wo * xt + Uo * Yt-1 + bo)

Yt = ot * tanh(ct)

Forward Propagation of LSTM from Scratch


Now let’s calculate forward propagation of LSTM to clear up our understanding.
Let’s walk through a single iteration of an LSTM for next word prediction using
the example sentence “The cat sat on the ____“.

Input Encoding

Like other NLP models, we first encode the input sequence “The cat sat on
the” as a sequence of word vectors. You can use any technique such as one-
hot encoding or word embeddings.

For simplicity, let’s assume we are using one-hot encoding, and we have a
vocabulary size of 5 words “the”, “cat”, “sat”, “on”, and “dog”. The one-hot
encoded input sequence would look like below:

[1 0 0 0 0] # “the”

[0 1 0 0 0] # “cat”

[0 0 1 0 0] # “sat”

[0 0 0 1 0] # “on”

In any kind of embedding technique, we should keep only unique words. This
is the reason we used embedding vector of word “the” one time.
We’ll assume that our LSTM has a hidden state size of 3, so our weight matrices
and biases have the following shapes:

 W_f, U_f, b_f: (3, 5), (3, 3), (3,)


 W_i, U_i, b_i: (3, 5), (3, 3), (3,)
 W_o, U_o, b_o: (3, 5), (3, 3), (3,)
 W_g, U_g, b_g: (3, 5), (3, 3), (3,)
 W_p, b_p: (5, 3), (5,)
We’ll also assume that our initial hidden state Y0 and cell state C0 are
both zero vectors of size 3.

Time step = 1

At the first time step t=1, we feed the input word vector of “the” [1 0 0 0 0] into
the LSTM, along with the previous hidden state Y0 and cell state C0:

Forget Gate Calculation

We have randomly initialized the weight (Wf) as a 3×5 matrix and bias bf as
3×1 matrix and Uf as 3×3 matrics. Initializing the first hidden state Y0 as zero
vectors of 1X3 matrix

So now we know the equation of forget get,

ft = sigmoid(Wf * xt + Uf * Yt-1 + bf)

for time step =1 the equation would be:

f1 = sigmoid(Wf * xthe + Uf * Y0 + bf)

After putting all values:


f_1 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3],
[0.2, 0.1, -0.2, -0.3, -0.1]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -
0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, -0.1])

= [0.475, 0.574, 0.475]

Input Gate Calculation

Equation of input gate:

it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

gt = tanh(Wg*xt + Ug*Yt-1] + bg)

Similarly, let’s define our weight matrices.

Weight matrices for it

Weight matrices for gt

Let’s first calculate it


it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

for time step =1 we can write the equation like below:

i1 = sigmoid(Wi*xthe + Ui*Y0 + bi)

After putting all values:

i_1 = sigmoid([[-0.4, 0.2, 0.1, -0.2, 0.3], [-0.3, 0.2, 0.1, -0.3, 0.2],
[0.1, -0.3, -0.1, 0.3, 0.2]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -
0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [-0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, 0.3])

= [0.475, 0.574, 0.475]

Now let’s calculate the value of gt:

gt = tanh(Wg*xt + Ug*Yt-1 + bg)

for time step =1 the equation will look like below:

g1 = tanh(Wg*xthe + Ug*Y0 + bg)

After putting all values:

g_1 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-
0.2, 0.1, -0.4, -0.1, 0.3]] @ [1 0 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -
0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0 0 0] + [0.1, -0.1, 0.2])

= tanh([0.11, 0.26, -0.01])

= [0.110, 0.254, -0.010]

Cell Gate Calculation

As we know formula for cell gate is:

Ct = ft * Ct-1 + it * gt
for time step =1 we can write like below:

C1 = f1 * C0 + i1 * g1

In the above previous steps we have already calculated values for f1, C0 ,
i1, and g1. And we are keeping initial hidden state Y0 and cell state C0 values
both zero vectors of size 3

So putting values in the equation will look like below:

c_1 = f_1 * c_0 + i_1 * g_1

= [0.475, 0.574, 0.475] * [0, 0, 0] + [0.475, 0.574, 0.475] * [0.110,


0.254, -0.010]

= [0.052, 0.146, -0.005]

Output Gate Calculation

Finally, we can calculate values of output gate. In this gate we need to


calculate two values:

ot = sigmoid(Wo * xt + Uo * Yt-1 + bo)

Yt = ot * tanh(ct)

Before jump into calculation let’s define our weight matrices in the similar
manner like other gates.

Output gate weight matrices


Let’s first calculate ot. For time step =1 we can write equation of ot like below:

o1 = sigmoid(Wo * xthe + Uo * Y0 + bo)

Now let’s put all the values to the equation:

o_1 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1],
[0.1, -0.1, 0.2, 0.1, 0.3]] @ [1 0 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3,
-0.2], [0.1, -0.1, 0.2]] @ [0 0 0] + [0.2, 0.3, 0.1])

= sigmoid([0.1, 0.5, 0.3])

= [0.524, 0.622, 0.574]

Now finally we can calculate hidden layer output (for time = 1) of LSTM neural
network model. The equation for this output is:

Yt = ot * tanh(ct)

For time =1 we can write like this:

Y1 = o1 * tanh(c1)

Now we already know value for o1 and c1. Let’s put those values in this
equation and produce the output.

Y1 = o1 * tanh(c1)

= [0.524, 0.622, 0.574] * [0.110, 0.254, -0.005]

= [0.058, 0.157, -0.003]

Time step = 2

At the second time step t=2, we feed the input vector for the word “cat” [0 1 0
0 0] into the LSTM neural network model, along with the previous hidden state
h1 and cell state c1.
Note: Weight and bias matrices will be same for each time step for one
iteration.

f_2 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3],
[0.2, 0.1, -0.2, -0.3, -0.1]] @ [0 1 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -
0.1, -0.2], [0.1, -0.3, 0.2]] @ [0.058, 0.157, -0.003] + [0.1, -0.2, 0.3]

= sigmoid([-0.165, -0.147, 0.137])

= [0.459, 0.463, 0.534]

i_2 = sigmoid([[-0.2, 0.3, -0.1, -0.3, 0.2], [-0.1, -0.2, -0.2, 0.2, 0.1],
[0.1, 0.3, -0.2, -0.1, -0.1]] @ [0 1 0 0 0] + [[-0.2, 0.1, 0.1], [0.1, -
0.2, -0.2], [-0.3, -0.1, 0.1]] @ [0.058, 0.157, -0.003] + [-0.2, 0.1,
0.1])

= sigmoid([-0.368, 0.088, 0.079])

= [0.409, 0.522, 0.519]

g_2 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-
0.2, 0.1, -0.4, -0.1, 0.3]] @ [0 1 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -
0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0.058, 0.157, -0.003] + [0.1, -0.1,
0.2])

= tanh([0.253, 0.288, -0.016])

= [0.246, 0.272, -0.016]

c_2 = f_2 * c_1 + i_2 * g_2

= [0.459, 0.463, 0.534] * [0.052, 0.146, -0.005] + [0.409, 0.522, 0.519] *


[0.246, 0.272, -0.016]

= [0.222, 0.284, -0.009]

o_2 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1],
[0.1, -0.1, 0.2, 0.1, 0.3]] @ [0 1 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3,
-0.2], [0.1, -0.1, 0.2]] @ [0.058, 0.157, -0.003] + [0.2, 0.3, 0.1])
= sigmoid([0.224, 0.287, 0.206])

= [0.556, 0.570, 0.551]

Y_2 = o_2 * tanh(c_2)

= [0.556, 0.570, 0.551] * tanh([0.222, 0.284, -0.009])

= [0.159, 0.163, -0.144]

Till time step = 5

This will go on till last input word vector for word “the”. In our example, input
words are “the”, “cat”, “sat”, “on”, and “the”.

So total time step it will take = 5

 “the” => time step = 1 input: xthe, y0, c0 , (calculated values: f1, i1, g1, c1, o1, and
y1)
 “cat” => time step = 2 input: xcat, y1, c1 , (calculated values: f2, i2, g2, c2, o2, and
y2)
 “sat” => time step = 3 input: xsat, y2, c2 , (calculated values: f3, i3, g3, c3, o3, and
y3)
 “on” => time step = 4 input: xon, y3, c3 , (calculated values: f4, i4, g4, c4, o4, and y4)
 “the” => time step = 5 input: xthe, y4, c4 , (calculated values: f5, i5, g5, c5, o5, and
y5)

Next word Prediction

We can now use the output hidden state Y5 (final output) to predict the next word
in the sequence. This is typically done by applying a softmax activation function
to the output of a linear layer that takes Y5 as input.

p_5 = softmax(Wp * Y5 + bp)

Now we need to use Wp weight matrix with random values of shape of


(output_size X vocabulary_size).

For example, we want to predict next word after Tim step = 2.


For this, we have Y2 = [0.159, 0.163, -0.144]. Now length of the Y2 vector = 3
and our vocabulary size = 5.

So the shape of Wp will be 3X5. And the shape of bp will be 5X1

So now we have:

Y_2 = [0.159, 0.163, -0.144]

W_p = [[-0.007, 0.366, -0.155, -0.223, -0.110],

[ 0.126, -0.009, -0.130, 0.246, 0.155],

[-0.057, 0.170, -0.070, -0.212, -0.091]]

b_p = [0.083, -0.139, 0.017, -0.022, 0.061]

Note: Weight and bias matrix values are random values.

So now, let’s calculate the next word prediction probability after time step = 2.

p_2 = softmax(Wp * Y2 + bp)

= softmax([0.159, 0.163, -0.144] @ [[-0.007, 0.366, -0.155, -0.223, -


0.110], [ 0.126, -0.009, -0.130, 0.246, 0.155], [-0.057, 0.170, -0.070, -
0.212, -0.091]] + [0.083, -0.139, 0.017, -0.022, 0.061]

= softmax([-0.008, 0.173, -0.167, -0.290, 0.037]))

= [0.238, 0.305, 0.163, 0.116, 0.177]

We can see the second element (0.305) of our softmax output vector is with
higher probability. So predicted next word will be the second word in our
vocabulary (vocabulary[2]).

Now our vocabulary is: ["the", "cat", "sat", "on", and "dog"]. So the next
predicted word will be vocabulary[2] = “cat”.
At this point with only one iteration, you cannot expect the proper output. I just
tried to explain you the entire path of a long short term memory model for only
one forward propagation iteration.

Limitations of LSTM model


While LSTMs are a powerful tool for modeling sequential data, they do have
some limitations. Here are some of the limitations of LSTMs:

Computationally expensive

For LSTMs to work well, they require a lot of data and computer resources,
especially for large-scale applications. They are also computationally
expensive to train.

Limited interpretability

Because LSTMs are black boxes, it is challenging to understand how and why
particular decisions are made by the model.

Overfitting

when using smaller datasets, LSTMs might be vulnerable to overfitting. It


means that the model may perform well on the training data but not for unseen
data.

Training time

Due to their complex architecture, LSTMs may require more time to train than
other models. This is because it needs to calculate all four gates (or states) for
each input sequence.

You might also like