Long Short-Term Memory (LSTM) : Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 20.03.2021

Lecture 10
Long Short-Term Memory (LSTM)
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 1

Recap: RNN
Chatbot :
Pretty popular nowadays.
Let’s say the chatbot can classify intentions from

the users inputted text.
First, encode the sequence of text using an RNN.
Then, feed the RNN output into a feed-forward NN

which will classify the intents.

Example: Chatbot
A user types in… what time is it?
To start, we break up the sentence into individual words.
RNN’s work sequentially so we feed it one word at a time.

Example: Chatbot
The first step is to feed “What” into the RNN.
The RNN encodes “What” and produces an output.

Example: Chatbot
For the next step, we feed the word “time” and the hidden state from the previous
step.
RNN now has information on both the word “What” and “time.”

Example: Chatbot
Repeat this process, until the final step.
The final step the RNN has encoded information from all the words in previous
steps.

Example: Chatbot
Since the final output was created from the rest of the sequence,
It is possible to take the final output and pass it to the feed-forward layer to
classify an intent.

Long Short-Term Memory (LSTM)

Take About 30 Seconds To Stare At The Picture

Now ……..
Ø Close your eyes.
Ø Try to recall the items you saw in the picture.
Ø How many were you able to recall?
If you were able to recall all the items then you have a pretty good working memory.

Working Memory
Ø Our brains store information in the working memory and forgets it after sometime.
Ø It is very unlikely that you will remember all these items till tomorrow.
Ø Working memory is also sometimes called short-term memory.
Ø We use this working memory in lot of daily tasks.

Working Memory in Human
Ø Consider this sentence — The bus, which went to Paris, was full.
Ø To construct this sentence correctly, we need to remember that the subject (the
bus) is a singular.
Ø Therefore towards the end of the sentence we have to use was.
Ø The same sentence with a plural subject becomes — The buses, which went to
Paris, were full.
Ø So in many cases we need to remember the context presented at the beginning of

the sentence to complete the sentence correctly.

Working Memory in Machine
Ø Similar things happen when machines are constructing sentences.
Ø We saw how RNNs are good at sequence tasks like language modelling.
The output at 𝑦̂ ⟨3⟩ is made not just

from this time step’s input 𝑥⟨3⟩, but
also from the information received
from the previous inputs 𝑥⟨1⟩ and 𝑥⟨2⟩.

RNN in Handling Short Term Memory
Ø RNNs suffer from short-term memory.
Ø It can remember context in the short term well.

Ø There are clouds in the sky.
Ø We just need the context cloud in this sentence to complete it.

Ø And the word clouds is just two steps away from sky.

RNN in Handling Short Term Memory
Ø If a sequence is long enough, they’ll have a hard time carrying information from
earlier time steps to later ones.
Ø The bus, which went to Paris, was full.

Ø The context is farther away.
Ø To process a paragraph of text to do predictions, RNN’s may leave out important

information from the beginning.

Vanishing Gradient
Ø During back propagation, RNNs suffer from the
vanishing gradient problem.
Ø Short-Term memory and the vanishing gradient is due to

the nature of back-propagation algorithm used to train
and optimize NNs.
Ø The odd distribution of colours in the hidden states

illustrate the short-term memory issue with RNNs.
Ø As the RNN processes more steps, it has troubles

retaining information from previous steps.
Ø The information from the word “what” and “time” is

almost non-existent at the final time step.

Vanishing Gradient
Ø Training a NN has three major steps.
1. It does a forward pass and makes a prediction.
2. It compares the prediction to the ground truth using a

loss function. The loss function outputs an error value
which is an estimate of how poorly the network is
performing.
3. It uses that error value to do back propagation which

calculates the gradients for each node in the network.

Vanishing Gradient
Ø The gradient is the value used to adjust the networks
internal weights, allowing the network to learn.
Ø The bigger the gradient, the bigger the adjustments and

vice versa.
Ø When doing back propagation, each node in a layer

calculates it’s gradient with respect to the effects of the
gradients, in the layer before it.
Ø So if the adjustments to the layers before it is small, then

adjustments to the current layer will be even smaller.

Vanishing Gradient
Ø The vanishing gradient problem is when the gradient
shrinks as it back propagates through time.
Ø If a gradient value becomes extremely small, it doesn’t

contribute too much learning.
Ø In RNNs, layers that get a small gradient update stops

learning.
Ø Those are usually the earlier layers.
Ø Because these layers don’t learn, RNN’s can forget what it

seen in longer sequences, thus having a short-term
memory.

Vanishing Gradient
Ø Because of vanishing gradients, the RNN doesn’t learn the
long-range dependencies across time steps.
Ø That means that there is a possibility that the word “what”

and “time” are not considered when trying to predict the
user’s intention.
Ø The network then has to make the best guess with “is it?”
Ø That’s pretty ambiguous and would be difficult even for a

human.
Ø So not being able to learn on earlier time steps causes the

network to have a short-term memory.
Long Short Term Memory (LSTM)
Ø LSTMs are a special kind of RNN, capable of
learning long-term dependencies.
Ø An LSTM has a similar control flow as a

recurrent neural network.
Ø The differences are the operations within the

LSTM’s cells.
Ø These operations are used to allow the LSTM

to keep or forget information.

RNN vs LSTM
Ø RNN unit takes the current input (X) as well as
the previous input (At-1) to produce output (H)
and current state (At).
Ø However, LSTMs have different components (4

layers inside) as compared to a single tanh
(activation) layer in the RNN.
Ø LSTMs have also sigmoid activation functions in

addition to tanh activation function.

Recall: tanh and sigmoid Activation Function
tanh squishes values to be between -1 and 1 sigmoid squishes values to be between 0 and 1

LSTM: Core Concept
Ø There are 4 layers inside an LSTM block which interact together.
Ø The core concept of LSTM’s are the cell state, and it’s various gates.
1. Forget Gate.
2. Input Gate.
3. Output gate.

Cell State
Ø The key to the operation of LSTM is the top
horizontal line running from left to right (cell state).
Ø It acts as the “memory” of the network.
Ø It runs straight down the entire chain, with only

some minor linear interactions.
Ø It’s very easy for information to just flow along it

unchanged, reducing the effects of short-term
memory.

Cell State
Ø As the cell state goes on its journey, information
get’s added or removed to the cell state via gates.
Ø Gates are different NNs (sigmoid layers) that decide

which information is allowed on the cell state.
Ø LSTM has three of these gates to control cell state:

1. Forget gate
2. Input gate
3. Output gate
Ø The gates can learn what information is relevant to

keep or forget during training.
Forget Gate
Ø The first step is to decide what information will be
thrown away from the cell state.
Ø Forget gate (the first sigmoid layer) decides what

information should be thrown away or kept.
Ø Information from the previous hidden state (ht−1)
and the current input (xt) is passed through the
sigmoid function.
Ø Values come out between 0 and 1.

Ø The closer to 0 means to forget, and the
closer to 1 means to keep.

Forget Gate
Ø Bob called Carla to ask her out.
Ø The cell state might include the gender of the present
subject (Bob à his).
Ø When there arrives a new subject, need to use the

correct pronouns later.
Ø The pronoun her is based on the subject Carla
and not Bob.
Ø The machine want to forget the context Bob, while

making prediction when it encounters a new subject
Carla.
Input Gate
Ø The next step is to determine what are we going to store
in the cell state.
Ø To update the cell state, we have the input gate.
Ø This has two parts:

Ø First, a sigmoid layer (input gate layer) to decide which
values will be updated.
Ø Next, a tanh layer creates a vector of new candidate
values (C̃t) that could be added to the state.
Ø In the next step, these two are combined to create an

update to the state..
Input Gate
Ø First, pass the previous hidden state and current
input into a sigmoid function to decide which values
will be updated by transforming the values to be
between 0 and 1.
Ø 0 means not important, and 1 means important.
Ø Also pass the hidden state and current input into the
tanh function to squish values between -1 and 1 to
help regulate the network.
Ø Then multiply the tanh output with the sigmoid output.
Ø The sigmoid output will decide which information is
important to keep from the tanh output.

Input Gate
Ø We want to add the gender of the new subject
(Carla) to the cell state, to replace the old one
(Bob) we’re forgetting.

Cell State Again!!!
Ø It’s now time to update the old cell state (Ct−1 ),
into the new cell state (Ct ).
Ø The previous steps already decided what to do,

we just need to actually do it.
Ø Multiply the old state by ft , forgetting the things

we decided to forget earlier.
Ø Then, add it∗C̃t -- this is the new candidate

values, scaled by how much we decided to
update each state value.

Scaled by tanh
vector transformations without tanh
A tanh function ensures that the values stay between -1 and 1, thus regulating the
output of the NN.
vector transformations with tanh

Cell State Again!!!
Ø This is where we will actually drop the information
about the old subject’s (Bob) gender and add the
new information (Carla), as we decided in the
previous steps.

Output Gate
Ø Finally, to decide what we are going to output (the
next hidden state).
Ø This output will be based on the cell state, but will

be a filtered version.
Ø Remember that the hidden state contains

information on previous inputs.
Ø The hidden state is also used for predictions.

Output Gate
Ø First, we pass the previous hidden state and the
current input into a sigmoid layer which decides
what parts of the cell state we’re going to output.
Ø Then, we pass the newly modified cell state

through tanh (to push the values to be between
−1 and 1), and multiply it by the output of the
sigmoid gate, so that we only output the parts we
decided to.
Ø The output is the new hidden state.
Ø The new cell state and the new hidden is then

carried over to the next time step.

Output Gate
Ø Since it just saw a subject (Carla), it might want to
output information relevant to a verb, in case
that’s what is coming next.
Ø For example, it might output whether the subject

is singular or plural, so that we know what form a
verb should be conjugated into if that’s what
follows next.

Input and Output Shape in LSTM
(Keras)

Input Shape
Ø You always have to give a three-dimensional array as an input to your LSTM network.
Ø (samples/batch_size, time steps, features/units)

Input Shape
Ø One complete sequence is one sample.

Ø A sample may refer to individual training examples.
Ø A batch is comprised of one or more samples.
Ø A “batch_size” variable indicates how many
different samples (examples) you feed at once to
the neural network.

Input Shape
sample

Input Shape
Ø The number of times, we feed to the model.
Ø It is the unit of time/data to use for the Network.
Ø This value is determined by the number of past data
to be viewed in the LSTM.
Ø The length of the input sequence.
Ø For example, if you want to give LSTM a sentence
as an input, your timesteps could either be the
number of words or the number of characters
depending on what you want.

Input Shape
sample

Input Shape
Ø Number of dimensions we feed at each time steps.

Ø One feature is one observation at a time step.
Ø For example:
Ø In NLP, a word could be represented by 300
features using word2vec.
Ø In the case of signal processing, let’s pretend that sample
the signal is 3D. That is, you have an X, a Y and a Z

signal. This means you would have 3 features sent
at each time step for each sample.

Input Shape
sample

Input Shape (samples/batch_size, time steps, features/units)
time steps features/units
Ø Though input_shape seems as a 2D array, actually a 3D array has been passed

with a shape of (batch_size, 2, 10).
Ø Means the value of time steps is 2, input units are 10.
Ø We have the flexibility to feed any batch size at the time of fitting the data to the
network.

Input Shape (samples/batch_size, time steps, features/units)
Ø We can also give an argument called batch_input_shape instead of input_shape.
Ø Difference: giving a fixed batch size (8) and the input array shape will look like (8,
2, 10).
Ø Feeding a different batch size other than 8, will give an error.

Input Shape: Don´t Confused
Ø Units in LSTM.
Ø The dimension of the hidden state (or the output).
Ø Here, the hidden state (the red circles) has length 2.
Ø The number of units is the number of neurons
connected to the layer holding the concatenated
vector of hidden state and input (the layer holding both
red and green circles below).
Ø Here, there are 2 neurons connected to that layer.

LSTM: Output and It’s Shape
Ø units = 3. So output shape is (None, 3).

Ø The first dimension of output is None, because we do not know the batch size in
advance.
Ø Here, the actual output shape will be (batch_size, 3).

Ø Here, batch_size is defined in advance.

Ø The output shape became (8, 3).

Ø return_sequences tells whether to return the output at each time step instead of the final time
step.
Ø Here, return_sequences to True, the output shape becomes a 3D array, instead of a 2D array.
Ø Now the shape of the output is (8, 2, 3).
Ø The one extra dimension in between representing the number of time steps.

Summary: Output and It’s Shape
Ø The input of the LSTM is always is a 3D array. (batch_size, time_steps, units).
Ø The output of the LSTM could be a 2D array or 3D array depending upon

the return_sequences argument.
Ø If return_sequence is False, the output is a 2D array (batch_size, units).
Ø If return_sequence is True, the output is a 3D array (batch_size, time_steps,

units).

LSTM: 1D, 2D, 3D Arrays
Ø The input to every LSTM network layer must be 3D.
Ø Let’s consider a sequence of multiple time steps with 1 feature: a sequence of 10 values,
each value representing the close price of some stock (therefore 1 feature).
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ø This sequence can be defined as a numpy array.
Ø The reshape() function in numpy can be used to reshape the above 1 dimensional array
into a 3 dimensional array, with 1 sample, 10 time steps, and 1 feature.
Ø This function takes 1 argument which is a tuple that defines the new shape of the array.


Ø Let’s consider data that has 2 features.
Ø For example data that includes close prices and open prices from chart data.

Long Short-Term Memory (LSTM) : Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 20.03.2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Long Short-Term Memory (LSTM) : Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 20.03.2021

Uploaded by

Copyright:

Available Formats

Lecture 10

Long Short-Term Memory (LSTM)

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 1

Pretty popular nowadays.

Let’s say the chatbot can classify intentions from

First, encode the sequence of text using an RNN.

Then, feed the RNN output into a feed-forward NN

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 2

A user types in… what time is it?

To start, we break up the sentence into individual words.

RNN’s work sequentially so we feed it one word at a time.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 3

The first step is to feed “What” into the RNN.

The RNN encodes “What” and produces an output.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 4

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 5

Repeat this process, until the final step.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 6

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 7

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 8

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 9

Ø Try to recall the items you saw in the picture.

Ø How many were you able to recall?

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 10

Ø Working memory is also sometimes called short-term memory.

Ø We use this working memory in lot of daily tasks.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 11

Ø So in many cases we need to remember the context presented at the beginning of

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 12

The output at 𝑦̂ ⟨3⟩ is made not just

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 13

Ø It can remember context in the short term well.

Ø We just need the context cloud in this sentence to complete it.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 14

Ø The bus, which went to Paris, was full.

Ø To process a paragraph of text to do predictions, RNN’s may leave out important

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 15

Ø Short-Term memory and the vanishing gradient is due to

Ø The odd distribution of colours in the hidden states

Ø As the RNN processes more steps, it has troubles

Ø The information from the word “what” and “time” is

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 16

1. It does a forward pass and makes a prediction.

2. It compares the prediction to the ground truth using a

3. It uses that error value to do back propagation which

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 17

Ø The bigger the gradient, the bigger the adjustments and

Ø When doing back propagation, each node in a layer

Ø So if the adjustments to the layers before it is small, then

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 18

Ø If a gradient value becomes extremely small, it doesn’t

Ø In RNNs, layers that get a small gradient update stops

Ø Those are usually the earlier layers.

Ø Because these layers don’t learn, RNN’s can forget what it

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 19

Ø That means that there is a possibility that the word “what”

Ø That’s pretty ambiguous and would be difficult even for a

Ø So not being able to learn on earlier time steps causes the

Ø An LSTM has a similar control flow as a

Ø The differences are the operations within the

Ø These operations are used to allow the LSTM

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 21