Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Lecture 10

Long Short-Term Memory (LSTM)

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 1


Recap: RNN
Chatbot :

Pretty popular nowadays.

Let’s say the chatbot can classify intentions from


the users inputted text.

First, encode the sequence of text using an RNN.

Then, feed the RNN output into a feed-forward NN


which will classify the intents.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 2


Example: Chatbot

A user types in… what time is it?

To start, we break up the sentence into individual words.

RNN’s work sequentially so we feed it one word at a time.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 3


Example: Chatbot

The first step is to feed “What” into the RNN.

The RNN encodes “What” and produces an output.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 4


Example: Chatbot

For the next step, we feed the word “time” and the hidden state from the previous
step.

RNN now has information on both the word “What” and “time.”

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 5


Example: Chatbot

Repeat this process, until the final step.

The final step the RNN has encoded information from all the words in previous
steps.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 6


Example: Chatbot

Since the final output was created from the rest of the sequence,
It is possible to take the final output and pass it to the feed-forward layer to
classify an intent.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 7


Long Short-Term Memory (LSTM)

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 8


Take About 30 Seconds To Stare At The Picture

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 9


Now ……..
Ø Close your eyes.

Ø Try to recall the items you saw in the picture.

Ø How many were you able to recall?

If you were able to recall all the items then you have a pretty good working memory.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 10


Working Memory
Ø Our brains store information in the working memory and forgets it after sometime.

Ø It is very unlikely that you will remember all these items till tomorrow.

Ø Working memory is also sometimes called short-term memory.

Ø We use this working memory in lot of daily tasks.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 11


Working Memory in Human

Ø Consider this sentence — The bus, which went to Paris, was full.
Ø To construct this sentence correctly, we need to remember that the subject (the
bus) is a singular.
Ø Therefore towards the end of the sentence we have to use was.

Ø The same sentence with a plural subject becomes — The buses, which went to
Paris, were full.

Ø So in many cases we need to remember the context presented at the beginning of


the sentence to complete the sentence correctly.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 12


Working Memory in Machine
Ø Similar things happen when machines are constructing sentences.

Ø We saw how RNNs are good at sequence tasks like language modelling.

The output at 𝑦̂ ⟨3⟩ is made not just


from this time step’s input 𝑥⟨3⟩, but
also from the information received
from the previous inputs 𝑥⟨1⟩ and 𝑥⟨2⟩.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 13


RNN in Handling Short Term Memory
Ø RNNs suffer from short-term memory.

Ø It can remember context in the short term well.


Ø There are clouds in the sky.

Ø We just need the context cloud in this sentence to complete it.


Ø And the word clouds is just two steps away from sky.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 14


RNN in Handling Short Term Memory
Ø If a sequence is long enough, they’ll have a hard time carrying information from
earlier time steps to later ones.

Ø The bus, which went to Paris, was full.


Ø The context is farther away.

Ø To process a paragraph of text to do predictions, RNN’s may leave out important


information from the beginning.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 15


Vanishing Gradient
Ø During back propagation, RNNs suffer from the
vanishing gradient problem.

Ø Short-Term memory and the vanishing gradient is due to


the nature of back-propagation algorithm used to train
and optimize NNs.

Ø The odd distribution of colours in the hidden states


illustrate the short-term memory issue with RNNs.

Ø As the RNN processes more steps, it has troubles


retaining information from previous steps.

Ø The information from the word “what” and “time” is


almost non-existent at the final time step.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 16


Vanishing Gradient
Ø Training a NN has three major steps.

1. It does a forward pass and makes a prediction.

2. It compares the prediction to the ground truth using a


loss function. The loss function outputs an error value
which is an estimate of how poorly the network is
performing.

3. It uses that error value to do back propagation which


calculates the gradients for each node in the network.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 17


Vanishing Gradient
Ø The gradient is the value used to adjust the networks
internal weights, allowing the network to learn.

Ø The bigger the gradient, the bigger the adjustments and


vice versa.

Ø When doing back propagation, each node in a layer


calculates it’s gradient with respect to the effects of the
gradients, in the layer before it.

Ø So if the adjustments to the layers before it is small, then


adjustments to the current layer will be even smaller.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 18


Vanishing Gradient
Ø The vanishing gradient problem is when the gradient
shrinks as it back propagates through time.

Ø If a gradient value becomes extremely small, it doesn’t


contribute too much learning.

Ø In RNNs, layers that get a small gradient update stops


learning.

Ø Those are usually the earlier layers.

Ø Because these layers don’t learn, RNN’s can forget what it


seen in longer sequences, thus having a short-term
memory.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 19


Vanishing Gradient
Ø Because of vanishing gradients, the RNN doesn’t learn the
long-range dependencies across time steps.

Ø That means that there is a possibility that the word “what”


and “time” are not considered when trying to predict the
user’s intention.

Ø The network then has to make the best guess with “is it?”

Ø That’s pretty ambiguous and would be difficult even for a


human.

Ø So not being able to learn on earlier time steps causes the


network to have a short-term memory.
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 20
Long Short Term Memory (LSTM)
Ø LSTMs are a special kind of RNN, capable of
learning long-term dependencies.

Ø An LSTM has a similar control flow as a


recurrent neural network.

Ø The differences are the operations within the


LSTM’s cells.

Ø These operations are used to allow the LSTM


to keep or forget information.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 21


RNN vs LSTM
Ø RNN unit takes the current input (X) as well as
the previous input (At-1) to produce output (H)
and current state (At).

Ø However, LSTMs have different components (4


layers inside) as compared to a single tanh
(activation) layer in the RNN.

Ø LSTMs have also sigmoid activation functions in


addition to tanh activation function.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 22


Recall: tanh and sigmoid Activation Function

tanh squishes values to be between -1 and 1 sigmoid squishes values to be between 0 and 1

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 23


LSTM: Core Concept
Ø There are 4 layers inside an LSTM block which interact together.

Ø The core concept of LSTM’s are the cell state, and it’s various gates.

1. Forget Gate.

2. Input Gate.

3. Output gate.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 24


Cell State
Ø The key to the operation of LSTM is the top
horizontal line running from left to right (cell state).

Ø It acts as the “memory” of the network.

Ø It runs straight down the entire chain, with only


some minor linear interactions.

Ø It’s very easy for information to just flow along it


unchanged, reducing the effects of short-term
memory.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 25


Cell State
Ø As the cell state goes on its journey, information
get’s added or removed to the cell state via gates.

Ø Gates are different NNs (sigmoid layers) that decide


which information is allowed on the cell state.

Ø LSTM has three of these gates to control cell state:


1. Forget gate
2. Input gate
3. Output gate

Ø The gates can learn what information is relevant to


keep or forget during training.
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 26
Forget Gate
Ø The first step is to decide what information will be
thrown away from the cell state.

Ø Forget gate (the first sigmoid layer) decides what


information should be thrown away or kept.
Ø Information from the previous hidden state (ht−1)
and the current input (xt) is passed through the
sigmoid function.

Ø Values come out between 0 and 1.


Ø The closer to 0 means to forget, and the
closer to 1 means to keep.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 27


Forget Gate
Ø Bob called Carla to ask her out.
Ø The cell state might include the gender of the present
subject (Bob à his).

Ø When there arrives a new subject, need to use the


correct pronouns later.
Ø The pronoun her is based on the subject Carla
and not Bob.

Ø The machine want to forget the context Bob, while


making prediction when it encounters a new subject
Carla.
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 28
Input Gate
Ø The next step is to determine what are we going to store
in the cell state.

Ø To update the cell state, we have the input gate.

Ø This has two parts:


Ø First, a sigmoid layer (input gate layer) to decide which
values will be updated.
Ø Next, a tanh layer creates a vector of new candidate
values (C̃t) that could be added to the state.

Ø In the next step, these two are combined to create an


update to the state..
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 29
Input Gate
Ø First, pass the previous hidden state and current
input into a sigmoid function to decide which values
will be updated by transforming the values to be
between 0 and 1.
Ø 0 means not important, and 1 means important.
Ø Also pass the hidden state and current input into the
tanh function to squish values between -1 and 1 to
help regulate the network.
Ø Then multiply the tanh output with the sigmoid output.
Ø The sigmoid output will decide which information is
important to keep from the tanh output.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 30


Input Gate
Ø Bob called Carla to ask her out.
Ø We want to add the gender of the new subject
(Carla) to the cell state, to replace the old one
(Bob) we’re forgetting.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 31


Cell State Again!!!
Ø It’s now time to update the old cell state (Ct−1 ),
into the new cell state (Ct ).

Ø The previous steps already decided what to do,


we just need to actually do it.

Ø Multiply the old state by ft , forgetting the things


we decided to forget earlier.

Ø Then, add it∗C̃t -- this is the new candidate


values, scaled by how much we decided to
update each state value.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 32


Scaled by tanh

vector transformations without tanh

A tanh function ensures that the values stay between -1 and 1, thus regulating the
output of the NN.

vector transformations with tanh


Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 33
Cell State Again!!!
Ø Bob called Carla to ask her out.
Ø This is where we will actually drop the information
about the old subject’s (Bob) gender and add the
new information (Carla), as we decided in the
previous steps.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 34


Output Gate
Ø Finally, to decide what we are going to output (the
next hidden state).

Ø This output will be based on the cell state, but will


be a filtered version.

Ø Remember that the hidden state contains


information on previous inputs.

Ø The hidden state is also used for predictions.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 35


Output Gate
Ø First, we pass the previous hidden state and the
current input into a sigmoid layer which decides
what parts of the cell state we’re going to output.

Ø Then, we pass the newly modified cell state


through tanh (to push the values to be between
−1 and 1), and multiply it by the output of the
sigmoid gate, so that we only output the parts we
decided to.

Ø The output is the new hidden state.

Ø The new cell state and the new hidden is then


carried over to the next time step.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 36


Output Gate
Ø Bob called Carla to ask her out.
Ø Since it just saw a subject (Carla), it might want to
output information relevant to a verb, in case
that’s what is coming next.

Ø For example, it might output whether the subject


is singular or plural, so that we know what form a
verb should be conjugated into if that’s what
follows next.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 37


Input and Output Shape in LSTM
(Keras)

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 38


Input Shape
Ø You always have to give a three-dimensional array as an input to your LSTM network.

Ø (samples/batch_size, time steps, features/units)

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 39


Input Shape
Ø (samples/batch_size, time steps, features/units)

Ø One complete sequence is one sample.


Ø A sample may refer to individual training examples.
Ø A batch is comprised of one or more samples.
Ø A “batch_size” variable indicates how many
different samples (examples) you feed at once to
the neural network.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 40


Input Shape
Ø (samples/batch_size, time steps, features/units)

sample

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 41


Input Shape
Ø (samples/batch_size, time steps, features/units)
Ø The number of times, we feed to the model.
Ø It is the unit of time/data to use for the Network.
Ø This value is determined by the number of past data
to be viewed in the LSTM.
Ø The length of the input sequence.
Ø For example, if you want to give LSTM a sentence
as an input, your timesteps could either be the
number of words or the number of characters
depending on what you want.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 42


Input Shape
Ø (samples/batch_size, time steps, features/units)

sample

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 43


Input Shape
Ø (samples/batch_size, time steps, features/units)

Ø Number of dimensions we feed at each time steps.


Ø One feature is one observation at a time step.
Ø For example:
Ø In NLP, a word could be represented by 300
features using word2vec.
Ø In the case of signal processing, let’s pretend that sample

the signal is 3D. That is, you have an X, a Y and a Z


signal. This means you would have 3 features sent
at each time step for each sample.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 44


Input Shape
Ø (samples/batch_size, time steps, features/units)

sample

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 45


Input Shape (samples/batch_size, time steps, features/units)

time steps features/units

Ø Though input_shape seems as a 2D array, actually a 3D array has been passed


with a shape of (batch_size, 2, 10).
Ø Means the value of time steps is 2, input units are 10.
Ø We have the flexibility to feed any batch size at the time of fitting the data to the
network.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 46


Input Shape (samples/batch_size, time steps, features/units)

Ø We can also give an argument called batch_input_shape instead of input_shape.

Ø Difference: giving a fixed batch size (8) and the input array shape will look like (8,
2, 10).

Ø Feeding a different batch size other than 8, will give an error.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 47


Input Shape: Don´t Confused

Ø Units in LSTM.
Ø The dimension of the hidden state (or the output).
Ø Here, the hidden state (the red circles) has length 2.
Ø The number of units is the number of neurons
connected to the layer holding the concatenated
vector of hidden state and input (the layer holding both
red and green circles below).
Ø Here, there are 2 neurons connected to that layer.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 48


LSTM: Output and It’s Shape

Ø units = 3. So output shape is (None, 3).


Ø The first dimension of output is None, because we do not know the batch size in
advance.
Ø Here, the actual output shape will be (batch_size, 3).

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 49


LSTM: Output and It’s Shape

Ø Here, batch_size is defined in advance.


Ø The output shape became (8, 3).

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 50


LSTM: Output and It’s Shape

Ø return_sequences tells whether to return the output at each time step instead of the final time
step.
Ø Here, return_sequences to True, the output shape becomes a 3D array, instead of a 2D array.
Ø Now the shape of the output is (8, 2, 3).
Ø The one extra dimension in between representing the number of time steps.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 51


Summary: Output and It’s Shape
Ø The input of the LSTM is always is a 3D array. (batch_size, time_steps, units).

Ø The output of the LSTM could be a 2D array or 3D array depending upon


the return_sequences argument.

Ø If return_sequence is False, the output is a 2D array (batch_size, units).

Ø If return_sequence is True, the output is a 3D array (batch_size, time_steps,


units).

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 52


LSTM: 1D, 2D, 3D Arrays
Ø The input to every LSTM network layer must be 3D.
Ø Let’s consider a sequence of multiple time steps with 1 feature: a sequence of 10 values,
each value representing the close price of some stock (therefore 1 feature).
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ø This sequence can be defined as a numpy array.

Ø The reshape() function in numpy can be used to reshape the above 1 dimensional array
into a 3 dimensional array, with 1 sample, 10 time steps, and 1 feature.

Ø This function takes 1 argument which is a tuple that defines the new shape of the array.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 53


LSTM: 1D, 2D, 3D Arrays

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 54


LSTM: 1D, 2D, 3D Arrays
Ø Let’s consider data that has 2 features.
Ø For example data that includes close prices and open prices from chart data.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 20.03.2021 55

You might also like