Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Recurrent Neural Networks

CS60010: Deep Learning

Abir Das

IIT Kharagpur

Mar 11, 2020


Introduction Recurrent Neural Network LSTM

Agenda

§ Get introduced to different recurrent neural architecture e.g., RNNs,


LSTMs, GRUs etc.
§ Get introduced to tasks involving sequential inputs and/or sequential
outputs.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 2 / 30


Introduction Recurrent Neural Network LSTM

Resources

§ Deep Learning by I. Goodfellow and Y. Bengio and A. Courville.


[Link] [Chapter 10]
§ CS231n by Stanford University [Link]
§ Understanding LSTM Networks by Chris Olah [Link]

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 3 / 30


Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?


§ So far, we focused mainly on prediction problems with fixedsize inputs
and outputs.
§ In image classification, input is fixed size image and and output is its
class, in video classification, the input is fixed size video and output is
its class, in bounding-box regression the input is fixed size region
proposal (resized/RoI pooled) and output is bounding box
coordinates.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 4 / 30


Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Suppose, we want our model to write down the caption of this image.

Figure: Several people with umbrellas walk down a side walk on a rainy day.

Image source: COCO Dataset, ICCV 2015

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 5 / 30


Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?


Several people with umbrellas

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30


Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?


Several people with umbrellas

§ When the model generates ‘people’, we need a way to tell the model
that ‘several’ has already been generated and similarly for the other
words.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30


Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?


Several people with umbrellas

e.g. Image Captioning


§ image -> sequence of words

Abir Das (IIT Kharagpur) Fei-Fei CS60010


Li & Justin Johnson & Serena Yeung MarLecture
11, 2020 10 - 7 / 30 M
Introduction Recurrent Neural Network LSTM

Recurrent Neural Networks: Process Sequences

e.g., Image Captioning e.g., Machine Translation


image-> sequence of words sequence of words -> sequence of words

e.g., Sentiment Classification e.g., Frame Level Video Classification


sequence of words -> sentiment sequence of frames -> sequence of labels

Image source: CS231n from Stanford

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 8 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network


§ The fundamental feature of a Recurrent Neural Network (RNN) is
that the network contains at least one feedback connection so that
activation can flow in a loop.
§ The feedback connection allows information to persist. Remember
the generation of people would require the generation of several to
be remembered.
§ The simplest form of RNN has the previous set of hidden unit
activations feeding back into the network along with the inputs.
Outputs
ℎ"
ℎ"
Hidden Units Delay

𝑥"
ℎ"#$
Inputs

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 9 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

Outputs
ℎ"
"

Hidden Units Delay

𝑥"
ℎ"#$
Inputs

§ Note that the concept of ‘time’ or sequential processing comes into


picture.
§ The activations are updated one time-step at a time.
§ The task of the delay unit is to simply delay the hidden layer
activation until the next time-step.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 10 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network


Input vector

ℎ" = 𝑓 𝑥" , ℎ"'(

Some
function Old state
New state

§ f , in particular, can be a layer of a neural network.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network


Input vector

ℎ" = 𝑓 𝑥" , ℎ"'(

Some
function Old state
New state

§ f , in particular, can be a layer of a neural network.


§ Lets unroll the recurrent connection.
𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&

𝑾' 𝑾' 𝑾' 𝑾'


" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Note that all weight matrices are same across timesteps. So, the
weights are shared for all the timesteps.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: Forward Pass


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&

𝑾' 𝑾' 𝑾' 𝑾'


" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

at = Wh ht−1 + Wi xt (1)
t t
h = g(a ) (2)
t t
y = Wo h (3)

§ Note that we can have biases too. For simplicity these are omitted.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 12 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time


T
∂L ∂L ∂L
Lt and we are after
P
§ Total loss L = ∂Wo , ∂Wh and ∂Wi
t=1

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time


T
∂L ∂L ∂L
Lt and we are after
P
§ Total loss L = ∂Wo , ∂Wh and ∂Wi
t=1
∂L
§ Lets compute ∂yt .

∂L ∂L ∂Lt ∂Lt
= = 1. (4)
∂yt ∂Lt ∂yt ∂yt
∂Lt
§ ∂yt is computable depending on the particular form of the loss
function.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT
∂L
§ ∂yT
, we just computed last slide (eqn. (4)).
∂L t t t+1 .
§ For a generic t, we need to compute ∂h t . h affects y and also h
For this we will use something that we used while studying
Backpropagation for feedforward networks.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L
§ ∂yt we computed in eqn. (4)

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂yt
§ ∂ht = Wo .

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L ∂L
§ ∂ht+1
is almost same as ∂ht . It is just for the next timestep.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
𝒙" 𝒙")$ 𝒙")- 𝒙+

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= + (6)
∂ht ∂yt ∂ht ∂ht+1 ∂ht
t+1 t+1 t+1
§ ∂h ∂h ∂a 0
∂ht = ∂at+1 ∂ht =g · Wh .
§ Since, g is an elementwise operation,g0 will be a diagonal matrix.
§ In particular, if g is tanh, then
∂ht+1 t 2 t 2 t 2

∂ht = diag 1 − (h1 ) , 1 − (h2 ) , · · · , 1 − (hm )

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t


§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
h
t t t ∂y
∂Lt
§ ∂yt is computable depending on the form of the loss function.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30


Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t


§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
h
t t t ∂y
∂Lt
§ ∂yt is computable depending on the form of the loss function.
∂L ∂L
§ (Do it yourself) Similarly for ∂Wh and ∂Wi .
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30
Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
§ The derivatives through these compositions will tend to be either very
small or very large.
§ If h = (f ◦ g)(x) = f (g(x)), then h0 (x) = f 0 (g(x))g 0 (x)
§ If the gradients are small, the product is small.
§ If the gradients are large, the product is large.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 17 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh


T
∂L X ∂Lt
=
∂θ t=1
∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh


T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh


T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ
t
∂y ∂yt ∂ht
=
∂θ ∂ht ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh


T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ
t
∂y ∂yt ∂ht
=
∂θ ∂ht ∂θ
§ But, ht is a function of ht−1 , ht−2 , · · · , h2 , h1 and each of these is a
function of θ.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30
Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where


∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where


∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
t−1
∂ht X ∂ht ∂hk
=
∂θ ∂hk ∂θ
k=1

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients


𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where


∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
t−1
∂ht X ∂ht ∂hk
=
∂θ ∂hk ∂θ
k=1

§ And
∂ht ∂ht ∂ht−1 ∂hk+1
k
= t−1 t−2
··· k
∂h ∂h  ∂h ∂h
=diag 1 − (h1 ) 1 − (ht−2
t−1 2 2 t−1 2
1 − (ht−2 2
  
1 ) · · · , 1 − (h2 ) 2 ) ··· ,

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30


Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ Exploding Gradients
I Easy to detect
I Clip the gradient at a threshold
§ Vanishing Gradients
I More difficult to detect
I Architectures designed to combat the problem of vanishing gradients.
Example: LSTMs by Schmidhuberet et. al.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 20 / 30


Introduction Recurrent Neural Network LSTM

Long Short Term Memory (LSTM)

§ Hochreiter & Schmidhuber(1997) solved the problem of getting an


RNN to remember things for a long time (e.g., hundreds of time
steps).
§ They designed a memory cell using logistic and linear units with
multiplicative interactions.
§ Information is handled using three gates, namely - forget, input and
output.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 21 / 30


Introduction Recurrent Neural Network LSTM

Recall: Vanilla RNNs

§ In a standard RNN the repeating module has a simple structure.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 22 / 30


Introduction Recurrent Neural Network LSTM

LSTMs
§ LSTMs also have this chain like structure, but the repeating module
has a different structure.
§ Instead of having a single neural network layer, there are four,
interacting in a very special way.

§ The notations mean

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 23 / 30


Introduction Recurrent Neural Network LSTM

LSTM Memory/Cell State


§ The key to LSTMs is the cell state, the horizontal line running
through the top of the diagram.
§ The cell state is kind of like a conveyor belt. Its very easy for
information to just flow along it unchanged.

§ The LSTM does have the ability to remove or add information to the
cell state, carefully regulated by gates. Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 24 / 30


Introduction Recurrent Neural Network LSTM

Gate

§ Composed of a sigmoid neural net layer and a pointwise multiplication


operation.
§ Sigmoid: outputs numbers between
I Zero: “let nothing through” and
I One: “let everything through”

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 25 / 30


Introduction Recurrent Neural Network LSTM

Gate

§ And LSTM has three such gates.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 26 / 30


Introduction Recurrent Neural Network LSTM

Forget Gate

§ The first step is to decide what information is going to be throw away


from the cell state. This decision is made by the “forget gate layer”.
§ It looks at ht−1 and xt , and outputs a number between 0 and 1 for
each number in the cell state Ct−1 . A 1 represents “completely keep
this” while a 0 represents “completely get rid of this”.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 27 / 30


Introduction Recurrent Neural Network LSTM

Input Gate

§ The next step is to decide what new information is going to be stored


in the cell state. This has two parts.
§ First, a sigmoid layer called the “input gate layer” decides which
values to update. Next, a tanh layer creates a vector of new
candidate values, C̃t , that could be added to the state.

§ Next step combines these two to create an update to the state.


Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 28 / 30


Introduction Recurrent Neural Network LSTM

Input Gate

§ Its now time to update the old cell state, Ct−1 , into the new cell state
Ct .
§ This is done by multiplying the old state by ft , forgetting the things
that were decided to forget earlier and adding it ∗ C̃t .

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 29 / 30


Introduction Recurrent Neural Network LSTM

Output Gate
§ Finally, we need to decide what is going to be the output. This
output will be based on the cell state.
§ First a sigmoid layer is run which decides what parts of the cell state
are going to be output.
§ Then, the cell state is put through tanh (to push the values to be
between 1 and 1) and this is multiplied by the output of the sigmoid
gate.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 30 / 30

You might also like