Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Recurrent Neural Networks

CS60010: Deep Learning

Abir Das

IIT Kharagpur

Mar 11, 2020

Introduction Recurrent Neural Network LSTM


§ Get introduced to different recurrent neural architecture e.g., RNNs,

LSTMs, GRUs etc.
§ Get introduced to tasks involving sequential inputs and/or sequential

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 2 / 30

Introduction Recurrent Neural Network LSTM


§ Deep Learning by I. Goodfellow and Y. Bengio and A. Courville.

[Link] [Chapter 10]
§ CS231n by Stanford University [Link]
§ Understanding LSTM Networks by Chris Olah [Link]

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 3 / 30

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ So far, we focused mainly on prediction problems with fixedsize inputs
and outputs.
§ In image classification, input is fixed size image and and output is its
class, in video classification, the input is fixed size video and output is
its class, in bounding-box regression the input is fixed size region
proposal (resized/RoI pooled) and output is bounding box

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 4 / 30

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Suppose, we want our model to write down the caption of this image.

Figure: Several people with umbrellas walk down a side walk on a rainy day.

Image source: COCO Dataset, ICCV 2015

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 5 / 30

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?

Several people with umbrellas

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?

Several people with umbrellas

§ When the model generates ‘people’, we need a way to tell the model
that ‘several’ has already been generated and similarly for the other

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

Several people with umbrellas

e.g. Image Captioning

§ image -> sequence of words

Abir Das (IIT Kharagpur) Fei-Fei CS60010

Li & Justin Johnson & Serena Yeung MarLecture
11, 2020 10 - 7 / 30 M
Introduction Recurrent Neural Network LSTM

Recurrent Neural Networks: Process Sequences

e.g., Image Captioning e.g., Machine Translation

image-> sequence of words sequence of words -> sequence of words

e.g., Sentiment Classification e.g., Frame Level Video Classification

sequence of words -> sentiment sequence of frames -> sequence of labels

Image source: CS231n from Stanford

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 8 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

§ The fundamental feature of a Recurrent Neural Network (RNN) is
that the network contains at least one feedback connection so that
activation can flow in a loop.
§ The feedback connection allows information to persist. Remember
the generation of people would require the generation of several to
be remembered.
§ The simplest form of RNN has the previous set of hidden unit
activations feeding back into the network along with the inputs.
Hidden Units Delay


Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 9 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network


Hidden Units Delay


§ Note that the concept of ‘time’ or sequential processing comes into

§ The activations are updated one time-step at a time.
§ The task of the delay unit is to simply delay the hidden layer
activation until the next time-step.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 10 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

Input vector

ℎ" = 𝑓 𝑥" , ℎ"'(

function Old state
New state

§ f , in particular, can be a layer of a neural network.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

Input vector

ℎ" = 𝑓 𝑥" , ℎ"'(

function Old state
New state

§ f , in particular, can be a layer of a neural network.

§ Lets unroll the recurrent connection.
𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&

𝑾' 𝑾' 𝑾' 𝑾'

" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Note that all weight matrices are same across timesteps. So, the
weights are shared for all the timesteps.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: Forward Pass

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&

𝑾' 𝑾' 𝑾' 𝑾'

" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

at = Wh ht−1 + Wi xt (1)
t t
h = g(a ) (2)
t t
y = Wo h (3)

§ Note that we can have biases too. For simplicity these are omitted.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 12 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time

∂L ∂L ∂L
Lt and we are after
§ Total loss L = ∂Wo , ∂Wh and ∂Wi

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ BPTT: Backpropagation through time

∂L ∂L ∂L
Lt and we are after
§ Total loss L = ∂Wo , ∂Wh and ∂Wi
§ Lets compute ∂yt .

∂L ∂L ∂Lt ∂Lt
= = 1. (4)
∂yt ∂Lt ∂yt ∂yt
§ ∂yt is computable depending on the particular form of the loss
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
are functions of ht . So, let us first consider ∂hT , where T is the last

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
are functions of ht . So, let us first consider ∂hT , where T is the last
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
are functions of ht . So, let us first consider ∂hT , where T is the last
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT
§ ∂yT
, we just computed last slide (eqn. (4)).
∂L t t t+1 .
§ For a generic t, we need to compute ∂h t . h affects y and also h
For this we will use something that we used while studying
Backpropagation for feedforward networks.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30
Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
§ ∂yt we computed in eqn. (4)

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
§ ∂ht = Wo .

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L ∂L
§ ∂ht+1
is almost same as ∂ht . It is just for the next timestep.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
𝒙" 𝒙")$ 𝒙")- 𝒙+

∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t

∂L ∂L ∂yt ∂L ∂ht+1
= + (6)
∂ht ∂yt ∂ht ∂ht+1 ∂ht
t+1 t+1 t+1
§ ∂h ∂h ∂a 0
∂ht = ∂at+1 ∂ht =g · Wh .
§ Since, g is an elementwise operation,g0 will be a diagonal matrix.
§ In particular, if g is tanh, then
∂ht+1 t 2 t 2 t 2

∂ht = diag 1 − (h1 ) , 1 − (h2 ) , · · · , 1 − (hm )

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
§ All the other things we can compute, but to compute ∂h t we need
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
§ All the other things we can compute, but to compute ∂h t we need
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t

§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
t t t ∂y
§ ∂yt is computable depending on the form of the loss function.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
§ All the other things we can compute, but to compute ∂h t we need
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.

∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t

§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
t t t ∂y
§ ∂yt is computable depending on the form of the loss function.
∂L ∂L
§ (Do it yourself) Similarly for ∂Wh and ∂Wi .
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30
Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
§ The derivatives through these compositions will tend to be either very
small or very large.
§ If h = (f ◦ g)(x) = f (g(x)), then h0 (x) = f 0 (g(x))g 0 (x)
§ If the gradients are small, the product is small.
§ If the gradients are large, the product is large.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 17 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L X ∂Lt
∂θ t=1

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L X ∂Lt
∂θ t=1
∂Lt ∂Lt ∂yt
∂θ ∂yt ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L X ∂Lt
∂θ t=1
∂Lt ∂Lt ∂yt
∂θ ∂yt ∂θ
∂y ∂yt ∂ht
∂θ ∂ht ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L X ∂Lt
∂θ t=1
∂Lt ∂Lt ∂yt
∂θ ∂yt ∂θ
∂y ∂yt ∂ht
∂θ ∂ht ∂θ
§ But, ht is a function of ht−1 , ht−2 , · · · , h2 , h1 and each of these is a
function of θ.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30
Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
∂ht X ∂ht ∂hk
∂θ ∂hk ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒚" 𝒚")$ 𝒚")- 𝒚+

𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

§ Now we will resort to our friend again - If u = f (x, y), where

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
∂ht X ∂ht ∂hk
∂θ ∂hk ∂θ

§ And
∂ht ∂ht ∂ht−1 ∂hk+1
= t−1 t−2
··· k
∂h ∂h  ∂h ∂h
=diag 1 − (h1 ) 1 − (ht−2
t−1 2 2 t−1 2
1 − (ht−2 2
1 ) · · · , 1 − (h2 ) 2 ) ··· ,

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ Exploding Gradients
I Easy to detect
I Clip the gradient at a threshold
§ Vanishing Gradients
I More difficult to detect
I Architectures designed to combat the problem of vanishing gradients.
Example: LSTMs by Schmidhuberet et. al.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 20 / 30

Introduction Recurrent Neural Network LSTM

Long Short Term Memory (LSTM)

§ Hochreiter & Schmidhuber(1997) solved the problem of getting an

RNN to remember things for a long time (e.g., hundreds of time
§ They designed a memory cell using logistic and linear units with
multiplicative interactions.
§ Information is handled using three gates, namely - forget, input and

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 21 / 30

Introduction Recurrent Neural Network LSTM

Recall: Vanilla RNNs

§ In a standard RNN the repeating module has a simple structure.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 22 / 30

Introduction Recurrent Neural Network LSTM

§ LSTMs also have this chain like structure, but the repeating module
has a different structure.
§ Instead of having a single neural network layer, there are four,
interacting in a very special way.

§ The notations mean

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 23 / 30

Introduction Recurrent Neural Network LSTM

LSTM Memory/Cell State

§ The key to LSTMs is the cell state, the horizontal line running
through the top of the diagram.
§ The cell state is kind of like a conveyor belt. Its very easy for
information to just flow along it unchanged.

§ The LSTM does have the ability to remove or add information to the
cell state, carefully regulated by gates. Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 24 / 30

Introduction Recurrent Neural Network LSTM


§ Composed of a sigmoid neural net layer and a pointwise multiplication

§ Sigmoid: outputs numbers between
I Zero: “let nothing through” and
I One: “let everything through”

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 25 / 30

Introduction Recurrent Neural Network LSTM


§ And LSTM has three such gates.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 26 / 30

Introduction Recurrent Neural Network LSTM

Forget Gate

§ The first step is to decide what information is going to be throw away

from the cell state. This decision is made by the “forget gate layer”.
§ It looks at ht−1 and xt , and outputs a number between 0 and 1 for
each number in the cell state Ct−1 . A 1 represents “completely keep
this” while a 0 represents “completely get rid of this”.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 27 / 30

Introduction Recurrent Neural Network LSTM

Input Gate

§ The next step is to decide what new information is going to be stored

in the cell state. This has two parts.
§ First, a sigmoid layer called the “input gate layer” decides which
values to update. Next, a tanh layer creates a vector of new
candidate values, C̃t , that could be added to the state.

§ Next step combines these two to create an update to the state.

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 28 / 30

Introduction Recurrent Neural Network LSTM

Input Gate

§ Its now time to update the old cell state, Ct−1 , into the new cell state
Ct .
§ This is done by multiplying the old state by ft , forgetting the things
that were decided to forget earlier and adding it ∗ C̃t .

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 29 / 30

Introduction Recurrent Neural Network LSTM

Output Gate
§ Finally, we need to decide what is going to be the output. This
output will be based on the cell state.
§ First a sigmoid layer is run which decides what parts of the cell state
are going to be output.
§ Then, the cell state is put through tanh (to push the values to be
between 1 and 1) and this is multiplied by the output of the sigmoid

Source: Chris Olah’s blog

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 30 / 30

You might also like