Recurrent Neural Networks: CS60010: Deep Learning

Recurrent Neural Networks
CS60010: Deep Learning
Abir Das
IIT Kharagpur
Mar 11, 2020

Introduction Recurrent Neural Network LSTM
Agenda
§ Get introduced to different recurrent neural architecture e.g., RNNs,

LSTMs, GRUs etc.
§ Get introduced to tasks involving sequential inputs and/or sequential
outputs.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 2 / 30

Resources
§ Deep Learning by I. Goodfellow and Y. Bengio and A. Courville.

[Link] [Chapter 10]
§ CS231n by Stanford University [Link]
§ Understanding LSTM Networks by Chris Olah [Link]

Why do we Need another NN Model?

§ So far, we focused mainly on prediction problems with fixedsize inputs
and outputs.
§ In image classification, input is fixed size image and and output is its
class, in video classification, the input is fixed size video and output is
its class, in bounding-box regression the input is fixed size region
proposal (resized/RoI pooled) and output is bounding box
coordinates.

§ Suppose, we want our model to write down the caption of this image.
Figure: Several people with umbrellas walk down a side walk on a rainy day.
Image source: COCO Dataset, ICCV 2015

§ Will this work?

Several people with umbrellas

§ Will this work?

§ When the model generates ‘people’, we need a way to tell the model
that ‘several’ has already been generated and similarly for the other
words.


e.g. Image Captioning

§ image -> sequence of words
Abir Das (IIT Kharagpur) Fei-Fei CS60010

Li & Justin Johnson & Serena Yeung MarLecture
11, 2020 10 - 7 / 30 M
Recurrent Neural Networks: Process Sequences
e.g., Image Captioning e.g., Machine Translation

image-> sequence of words sequence of words -> sequence of words
e.g., Sentiment Classification e.g., Frame Level Video Classification

sequence of words -> sentiment sequence of frames -> sequence of labels
Image source: CS231n from Stanford

Recurrent Neural Network

§ The fundamental feature of a Recurrent Neural Network (RNN) is
that the network contains at least one feedback connection so that
activation can flow in a loop.
§ The feedback connection allows information to persist. Remember
the generation of people would require the generation of several to
be remembered.
§ The simplest form of RNN has the previous set of hidden unit
activations feeding back into the network along with the inputs.
Outputs
ℎ"
ℎ"
Hidden Units Delay
𝑥"
ℎ"#$
Inputs

Outputs
ℎ"
"
ℎ
Hidden Units Delay
𝑥"
ℎ"#$
Inputs
§ Note that the concept of ‘time’ or sequential processing comes into

picture.
§ The activations are updated one time-step at a time.
§ The task of the delay unit is to simply delay the hidden layer
activation until the next time-step.


Input vector
ℎ" = 𝑓 𝑥" , ℎ"'(
Some
function Old state
New state
§ f , in particular, can be a layer of a neural network.


Input vector
ℎ" = 𝑓 𝑥" , ℎ"'(
Some
function Old state
New state
§ f , in particular, can be a layer of a neural network.

§ Lets unroll the recurrent connection.
𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&
𝑾' 𝑾' 𝑾' 𝑾'

" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
§ Note that all weight matrices are same across timesteps. So, the
weights are shared for all the timesteps.
Recurrent Neural Network: Forward Pass

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&
𝑾' 𝑾' 𝑾' 𝑾'

" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
at = Wh ht−1 + Wi xt (1)
t t
h = g(a ) (2)
t t
y = Wo h (3)
§ Note that we can have biases too. For simplicity these are omitted.
Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
§ BPTT: Backpropagation through time


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L ∂L ∂L
Lt and we are after
P
§ Total loss L = ∂Wo , ∂Wh and ∂Wi
t=1


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L ∂L ∂L
Lt and we are after
P
§ Total loss L = ∂Wo , ∂Wh and ∂Wi
t=1
∂L
§ Lets compute ∂yt .
∂L ∂L ∂Lt ∂Lt
= = 1. (4)
∂yt ∂Lt ∂yt ∂yt
∂Lt
§ ∂yt is computable depending on the particular form of the loss
function.

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
∂L
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
∂L
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT
∂L
§ ∂yT
, we just computed last slide (eqn. (4)).
∂L t t t+1 .
§ For a generic t, we need to compute ∂h t . h affects y and also h
For this we will use something that we used while studying
Backpropagation for feedforward networks.
𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L
§ ∂yt we computed in eqn. (4)

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂yt
§ ∂ht = Wo .

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L ∂L
§ ∂ht+1
is almost same as ∂ht . It is just for the next timestep.


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝒙" 𝒙")$ 𝒙")- 𝒙+
∂u ∂u ∂x ∂u ∂y
∂L ∂L ∂yt ∂L ∂ht+1
= + (6)
∂ht ∂yt ∂ht ∂ht+1 ∂ht
t+1 t+1 t+1
§ ∂h ∂h ∂a 0
∂ht = ∂at+1 ∂ht =g · Wh .
§ Since, g is an elementwise operation,g0 will be a diagonal matrix.
§ In particular, if g is tanh, then
∂ht+1 t 2 t 2 t 2

∂ht = diag 1 − (h1 ) , 1 − (h2 ) , · · · , 1 − (hm )


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
∂L
∂ht+1
.
∂L ∂L
∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t

§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
h
t t t ∂y
∂Lt
§ ∂yt is computable depending on the form of the loss function.


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh

§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
∂L
∂ht+1
.
∂L ∂L
∂L P ∂Lt P ∂Lt ∂yt P ∂Lt t

§ Now, ∂Wo = ∂Wo = ∂yt ∂Wo = t
h
t t t ∂y
∂Lt
§ ∂yt is computable depending on the form of the loss function.
∂L ∂L
§ (Do it yourself) Similarly for ∂Wh and ∂Wi .
Exploding or Vanishing Gradients
§ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
§ The derivatives through these compositions will tend to be either very
small or very large.
§ If h = (f ◦ g)(x) = f (g(x)), then h0 (x) = f 0 (g(x))g 0 (x)
§ If the gradients are small, the product is small.
§ If the gradients are large, the product is large.


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
§ Let us see what happens with one learnable weight matrix θ = Wh


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L X ∂Lt
=
∂θ t=1
∂θ


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ
t
∂y ∂yt ∂ht
=
∂θ ∂ht ∂θ


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

T
∂L X ∂Lt
=
∂θ t=1
∂θ
∂Lt ∂Lt ∂yt
=
∂θ ∂yt ∂θ
t
∂y ∂yt ∂ht
=
∂θ ∂ht ∂θ
§ But, ht is a function of ht−1 , ht−2 , · · · , h2 , h1 and each of these is a
function of θ.

𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
§ Now we will resort to our friend again - If u = f (x, y), where

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
t−1
∂ht X ∂ht ∂hk
=
∂θ ∂hk ∂θ
k=1


𝒚" 𝒚")$ 𝒚")- 𝒚+
𝑾( 𝑾( 𝑾( 𝑾(
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙

∂u ∂y
x = φ(t), y = ψ(t), then ∂u ∂u ∂x
∂t = ∂x ∂t + ∂y ∂t
t−1
∂ht X ∂ht ∂hk
=
∂θ ∂hk ∂θ
k=1
§ And
∂ht ∂ht ∂ht−1 ∂hk+1
k
= t−1 t−2
··· k
∂h ∂h ∂h ∂h
=diag 1 − (h1 ) 1 − (ht−2
t−1 2 2 t−1 2
1 − (ht−2 2

1 ) · · · , 1 − (h2 ) 2 ) ··· ,

§ Exploding Gradients
I Easy to detect
I Clip the gradient at a threshold
§ Vanishing Gradients
I More difficult to detect
I Architectures designed to combat the problem of vanishing gradients.
Example: LSTMs by Schmidhuberet et. al.

Long Short Term Memory (LSTM)
§ Hochreiter & Schmidhuber(1997) solved the problem of getting an

RNN to remember things for a long time (e.g., hundreds of time
steps).
§ They designed a memory cell using logistic and linear units with
multiplicative interactions.
§ Information is handled using three gates, namely - forget, input and
output.

Recall: Vanilla RNNs
§ In a standard RNN the repeating module has a simple structure.
Source: Chris Olah’s blog

LSTMs
§ LSTMs also have this chain like structure, but the repeating module
has a different structure.
§ Instead of having a single neural network layer, there are four,
interacting in a very special way.
§ The notations mean

LSTM Memory/Cell State

§ The key to LSTMs is the cell state, the horizontal line running
through the top of the diagram.
§ The cell state is kind of like a conveyor belt. Its very easy for
information to just flow along it unchanged.
§ The LSTM does have the ability to remove or add information to the
cell state, carefully regulated by gates. Source: Chris Olah’s blog

Gate
§ Composed of a sigmoid neural net layer and a pointwise multiplication

operation.
§ Sigmoid: outputs numbers between
I Zero: “let nothing through” and
I One: “let everything through”

Gate
§ And LSTM has three such gates.

Forget Gate
§ The first step is to decide what information is going to be throw away

from the cell state. This decision is made by the “forget gate layer”.
§ It looks at ht−1 and xt , and outputs a number between 0 and 1 for
each number in the cell state Ct−1 . A 1 represents “completely keep
this” while a 0 represents “completely get rid of this”.

Input Gate
§ The next step is to decide what new information is going to be stored

in the cell state. This has two parts.
§ First, a sigmoid layer called the “input gate layer” decides which
values to update. Next, a tanh layer creates a vector of new
candidate values, C̃t , that could be added to the state.
§ Next step combines these two to create an update to the state.


Input Gate
§ Its now time to update the old cell state, Ct−1 , into the new cell state
Ct .
§ This is done by multiplying the old state by ft , forgetting the things
that were decided to forget earlier and adding it ∗ C̃t .

Output Gate
§ Finally, we need to decide what is going to be the output. This
output will be based on the cell state.
§ First a sigmoid layer is run which decides what parts of the cell state
are going to be output.
§ Then, the cell state is put through tanh (to push the values to be
between 1 and 1) and this is multiplied by the output of the sigmoid
gate.

Recurrent Neural Networks: CS60010: Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recurrent Neural Networks: CS60010: Deep Learning

Uploaded by

Copyright:

Available Formats

Recurrent Neural Networks

CS60010: Deep Learning

Mar 11, 2020

§ Get introduced to different recurrent neural architecture e.g., RNNs,

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 2 / 30

§ Deep Learning by I. Goodfellow and Y. Bengio and A. Courville.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 3 / 30

Why do we Need another NN Model?

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 4 / 30

Why do we Need another NN Model?

Image source: COCO Dataset, ICCV 2015

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 5 / 30

Why do we Need another NN Model?

§ Will this work?

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

Why do we Need another NN Model?

§ Will this work?

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

Why do we Need another NN Model?

e.g. Image Captioning

Abir Das (IIT Kharagpur) Fei-Fei CS60010

Recurrent Neural Networks: Process Sequences

e.g., Image Captioning e.g., Machine Translation

e.g., Sentiment Classification e.g., Frame Level Video Classification

Image source: CS231n from Stanford

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 8 / 30

Recurrent Neural Network

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 9 / 30

Recurrent Neural Network

§ Note that the concept of ‘time’ or sequential processing comes into

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 10 / 30

Recurrent Neural Network

ℎ" = 𝑓 𝑥" , ℎ"'(

§ f , in particular, can be a layer of a neural network.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30

Recurrent Neural Network

ℎ" = 𝑓 𝑥" , ℎ"'(

§ f , in particular, can be a layer of a neural network.

𝑾' 𝑾' 𝑾' 𝑾'

Recurrent Neural Network: Forward Pass

𝑾' 𝑾' 𝑾' 𝑾'

Recurrent Neural Network: BPTT

§ BPTT: Backpropagation through time

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

Recurrent Neural Network: BPTT

§ BPTT: Backpropagation through time

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

Recurrent Neural Network: BPTT

§ BPTT: Backpropagation through time

Recurrent Neural Network: BPTT

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

Recurrent Neural Network: BPTT

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

Recurrent Neural Network: BPTT

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

Recurrent Neural Network: BPTT

𝒚" 𝒚")$ 𝒚")- 𝒚+