Professional Documents
Culture Documents
Recurrent Neural Networks: CS60010: Deep Learning
Recurrent Neural Networks: CS60010: Deep Learning
Abir Das
IIT Kharagpur
Agenda
Resources
§ Suppose, we want our model to write down the caption of this image.
Figure: Several people with umbrellas walk down a side walk on a rainy day.
§ When the model generates ‘people’, we need a way to tell the model
that ‘several’ has already been generated and similarly for the other
words.
𝑥"
ℎ"#$
Inputs
Outputs
ℎ"
"
ℎ
Hidden Units Delay
𝑥"
ℎ"#$
Inputs
Some
function Old state
New state
Some
function Old state
New state
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&
§ Note that all weight matrices are same across timesteps. So, the
weights are shared for all the timesteps.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30
Introduction Recurrent Neural Network LSTM
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$
𝑾& 𝑾& 𝑾& 𝑾&
at = Wh ht−1 + Wi xt (1)
t t
h = g(a ) (2)
t t
y = Wo h (3)
§ Note that we can have biases too. For simplicity these are omitted.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 12 / 30
Introduction Recurrent Neural Network LSTM
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂Lt ∂Lt
= = 1. (4)
∂yt ∂Lt ∂yt ∂yt
∂Lt
§ ∂yt is computable depending on the particular form of the loss
function.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30
Introduction Recurrent Neural Network LSTM
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L
§ Lets compute ∂h t . The subtlety here is that all Lt after timestep t
∂L
are functions of ht . So, let us first consider ∂hT , where T is the last
timestep.
∂L ∂L ∂yT ∂L
= = Wo (5)
∂hT ∂yT ∂hT ∂yT
∂L
§ ∂yT
, we just computed last slide (eqn. (4)).
∂L t t t+1 .
§ For a generic t, we need to compute ∂h t . h affects y and also h
For this we will use something that we used while studying
Backpropagation for feedforward networks.
Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30
Introduction Recurrent Neural Network LSTM
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L
§ ∂yt we computed in eqn. (4)
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂yt
§ ∂ht = Wo .
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
t
= t t
+ (6)
∂h ∂y ∂h ∂ht+1 ∂ht
∂L ∂L
§ ∂ht+1
is almost same as ∂ht . It is just for the next timestep.
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
𝒙" 𝒙")$ 𝒙")- 𝒙+
∂u ∂u ∂x ∂u ∂y
§ If u = f (x, y), where x = φ(t), y = ψ(t), then ∂t = ∂x ∂t + ∂y ∂t
∂L ∂L ∂yt ∂L ∂ht+1
= + (6)
∂ht ∂yt ∂ht ∂ht+1 ∂ht
t+1 t+1 t+1
§ ∂h ∂h ∂a 0
∂ht = ∂at+1 ∂ht =g · Wh .
§ Since, g is an elementwise operation,g0 will be a diagonal matrix.
§ In particular, if g is tanh, then
∂ht+1 t 2 t 2 t 2
∂ht = diag 1 − (h1 ) , 1 − (h2 ) , · · · , 1 − (hm )
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh
§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh
§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
∂L ∂L ∂L
(ht1 )2 , 1 − (ht2 )2 , · · · , 1 − (htm )2 Wh
§ ∂ht = ∂yt Wo + ∂ht+1
diag 1−
∂L
§ All the other things we can compute, but to compute ∂h t we need
∂L
∂ht+1
.
∂L ∂L
§ From eqn. (5), we get ∂h T , which gives ∂hT −1 and so on.
§ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
§ The derivatives through these compositions will tend to be either very
small or very large.
§ If h = (f ◦ g)(x) = f (g(x)), then h0 (x) = f 0 (g(x))g 0 (x)
§ If the gradients are small, the product is small.
§ If the gradients are large, the product is large.
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
𝑾( 𝑾( 𝑾( 𝑾(
𝒉"#$ 𝒉" 𝒉")$ 𝒉")𝟐 𝒉+#$ at = Wh ht−1 + Wi xt
𝑾& 𝑾& 𝑾& 𝑾&
ht = g(at )
𝑾' 𝑾' 𝑾' 𝑾' yt = Wo ht
" ")$ ")- +
𝒙 𝒙 𝒙 𝒙
§ And
∂ht ∂ht ∂ht−1 ∂hk+1
k
= t−1 t−2
··· k
∂h ∂h ∂h ∂h
=diag 1 − (h1 ) 1 − (ht−2
t−1 2 2 t−1 2
1 − (ht−2 2
1 ) · · · , 1 − (h2 ) 2 ) ··· ,
§ Exploding Gradients
I Easy to detect
I Clip the gradient at a threshold
§ Vanishing Gradients
I More difficult to detect
I Architectures designed to combat the problem of vanishing gradients.
Example: LSTMs by Schmidhuberet et. al.
LSTMs
§ LSTMs also have this chain like structure, but the repeating module
has a different structure.
§ Instead of having a single neural network layer, there are four,
interacting in a very special way.
§ The LSTM does have the ability to remove or add information to the
cell state, carefully regulated by gates. Source: Chris Olah’s blog
Gate
Gate
Forget Gate
Input Gate
Input Gate
§ Its now time to update the old cell state, Ct−1 , into the new cell state
Ct .
§ This is done by multiplying the old state by ft , forgetting the things
that were decided to forget earlier and adding it ∗ C̃t .
Output Gate
§ Finally, we need to decide what is going to be the output. This
output will be based on the cell state.
§ First a sigmoid layer is run which decides what parts of the cell state
are going to be output.
§ Then, the cell state is put through tanh (to push the values to be
between 1 and 1) and this is multiplied by the output of the sigmoid
gate.