Understanding LSTM Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Understanding LSTM Networks

Recurrent Neural Networks


An unrolled recurrent neural network
The Problem of Long-Term
Dependencies
RNN short-term dependencies
Language model trying to predict the next word based on the previous ones

the clouds are in the sky,


h0 h1 h2 h3 h4

A A A A A

x0 x1 x2 x3 x4
RNN long-term dependencies
Language model trying to predict the next word based on the previous ones

I grew up in India… I speak fluent Hindi.

h0 h1 h2 ht −1 ht

A A A A A

x0 x1 x2 x t −1 xt
Standard RNN
Backpropagation Through Time
(BPTT)
RNN forward pass
s t =tanh (Ux t +Ws t −1)
y^ t =softmax( Vst )
V V V V V
W W W W W

E( y , ^y )=−∑ E t ( y t , y^ t ) U U U U U
t
Backpropagation Through Time
∂E ∂ Et
=∑
∂W t ∂W

∂ E 3 ∂ E 3 ∂ y^3 ∂ s 3
=
∂W ∂ y^3 ∂ s 3 ∂ W
But s 3 =tanh(Uxt +Ws2)

S_3 depends on s_2, which


depends on W and s_1, and so on.

∂ E 3 3 ∂ E3 ∂ y^3 ∂ s 3 ∂ s k
=∑
∂W k=0 ∂ y^ 3 ∂ s 3 ∂ s k ∂W
The Vanishing Gradient Problem
∂ E 3 3 ∂ E3 ∂ y^3 ∂ s 3 ∂ s k
=∑
∂W k=0 ∂ y^ 3 ∂ s 3 ∂ s k ∂W

∂ E 3 3 ∂ E3 ∂ y^3 3
=∑
∂W k=0 ∂ y^ 3 ∂ s 3 (∏
j=k +1
)
∂ s j ∂ sk
∂ s j−1 ∂W

● Derivative of a vector w.r.t a vector is a matrix called jacobian


● 2-norm of the above Jacobian matrix has an upper bound of 1
● tanh maps all values into a range between -1 and 1, and the derivative
is bounded by 1
● With multiple matrix multiplications, gradient values shrink
exponentially
● Gradient contributions from “far away” steps become zero
● Depending on activation functions and network parameters, gradients
could explode instead of vanishing
Activation function
Basic LSTM
Unrolling the LSTM through time
Constant error carousel
s t =tanh (Ux t +Ws t −1)

C t⋅o t ot
Replaced by

Π σ

~ ( t)
C t = C t⋅i c + C t − 1

Edge to next
Π σ time step

it
Edge from previous
σ ~
C time step
t (and current input)

Weight fixed at 1
Input gate
● Use contextual information to decide
● Store input into memory
● Protect memory from overwritten
by other irrelevant inputs

C t⋅o t ot

Π σ

~ (t )
C t= Ct⋅i c +C t −1

Edge to next
Π σ time step

it
Edge from previous
σ ~
C time step
t (and current input)

Weight fixed at 1
Output gate
● Use contextual information to decide
● Access information in memory
● Block irrelevant information

C t⋅o t ot

Π σ

~ (t )
C t= Ct⋅i c +C t −1

Edge to next
Π σ time step

it
Edge from previous
σ ~
C time step
t (and current input)

Weight fixed at 1
Forget or reset gate

C t⋅o t ot

ft Π σ

σ
~ (t )
C t= Ct⋅i c +C t −1⋅f t

Π
Edge to next
Π σ time step

it
Edge from previous
σ ~
C time step
t (and current input)

Weight fixed at 1
LSTM with four interacting layers
The cell state
Gates

sigmoid layer
Step-by-Step LSTM Walk Through
Forget gate layer
Input gate layer
The current state
Output layer
Refrence
● http://colah.github.io/posts/2015-08-Understanding-LSTMs/
● http://www.wildml.com/
● http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-netwo
rks/
● http://deeplearning.net/tutorial/lstm.html
● https://theclevermachine.files.wordpress.com/2014/09/act-funs.png
● http://blog.terminal.com/demistifying-long-short-term-memory-lstm-recurrent
-neural-networks/
● A Critical Review of Recurrent Neural Networks for Sequence Learning,
Zachary C. Lipton, John Berkowitz
● Long Short-Term Memory, Hochreiter, Sepp and Schmidhuber, Jurgen, 1997
● Gers, F. A.; Schmidhuber, J. & Cummins, F. A. (2000), 'Learning to Forget:
Continual Prediction with LSTM.', Neural Computation 12 (10) , 2451-2471 .

You might also like