Professional Documents
Culture Documents
Sequence Modeling - Recurrent Networks: Biplab Banerjee
Sequence Modeling - Recurrent Networks: Biplab Banerjee
recurrent networks
Biplab Banerjee
Sequential learning problem
Classifier
Hidden state
“Memory” h5
“Context”
h1 h2 h3 h4
https://translate.google.com/
Machine translation
• Multiple input – multiple output (or sequence to
sequence)
Many to Many -
Machine translation
Dialogue
Different input output combinations
Image and video captioning
The predicted word at a certain point can be inferred given a few previous words
Output at time t yt
Hidden
Classifier
Recurrence:
representation
at time t
ht ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
new function input at old
ht-1 Hidden layer state of W time t state
Input at time t xt
Unrolling the RNN
y3
y2
Classifier
y1 h3
Classifier h3
h2 Hidden layer
Classifier h2
h1 h1 Hidden layer
x3
Hidden layer t=3
x2
h0 t=2
x1
t=1
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑦𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt
J. Elman, Finding structure in time, Cognitive science 14(2), pp. 179–211, 1990
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt
𝑒 𝑎 − 𝑒 −𝑎
𝜎 𝑎 tanh 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
tanh 𝑎 = 2𝜎 2𝑎 − 1
Image source
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt
𝑑
tanh 𝑎 = 1 − tanh2 (𝑎)
𝑑𝑎
Image source
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt = tanh 𝑊𝑥 𝑥𝑡 + 𝑊ℎ ℎ𝑡−1
n m
m 𝑊𝑥 𝑊ℎ
𝑥𝑡 n-dim.
ℎ𝑡−1 m-dim.
Formulation
h(t) = (W h(t-1) + U x(t))
• The weight updates are computed for each copy in the unfolded
network, then accumulated and applied to the model weights and
biases.
Forward and backward pass
Short term dependency
Long term dependency
RNN fails to preserve the long term dependency due to the vanishing
gradient problem
First, we pass the previous hidden state and current input into
a sigmoid function. That decides which values will be updated
by transforming the values to be between 0 and 1. 0 means not
important, and 1 means important. We also pass the hidden
state and current input into the tanh function to squish values
between -1 and 1 to help regulate the network. Then we
multiply the tanh output with the sigmoid output. The sigmoid
output will decide which information is important to keep from
the tanh output.
Updating the cell state
First, we pass the previous hidden state and the current input
into a sigmoid function. Then we pass the newly modified cell
state to the tanh function. We multiply the tanh output with
the sigmoid output to decide what information the hidden
state should carry. The output is the hidden state.
Summarizing the equations
Training LSTM
• Backpropagation through time
Update Gate
The update gate acts similar to the forget and input gate of an
LSTM. It decides what information to throw away and what
new information to add.
Reset Gate
The reset gate is another gate is used to decide how much past
information to forget.