ch10 Sequence Modelling - Recurrent and Recursive Nets

Sequence Modelling:
Recurrent and Recursive Nets
Jen-Tzung Chien
Department of Electrical and Computer Engineering

Department of Computer Science
National Chiao Tung University, Hsinchu
September 19, 2016

Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
1
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Explicit Memory
2
Unfolding Computational Graphs
• Unfolding flow graph is a flow graph that has a repetitive structure, typically
corresponding to a chain of events
• s(t) = f (s(t−1)θ)
• s(t) = f (s(t−1), x(t); θ)
3
Design Patterns for Recurrent Neural Networks

• Recurrent networks that produce an output at each time step and have
recurrent connections between hidden units
4

• Recurrent networks that produce an output at each time step and have
recurrent connections only from the output at one time step to the hidden
units at the next time step
5
• Recurrent networks with recurrent connections between hidden units that read
an entire sequence and then produce a single output
6
Outline
• Explicit Memory
7
Recurrent Neural Networks
• We now develop the forward propagation equations for the RNN
• Hyperbolic tangent activation function
• Update equations for Forward propagation

a(t) = b + W h(t−1) + U x(t)
h(t) = tanh(a(t))
o(t) = c + V s(t)
p(t) = softmax(o(t))
• Total loss is the sum of the losses over all the time steps
L({x (1)
∑ (t) , ..., x (τ )
}, {y (1)
, ..., y (τ )
})
= ∑t L
= t − log pmodel(y (t)|{x(1), ..., x(t)})
8
Teacher Forcing and Networks with Output Recurrence
• The network with recurrent connections only from the output at one time step
to the hidden units at the next time step is less powerful because it lacks
hidden-to-hidden recurrent connections
• Models that have recurrent connections from their outputs leading back into
the model may be trained with teacher forcing
9
Teacher Forcing and Networks with Output Recurrence

• A training technique that is applicable to RNNs that have connections from
their output to their hidden states at the next time step
• The conditional maximum likelihood criterion is

log p(y (1), y (2)|x(1), x(2))
= log p(y (2)|y (1), x(1), x(2)) + log p(y (1)|x(1), x(2))
10
Back-Propagation Through Time (BPTT)
∂L
(t)
=1
∂L
∂L ∂L ∂L(t) (t)
(∇o(t) L)i = (t)
= = ŷi − 1i,y (t)
∂oi ∂L(t) ∂o(t)
i
∇h(τ ) L = V ⊤∇o(τ ) L
∇h(τ ) L
∂h(t+1) ⊤ (t+1) ∂o(t) ⊤
=( ) (∇ L) + ( (t) ) (∇o(t) L)
∂h(t) ∂h
= W ⊤(∇h(t+1) L)diag(1 − (h(t+1))2) + V ⊤(∇o(t) L)
11
Back-Propagation Through Time (BPTT)
∑ ∂o(t) ∑
⊤
∇c L = ( ) ∇o(t) L = ∇o(t) L
t
∂c t
∑ ∂h(t) ∑
⊤
∇b L = ( (t) ) ∇h(t) L = diag(1 − (h(t))2)∇h(t) L
t
∂b t
∑ ∑ ∂L ∑
(t) (t) ⊤
∇V L = ( (t) )∇V oi = (∇o(t) L)h
t i ∂oi t
∑ ∑ ∂L ∑
(t) (t−1) ⊤
∇W L = ( (t) )∇W (t) hi = diag(1 − (h ) )(∇h(t) L)h
(t) 2
t i ∂hi t
∑ ∑ ∂L (t)
∑ ⊤
∇U L = ( (t) )∇U (t) hi = diag(1 − (h(t))2)(∇h(t) L)x(t)
t i ∂hi t
12
Recurrent Networks as Directed Graphical Models
∏
τ
P (Y) = P (y (1), ..., y (τ )) = P (y (t)|y (t−1), y (t−2), ..., y (1))
t=1
∑
L= L(t)
t
L(t) = − log P (y (t)|y (t−1), y (t−2), ..., y (1))
13
RNNs to Represent Conditional Probability Distributions

• RNNs allow the extension of the graphical model view to represent a conditional
distribution over y given x
• When x is a fixed-size vector, we can simply make it an extra input of the

RNN that generates the y sequence
• Some common ways of providing an extra input to an RNN are:

1. as an extra input at each time step, or
2. as the initial state s0, or
3. both
14
Outline
• Explicit Memory
15
Bidirectional RNNs
• In many applications we need an output which

may depend on the whole input sequence
• Bidirectional RNNs were invented to address

that need
• They combine an RNN that moves forward

through time beginning from the start of
the sequence with another RNN that moves
backward through time beginning from the
end of the sequence
16
Outline
• Explicit Memory
17
Encoder-Decoder Sequence-to-Sequence Architectures
• RNN can be trained to map an input sequence

to an output sequence which is not necessarily
of the same length
• We often call the input to the RNN

the ”context”. We want to produce a
representation of this context, C
• encoder-decoder or sequence-to-sequence
architecture
1. an encoder or reader or input RNN
processes the input sequence
2. a decoder or writer or output RNN is
conditioned on that fixed-length vector to
generate the output sequence
18
Outline
• Explicit Memory
19
Deep Recurrent Networks
• RNNs can be decomposed into three blocks of parameters and associated

transformations
1. from input to hidden state
2. from previous hidden state to next hidden state
3. from hidden state to output
20
Deep Recurrent Networks

• Left: Hidden state broken down into groups
• Middle: deeper computation (e.g., an MLP) introduced
• Right: The lengthened shortest path linking different time steps can be
mitigated by introduced skip connections
21
Outline
• Explicit Memory
22
Recursive Neural Networks
• A recursive network has a computational

graph that generalizes that of the recurrent
network from a chain to a tree
• One clear advantage of recursive net over

recurrent nets is that for a sequence of
the same length N, depth can be drastically
reduced from τ to O(log τ )
23
Outline
• Explicit Memory
24
The Challenge of Long-Term Dependencies
• The basic problem is that gradients propagated over many stages tend to
either vanish or explode
• Even if we assume that the parameters are such that the recurrent network is
stable, the difficulty with long-term dependencies arises from the exponentially
smaller weights given to long-term interactions compared to short-term ones
• When composing many non-linear functions, the result is highly non-linear
25
The Challenge of Long-Term Dependencies
• We can think of the recurrence relation

h(t) = W ⊤h(t−1)
h(t) = (W t)⊤h(0)
W = QΛQ⊤
h(t) = Q⊤ΛtQh(0)
• If we make a non-recurrent network that has a different weight w(t) at each

time step, the situation is different
• Very deep feed-forward networks with carefully chosen scaling can thus avoid
the vanishing and exploding gradient problem
26
Outline
• Explicit Memory
27
Echo State Networks
• One approach to avoiding this difficulty is to set the recurrent weights such
that the recurrent hidden units can capture the history of past inputs, and
learn only the output weights
• Viewing the recurrent net as a dynamical system and setting the input and
recurrent weights such that the dynamical system is near the edge of stability
can do it
• The strategy of echo state networks is simply to fix the weights to have some
spectral radius such as 3, where information is carried forward through time
but does not explode due to the stabilizing effect of saturating non-linearities
like tanh
• The techniques used to set the weights in ESNs could be used to initialize the
weights in a recurrent network, helping to learn long-term dependencies
28
Outline
• Explicit Memory
29
Leaky Units and Other Strategies for Multiple Time Scales
• One way to obtain coarse time scales is to add direct connections from variables
in the distant past to variables in the present
• Another way to obtain paths on which the product of derivatives is close to

1 is to have units with linear self-connections and a weight near 1 on these
connections (called leaky units)
• There are two basic strategies for setting the time constants used by leaky
units
1. manually fix them to values that remain constant
2. make the time constants free parameters and learn them
• Another approach is the idea of organizing the state of the RNN at multiple
time-scales, with information flowing more easily through long distances at the
slower time scales
30
Outline
• Explicit Memory
31
The Long Short-Term Memory and Other Gated RNNs
• The most effective sequence models used in practical applications are called
gated RNNs, including the long short-term memory and networks based on
the gated recurrent unit
• Like leaky units, gated RNNs are based on the idea of creating paths through
time that have derivatives that neither vanish nor explode
• Different from leaky units, gated RNNs generalize this to connection weights
that may change at each time step
32
The Long-Short-Term-Memory Architecture
• The clever idea of introducing self-loops to produce paths where the gradient
can flow for long durations is a core contribution of the initial long short-term
memory (LSTM) model
• LSTM networks have ”LSTM cells” that have an internal recurrence, in

addition to the outer recurrence of the RNN
• Each cell has the same inputs and outputs as a vanilla recurrent network, but
has more parameters and a system of gating units that controls the flow of
information
33
• The self-loop weight is controlled by a forget gate unit

(t) f ∑ f (t) ∑ f (t−1)
fi = σ(bi + Ui,j xj + Wi,j hj )
j j
h(t): current hidden layer vector
bf ,U f ,W f are respectively biases, input weights and recurrent weights for the
forget gates
(t) (t) (t−1) (t) ∑ (t) ∑ (t−1)
si = fi si + gi σ(bi + Ui,j xj + Wi,j hj )
j j
(t) ∑ g (t) ∑ g (t−1)
gi = σ(bgi + Ui,j xj + Wi,j hj )
j j
(t) (t) (t)
hi = tanh(si )qi
(t) o
∑ o (t) ∑ o (t−1)
qi = σ(bi + Ui,j xj + Wi,j hj )
j j
(t)
gi : external input gate
(t)
qi : output gate
34
35
Other Gated RNNs
• The main difference with the LSTM is that a single gating unit simultaneously
controls the forgetting factor and the decision to update the state unit
• update equations:
(t) (t−1) (t−1) (t−1) ∑ (t−1) ∑ (t−1) (t−1)
hi = ui hi + (1 − ui )σ(bi + Ui,j xj + Wi,j rj hj )
j j
(t) ∑ u (t)
∑ u (t)
ui = σ(bui + Ui,j xj + Wi,j hj )
j j
(t) ∑ r (t)
∑ r (t)
ri = σ(bri + Ui,j xj + Wi,j hj )
j j
u: ”update” gate : act like conditional leaky integrators that can linearly gate
any dimension
r: ”reset” gate : control which parts of the state get used to compute the
next target state
36
Outline
• Explicit Memory
37
Optimization for Long-Term Dependencies
• An interesting idea is that second derivatives may vanish at the same time
that first derivatives vanish
• If the second derivative shrinks at a similar rate to the first derivative, then
the ratio of first and second derivatives may remain relatively constant
• Unfortunately, second-order methods have many drawbacks, including high

computational cost, the need for a large mini-batch, and a tendency to be
attracted to saddle points
• Some approaches have largely been replaced by simply using SGD applied to
LSTMs could achieve promising results
• It is often much easier to design a model that is easy to optimize than it is to

design a more powerful optimization algorithm
38
Clipping Gradients
• Strongly non-linear functions such as those computed by a recurrent net over

many time steps tend to have derivatives that can be either very large or very
small in magnitude
• The difficulty that arises is that when the parameter gradient is very large, a
gradient descent parameter update could throw the parameters very far, into
a region where the objective function is larger, undoing much of the work that
had been done to reach the current solution
39
Clipping Gradients
• A simple type of solution has been in use by practitioners for many years:
clipping the gradient
• One option is to clip the parameter gradient from a mini-batch element-wise

just before the parameter update
• Another is to clip the norm of the gradient just before the parameter update
40
Regularizing to Encourage Information Flow
• Gradient clipping helps to deal with exploding gradients, but it does not help
with vanishing gradients
• We discussed the idea of creating paths in the computational graph of the

unfolded recurrent architecture along which the product of gradients associated
with arcs is near 1
• One approach is with LSTM and self-loops and gating mechanisms, another
idea is to regularize or constrain the parameters so as to encourage ”information
flow”
(t)
• we want (∇h(t) L) ∂h
∂h
(t−1) to be as large as ∇h(t) L
(t)
∑ ∥(∇h(t) L) ∂h∂h(t−1) ∥
• Ω= ( ∥∇ L∥ − 1)2
t h(t)
41
Outline
• Explicit Memory
42
Explicit Memory
• Neural networks excel at storing implicit knowledge but struggle to memorize
facts
• Neural networks lack the equivalent of the working memory system that allows
human beings to explicitly hold and manipulate pieces of information that are
relevant to achieving some goal
• To resolve this difficulty, memory networks that include a set of memory cells
was introduced
• These memory cells are typically augmented to contain a vector, rather than
the single scalar stored by an LSTM or GRU memory cell
• If the content of a memory cell is copied (not forgotten) at most time steps,
then the information it contains can be propagated forward in time and the
gradients propagated backward in time without either vanishing or exploding
43
Explicit Memory
• The task network can choose to read from or write to specific memory
addresses
• Explicit memory seems to allow models to learn tasks that ordinary RNNs or
LSTM RNNs cannot learn
44

ch10 Sequence Modelling - Recurrent and Recursive Nets

Uploaded by

Copyright:

Available Formats

You might also like

ch10 Sequence Modelling - Recurrent and Recursive Nets

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch10 Sequence Modelling - Recurrent and Recursive Nets

Uploaded by

Copyright:

Available Formats

Sequence Modelling:

Recurrent and Recursive Nets

Department of Electrical and Computer Engineering

September 19, 2016

Unfolding Computational Graphs

• s(t) = f (s(t−1), x(t); θ)

Design Patterns for Recurrent Neural Networks

Design Patterns for Recurrent Neural Networks

Design Patterns for Recurrent Neural Networks

Recurrent Neural Networks

• We now develop the forward propagation equations for the RNN

• Hyperbolic tangent activation function

• Update equations for Forward propagation

Teacher Forcing and Networks with Output Recurrence

Teacher Forcing and Networks with Output Recurrence

• The conditional maximum likelihood criterion is

Back-Propagation Through Time (BPTT)

Back-Propagation Through Time (BPTT)

Recurrent Networks as Directed Graphical Models

L(t) = − log P (y (t)|y (t−1), y (t−2), ..., y (1))

RNNs to Represent Conditional Probability Distributions

• When x is a fixed-size vector, we can simply make it an extra input of the

• Some common ways of providing an extra input to an RNN are:

• In many applications we need an output which

• Bidirectional RNNs were invented to address

• They combine an RNN that moves forward

Encoder-Decoder Sequence-to-Sequence Architectures

• RNN can be trained to map an input sequence

• We often call the input to the RNN

Deep Recurrent Networks

• RNNs can be decomposed into three blocks of parameters and associated

Deep Recurrent Networks

• Middle: deeper computation (e.g., an MLP) introduced

Recursive Neural Networks

• A recursive network has a computational

• One clear advantage of recursive net over

The Challenge of Long-Term Dependencies

• When composing many non-linear functions, the result is highly non-linear

The Challenge of Long-Term Dependencies

• We can think of the recurrence relation

• If we make a non-recurrent network that has a diﬀerent weight w(t) at each

Echo State Networks

Leaky Units and Other Strategies for Multiple Time Scales

• Another way to obtain paths on which the product of derivatives is close to

The Long Short-Term Memory and Other Gated RNNs

The Long-Short-Term-Memory Architecture

• LSTM networks have ”LSTM cells” that have an internal recurrence, in

The Long-Short-Term-Memory Architecture

• The self-loop weight is controlled by a forget gate unit

The Long-Short-Term-Memory Architecture

Other Gated RNNs

Optimization for Long-Term Dependencies

• Unfortunately, second-order methods have many drawbacks, including high

• It is often much easier to design a model that is easy to optimize than it is to

• Strongly non-linear functions such as those computed by a recurrent net over

• One option is to clip the parameter gradient from a mini-batch element-wise

Regularizing to Encourage Information Flow

• We discussed the idea of creating paths in the computational graph of the