Professional Documents
Culture Documents
DL Tutorial NIPS2015
DL Tutorial NIPS2015
Learning
NIPS2015 Tutorial
Geo Hinton, Yoshua Bengio & Yann LeCun
Breakthrough
Deep Learning: machine
Machine Learning,
AI & No Free Lunch
Four key ingredients for ML towards AI
1. Lots & lots of data
2. Very exible models
3. Enough compu4ng power
4. Powerful priors that can defeat the curse of
dimensionality
3
Classical Symbolic AI vs
Learning Distributed Representations
Two symbols are equally far from each other
Concepts are not represented by symbols in our
brain, but by paWerns of ac4va4on
(Connec'onism, 1980s)
Georey Hinton
Output units
Hidden units
Input
units
5
person
cat
dog
David Rumelhart
Lighting
Tables
detection results along with the confidence based on the units activation and
Object counts in SUN
Fireplace
15000(J=5.3%, AP=22.9%)
10000
5000 (J=4.2%, AP=12.7%)
Wardrobe
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
b)
15
Animals
Seating
Building10(J=14.6%, AP=47.2%)
c)
a)
30
a)
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
Figure 11: (a) Segmentation of images from the SUN database using pool
20
Jaccard
segmentation index, AP = average precision-recall.) (b) Precision-r
discovered
objects. (c) Histogram of AP for all discovered object classes.
10
Note
d) that there are 115 units in pool5 of Places-CNN not detecting objects.
incomplete learning or a complementary texture-based or part-based represen
Figure 9: (a) Segmentations from pool5 in Places-CNN. Many classes are encoded by several units
Deep Learning:
Automating
Feature Discovery
10
Output
Output
Output
Mapping
from
features
Output
Mapping
from
features
Mapping
from
features
Most
complex
features
Handdesigned
program
Handdesigned
features
Features
Simplest
features
Input
Input
Input
Input
Rule-based
systems
Classic
machine
learning
Representation
learning
Deep
learning
Fig: I. Goodfellow
Logic gates
Formal neurons
RBF units
= universal approximator
2n
1 2 3
1 2 3
12
13
Y LeCun
Backprop
(modular approach)
Squared Distance
l
W3, B3 Linear
l
ReLU
W2, B2 Linear
ReLU
W1, B1 Linear
X (input)
Y LeCun
L()=1/p k C(Xk,Yk,)
= (W1,B1,W2,B2,W3,B3)
Y (desired output)
Y LeCun
All major deep learning frameworks use modules (inspired by SN/Lush, 1991)
l Torch7, Theano, TensorFlow.
C(X,Y,)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X
input
Y
Label
Y LeCun
Cost
Wn
dC/dWn
Fn(Xn-1,Wn)
dC/dXi
Wi
dC/dWi
Xi
Fi(Xi-1,Wi)
dC/dXi-1
Xi-1
F1(X0,W1)
X (input)
Y (desired output)
Running Backprop
Y LeCun
Torch7 example
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X
input
Y
Label
Module Classes
Y LeCun
T
ReLU
Duplicate
Add
Y = X1 + X2
Max
LogSoftMax
Yi = Xi log[j exp(Xj)] ; ..
Linear
Module Classes
l
Y LeCun
Y LeCun
"
"
"
Backprop in Practice
"
"
"
"
"
"
"
"
"
"
"
Y LeCun
Y LeCun
Convolutional
Networks
"
"
Trainable
Classifier
"
Y LeCun
Mid-Level
Trainable
Features
Classifier
Mid-Level
Features
High-Level
Features
Trainable
Classifier
Norm
Filter
Bank
NonLinear
feature
Pooling
Norm
Filter
Bank
NonLinear
feature
Pooling
Y LeCun
Classifier
ConvNet Architecture
Y LeCun
"
Multiple Convolutions
Y LeCun
Y LeCun
Y LeCun
LeNet5
Y LeCun
input
1@32x32
Layer 1
6@28x28
Layer 2
6@14x14
Layer 3
12@10x10
Layer 4
12@5x5
Layer 5
100@1x1
Layer 6: 10
10
5x5
convolution
2x2
pooling/
subsampling
5x5
convolution
5x5
convolution
2x2
pooling/
subsampling
Y LeCun
Y LeCun
[Matan, Burges, LeCun, Denker NIPS 1991] [LeCun, BoWou, Bengio, Haner, Proc IEEE 1998]
Y LeCun
"
"
"
"
"
Y LeCun
Mid-Level
High-Level
Feature
Feature
Feature
Trainable
Classifier
Y LeCun
" The ventral (recognition) pathway in the visual cortex has multiple stages
" Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
Text Classification
Musical Genre Recognition
Acoustic Modeling for Speech Recognition
Time-Series Prediction
Y LeCun
37
st
38
xt
st+1
st
unfold
x
F
xt
F
xt+1
o
V
s
U
x
39
ot
V
W
unfold
st
V
1
W
U
xt
ot
st
ot+1
V
st+1
W
W
U
U
xt
xt+1
Generative RNNs
An RNN can represent a fully-connected directed generaHve
model: every variable predicted from all previous ones.
Lt
ot
V
W
st
V
W
40
U
xt
Lt+1
ot
1
1
Lt
st
ot+1
V
st+1
W
W
U
U
xt
xt+1 xt+2
Maximum Likelihood =
Teacher Forcing
During training, past y
in input is from training
data
At genera4on 4me,
past y in input is
generated
Mismatch can cause
compounding error
yt P (yt | ht )
yt
P (yt | ht )
ht
xt
41
+ stacking
42
Long-Term Dependencies
The RNN gradient is a product of Jacobian matrices, each
associated with a step in the forward computa4on. To store
informa4on robustly in a nite-dimensional state, the dynamics
must be contrac4ve [Bengio et al 1994].
Storing bits
robustly requires
sing. values<1
Problems:
Gradient
sing. values of Jacobians > 1 gradients explode
clipping
or sing. values < 1 gradients shrink & vanish (Hochreiter 1991)
or random variance grows exponen4ally
43
44
RNN Tricks
(Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)
error
45
output
self-loop
+
state
input
input gate
forget gate
output gate
RNN Tricks
Delays and mul4ple 4me scales, Elhihi & Bengio NIPS 1996
ot
W3
W1
st
s
W3
47
unfold
2
W1
ot
W3
W3
st
xt
1
W1
1
ot+1
st
st+1
W1
xt
W3
W1
xt+1
Backprop in Practice
Other tricks: see Deep Learning book (in prepara4on, online)
48
p (Y WXp)2
" H = 1/p X X T
p p
" To avoid ill conditioning: normalize the inputs
" Zero mean
" Unit variance for all variable
W0
W1
W2
X1
X2
Y
LeCun
"S
tochastic Gradient: Much Faster
" But fluctuates near the minimum
Y
LeCun
Y
LeCun
Y
W2
Z
W1
X
Solution
Saddle point
Solution
"S
tack of linear transforms interspersed with Max operators
" Point-wise ReLUs:
31
W31,22
22
W22,14
" Max Pooling
" switches from one layer to the next
" Input-output function
" Sum over active paths
" Product of all weights along the path
" Solutions are hyperbolas
" Objective function is full of saddle points
14
W14,3
3
Z3
55
Saddle Points
Local minima dominate in low-D, but
saddle points dominate in high-D
Most local minima are close to the
boWom (global minimum error)
56
57
58
sotplus
f(x)=log(1+exp(x))
f(x)=max(0,x)
Glorot, Bordes and Bengio AISTATS 2011: Using a rec4er nonlinearity (ReLU) instead of tanh of sotplus allows for the rst 4me
to train very deep supervised networks without the need for Neuroscience motivations
Leaky integrate-and-fire model
unsupervised pre-training; was biologically moHvated
Krizhevsky, Sutskever & Hinton NIPS 2012:
rec4ers one of the crucial ingredients in
ImageNet breakthrough
60
61
m
we can standardize each feature1asX
follows
xki,k+,
BN (xxkk)==m k x
Batch Normalization
x
x
i=1
k
k
(Ioe & Szegedy ICML 2015)
k = p 2 m ,
x
1kkX
, the
k to k and k2 to x
+
k.
(3)
By setting
network
can
rec
2
=
(x
x
)
,
i,k
k
k
However,
standardizing
thefeature
intermediate
activations
rewe can
standardize each
asIffollows
we have
access to the who
information
not account
only from the
p
m
duces the
representational
power
of
the
layer.
To
for
y
=
(Wx
+
b),
X
1
k ones, allowing for bid
x
k =
x
xi,k ,
(1) xkthe future
this, batch
learnable pa k = padditional
x
,
m i=1normalization introduces
2 +
!
! !
k
m
rameters
and
,
which
respectively
scale
and
shift
the
data,
ht =
(W
X weights matrix, b is the bias
where W2 is1 the
vector,
hht
2
ak )small
(xi,k isx
,
(2)
where
positive
constant to improve
numerical
k = to
leading
a
layer
of
the
form
m i=1
h t = (W h h t
BN
(x
)
=
x
+
.
(4)
an standardize each feature as follows
k
k k
k
t
,
the
network
can
recover
the
this,
introduces
learnable
k
k
k
pkk to
k =
x
,
(3)
(BN
(Wx)).
2 + y =
deeperand
architectures
[13
krepresentation.
rameters
and , which
scale
shift the
original layer
So,respectively
for creating
a standard
feedforward
e is a small positive leading
constant to improve
staa layernumerical
of the form
62 layer in a neural to
network
hlt = (Wh hlt 1
.
Note
that the bias vector has been removed, since i
Early Stopping
Beau4ful FREE LUNCH (no need to launch many dierent
training runs for each value of hyper-parameter for #itera4ons)
Monitor valida4on error during training (ater visi4ng # of
training examples = a mul4ple of valida4on set size)
Keep track of parameters with best valida4on error and report
them at the end
If error does not improve enough (with some pa4ence), stop.
63
64
65
Distributed Training
Minibatches
Large minibatches + 2nd order & natural gradient methods
Asynchronous SGD (Bengio et al 2003, Le et al ICML 2012, Dean et al NIPS 2012)
Data parallelism vs model parallelism
BoWleneck: sharing weights/updates among nodes, to avoid
node-models to move too far from each other
EASGD (Zhang et al NIPS 2015) works well in prac4ce
Eciently exploi4ng more than a few GPUs remains a challenge
66
Vision
((switch laptops)
67
Speech Recognition
68
100%
69
Using DL
10%
4%
2%
1%
1990
2000
2010
"M
ultilingual recognizer
" Multiscale input
" Large context window
Y LeCun
"
"
"
"
Y LeCun
Y LeCun
Y LeCun
End-to-End Training
with Search
interpretation graph
74
"u"
0.8
"p" 0.2
"c" 0.4
"a" 0.2
grammar graph
"t"
0.8
"r"
"n"
"a"
Graph Composition
"t" 0.8
interpretations:
cut (2.0)
cap (0.8)
cat (1.4)
"t"
match
& add
match
& add
match
& add
"b"
"u"
"t"
"e"
"u"
"c"
"r"
"a"
"p"
"t" "r"
"c" 0.4
"x" 0.1
"o" 1.0
"a" 0.2
"d" 1.8
"u" 0.8
"e"
"d"
"p" 0.2
"t" 0.8
Recognition
Graph
Natural Language
Representations
75
output
softmax
...
...
tanh
...
C(w(tn+1))
...
w1
76
w2
w3
w4
w5
input sequence
w6
...
Table
lookup
in C
index for w(tn+1)
C(w(t2))
...
...
C(w(t1))
...
Matrix C
shared parameters
across words
index for w(t2)
77
Paris
Rome
78
79
Encoder-Decoder Framework
Intermediate representa4on of meaning
= universal representa4on
Encoder: from word sequence to sentence representa4on
Decoder: from representa4on to word sequence distribu4on
English
decoder
French
encoder
French sentence
80
English sentence
English sentence
English
decoder
English
encoder
English sentence
Lower-level
81
(Bahdanau, Cho & Bengio, arXiv sept. 2014) following up on (Graves 2013) and
(Larochelle & Hinton NIPS 2010)
LJhUV
jkXe3
jjXk3
jjXNN
jeXdR
LQi2
L2m`H Jh
lX1/BM#m`;?- avMi+iB+ aJh
GAJaAfEAh
lX1/BM#m`;?- S?`b2 aJh
EAh- S?`b2 aJh
:QQ;H2
jyXe
jkXd
jeXN
S@aJh
jdXyj
LQi2
L2m`H Jh
C>l- aJhYGJYPaJYaT`b2
*l- S?`b2 aJh
lX1/BM#m`;?- S?`b2 aJh
lX1/BM#m`;?- avMi+iB+ aJh
35
30
28.18
30.85
30
26.18
25
26.02
25
-26%
21.84
24.96
22.51
20.08
20
20
22.67
23.42
16.16
15
15
10
10
5
0
83
Stanford
PJAIT
Baseline
Stanford
PJAIT
ui
zi
Attention
weight
aj
aj =1
Attention
Mechanism
Recurrent
State
Word
Ssample
84
Annotation
Vectors
hj
X X X X
X X X X
X X X X
X X X X
X X X X
X X X X
X
X
X
X
Paying
Attention to
Selected Parts
of the Image
While Uttering
Words
85
The Good
86
87
Y
LeCun
memory
Y
LeCun
http://arxiv.org/abs/1410.3916
Results on
Question Answering
Task
(Weston, Chopra,
Bordes 2014)
Y
LeCun
Y
LeCun
passive copy
access
92
Representa4ons
Explanatory factors
ICML2011
workshop on
Unsup. &
Transfer Learning
2 layers
NIPS2011
Transfer
Learning
Challenge
Paper:
ICML2012
3 layers
4 layers
Multi-Task Learning
Generalizing beWer to new tasks (tens
of thousands!) is crucial to approach AI
Example: speech recogni4on, sharing
across mul4ple languages
task 1
output y1
task 2
output y2
Task A
Task B
task 3
output y3
Task C
factors
(IJCAI 2011,
NIPS2010,
JMLR 2010,
ML J 2010)
WSABIE objec4ve func4on:
person
url
event
url
event
words
history
url person
history words
url
P(person,url,event)
P(url,words,history)
selection switch
h1
98
X1
h2
X2
h3
X3
Maps Between
Representations
hx = fx (x)
x and y represent
dierent modali4es, e.g.,
image, text, sound
Can provide 0-shot
generaliza4on to new
categories (values of y)
fx
hy = fy (y)
fy
x-space
y -space
xtest
ytest
99
Unsupervised Representation
Learning
100
101
P (X|Y )P (Y )
P (Y |X) =
P (X)
is 4ed to P(X) and P(X|Y), and P(X) is dened in terms of P(X|Y), i.e.
The best possible model of X (unsupervised learning) MUST
involve Y as a latent factor, implicitly or explicitly.
Representa4on learning SEEKS the latent variables H that explain
the varia4ons of X, making it likely to also uncover Y.
102
If Y is a Cause of X, Semi-Supervised
Learning Works
Just observing the x-density reveals the causes y (cluster ID)
Ater learning p(x) as a mixture, a single labeled example per class
suces to learn p(y|x)
Mixture model
0.5
y=1
y=2
y=3
10
x
15
0.4
p(x)
0.3
0.2
0.1
0.0
103
20
Invariant features
Which invariances?
Alterna4ve: learning to disentangle factors, i.e.
keep all the explanatory factors in the
representa4on
Good disentangling
avoid the curse of dimensionality
Emerges from representa4on learning
(Goodfellow et al. 2009, Glorot et al. 2011)
104
Boltzmann Machines /
Undirected Graphical Models
Boltzmann machines:
(Hinton 84)
observed
h ~ P(h|x )
h ~ P(h|x)
hidden
x ~ P(x | h)
Block
Gibbs
sampling
P r(x) =
X+
X-
Yann
LeCun
LeCun
" 1. build the machine so that the volume of low energy stuff is constant
" PCA, K-means, GMM, square ICA
" 2. push down of the energy of data points, push up everywhere else
" Max likelihood (needs tractable partition function)
" 3. push down of the energy of data points, push up on chosen locations
" contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
" 4. minimize the gradient and maximize the curvature around data points
" score matching
" 5. train a dynamical system so that the dynamics goes to the manifold
" denoising auto-encoder, diffusion inversion (nonequilibrium dynamics)
" 6. use a regularizer that limits the volume of space that has low energy
" Sparse coding, sparse auto-encoder, PSD
" 7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
" Contracting auto-encoder, saturating auto-encoder
" 8. Adversarial training: generator tries to fool real/synthetic classifier.
Auto-Encoders
P(x|h)
reconstruc,on!r!
Decoder.g!
P(h)
Q(h|x)
Encoder.f!
x
109
input!x!
Denoising auto-encoder:
During training, input is corrupted
stochas4cally, and auto-encoder must
learn to guess the distribu4on of the
missing informa4on.
Yann
LeCun
LeCun
Generative Model
Factor A
INPUT
Factor B
Decoder
Distance
LATENT
VARIABLE
Factor A'
Encoder
Distance
Varia4onal auto-encoders
(Kingma & Welling ICLR 2014)
(Gregor et al arXiv 2015)
111
Denoising Auto-Encoder
Learns a vector eld poin4ng towards higher
probability direc4on (Alain & Bengio 2013)
reconstruction(x)
x !
2@
log p(x)
@x
prior: examples
concentrate near a
lower dimensional
manifold
Corrupted input
113
corrupt
Xt
X~ t denoise
Xt+1
X~ t+1
X~ t+2
Xt+2
114
E(x)
x1"
E[||r(x + z)
115
x2"
x3"
x|| ] E[||r(x)
x"
2
x|| ] +
@r(x) 2
||
||F
@x
Encoder = inference
Parametric approximate
Q(h3 |h2 )
inference
h2
Successors of Helmholtz
Q(h2 |h1 )
machine (Hinton et al 95)
h1
Maximize varia4onal lower
bound on log-likelihood:
Q(h1 |x)
min
KL(Q(x, h)||P (x, h))
x
where = data distr.
Q(x)
Q(x)
or equivalently
max
X
x
116
P (h3 )
P (h2 |h3 )
P (h1 |h2 )
P (x|h1 )
Decoder = generator
X
P (x, h)
= max
Q(h|x) log
Q(h|x) log P (x|h) + KL(Q(h|x)||P (h))
Q(h|x)
x
Geometric Interpretation
Encoder: map input to a new space
where the data has a simpler
distribu4on
Add noise between encoder output
and decoder input: train the
decoder to be robust to mismatch
between encoder output and prior
output.
P(h)
Q(h|x)
f(x)
contrac4ve
g
117
Even for a sta4c input, the encoder and decoder are now
recurrent nets, which gradually add elements to the answer,
DRAW: A Recurrent Neural Network For Image Generation
and use an aWen4on mechanism to choose where to do so.
P (x|z)
decoder
FNN
ct
ct
write
write
decoder
RNN
decoder
RNN
zt
zt+1
sample
sample
sample
Q(z|x)
encoder
FNN
hdec
t 1
1)
. . . cT
encoder
RNN
encoder
RNN
read
read
P (x|z1:T )
decoding
(generative model)
encoding
(inference)
119
Random Generator
Vector Network
Fake
Image
Random Training
Set
Index
Real
Image
Discriminator
Network
1
2
0
http://soumith.ch/eyescream/
1
2
1
LAPGAN
results
40% of samples mistaken by humans for real photos
122
Convolutional GANs
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
(Radford et al, arXiv 1511.06343)
small learning rate and minibatch SGD.
We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate in only one epoch.
123
Space-Filling in Representation-Space
H-space
Linear interpola4on at layer 2
9s manifold
3s manifold
125
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
Predicted What
what
Predicted What
Many
To
One
One
To
Many
Many
To
One
reconstruction
input
input
Cost
Cost
what
One
To
Many
Cost
reconstruction
127
for ll == LL to
to 00 do
do
for
forl l==11totoLLdo
do
for
(l)
(l)
(l)(l(l 1)
then
ifif ll == LL then
W(l)
pre
zzpre
W
hh 1)
(L)
(L)
(L)
(L)
(l)
u
batchnorm(
h
)
(l)
(l)
Semisupervised
Learning
with
Ladder
u
batchnorm(
(l)
batchmean(
pre))
batchmean(
zzpre
else
else
(l)
(l)
(l)
(l)
(l)
(l) (l+1)
batchstd(
Network
pre))
batchstd(
zzpre
(Rasmus et al, NIPS 2015)
u(l)
batchnorm(V(l)
z(l+1)))
u
batchnorm(V
z
(l)
(l)
(l)
(l)
z
batchnorm(
z
+noise
noise
end ifif
pre))+
z
batchnorm(zpre
end
(l)
(l) (l)
(l)
(l)
(l)
(l)
(l)
(l)
(l)
8i
:
z
g(
z
,
u
# Eq.
Eq. (1)
(1)
(l)
(l)
(l)
(l)
activation(
(
+
))
h
i , uii )) #
8i : zii
g(
zi(l)
h
activation(
(
zz +
))
Jointly trained stack of denoising auto-encoders with gated
(l)
z(l)
(l)
(l)
end
for
i
i
z
(l)
end
for
8i
:
z
i
i
(l)
i,BN
(L)
8i
:
z
z(1)
z(1)
(1)
z
z
#
Eq
C
C
+
Cd
BN
l=1 l
N (0,
f (1) ()
f (1) ()
N (0,
(0)
(, )
(0)
Cd
4
4
x
(l)
(l)
Stacked What-Where
Auto-Encoder (SWWAE)
Predicted
Output
Desired
Output
Loss
Yann
LeCun
LeCun
Inpu
t
Recons
tructio
n
130
132
Unsupervised learning
How to evaluate?
Long-term dependencies
Natural language understanding & reasoning
More robust op4miza4on (or easier to train architectures)
Distributed training (that scales) & specialized hardware
Bridging the gap to biology
Deep reinforcement learning
133