Professional Documents
Culture Documents
DNN Tutorial
DNN Tutorial
DNN Tutorial
)
Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)
Machine Learning
?
Learning
Inference
?
Learning
Inference
?
Learning
Data
Labeled Images
Transcribed Speech
Paraphrases
Movie Ratings
Parallel Corpora
Sensory Information
An image
A speech segment
A pair of sentences
Ratings of Y and by X
A Finnish sentence
Inference
Query
Is a cat in the image?
What is this person saying?
Is this sentence a paraphrase?
Will a user X like a movie Y ?
What is moi in English?
simple functions
Multi-layered
(Fukushima, 1980)
(Kohonen, 1982),
Neural PCA
(Broomhead&Lowe, 1988)
1989 Autoencoders
(Baldi&Hornik, 1989),
(Oja, 1982)
Convolutional network
(Neal, 1992)
(Field, 1993)
(LeCun, 1989)
(Krizhevsky et al.)
(Gunji et al.)
(Simonyan et al.)
(Perronin et al.)
(3)
3. Inference
x1
x2
(2)
...
(1)
features
...
...
xN
...
(1a)
x1
x2
(3)
features
...
...
...
1a.
1b.
2.
3.
xN
(1b)
...
(2)
(2)
(1)
And Beyond. . .
(Goodfellow, 2013)
Unsupervised Learning
Generative Modeling with Neural Networks
I
I
Density/Distribution Learning
Learning to Infer
4. Advanced Topics
I
I
Warning!!!
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )}
Assumption:
yn = f (xn )
Find a function f , using D, that emulates f as well as possible on
samples potentially not included in D.
Underlying distributions:
X and Y | X
Data:
D = {(x, y ) | x X & y Y | X = x}
Find a distribution p(y | x), using D, that emulates Y | X as well as
possible on new samples from p(x) potentially not included in D.
Evaluation:
Exp(x) [kf (x) f (x)kp ]
xDtest
or
Exp(x) [KL(p(y | x)kp (y | x))]
KL(pkp ),
xDtest
where D 6= Dtest
Why do we test the found solution f or p on Dtest , not on D?
Linear/Ridge Regression
Underlying assumption:
y = W x + b + ,
where is a white Gaussian noise.
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )} ,
where yn = W xn + b + .
Learning:
W , b = arg min
W ,b
N
1 X
2
2
2
kWxn + b yn k2 + kW kF + kbk2
N n=1
N
1 X
yn log f (xn ) + (1 yn ) log (1 f (xn )) + (, D)
N n=1
Learning as an Optimization
N
1 X
c ((xn , yn ) | ) + (, D) ,
N n=1
Gradient Descent
Gradient-descent Algorithm:
t = t1 L( t1 )
where, in our case,
L() =
N
1 X
l ((xn , yn ) | ) .
N n=1
and
t=1
X
t=1
2
< .
Almost there. . .
h1
f
h2
x2
Forward Computation:
2
1
y U> W> x
2
x1
x2
h2
x1
h2
x1
h1
f
h1
f
f
h2
h2
L
f
f h1
f h2
+
h1 x1
h2 x1
x1
h1
f
h1
f
h2
x2
f
h2
L
f
L
h
h
a
L
a
h
a
L
a
h
a
Forward: h(a1 , a2 , . . . , aq )
Backward:
L
a
h h
h
a1 , a2 , . . . , aq
L
h
h
a
L
a
1 Well. . . ?
2 Well. . . ?
h
a
L
a
h
a
L
a
x1
h1
f
h1
f
x2
I
I
h2
f
h2
L
f
Theano, Torch, . . .
1
Prior Distribution: j N 0, (M)
N
X
n=1
p(y n | x n , ) +
M
X
j=1
j2 .
PN (x n ) 2
Regularizing n=1 fx
is equivalent to adding random Gaussian
noise in the input (Bishop, 1995)
N
X
f (x n ) 2
arg min
kf (x ) y k +
x
n=1
!
N
X
X
n
n 2
n 2
n=1
M
1 X
fM (x)
M m=1 m
N
N
1 X
1 X
log Em [p(y n , m | x n , )]
Em [log p(y n , m | x n , )]
N n=1
N n=1
x1
...
x2
h1
h2
h3
hH
x1
h2
h3
2
2
2
...
x2
h1
hH
I
I
I
Language Processing
I
Translation invariance
Global
Pooling
I
Local
max
max
max
(TeraDeep, 2013)
Convolutional Layer
1. Contrast Normalization
2. Convolution
3. Pooling
4. Nonlinearity
Location invariance
Variable Length
Compositional Structure
Further Topics
Warning!!!
Unsupervised Learning
No more label!
D = {x1 , x2 , . . . , xN }
What can we do?
150
100
50
z
0
150
y
50
100
100
50
150
0
200
150
100
50
50
x
100
150
200
Human Gestures
250
300
Visualization by DNN
(Cho&Chen, 2014)
Feature Extraction
f?
Underlying distribution:
X PD
Data:
D = {x | x PD }
Find a distribution p(x), using D, that emulates PD as well as possible
on new samples from p(x) potentially not included in D.
X
1 X
log
p(x n | h)p(h)
N n=1
h
Density/Distribution Estimation
(2) Restricted Boltzmann Machines
1. Joint distribution
h1
h2
hq
x1
x2
xp
0
1
p
q
X
X
1
p (x, h) =
exp @
xi hj wij A
Z ()
i =1 j=1
2. Marginal distribution
X
h
q
p
X
1 Y
1 + exp
wi ,j xi
p (x, h) =
Z ()
j=1
!!
i =1
q
Y
j=1
1 + exp
p
X
!!
wi ,j xi
3
log Z ()5
i =1
(Smolensky, 1986)
1. Joint distribution
p (x, h) = p(x | h1 )p(h1 | h2 ) p(h)
2. Marginal distribution
X
p (x, h) = ?
...
3. Learning rule: Maximum Likelihood
"
L() =ExPd
#
X
p (x, h)
(Neal, 1996)
1. Joint distribution
p (x) = p (x1 )p (x2 | x1 ) p (xd | x1 , . . . , xd 1 )
3. Learning rule:
3. Maximum Likelihood (fixed order ), Order-agnostic (all orders)
(Larochelle&Murray, 2011)
Variational inference/message-passing
h<1>
h<2>
...
h<k>
...
x
x <0>
x <1>
x <2>
x <k>
<1>
<2>
...
h<1>
[1]
<k>
...
x
<0>
<1>
<2>
<k>
<1>
h[2]
h<2>
[1]
v <0>
<2>
h[2]
v <1>
V
v <2>
Lesson:
- Do not maximize log-likelihood, but minimize the actual cost!
- Dont do what a model tells you to do, but do what you aim to do.
???
???
Now, which class does the dot belong to, red or blue?
(x)
Data space
min0 kx g0 (f (x + ))k
,
Data space
h1
h3
h2
y
Pretraining (2nd layer)
Pretraining
(1st
h[2]
h[2]
h[1]
h[1]
layer)
h[1]
x
Manifold Embedding: M Rd Rq , q d
If q = 2 or 3, data visualization
Data
z1
z2
z2
z1
196
R2
0
1
2
3
4
5
6
7
8
9
rotateArmsLBack
rotateArmsRBack
rotateArmsBBack
Advanced Topics
Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)
Action at time t
= arg maxa[1,j] Q(s, a)
Update Q on-the-fly
h1 Deep Q Learning
s
Not satisfying..
Not really
1.0
0.5
0.0
0.5 Z
1.0
1.5
2.0
2.5
2
1
0
1
2
3
4
1.0
1.5
1.0
0.5
0.0
0.5
1.0
0.5
1.5
0.0
X
0.5
1.0
1.0
0.5
0.0
0.5 Y
1.0
0.5 Z
0.0
0.5
2
1
0
1
1.0
2
3
1.5
1.0
0.5
1.5 1.0
0.0
0.5 0.0
0.5 Y
1.0
0.5 1.0
X
1.5 1.5
1.0
0.5
0.0 Y
0.5
1.0
0.5
0.0
X
0.5
1.0
1.0
L(),
Theoretical Analysis
Deep Rectifier Networks Fold the Space
3.
1. Fold along the
vertical axis
(a)
Input Space
First Layer Space
S10
S20
S40 S10
S30 S20
S40
S30
S40 S10
S4 S1
S3 S2
S20 S30
S10 S40
S30 S20
S30 S20
S40 S10
Second Layer
Space
(b)
(c)
(Montufar et al., 2014; Pascanu et al., 2014)