DNN Tutorial

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

Deep Learning: Past, Present and Future (?

)
Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Machine Learning

?
Learning

Inference

1. Let the model M learn the data D


2. Let the model M infer unknown quantities

Machine Learning & Perception

?
Learning

Inference

Perception is the organization, identification,


and interpretation of sensory information in
order to represent and understand the
environment.
Wikipedia

(Farabet et al., 2013)

Machine Learning & Perception: Examples

?
Learning

Data
Labeled Images
Transcribed Speech
Paraphrases
Movie Ratings
Parallel Corpora

Sensory Information
An image
A speech segment
A pair of sentences
Ratings of Y and by X
A Finnish sentence

Inference

Query
Is a cat in the image?
What is this person saying?
Is this sentence a paraphrase?
Will a user X like a movie Y ?
What is moi in English?

A human possesses the best machinery for perception, called a brain.


But, how does our brain do it?

Deep Learning: Motivated from Human Learning

(Van Essen&Gallant, 1994)

Learn massive data

simple functions

Multi-layered

(Krizhevsky et al., 2012)

Boltzmann machines? I remember working on them in 80s and 90s..


Anonymous Interviewer, 2011
paraphrased

Deep Learning: History

1958 Rosenblatt proposed perceptrons


1980 Neocognitron

(Fukushima, 1980)

1982 Hopfield network, SOM


1985 Boltzmann machines

(Kohonen, 1982),

Neural PCA

(Ackley et al., 1985)

1986 Multilayer perceptrons and backpropagation


1988 RBF networks

(Broomhead&Lowe, 1988)

1989 Autoencoders

(Baldi&Hornik, 1989),

1992 Sigmoid belief network


1993 Sparse coding

(Oja, 1982)

Convolutional network

(Neal, 1992)

(Field, 1993)

(Rumelhart et al., 1986)

(LeCun, 1989)

Why all this fuss about deep learning now?

ImageNet: ILSVRC 2012 Classification Task


Top Rankers
1. SuperVision (0.153): Deep Conv. Neural Network
2. ISI (0.262): Features + FV + Linear classifier

(Krizhevsky et al.)

(Gunji et al.)

3. OXFORD_VGG (0.270): Features + FV + SVM

(Simonyan et al.)

4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM

(Perronin et al.)

5. University of Amsterdam (0.300): Color desc. + SVM

(van de Sande et al.)

(Krizhevsky et al., 2012)

ImageNet: ILSVRC 2013 Classification Task


Top Rankers
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)


NUS: Deep Convolutional Neural Networks
ZF: Deep Convolutional Neural Networks
Andrew Howard: Deep Convolutional Neural Networks
OverFeat: Deep Convolutional Neural Networks
UvA-Euvision: Deep Convolutional Neural Networks
Adobe: Deep Convolutional Neural Networks
VGG: Deep Convolutional Neural Networks
CognitiveVision: Deep Convolutional Neural Networks
decaf: Deep Convolutional Neural Networks
IBM Multimedia Team: Deep Convolutional Neural Networks
Deep Punx (0.209): Deep Convolutional Neural Networks
MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et
al.)
Minerva-MSRA: Deep Convolutional Neural Networks
Orange: Deep Convolutional Neural Networks
BUPT-Orange: Deep Convolutional Neural Networks
Trimps-Soushen1: Deep Convolutional Neural Networks
QuantumLeap: 15 features + RVM (Shu&Shu)

Youre already using deep learning!

How do you tell deep learning from not-so-deep machine learning?

Not-So-Deep Machine Learning

1. Feature engineering not learned!


2. Learning

(3)

3. Inference

x1
x2

(2)

...

(1)

features
...
...

xN

...

Separation between domain knowledge and general machine learning

Unsupervised Learning of Representation


Feature engineering
Feature/Representation learning
Learning
Inference

(1a)

x1
x2

(3)

features
...
...
...

1a.
1b.
2.
3.

xN

(1b)

...

(2)

Deep Learning: toward the Ultimate Machine Learning?

1. Jointly learn everything


2. Inference

(2)

(1)

The data decides Yoshua Bengio

Why now? Why not 20 years ago?

What has happened in last 20+ years?

We have connected the dots, e.g.,


I
I

PCA Neural PCA Probabilistic PCA Autoencoder


Autoencoder Belief network Restricted Boltzmann machine

We understand learning better


I
I
I

Model structure matters a lot


Learning is but is not optimization
No need to be scared of non-convex optimization

We understand how learning and inference interact

Exponential growth of the amount of data and computational power

And Beyond. . .

(Goodfellow, 2013)

Todays Tutorial: Introduction to Deep Learning


1. Deep Learning: Past, Present and Future
2. Supervised Neural Networks
I
I
I
I

Multilayer Perceptron and Learning


Regularization
Practical Recipe
Task-specific Neural Network

3. Unsupervised Neural Networks


I
I

Unsupervised Learning
Generative Modeling with Neural Networks
I
I

Density/Distribution Learning
Learning to Infer

Semi-Supervised Learning and Pretraining

4. Advanced Topics
I
I

Beyond Computer Vision


Advanced Topics

Supervised Neural Networks


Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com, kyunghyun.cho@umontreal.ca

Warning!!!

The next 910 slides may be extremely boring.

Supervised Learning: Rough Picture

Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )}
Assumption:
yn = f (xn )
Find a function f , using D, that emulates f as well as possible on
samples potentially not included in D.

Supervised Learning: Probabilistic Picture

Underlying distributions:
X and Y | X
Data:
D = {(x, y ) | x X & y Y | X = x}
Find a distribution p(y | x), using D, that emulates Y | X as well as
possible on new samples from p(x) potentially not included in D.

Supervised Learning: Evaluation and Generalization

Evaluation:
Exp(x) [kf (x) f (x)kp ]

[kf (x) f (x)kp ]

xDtest

or
Exp(x) [KL(p(y | x)kp (y | x))]

KL(pkp ),

xDtest

where D 6= Dtest
Why do we test the found solution f or p on Dtest , not on D?

Linear/Ridge Regression
Underlying assumption:
y = W x + b + ,
where  is a white Gaussian noise.
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )} ,
where yn = W xn + b + .
Learning:
W , b = arg min
W ,b

N


1 X
2
2
2
kWxn + b yn k2 + kW kF + kbk2
N n=1

Multilayer Perceptron: (Binary) Classification


Underlying assumption:
y B (p = f (x)) ,
where f is a nonlinear function parameterized with and B(p) is a
Bernoulli distribution of mean p.
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )} ,
where yn B (p = f (xn )).
Learning:
= arg min

N
1 X
yn log f (xn ) + (1 yn ) log (1 f (xn )) + (, D)
N n=1

Learning as an Optimization

Ultimately, learning is (mostly )


= arg min

N
1 X
c ((xn , yn ) | ) + (, D) ,
N n=1

where c ((x, y ) | ) is a per-sample cost function.

Gradient Descent

Gradient-descent Algorithm:
t = t1 L( t1 )
where, in our case,
L() =

Let us assume that (, D) = 0.

N
1 X
l ((xn , yn ) | ) .
N n=1

Stochastic Gradient Descent


Often, it is too costly to compute C () due to a large training set.
Stochastic gradient descent algorithm:

t = t1 t l (x 0 , y 0 ) | t1 ,
where (x 0 , y 0 ) is a randomly chosen sample from D, and

and

t=1

Let us assume that (, D) = 0.

X
t=1

2

< .

Any question so far?

Almost there. . .

How do we compute the gradient efficiently for deep neural networks?

Backpropagation Algorithm (1) Forward Pass


x1

h1
f

h2

x2
Forward Computation:

L(f (h1 (x1 , x2 , h1 ), h2 (x1 , x2 , h2 ), f ), y )


Multilayer Perceptron with a single hidden layer:
L(x, y , ) =

2
1
y U> W> x
2

Backpropagation Algorithm (2) Chain Rule

x1

x2

h2
x1

h2
x1

h1

f
h1

f
f
h2

h2

L
f

Chain rule of derivatives:


L
L f
L
=
=
x1
f x1
f

f h1
f h2
+
h1 x1
h2 x1

Backpropagation Algorithm (3) Shared Derivatives

x1

h1

f
h1

f
h2

x2

f
h2

Local derivatives are shared:




L f h1
f h2
L
=
+
x1
f h1 x1
h2 x1


L
L f h1
f h2
=
+
x2
f h1 x2
h2 x2

L
f

Backpropagation Algorithm (4) Local Computation

L
h
h
a

L
a

h
a

L
a

h
a

Each node computes


I

Forward: h(a1 , a2 , . . . , aq )

Backward:

L
a

h h
h
a1 , a2 , . . . , aq

Backpropagation Algorithm Requirements

L
h
h
a

L
a

1 Well. . . ?
2 Well. . . ?

h
a

L
a

h
a

Each node computes a


differentiable function1

Directed Acyclic Graph2

L
a

Backpropagation Algorithm Automatic Differentiation

x1

h1

f
h1

f
x2

I
I

h2

f
h2

L
f

Generalized approach to computing partial derivatives


As long as your neural network fits the requirements, you do not
need to derive the derivatives yourself!
I

Theano, Torch, . . .

Any question on backpropagation and automatic differentiation?

Regularization (1) Maximum a Posteriori

Probabilistic Perspective: Find a model M by. . .


I

Maximum Likelihood (ML): arg max p(D | M)

Maximum a Posteriori (MAP): arg max p(M | D)

What is the probability of given a current data D?


p(D | M)p(M)
p(D | M)p(M)
p(M | D) = P
M p(D, M)
P(M): What do we think a good model should be?

Regularization (2) Weight Decay

P(M): What do we think a good model should be?


Weight-Decay Regularization
I



1
Prior Distribution: j N 0, (M)

Maximum a Posteriori Estimation


= arg max

N
X
n=1

p(y n | x n , ) +

M
X
j=1

j2 .

Regularization (3) Smoothness and Noise Injection


Prior on a Model: Smoothness
I

f (x) f (x + ): the model should be insensitive to small


PN (x n ) 2
change/noise Minimize n=1 fx

PN (x n ) 2
Regularizing n=1 fx
is equivalent to adding random Gaussian
noise in the input (Bishop, 1995)
N
X



f (x n ) 2


arg min
kf (x ) y k +

x

n=1
!
N
X
X
n
n 2

p() arg min


kf (x +) y k


n 2

n=1

Regularization (4a) Ensemble Learning and Dropout


Wisdom of the Crowd: Train M classifiers and let them vote
f (x) =

M
1 X
fM (x)
M m=1 m

(Ciresan et al., 2012)

Regularization (4b) Ensemble Learning and Dropout


Dropout: Train one, but exponentially many classifiers
L() =

(Hinton et al., 2012)

N
N
1 X
1 X
log Em [p(y n , m | x n , )]
Em [log p(y n , m | x n , )]
N n=1
N n=1

x1

...

x2

h1
h2
h3

hH

Each update samples one network out of exponentially many classifiers.



t = t1 t l (x 0 , y 0 ), m | t1 ,
where mi,l B (0.5).

Regularization (4b) Ensemble Learning and Dropout


Dropout: When testing, halve the activations


N
N
1 X
1
1 X
n
n
n
n
Em [log p(y , m | x , )]
p y ,m = | x ,
L()
N n=1
N n=1
2

x1

h2

h3

2
2
2

...

x2

h1

hH

Do you see why I spent so much time on regularization?

Common Recipe for Deep Neural Networks


1. Use a piecewise linear hidden unit
I
I

Rectifier: h(x) = max {0, x} (Glorot&Bengio, 2011)


Maxout: h({x1 , . . . , xp }) = max {x1 , . . . , xp } (Goodfellow et al., 2013)

2. Preprocess data and choose features carefully


I

I
I
I

Images: Whitening? Local contrast normalization? Raw? SIFT?


HoG?
Speech: Raw? Spectrum?
Text: Characters? Words? Tree?
General: z-Normalization?

3. Use Dropout and other regularization methods


4. Unsupervised Pretraining (Hinton&Salakhutdinov, 2006)
I

Few labeled samples, but a lot of unlabeled samples

5. Carefully search for hyperparameters


I

Random search, Bayesian optimization


(Bergstra&Bengio, 2013;Bergstra et al., 2011;Snoek et al.,2012)

6. Often, deeper the better


7. Build an ensemble of neural networks

But, nobody seems to use a vanilla multilayer perceptron, right?

How to Encode Prior/Domain Knowledge?

Data Preprocessing and Feature Extraction:


I Object recognition from images
I

Lighting condition shouldnt matter Contrast normalization

Gesture recognition from skeleton


Relative positions of joints are important Relative coordinate
system w.r.t. the body center

Language Processing
I

Word counts are important Bag-of-Words representation

Model Architecture Design

Convolutional Neural Networks (1)

Suitable for Images, Videos and Speech


Prior/Domain Knowledge
I

Translation invariance

Rotation invariance: Images and Videos

Temporal invariance: Videos and Speech

Frequency invariance: Speech

Convolutional Neural Networks (2) Convolution and


Pooling
Convolution
I

Global

Pooling
I

Local

max

max

max

Convolutional Neural Networks (3)


Convolutional Neural Network

(TeraDeep, 2013)

Convolutional Layer
1. Contrast Normalization
2. Convolution
3. Pooling
4. Nonlinearity

Convolutional Neural Networks (3)


Deep Convolutional Neural Network

(Krizhevsky et al., 2012)

Recursive Neural Networks (1)

Suitable for Text and Variable-Length Sequences


Prior/Domain Knowledge
I Compositionality Tree-based Grammar (?)
I

Location invariance

Variable Length

Recursive Neural Networks (2)

Compositional Structure

A small crowd quietly enters the historic church


Small (local) pieces are glued together to form a global structure

Recursive Neural Networks (3)

Finding a good, compact representation of variable-length sequence

(Socher et al., 2011)

What other architectures can you think of?

Further Topics

Is learning solved for supervised neural networks?

Recurrent neural networks: cope with variable-length inputs/outputs

Beyond sigmoid and rectifier functions

Unsupervised Neural Networks


Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Warning!!!

The first half of this session can be boring.

Unsupervised Learning

No more label!

D = {x1 , x2 , . . . , xN }
What can we do?

(Exploratory) Data Analysis


The most important step in machine learning

150
100
50
z

0
150
y

50
100
100
50
150

0
200

150

100

50

50
x

100

150

200

Human Gestures

250

300

Visualization by DNN
(Cho&Chen, 2014)

Feature Extraction

With domain knowledge Engineered Features


Without domain knowledge Learned Features

f?

Generative Model: Probabilistic Picture

Underlying distribution:
X PD
Data:
D = {x | x PD }
Find a distribution p(x), using D, that emulates PD as well as possible
on new samples from p(x) potentially not included in D.

What should we do with data D = {xn }Nn=1

Example target tasks


I
I
I
I
I

Classification p(xclass | xins )


Missing value reconstruction p(xm | xo )
Denoising p(x | x)
Structured Output Prediction p(xout | xin )
Outlier Detection p(x)? >

Ultimately, it comes down to learning a distribution of x.

Density/Distribution Estimation (1)

Latent Variable Models p (x)


= arg max

X
1 X
log
p(x n | h)p(h)
N n=1
h

1. Define a parametric form of joint distribution p (x, h)


2. Derive a learning rule for

Density/Distribution Estimation
(2) Restricted Boltzmann Machines
1. Joint distribution

h1

h2

hq

x1

x2

xp

0
1
p
q
X
X
1
p (x, h) =
exp @
xi hj wij A
Z ()
i =1 j=1

2. Marginal distribution
X
h

q
p
X
1 Y
1 + exp
wi ,j xi
p (x, h) =
Z ()
j=1

!!

i =1

3. Learning rule: Maximum Likelihood


2
L() =ExPd 4log

q
Y
j=1

1 + exp

p
X

!!
wi ,j xi

3
log Z ()5

i =1

wij =hxi hj id hxi hj im

(Smolensky, 1986)

Density/Distribution Estimation (3) Belief Networks


...
...
...

1. Joint distribution
p (x, h) = p(x | h1 )p(h1 | h2 ) p(h)

2. Marginal distribution
X

p (x, h) = ?

...
3. Learning rule: Maximum Likelihood
"
L() =ExPd

#
X

p (x, h)

(Neal, 1996)

Density/Distribution Estimation (4) NADE

1. Joint distribution
p (x) = p (x1 )p (x2 | x1 ) p (xd | x1 , . . . , xd 1 )

2. Marginal distribution: no latent variable h

3. Learning rule:
3. Maximum Likelihood (fixed order ), Order-agnostic (all orders)
(Larochelle&Murray, 2011)

Density/Distribution Estimation (4) Issues


Intractability! Intractability! Intractability!
(General) Boltzmann Machines
I Normalization Constant Z ()
P
I Marginal Probability
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probability
p(xmis | xobs )
Belief Networks
I Normalization Constant Z ()
P
I Marginal Probability
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probability
p(xmis | xobs )

Restricted Boltzmann Machines


I Normalization Constant Z ()
P
I Marginal Probability
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probability
p(xmis | xobs )
NADE
I Normalization Constant Z ()
P
I Marginal Probability
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probability
p(xmis | xobs )
I Somewhat unsatisfactory
performance

Do we want to learn the distribution?

Generative Model (1) Learn to Infer

Example target tasks


I
I
I

Classification p(xclass | xins )


Missing value reconstruction p(xm | xo )
Denoising p(x | x)

Structured Output Prediction p(xout | xin )

Outlier Detection p(x)? >

All we want is to infer the conditional distribution of unknown variables


(Goodfellow et al., 2013; Brakel et al., 2013; Stoyanov et al., 2011, Raiko et al., 2014)

Generative Model (2) Learn to Infer

Approximate Inference in a Graphical Model


p(xmis | xobs ) Q(xmis | xobs )
Methods:
I

Loopy belief propagation

Variational inference/message-passing

At the end of the day. . .


Q k (xmis | xobs ) = f Q k1 (xmis | xobs )
until convergence

Generative Model (3) Learn to Infer


h

h<1>

h<2>

...

h<k>

...
x

x <0>

x <1>

x <2>

x <k>

Approximate Inference in Restricted Boltzmann Machine


p(xmis | xobs ) Q(xmis | xobs )
Mean-field Fixed-point Iteration


kx = W W> k1
+c +b
x
At the end of the day, a multilayer perceptron with k 1 hidden layers.
Use backpropagation and stochastic gradient descent!

Generative Model (4) Learn to Infer NADE-k


h

<1>

<2>

...

h<1>
[1]

<k>

...
x

<0>

<1>

<2>

<k>

<1>
h[2]
h<2>
[1]

v <0>

<2>
h[2]

v <1>

V
v <2>

Further Generalization with Deep Neural Networks


p(xmis | xobs ) Q(xmis | xobs ) = f (xobs )
Interpret the model as a mixture of NADEs with different orders of
variables
I

Exact computation of p(x) possible

Fast inference p(xmis | xobs )


Flexible

(Raiko et al., 2014)

Generative Model (5) Learn to Infer

Lesson:
- Do not maximize log-likelihood, but minimize the actual cost!

- Dont do what a model tells you to do, but do what you aim to do.

But, popular science journalists dont care about generative models..

Manifold Learning Semi-Supervised Learning (1)

???

Which class does the dot belong to, red or blue?

Manifold Learning Semi-Supervised Learning (2)

???

Now, which class does the dot belong to, red or blue?

The black dots are unlabeled

Manifold Learning New Representation (1)


Representation on the data manifold?
Hidden space

1. should reflect changes along


the manifold
(xi ) 6= (xj ), for all xi , xj D

(x)

2. should not reflect any


change orthogonal to the
manifold
(xi + ) = (xi )

Data space

Manifold Learning New Representation (2)


Denoising Autoencoder

(Vincent et al., 2011)


Hidden space

Representation that capture manifold


1. (xi ) 6= (xj ), for all xi , xj D
2. (xi + ) = (xi )
Denoising autoencoder achieves it by
(x)

min0 kx g0 (f (x + ))k
,

Data space

Semi-Supervised Learning in Action (1)


Layer-wise Pretraining

h1

h3

h2

Semi-Supervised Learning in Action (2)


Layer-wise Pretraining

y
Pretraining (2nd layer)
Pretraining

(1st

h[2]

h[2]

h[1]

h[1]

layer)
h[1]
x

(Hinton&Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al., 2007)

Manifold Embedding and Visualization (1)

Manifold Embedding: M Rd Rq , q  d

If q = 2 or 3, data visualization

Data

z1
z2
z2
z1

(Oja, 1991; Kramer, 1991; Hinton & Salakhutdinov, 2006)

Manifold Embedding and Visualization (2)

196

Handwritten Digits [0, 1]

R2

0
1
2
3
4
5
6
7
8
9

Pose Frame Data R30 R2

rotateArmsLBack
rotateArmsRBack
rotateArmsBBack

What other applications can you think of?

Advanced Topics
Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Is deep learning all about computer vision and speech recognition?

Deep Reinforcement Learning


Q Learning
a
h3
h2

Q(s, a): state-action function

Action at time t
= arg maxa[1,j] Q(s, a)

Update Q on-the-fly

h1 Deep Q Learning
s

Model Q with a deep neural network

Predict Q(aj , ) for all j at once

State s: visual perception


not internal states!

(Mnih et al., 2013)

Natural Language Processing

In neuropsychology, linguistics and the philosophy of language, a natural


language or ordinary language is any language which arises,
unpremeditated, in the brains of human beings.
Wikipedia

Natural Language Processing

To machine learning researchers:


Natural Language is a huge set of variable-length sequences of
high-dimensional vectors.

Natural Language Processing (1)


How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Most nave, uninformative coding:
[EU] = [1, 0, 0, 0]
[3] = [0, 1, 0, 0]
[France] = [0, 0, 1, 0]
[three] = [0, 0, 0, 1]

Not satisfying..

Natural Language Processing (2)


How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Is there a representation that preserves the similarities of meanings of
symbols?
D([EU] , [France]) < D([EU] , [3] ,
D([3] , [three]) < D([France] , [three] ,
D([3] , [three]) < 
..
.

Natural Language Processing (3)


Continuous-Space Representation
Sample sentences:
1. There are three teams left for the qualification.
2. 3 teams have passed the first round.
Task: Predict a following word given a current work [three]
Nave approach: build a table (so called n-gram)
I

(three, teams), (3, teams)

The table can grow arbitrarily.

Machine learning: compress the table into a continuous function


I

Map three and 3 to nearby points x in a continuous space

From x, map to [teams].

Natural Language Processing (4)


Continuous-Space Representation

(Cho et al., 2014)

Natural Language Processing (5)


Beyond Word Representation

(Cho et al., 2014)

NN: I am very powerful and can model anything as long as Im fed


enough computational resource.
SVM: But, you have to optimize a high-dimensional, non-convex
function which has many, many local minima!
NN: Really?

Advanced Optimization (1) Statistical Physics says. . .

Not really

Advanced Optimization (2)


Local Minima? Saddle Points?
4
3

1.0
0.5
0.0
0.5 Z
1.0
1.5
2.0
2.5

2
1
0
1
2
3
4

1.0
1.5

1.0

0.5

0.0

0.5

1.0

0.5

1.5

0.0
X

0.5

1.0

1.0

0.5
0.0
0.5 Y

1.0
0.5 Z
0.0
0.5

2
1
0
1

1.0

2
3
1.5
1.0
0.5
1.5 1.0
0.0
0.5 0.0
0.5 Y
1.0
0.5 1.0
X
1.5 1.5

1.0

0.5
0.0 Y
0.5
1.0

0.5

0.0
X

0.5

1.0

1.0

(Dauphin et al., 2014; Pascanu et al., 2014)

Advanced Optimization (3)


Beyond the 2nd-order Method
(Quasi-)Newton Method
H 1 L()
How well does the quadratic approximation hold when training neural
networks?
Saddle-Free Newton Method (very new!!)
|H|

L(),

where |H| is constructed by


|H| = U || V
when H = UV .

(Dauphin et al., 2014)

Lastly but not at all least,


is there any theoretical ground for using deep neural networks?

Theoretical Analysis
Deep Rectifier Networks Fold the Space

3.
1. Fold along the
vertical axis

2. Fold along the


horizontal axis

(a)
Input Space
First Layer Space
S10
S20

S40 S10
S30 S20

S40
S30

S40 S10

S4 S1
S3 S2
S20 S30
S10 S40

S30 S20
S30 S20
S40 S10

Second Layer
Space

(b)

(c)
(Montufar et al., 2014; Pascanu et al., 2014)

Is it the beginning of deep learning or the end of deep learning?

You might also like