DNN Tutorial

Deep Learning: Past, Present and Future (?
)
Kyunghyun Cho
Laboratoire dInformatique des Systmes Adaptatifs,
Dpartement dinformatique et de recherche oprationnelle,
Facult des arts et des sciences,
Universit de Montral
cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)
Machine Learning
?
Learning
Inference
1. Let the model M learn the data D

2. Let the model M infer unknown quantities
Machine Learning & Perception
?
Learning
Inference
Perception is the organization, identification,

and interpretation of sensory information in
order to represent and understand the
environment.
Wikipedia
(Farabet et al., 2013)
Machine Learning & Perception: Examples
?
Learning
Data
Labeled Images
Transcribed Speech
Paraphrases
Movie Ratings
Parallel Corpora
Sensory Information
An image
A speech segment
A pair of sentences
Ratings of Y and by X
A Finnish sentence
Inference
Query
Is a cat in the image?
What is this person saying?
Is this sentence a paraphrase?
Will a user X like a movie Y ?
What is moi in English?
A human possesses the best machinery for perception, called a brain.

But, how does our brain do it?
Deep Learning: Motivated from Human Learning
(Van Essen&Gallant, 1994)
Learn massive data
simple functions
Multi-layered
(Krizhevsky et al., 2012)
Boltzmann machines? I remember working on them in 80s and 90s..

Anonymous Interviewer, 2011
paraphrased
Deep Learning: History
1958 Rosenblatt proposed perceptrons

1980 Neocognitron
(Fukushima, 1980)
1982 Hopfield network, SOM

1985 Boltzmann machines
(Kohonen, 1982),
Neural PCA
(Ackley et al., 1985)
1986 Multilayer perceptrons and backpropagation

1988 RBF networks
(Broomhead&Lowe, 1988)
1989 Autoencoders
(Baldi&Hornik, 1989),
1992 Sigmoid belief network

1993 Sparse coding
(Oja, 1982)
Convolutional network
(Neal, 1992)
(Field, 1993)
(Rumelhart et al., 1986)
(LeCun, 1989)
Why all this fuss about deep learning now?
ImageNet: ILSVRC 2012 Classification Task

Top Rankers
1. SuperVision (0.153): Deep Conv. Neural Network
2. ISI (0.262): Features + FV + Linear classifier
(Krizhevsky et al.)
(Gunji et al.)
3. OXFORD_VGG (0.270): Features + FV + SVM
(Simonyan et al.)
4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM
(Perronin et al.)
5. University of Amsterdam (0.300): Color desc. + SVM
(van de Sande et al.)
ImageNet: ILSVRC 2013 Classification Task

Top Rankers
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)

NUS: Deep Convolutional Neural Networks
ZF: Deep Convolutional Neural Networks
Andrew Howard: Deep Convolutional Neural Networks
OverFeat: Deep Convolutional Neural Networks
UvA-Euvision: Deep Convolutional Neural Networks
Adobe: Deep Convolutional Neural Networks
VGG: Deep Convolutional Neural Networks
CognitiveVision: Deep Convolutional Neural Networks
decaf: Deep Convolutional Neural Networks
IBM Multimedia Team: Deep Convolutional Neural Networks
Deep Punx (0.209): Deep Convolutional Neural Networks
MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et
al.)
Minerva-MSRA: Deep Convolutional Neural Networks
Orange: Deep Convolutional Neural Networks
BUPT-Orange: Deep Convolutional Neural Networks
Trimps-Soushen1: Deep Convolutional Neural Networks
QuantumLeap: 15 features + RVM (Shu&Shu)
Youre already using deep learning!
How do you tell deep learning from not-so-deep machine learning?
Not-So-Deep Machine Learning
1. Feature engineering not learned!

2. Learning
(3)
3. Inference
x1
x2
(2)
...
(1)
features
...
...
xN
...
Separation between domain knowledge and general machine learning
Unsupervised Learning of Representation

Feature engineering
Feature/Representation learning
Learning
Inference
(1a)
x1
x2
(3)
features
...
...
...
1a.
1b.
2.
3.
xN
(1b)
...
(2)
Deep Learning: toward the Ultimate Machine Learning?
1. Jointly learn everything

2. Inference
(2)
(1)
The data decides Yoshua Bengio
Why now? Why not 20 years ago?
What has happened in last 20+ years?
We have connected the dots, e.g.,

I
I
PCA Neural PCA Probabilistic PCA Autoencoder

Autoencoder Belief network Restricted Boltzmann machine
We understand learning better

I
I
I
Model structure matters a lot

Learning is but is not optimization
No need to be scared of non-convex optimization
We understand how learning and inference interact
Exponential growth of the amount of data and computational power
And Beyond. . .
(Goodfellow, 2013)
Todays Tutorial: Introduction to Deep Learning

1. Deep Learning: Past, Present and Future
2. Supervised Neural Networks
I
I
I
I
Multilayer Perceptron and Learning

Regularization
Practical Recipe
Task-specific Neural Network
3. Unsupervised Neural Networks

I
I
Unsupervised Learning
Generative Modeling with Neural Networks
I
I
Density/Distribution Learning
Learning to Infer
Semi-Supervised Learning and Pretraining
4. Advanced Topics
I
I
Beyond Computer Vision

Advanced Topics
Supervised Neural Networks

Kyunghyun Cho
cho.k.hyun@gmail.com, kyunghyun.cho@umontreal.ca
Warning!!!
The next 910 slides may be extremely boring.
Supervised Learning: Rough Picture
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )}
Assumption:
yn = f (xn )
Find a function f , using D, that emulates f as well as possible on
samples potentially not included in D.
Supervised Learning: Probabilistic Picture
Underlying distributions:
X and Y | X
Data:
D = {(x, y ) | x X & y Y | X = x}
Find a distribution p(y | x), using D, that emulates Y | X as well as
possible on new samples from p(x) potentially not included in D.
Supervised Learning: Evaluation and Generalization
Evaluation:
Exp(x) [kf (x) f (x)kp ]
[kf (x) f (x)kp ]
xDtest
or
Exp(x) [KL(p(y | x)kp (y | x))]
KL(pkp ),
xDtest
where D 6= Dtest
Why do we test the found solution f or p on Dtest , not on D?
Linear/Ridge Regression
Underlying assumption:
y = W x + b + ,
where is a white Gaussian noise.
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )} ,
where yn = W xn + b + .
Learning:
W , b = arg min
W ,b
N

1 X
2
2
2
kWxn + b yn k2 + kW kF + kbk2
N n=1
Multilayer Perceptron: (Binary) Classification

Underlying assumption:
y B (p = f (x)) ,
where f is a nonlinear function parameterized with and B(p) is a
Bernoulli distribution of mean p.
Data:
D = {(x1 , y1 ), (x2 , y2 ) . . . , (xN , yN )} ,
where yn B (p = f (xn )).
Learning:
= arg min
N
1 X
yn log f (xn ) + (1 yn ) log (1 f (xn )) + (, D)
N n=1
Learning as an Optimization
Ultimately, learning is (mostly )

= arg min
N
1 X
c ((xn , yn ) | ) + (, D) ,
N n=1
where c ((x, y ) | ) is a per-sample cost function.
Gradient Descent
Gradient-descent Algorithm:
t = t1 L( t1 )
where, in our case,
L() =
Let us assume that (, D) = 0.
N
1 X
l ((xn , yn ) | ) .
N n=1
Stochastic Gradient Descent

Often, it is too costly to compute C () due to a large training set.
Stochastic gradient descent algorithm:

t = t1 t l (x 0 , y 0 ) | t1 ,
where (x 0 , y 0 ) is a randomly chosen sample from D, and
and
t=1
Let us assume that (, D) = 0.
X
t=1
2
< .
Any question so far?
Almost there. . .
How do we compute the gradient efficiently for deep neural networks?
Backpropagation Algorithm (1) Forward Pass

x1
h1
f
h2
x2
Forward Computation:
L(f (h1 (x1 , x2 , h1 ), h2 (x1 , x2 , h2 ), f ), y )

Multilayer Perceptron with a single hidden layer:
L(x, y , ) =
2
1
y U> W> x
2
Backpropagation Algorithm (2) Chain Rule
x1
x2
h2
x1
h2
x1
h1
f
h1
f
f
h2
h2
L
f
Chain rule of derivatives:

L
L f
L
=
=
x1
f x1
f
f h1
f h2
+
h1 x1
h2 x1
Backpropagation Algorithm (3) Shared Derivatives
x1
h1
f
h1
f
h2
x2
f
h2
Local derivatives are shared:

L f h1
f h2
L
=
+
x1
f h1 x1
h2 x1

L
L f h1
f h2
=
+
x2
f h1 x2
h2 x2
L
f
Backpropagation Algorithm (4) Local Computation
L
h
h
a
L
a
h
a
L
a
h
a
Each node computes

I
Forward: h(a1 , a2 , . . . , aq )
Backward:
L
a
h h
h
a1 , a2 , . . . , aq
Backpropagation Algorithm Requirements
L
h
h
a
L
a
1 Well. . . ?
2 Well. . . ?
h
a
L
a
h
a
Each node computes a

differentiable function1
Directed Acyclic Graph2
L
a
Backpropagation Algorithm Automatic Differentiation
x1
h1
f
h1
f
x2
I
I
h2
f
h2
L
f
Generalized approach to computing partial derivatives

As long as your neural network fits the requirements, you do not
need to derive the derivatives yourself!
I
Theano, Torch, . . .
Any question on backpropagation and automatic differentiation?
Regularization (1) Maximum a Posteriori
Probabilistic Perspective: Find a model M by. . .

I
Maximum Likelihood (ML): arg max p(D | M)
Maximum a Posteriori (MAP): arg max p(M | D)
What is the probability of given a current data D?

p(D | M)p(M)
p(D | M)p(M)
p(M | D) = P
M p(D, M)
P(M): What do we think a good model should be?
Regularization (2) Weight Decay
P(M): What do we think a good model should be?

Weight-Decay Regularization
I

1
Prior Distribution: j N 0, (M)
Maximum a Posteriori Estimation

= arg max
N
X
n=1
p(y n | x n , ) +
M
X
j=1
j2 .
Regularization (3) Smoothness and Noise Injection

Prior on a Model: Smoothness
I
f (x) f (x + ): the model should be insensitive to small

PN (x n ) 2
change/noise Minimize n=1 fx

PN (x n ) 2
Regularizing n=1 fx
is equivalent to adding random Gaussian
noise in the input (Bishop, 1995)
N
X

f (x n ) 2

arg min
kf (x ) y k +

x
n=1
!
N
X
X
n
n 2
p() arg min

kf (x +) y k

n 2
n=1
Regularization (4a) Ensemble Learning and Dropout

Wisdom of the Crowd: Train M classifiers and let them vote
f (x) =
M
1 X
fM (x)
M m=1 m
(Ciresan et al., 2012)
Regularization (4b) Ensemble Learning and Dropout

Dropout: Train one, but exponentially many classifiers
L() =
(Hinton et al., 2012)
N
N
1 X
1 X
log Em [p(y n , m | x n , )]
Em [log p(y n , m | x n , )]
N n=1
N n=1
x1
...
x2
h1
h2
h3
hH
Each update samples one network out of exponentially many classifiers.

t = t1 t l (x 0 , y 0 ), m | t1 ,
where mi,l B (0.5).
Regularization (4b) Ensemble Learning and Dropout

Dropout: When testing, halve the activations

N
N
1 X
1
1 X
n
n
n
n
Em [log p(y , m | x , )]
p y ,m = | x ,
L()
N n=1
N n=1
2
x1
h2
h3
2
2
2
...
x2
h1
hH
Do you see why I spent so much time on regularization?
Common Recipe for Deep Neural Networks

1. Use a piecewise linear hidden unit
I
I
Rectifier: h(x) = max {0, x} (Glorot&Bengio, 2011)

Maxout: h({x1 , . . . , xp }) = max {x1 , . . . , xp } (Goodfellow et al., 2013)
2. Preprocess data and choose features carefully

I
I
I
I
Images: Whitening? Local contrast normalization? Raw? SIFT?

HoG?
Speech: Raw? Spectrum?
Text: Characters? Words? Tree?
General: z-Normalization?
3. Use Dropout and other regularization methods

4. Unsupervised Pretraining (Hinton&Salakhutdinov, 2006)
I
Few labeled samples, but a lot of unlabeled samples
5. Carefully search for hyperparameters

I
Random search, Bayesian optimization

(Bergstra&Bengio, 2013;Bergstra et al., 2011;Snoek et al.,2012)
6. Often, deeper the better

7. Build an ensemble of neural networks
But, nobody seems to use a vanilla multilayer perceptron, right?
How to Encode Prior/Domain Knowledge?
Data Preprocessing and Feature Extraction:

I Object recognition from images
I
Lighting condition shouldnt matter Contrast normalization
Gesture recognition from skeleton

Relative positions of joints are important Relative coordinate
system w.r.t. the body center
Language Processing
I
Word counts are important Bag-of-Words representation
Model Architecture Design
Convolutional Neural Networks (1)
Suitable for Images, Videos and Speech

Prior/Domain Knowledge
I
Translation invariance
Rotation invariance: Images and Videos
Temporal invariance: Videos and Speech
Frequency invariance: Speech
Convolutional Neural Networks (2) Convolution and

Pooling
Convolution
I
Global
Pooling
I
Local
max
max
max

Convolutional Neural Network
(TeraDeep, 2013)
Convolutional Layer
1. Contrast Normalization
2. Convolution
3. Pooling
4. Nonlinearity

Deep Convolutional Neural Network
Recursive Neural Networks (1)
Suitable for Text and Variable-Length Sequences

Prior/Domain Knowledge
I Compositionality Tree-based Grammar (?)
I
Location invariance
Variable Length
Compositional Structure
A small crowd quietly enters the historic church

Small (local) pieces are glued together to form a global structure
Finding a good, compact representation of variable-length sequence
(Socher et al., 2011)
What other architectures can you think of?
Further Topics
Is learning solved for supervised neural networks?
Recurrent neural networks: cope with variable-length inputs/outputs
Beyond sigmoid and rectifier functions
Unsupervised Neural Networks

Kyunghyun Cho
Warning!!!
The first half of this session can be boring.
Unsupervised Learning
No more label!
D = {x1 , x2 , . . . , xN }
What can we do?
(Exploratory) Data Analysis

The most important step in machine learning
150
100
50
z
0
150
y
50
100
100
50
150
0
200
150
100
50
50
x
100
150
200
Human Gestures
250
300
Visualization by DNN
(Cho&Chen, 2014)
Feature Extraction
With domain knowledge Engineered Features

Without domain knowledge Learned Features
f?
Generative Model: Probabilistic Picture
Underlying distribution:
X PD
Data:
D = {x | x PD }
Find a distribution p(x), using D, that emulates PD as well as possible
on new samples from p(x) potentially not included in D.
What should we do with data D = {xn }Nn=1
Example target tasks

I
I
I
I
I
Classification p(xclass | xins )

Missing value reconstruction p(xm | xo )
Denoising p(x | x)
Structured Output Prediction p(xout | xin )
Outlier Detection p(x)? >
Ultimately, it comes down to learning a distribution of x.
Density/Distribution Estimation (1)
Latent Variable Models p (x)

= arg max
X
1 X
log
p(x n | h)p(h)
N n=1
h
1. Define a parametric form of joint distribution p (x, h)

2. Derive a learning rule for
Density/Distribution Estimation
(2) Restricted Boltzmann Machines
1. Joint distribution
h1
h2
hq
x1
x2
xp
0
1
p
q
X
X
1
p (x, h) =
exp @
xi hj wij A
Z ()
i =1 j=1
2. Marginal distribution
X
h
q
p
X
1 Y
1 + exp
wi ,j xi
p (x, h) =
Z ()
j=1
!!
i =1
3. Learning rule: Maximum Likelihood

2
L() =ExPd 4log
q
Y
j=1
1 + exp
p
X
!!
wi ,j xi
3
log Z ()5
i =1
wij =hxi hj id hxi hj im
(Smolensky, 1986)
Density/Distribution Estimation (3) Belief Networks

...
...
...
p (x, h) = p(x | h1 )p(h1 | h2 ) p(h)
2. Marginal distribution
X
p (x, h) = ?
...
3. Learning rule: Maximum Likelihood
"
L() =ExPd
#
X
p (x, h)
(Neal, 1996)
Density/Distribution Estimation (4) NADE
p (x) = p (x1 )p (x2 | x1 ) p (xd | x1 , . . . , xd 1 )
2. Marginal distribution: no latent variable h
3. Learning rule:
3. Maximum Likelihood (fixed order ), Order-agnostic (all orders)
(Larochelle&Murray, 2011)
Density/Distribution Estimation (4) Issues

Intractability! Intractability! Intractability!
(General) Boltzmann Machines
I Normalization Constant Z ()
P
I Marginal Probability
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probability
p(xmis | xobs )
Belief Networks
P
h p(x, h)
p(xmis | xobs )
Restricted Boltzmann Machines

P
h p(x, h)
p(xmis | xobs )
NADE
P
h p(x, h)
p(xmis | xobs )
I Somewhat unsatisfactory
performance
Do we want to learn the distribution?
Generative Model (1) Learn to Infer
Example target tasks

I
I
I
Classification p(xclass | xins )

Missing value reconstruction p(xm | xo )
Denoising p(x | x)
Structured Output Prediction p(xout | xin )
Outlier Detection p(x)? >
All we want is to infer the conditional distribution of unknown variables

(Goodfellow et al., 2013; Brakel et al., 2013; Stoyanov et al., 2011, Raiko et al., 2014)
Approximate Inference in a Graphical Model

p(xmis | xobs ) Q(xmis | xobs )
Methods:
I
Loopy belief propagation
Variational inference/message-passing
At the end of the day. . .

Q k (xmis | xobs ) = f Q k1 (xmis | xobs )
until convergence

h
h<1>
h<2>
...
h<k>
...
x
x <0>
x <1>
x <2>
x <k>
Approximate Inference in Restricted Boltzmann Machine

p(xmis | xobs ) Q(xmis | xobs )
Mean-field Fixed-point Iteration

kx = W W> k1
+c +b
x
At the end of the day, a multilayer perceptron with k 1 hidden layers.
Use backpropagation and stochastic gradient descent!
Generative Model (4) Learn to Infer NADE-k

h
<1>
<2>
...
h<1>
[1]
<k>
...
x
<0>
<1>
<2>
<k>
<1>
h[2]
h<2>
[1]
v <0>
<2>
h[2]
v <1>
V
v <2>
Further Generalization with Deep Neural Networks

p(xmis | xobs ) Q(xmis | xobs ) = f (xobs )
Interpret the model as a mixture of NADEs with different orders of
variables
I
Exact computation of p(x) possible
Fast inference p(xmis | xobs )

Flexible
(Raiko et al., 2014)
Lesson:
- Do not maximize log-likelihood, but minimize the actual cost!
- Dont do what a model tells you to do, but do what you aim to do.
But, popular science journalists dont care about generative models..
Manifold Learning Semi-Supervised Learning (1)
???
Which class does the dot belong to, red or blue?
Manifold Learning Semi-Supervised Learning (2)
???
Now, which class does the dot belong to, red or blue?
The black dots are unlabeled
Manifold Learning New Representation (1)

Representation on the data manifold?
Hidden space
1. should reflect changes along

the manifold
(xi ) 6= (xj ), for all xi , xj D
(x)
2. should not reflect any

change orthogonal to the
manifold
(xi + ) = (xi )
Data space
Manifold Learning New Representation (2)

Denoising Autoencoder
(Vincent et al., 2011)

Hidden space
Representation that capture manifold

1. (xi ) 6= (xj ), for all xi , xj D
2. (xi + ) = (xi )
Denoising autoencoder achieves it by
(x)
min0 kx g0 (f (x + ))k
,
Data space
Semi-Supervised Learning in Action (1)

Layer-wise Pretraining
h1
h3
h2
Semi-Supervised Learning in Action (2)

Layer-wise Pretraining
y
Pretraining (2nd layer)
Pretraining
(1st
h[2]
h[2]
h[1]
h[1]
layer)
h[1]
x
(Hinton&Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al., 2007)
Manifold Embedding and Visualization (1)
Manifold Embedding: M Rd Rq , q d
If q = 2 or 3, data visualization
Data
z1
z2
z2
z1
(Oja, 1991; Kramer, 1991; Hinton & Salakhutdinov, 2006)
Manifold Embedding and Visualization (2)
196
Handwritten Digits [0, 1]
R2
0
1
2
3
4
5
6
7
8
9
Pose Frame Data R30 R2
rotateArmsLBack
rotateArmsRBack
rotateArmsBBack
What other applications can you think of?
Advanced Topics
Kyunghyun Cho
Is deep learning all about computer vision and speech recognition?
Deep Reinforcement Learning

Q Learning
a
h3
h2
Q(s, a): state-action function
Action at time t
= arg maxa[1,j] Q(s, a)
Update Q on-the-fly
h1 Deep Q Learning
s
Model Q with a deep neural network
Predict Q(aj , ) for all j at once
State s: visual perception

not internal states!
(Mnih et al., 2013)
Natural Language Processing
In neuropsychology, linguistics and the philosophy of language, a natural

language or ordinary language is any language which arises,
unpremeditated, in the brains of human beings.
Wikipedia
Natural Language Processing
To machine learning researchers:

Natural Language is a huge set of variable-length sequences of
high-dimensional vectors.
Natural Language Processing (1)

How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Most nave, uninformative coding:
[EU] = [1, 0, 0, 0]
[3] = [0, 1, 0, 0]
[France] = [0, 0, 1, 0]
[three] = [0, 0, 0, 1]
Not satisfying..

How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Is there a representation that preserves the similarities of meanings of
symbols?
D([EU] , [France]) < D([EU] , [3] ,
D([3] , [three]) < D([France] , [three] ,
D([3] , [three]) <
..
.

Continuous-Space Representation
Sample sentences:
1. There are three teams left for the qualification.
2. 3 teams have passed the first round.
Task: Predict a following word given a current work [three]
Nave approach: build a table (so called n-gram)
I
(three, teams), (3, teams)
The table can grow arbitrarily.
Machine learning: compress the table into a continuous function

I
Map three and 3 to nearby points x in a continuous space
From x, map to [teams].

Continuous-Space Representation
(Cho et al., 2014)

Beyond Word Representation
(Cho et al., 2014)
NN: I am very powerful and can model anything as long as Im fed

enough computational resource.
SVM: But, you have to optimize a high-dimensional, non-convex
function which has many, many local minima!
NN: Really?
Advanced Optimization (1) Statistical Physics says. . .
Not really
Advanced Optimization (2)

Local Minima? Saddle Points?
4
3
1.0
0.5
0.0
0.5 Z
1.0
1.5
2.0
2.5
2
1
0
1
2
3
4
1.0
1.5
1.0
0.5
0.0
0.5
1.0
0.5
1.5
0.0
X
0.5
1.0
1.0
0.5
0.0
0.5 Y
1.0
0.5 Z
0.0
0.5
2
1
0
1
1.0
2
3
1.5
1.0
0.5
1.5 1.0
0.0
0.5 0.0
0.5 Y
1.0
0.5 1.0
X
1.5 1.5
1.0
0.5
0.0 Y
0.5
1.0
0.5
0.0
X
0.5
1.0
1.0
(Dauphin et al., 2014; Pascanu et al., 2014)
Advanced Optimization (3)

Beyond the 2nd-order Method
(Quasi-)Newton Method
H 1 L()
How well does the quadratic approximation hold when training neural
networks?
Saddle-Free Newton Method (very new!!)
|H|
L(),
where |H| is constructed by

|H| = U || V
when H = UV .
(Dauphin et al., 2014)
Lastly but not at all least,

is there any theoretical ground for using deep neural networks?
Theoretical Analysis
Deep Rectifier Networks Fold the Space
3.
1. Fold along the
vertical axis
2. Fold along the

horizontal axis
(a)
Input Space
First Layer Space
S10
S20
S40 S10
S30 S20
S40
S30
S40 S10
S4 S1
S3 S2
S20 S30
S10 S40
S30 S20
S30 S20
S40 S10
Second Layer
Space
(b)
(c)
(Montufar et al., 2014; Pascanu et al., 2014)
Is it the beginning of deep learning or the end of deep learning?

DNN Tutorial

Uploaded by

Copyright:

Available Formats

You might also like

DNN Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNN Tutorial

Uploaded by

Copyright:

Available Formats

Deep Learning: Past, Present and Future (?

1. Let the model M learn the data D

Machine Learning & Perception

Perception is the organization, identification,

(Farabet et al., 2013)

Machine Learning & Perception: Examples

A human possesses the best machinery for perception, called a brain.

Deep Learning: Motivated from Human Learning

(Van Essen&Gallant, 1994)

Learn massive data

(Krizhevsky et al., 2012)

Boltzmann machines? I remember working on them in 80s and 90s..

Deep Learning: History

1958 Rosenblatt proposed perceptrons

1982 Hopfield network, SOM

(Ackley et al., 1985)

1986 Multilayer perceptrons and backpropagation

1992 Sigmoid belief network

(Rumelhart et al., 1986)

Why all this fuss about deep learning now?

ImageNet: ILSVRC 2012 Classification Task

3. OXFORD_VGG (0.270): Features + FV + SVM

4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM

5. University of Amsterdam (0.300): Color desc. + SVM

(van de Sande et al.)

(Krizhevsky et al., 2012)

ImageNet: ILSVRC 2013 Classification Task

Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)

Youre already using deep learning!

How do you tell deep learning from not-so-deep machine learning?

Not-So-Deep Machine Learning

1. Feature engineering not learned!

Separation between domain knowledge and general machine learning

Unsupervised Learning of Representation

Deep Learning: toward the Ultimate Machine Learning?

1. Jointly learn everything

The data decides Yoshua Bengio

Why now? Why not 20 years ago?

What has happened in last 20+ years?

We have connected the dots, e.g.,

PCA Neural PCA Probabilistic PCA Autoencoder

We understand learning better

Model structure matters a lot

We understand how learning and inference interact

Exponential growth of the amount of data and computational power

Todays Tutorial: Introduction to Deep Learning

Multilayer Perceptron and Learning

3. Unsupervised Neural Networks

Semi-Supervised Learning and Pretraining

Beyond Computer Vision

Supervised Neural Networks

The next 910 slides may be extremely boring.

Supervised Learning: Rough Picture

Supervised Learning: Probabilistic Picture

Supervised Learning: Evaluation and Generalization

[kf (x) f (x)kp ]

Multilayer Perceptron: (Binary) Classification

Ultimately, learning is (mostly )

f (x) f (x + ): the model should be insensitive to small

p() arg min