Professional Documents
Culture Documents
LN - ieML DeepLearning
LN - ieML DeepLearning
DEEP LEARNING
shin@ajou.ac.kr
http://www.alphaminers.net
Deep Learning Overview
1 Boltzmann Machines
4 Belief Networks
𝒘1 𝒘2 𝒘3
𝑥1
𝑥2
⋮
⋮ ⋮
𝑥𝑑
⋮
Supervised learning
[Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back‐Propagation Network. NIPS 1989
Slow convergence
Bad activation function: Sigmoid function
Too many parameters
Limited computing resources
ImageNet
Over 15 million labeled high-resolution images
Roughly 22,000 categories
Collected from the web
Labeled by human labelers using Amazon’s Mechanical Turk
crowd-sourcing tool.
28
25
26
20
15
16
12 10
Human Ability 7
5
3 3.6
2.3
0
2017 2016 2015 2014 2013 2012 2011 2010
SENet GoogLeNet-v4 ResNet 1. GoogLeNet ZFNet AlexNet XRCE NEC-UIUC
2. VGGNet
1
𝑓(𝑥) = 𝑓(𝑥) = max(0, 𝑥)
1 + 𝑒 −𝑥
𝑦 𝑦 𝑦
Training error rate
1 1 1
0.75
0.5
0.5
0.25
ReLU Sigmoid
0 0 0
0 𝑧 0 10 20 30 40 0 𝑧
LeNet (LeCun89)
3rd layer
Objects
2nd layer
Object parts
1st layer
Edges
Pixels
Slides from Junmo Kim 2006
Traditional Top-down
Error propagation Grammar,
sentence
word
phoneme Unsupervised
pre-training
Hidden layer
𝒉
J Hidden layer
𝒉 ∈ 0,1 𝑝
Visible layer
W 𝑑
𝒗 ∈ 0,1
Parameters:𝜽 = 𝑾, 𝑳, 𝑱
W: visible-to-hidden
L: visible-to-visible, 𝒅𝒊𝒂𝒈(𝑳) = 0
J: hidden-to-hidden, 𝒅𝒊𝒂𝒈(𝑱) = 0
L
𝒗 Visible layer
Energy of the Boltzmann machine
1 1
General Boltzmann Machine 𝑬 𝒗, 𝒉 𝜽 = − 𝑣 𝑇 L𝑣 − ℎ𝑇 Jℎ − 𝑣 𝑇 Wℎ
2 2
Generative model
Joint likelihood 𝑃 𝒗, 𝒉 𝜃 ∝ exp(−𝐸(𝒗, 𝒉; 𝜃))
Hidden layer
(No hidden-to-hidden )
Parameters
W: visible-to-hidden
L=0: visible-to-visible
J=0: hidden-to-hidden
W
Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ
(No visible-to-visible)
Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃
Hidden layer
Parameters
W: visible-to-hidden
L=0: visible-to-visible
J=0: hidden-to-hidden
W
Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ
Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃
𝐸 𝑣, ℎ = − 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
Energy with configuration 𝑖,𝑗 weight between
𝒗 on the visible units and units 𝑖 and 𝑗
𝒉 on the hidden units
𝝏𝑬 𝒗, 𝒉
= −𝒗𝒊 𝒉𝒋
𝝏𝒘𝒊𝒋
𝑝 𝑣, ℎ ∝ 𝑒 −𝐸(𝑣,ℎ)
𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣, ℎ =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔) Partition function
σℎ 𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣 =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔)
1
𝑃(𝒗; 𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
𝑍 𝜃
𝒉
1
𝑃 𝒗; 𝜃 = exp 𝒗𝑇 𝑊𝒉 + 𝒃𝑇 𝒗 + 𝒂𝑇 𝒉
𝑍 𝜃
𝒉
𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ exp 𝑎𝑗 ℎ𝑗 + 𝑊𝑖𝑗 𝑣𝑖 ℎ𝑗
𝑍 𝜃
𝑗=1 ℎ𝑗 ∈ 0,1 𝑖=1
𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ 1 + exp 𝑎𝑗 + 𝑊𝑖𝑗 𝑣𝑖
𝑍 𝜃
𝑗=1 𝑖=1
Gradient descent
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗𝒉𝑇 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗𝒉𝑇
𝜕𝑊
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒉 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒉
𝜕𝒂
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗
𝜕𝒃
The exact calculations are intractable because the expectation operator in
𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 takes exponential time in min(D,F)
Efficient Gibbs sampling based approximation exists(contrastive divergence)
For 𝑖 = 1 to 𝑘
𝑣𝑖 ~𝑃 𝑣 ℎ𝑖−1
ℎ𝑖 ~𝑃 ℎ 𝑣𝑖
Return (𝑣𝑘 , ℎ𝑘 )
𝑃 𝒉|𝒗; 𝜃 = ෑ 𝑃 ℎ𝑗 |𝒗 , 𝑃 𝒗|𝒉; 𝜃 = ෑ 𝑃 𝑣𝑖 |𝒉
𝑗=1 𝑖=1
𝑃 ℎ𝑗 = 1 𝒗 = 𝑔 𝑊𝑖𝑗 𝑣𝑖 + 𝑎𝑗
𝑖
𝑃 𝑣𝑖 = 1 𝒉 = 𝑔 𝑊𝑖𝑗 ℎ𝑗 + 𝑏𝑖
𝑗
𝒉𝟑
High-level representations are
𝟑
𝑾 built from unlabeled inputs
𝒉𝟐
𝑾𝟏
𝒗
1 Τ Τ
𝑷 𝒗 = exp[𝑣 Τ 𝑊 1 ℎ + ℎ1 𝑊 2 ℎ2 + ℎ2 𝑊 3 ℎ3 ]
1 2 3
𝑍
ℎ ,ℎ ,ℎ
𝒉𝟐
𝑾𝟐
𝒉𝟏
𝑾𝟏
Assume no within
layer connection
Pre-training
𝒉𝟑 Can (must) initialize from stacked RBMs
𝑾𝟑
𝒉𝟐 Generative fine-tuning
Positive phase: variational approximation
𝑾𝟐 (mean-field)
𝒉𝟏 Negative phase: persistent chain
(stochastic approximation)
𝑾𝟏
𝒗
Discriminative fine-tuning
backpropagation
1000 units
500 units
28 X 28
pixel
image 60,000 training and 10,000 testing examples
0.9 million parameters
Gibbs sampler for 100,000 steps
𝒉𝟑 𝒉𝟑
𝑾𝟑 𝑾𝟑
𝒉𝟐 𝒉𝟐
𝑾𝟐 𝑾𝟐
𝒉𝟏 𝒉𝟏
𝑾𝟏 𝑾𝟏
𝒗 𝒗
1
𝑝(𝑠𝑖 = 1) =
1 + exp(−𝑏𝑖 − σ𝑗 𝑠𝑗 𝑤𝑗𝑖 )
1
These have a state of 1 or 0
𝑝(𝑠𝑖 = 1)
RBM
𝒉𝟐 𝒉𝟐
RBM
𝒉𝟏 𝒉𝟏 𝒉𝟏
RBM
𝒗 𝒗 𝒗
RBM
Directed belief nets
Hinton et al., 2006
Hyunjung (Helen) Shin 49
Deep Belief Network: Greedy Layer-wise Training (1)
𝒉
1. Train the first layer RBM
𝑾𝟏 Construct an RBM with an input layer
𝑣 and a hidden layer ℎ
𝒗
𝑸(𝒉𝟐 |𝒉𝟏 ) 𝑾𝟐
𝑸(𝒉𝟏 |𝒗) 𝑾𝟏
(2) Then treat this as pre-training that finds a good initial set of
weights which can be fine-tuned by a local search procedure
(4) For classification, add and connect a label unit to the top level unit
2. For 𝑘 = 1 𝐭𝐨 𝑙
Sample ℎ𝑘−1 ~𝑃(. |ℎ𝑘 ) from the kth RBM
Associative
Top Level Units Memory
The top two layers form an
Label Units Hidden Units associative memory whose
Label Unit energy landscape models the
The energy valleys low dimensional manifolds of
have names Hidden Units the digits
Detection Weights
Hidden Units
Generative Weights
Hidden
weights RBM Layer
Visible
Available at www.cs.toronto/~hinton
This could be
the top level of
another sensory
pathway
1 layer 4 layers
100
95
90
85
Accuracy (%)
Layers
Regularization Hypothesis
Pre-training is “constraining” parameters in a region relevant
to unsupervised dataset, therefore Better generalization
(Representations that better describe unlabeled data are more
discriminative for labeled data)
Optimization Hypothesis
Unsupervised training initializes lower level parameters near
localities of better minima than random initialization can.
The initial gradients are sensible and backprop only needs
to perform a local search from a sensible starting point.
Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.
Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.
Fast to compute
Biological reason
𝜎 𝑧 𝑎
𝑎=𝑧
𝑎=0
𝑧
𝒙𝟏 … 𝒚𝟏
𝒙𝟐 … 𝒚𝟐
In 2006, people used
RBM pre-training
…
In 2015, people use ReLU
𝒙𝑵
… 𝒚𝑵
𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏 ෝ𝟏
𝒚
𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚
…
…
… 𝑪 + ∆𝑪
𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚
+∆𝒘
Smaller gradients
𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏
Small Output
ෝ𝟏
𝒚
𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚
…
…
… 𝑪 + ∆𝑪
Large Input
𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚
+∆𝒘
Smaller gradients
𝒙𝟏 𝒚𝟏
𝒙𝟐 0 𝒚𝟐
0
𝒙𝟏 𝒚𝟏
𝒙𝟐 𝒚𝟐
Do not have smaller
gradients
Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.
Training:
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:
Testing:
Why the weights should multiply (1-p)% (dropout rate) when testing?
𝟎. 𝟓 × 𝑧 ′ ≈ 2𝑧
𝑤1 𝑤1
𝑤2 𝑧 𝑤2 𝑧′
𝟎. 𝟓 ×
𝑤3 𝑤3
𝑤4 𝟎. 𝟓 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
𝟎. 𝟓 ×
Different Networks
f1
f2
fJ
f1
Training f2 Average F(xi)
samples
fJ
……
f1
f2
fJ
Training of
Dropout
M neurons
……
2M possible
networks
y1 y2 y3 y
Average
Hyunjung (Helen) Shin 80
Issues in Deep Neural Network
Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.
𝑷𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓
Hyunjung (Helen) Shin 82
Miscellaneous
MxNet Pre-trained models that you can use for object recognition
DeepNet Deep neural networks, deep belief networks and restricted boltzmann machines
A aimpler package than lasagna which is a python package for training neural
Nolearn 0.6.0 networks
Python
A simple, clean, fast python implementation of deep belief networks based on binary
Numpy restricted boltzmann machines (RBM)
Deeplearn, 2014 Deeplearntoolbox is a matlab/octave toolbox for deep learning and includes deep
belief nets, stacked autoencoders, convolutional neural nets.
Matlab
Deepmat, 2014 Matlab code for restricted/deep boltzmann machines and autoencoders
http://www.teglor.com/b/deep-learning-libraries-language-cm569/
Example
• 1000x1000 image
• 1M hidden units
→ 1012(= 106×106)parameters!
Observation
• Spatial correlation is local
Example
• 1000x1000 image
• 1M hidden units
• Filter size: 10x10
Observation
• Statistics is similar at different
locations
10,000 parameters
We can design neural networks that are specifically adapted for these
problems
▪ Must deal with very high-dimensional inputs
• 1000x1000 pixels
▪ Can exploit the 2D topology of pixels
▪ Can build in invariance to certain variations we can expect
• Translations, etc
Ideas
▪ Local connectivity
▪ Parameter sharing
from: https://developer.apple.com/library/ios/documentation/Performance/
Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
max pooling
average pooling
#(Parameter) = 3,274,634
Layer C1 C2 FC1 FC2
Weight 800 51,200 3,211,264 10,240
Bias 32 64 1,024 10
▪ handwritten digits
▪ a training set of 60,000 examples
▪ 24x24 images
one to one one to many many to one many to many many to many
𝑊ℎℎ ∈ ℜ3×3
𝑊𝑥ℎ ∈ 𝔑3×4
→ 𝑔Θ ℎ1 , 𝑔Θ ℎ2 ,…, 𝑔Θ ℎ 𝑇−1
0
0
𝑇
0
log𝑝 𝑥𝑡 𝑥1,𝑥2,…,𝑥𝑡−1 → 𝑔Θ ℎ𝑡−1 ⋮ → cross entropy
𝑡 𝑡
1
0
0
𝑥𝑡’s element is 1
Example: the prediction of the next word based on the previous ones
“the clouds are in the sky,”
Then, we put the cell state through tanh (to push the values to be between −1 and
1) and multiply it by the output of the sigmoid gate, so that we only output the
parts we decided to