LN - ieML DeepLearning

Machine Learning
DEEP LEARNING
Hyunjung (Helen) Shin

Dept. of Industrial Engineering,
Artificial Intelligence,
Integrative Systems Engineering
Ajou University, Korea
shin@ajou.ac.kr
http://www.alphaminers.net
Deep Learning Overview
1 Boltzmann Machines
2 Restricted Boltzmann Machines (RBM)
3 Deep Boltzmann Machines (DBM)
4 Belief Networks
5 Deep Belief Networks (DBN)
6 Convolutional Neural Network (CNN)
7 Recurrent Neural Network (RNN)
8 Long Short Term Memory (LSTM)
Hyunjung (Helen) Shin 2

DEEP LEARNING

Artificial Intelligence: ALPHAGO?

Artificial Intelligence: ALPHAGO?
Searching Tree

Artificial Intelligence: ALPHA ZERO

Multi-Layered Neural Network
𝒘1 𝒘2 𝒘3
𝑥1
𝑥2
⋮
⋮ ⋮
𝑥𝑑
⋮

Deep Neural Net: Convolutional Neural Network (CNN)
Feed-forward network
Convolution
Non-linearity: rectified linear units
Pooling: (typically) local maximum
Supervised learning
Representation learning LeNet(LeCun89)
[Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back‐Propagation Network. NIPS 1989

CNN had not shown impressive performance !
Insufficient training data
Slow convergence
Bad activation function: Sigmoid function
Too many parameters
Limited computing resources
Lack of theory: needed to rely on trials-and-errors
Pooling: (typically) local maximum

CNN recently draws a lot of attention due to its great success.
Availability of larger training datasets: ImageNet

Powerful GPUs
Better model regularization strategy such as dropout
Simple activation function: ReLU

Deep Neural Net: Revolution of Depth -ImageNet

Deep Neural Net: Revolution of Depth -ImageNet
ImageNet
Over 15 million labeled high-resolution images
Roughly 22,000 categories
Collected from the web
Labeled by human labelers using Amazon’s Mechanical Turk
crowd-sourcing tool.
ILSVRC: ImageNet Large-Scale Visual Recognition Challenge

Uses a subset of ImageNet
1000 categories
1.2 million training images
50,000 validation images
150,000 test images
Report two error rates: top-1 and top-5

Deep Neural Net: Revolution of Depth

Deep Neural Net: Revolution of Depth

Deep Neural Net: Error Rate of Winning Algorithms
30 (%)
28
25
26
20
15
16
12 10
Human Ability 7
5
3 3.6
2.3
0
2017 2016 2015 2014 2013 2012 2011 2010
SENet GoogLeNet-v4 ResNet 1. GoogLeNet ZFNet AlexNet XRCE NEC-UIUC
2. VGGNet

Deep Neural Net: ReLU
1
𝑓(𝑥) = 𝑓(𝑥) = max(0, 𝑥)
1 + 𝑒 −𝑥
𝑦 𝑦 𝑦
Training error rate
1 1 1
0.75
0.5
0.5
0.25
ReLU Sigmoid
0 0 0
0 𝑧 0 10 20 30 40 0 𝑧
Sigmoid function Convergence rates Rectified linear unit

Deep Neural Net: LeNet
LeNet (LeCun89)

Deep Neural Net: AlexNet
AlexNet (Krizhevsky et al. 2012)

Deep Neural Net: GoogLeNet
GoogleNet (Szegedy et al, 2014)

Deep Neural Net: Multiple Levels of Feature Representation
Visible layer (raw input) → 1st layer→ 2nd layer → 3rd layer
Represents more and more abstract features of the raw input, e.g., edges, local
shapes, object parts, objects, etc.
Feature Representation
3rd layer
Objects
2nd layer
Object parts
1st layer
Edges
Pixels
Slides from Junmo Kim 2006

Deep Neural Net Training: Deep Learning
Deep learning is a procedure of training a deep network
① Use unsupervised pre-training (greedy layer-wise training)

Allows abstraction to develop naturally from one layer to another
② Perform supervised top-down training as final step. Refine the features
(intermediate layers) so that they become more relevant for the task
Traditional Top-down
Error propagation Grammar,
sentence
word
phoneme Unsupervised
pre-training

Deep Learning Library / Packages
Base Lead
Name Features
Language Developer
• Flexible Bengio Group
Theano Python • Large Community In Univ.
• Keras, Lasagne Montreal
C++ with • Fast
matlab / • Good for training and fine-
Caffe UC Berkeley
python tuning feedforward models
wrapper • Model Zoo
• Similar to Theano
• Still slower
Tensor Flow Python, C++ Google
• TensorBoard
• Multi-GPU & Multi-node
• Fast
Torch Lua -
• Multi-GPU support
• Multi-GPU support
DL4J Java • Commercially supported by Skymind
Skymind

Boltzmann Machines

Boltzmann Machines
A network of symmetrically coupled stochastic binary units {0,1}
Hidden layer
𝒉
J Hidden layer
𝒉 ∈ 0,1 𝑝
Visible layer
W 𝑑
𝒗 ∈ 0,1
Parameters:𝜽 = 𝑾, 𝑳, 𝑱
W: visible-to-hidden
L: visible-to-visible, 𝒅𝒊𝒂𝒈(𝑳) = 0
J: hidden-to-hidden, 𝒅𝒊𝒂𝒈(𝑱) = 0
L
𝒗 Visible layer
Energy of the Boltzmann machine
1 1
General Boltzmann Machine 𝑬 𝒗, 𝒉 𝜽 = − 𝑣 𝑇 L𝑣 − ℎ𝑇 Jℎ − 𝑣 𝑇 Wℎ
2 2

Boltzmann Machines
A network of symmetrically coupled stochastic binary units {0,1}
Energy of the Boltzmann machine

1 1
𝑬 𝒗, 𝒉 𝜃 = − 𝑣 𝑇 𝑳𝑣 − ℎ𝑇 𝑱ℎ − 𝑣 𝑇 𝑾ℎ
2 2
Generative model
Joint likelihood 𝑃 𝒗, 𝒉 𝜃 ∝ exp(−𝐸(𝒗, 𝒉; 𝜃))
Probability of visible vector v

σ 𝑒𝑥𝑝(−𝐸(𝒗,𝒉;𝜃))
𝑃 𝒗 𝜃 = σℎ 𝑃 𝒗, 𝒉 𝜃 = ℎ
𝑍 𝜃
𝑍 𝜃 = σ𝒗 σ𝒉 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))

Restricted Boltzmann Machines
(RBM)

No hidden-to-hidden and no visible-to-visible connections
Hidden layer
(No hidden-to-hidden )
Parameters
L=0: visible-to-visible
J=0: hidden-to-hidden
W
Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ
(No visible-to-visible)
Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃

No hidden-to-hidden and no visible-to-visible connections
Hidden layer
Parameters
L=0: visible-to-visible
J=0: hidden-to-hidden
W
Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ
Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃

Top layer: vector of stochastic binary hidden units h

Bottom layer: a vector of stochastic binary visible variables v
Figure is taken from R. Salakhutdinov

RBM: The Energy of a Joint Configuration
(ignoring terms to do with biases)
binary state of binary state of

visible unit 𝒊 hidden unit 𝒋
𝐸 𝑣, ℎ = − ෍ 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
Energy with configuration 𝑖,𝑗 weight between
𝒗 on the visible units and units 𝑖 and 𝑗
𝒉 on the hidden units
𝝏𝑬 𝒗, 𝒉
= −𝒗𝒊 𝒉𝒋
𝝏𝒘𝒊𝒋

RBM: Weights → Energies → Probabilities
Each possible joint configuration of the visible and hidden units

has an energy. The energy is determined by the weights (and
biases)
𝐸 𝑣, ℎ = − ෍ 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
𝑖,𝑗
The energy of a joint configuration of the visible and hidden units

determines its probability:
𝑝 𝑣, ℎ ∝ 𝑒 −𝐸(𝑣,ℎ)

RBM: Using Energies to Define Probabilities
The probability of a joint configuration over both visible and hidden

units depends on the energy of that joint configuration compared with
the energy of all other joint configurations.
𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣, ℎ =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔) Partition function
The probability of a configuration of the visible units is the sum of the

probabilities of all the joint configurations that contain it.
σℎ 𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣 =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔)

RBM: Training
Due to the special bipartite structure of RBM’s, the hidden units can be
explicitly marginalized out
1
𝑃(𝒗; 𝜃) = ෍ 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
𝑍 𝜃
𝒉
1
𝑃 𝒗; 𝜃 = ෍ exp 𝒗𝑇 𝑊𝒉 + 𝒃𝑇 𝒗 + 𝒂𝑇 𝒉
𝑍 𝜃
𝒉
𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ ෍ exp 𝑎𝑗 ℎ𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖 ℎ𝑗
𝑍 𝜃
𝑗=1 ℎ𝑗 ∈ 0,1 𝑖=1
𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ 1 + exp 𝑎𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖
𝑍 𝜃
𝑗=1 𝑖=1

RBM: Training
𝐹 𝐷
1
𝑃 𝒗; 𝜃 = exp 𝒃𝑇 𝒗 ෑ 1 + exp 𝑎𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖
𝑍 𝜃
𝑗=1 𝑖=1
Gradient descent
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗𝒉𝑇 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗𝒉𝑇
𝜕𝑊
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒉 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒉
𝜕𝒂
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗
𝜕𝒃
The exact calculations are intractable because the expectation operator in
𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 takes exponential time in min(D,F)
Efficient Gibbs sampling based approximation exists(contrastive divergence)

RBM: How Do We Sample from an RBM?
Gibbs sampling: [𝐢. 𝐞. 𝒑 𝒗𝒊 ]

Due to factorization, Gibbs sampling is easy:
Set 𝑣0 randomly (or preferably from the empirical distribution)

Set ℎ0 ~𝑃 ℎ 𝑣0
For 𝑖 = 1 to 𝑘
𝑣𝑖 ~𝑃 𝑣 ℎ𝑖−1
ℎ𝑖 ~𝑃 ℎ 𝑣𝑖
Return (𝑣𝑘 , ℎ𝑘 )
Simply run it for a fixed number of steps k in practice (good enough)

RBM: Inference in RBM
Inference is simple in RBM:
𝑃 𝒉|𝒗; 𝜃 = ෑ 𝑃 ℎ𝑗 |𝒗 , 𝑃 𝒗|𝒉; 𝜃 = ෑ 𝑃 𝑣𝑖 |𝒉
𝑗=1 𝑖=1
𝑃 ℎ𝑗 = 1 𝒗 = 𝑔 ෍ 𝑊𝑖𝑗 𝑣𝑖 + 𝑎𝑗
𝑖
𝑃 𝑣𝑖 = 1 𝒉 = 𝑔 ෍ 𝑊𝑖𝑗 ℎ𝑗 + 𝑏𝑖
𝑗
where 𝒈 𝒙 = 𝟏/(𝟏 + 𝒆𝒙𝒑 −𝒙 ) is the sigmoid function

RBM
Modeling handwritten digits using RBM
Training data examples Visualized weights of RBM

Deep Boltzmann Machines
(DBM)

Deep Boltzmann Machines (DBM)
Undirected connections between all layers : In multiple layer model,
the undirected connection between the layers make complete Boltzmann
machine (i.e, Deep Restricted Boltzmann Machines).
𝒉𝟑
High-level representations are
𝟑
𝑾 built from unlabeled inputs
𝒉𝟐
𝑾𝟐 Labeled data is used to only

slightly fine-tune the model
𝒉𝟏
𝑾𝟏
𝒗
1 Τ Τ
𝑷 𝒗 = ෍ exp[𝑣 Τ 𝑊 1 ℎ + ℎ1 𝑊 2 ℎ2 + ℎ2 𝑊 3 ℎ3 ]
1 2 3
𝑍
ℎ ,ℎ ,ℎ
Salakhutdinov & Hinton, 2009

Deep Boltzmann Machines: Two layer DBM example
𝒉𝟐
𝑾𝟐
𝒉𝟏
𝑾𝟏
Assume no within
layer connection

Deep Boltzmann Machines (DBM)
Pre-training
𝒉𝟑 Can (must) initialize from stacked RBMs
𝑾𝟑
𝒉𝟐 Generative fine-tuning
Positive phase: variational approximation
𝑾𝟐 (mean-field)
𝒉𝟏 Negative phase: persistent chain
(stochastic approximation)
𝑾𝟏
𝒗
Discriminative fine-tuning
backpropagation

Deep Boltzmann Machines: Experiments
MNIST: 2-layer BM 2-layer RBM Training Samples
1000 units
500 units
28 X 28
pixel
image 60,000 training and 10,000 testing examples
0.9 million parameters
Gibbs sampler for 100,000 steps
After discriminative fine-tuning: 0.95% error rate

Compare with SVM 1.4%

DBMs vs. DBNs
𝒉𝟑 𝒉𝟑
𝑾𝟑 𝑾𝟑
𝒉𝟐 𝒉𝟐
𝑾𝟐 𝑾𝟐
𝒉𝟏 𝒉𝟏
𝑾𝟏 𝑾𝟏
𝒗 𝒗
Deep Boltzmann Machines Deep Belief Network

Belief Networks

Belief Nets
A belief net is a directed acyclic graph composed of stochastic variables
Stochastic We get to observe some of the variables and

Hidden we would like to solve two problems:
Cause
① The inference problem: Infer the
states of the unobserved variables
② The learning problem: Adjust the

interactions between variables to
make the network more likely to
Visible Effect generate the observed data

Belief Nets: Stochastic Binary Units (Bernoulli Variables)
1
𝑝(𝑠𝑖 = 1) =
1 + exp(−𝑏𝑖 − σ𝑗 𝑠𝑗 𝑤𝑗𝑖 )
1
These have a state of 1 or 0
𝑝(𝑠𝑖 = 1)
The probability of turning on is

determined by the weighted input
0 from other units (plus a bias)
0 𝑏𝑖 + σ𝑗 𝑠𝑗 𝑤𝑗𝑖

Deep Belief Networks
(DBN)

Deep Belief Network
Probabilistic generative model
Deep architecture – multiple layers (stacks of RBMs)
Greedy layer-wise training algorithm (Contrastive Divergence Learning)
Supervised fine-tuning can be applied
𝒉𝟑
RBM
𝒉𝟐 𝒉𝟐
RBM
𝒉𝟏 𝒉𝟏 𝒉𝟏
RBM
𝒗 𝒗 𝒗
Hinton et al., 2006

Deep Belief Network
(Approximate) Inference Generative Process
RBM
Directed belief nets
Hinton et al., 2006
Deep Belief Network: Greedy Layer-wise Training (1)
𝒉
1. Train the first layer RBM
𝑾𝟏 Construct an RBM with an input layer
𝑣 and a hidden layer ℎ
𝒗
Hinton et al., 2006

2. Stack another hidden layer

𝒉𝟐 on top of the first RBM to form
a new RBM
𝑾𝟐
𝒉 𝟏 Fix 𝑾𝟏 , sample ℎ1 from 𝑄(ℎ1 |𝑣) as
input. Train 𝑾𝟐 as a second RBM
𝑸(𝒉𝟏 |𝒗)
𝑾𝟏
Hinton et al., 2006

𝒉𝟑 3. Continue to stack layers on top of

𝑾𝟑 the network, train it as previous
step, with sample sampled from
𝒉𝟐 𝑄(ℎ1 |ℎ2 )
𝑸(𝒉𝟐 |𝒉𝟏 ) 𝑾𝟐
𝑸(𝒉𝟏 |𝒗) 𝑾𝟏
Hinton et al., 2006

Deep Belief Network: Classifier
(1) Learn one layer at a time greedily
(2) Then treat this as pre-training that finds a good initial set of
weights which can be fine-tuned by a local search procedure
(3) Backpropagation can be used to fine-tune the model for better

classification. This overcomes many of the limitations of standard
backpropagation
(4) For classification, add and connect a label unit to the top level unit

Deep Belief Network: How to sample from a DBN?
1. Sample a visible ℎ𝑙−1 from the top-level RBM (using Gibbs)

Run multiple Gibbs steps
(unlike CD-k, this is sampling)
2. For 𝑘 = 1 𝐭𝐨 𝑙
Sample ℎ𝑘−1 ~𝑃(. |ℎ𝑘 ) from the kth RBM
3. 𝑣 = ℎ0 is the final sample
The top-most layer generates sample as a regular RBM

The rest is a top-down procedure

Deep Belief Network: Classifier - Generative Digits
Associative
Top Level Units Memory
The top two layers form an
Label Units Hidden Units associative memory whose
Label Unit energy landscape models the
The energy valleys low dimensional manifolds of
have names Hidden Units the digits
Detection Weights
Hidden Units
Generative Weights
Hidden
weights RBM Layer
Visible
Hinton et al., 2006

Observation Vector 𝑣 http://www.cs.toronto.edu/~hi
(e.g., 32 × 32 Image) nton/adi/index.htm

Deep Belief Network: Classifier - Generative Digits
Available at www.cs.toronto/~hinton
This could be
the top level of
another sensory
pathway
Hinton et al., 2006

Deep Belief Network: Example - Generative Digits
There are 1000 iterations of alternating Gibbs Sampling between samples.

Deep Belief Network: Example - Generative Digits

Deep Belief Network: Results on Digit Classification
Effect of Pre-training
with pre-training
w/o pre-training
1 layer 4 layers
Erhan et. al. AISTATS’2009

Effect of Depth
without pre-training with pre-training
Erhan et. al. AISTATS’2009

Effect of Depth
100
95
90
85
Accuracy (%)
80 Handwritting Digit Classification

75
70
65
60
55
50
1 2 3 4 5 6 7 8 9 10
Layers

Deep Belief Network
Why Greedy Layer Wise Training Works?
Regularization Hypothesis
Pre-training is “constraining” parameters in a region relevant
to unsupervised dataset, therefore Better generalization
(Representations that better describe unlabeled data are more
discriminative for labeled data)
Optimization Hypothesis
Unsupervised training initializes lower level parameters near
localities of better minima than random initialization can.
The initial gradients are sensible and backprop only needs
to perform a local search from a sensible starting point.
Bengio 2009, Erhan et al. 2009

Miscellaneous…

Issues in Deep Neural Network
Vanishing gradient problem

Gradients in the lower layers are typically extremely small
Problems with non-linear activation
Solved by a new non-linear activation, Rectified Linear Unit (ReLU)
Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.
Get stuck in local minima (?)


Overfitting problem
does not work well

Vanishing Gradient Problem: Rectified Linear Unit (ReLU)
Fast to compute
Biological reason
𝜎 𝑧 𝑎
𝑎=𝑧
𝑎=0
𝑧
Xavier Glorot, 2011, AISTATS

Andrew L. Maas, 2013, ICML
Kaiming He, 2015, arXiv
𝒙𝟏 … 𝒚𝟏
𝒙𝟐 … 𝒚𝟐
In 2006, people used
RBM pre-training
…
In 2015, people use ReLU
𝒙𝑵
… 𝒚𝑵
Smaller gradients Larger gradients

Learn very slow Learn very fast
Almost random Already converge

𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏 ෝ𝟏
𝒚
𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚
…
…
… 𝑪 + ∆𝑪
𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚
+∆𝒘
Smaller gradients

𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏
Small Output
ෝ𝟏
𝒚
𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚
…
…
… 𝑪 + ∆𝑪
Large Input
𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚
+∆𝒘
Smaller gradients

𝒙𝟏 𝒚𝟏
𝒙𝟐 0 𝒚𝟐
0

A Thinner Linear Network
𝒙𝟏 𝒚𝟏
𝒙𝟐 𝒚𝟐
Do not have smaller
gradients


Overfitting problem
does not work well

Overfitting Problem: Dropout
Each time before computing the gradients
Each neuron has p% to dropout
Training:
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Each time before computing the gradients
Each neuron has p% to dropout
The structure of the network is changed
The new network is used for training
Training:
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
For each mini-batch, we resample the dropout neurons

When teams up, if everyone expect the partner will do the work,
nothing will be done finally
However, if you know your partner will dropout, you will do better
When testing, no one dropout actually, so obtaining good results eventually
Training:

No dropout for Testing
If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%
If a weight w=1 by training, set 𝑤=0.5 for testing
Testing:

Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout

Assume dropout rate is 50% No dropout
Weights from training
𝟎. 𝟓 × 𝑧 ′ ≈ 2𝑧
𝑤1 𝑤1
𝑤2 𝑧 𝑤2 𝑧′
𝟎. 𝟓 ×
𝑤3 𝑤3
𝑤4 𝟎. 𝟓 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
𝟎. 𝟓 ×

Dropout is a kind of ensemble
Train a bunch of networks with different structures
Different Networks
f1
f2
fJ
f1
Training f2 Average F(xi)
samples
fJ
……
f1
f2
fJ

Using one mini-batch to train one network
Some parameters in the network are shared
Training of
Dropout
M neurons
……
2M possible
networks

Testing data x
All the weights

multiply (1-p)%
y1 y2 y3 y
Average

Overfitting problem
does not work well

Getting Stuck in Local Minima
Unsupervised pre-training may help the network initialize with good
parameters
𝑳𝒐𝒔𝒔 Common wisdom: training does not work

because we “get stuck in local minima”
𝑷𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓
Miscellaneous
How many layers should we use and

how wide should they be?
Deep belief nets give the creator a lot of freedom

How best to make use of that freedom depends on the task.
With enough narrow layers we can model any distribution over
binary vectors (Sutskever & Hinton, 2007)

DBN Packages
Package Language Descriptions
Package which is for generating neural networks with many layers (deep
Architectures) and train them with the method introduced by the publications
Darch ``A fast learning algorithm for deep belief nets'' (G. E. Hinton, S. Osindero, Y. W. Teh
(2006)
H2o R Deep feedforward neural networks and auto encoders
MxNet Pre-trained models that you can use for object recognition
DeepNet Deep neural networks, deep belief networks and restricted boltzmann machines
A aimpler package than lasagna which is a python package for training neural
Nolearn 0.6.0 networks
Python
A simple, clean, fast python implementation of deep belief networks based on binary
Numpy restricted boltzmann machines (RBM)
Deeplearn, 2014 Deeplearntoolbox is a matlab/octave toolbox for deep learning and includes deep
belief nets, stacked autoencoders, convolutional neural nets.
Matlab
Deepmat, 2014 Matlab code for restricted/deep boltzmann machines and autoencoders
The first commercial-grade, open-source, distributed deep-learning library written

Deeplearning4j for java and scala. It is designed to be used in business environments, rather than as
a research tool.
Java
An advanced machine learning framework which supports support vector machines,
Encog artificial neural networks, genetic programming, bayesian networks, hidden markov
models, genetic programming and genetic algorithms are supported.
http://www.teglor.com/b/deep-learning-libraries-language-cm569/

Convolutional Neural Network
(CNN)

Fully connected neural network
Example
• 1000x1000 image
• 1M hidden units
→ 1012(= 106×106)parameters!
Observation
• Spatial correlation is local

Locally connected neural net
Example
• 1000x1000 image
• 1M hidden units
• Filter size: 10x10
→ 108(= 106×10 ×10)parameters!
Observation
• Statistics is similar at different
locations

Convolution network
Share the same parameters across

different locations
• Convolution with learned kernels
Learn multiple filters

• 1000x1000 image
• 100 Filters
• Filter size: 10x10
10,000 parameters

Convolution neural networks
We can design neural networks that are specifically adapted for these
problems
▪ Must deal with very high-dimensional inputs
• 1000x1000 pixels
▪ Can exploit the 2D topology of pixels
▪ Can build in invariance to certain variations we can expect
• Translations, etc
Ideas
▪ Local connectivity
▪ Parameter sharing

Convolution
from: https://developer.apple.com/library/ios/documentation/Performance/
Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html

Convolution

Convolution

Convolution

Pooling
Max vs Average pooling
max pooling
average pooling

Pooling
Max vs Average pooling Drawbacks

LeNet
Yann LeCun and his collaborators developed a recognizer for handwritten

digits by using back-propagation in a feed-forward net

LeNet
#(Parameter) = 3,274,634
Layer C1 C2 FC1 FC2
Weight 800 51,200 3,211,264 10,240
Bias 32 64 1,024 10

MNIST Dataset
▪ handwritten digits
▪ a training set of 60,000 examples
▪ 24x24 images

The 82 errors by LeNet5
Notice that most of

the errors are cases
that people find quite
easy
The human error rate

is probably 20 to 30
errors but nobody has
had the patience to
measure it

Feature map results

Learned Filters
Trained 32 filters on C1 layer

Learned Filters
Filtered Filtered Filtered Filtered

ReLU ReLU ReLU ReLU
result result result result
Filtered Filtered Filtered Filtered

ReLU ReLU ReLU ReLU
result result result result

Recurrent Neural Network
(RNN)

RNN
one to one one to many many to one many to many many to many
Vanilla Neural Networks

RNN offers a lot of flexibility
e.g. Image Captioning

Image →sequence of words

e.g. Sentiment Classification

sequence of words →sentiment

e.g. Machine Translation

sequence of words → sequence of words

e.g. Video Classification On Frame Level

Recurrent Neural Network
Predict a vector at
some time steps
We can process a sequence of vectors x by
applying a recurrence formula at every time step:
new state old state input vector at

some time step
some function
with parameters W
Notice: the same function and the same set of

parameters are used at every step

(Vanilla) Recurrent Neural Network

Character-level language model
Vocabulary: [h,e,l,o]
Training sequence: hello
We want the green

numbers to be high
and red numbers to
be low
𝑊ℎℎ ∈ ℜ3×3
𝑊𝑥ℎ ∈ 𝔑3×4
ℎ𝑡 = tanh (𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ)

Example

RNN: Probabilistic Perspectives
Given example sequences {𝑋𝑖 }, let us find a probabilistic model 𝑝(𝑋) that
maximizes ς𝒊 𝒑(𝑿𝒊 )
𝑋𝑖 is a sequence 𝑋𝑖 = (𝑥1, 𝑥2, ⋯ , 𝑥𝑇)

For example,
• 𝑥1 = ℎ, 𝑥2 =e, 𝑥3 = 𝑙, 𝑥4 = 𝑙, 𝑥5 = 𝑜
• 𝑥1 = "𝐼", 𝑥2 = "𝑎𝑚", 𝑥3 = "𝑎", 𝑥4 = "𝑠𝑡𝑢𝑑𝑒𝑛𝑡"
For the model learning, we have to specify

• Model selection
• Learning criterion
• Parameter learning method

Model selection
Because we can decompose 𝑝(𝑋) = 𝑝(𝑥1, 𝑥2, … , 𝑥𝑇) into
and RNN can handle this situation well.
→ 𝑔Θ ℎ1 , 𝑔Θ ℎ2 ,…, 𝑔Θ ℎ 𝑇−1
← 𝑥1, 𝑥2, ⋯ , 𝑥𝑇−1

Model selection
• 𝑔Θ(ℎ𝑡−1) is a parametric model of 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
• If we have an 𝑁-word dictionary, we can represent 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
with a 𝑁-dimensional vector (that sums to a unity and non-negative)
𝑝(𝑥𝑡 = 1st word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)

𝑝(𝑥𝑡 = 2nd word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑝(𝑥𝑡 = 3rd word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑝(𝑥𝑡 = 4th word in the dictionary |𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑔Θ (ℎ𝑡−1 ) =
⋯
⋯
⋯
𝑝(𝑥𝑡 = Nth word in the dictionary |𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)

RNN: Learning Criterion
Learning criterion: log likelihood/cross entropy
log 𝑝 𝑥1, 𝑥2, ⋯,𝑥𝑇 =෍

𝑡
log 𝑝(𝑥𝑡 |𝑥1, 𝑥2, …,𝑥𝑡−1)
0
0
𝑇
0
෍ log𝑝 𝑥𝑡 𝑥1,𝑥2,…,𝑥𝑡−1 → ෍ 𝑔Θ ℎ𝑡−1 ⋮ → cross entropy
𝑡 𝑡
1
0
0
𝑥𝑡’s element is 1

Summary
The above RNN model learns the (conditional) probability model of a

sequence by maximizing the log likelihood of training sequences.
≃ 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
▪ RNNs allow a lot of flexibility in architecture design

▪ Vanilla RNNs are simple but don’t work very well
▪ Backward flow of gradients in RNN can explode or vanish

Long-Short Term Memory
(LSTM)

Repeating module in RNN

RNN: Problems of Long-term Dependencies
One of the appeals of RNNs is the idea that they might be able to
connect previous information to the present task
Example: the prediction of the next word based on the previous ones
“the clouds are in the sky,”

RNN: Problems of Long-term Dependencies
Large gap between the relevant information and the place that it’s need
ed
“I grew up in France… I speak fluent French.”

LSTM: Long Short Term Memory
Most popular recurrent node
type is Long Short Term Memory
(LSTM)
LSTM includes also gates,

which can turn on/off the
history and a few additional
inputs.

Repeating module in LSTM

LSTM: Cell State
Cell state
• It runs straight down the entire chain, with linear interactions
• LSTM has the ability to remove or add information to the cell state,
regulated by gates

LSTM: Gates
Gates are a way to optionally let information through
• A sigmoid neural net layer and a pointwise multiplication operation
• The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through
A value of zero means “let nothing through,”
A value of one means “let everything through”

LSTM: Forget Gate Layer
Forget gate
• to decide what information we’re going to throw away from the cell state
• this decision is made by a sigmoid layer called the “forget gate layer.”
𝑓𝑡 = 1: completely keep this information

𝑓𝑡 = 0: completely get rid of this information
Ex) Language model example

• When we a new subject, we want to forget the gender information in 𝐶𝑡−1

LSTM: Input Gate Layer
Input gate
• to decide what new information we’re going to store in the cell state
• a sigmoid layer called the “input gate layer” decides which values we’ll
update

LSTM: Cell State Update
We multiply the old state by 𝑓𝑡, forgetting the things we decided to forget earlier.
Then we add 𝑖𝑡 ∗ 𝐶ሚ𝑡
In the case of the language model, this is where we’d actually drop the information
about the old subject’s gender and add the new information, as we decided in the
previous steps.

LSTM: Output
We run a sigmoid layer which decides what parts of the cell state we’re going to
output
Then, we put the cell state through tanh (to push the values to be between −1 and
1) and multiply it by the output of the sigmoid gate, so that we only output the
parts we decided to

Summary
▪ RNNs allow a lot of flexibility in architecture design
▪ Common to use LSTM: their additive interactions improve

gradient flow
▪ Backward flow of gradients in RNN can explode or vanish
▪ Exploding is controlled with gradient clipping. Vanishing is

controlled with additive interactions (LSTM)

LN - ieML DeepLearning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LN - ieML DeepLearning

Uploaded by

Copyright:

Available Formats

Machine Learning

Hyunjung (Helen) Shin

2 Restricted Boltzmann Machines (RBM)

3 Deep Boltzmann Machines (DBM)

5 Deep Belief Networks (DBN)

6 Convolutional Neural Network (CNN)

7 Recurrent Neural Network (RNN)

8 Long Short Term Memory (LSTM)

Hyunjung (Helen) Shin 2

Hyunjung (Helen) Shin 3

Hyunjung (Helen) Shin 4

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin 7

Representation learning LeNet(LeCun89)

Hyunjung (Helen) Shin 8

CNN had not shown impressive performance !

Insufficient training data

Lack of theory: needed to rely on trials-and-errors

Pooling: (typically) local maximum

Hyunjung (Helen) Shin 9

CNN recently draws a lot of attention due to its great success.

Availability of larger training datasets: ImageNet

Hyunjung (Helen) Shin 10

Hyunjung (Helen) Shin

ILSVRC: ImageNet Large-Scale Visual Recognition Challenge

Report two error rates: top-1 and top-5

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin

Sigmoid function Convergence rates Rectified linear unit

Hyunjung (Helen) Shin 16

Hyunjung (Helen) Shin

AlexNet (Krizhevsky et al. 2012)

Hyunjung (Helen) Shin

GoogleNet (Szegedy et al, 2014)

Hyunjung (Helen) Shin

Hyunjung (Helen) Shin 20

Deep learning is a procedure of training a deep network

① Use unsupervised pre-training (greedy layer-wise training)

Hyunjung (Helen) Shin 21

Hyunjung (Helen) Shin 22

Hyunjung (Helen) Shin 23

Hyunjung (Helen) Shin 24

Energy of the Boltzmann machine

Probability of visible vector v

Hyunjung (Helen) Shin 25

Hyunjung (Helen) Shin 26

Hyunjung (Helen) Shin 27

Hyunjung (Helen) Shin 28

Top layer: vector of stochastic binary hidden units h

Figure is taken from R. Salakhutdinov

Hyunjung (Helen) Shin 29

binary state of binary state of

Hyunjung (Helen) Shin 30

Each possible joint configuration of the visible and hidden units

The energy of a joint configuration of the visible and hidden units

Hyunjung (Helen) Shin 31

The probability of a joint configuration over both visible and hidden

The probability of a configuration of the visible units is the sum of the

Hyunjung (Helen) Shin 32

Hyunjung (Helen) Shin 33