Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Machine Learning

DEEP LEARNING

Hyunjung (Helen) Shin


Dept. of Industrial Engineering,
Artificial Intelligence,
Integrative Systems Engineering
Ajou University, Korea

shin@ajou.ac.kr
http://www.alphaminers.net
Deep Learning Overview

1 Boltzmann Machines

2 Restricted Boltzmann Machines (RBM)

3 Deep Boltzmann Machines (DBM)

4 Belief Networks

5 Deep Belief Networks (DBN)

6 Convolutional Neural Network (CNN)

7 Recurrent Neural Network (RNN)

8 Long Short Term Memory (LSTM)

Hyunjung (Helen) Shin 2


DEEP LEARNING

Hyunjung (Helen) Shin 3


Artificial Intelligence: ALPHAGO?

Hyunjung (Helen) Shin 4


Artificial Intelligence: ALPHAGO?
Searching Tree

Hyunjung (Helen) Shin


Artificial Intelligence: ALPHA ZERO

Hyunjung (Helen) Shin


Multi-Layered Neural Network

𝒘1 𝒘2 𝒘3

𝑥1

𝑥2

⋮ ⋮
𝑥𝑑

Hyunjung (Helen) Shin 7


Deep Neural Net: Convolutional Neural Network (CNN)
Feed-forward network
Convolution
Non-linearity: rectified linear units
Pooling: (typically) local maximum

Supervised learning

Representation learning LeNet(LeCun89)

[Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back‐Propagation Network. NIPS 1989

Hyunjung (Helen) Shin 8


Deep Neural Net: Convolutional Neural Network (CNN)

CNN had not shown impressive performance !

Insufficient training data

Slow convergence
Bad activation function: Sigmoid function
Too many parameters
Limited computing resources

Lack of theory: needed to rely on trials-and-errors

Pooling: (typically) local maximum

Hyunjung (Helen) Shin 9


Deep Neural Net: Convolutional Neural Network (CNN)

CNN recently draws a lot of attention due to its great success.

Availability of larger training datasets: ImageNet


Powerful GPUs
Better model regularization strategy such as dropout
Simple activation function: ReLU

Hyunjung (Helen) Shin 10


Deep Neural Net: Revolution of Depth -ImageNet

Hyunjung (Helen) Shin


Deep Neural Net: Revolution of Depth -ImageNet

ImageNet
Over 15 million labeled high-resolution images
Roughly 22,000 categories
Collected from the web
Labeled by human labelers using Amazon’s Mechanical Turk
crowd-sourcing tool.

ILSVRC: ImageNet Large-Scale Visual Recognition Challenge


Uses a subset of ImageNet
1000 categories
1.2 million training images
50,000 validation images
150,000 test images

Report two error rates: top-1 and top-5

Hyunjung (Helen) Shin


Deep Neural Net: Revolution of Depth

Hyunjung (Helen) Shin


Deep Neural Net: Revolution of Depth

Hyunjung (Helen) Shin


Deep Neural Net: Error Rate of Winning Algorithms
30 (%)

28
25
26

20

15
16

12 10

Human Ability 7
5

3 3.6
2.3
0
2017 2016 2015 2014 2013 2012 2011 2010
SENet GoogLeNet-v4 ResNet 1. GoogLeNet ZFNet AlexNet XRCE NEC-UIUC
2. VGGNet

Hyunjung (Helen) Shin


Deep Neural Net: ReLU

1
𝑓(𝑥) = 𝑓(𝑥) = max(0, 𝑥)
1 + 𝑒 −𝑥
𝑦 𝑦 𝑦
Training error rate

1 1 1

0.75

0.5
0.5
0.25
ReLU Sigmoid

0 0 0
0 𝑧 0 10 20 30 40 0 𝑧

Sigmoid function Convergence rates Rectified linear unit

Hyunjung (Helen) Shin 16


Deep Neural Net: LeNet

LeNet (LeCun89)

Hyunjung (Helen) Shin


Deep Neural Net: AlexNet

AlexNet (Krizhevsky et al. 2012)

Hyunjung (Helen) Shin


Deep Neural Net: GoogLeNet

GoogleNet (Szegedy et al, 2014)

Hyunjung (Helen) Shin


Deep Neural Net: Multiple Levels of Feature Representation
Visible layer (raw input) → 1st layer→ 2nd layer → 3rd layer
Represents more and more abstract features of the raw input, e.g., edges, local
shapes, object parts, objects, etc.
Feature Representation

3rd layer
Objects

2nd layer
Object parts

1st layer
Edges

Pixels
Slides from Junmo Kim 2006

Hyunjung (Helen) Shin 20


Deep Neural Net Training: Deep Learning

Deep learning is a procedure of training a deep network

① Use unsupervised pre-training (greedy layer-wise training)


Allows abstraction to develop naturally from one layer to another
② Perform supervised top-down training as final step. Refine the features
(intermediate layers) so that they become more relevant for the task

Traditional Top-down
Error propagation Grammar,
sentence

word

phoneme Unsupervised
pre-training

Hyunjung (Helen) Shin 21


Deep Learning Library / Packages
Base Lead
Name Features
Language Developer
• Flexible Bengio Group
Theano Python • Large Community In Univ.
• Keras, Lasagne Montreal
C++ with • Fast
matlab / • Good for training and fine-
Caffe UC Berkeley
python tuning feedforward models
wrapper • Model Zoo
• Similar to Theano
• Still slower
Tensor Flow Python, C++ Google
• TensorBoard
• Multi-GPU & Multi-node
• Fast
Torch Lua -
• Multi-GPU support
• Multi-GPU support
DL4J Java • Commercially supported by Skymind
Skymind

Hyunjung (Helen) Shin 22


Boltzmann Machines

Hyunjung (Helen) Shin 23


Boltzmann Machines
A network of symmetrically coupled stochastic binary units {0,1}

Hidden layer
𝒉
J Hidden layer
𝒉 ∈ 0,1 𝑝

Visible layer
W 𝑑
𝒗 ∈ 0,1

Parameters:𝜽 = 𝑾, 𝑳, 𝑱
W: visible-to-hidden
L: visible-to-visible, 𝒅𝒊𝒂𝒈(𝑳) = 0
J: hidden-to-hidden, 𝒅𝒊𝒂𝒈(𝑱) = 0
L
𝒗 Visible layer
Energy of the Boltzmann machine
1 1
General Boltzmann Machine 𝑬 𝒗, 𝒉 𝜽 = − 𝑣 𝑇 L𝑣 − ℎ𝑇 Jℎ − 𝑣 𝑇 Wℎ
2 2

Hyunjung (Helen) Shin 24


Boltzmann Machines
A network of symmetrically coupled stochastic binary units {0,1}

Energy of the Boltzmann machine


1 1
𝑬 𝒗, 𝒉 𝜃 = − 𝑣 𝑇 𝑳𝑣 − ℎ𝑇 𝑱ℎ − 𝑣 𝑇 𝑾ℎ
2 2

Generative model
Joint likelihood 𝑃 𝒗, 𝒉 𝜃 ∝ exp(−𝐸(𝒗, 𝒉; 𝜃))

Probability of visible vector v


σ 𝑒𝑥𝑝(−𝐸(𝒗,𝒉;𝜃))
𝑃 𝒗 𝜃 = σℎ 𝑃 𝒗, 𝒉 𝜃 = ℎ
𝑍 𝜃
𝑍 𝜃 = σ𝒗 σ𝒉 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))

Hyunjung (Helen) Shin 25


Restricted Boltzmann Machines
(RBM)

Hyunjung (Helen) Shin 26


Restricted Boltzmann Machines
No hidden-to-hidden and no visible-to-visible connections

Hidden layer
(No hidden-to-hidden )
Parameters
W: visible-to-hidden
L=0: visible-to-visible
J=0: hidden-to-hidden
W

Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ

(No visible-to-visible)
Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃

Hyunjung (Helen) Shin 27


Restricted Boltzmann Machines
No hidden-to-hidden and no visible-to-visible connections

Hidden layer
Parameters
W: visible-to-hidden
L=0: visible-to-visible
J=0: hidden-to-hidden
W

Energy of RBM
𝐸(𝒗, 𝒉│𝜃) = −𝑣 𝑇 𝑊ℎ − 𝒃𝑇 𝑣 − 𝒂𝑇 ℎ

Joint likelihood
Visible layer
1
𝑃(𝒗, 𝒉│𝜃) = 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
Restricted Boltzmann Machine 𝑍 𝜃

Hyunjung (Helen) Shin 28


Restricted Boltzmann Machines

Top layer: vector of stochastic binary hidden units h


Bottom layer: a vector of stochastic binary visible variables v

Figure is taken from R. Salakhutdinov

Hyunjung (Helen) Shin 29


RBM: The Energy of a Joint Configuration
(ignoring terms to do with biases)

binary state of binary state of


visible unit 𝒊 hidden unit 𝒋

𝐸 𝑣, ℎ = − ෍ 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
Energy with configuration 𝑖,𝑗 weight between
𝒗 on the visible units and units 𝑖 and 𝑗
𝒉 on the hidden units

𝝏𝑬 𝒗, 𝒉
= −𝒗𝒊 𝒉𝒋
𝝏𝒘𝒊𝒋

Hyunjung (Helen) Shin 30


RBM: Weights → Energies → Probabilities

Each possible joint configuration of the visible and hidden units


has an energy. The energy is determined by the weights (and
biases)
𝐸 𝑣, ℎ = − ෍ 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
𝑖,𝑗

The energy of a joint configuration of the visible and hidden units


determines its probability:

𝑝 𝑣, ℎ ∝ 𝑒 −𝐸(𝑣,ℎ)

Hyunjung (Helen) Shin 31


RBM: Using Energies to Define Probabilities

The probability of a joint configuration over both visible and hidden


units depends on the energy of that joint configuration compared with
the energy of all other joint configurations.

𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣, ℎ =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔) Partition function

The probability of a configuration of the visible units is the sum of the


probabilities of all the joint configurations that contain it.

σℎ 𝑒 −𝐸(𝑣,ℎ)
𝑝 𝑣 =
σ𝑢,𝑔 𝑒 −𝐸(𝑢,𝑔)

Hyunjung (Helen) Shin 32


RBM: Training
Due to the special bipartite structure of RBM’s, the hidden units can be
explicitly marginalized out

1
𝑃(𝒗; 𝜃) = ෍ 𝑒𝑥𝑝(−𝐸(𝒗, 𝒉; 𝜃))
𝑍 𝜃
𝒉
1
𝑃 𝒗; 𝜃 = ෍ exp 𝒗𝑇 𝑊𝒉 + 𝒃𝑇 𝒗 + 𝒂𝑇 𝒉
𝑍 𝜃
𝒉
𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ ෍ exp 𝑎𝑗 ℎ𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖 ℎ𝑗
𝑍 𝜃
𝑗=1 ℎ𝑗 ∈ 0,1 𝑖=1

𝐹 𝐷
1
= exp 𝒃𝑇 𝒗 ෑ 1 + exp 𝑎𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖
𝑍 𝜃
𝑗=1 𝑖=1

Hyunjung (Helen) Shin 33


RBM: Training
𝐹 𝐷
1
𝑃 𝒗; 𝜃 = exp 𝒃𝑇 𝒗 ෑ 1 + exp 𝑎𝑗 + ෍ 𝑊𝑖𝑗 𝑣𝑖
𝑍 𝜃
𝑗=1 𝑖=1

Gradient descent

𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗𝒉𝑇 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗𝒉𝑇
𝜕𝑊
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒉 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒉
𝜕𝒂
𝜕𝑙𝑜𝑔𝑃 𝒗; 𝜃
= 𝐸𝑃 𝑑𝑎𝑡𝑎 𝒗 − 𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 𝒗
𝜕𝒃
The exact calculations are intractable because the expectation operator in
𝐸𝑃 𝑚𝑜𝑑𝑒𝑙 takes exponential time in min(D,F)
Efficient Gibbs sampling based approximation exists(contrastive divergence)

Hyunjung (Helen) Shin 34


RBM: How Do We Sample from an RBM?

Gibbs sampling: [𝐢. 𝐞. 𝒑 𝒗𝒊 ]


Due to factorization, Gibbs sampling is easy:

Set 𝑣0 randomly (or preferably from the empirical distribution)


Set ℎ0 ~𝑃 ℎ 𝑣0

For 𝑖 = 1 to 𝑘
𝑣𝑖 ~𝑃 𝑣 ℎ𝑖−1
ℎ𝑖 ~𝑃 ℎ 𝑣𝑖

Return (𝑣𝑘 , ℎ𝑘 )

Simply run it for a fixed number of steps k in practice (good enough)

Hyunjung (Helen) Shin 35


RBM: Inference in RBM

Inference is simple in RBM:

𝑃 𝒉|𝒗; 𝜃 = ෑ 𝑃 ℎ𝑗 |𝒗 , 𝑃 𝒗|𝒉; 𝜃 = ෑ 𝑃 𝑣𝑖 |𝒉
𝑗=1 𝑖=1

𝑃 ℎ𝑗 = 1 𝒗 = 𝑔 ෍ 𝑊𝑖𝑗 𝑣𝑖 + 𝑎𝑗
𝑖

𝑃 𝑣𝑖 = 1 𝒉 = 𝑔 ෍ 𝑊𝑖𝑗 ℎ𝑗 + 𝑏𝑖
𝑗

where 𝒈 𝒙 = 𝟏/(𝟏 + 𝒆𝒙𝒑 −𝒙 ) is the sigmoid function

Hyunjung (Helen) Shin 36


RBM
Modeling handwritten digits using RBM

Training data examples Visualized weights of RBM

Hyunjung (Helen) Shin 37


Deep Boltzmann Machines
(DBM)

Hyunjung (Helen) Shin 38


Deep Boltzmann Machines (DBM)
Undirected connections between all layers : In multiple layer model,
the undirected connection between the layers make complete Boltzmann
machine (i.e, Deep Restricted Boltzmann Machines).

𝒉𝟑
High-level representations are
𝟑
𝑾 built from unlabeled inputs
𝒉𝟐

𝑾𝟐 Labeled data is used to only


slightly fine-tune the model
𝒉𝟏

𝑾𝟏

𝒗
1 Τ Τ
𝑷 𝒗 = ෍ exp[𝑣 Τ 𝑊 1 ℎ + ℎ1 𝑊 2 ℎ2 + ℎ2 𝑊 3 ℎ3 ]
1 2 3
𝑍
ℎ ,ℎ ,ℎ

Salakhutdinov & Hinton, 2009


Hyunjung (Helen) Shin 39
Deep Boltzmann Machines: Two layer DBM example

𝒉𝟐

𝑾𝟐

𝒉𝟏

𝑾𝟏

Assume no within
layer connection

Salakhutdinov & Hinton, 2009


Hyunjung (Helen) Shin 40
Deep Boltzmann Machines (DBM)

Pre-training
𝒉𝟑 Can (must) initialize from stacked RBMs
𝑾𝟑

𝒉𝟐 Generative fine-tuning
Positive phase: variational approximation
𝑾𝟐 (mean-field)
𝒉𝟏 Negative phase: persistent chain
(stochastic approximation)
𝑾𝟏

𝒗
Discriminative fine-tuning
backpropagation

Salakhutdinov & Hinton, 2009


Hyunjung (Helen) Shin 41
Deep Boltzmann Machines: Experiments
MNIST: 2-layer BM 2-layer RBM Training Samples

1000 units

500 units

28 X 28
pixel
image 60,000 training and 10,000 testing examples
0.9 million parameters
Gibbs sampler for 100,000 steps

After discriminative fine-tuning: 0.95% error rate


Compare with SVM 1.4%

Hyunjung (Helen) Shin 42


DBMs vs. DBNs

𝒉𝟑 𝒉𝟑

𝑾𝟑 𝑾𝟑

𝒉𝟐 𝒉𝟐

𝑾𝟐 𝑾𝟐

𝒉𝟏 𝒉𝟏

𝑾𝟏 𝑾𝟏

𝒗 𝒗

Deep Boltzmann Machines Deep Belief Network

Salakhutdinov & Hinton, 2009


Hyunjung (Helen) Shin 43
Belief Networks

Hyunjung (Helen) Shin 44


Belief Nets
A belief net is a directed acyclic graph composed of stochastic variables

Stochastic We get to observe some of the variables and


Hidden we would like to solve two problems:
Cause
① The inference problem: Infer the
states of the unobserved variables

② The learning problem: Adjust the


interactions between variables to
make the network more likely to
Visible Effect generate the observed data

Hyunjung (Helen) Shin 45


Belief Nets: Stochastic Binary Units (Bernoulli Variables)

1
𝑝(𝑠𝑖 = 1) =
1 + exp(−𝑏𝑖 − σ𝑗 𝑠𝑗 𝑤𝑗𝑖 )

1
These have a state of 1 or 0
𝑝(𝑠𝑖 = 1)

The probability of turning on is


determined by the weighted input
0 from other units (plus a bias)
0 𝑏𝑖 + σ𝑗 𝑠𝑗 𝑤𝑗𝑖

Hyunjung (Helen) Shin 46


Deep Belief Networks
(DBN)

Hyunjung (Helen) Shin 47


Deep Belief Network
Probabilistic generative model
Deep architecture – multiple layers (stacks of RBMs)
Greedy layer-wise training algorithm (Contrastive Divergence Learning)
Supervised fine-tuning can be applied
𝒉𝟑

RBM

𝒉𝟐 𝒉𝟐

RBM

𝒉𝟏 𝒉𝟏 𝒉𝟏

RBM
𝒗 𝒗 𝒗

Hinton et al., 2006


Hyunjung (Helen) Shin 48
Deep Belief Network
(Approximate) Inference Generative Process

RBM
Directed belief nets
Hinton et al., 2006
Hyunjung (Helen) Shin 49
Deep Belief Network: Greedy Layer-wise Training (1)

𝒉
1. Train the first layer RBM
𝑾𝟏 Construct an RBM with an input layer
𝑣 and a hidden layer ℎ
𝒗

Hinton et al., 2006


Hyunjung (Helen) Shin 50
Deep Belief Network: Greedy Layer-wise Training (2)

2. Stack another hidden layer


𝒉𝟐 on top of the first RBM to form
a new RBM
𝑾𝟐
𝒉 𝟏 Fix 𝑾𝟏 , sample ℎ1 from 𝑄(ℎ1 |𝑣) as
input. Train 𝑾𝟐 as a second RBM
𝑸(𝒉𝟏 |𝒗)
𝑾𝟏

Hinton et al., 2006


Hyunjung (Helen) Shin 51
Deep Belief Network: Greedy Layer-wise Training (3)

𝒉𝟑 3. Continue to stack layers on top of


𝑾𝟑 the network, train it as previous
step, with sample sampled from
𝒉𝟐 𝑄(ℎ1 |ℎ2 )

𝑸(𝒉𝟐 |𝒉𝟏 ) 𝑾𝟐

𝑸(𝒉𝟏 |𝒗) 𝑾𝟏

Hinton et al., 2006


Hyunjung (Helen) Shin 52
Deep Belief Network: Classifier

(1) Learn one layer at a time greedily

(2) Then treat this as pre-training that finds a good initial set of
weights which can be fine-tuned by a local search procedure

(3) Backpropagation can be used to fine-tune the model for better


classification. This overcomes many of the limitations of standard
backpropagation

(4) For classification, add and connect a label unit to the top level unit

Hyunjung (Helen) Shin 53


Deep Belief Network: How to sample from a DBN?

1. Sample a visible ℎ𝑙−1 from the top-level RBM (using Gibbs)


Run multiple Gibbs steps
(unlike CD-k, this is sampling)

2. For 𝑘 = 1 𝐭𝐨 𝑙
Sample ℎ𝑘−1 ~𝑃(. |ℎ𝑘 ) from the kth RBM

3. 𝑣 = ℎ0 is the final sample

The top-most layer generates sample as a regular RBM


The rest is a top-down procedure

Hyunjung (Helen) Shin 54


Deep Belief Network: Classifier - Generative Digits

Associative
Top Level Units Memory
The top two layers form an
Label Units Hidden Units associative memory whose
Label Unit energy landscape models the
The energy valleys low dimensional manifolds of
have names Hidden Units the digits
Detection Weights
Hidden Units
Generative Weights

Hidden
weights RBM Layer
Visible

Hinton et al., 2006


Observation Vector 𝑣 http://www.cs.toronto.edu/~hi
(e.g., 32 × 32 Image) nton/adi/index.htm

Hyunjung (Helen) Shin 55


Deep Belief Network: Classifier - Generative Digits

Available at www.cs.toronto/~hinton

This could be
the top level of
another sensory
pathway

Hinton et al., 2006

Hyunjung (Helen) Shin 56


Deep Belief Network: Example - Generative Digits

There are 1000 iterations of alternating Gibbs Sampling between samples.


Hyunjung (Helen) Shin 57
Deep Belief Network: Example - Generative Digits

Hyunjung (Helen) Shin 58


Deep Belief Network: Results on Digit Classification
Effect of Pre-training
with pre-training
w/o pre-training

1 layer 4 layers

Erhan et. al. AISTATS’2009

Hyunjung (Helen) Shin 59


Deep Belief Network: Results on Digit Classification
Effect of Depth

without pre-training with pre-training

Erhan et. al. AISTATS’2009

Hyunjung (Helen) Shin 60


Deep Belief Network: Results on Digit Classification
Effect of Depth

100
95
90
85
Accuracy (%)

80 Handwritting Digit Classification


75
70
65
60
55
50
1 2 3 4 5 6 7 8 9 10

Layers

Hyunjung (Helen) Shin 61


Deep Belief Network

Why Greedy Layer Wise Training Works?

Regularization Hypothesis
Pre-training is “constraining” parameters in a region relevant
to unsupervised dataset, therefore Better generalization
(Representations that better describe unlabeled data are more
discriminative for labeled data)

Optimization Hypothesis
Unsupervised training initializes lower level parameters near
localities of better minima than random initialization can.
The initial gradients are sensible and backprop only needs
to perform a local search from a sensible starting point.

Bengio 2009, Erhan et al. 2009

Hyunjung (Helen) Shin 62


Miscellaneous…

Hyunjung (Helen) Shin 63


Issues in Deep Neural Network

Vanishing gradient problem


Gradients in the lower layers are typically extremely small
Problems with non-linear activation
Solved by a new non-linear activation, Rectified Linear Unit (ReLU)

Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.

Get stuck in local minima (?)

Hyunjung (Helen) Shin 64


Issues in Deep Neural Network

Vanishing gradient problem


Gradients in the lower layers are typically extremely small
Problems with non-linear activation
Solved by a new non-linear activation, Rectified Linear Unit (ReLU)

Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.

Get stuck in local minima (?)

Hyunjung (Helen) Shin 65


Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

Fast to compute
Biological reason

𝜎 𝑧 𝑎
𝑎=𝑧

𝑎=0
𝑧

Xavier Glorot, 2011, AISTATS


Andrew L. Maas, 2013, ICML
Kaiming He, 2015, arXiv
Hyunjung (Helen) Shin 66
Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

𝒙𝟏 … 𝒚𝟏

𝒙𝟐 … 𝒚𝟐
In 2006, people used
RBM pre-training


In 2015, people use ReLU
𝒙𝑵
… 𝒚𝑵

Smaller gradients Larger gradients


Learn very slow Learn very fast
Almost random Already converge

Hyunjung (Helen) Shin 67


Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏 ෝ𝟏
𝒚

𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚


… 𝑪 + ∆𝑪

𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚

+∆𝒘

Smaller gradients

Hyunjung (Helen) Shin 68


Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

𝝏𝑪 ∆𝑪
Intuitive way to compute the gradient … =?
𝝏𝒘 ∆𝒘
𝒙𝟏 … 𝒚𝟏
Small Output
ෝ𝟏
𝒚

𝒙𝟐 … 𝒚𝟐 ෝ𝟐
𝒚


… 𝑪 + ∆𝑪
Large Input
𝒙𝑵
… 𝒚𝑵 ෝ𝑴
𝒚

+∆𝒘

Smaller gradients

Hyunjung (Helen) Shin 69


Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

𝒙𝟏 𝒚𝟏

𝒙𝟐 0 𝒚𝟐
0

Hyunjung (Helen) Shin 70


Vanishing Gradient Problem: Rectified Linear Unit (ReLU)

A Thinner Linear Network

𝒙𝟏 𝒚𝟏

𝒙𝟐 𝒚𝟐
Do not have smaller
gradients

Hyunjung (Helen) Shin 71


Issues in Deep Neural Network

Vanishing gradient problem


Gradients in the lower layers are typically extremely small
Problems with non-linear activation
Solved by a new non-linear activation, Rectified Linear Unit (ReLU)

Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.

Get stuck in local minima (?)

Hyunjung (Helen) Shin 72


Overfitting Problem: Dropout
Each time before computing the gradients
Each neuron has p% to dropout

Training:

Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Hyunjung (Helen) Shin 73


Overfitting Problem: Dropout
Each time before computing the gradients
Each neuron has p% to dropout
The structure of the network is changed
The new network is used for training

Training:

Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

For each mini-batch, we resample the dropout neurons

Hyunjung (Helen) Shin 74


Overfitting Problem: Dropout
When teams up, if everyone expect the partner will do the work,
nothing will be done finally
However, if you know your partner will dropout, you will do better
When testing, no one dropout actually, so obtaining good results eventually

Training:

Hyunjung (Helen) Shin 75


Overfitting Problem: Dropout
No dropout for Testing
If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%
If a weight w=1 by training, set 𝑤=0.5 for testing

Testing:

Hyunjung (Helen) Shin 76


Overfitting Problem: Dropout

Why the weights should multiply (1-p)% (dropout rate) when testing?

Training of Dropout Testing of Dropout


Assume dropout rate is 50% No dropout
Weights from training

𝟎. 𝟓 × 𝑧 ′ ≈ 2𝑧
𝑤1 𝑤1
𝑤2 𝑧 𝑤2 𝑧′
𝟎. 𝟓 ×
𝑤3 𝑤3
𝑤4 𝟎. 𝟓 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
𝟎. 𝟓 ×

Hyunjung (Helen) Shin 77


Overfitting Problem: Dropout
Dropout is a kind of ensemble
Train a bunch of networks with different structures

Different Networks
f1
f2
fJ
f1
Training f2 Average F(xi)
samples
fJ
……

f1
f2
fJ

Hyunjung (Helen) Shin 78


Overfitting Problem: Dropout
Dropout is a kind of ensemble
Using one mini-batch to train one network
Some parameters in the network are shared

Training of
Dropout

M neurons

……
2M possible
networks

Hyunjung (Helen) Shin 79


Overfitting Problem: Dropout
Dropout is a kind of ensemble
Testing data x

All the weights


multiply (1-p)%

y1 y2 y3 y

Average
Hyunjung (Helen) Shin 80
Issues in Deep Neural Network

Vanishing gradient problem


Gradients in the lower layers are typically extremely small
Problems with non-linear activation
Solved by a new non-linear activation, Rectified Linear Unit (ReLU)

Overfitting problem
Given limited amounts of labeled data, training via back-propagation
does not work well
Solved by a new regularization method : dropout, dropconnect, etc.

Get stuck in local minima (?)

Hyunjung (Helen) Shin 81


Getting Stuck in Local Minima
Unsupervised pre-training may help the network initialize with good
parameters

𝑳𝒐𝒔𝒔 Common wisdom: training does not work


because we “get stuck in local minima”

𝑷𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓
Hyunjung (Helen) Shin 82
Miscellaneous

How many layers should we use and


how wide should they be?

Deep belief nets give the creator a lot of freedom


How best to make use of that freedom depends on the task.
With enough narrow layers we can model any distribution over
binary vectors (Sutskever & Hinton, 2007)

Hyunjung (Helen) Shin 83


DBN Packages
Package Language Descriptions
Package which is for generating neural networks with many layers (deep
Architectures) and train them with the method introduced by the publications
Darch ``A fast learning algorithm for deep belief nets'' (G. E. Hinton, S. Osindero, Y. W. Teh
(2006)

H2o R Deep feedforward neural networks and auto encoders

MxNet Pre-trained models that you can use for object recognition

DeepNet Deep neural networks, deep belief networks and restricted boltzmann machines

A aimpler package than lasagna which is a python package for training neural
Nolearn 0.6.0 networks
Python
A simple, clean, fast python implementation of deep belief networks based on binary
Numpy restricted boltzmann machines (RBM)

Deeplearn, 2014 Deeplearntoolbox is a matlab/octave toolbox for deep learning and includes deep
belief nets, stacked autoencoders, convolutional neural nets.
Matlab
Deepmat, 2014 Matlab code for restricted/deep boltzmann machines and autoencoders

The first commercial-grade, open-source, distributed deep-learning library written


Deeplearning4j for java and scala. It is designed to be used in business environments, rather than as
a research tool.
Java
An advanced machine learning framework which supports support vector machines,
Encog artificial neural networks, genetic programming, bayesian networks, hidden markov
models, genetic programming and genetic algorithms are supported.

http://www.teglor.com/b/deep-learning-libraries-language-cm569/

Hyunjung (Helen) Shin 84


Convolutional Neural Network
(CNN)

Hyunjung (Helen) Shin 85


Fully connected neural network

Example
• 1000x1000 image
• 1M hidden units

→ 1012(= 106×106)parameters!

Observation
• Spatial correlation is local

Hyunjung (Helen) Shin 86


Locally connected neural net

Example
• 1000x1000 image
• 1M hidden units
• Filter size: 10x10

→ 108(= 106×10 ×10)parameters!

Observation
• Statistics is similar at different
locations

Hyunjung (Helen) Shin 87


Convolution network

Share the same parameters across


different locations
• Convolution with learned kernels

Learn multiple filters


• 1000x1000 image
• 100 Filters
• Filter size: 10x10

10,000 parameters

Hyunjung (Helen) Shin 88


Convolution neural networks

We can design neural networks that are specifically adapted for these
problems
▪ Must deal with very high-dimensional inputs
• 1000x1000 pixels
▪ Can exploit the 2D topology of pixels
▪ Can build in invariance to certain variations we can expect
• Translations, etc

Ideas
▪ Local connectivity
▪ Parameter sharing

Hyunjung (Helen) Shin 89


Convolution

from: https://developer.apple.com/library/ios/documentation/Performance/
Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html

Hyunjung (Helen) Shin 90


Convolution

Hyunjung (Helen) Shin 91


Convolution

Hyunjung (Helen) Shin 92


Convolution

Hyunjung (Helen) Shin 93


Pooling

Max vs Average pooling

max pooling

average pooling

Hyunjung (Helen) Shin 94


Pooling

Max vs Average pooling Drawbacks

Hyunjung (Helen) Shin 95


LeNet

Yann LeCun and his collaborators developed a recognizer for handwritten


digits by using back-propagation in a feed-forward net

Hyunjung (Helen) Shin 96


LeNet

#(Parameter) = 3,274,634
Layer C1 C2 FC1 FC2
Weight 800 51,200 3,211,264 10,240
Bias 32 64 1,024 10

Hyunjung (Helen) Shin 97


MNIST Dataset

▪ handwritten digits
▪ a training set of 60,000 examples
▪ 24x24 images

Hyunjung (Helen) Shin 98


The 82 errors by LeNet5

Notice that most of


the errors are cases
that people find quite
easy

The human error rate


is probably 20 to 30
errors but nobody has
had the patience to
measure it

Hyunjung (Helen) Shin 99


Feature map results

Hyunjung (Helen) Shin 100


Learned Filters

Trained 32 filters on C1 layer

Hyunjung (Helen) Shin 101


Learned Filters

Filtered Filtered Filtered Filtered


ReLU ReLU ReLU ReLU
result result result result

Filtered Filtered Filtered Filtered


ReLU ReLU ReLU ReLU
result result result result

Hyunjung (Helen) Shin 102


Recurrent Neural Network
(RNN)

Hyunjung (Helen) Shin 103


RNN

one to one one to many many to one many to many many to many

Vanilla Neural Networks

Hyunjung (Helen) Shin 104


RNN offers a lot of flexibility

e.g. Image Captioning


Image →sequence of words

Hyunjung (Helen) Shin 105


RNN offers a lot of flexibility

e.g. Sentiment Classification


sequence of words →sentiment

Hyunjung (Helen) Shin 106


RNN offers a lot of flexibility

e.g. Machine Translation


sequence of words → sequence of words

Hyunjung (Helen) Shin 107


RNN offers a lot of flexibility

e.g. Video Classification On Frame Level

Hyunjung (Helen) Shin 108


Recurrent Neural Network
Predict a vector at
some time steps
We can process a sequence of vectors x by
applying a recurrence formula at every time step:

new state old state input vector at


some time step
some function
with parameters W

Notice: the same function and the same set of


parameters are used at every step

Hyunjung (Helen) Shin 109


(Vanilla) Recurrent Neural Network

Hyunjung (Helen) Shin 110


Character-level language model
Vocabulary: [h,e,l,o]
Training sequence: hello

We want the green


numbers to be high
and red numbers to
be low

𝑊ℎℎ ∈ ℜ3×3
𝑊𝑥ℎ ∈ 𝔑3×4

ℎ𝑡 = tanh (𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ)

Hyunjung (Helen) Shin 111


Example

Hyunjung (Helen) Shin 112


RNN: Probabilistic Perspectives
Given example sequences {𝑋𝑖 }, let us find a probabilistic model 𝑝(𝑋) that
maximizes ς𝒊 𝒑(𝑿𝒊 )

𝑋𝑖 is a sequence 𝑋𝑖 = (𝑥1, 𝑥2, ⋯ , 𝑥𝑇)


For example,
• 𝑥1 = ℎ, 𝑥2 =e, 𝑥3 = 𝑙, 𝑥4 = 𝑙, 𝑥5 = 𝑜
• 𝑥1 = "𝐼", 𝑥2 = "𝑎𝑚", 𝑥3 = "𝑎", 𝑥4 = "𝑠𝑡𝑢𝑑𝑒𝑛𝑡"

For the model learning, we have to specify


• Model selection
• Learning criterion
• Parameter learning method

Hyunjung (Helen) Shin 113


RNN: Probabilistic Perspectives
Model selection
Because we can decompose 𝑝(𝑋) = 𝑝(𝑥1, 𝑥2, … , 𝑥𝑇) into

and RNN can handle this situation well.

→ 𝑔Θ ℎ1 , 𝑔Θ ℎ2 ,…, 𝑔Θ ℎ 𝑇−1

← 𝑥1, 𝑥2, ⋯ , 𝑥𝑇−1

Hyunjung (Helen) Shin 114


RNN: Probabilistic Perspectives
Model selection
• 𝑔Θ(ℎ𝑡−1) is a parametric model of 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
• If we have an 𝑁-word dictionary, we can represent 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
with a 𝑁-dimensional vector (that sums to a unity and non-negative)

𝑝(𝑥𝑡 = 1st word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)


𝑝(𝑥𝑡 = 2nd word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑝(𝑥𝑡 = 3rd word in the dictionary | 𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑝(𝑥𝑡 = 4th word in the dictionary |𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)
𝑔Θ (ℎ𝑡−1 ) =



𝑝(𝑥𝑡 = Nth word in the dictionary |𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)

Hyunjung (Helen) Shin 115


RNN: Learning Criterion
Learning criterion: log likelihood/cross entropy

log 𝑝 𝑥1, 𝑥2, ⋯,𝑥𝑇 =෍


𝑡
log 𝑝(𝑥𝑡 |𝑥1, 𝑥2, …,𝑥𝑡−1)

0
0
𝑇
0
෍ log𝑝 𝑥𝑡 𝑥1,𝑥2,…,𝑥𝑡−1 → ෍ 𝑔Θ ℎ𝑡−1 ⋮ → cross entropy
𝑡 𝑡
1
0
0
𝑥𝑡’s element is 1

Hyunjung (Helen) Shin 116


Summary

The above RNN model learns the (conditional) probability model of a


sequence by maximizing the log likelihood of training sequences.

≃ 𝑝(𝑥𝑡|𝑥1, 𝑥2, ⋯ , 𝑥𝑡−1)

▪ RNNs allow a lot of flexibility in architecture design


▪ Vanilla RNNs are simple but don’t work very well
▪ Backward flow of gradients in RNN can explode or vanish

Hyunjung (Helen) Shin 117


Long-Short Term Memory
(LSTM)

Hyunjung (Helen) Shin 118


Repeating module in RNN

Hyunjung (Helen) Shin 119


RNN: Problems of Long-term Dependencies
One of the appeals of RNNs is the idea that they might be able to
connect previous information to the present task

Example: the prediction of the next word based on the previous ones
“the clouds are in the sky,”

Hyunjung (Helen) Shin 120


RNN: Problems of Long-term Dependencies
Large gap between the relevant information and the place that it’s need
ed
“I grew up in France… I speak fluent French.”

Hyunjung (Helen) Shin 121


LSTM: Long Short Term Memory
Most popular recurrent node
type is Long Short Term Memory
(LSTM)

LSTM includes also gates,


which can turn on/off the
history and a few additional
inputs.

Hyunjung (Helen) Shin 122


Repeating module in LSTM

Hyunjung (Helen) Shin 123


LSTM: Cell State
Cell state
• It runs straight down the entire chain, with linear interactions
• LSTM has the ability to remove or add information to the cell state,
regulated by gates

Hyunjung (Helen) Shin 124


LSTM: Gates
Gates are a way to optionally let information through
• A sigmoid neural net layer and a pointwise multiplication operation
• The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through
A value of zero means “let nothing through,”
A value of one means “let everything through”

Hyunjung (Helen) Shin 125


LSTM: Forget Gate Layer
Forget gate
• to decide what information we’re going to throw away from the cell state
• this decision is made by a sigmoid layer called the “forget gate layer.”

𝑓𝑡 = 1: completely keep this information


𝑓𝑡 = 0: completely get rid of this information

Ex) Language model example


• When we a new subject, we want to forget the gender information in 𝐶𝑡−1

Hyunjung (Helen) Shin 126


LSTM: Input Gate Layer
Input gate
• to decide what new information we’re going to store in the cell state
• a sigmoid layer called the “input gate layer” decides which values we’ll
update

Hyunjung (Helen) Shin 127


LSTM: Cell State Update
We multiply the old state by 𝑓𝑡, forgetting the things we decided to forget earlier.
Then we add 𝑖𝑡 ∗ 𝐶ሚ𝑡
In the case of the language model, this is where we’d actually drop the information
about the old subject’s gender and add the new information, as we decided in the
previous steps.

Hyunjung (Helen) Shin 128


LSTM: Output
We run a sigmoid layer which decides what parts of the cell state we’re going to
output

Then, we put the cell state through tanh (to push the values to be between −1 and
1) and multiply it by the output of the sigmoid gate, so that we only output the
parts we decided to

Hyunjung (Helen) Shin 129


Summary

▪ RNNs allow a lot of flexibility in architecture design

▪ Common to use LSTM: their additive interactions improve


gradient flow

▪ Backward flow of gradients in RNN can explode or vanish

▪ Exploding is controlled with gradient clipping. Vanishing is


controlled with additive interactions (LSTM)

Hyunjung (Helen) Shin 130

You might also like