Part 12

Visual Information Interpretation: Basics on deep
learning I
Ji Hui
National University of Singapore
October 25, 2021
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 1 / 37
Neural network based learning
Inspiration from brains composed of neurons at a low level
A neuron receives input from other neurons (generally thousands) from
its synapses
Inputs are approximately summed
When the input exceeds a threshold the neuron sends an electrical spike
that travels that travels from the body, down the axon, to the next
neuron(s)
forward
Feedforwardnetwork for MNIST
Neural network
mages
Models are called Feed-forward because information flows one-way from

input through computations to outputs
There is no feedback connection
dsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f
4
Mathematical model of feedforward NN
FNN is associated with a directed acyclic graph describing how functions

are composed
For example, functions f (1) , f (2) , f (3) connected in a chain to form a function
f (x) = f (3) (f (2) (f (1) (x)))
f (1) is called the first layer of the NN, the input layer
f (2) is called the second layer
f (3) is called the output layer
These chain structures are the most commonly used structures of neural
networks
Overall the length of the chain is the depth of the model, e.g. depth=3 in
the example above.
Deep learning means the depth of the NN is very large.
Deep neural network
A deep network is a multi-layer NN with MANY layers
A network with depth 2: one hidden layer
f (x) = f (2) (f (1) (x))

f (1) outputs a M-dim vector
f (2) outpus a K-dim vector
The hidden layer is typically vector-valued

Dimensionality of hidden layer is the width of the mode.
Each element of vector viewed as a unit (neuron)
Each unit receives inputs from many other units and computes its own
activation value
A brief history of neural network
First appearance
Results:
One-layer perception (depth=2) for binary classification
One-layer perception (depth=2) cannot solve XOR (i.e., exclusive or).
Multi-layer perceptrons can solve XOR
Challenges:
How to train a multi-layer perceptron?
The thaw of NN learning
New results on training algorithms and architecture

Backpropagation for training multi-layer NN
Recurrent long-short term memory networks
OCR with convolution neural network
Challenges
Kernel machines (SVM etc) works better with similar accuracies, less
complexity and more solid foundation.
The problems on multi-layer NN and training
Lack of processing power

lack of sufficiently large labeled data
Overfitting
Empirical performance did not improve with more layers.
The inevitable question?
Are 1 or 2-layer is the best NN can do?
Deep learning renaissance
Big labeled data in certain domain, e.g. computer vision

Advances in regularization and normalization of NN training
Better Hardware for training NN, e.g. GPU
A representative Deep network for image classification
200 Layers
650K neurons and 630M connections
Key factors in the popularity of deep learning
Availability of labeled big data in many application sectors

”Imagenet” in computer vision with 15M lableled images classified into
100,000 classes.
Key factors in the popularity of deep learning
Algorithms for regularizing the problem in NN training.

Back-propagation, stochastic gradident descent, drop-out,
batch-normalization, and etc.
The advance of GPU computing for neural network training.
Learning in NN
Definition
f (:) denote the unknown true mapping s.t. y = f (x)
h(:, w) denote the function representing an NN with weights w.
Learning a NN
Using input training data, {x(k) , y (k) = f (x(k) )}, to adjust the parameters w
to approximate f .
Perceptron + Threshold Logic Unit (TLU)
A simple NN with perceptron and one TLU
where (a) = 1 if a > 0 and 0 otherwise.

Modeling it by a mapping h taking input x and outputting y
X
y = h(x, w) = ( xk w k + w 0 )
k
where w are weights of NN and w0 is the bias.

Mapping for classification
the transform in R2 represented by perception+one TLU:
⇢
1, if w1 x1 + w2 x2 + w0 0;
h([x1 , x2 ]) =
0 if w1 x1 + w2 x2 + w0 < 0.
It is a function for binary linear classifier
Decision Boundary
Other activation functions
Sigmoid activation function
1
(a) = sigm(a) =
1 + exp( a)
Rectified linear (ReLU) activation function
(a) = relu(a) = max(0, a).
Sigmoid ReLU
Simple NN with one hidden layer
The mapping of the NN with multiple nodes

X
y = H(x; w) = vi h(x; wi )
i
More nodes in the hidden layer, more complex transform the NN can
model
Binary decision boundary of simple NN
NN with different number of nodes in hidden layer. The activation is
ReLU.
Multi-layer neural network
Multiple hidden layers allows composite of multiple functions

Input Hidden Hidden Ouput
layer layer layer layer
I1 H11 H21
I2 O0
I3 .. .. ..
H1.n H2.n .
.. O9
In .
The transform H of a multi-layer NN
H(·, w) = h(·, w1 ) h(·, w2 ) h(·, w3 ) ···

= h(h(h(· · · ; wn 2 ); wn 1 ); wn )
More complexity of multi-layer NN is achieved by using more layers.
Binary decision boundary of multi-layer NN
NN with different number of layers.

The activation function is ReLU
Illustration of deep learning for classification
Definition (classifier)
A classifier f is a mapping that accepts a set of features and produce a class
label for them.
Training a classifier f (:, w) with a training dataset {xi , yi }
f (x 2 R3 ; w) ! y 2 { 1, 1}.
Table: A dataset with three attributes

Data: x Class: y
fields label
1.4 2.7 1.9 -1
3.8 3.4 3.2 -1
6.4 2.8 1.7 1
4.1 0.1 0.2 1
etc ...
Demo of training in NN
Iteratively refine the parameters w of NN to minimizing classification error
X
#(f (xi , w) 6= yi )
j
Initialize with random parameters

1.4
2.7 0.7(((((0)
Error=0.7
1.9
Adjust parameters based on classification errors
1.4
0. (((((0)
2.7
Error=0.
1.9
Keeping going to lower classification error

Learning representation of shallow NN
Consider a shallow NN with 3-layers (input, hidden, output)
A summation based representation

N
X
H(x) = vi h(wi> x + ✓i ).
i=1
where h is the function of node in hidden layer.

A well-established strategy to construct representation (regression)
strategy
Taylor expansion by polynomial, Fourier series by trigonometric polynomials.
Universal approximation theorem [Cybenko 1989]. Suppose h(·) is a
non-constant, bounded and monotonically increasing continuous function
defined on [0, 1]n . For any ✏ > 0, there exist N, {vi , wi , ✓i } such that
max |H(x) f (x)|  ✏.

x2[0,1]n
Difference between deep NN and shallow NN
A shallow NN with one hidden layer
H(·) = v1 h(·, w1 ) + v2 h(·, w2 ) + v3 h(·, w3 ) · · ·
A summation based approximation scheme
A deep NN with many hidden layer
H(x) = h(h(h(· · · ; wn 2 ); wn 1 ); wn )
A variable substitution based approximation scheme
The question: what variable substitution is doing?
It rapidly change the level sets of the function
⌦h(·)=0 = {x 2 RN : h(x) = 0}
⌦h(h(·))=0 = {x 2 RN : h(x) 2 ⌦h(·)=0 }
Example of variable substitution
Consider a data set {(xi , yi )} where yi 2 {0, 1}. A binary classifier
function can be defined
⇢
1 f (x) > 1/2
f (x) =
0 f (x) < 1/2.
Consider a function h defined from ReLU network
8
< 2x if 0  x  1/2
h(x) = 2 2x if 1/2  x  1
:
0 otherwise.
Consecutive variable substitution by multi-layer network
the depth of NN increases the number of

oscillations multiplicatively
the width of NN can only do so additively.
deep NN vs wide NN
Expression capability
Both NNs can do the job
wide NN is preferred, as it is easier in optimization owing to the separability
of node parameters.
Expression efficiency
For functions with simple level sets, deep NN has not advantages
For functions with complex level sets, deep NN is preferred owing to its
efficiency on rapidly changing level sets.
Conclusion: deep NN is for the application, where the targeting transform
satisfies
accuracy of its level sets is very important
structure of its level sets are complex
Sample applications
Multi-label function in classification
Value evaluation function in AI for games.
Pattern analysis in natural language processing
Outline of NN training
Preparing training data:

{xn , yn }N
n=1
Choose each of these:

Decision function (transform)
yb = f (xn ; w)
Loss function
y , y) 2 R
L(b
Define goal:
N
X
w⇤ = argminw L(f (xn ; w), yi )
n=1
Training with stochastic gradient descent:
wt+1 = wt ⌘` rw L(f (xn ; w), yi ).
Cost function
Consider a network denoted by f (x; w)
Gaussian error model: assume the prediction error follows normal
distribution:
pmodel (y|x) = N (y|f (x; w), I)
The likelihood function is then
N
Y
2 m 1
J(w) = (2⇡ ) 2 exp( 2
kyn f (xn ; w)k22 )
n=1
2
By using MLE (Maximum likelihood estimation), the cost function is
max J(w) = max Ex,y2Pdata ky f (x; w)k22 + const

✓ w
In other words, we are minimizing the function
XN
1
min Ex,y2Pdata ky f (x; w)k22 ⇡ kyn f (xn ; w)k22
w
n=1
2
Cross-entropy based cost function
Consider two distributions p(x) and q(x), then their cross-entropy is
defined as X
H(p, q) = p(x) ln q(x).
x
Consider a two-class problem y 2 {0, 1}, the Logistic regression, the MLE
minimize the negative log-likelihood
N
X
J(✓) = (yk⇤ ln yk + (1 yk⇤ ) ln(1 yk )),
n=1
where yk⇤ denote the true label of xk and yk = (! > xk ) denote the
prediction.
Furthermore, there is an interesting connection of cross-entropy to K-L
divergence
X X p(x)
H(p, q) = p(x) ln q(x) = p(x) log + H(p, p)
x x
q(x)
Minimizing H(p, q) is the same as minimizing DKL (pkq).

Objective function
Quadratic loss: J = 12 (y y ⇤ )2
Cross entropy: J = y ⇤ log(y) + (1 y ⇤ ) log(1 y)
”softmax”: yk = Pexp(b k)
exp(b` ) `
Standard ML Training vs NN Training
Nonlinearity of neural network causes interesting loss functions to be
non-convex
Difference between standard ML learning (e.g. logistic regression or

SVM) and NN training
Standard ML learning is to find a global minimizer, e.g., Exact linear
equation solvers used for linear regression or convex optimization algorithms
used for logistic regression or SVMs
NN training: Using iterative gradient-based optimizers that merely lower the
value of the cost function.
Learning vs Pure optimization
Optimization algorithms for deep learning differ from traditional

optimization in several ways
Pure optimization: minimizing J is a goal in itself
Machine learning cares more about the performance on testing data than
the performance on training data
We reduce a different function and hope that it will lead to the reduction of
the truth cost (not available).
Optimization algorithm for training deep model includes special structure
of the cost function
the cost function can be expressed as the average over a training set
X
J(✓) = E(x,y)2pdata L(f (x; ✓), y) ⇡ L(f (xi ; ✓), yi ))
i
The cost function is the summations of multiple terms with same functional
form.
Training with gradient descent
Solving the optimization problem:

N
X
w⇤ = argminw L(f (xi ; w))
i=1
with gradient descent
wt+1 = wt ⌘t rL(f (xi ; w)), yi ).
advantage: simple, and easy

dis-advantage: first-order, slow convergence,
prone to local minimizer
Is differentiability necessary?
Some hidden units are not differentiable at all input points

e.g. ReLU: f (z) = max{0, z}.
It seems like it invalidates for use in gradientbased learning
In practice gradient descent still performs well enough for these models
to be used in ML
Neural network training not usually arrives at a local minimum of cost
function
Only for reduces the cost significantly
Hidden units not differentiable are usually non-differentiable at only a small
no. of points
Left and right differentiability
A right derivative means left derivative is equal to right derivative
Practical implementation returns one of the one-sided derivative if two
one-sided derivatives are different.
How to calculate the gradient: backpropagation via
chain rule
Chain rule
given y = g(u) and u = h(x),
X dyi dui J
dyi
= , 8j, k.
dxk i=1
duj dxk
Backpropagation: repeating application of

chain rule
Instantiate the computation as a directed acyclic graph, where each

intermediate quantity is a node
At each node, store (a) the quantity computed in the forward pass and (b)
the partial derivative w.r.t. intermediate quantity.
Initialize all partial derivatives to 0.
Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Demonstration on function
Consider J = cos(sin(x2 ) + 3x2 )
compute value on the forward pass and the derivative dJ
dx on the
backward pass
Demonstration on NN
Consider the following network
Illustration of backpropagation on NN

Part 12

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 12

Uploaded by

Copyright:

Available Formats

Visual Information Interpretation: Basics on deep

National University of Singapore

October 25, 2021

Models are called Feed-forward because information flows one-way from

FNN is associated with a directed acyclic graph describing how functions

f (x) = f (3) (f (2) (f (1) (x)))

f (x) = f (2) (f (1) (x))

The hidden layer is typically vector-valued

New results on training algorithms and architecture

Lack of processing power

Big labeled data in certain domain, e.g. computer vision

Availability of labeled big data in many application sectors

Algorithms for regularizing the problem in NN training.

The advance of GPU computing for neural network training.

where (a) = 1 if a > 0 and 0 otherwise.

where w are weights of NN and w0 is the bias.

It is a function for binary linear classifier

Rectified linear (ReLU) activation function

(a) = relu(a) = max(0, a).

The mapping of the NN with multiple nodes

Multiple hidden layers allows composite of multiple functions

The transform H of a multi-layer NN

H(·, w) = h(·, w1 ) h(·, w2 ) h(·, w3 ) ···

More complexity of multi-layer NN is achieved by using more layers.

NN with different number of layers.

Training a classifier f (:, w) with a training dataset {xi , yi }

Table: A dataset with three attributes

Initialize with random parameters

Keeping going to lower classification error

Consider a shallow NN with 3-layers (input, hidden, output)

A summation based representation

where h is the function of node in hidden layer.

max |H(x) f (x)|  ✏.

⌦h(h(·))=0 = {x 2 RN : h(x) 2 ⌦h(·)=0 }

Consecutive variable substitution by multi-layer network

the depth of NN increases the number of

Preparing training data:

Choose each of these:

Training with stochastic gradient descent:

wt+1 = wt ⌘` rw L(f (xn ; w), yi ).

By using MLE (Maximum likelihood estimation), the cost function is

max J(w) = max Ex,y2Pdata ky f (x; w)k22 + const

In other words, we are minimizing the function

Minimizing H(p, q) is the same as minimizing DKL (pkq).

Difference between standard ML learning (e.g. logistic regression or

Optimization algorithms for deep learning differ from traditional

Solving the optimization problem:

wt+1 = wt ⌘t rL(f (xi ; w)), yi ).

advantage: simple, and easy

Some hidden units are not differentiable at all input points

Backpropagation: repeating application of

Instantiate the computation as a directed acyclic graph, where each

You might also like