Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Visual Information Interpretation: Basics on deep

learning I

Ji Hui

National University of Singapore

October 25, 2021

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 1 / 37
Neural network based learning
Inspiration from brains composed of neurons at a low level
A neuron receives input from other neurons (generally thousands) from
its synapses
Inputs are approximately summed
When the input exceeds a threshold the neuron sends an electrical spike
that travels that travels from the body, down the axon, to the next
neuron(s)

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 2 / 37
forward
Feedforwardnetwork for MNIST
Neural network

mages

Models are called Feed-forward because information flows one-way from


input through computations to outputs
There is no feedback connection
dsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f
4
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 4 / 37
Mathematical model of feedforward NN

FNN is associated with a directed acyclic graph describing how functions


are composed
For example, functions f (1) , f (2) , f (3) connected in a chain to form a function

f (x) = f (3) (f (2) (f (1) (x)))

f (1) is called the first layer of the NN, the input layer
f (2) is called the second layer
f (3) is called the output layer
These chain structures are the most commonly used structures of neural
networks
Overall the length of the chain is the depth of the model, e.g. depth=3 in
the example above.
Deep learning means the depth of the NN is very large.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 5 / 37
Deep neural network
A deep network is a multi-layer NN with MANY layers

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 6 / 37
A network with depth 2: one hidden layer

f (x) = f (2) (f (1) (x))


f (1) outputs a M-dim vector
f (2) outpus a K-dim vector

The hidden layer is typically vector-valued


Dimensionality of hidden layer is the width of the mode.
Each element of vector viewed as a unit (neuron)
Each unit receives inputs from many other units and computes its own
activation value
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 7 / 37
A brief history of neural network
First appearance

Results:
One-layer perception (depth=2) for binary classification
One-layer perception (depth=2) cannot solve XOR (i.e., exclusive or).
Multi-layer perceptrons can solve XOR
Challenges:
How to train a multi-layer perceptron?

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 8 / 37
A brief history of neural network
The thaw of NN learning

New results on training algorithms and architecture


Backpropagation for training multi-layer NN
Recurrent long-short term memory networks
OCR with convolution neural network
Challenges
Kernel machines (SVM etc) works better with similar accuracies, less
complexity and more solid foundation.
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 8 / 37
A brief history of neural network
The problems on multi-layer NN and training

Lack of processing power


lack of sufficiently large labeled data
Overfitting
Empirical performance did not improve with more layers.
The inevitable question?
Are 1 or 2-layer is the best NN can do?

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 8 / 37
A brief history of neural network
Deep learning renaissance

Big labeled data in certain domain, e.g. computer vision


Advances in regularization and normalization of NN training
Better Hardware for training NN, e.g. GPU
A representative Deep network for image classification
200 Layers
650K neurons and 630M connections

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 8 / 37
Key factors in the popularity of deep learning

Availability of labeled big data in many application sectors


”Imagenet” in computer vision with 15M lableled images classified into
100,000 classes.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 9 / 37
Key factors in the popularity of deep learning

Algorithms for regularizing the problem in NN training.


Back-propagation, stochastic gradident descent, drop-out,
batch-normalization, and etc.

The advance of GPU computing for neural network training.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 10 / 37
Learning in NN
Definition
f (:) denote the unknown true mapping s.t. y = f (x)
h(:, w) denote the function representing an NN with weights w.
Learning a NN
Using input training data, {x(k) , y (k) = f (x(k) )}, to adjust the parameters w
to approximate f .

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 11 / 37
Perceptron + Threshold Logic Unit (TLU)
A simple NN with perceptron and one TLU

where (a) = 1 if a > 0 and 0 otherwise.


Modeling it by a mapping h taking input x and outputting y

X
y = h(x, w) = ( xk w k + w 0 )
k

where w are weights of NN and w0 is the bias.


Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 12 / 37
Mapping for classification
the transform in R2 represented by perception+one TLU:

1, if w1 x1 + w2 x2 + w0 0;
h([x1 , x2 ]) =
0 if w1 x1 + w2 x2 + w0 < 0.

It is a function for binary linear classifier

Decision Boundary

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 13 / 37
Other activation functions
Sigmoid activation function
1
(a) = sigm(a) =
1 + exp( a)

Rectified linear (ReLU) activation function

(a) = relu(a) = max(0, a).

Sigmoid ReLU

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 14 / 37
Simple NN with one hidden layer

The mapping of the NN with multiple nodes


X
y = H(x; w) = vi h(x; wi )
i

More nodes in the hidden layer, more complex transform the NN can
model
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 15 / 37
Binary decision boundary of simple NN
NN with different number of nodes in hidden layer. The activation is
ReLU.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 16 / 37
Multi-layer neural network

Multiple hidden layers allows composite of multiple functions


Input Hidden Hidden Ouput
layer layer layer layer
I1 H11 H21
I2 O0
I3 .. .. ..
H1.n H2.n .
.. O9
In .

The transform H of a multi-layer NN

H(·, w) = h(·, w1 ) h(·, w2 ) h(·, w3 ) ···


= h(h(h(· · · ; wn 2 ); wn 1 ); wn )

More complexity of multi-layer NN is achieved by using more layers.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 17 / 37
Binary decision boundary of multi-layer NN

NN with different number of layers.


The activation function is ReLU

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 18 / 37
Illustration of deep learning for classification

Definition (classifier)
A classifier f is a mapping that accepts a set of features and produce a class
label for them.

Training a classifier f (:, w) with a training dataset {xi , yi }

f (x 2 R3 ; w) ! y 2 { 1, 1}.

Table: A dataset with three attributes


Data: x Class: y
fields label
1.4 2.7 1.9 -1
3.8 3.4 3.2 -1
6.4 2.8 1.7 1
4.1 0.1 0.2 1
etc ...

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 19 / 37
Demo of training in NN
Iteratively refine the parameters w of NN to minimizing classification error
X
#(f (xi , w) 6= yi )
j

Initialize with random parameters


1.4

2.7 0.7(((((0)

Error=0.7
1.9
Adjust parameters based on classification errors
1.4

0. (((((0)
2.7
Error=0.
1.9

Keeping going to lower classification error


Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 20 / 37
Learning representation of shallow NN

Consider a shallow NN with 3-layers (input, hidden, output)

A summation based representation


N
X
H(x) = vi h(wi> x + ✓i ).
i=1

where h is the function of node in hidden layer.


A well-established strategy to construct representation (regression)
strategy
Taylor expansion by polynomial, Fourier series by trigonometric polynomials.
Universal approximation theorem [Cybenko 1989]. Suppose h(·) is a
non-constant, bounded and monotonically increasing continuous function
defined on [0, 1]n . For any ✏ > 0, there exist N, {vi , wi , ✓i } such that

max |H(x) f (x)|  ✏.


x2[0,1]n

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 21 / 37
Difference between deep NN and shallow NN
A shallow NN with one hidden layer
H(·) = v1 h(·, w1 ) + v2 h(·, w2 ) + v3 h(·, w3 ) · · ·
A summation based approximation scheme
A deep NN with many hidden layer
H(x) = h(h(h(· · · ; wn 2 ); wn 1 ); wn )
A variable substitution based approximation scheme
The question: what variable substitution is doing?
It rapidly change the level sets of the function
⌦h(·)=0 = {x 2 RN : h(x) = 0}

⌦h(h(·))=0 = {x 2 RN : h(x) 2 ⌦h(·)=0 }

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 22 / 37
Example of variable substitution
Consider a data set {(xi , yi )} where yi 2 {0, 1}. A binary classifier
function can be defined

1 f (x) > 1/2
f (x) =
0 f (x) < 1/2.
Consider a function h defined from ReLU network
8
< 2x if 0  x  1/2
h(x) = 2 2x if 1/2  x  1
:
0 otherwise.

Consecutive variable substitution by multi-layer network

the depth of NN increases the number of


oscillations multiplicatively
the width of NN can only do so additively.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 23 / 37
deep NN vs wide NN

Expression capability
Both NNs can do the job
wide NN is preferred, as it is easier in optimization owing to the separability
of node parameters.
Expression efficiency
For functions with simple level sets, deep NN has not advantages
For functions with complex level sets, deep NN is preferred owing to its
efficiency on rapidly changing level sets.
Conclusion: deep NN is for the application, where the targeting transform
satisfies
accuracy of its level sets is very important
structure of its level sets are complex
Sample applications
Multi-label function in classification
Value evaluation function in AI for games.
Pattern analysis in natural language processing

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 24 / 37
Outline of NN training

Preparing training data:


{xn , yn }N
n=1

Choose each of these:


Decision function (transform)

yb = f (xn ; w)

Loss function
y , y) 2 R
L(b
Define goal:
N
X
w⇤ = argminw L(f (xn ; w), yi )
n=1

Training with stochastic gradient descent:

wt+1 = wt ⌘` rw L(f (xn ; w), yi ).

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 25 / 37
Cost function
Consider a network denoted by f (x; w)
Gaussian error model: assume the prediction error follows normal
distribution:
pmodel (y|x) = N (y|f (x; w), I)
The likelihood function is then
N
Y
2 m 1
J(w) = (2⇡ ) 2 exp( 2
kyn f (xn ; w)k22 )
n=1
2

By using MLE (Maximum likelihood estimation), the cost function is

max J(w) = max Ex,y2Pdata ky f (x; w)k22 + const


✓ w

In other words, we are minimizing the function

XN
1
min Ex,y2Pdata ky f (x; w)k22 ⇡ kyn f (xn ; w)k22
w
n=1
2

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 27 / 37
Cross-entropy based cost function
Consider two distributions p(x) and q(x), then their cross-entropy is
defined as X
H(p, q) = p(x) ln q(x).
x

Consider a two-class problem y 2 {0, 1}, the Logistic regression, the MLE
minimize the negative log-likelihood
N
X
J(✓) = (yk⇤ ln yk + (1 yk⇤ ) ln(1 yk )),
n=1

where yk⇤ denote the true label of xk and yk = (! > xk ) denote the
prediction.
Furthermore, there is an interesting connection of cross-entropy to K-L
divergence
X X p(x)
H(p, q) = p(x) ln q(x) = p(x) log + H(p, p)
x x
q(x)

Minimizing H(p, q) is the same as minimizing DKL (pkq).


Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 28 / 37
Objective function
Quadratic loss: J = 12 (y y ⇤ )2
Cross entropy: J = y ⇤ log(y) + (1 y ⇤ ) log(1 y)
”softmax”: yk = Pexp(b k)
exp(b` ) `

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 29 / 37
Standard ML Training vs NN Training
Nonlinearity of neural network causes interesting loss functions to be
non-convex

Difference between standard ML learning (e.g. logistic regression or


SVM) and NN training
Standard ML learning is to find a global minimizer, e.g., Exact linear
equation solvers used for linear regression or convex optimization algorithms
used for logistic regression or SVMs
NN training: Using iterative gradient-based optimizers that merely lower the
value of the cost function.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 30 / 37
Learning vs Pure optimization

Optimization algorithms for deep learning differ from traditional


optimization in several ways
Pure optimization: minimizing J is a goal in itself
Machine learning cares more about the performance on testing data than
the performance on training data
We reduce a different function and hope that it will lead to the reduction of
the truth cost (not available).
Optimization algorithm for training deep model includes special structure
of the cost function
the cost function can be expressed as the average over a training set
X
J(✓) = E(x,y)2pdata L(f (x; ✓), y) ⇡ L(f (xi ; ✓), yi ))
i

The cost function is the summations of multiple terms with same functional
form.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 31 / 37
Training with gradient descent

Solving the optimization problem:


N
X
w⇤ = argminw L(f (xi ; w))
i=1
with gradient descent

wt+1 = wt ⌘t rL(f (xi ; w)), yi ).

advantage: simple, and easy


dis-advantage: first-order, slow convergence,
prone to local minimizer

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 32 / 37
Is differentiability necessary?

Some hidden units are not differentiable at all input points


e.g. ReLU: f (z) = max{0, z}.
It seems like it invalidates for use in gradientbased learning
In practice gradient descent still performs well enough for these models
to be used in ML
Neural network training not usually arrives at a local minimum of cost
function
Only for reduces the cost significantly
Hidden units not differentiable are usually non-differentiable at only a small
no. of points
Left and right differentiability
A right derivative means left derivative is equal to right derivative
Practical implementation returns one of the one-sided derivative if two
one-sided derivatives are different.

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 33 / 37
How to calculate the gradient: backpropagation via
chain rule
Chain rule
given y = g(u) and u = h(x),

X dyi dui J
dyi
= , 8j, k.
dxk i=1
duj dxk

Backpropagation: repeating application of


chain rule

Instantiate the computation as a directed acyclic graph, where each


intermediate quantity is a node
At each node, store (a) the quantity computed in the forward pass and (b)
the partial derivative w.r.t. intermediate quantity.
Initialize all partial derivatives to 0.
Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 34 / 37
Demonstration on function
Consider J = cos(sin(x2 ) + 3x2 )
compute value on the forward pass and the derivative dJ
dx on the
backward pass

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 35 / 37
Demonstration on NN
Consider the following network

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 36 / 37
Illustration of backpropagation on NN

Ji Hui (National University of Singapore) Visual Information Interpretation: Basics on deep learning I October 25, 2021 37 / 37

You might also like