Lecture 04 (3hrs) Neural Network and Deep Learning-Part A

Neural Network and Deep
Learning
Xizhao WANG
Dian ZHANG
Big Data Institute
College of Computer Science
Shenzhen University
March 2022
Gradient Descent Algorithm
BP Algorithm for Feed-Forward Neural Network Model
Convolutional Neural Network
Deep Learning
Outline
1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Definition of Gradient
BP Algorithm for Feed-Forward Neural Network Model 2. Gradient Descent Algorithm (GDA)
Convolutional Neural Network 3. Difference between GDA and Newton's Method
Deep Learning 4. An example
Lecture 01
Gradient Descent Algorithm - start from an example

Minimize: f ( x)  x 2 .
Step 1: computing the gradient,   2x.
Step 2: moving x along the negative direction of the gradient, i.e.,
x  x    , where γ is the learning rate.
Step 3: Looping Step 2, untill the difference of f (x) between two
adjacent iteration is small enough, which indicates f (x) attains its local
minimum value.
Step 4: outputting x, which is the optimal solution.

Example
Minimize f (x)=x2 by using Gradient Descent Algorithm
 
x, t , x The initial value of x is 2,
and the step length is 0.1.
After iteration 49 times, t

wi he minimum value 1.273
147e-09 of the function is
obtained, and the corresp
onding x value is 3.56811
9e-05.

Definition:
Directional derivative (taking the triadic function as an example):

Suppose function f is defined in a neighborhood of point P0 (x0, y0,
z0),l is a ray from point P0, P (x, y, z) is a point on l and is
contained in the neighborhood of P0, ρ represents the distance
between P and P0.
If lim((( f ( P ))  ( f ( P0 ))) /  )  lim(f /  )
exists when ρ→0, we call this limit the directional derivative of f

at P0 along the direction of l.

Generally speaking, directional derivative is the rate of change
of a function in a specified direction.
The gradient of a scalar function f (x1, x2, ∙∙∙, xn) is denoted as

 f 
 x 
 1 
 f  T f
   f f f 
f ( X )   x 2  , ,  ,  .
 



 x1 x 2 x n 
 
 f 
 x 
 n 
In the three-dimensional Cartesian coordinate system with a Euclidean

metric, the gradient, if it exists, is given by:
f f f
f  i j k
x y z
where i, j, k are the standard unit vectors in the directions of the
coordinates, respectively. For example, the gradient of the function
f ( x, y, z )  2 x  3 y 2  sin( z ) is f  2i  6 yj  cos( z )k.

Geometric Meaning
The gradient specifies the direction that produces the steepest increase in the
function. The negative of gradient therefore gives the direction of steepest
decrease.
In the above two images, the values of the function are represented in black and white, black re
presenting higher values, and its corresponding gradient is represented by blue arrows.

Geometric Meaning
2
 2
The gradient of the function f ( x, y )   cos x  cos y 
2
is depicted as a projected vector field on the bottom plane.

For the 2-dimensional case:
Gradient: Suppose z  f ( x, y ) has the first-order continuous partial derivative on
region D, and for there exists a vector  P ( x, y )
 f f   
 ,   f x ( x , y ) i  f y ( x , y ) j,
 x y 
then the gradient of z=f (x, y) at P(x, y) is marked as grad f (x, y) or i.e., f ( x, y ),
f  f 
grad f ( x, y )  f ( x, y )  i j.
x y
Along with gradient direction,

the function changes most
quickly

Suppose e = [cosα, cosβ] is a unit vector in l direction, then
f f f  f f 
 cos   cos    , cos  , cos  
l x y  x y 
 grad f ( x, y )  e
 grad f ( x, y )  e  cos gradf ( x, y ), e
cos gradf ( x, y ), e  1,
f
then the directional derivative attains its maximum value, which equals to the
l
norm of gradient, i.e.
2 2
 f   f 
grad f ( x, y )    
 y 
 .
 x   
Then when variables change along the gradient direction, the rate of change of a
function attains its maximum value, which is the norm of the gradient.
When gradient is generalized to n dimensional space, it can be represented as:
 f 
 x 
 1
 f  T
   f f f 
f ( X )   x2    , ,,  .
   x1 x2 xn 
 
 f 
 x 
 n
Along with gradient direction, the function changes most quickly.

initial point
the minimum value
The gradient descent algorithm may lead to local optimal solution; the
global optimal one can be ensured when the loss function is convex.
Note on the Gradient Descent Algorithm parameters
1. The magnitude of gradient, epxilong, is one of termination conditions

2. Another termination condition is the iteration numbers (time control)
3. The learning rate, alpha, is to control the “walking-step”, too small will
lead to slow convergence (low efficiency), but too big will result in
vibrating (non-convergence). Its appropriate value is dependent on the
specific function to be minimized.
1. Definition of gradient
2. Gradient descent algorithm (GDA)
3. Difference between GDA and Newton’ s method
4. An example
Gauss-Newton method
Suppose the objective function f(x) has the second order continuous partial
derivative; xk is an approximation of its minimum point. The second order
Taylor polynomial approximation of f(x) near xk is shown as follows:

Its gradient is
The minimum point of the approximate function satisfies
then
where H(xk) is the Hessian matrix of f(x) at point xk.
In the minimizing process of f(x), is considered as the

searching direction.
Gauss-Newton method
The minimizing process of Gauss-Newton method can be represented as:
Gauss-Newton method
In optimization, Newton's method is applied to the derivative f ′ of a twice-di

fferentiable function f to find the roots of the derivative (solutions to
f ′(x)=0), also known as the stationary points of f.
In the one-dimensional problem, Newton's method to find the roots attempt

s to construct a sequence xn from an initial guess x0 that converges towards
 
x , t , x
some value x* satisfying f ′(x*)=0. This x* is a stationary point of f. t
wi order Taylor expansion fT(x) of f around xn is:
The second
wi
1 ''
fT  x   fT  xn  x   f  xn   f '
 xn  x  f  xn  x 2 .
2
Gauss-Newton method
We want to find Δx such that xn + Δx is a stationary point. We seek to solve the e
quation that sets the derivative of this last expression with respect to Δx equal t
o zero:
d  1 '' 
  n  n f  xn  x 2   f '  xn   f ''  xn  x.
'
0 f x  f x x 
d x  2 
 
For the value of Δx = −f ′(xn) / f ″(xn), which
x , t is, the solution
x of this equation, it can t
be hoped that xn+1 = xn + Δx = xn − f ′(xn) / f ″(xn) will be closer to a stationary point
x*. Provided that f is a twice-differentiable function and other technical condition
s are satisfied, the sequence x1, x2, ∙∙∙ will converge to a point x* satisfying f ′(x*)
= 0.
The above iterative scheme can be generalized to several dimensions by replac

ing the derivative with the gradient, ∇f (x), and the reciprocal of the second deri
vative with the inverse of the Hessian matrix, H f (x). One obtains the iterative s
cheme
1
x n 1  x n   H f  xn  f  xn  , n  0.

Comparison of GDA and Netwon's Method
A comparison of gradient descent

 (green) and Newton's method (red
x, t , x t
) for minimizing a function (with s
mall step sizes).
wi Newton's method uses curvature

information to take a more direct
route.
从本质上去看，牛顿法是二阶收敛，梯度下降是一阶收敛
，所以牛顿法就更快。如果更通俗地说的话，比如你想找
一条最短的路径走到一个盆地的最底部，梯度下降法每次
只从你当前所处位置选一个坡度最大的方向走一步，牛顿
法在选择方向时，不仅会考虑坡度是否够大，还会考虑你
走了一步之后，坡度是否会变得更大。
1. Definition of gradient
2. Gradient descent algorithm (GDA)
3. Difference between GDA and Newton’s method
4. An example
Newton’s method
Example
Minimize f (x)=x2 by using Netwon's Method

x, t , x0
The initial value of x is 2.
After iteration 15 times, t

he minimum value 3.7253
wi e-09 of the function is
obtained, and the corresp
onding x value is 6.1035e
x1
-05.
x2
Gradient Descent
Algorithm
The End.
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An application
6. Questions
Outline
1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Deep Learning 5. An Application
6. Questions
6. Questions
6. Questions
6. Questions

• Rumelhart, McClelland proposed BP(Back Propagation)
algorithm for feed-forward neural network
David Geoffrey
Rumelhart Everest Hinton
• BP algorithm – key idea

– Using the error of output layer to estimate the error of its
previous layer, generally using the error of layer n to
estimate the error of layer n-1
6. Questions

A intuitive understanding to a feed-forward neural network
A feed-forward NN is a smooth function which can used

to approximate a system of input-output (Black Box)
What is the specific form of the function

in box?
6. Questions
6. Questions

• A Perceptron
6. Questions

• Sigmoid threshold unit
The Sigmoid unit computes its output o as where
It is easy to check that
6. Questions
An intuitive example
Digit “3”
Digit “8”
6. Questions
6. Questions
6. Questions
6. Questions

• A Perceptron can be used to represent many Boolean
functions, like the following:
A Perceptron cannot be used to represent:
6. Questions

• Sigmoid function picture
6. Questions
An intuitive understanding to a feed-forward neural network
6. Questions
6. Questions
6. Questions
6. Questions
Overview of Backprop algorithm
• Choose random weights for the network

• Feed in an example and obtain a result
• Calculate the error for each node
(starting from the last stage and propagating the error backwards)
• Update the weights
• Repeat with other examples until the network converges on the
target output
6. Questions
6. Questions
6. Questions
6. Questions
6. Questions
6. Questions
Backwards pass
6. Questions
Backwards pass
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
6. Questions
6. Questions
  
x  AT A
1
AT b,
Iteration method, approaches the optimal solution gradually through each updating
step.
Gradient descent, which belongs to iteration methods, is available for least squares
problems.
Gauss-Newton method is an commonly used iteration approach to solving

nonlinear least squares problems.
Levenberg-Marquardt is another iteration method to solve nonlinear least squares

problems.
6. Questions

It is a function of and its minimum
exists. BP algorithm is to use the
gradient descent technique to find
the minimum by gradually updating
the weights
It is easy to know that
The remaining task to derive a convenient expression

for
6. Questions
6. Questions
6. Questions
6. Questions
6. Questions

6. Questions

In summary:
6. Questions
1. Brief introduction
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions
6. Questions

• Learning process ：
– Stimulated by input samples, the connection weights
update gradually, such that network outputs approach
expected outputs step by step.
• Learning essence ：
– Dynamically update connection weights
• Learning rule ：
– It is the rule of how updating the connection weights
(What rule is followed)
6. Questions

• Learning type ： Supervised
• Key idea ：
– The output error (in a suitable form) is back-propagated
to input layer via hidden layer(s)
Assigning the error to Updating

all units (nodes) in weight for
layers each node
• Features ：
– Signal forward-propagated
– Error back-propagated
6. Questions

• Forward propagation ：
– Input sample － input layer － every hidden layer
－ output layer
• Judge whether go to back-propagation ：
– If the difference between actual and expected outputs
(in output layer) is bigger than a threshold
• Back-propagation
– Representing errors of each layer and updating weight
for each node
• Stop if output error is under a predefined threshold or the
number of iterations attains the predefined maximum.
6. Questions

Related concepts of gradient descent
1. Learning rate: in the process of gradient descent, the function decreases

along the negative direction of the gradient. Learning rate determines the
descent degree for each iteration step.
2. Feature: the inputs of the algorithm, which are used to describe the
samples.
3. Hypothesis function: in supervised learning, it aims to fit leaning samples.
4. Loss function: it can measure the effectiveness of hypothesis function,

generally, which is computed as the square of the difference between the
outputs and the prediction fitting values.
6. Questions
6. Questions
Standard Gradient Descent As described in the Gradient Descent Algorithm,

the calculation of gradient is based on all the 
 training samples.
x, t , x t
Stochasticw Gradient Descent Whereas the gradient descent training rule
i
presented in the Gradient Descent Algorithm computes weight updates after summing
over all the training examples, the idea behind stochastic gradient descent is to
approximate this gradient
wi descent search by updating weights incrementally,
x , t
following the calculation of thefunction for each individual example.
x
Batch GradientwiDescent
, , where the gradient is based on a batch of the training
samples.
6. Questions

Remarks
The key differences between standard gradient descent and stochastic gradient
descent are:
• In standard gradient descent, the error is summed

 over allexamples before updating weights,
x, t , x
whereas in stochastic gradient descent weights are updated upon examining each training
t
example.
wi
• Summing over multiple wexamples

i in standard gradient descent requires more computation per

weight update step.
x , t On the other hand, because it uses the true gradient, standard gradient
 size per weight update than stochastic gradient descent.
descent is often used with a larger step
x
wi ,
• In cases where there are multiple local minima with respect to the objective function, stochastic
gradient descent can sometimes avoid falling into these local minima because it uses the various
V E d ( G ) rather than V E ( 6 ) to guide its search.
6. Questions
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions
6. Questions

• A 3-layer feed-forward neural network: Neural
network learning to steer an autonomous vehicle
6. Questions
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions
6. Questions
Questions:
1.If the features are not numerical but symbolic, I mean,

the input-output system has input of symbols and
output of real number, do you think how to use BP to
train the approximator?
2.In comparison with real case, how about its
performance?
3.In your own opinion, how to empirically select the
step in Gradient Descent Algorithm?
6. Questions
Feedforward NN and
BP Algorithm
The End.

Lecture 04 (3hrs) Neural Network and Deep Learning-Part A

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 04 (3hrs) Neural Network and Deep Learning-Part A

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 04 (3hrs) Neural Network and Deep Learning-Part A

Uploaded by

Copyright:

Available Formats

Neural Network and Deep

1. Gradient Descent Algorithm

Gradient Descent Algorithm

Gradient Descent Algorithm - start from an example

Gradient Descent Algorithm

Minimize f (x)=x2 by using Gradient Descent Algorithm

After iteration 49 times, t

Gradient Descent Algorithm

Directional derivative (taking the triadic function as an example):

If lim((( f ( P ))  ( f ( P0 ))) /  )  lim(f /  )

exists when ρ→0, we call this limit the directional derivative of f

Gradient Descent Algorithm

The gradient of a scalar function f (x1, x2, ∙∙∙, xn) is denoted as

In the three-dimensional Cartesian coordinate system with a Euclidean

Gradient Descent Algorithm

Gradient Descent Algorithm

is depicted as a projected vector field on the bottom plane.

Gradient Descent Algorithm

Along with gradient direction,

Gradient Descent Algorithm

Gradient Descent Algorithm

When gradient is generalized to n dimensional space, it can be represented as:

Along with gradient direction, the function changes most quickly.

Gradient Descent Algorithm

the minimum value

Gradient Descent Algorithm

Gradient Descent Algorithm

Gradient Descent Algorithm

Note on the Gradient Descent Algorithm parameters

1. The magnitude of gradient, epxilong, is one of termination conditions

Gradient Descent Algorithm

The minimum point of the approximate function satisfies

In the minimizing process of f(x), is considered as the

In optimization, Newton's method is applied to the derivative f ′ of a twice-di

In the one-dimensional problem, Newton's method to find the roots attempt

The above iterative scheme can be generalized to several dimensions by replac

Gradient Descent Algorithm

A comparison of gradient descent

wi Newton's method uses curvature

Gradient Descent Algorithm

Minimize f (x)=x2 by using Netwon's Method

After iteration 15 times, t

Gradient Descent Algorithm

1. Gradient Descent Algorithm

BP Algorithm for Feed-Forward Neural Network Model

BP Algorithm for Feed-Forward Neural Network Model

BP Algorithm for Feed-Forward Neural Network Model

BP Algorithm for Feed-Forward Neural Network Model

• BP algorithm – key idea

BP Algorithm for Feed-Forward Neural Network Model

A feed-forward NN is a smooth function which can used

What is the specific form of the function

BP Algorithm for Feed-Forward Neural Network Model

BP Algorithm for Feed-Forward Neural Network Model

BP Algorithm for Feed-Forward Neural Network Model

The Sigmoid unit computes its output o as where

It is easy to check that

BP Algorithm for Feed-Forward Neural Network Model