Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008

Gradient Descent
Deep Learning
By
T.K. Damodharan
Vice President, RBS
Reg.No: PC2013003013008
Under the guidance of

Dr V.Rajasekar,
Associate Professor,
Department of Computer Science & Engineering,
SRM Institute of Science and Technology-Vadapalani Campus.
Gradient Descent
 Gradient descent is by far the most

popular optimization strategy used in
machine learning and deep learning at the
moment.
 It is used when training data models, can be
combined with every algorithm and is easy to
understand and implement.
 Everyone working with machine learning
should understand its concept.
Gradient Descent
 Gradient Descent is an optimization algorithm
for finding a local minimum of a differentiable
function.
 Gradient descent is simply used to find the
values of a function's parameters (coefficients)
that minimize a cost function as far as possible.
 It's based on a convex function and tweaks its
parameters iteratively to minimize a given
function to its local minimum.
What is a Gradient
"A gradient measures how much the output of a

function changes if you change the inputs a little
bit." — Lex Fridman (MIT)
A gradient simply measures the change in all
weights with regard to the change in error.
You can also think of a gradient as the slope of a
function. The higher the gradient, the steeper the
slope and the faster a model can learn.
But if the slope is zero, the model stops learning.
In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
What is a Gradient
What is a Gradient
Imagine a blindfolded man who wants to climb to

the top of a hill with the fewest steps along the way
as possible.
He might start climbing the hill by taking really
big steps in the steepest direction, which he can do
as long as he is not close to the top.
As he comes closer to the top, however, his steps
will get smaller and smaller to avoid overshooting
it.
This process can be described mathematically
using the gradient.
What is a Gradient
Imagine the image below illustrates our hill from a

top-down view and the red arrows are the steps of
our climber.
Think of a gradient in this context as a vector that
contains the direction of the steepest step the
blindfolded man can take and also how long
that step should be.
What is a Gradient
What is a Gradient
Note that the gradient ranging from X0 to X1 is

much longer than the one reaching from X3 to X4.
This is because the steepness/slope of the hill,
which determines the length of the vector, is less.
This perfectly represents the example of the hill
because the hill is getting less steep the higher it's
climbed.
Therefore a reduced gradient goes along with a
reduced slope and a reduced step size for the hill
climber.
How Gradient Descent Works
Instead of climbing up a hill, think of gradient

descent as hiking down to the bottom of a valley.
This is a better analogy because it is a
minimization algorithm that minimizes a given
function.
Equation :b is the next position of our climber,

while a represents his current position.
The minus sign refers to the min part of GD.
The gamma in the middle is a waiting factor and
Gradient Descent
More details and Types of Gradient Descent
https://builtin.com/data-science/gradient-descent
Step by step Video guide:

https://youtu.be/sDv4f4s2SB8
Linear Models
A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly

separable
Linear models
A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly

separable
Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2

w x3 = 2 y3 = 2
 1  x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
Linear regression assumes that the expected value of

the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
Linear models
A linear model in n-dimensional space (i.e. n
features) is define by n+1 weights:
In two dimensions, a line:

0 =w1 f1 + w2 f2 + b (where b = -a)
In three dimensions, a plane:

0 =w1 f1 + w2 f2 + w3 f3 + b
In m-dimensions, a hyperplane
m
0 =b + å wj fj
j=1
Which line will it find?
Which line will it find?
Only guaranteed to find some

line that separates the data
Linear models
Perceptron algorithm is one example of a linear
classifier
Many, many other algorithms that learn a line (i.e. a

setting of a linear combination of weights)
Goals:
 Explore a number of linear training algorithms
 Understand why these algorithms work

Linear models in general
1. pick a model
0 =b + å
m
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)

Some notation: indicator function
ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ
Convenient notation for turning T/F answers into numbers/counts:
drinks _ to _ bring _ for _ class = å 1[ x >=21]

xÎ class
Some notation: dot-product
Sometimes it is convenient to use vector notation
We represent an example f1, f2, …, fm as a single vector, x
Similarly, we can represent the weight vector w1, w2, …, wm as a single

vector, w
The dot-product between two vectors a and b is defined as:

m
a ×b =å a j b j
j=1
Linear models
1. pick a model
0 =b + å
n
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)

n
å 1[ y (w ×x + b) £0]
i i
i=1
What does this equation say?

Convex functions
Convex functions look something like:
One definition: The line segment between any

two points on the function is above the function
Finding the minimum
You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum
loss
How do we do this for a function?

One approach: gradient descent
Partial derivatives give us the

slope (i.e. direction to move)
in that dimension
loss
w

slope (i.e. direction to move) in
that dimension
loss
Approach:
 pick a starting point (w)
 repeat: w
 pick a dimension
 move a small amount in that
dimension towards decreasing loss
(using the derivative)

slope (i.e. direction to move) in
that dimension
Approach:
 repeat:
 move a small amount in that
dimension towards decreasing loss
Gradient descent

 repeat until loss doesn’t decrease in all dimensions:
 move a small amount in that dimension towards decreasing loss
d
w j =w j - h loss(w)
dw j
What does this do?

Gradient descent

d
w j =w j - h loss(w)
dw j
learning rate (how much we want to move in the error

direction, often this will change over time)
Some maths
n
d d
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))
n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent

n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1
What is this doing?

Exponential update rule
n
w j =w j + hå yi xij exp(- yi (w ×xi + b))
i=1
for each example xi:
w j =w j + h yi xij exp(- yi (w ×xi + b))

Summary
Gradient descent minimization algorithm
 require that our loss function is convex
 make small updates towards lower losses
Gradient descent

d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hl w j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j
learning rate direction to regularization

update
constant: how far from wrong
What effect does the regularizer have?

The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlw j
learning rate direction to regularization

update
constant: how far from wrong
If wj is positive, reduces wj moves wj towards 0

If wj is negative, increases wj

Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008

Uploaded by

Copyright:

Available Formats

You might also like

Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008

Uploaded by

Copyright:

Available Formats

Gradient Descent

Under the guidance of

 Gradient descent is by far the most

"A gradient measures how much the output of a

Imagine a blindfolded man who wants to climb to

Imagine the image below illustrates our hill from a

Note that the gradient ranging from X0 to X1 is

Instead of climbing up a hill, think of gradient

Equation :b is the next position of our climber,

Step by step Video guide:

 in higher dimensions, need hyperplanes

A linear model is a model that assumes the data is linearly

 in higher dimensions, need hyperplanes

A linear model is a model that assumes the data is linearly

Linear regression assumes that the expected value of

In two dimensions, a line:

In three dimensions, a plane:

Only guaranteed to find some

Many, many other algorithms that learn a line (i.e. a

 Understand why these algorithms work

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)

Convenient notation for turning T/F answers into numbers/counts:

drinks _ to _ bring _ for _ class = å 1[ x >=21]

We represent an example f1, f2, …, fm as a single vector, x

Similarly, we can represent the weight vector w1, w2, …, wm as a single

The dot-product between two vectors a and b is defined as:

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)

What does this equation say?

One definition: The line segment between any

How do we do this for a function?

Partial derivatives give us the

Partial derivatives give us the

Partial derivatives give us the

 pick a starting point (w)

What does this do?

 pick a starting point (w)

learning rate (how much we want to move in the error

 pick a starting point (w)

What is this doing?

for each example xi:

w j =w j + h yi xij exp(- yi (w ×xi + b))

 pick a starting point (w)

learning rate direction to regularization

What effect does the regularizer have?

learning rate direction to regularization

If wj is positive, reduces wj moves wj towards 0

You might also like