Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Gradient Descent

Deep Learning
By
T.K. Damodharan
Vice President, RBS
Reg.No: PC2013003013008

Under the guidance of


Dr V.Rajasekar,
Associate Professor,
Department of Computer Science & Engineering,
SRM Institute of Science and Technology-Vadapalani Campus.
Gradient Descent 

 Gradient descent is by far the most


popular optimization strategy used in 
machine learning and deep learning at the
moment.
 It is used when training data models, can be
combined with every algorithm and is easy to
understand and implement.
 Everyone working with machine learning
should understand its concept.
Gradient Descent 
 Gradient Descent is an optimization algorithm
for finding a local minimum of a differentiable
function.
 Gradient descent is simply used to find the
values of a function's parameters (coefficients)
that minimize a cost function as far as possible.
 It's based on a convex function and tweaks its
parameters iteratively to minimize a given
function to its local minimum.
What is a Gradient

"A gradient measures how much the output of a


function changes if you change the inputs a little
bit." — Lex Fridman (MIT)
A gradient simply measures the change in all
weights with regard to the change in error.
You can also think of a gradient as the slope of a
function. The higher the gradient, the steeper the
slope and the faster a model can learn.
But if the slope is zero, the model stops learning.
In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
What is a Gradient
What is a Gradient

Imagine a blindfolded man who wants to climb to


the top of a hill with the fewest steps along the way
as possible.
He might start climbing the hill by taking really
big steps in the steepest direction, which he can do
as long as he is not close to the top.
As he comes closer to the top, however, his steps
will get smaller and smaller to avoid overshooting
it.
This process can be described mathematically
using the gradient.
What is a Gradient

Imagine the image below illustrates our hill from a


top-down view and the red arrows are the steps of
our climber.
Think of a gradient in this context as a vector that
contains the direction of the steepest step the
blindfolded man can take and also how long
that step should be.
What is a Gradient
What is a Gradient

Note that the gradient ranging from X0 to X1 is


much longer than the one reaching from X3 to X4.
This is because the steepness/slope of the hill,
which determines the length of the vector, is less.
This perfectly represents the example of the hill
because the hill is getting less steep the higher it's
climbed.
Therefore a reduced gradient goes along with a
reduced slope and a reduced step size for the hill
climber.
How Gradient Descent Works

Instead of climbing up a hill, think of gradient


descent as hiking down to the bottom of a valley.
This is a better analogy because it is a
minimization algorithm that minimizes a given
function.

Equation :b is the next position of our climber,


while a represents his current position.
The minus sign refers to the min part of GD.
The gamma in the middle is a waiting factor and
Gradient Descent
More details and Types of Gradient Descent
https://builtin.com/data-science/gradient-descent

Step by step Video guide:


https://youtu.be/sDv4f4s2SB8
Linear Models
A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line

 in higher dimensions, need hyperplanes

A linear model is a model that assumes the data is linearly


separable
Linear models
A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line

 in higher dimensions, need hyperplanes

A linear model is a model that assumes the data is linearly


separable
Linear Regression
DATASET

inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2

w x3 = 2 y3 = 2
 1  x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1

Linear regression assumes that the expected value of


the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
Linear models
A linear model in n-dimensional space (i.e. n
features) is define by n+1 weights:

In two dimensions, a line:


0 =w1 f1 + w2 f2 + b (where b = -a)

In three dimensions, a plane:


0 =w1 f1 + w2 f2 + w3 f3 + b
In m-dimensions, a hyperplane
m
0 =b + å wj fj
j=1
Which line will it find?
Which line will it find?

Only guaranteed to find some


line that separates the data
Linear models
Perceptron algorithm is one example of a linear
classifier

Many, many other algorithms that learn a line (i.e. a


setting of a linear combination of weights)

Goals:
 Explore a number of linear training algorithms

 Understand why these algorithms work


Linear models in general
1. pick a model
0 =b + å
m
wj fj
j=1

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)


Some notation: indicator function

ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ

Convenient notation for turning T/F answers into numbers/counts:

drinks _ to _ bring _ for _ class = å 1[ x >=21]


xÎ class
Some notation: dot-product
Sometimes it is convenient to use vector notation

We represent an example f1, f2, …, fm as a single vector, x

Similarly, we can represent the weight vector w1, w2, …, wm as a single


vector, w

The dot-product between two vectors a and b is defined as:


m
a ×b =å a j b j
j=1
Linear models
1. pick a model
0 =b + å
n
wj fj
j=1

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)


n

å 1[ y (w ×x + b) £0]
i i
i=1

What does this equation say?


Convex functions
Convex functions look something like:

One definition: The line segment between any


two points on the function is above the function
Finding the minimum

You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum

loss

How do we do this for a function?


One approach: gradient descent

Partial derivatives give us the


slope (i.e. direction to move)
in that dimension
loss

w
One approach: gradient descent

Partial derivatives give us the


slope (i.e. direction to move) in
that dimension
loss

Approach:
 pick a starting point (w)
 repeat: w
 pick a dimension
 move a small amount in that
dimension towards decreasing loss
(using the derivative)
One approach: gradient descent

Partial derivatives give us the


slope (i.e. direction to move) in
that dimension

Approach:
 pick a starting point (w)
 repeat:
 pick a dimension
 move a small amount in that
dimension towards decreasing loss
(using the derivative)
Gradient descent

 pick a starting point (w)


 repeat until loss doesn’t decrease in all dimensions:
 pick a dimension
 move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j

What does this do?


Gradient descent

 pick a starting point (w)


 repeat until loss doesn’t decrease in all dimensions:
 pick a dimension
 move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j

learning rate (how much we want to move in the error


direction, often this will change over time)
Some maths
n
d d
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))

n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent

 pick a starting point (w)


 repeat until loss doesn’t decrease in all dimensions:
 pick a dimension
 move a small amount in that dimension towards decreasing loss
(using the derivative)

n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1

What is this doing?


Exponential update rule
n
w j =w j + hå yi xij exp(- yi (w ×xi + b))
i=1

for each example xi:

w j =w j + h yi xij exp(- yi (w ×xi + b))


Summary
Gradient descent minimization algorithm
 require that our loss function is convex
 make small updates towards lower losses
Gradient descent

 pick a starting point (w)


 repeat until loss doesn’t decrease in all dimensions:
 pick a dimension
 move a small amount in that dimension towards decreasing loss
(using the derivative)
d
wi =wi - h (loss(w) + regularizer(w, b))
dwi

n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hl w j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j

learning rate direction to regularization


update
constant: how far from wrong

What effect does the regularizer have?


The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlw j

learning rate direction to regularization


update
constant: how far from wrong

If wj is positive, reduces wj moves wj towards 0


If wj is negative, increases wj

You might also like