Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

1

Linear'regression'
with'one'variable'
Model'
representa6on'
Machine'Learning'

Andrew'Ng'
500' 2
Housing(Prices(
400'
(Portland,(OR)(
300'

Price' 200'
(in'1000s' 100'
of'dollars)' 0'
0' 500' 1000' 1500' 2000' 2500' 3000'
Size'(feet2)'
Supervised'Learning' Regression'Problem'
' '

Given'the'“right'answer”'for' Predict'realLvalued'output'
each'example'in'the'data.' Classification: discrete valued output
Andrew'Ng'
3
Training(set(of( Size(in(feet2((x)( Price(($)(in(1000's((y)(
housing(prices( 2104' 460'
(Portland,(OR)( 1416' 232'
1534' 315'
852' 178'
…' …'
Nota6on:'
'
'

'''m'='Number'of'training'examples'
'''x’s'='“input”'variable'/'features'
'''y’s'='“output”'variable'/'“target”'variable'

Andrew'Ng'
4
Training'Set' How(do(we(represent(h(?(

Learning'Algorithm'
what this function is doing, is
predicting that y is some straight line
function of x

Size'of' h" Es6mated'


house' price'
Estimated
value of y Linear'regression'with'one'variable.'
Univariate'linear'regression.'

Andrew'Ng'
5

Linear'regression'
with'one'variable'
Cost'func6on'

Machine'Learning'

Andrew'Ng'
6
Size(in(feet2((x)( Price(($)(in(1000's((y)(
Training'Set' 2104' 460'
1416' 232'
1534' 315'
852' 178'
…' …'

Hypothesis:'
‘s:''''''Parameters'
How'to'choose'''''‘s'?'
Andrew'Ng'
With different choices of the parameters theta 0 and theta 1, we 7
get different hypothesis functions. Examples the h(x) will look :

3' 3' 3'

2' 2' 2'

1' 1' 1'

0' 0' 0'


0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3'

Andrew'Ng'
8

y' Prediction Actual Price

* Aim : minimize
over theta0 theta1

x'

Idea:'Choose'''''''''''''so'that''''''''''''''''''''
'''''''''''is'close'to'''''for'our'
training'examples'' Cost Function
Squared Error Function
Andrew'Ng'
9
To break it apart, it is 1/2 X’ where X’ is the mean of the squares of hθ (xi) - yi or the
difference between the predicted value and the actual value.

This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved 1/2 as a convenience for the computation of the gradient descent, as the
derivative term of the square function will cancel out the 1/2 term.

It turns out that these squared error cost function is a reasonable choice and works well for
problems for most regression programs.

There are other cost functions that will work pretty well. But the square cost function is
probably the most commonly used one for regression problems.

—- What the cost function "J" is doing and what is computing, and why we want to use it...
10

Linear'regression'
with'one'variable'
Cost'func6on'
intui6on'I'
Machine'Learning'

Andrew'Ng'
We want to fit a straight line to our data, so we
had this hypothesis with these parameters , Simplified' 11
Hypothesis:'
and with different choices of the parameters
we end up with different straight line theta0=0 —>
choosing only
hypothesis functions
that pass through
the origin, that pass
1)
through the point
Parameters:' (0, 0).

2)

Cost'Func6on:'

3)

Goal:'

Andrew'Ng'
12

(for'fixed''''','this'is'a'func6on'of'x)' (func6on'of'the'parameter'''''')'

3' 3'

2' 2'
y'
1' 1'

0' 0'
0' 1' 2' 3' L0.5' 0' 0.5' 1' 1.5' 2' 2.5'
x'

Andrew'Ng'
13

(for'fixed''''','this'is'a'func6on'of'x)' (func6on'of'the'parameter'''''')'

3' 3'

2' 2'
y'
1' 1'

0' 0'
0' 1' 2' 3' L0.5' 0' 0.5' 1' 1.5' 2' 2.5'
x'

Andrew'Ng'
14
15
16
17

For each value of theta1 corresponds to


a different hypothesis, or to a different
straight line fit on the left. And for each
value of theta 1, we could then derive a
different value of J of theta 1.
Very high error 18

(for'fixed''''','this'is'a'func6on'of'x)' (func6on'of'the'parameter'''''')'

3' 3'

2' 2'
y'
1' 1'

0' 0'
0' 1' 2' 3' L0.5' 0' 0.5' 1' 1.5' 2' 2.5'
x'

Andrew'Ng'
19
Our objective is to get the best possible line.

The best possible line is that the average squared vertical distances of the
scattered points from the line will be the least. Ideally, the line should pass
through all the points of our training data set, in such a case, the value of
J(θ 0,θ 1) will be 0
We looked up to some plots, to understand the cost function.

To do so, we simplified the algorithm, so that it only had one parameter theta 1. And we
set the parameter theta 0 to be only zero.

In the next We'll go back to the original problem formulation and look at some
visualizations involving both theta 0 and theta 1. That is without setting theta 0 to zero to
give you an even better sense of what the cost function J is doing in the original linear
regression formulation.
20

Linear'regression'
with'one'variable'
Cost'func6on'
intui6on'II'
Machine'Learning'

Andrew'Ng'
21

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Andrew'Ng'
22

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'
What we did last time when we only had
500' theta 1. Bowl shape

400'
Price'($)'' 300'
in'1000’s'
200'

100'

0'
0' 1000' 2000' 3000'
Size'in'feet2'(x)'

But now we have two parameters, theta 0, and theta 1,


and so the plot gets a little more complicated.
Andrew'Ng'
3-D surface plot, We are going to use a contour plot 23
Bowl shape it is a graph
that contains
many contour
lines

Andrew'Ng'
Contour Plot
24

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

the three
g r e e n
points have
the same
value for
J(θ 0, θ 1)

This hypothesis, h(x), with these values of theta 0, theta 1, is really not such a good fit to the data. This is a pretty high
cost far from the minimum because this is just not that good a fit to the data.
Andrew'Ng'
25

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

When θ0 = 360 and θ1 = 0, the value of


J(θ0, θ1) in the contour plot gets closer to the
center thus reducing the cost function error.
Andrew'Ng'
26

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
27

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

This is not quite at the minimum, but it's pretty close. And so the sum of squares
errors is sum of squares distances between the training samples and the hypothesis.
Andrew'Ng'
28

Now of course what we really want is an efficient algorithm, an efficient piece of


software for automatically finding the value of theta zero and theta one, that minimizes
the cost function J,

What we don't want to do is to write software, to plot out this point, and then try to
manually read off the numbers.

In fact, when we look at more complicated examples, we'll have high dimensional
figures with more parameters, where this figure cannot really be plotted, and this
becomes much harder to visualize.

And so, what we want is to have software to find the value of theta zero, theta one that
minimizes this function

We start to talk about an algorithm for automatically finding that value of theta zero and
theta one that minimizes the cost function J.
29

Linear'regression'
with'one'variable'

Gradient'
Machine'Learning'
descent'
Andrew'Ng'
30

Have'some'func6on'
Want''

Outline:(
•  Start'with'some'
•  Keep'changing''''''''''''''to'reduce'''''''''''''''''''''
un6l'we'hopefully'end'up'at'a'minimum'
Andrew'Ng'
In gradient descent, we're going to spin 360 degrees around, just look all around , and ask, if I were to take a little
31
baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take?.

J(θ0,θ1)"

θ1"
θ0"
Take another step, another step, and so on until you converge to this local minimum down here. The size of
each step is determined by the parameter α, which is called the learning rate. A smaller α would result in a
smaller step and a larger α results in a larger step.
Andrew'Ng'
Gradient descent has an interesting property. if you started just a couple of steps to the right,
32
gradient descent would've taken you to this second local optimum over on the right. So if you started
just at a slightly different location, you would've wound up at a very different local optimum.

J(θ0,θ1)"

θ1"
θ0"
We will know that we have succeeded when our cost function is at the very bottom of the pits in our
graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost function.
Andrew'Ng'
33

At each iteration the algorithme simultaneously updates the


parameters θ0, θ1, θ2, θ3, …, θn
34

The learning rate controls how big a step we take downhill with creating descent.
So if alpha is very large, then that corresponds to a very aggressive gradient
descent procedure where we're trying take huge steps downhill and if alpha is
very small, then we're taking little, little baby steps downhill.
35

Simultaneously update
Learning
θ0 and θ1
rate
36

Linear'regression'
with'one'variable'
Gradient'descent'
intui6on'
Machine'Learning'

Andrew'Ng'
37

Gradient(descent(algorithm(

Learning
Derivative
rate

We explored the scenario where we used one parameter theta1 and plotted its
cost function to implement a gradient descent.
Andrew'Ng'
Gradient descent will update Theta 1
38
what's the slope of the
line that is just tangent
to the function?

Positive Slope, Positive derivative


the value of θ1 decreases

Negative Slope, Negative derivative


the value of θ1increases Andrew'Ng'
39

Baby Steps

If'α'is'too'small,'gradient'descent'
can'be'slow.'
We should adjust our parameter α to ensure that
the gradient descent algorithm converges in a
reasonable time.

If'α'is'too'large,'gradient'descent'
can'overshoot'the'minimum.'It'may'
fail'to'converge,'or'even'diverge.'
Failure to converge or too much time to obtain the
minimum value imply that our step size is wrong.
Andrew'Ng'
40
What if your parameter theta 1 is
already at a local minimum, what
do you think one step of gradient
descent will do?

Slope=0

at'local'op6ma'

Current'value'of''

if your parameters are already at a local minimum one =0


step with gradient descent does absolutely nothing it
θ1 = θ1 - α .0
doesn't change your parameters which is what you want
because it keeps your solution at the local optimum. θ1 = θ1
Andrew'Ng'
41
Gradient'descent'can'converge'to'a'local'
minimum,'even'with'the'learning'rate'α'fixed.'
You notice that the derivative,
meaning the slope, is less
steep. Because as we
approach the minimum, the
derivative gets closer and
As'we'approach'a'local' closer to zero,

minimum,'gradient'
descent'will'automa6cally'
take'smaller'steps.'So,'no'
need'to'decrease'α'over'
6me.''
Andrew'Ng'
42

Linear'regression'
with'one'variable'
Gradient'descent'for''
linear'regression'
Machine'Learning'

Andrew'Ng'
43
Gradient'descent'algorithm' Linear'Regression'Model'

Andrew'Ng'
44

Andrew'Ng'
45

Gradient(descent(algorithm(

update''
and'
simultaneously'
'

Andrew'Ng'
When specifically applied to the case of linear regression, a new form of the
46
gradient descent equation can be derived. We can substitute our actual cost
function and our actual hypothesis function and modify the equation to :

where m is the size of the training set, θ0 a constant that will be changing
simultaneously with θ1 and Xi et yi are values of the given training set (data).
One of the issues we saw with gradient descent is that it can 47
be susceptible to local optima

J(θ0,θ1)"

θ1"
θ0"

Andrew'Ng'
48

J(θ0,θ1)"

θ1"
θ0"

Andrew'Ng'
The cost function for linear The technical term for this
regression is always going to be a is that this is called
bowl shaped function like this. a convex function.

informally a convex
function means a bowl
shaped function.This
function doesn't have
any local optima
except for the one
global optimum. And
does gradient descent
on this type of cost
function which you get
whenever you're using
linear regression it will
49 always converge to
the global optimum
Andrew'Ng'
50
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
51
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
52
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
53
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
54
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
55
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
56
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
57
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'
58
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

until we finally reach the global minimum and that global minimum corresponds
to this hypothesis, which gets us a good fit to the data.
Andrew'Ng'
59
“Batch”(Gradient(Descent(

“Batch”:'Each'step'of'gradient'descent'
uses'all'the'training'examples.'

Andrew'Ng'

You might also like