Week 1 - Model Representation: (giá trị rời rạc)

Week 1 – Model Representation
Our first learning algorithm will be
linear regression. In this video, you'll see what the model looks like and more
importantly you'll see what the overall process of supervised learning looks like. Let's
use some motivating example of predicting housing prices. We're going to use a data
set of housing prices from the city of Portland, Oregon. And here I'm gonna
plot my data set of a number of houses that were different sizes that were sold
for a range of different prices. Let's say that given this data set, you have a
friend that's trying to sell a house and let's see if friend's house is size of
1250 square feet and you want to tell them how much they might be able to sell the
house for. Well one thing you could do is fit a model. Maybe fit a straight line
to this data. Looks something like that and based on that, maybe you could tell your friend
that let's say maybe he can sell the house for around $220,000.
So this is an example of a supervised learning algorithm. And it's
supervised learning because we're given the, quotes, "right answer" for each of
our examples. Namely we're told what was the actual house, what was the actual
price of each of the houses in our data set were sold for and moreover, this is
an example of a regression problem where the term regression refers to the fact
that we are predicting a real-valued output namely the price. And just to remind you
the other most common type of supervised learning problem is called the
classification problem where we predict discrete-valued (giá trị rời rạc) outputs such as if we are
looking at cancer tumors and trying to decide if a tumor is malignant or benign.
So that's a zero-one valued discrete output. More formally, in supervised learning, we have
a data set and this data set is called a training set. So for housing prices
example, we have a training set of different housing prices and our job is to
learn from this data how to predict prices of the houses. Let's define some notation
that we're using throughout this course. We're going to define quite a lot of
symbols. It's okay if you don't remember all the symbols right now but as the
course progresses it will be useful [inaudible] convenient notation. So I'm gonna use
lower case m throughout this course to denote the number of training examples. So
in this data set, if I have, you know, let's say 47 rows in this table. Then I
have 47 training examples and m equals 47. Let me use lowercase x to denote the
input variables often also called the features. That would be the x is here, it would the input
features. And I'm gonna use y to denote my output variables or the
target variable which I'm going to predict and so that's the second
column here. [inaudible] notation, I'm going to use (x, y) to denote a single
training example. So, a single row in this table corresponds to a single training
example and to refer to a specific training example, I'm going to use this
notation x(i) comma gives me y(i) And, we're going to use this to refer to the ith
training example. So this superscript i over here, this is not exponentiation
right? This (x(i), y(i)), the superscript i in parentheses that's just an index into my
training set and refers to the ith row in this table, okay? So this is not x to
the power of i, y to the power of i. Instead (x(i), y(i)) just refers to the ith row of this
table. So for example, x(1) refers to the input value for the first training example so
that's 2104. That's this x in the first row. x(2) will be equal to
1416 right? That's the second x and y(1) will be equal to 460.
The first, the y value for my first training example, that's what that (1)
refers to. So as mentioned, occasionally I'll ask you a question to let you check your
understanding and a few seconds in this video a multiple-choice question
will pop up in the video. When it does, please use your mouse to select what you
think is the right answer.
What defined by the training set is. So here's how this
supervised learning algorithm works. We saw that with the training set like our
training set of housing prices and we feed that to our learning algorithm. Is the job
of a learning algorithm to then output a function which by convention is
usually denoted lowercase h and h stands for hypothesis And what the job of
the hypothesis is, is, is a function that takes as input the size of a house like
maybe the size of the new house your friend's trying to sell so it takes in the value of
x and it tries to output the estimated value of y for the corresponding house.
So h is a function that maps from x's to y's. People often ask me, you
know, why is this function called hypothesis. Some of you may know the
meaning of the term hypothesis, from the dictionary or from science or whatever. It
turns out that in machine learning, this is a name that was used in the early days of
machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of
function, for mapping from sizes of houses to the predictions, that you know....
I think the term hypothesis, maybe isn't the best possible name for this, but this is the
standard terminology that people use in machine learning. So don't worry too much
about why people call it that. When designing a learning algorithm, the next
thing we need to decide is how do we represent this hypothesis h. For this and
the next few videos, I'm going to choose our initial choice , for representing the
hypothesis, will be the following. We're going to represent h as follows. And we will write this as
htheta(x) equals theta0 plus theta1 of x. And as a shorthand,
sometimes instead of writing, you know, h subscript theta of x, sometimes
there's a shorthand, I'll just write as a h of x. But more often I'll write it as a
subscript theta over there. And plotting this in the pictures, all this means is that,
we are going to predict that y is a linear function of x. Right, so that's the
data set and what this function is doing, is predicting that y is some straight
line function of x. That's h of x equals theta 0 plus theta 1 x, okay? And why a linear
function? Well, sometimes we'll want to fit more complicated, perhaps non-linear
functions as well. But since this linear case is the simple building block, we will
start with this example first of fitting linear functions, and we will build on
this to eventually have more complex models, and more complex learning
algorithms. Let me also give this particular model a name. This model is
called linear regression or this, for example, is actually linear regression
with one variable, with the variable being x. Predicting all the prices as functions
of one variable X. And another name for this model is univariate linear
regression. And univariate is just a fancy way of saying one variable. So,
that's linear regression. In the next video we'll start to talk about just how
we go about implementing this model.

Week 1 – Cost Function
In this video we'll define
something called the cost function, this will let us figure out how to fit the
best possible straight line to our data. In linear progression, we have a training
set that I showed here remember on notation M was the number of training
examples, so maybe m equals 47. And the form of our hypothesis, which we use to make predictions
is this linear function. To introduce a little bit more
terminology, these theta zero and theta one, they stabilize what I
call the parameters of the model. And what we're going to do in
this video is talk about how to go about choosing these two parameter
values, theta 0 and theta 1. With different choices of
the parameter's theta 0 and theta 1, we get different hypothesis,
different hypothesis functions. I know some of you will probably
be already familiar with what I am going to do on the slide, but
just for review, here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis
function
will look like this. Because your hypothesis function will
be h of x equals 1.5 plus 0 times x which is this constant value
function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then
the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x).
Or really h of theta(x), but sometimes
I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x,
which looks like that. And finally, if theta zero equals one,
and theta one equals 0.5, then we end up with a hypothesis
that looks like this. Let's see,
it should pass through the two-two point. Like so, and this is my new vector of x,
or my new h subscript theta of x. Whatever way you remember, I said that
this is h subscript theta of x, but that's a shorthand,
sometimes I'll just write this as h of x.
In linear regression, we have a training
set, like maybe the one I've plotted here. What we want to do, is come up with (đưa ra)
values for the parameters theta zero and theta one so that the straight line
we get out of this, corresponds to a straight line that somehow fits the data
well, like maybe that line over there. So, how do we come up with values, theta zero, theta one, that
corresponds to a good fit to the data? The idea is we get to choose our
parameters theta 0, theta 1 so that h of x,
meaning the value we predict on input x, that this is at least
close to the values y for the examples in our training set,
for our training examples. So in our training set, we've given
a number of examples where we know X decides the wholes and
we know the actual price is was sold for. So, let's try to choose values for
the parameters so that, at least in the training set, given the X in the training set we make reason of
the active predictions for the Y values. Let's formalize this. So linear regression,
what we're going to do is, I'm going to want to solve
a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small, right?
I want the difference between h(x) and
y to be small. And one thing I might do is try
to minimize the square difference between the output of the hypothesis and
the actual price of a house. Okay.
So lets find some details. You remember that I was using
the notation (x(i),y(i)) to represent the ith training example. So what I want really is to
sum over my training set, something i = 1 to m, of the square difference between,
this is the prediction of my hypothesis when it is
input to size of house number i. Right?
Minus the actual price that house number I was sold for, and I want to
minimize the sum of my training set, sum from I equals one through M,
of the difference of this squared error, the square difference between
the predicted price of a house, and the price that it was actually sold for. And just remind you of
notation, m here
was the size of my training set right? So my m there is my number
of training examples. Right that hash sign is the abbreviation
for number of training examples, okay? And to make some of our,
make the math a little bit easier, I'm going to actually look at
we are 1 over m times that so let's try to minimize my
average minimize one over 2m. Putting the 2 at the constant one
half in front, it may just sound the math probably easier so minimizing
one-half of something, right, should give you the same values of the process, theta
0 theta 1, as minimizing that function. And just to be sure,
this equation is clear, right? This expression in here, h subscript theta(x), this is our usual, right? That
is equal to this plus theta one xi. And this notation,
minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and
theta 1 that causes this expression to be minimized and this expression
depends on theta 0 and theta 1, okay? So just a recap. We're closing this problem as, find me
the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of
square errors (tổng sai số bình phương) between
my predictions on the training set minus the actual values of the houses
on the training set is minimized. So this is going to be my overall
objective function for linear regression. And just to rewrite this out a little bit
more cleanly, what I'm going to do is, by convention we usually
define a cost function, which is going to be exactly this,
that formula I have up here. And what I want to do is
minimize over theta0 and theta1. My function j(theta0, theta1). Just write this out. This is my cost
function. So, this cost function is also
called the squared error function. When sometimes called the squared
error cost function and it turns out that why do we
take the squares of the erros. It turns out that these squared error
cost function is a reasonable choice and works well for problems for
most regression programs. There are other cost functions
that will work pretty well. But the square cost function is
probably the most commonly used one for regression problems. Later in this class we'll talk about
alternative cost functions as well, but this choice that we just had should
be a pretty reasonable thing to try for most linear regression problems. Okay. So that's the cost
function. So far we've just seen a mathematical
definition of this cost function. In case this function j of theta zero,
theta one. In case this function seems

a little bit abstract, and you still don't have a good
sense of what it's doing, in the next video, in the next
couple videos, I'm actually going to go a little bit deeper into what
the cause function "J" is doing and try to give you better intuition about what
is computing and why we want to use it...

Week 1 – Cost Function Intuition I
In the previous video, we gave the
mathematical definition of the cost function. In this video, let's look at
some examples, to get back to intuition about what the cost function is doing, and
why we want to use it. To recap (tóm lại), here's what we had last time.
We want to fit a straight line to our data, so we had this formed as a hypothesis with these
parameters theta zero and theta one, and with different choices of the parameters we end up with
different straight line fits. So the data which are fit like so, and there's a cost function, and that was
our optimization objective :
So this video, in order to better visualize the cost function J, I'm going
to work with a simplified hypothesis function, like that shown on the right:
So I'm gonna use my simplified hypothesis, which is just theta one times X.
We can, if you want, think of this as setting the parameter theta zero equal to 0.
So I have only one parameter theta one and my cost function is similar to before
except that now H of X that is now equal to just theta one times X. And I have only
one parameter theta one and so my optimization objective is to minimize J of
theta one.
In pictures what this means is that if theta zero equals zero that
corresponds to choosing only hypothesis functions that pass through the origin (điểm gốc),
that pass through the point (0, 0).
Using this simplified definition of a hypothesizing cost function let's try to understand the cost
function concept better.
It turns out that two key functions we want to understand. The first is the hypothesis function, and
the second is a cost function.
So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So
the hypothesis is a function of, what is the size of the house X. In contrast, the cost function, J, that's
a function of the parameter, theta one, which controls the slope (độ dốc) of the straight line.
Let's plot these functions and try to understand them both better. Let's start with the hypothesis. On
the left, let's say here's my training set with
three points at (1, 1), (2, 2), and (3, 3). Let's pick a value theta one, so when theta one
equals one, and if that's my choice for theta one, then my hypothesis is going to
look like this straight line over here. And I'm gonna point out, when I'm plotting
my hypothesis function. X-axis, my horizontal axis is labeled X, is labeled
you know, size of the house over here. Now, of temporary (tạm thời), set
theta one equals one, what I want to do is figure out what is j of theta one, when
theta one equals one. So let's go ahead and compute what the cost function has
for. You'll devalue one. Well, as usual, my cost function is defined as follows,
right? Some from, some of 'em are training sets of this usual squared error term.
And, this is therefore equal to. And this. Of theta one x I minus y I and if you
simplify this turns out to be. That. Zero Squared to zero squared to zero squared which
is of course, just equal to zero. Now, inside the cost function. It turns out each
of these terms here is equal to zero. Because for the specific training set I have or my
3 training examples are (1, 1), (2, 2), (3,3). If theta one is equal to one. Then h of x. H of x
i. Is equal to y I exactly, let me write this better. Right? And so, h of x minus
y, each of these terms is equal to zero, which is why I find that j of one is equal
to zero.
So, we now know that j of one Is equal to zero. Let's plot that. What I'm
gonna do on the right is plot my cost function j. And notice, because my cost
function is a function of my parameter theta one, when I plot my cost function,
the horizontal axis is now labeled with theta one. So I have j of one zero
zero so let's go ahead and plot that. End up with. An X over there. Now lets look at
some other examples. Theta-1 can take on a range of different values. Right? So
theta-1 can take on the negative values, zero, positive values. So what if theta-1
is equal to 0.5. What happens then? Let's go ahead and plot that. I'm now going to
set theta-1 equals 0.5, and in that case my hypothesis now looks like this. As a line
with slope equals to 0.5, and, lets compute J, of 0.5. So that is going to be
one over 2M of, my usual cost function. It turns out that the cost function is
going to be the sum of square values of the height of this line. Plus the sum of
square of the height of that line, plus the sum of square of the height of that
line, right? ?Cause just this vertical distance, that's the difference between,
you know, Y. I. and the predicted value, H of XI, right? So the first example
is going to be 0.5 minus one squared. Because my hypothesis predicted 0.5.
Whereas, the actual value was one. For my second example, I get, one minus two
squared, because my hypothesis predicted one, but the actual housing price was two.
And then finally, plus. 1.5 minus three squared. And so that's equal to one over
two times three. Because, M when trading set size, right, have three training
examples. In that, that's times simplifying for the parentheses it's 3.5.
So that's 3.5 over six which is about 0.68. So now we know that j of 0.5 is
about 0.68.[Should be 0.58] Lets go and plot that. Oh excuse me, math error, it's actually 0.58. So
we plot that which is maybe about over there. Okay? Now, let's do one more. How
about if theta one is equal to zero, what is J of zero equal to? It turns out that
if theta one is equal to zero, then H of X is just equal to, you know, this flat
line, right, that just goes horizontally like this. And so, measuring the errors.
We have that J of zero is equal to one over two M, times one squared plus two
squared plus three squared, which is, One six times fourteen which is about 2.3. So
let's go ahead and plot as well. So it ends up with a value around 2.3
and of course we can keep on doing this for other values of theta one. It turns
out that you can have you know negative values of theta one as well so if theta
one is negative then h of x would be equal to say minus 0.5 times x then theta
one is minus 0.5 and so that corresponds to a hypothesis with a
slope of negative 0.5. And you can actually keep on computing these errors.
This turns out to be, you know, for 0.5, it turns out to have really high error. It
works out to be something, like, 5.25. And so on, and the different values of theta
one, you can compute these things, right? And it turns out that you, your computed
range of values, you get something like that. And by computing the range of
values, you can actually slowly create out. What does function J of Theta say and
that's what J of Theta is. To recap, for each value of theta one, right? Each value
of theta one corresponds to a different hypothesis, or to a different straight
line fit on the left. And for each value of theta one, we could then derive a
different value of j of theta one. And for example, you know, theta one=1,
corresponded to this straight line straight through the data. Whereas theta
one=0.5. And this point shown in magenta corresponded to maybe that line, and theta
one=zero which is shown in blue that corresponds to this horizontal line. Right, so for each
value of theta one we wound up with a different value of J of theta one and we
could then use this to trace out this plot on the right. Now you remember, the
optimization objective for our learning algorithm is we want to choose the value
of theta one. That minimizes J of theta one. Right? This was our objective function for
the linear regression. Well, looking at this curve, the value that minimizes j of
theta one is, you know, theta one equals to one. And low and behold, that is indeed
the best possible straight line fit through our data, by setting theta one
equals one. And just, for this particular training set, we actually end up fitting
it perfectly. And that's why minimizing j of theta one corresponds to finding a
straight line that fits the data well. So, to wrap up. In this video, we looked up
some plots. To understand the cost function. To do so, we simplify the
algorithm. So that it only had one parameter theta one. And we set the
parameter theta zero to be only zero. In the next video. We'll go back to the original problem
formulation and look at some visualizations involving both theta zero
and theta one. That is without setting theta zero to zero. And hopefully that will give
you, an even better sense of what the cost function j is doing in the original
linear regression formulation.

Week 1 - Model Representation: (giá trị rời rạc)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 1 - Model Representation: (giá trị rời rạc)

Uploaded by

Copyright:

Available Formats

Week 1 – Model Representation

Our first learning algorithm will be

So this is an example of a supervised learning algorithm. And it's

looking at cancer tumors and trying to decide if a tumor is malignant or benign.

training example. So this superscript i over here, this is not exponentiation

understanding and a few seconds in this video a multiple-choice question

think is the right answer.

What defined by the training set is. So here's how this

of a learning algorithm to then output a function which by convention is

h<u>theta(x) equals theta<u>0</u></u> plus theta<u>1 of x. And as a shorthand,

sometimes instead of writing, you</u> know, h subscript theta of x, sometimes

we are going to predict that y is a linear function of x. Right, so that's the

called linear regression or this, for example, is actually linear regression

we go about implementing this model.

is this linear function. To introduce a little bit more

call the parameters of the model. And what we're going to do in

values, theta 0 and theta 1. With different choices of

the parameter's theta 0 and theta 1, we get different hypothesis,

different hypothesis functions. I know some of you will probably

be already familiar with what I am going to do on the slide, but

will look like this. Because your hypothesis function will

be h of x equals 1.5 plus 0 times x which is this constant value

function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then

and theta one equals 0.5, then we end up with a hypothesis

that looks like this. Let's see,

or my new h subscript theta of x. Whatever way you remember, I said that

this is h subscript theta of x, but that's a shorthand,

sometimes I'll just write this as h of x.

In linear regression, we have a training

parameters theta 0, theta 1 so that h of x,

meaning the value we predict on input x, that this is at least

close to the values y for the examples in our training set,

for our training examples. So in our training set, we've given

a number of examples where we know X decides the wholes and

what we're going to do is, I'm going to want to solve

y to be small. And one thing I might do is try

the actual price of a house. Okay.

So lets find some details. You remember that I was using

sum over my training set, something i = 1 to m, of the square difference between,

this is the prediction of my hypothesis when it is

input to size of house number i. Right?

of the difference of this squared error, the square difference between

was the size of my training set right? So my m there is my number

of training examples. Right that hash sign is the abbreviation

for number of training examples, okay? And to make some of our,

0 theta 1, as minimizing that function. And just to be sure,

theta 1 that causes this expression to be minimized and this expression

on the training set is minimized. So this is going to be my overall

more cleanly, what I'm going to do is, by convention we usually

define a cost function, which is going to be exactly this,

that formula I have up here. And what I want to do is

error cost function and it turns out that why do we

most regression programs. There are other cost functions

definition of this cost function. In case this function j of theta zero,

theta one. In case this function seems

sense of what it's doing, in the next video, in the next

is computing and why we want to use it...

mathematical definition of the cost function. In this video, let's look at

one parameter theta one and so my optimization objective is to minimize J of

that pass through the point (0, 0).

function is a function of my parameter theta one, when I plot my cost function,

is going to be 0.5 minus one squared. Because my hypothesis predicted 0.5.