Professional Documents
Culture Documents
Week 1 - Model Representation: (giá trị rời rạc)
Week 1 - Model Representation: (giá trị rời rạc)
linear regression. In this video, you'll see what the model looks like and more
importantly you'll see what the overall process of supervised learning looks like. Let's
use some motivating example of predicting housing prices. We're going to use a data
set of housing prices from the city of Portland, Oregon. And here I'm gonna
plot my data set of a number of houses that were different sizes that were sold
for a range of different prices. Let's say that given this data set, you have a
friend that's trying to sell a house and let's see if friend's house is size of
1250 square feet and you want to tell them how much they might be able to sell the
house for. Well one thing you could do is fit a model. Maybe fit a straight line
to this data. Looks something like that and based on that, maybe you could tell your friend
that let's say maybe he can sell the house for around $220,000.
supervised learning because we're given the, quotes, "right answer" for each of
our examples. Namely we're told what was the actual house, what was the actual
price of each of the houses in our data set were sold for and moreover, this is
an example of a regression problem where the term regression refers to the fact
that we are predicting a real-valued output namely the price. And just to remind you
the other most common type of supervised learning problem is called the
classification problem where we predict discrete-valued (giá trị rời rạc) outputs such as if we are
So that's a zero-one valued discrete output. More formally, in supervised learning, we have
a data set and this data set is called a training set. So for housing prices
example, we have a training set of different housing prices and our job is to
learn from this data how to predict prices of the houses. Let's define some notation
that we're using throughout this course. We're going to define quite a lot of
symbols. It's okay if you don't remember all the symbols right now but as the
course progresses it will be useful [inaudible] convenient notation. So I'm gonna use
lower case m throughout this course to denote the number of training examples. So
in this data set, if I have, you know, let's say 47 rows in this table. Then I
have 47 training examples and m equals 47. Let me use lowercase x to denote the
input variables often also called the features. That would be the x is here, it would the input
features. And I'm gonna use y to denote my output variables or the
target variable which I'm going to predict and so that's the second
column here. [inaudible] notation, I'm going to use (x, y) to denote a single
training example. So, a single row in this table corresponds to a single training
example and to refer to a specific training example, I'm going to use this
notation x(i) comma gives me y(i) And, we're going to use this to refer to the ith
right? This (x(i), y(i)), the superscript i in parentheses that's just an index into my
training set and refers to the ith row in this table, okay? So this is not x to
the power of i, y to the power of i. Instead (x(i), y(i)) just refers to the ith row of this
table. So for example, x(1) refers to the input value for the first training example so
that's 2104. That's this x in the first row. x(2) will be equal to
1416 right? That's the second x and y(1) will be equal to 460.
The first, the y value for my first training example, that's what that (1)
refers to. So as mentioned, occasionally I'll ask you a question to let you check your
will pop up in the video. When it does, please use your mouse to select what you
supervised learning algorithm works. We saw that with the training set like our
training set of housing prices and we feed that to our learning algorithm. Is the job
usually denoted lowercase h and h stands for hypothesis And what the job of
the hypothesis is, is, is a function that takes as input the size of a house like
maybe the size of the new house your friend's trying to sell so it takes in the value of
x and it tries to output the estimated value of y for the corresponding house.
So h is a function that maps from x's to y's. People often ask me, you
know, why is this function called hypothesis. Some of you may know the
meaning of the term hypothesis, from the dictionary or from science or whatever. It
turns out that in machine learning, this is a name that was used in the early days of
machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of
function, for mapping from sizes of houses to the predictions, that you know....
I think the term hypothesis, maybe isn't the best possible name for this, but this is the
standard terminology that people use in machine learning. So don't worry too much
about why people call it that. When designing a learning algorithm, the next
thing we need to decide is how do we represent this hypothesis h. For this and
the next few videos, I'm going to choose our initial choice , for representing the
hypothesis, will be the following. We're going to represent h as follows. And we will write this as
there's a shorthand, I'll just write as a h of x. But more often I'll write it as a
subscript theta over there. And plotting this in the pictures, all this means is that,
data set and what this function is doing, is predicting that y is some straight
line function of x. That's h of x equals theta 0 plus theta 1 x, okay? And why a linear
function? Well, sometimes we'll want to fit more complicated, perhaps non-linear
functions as well. But since this linear case is the simple building block, we will
start with this example first of fitting linear functions, and we will build on
this to eventually have more complex models, and more complex learning
algorithms. Let me also give this particular model a name. This model is
with one variable, with the variable being x. Predicting all the prices as functions
of one variable X. And another name for this model is univariate linear
regression. And univariate is just a fancy way of saying one variable. So,
that's linear regression. In the next video we'll start to talk about just how
something called the cost function, this will let us figure out how to fit the
best possible straight line to our data. In linear progression, we have a training
set that I showed here remember on notation M was the number of training
examples, so maybe m equals 47. And the form of our hypothesis, which we use to make predictions
terminology, these theta zero and theta one, they stabilize what I
this video is talk about how to go about choosing these two parameter
just for review, here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis
function
the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x).
Or really h of theta(x), but sometimes
I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x,
which looks like that. And finally, if theta zero equals one,
it should pass through the two-two point. Like so, and this is my new vector of x,
set, like maybe the one I've plotted here. What we want to do, is come up with (đưa ra)
values for the parameters theta zero and theta one so that the straight line
we get out of this, corresponds to a straight line that somehow fits the data
well, like maybe that line over there. So, how do we come up with values, theta zero, theta one, that
corresponds to a good fit to the data? The idea is we get to choose our
we know the actual price is was sold for. So, let's try to choose values for
the parameters so that, at least in the training set, given the X in the training set we make reason of
the active predictions for the Y values. Let's formalize this. So linear regression,
a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small, right?
I want the difference between h(x) and
to minimize the square difference between the output of the hypothesis and
the notation (x(i),y(i)) to represent the ith training example. So what I want really is to
Minus the actual price that house number I was sold for, and I want to
minimize the sum of my training set, sum from I equals one through M,
the predicted price of a house, and the price that it was actually sold for. And just remind you of
notation, m here
make the math a little bit easier, I'm going to actually look at
we are 1 over m times that so let's try to minimize my
average minimize one over 2m. Putting the 2 at the constant one
half in front, it may just sound the math probably easier so minimizing
one-half of something, right, should give you the same values of the process, theta
this equation is clear, right? This expression in here, h subscript theta(x), this is our usual, right? That
is equal to this plus theta one xi. And this notation,
minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and
depends on theta 0 and theta 1, okay? So just a recap. We're closing this problem as, find me
the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of
square errors (tổng sai số bình phương) between
my predictions on the training set minus the actual values of the houses
objective function for linear regression. And just to rewrite this out a little bit
minimize over theta0 and theta1. My function j(theta0, theta1). Just write this out. This is my cost
function. So, this cost function is also
called the squared error function. When sometimes called the squared
take the squares of the erros. It turns out that these squared error
cost function is a reasonable choice and works well for problems for
that will work pretty well. But the square cost function is
probably the most commonly used one for regression problems. Later in this class we'll talk about
alternative cost functions as well, but this choice that we just had should
be a pretty reasonable thing to try for most linear regression problems. Okay. So that's the cost
function. So far we've just seen a mathematical
couple videos, I'm actually going to go a little bit deeper into what
the cause function "J" is doing and try to give you better intuition about what
some examples, to get back to intuition about what the cost function is doing, and
why we want to use it. To recap (tóm lại), here's what we had last time.
We want to fit a straight line to our data, so we had this formed as a hypothesis with these
parameters theta zero and theta one, and with different choices of the parameters we end up with
different straight line fits. So the data which are fit like so, and there's a cost function, and that was
our optimization objective :
So this video, in order to better visualize the cost function J, I'm going
to work with a simplified hypothesis function, like that shown on the right:
So I'm gonna use my simplified hypothesis, which is just theta one times X.
We can, if you want, think of this as setting the parameter theta zero equal to 0.
So I have only one parameter theta one and my cost function is similar to before
except that now H of X that is now equal to just theta one times X. And I have only
theta one.
In pictures what this means is that if theta zero equals zero that
corresponds to choosing only hypothesis functions that pass through the origin (điểm gốc),
Using this simplified definition of a hypothesizing cost function let's try to understand the cost
function concept better.
It turns out that two key functions we want to understand. The first is the hypothesis function, and
the second is a cost function.
So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So
the hypothesis is a function of, what is the size of the house X. In contrast, the cost function, J, that's
a function of the parameter, theta one, which controls the slope (độ dốc) of the straight line.
Let's plot these functions and try to understand them both better. Let's start with the hypothesis. On
the left, let's say here's my training set with
three points at (1, 1), (2, 2), and (3, 3). Let's pick a value theta one, so when theta one
equals one, and if that's my choice for theta one, then my hypothesis is going to
look like this straight line over here. And I'm gonna point out, when I'm plotting
my hypothesis function. X-axis, my horizontal axis is labeled X, is labeled
you know, size of the house over here. Now, of temporary (tạm thời), set
theta one equals one, what I want to do is figure out what is j of theta one, when
theta one equals one. So let's go ahead and compute what the cost function has
for. You'll devalue one. Well, as usual, my cost function is defined as follows,
right? Some from, some of 'em are training sets of this usual squared error term.
And, this is therefore equal to. And this. Of theta one x I minus y I and if you
simplify this turns out to be. That. Zero Squared to zero squared to zero squared which
is of course, just equal to zero. Now, inside the cost function. It turns out each
of these terms here is equal to zero. Because for the specific training set I have or my
3 training examples are (1, 1), (2, 2), (3,3). If theta one is equal to one. Then h of x. H of x
i. Is equal to y I exactly, let me write this better. Right? And so, h of x minus
y, each of these terms is equal to zero, which is why I find that j of one is equal
to zero.
So, we now know that j of one Is equal to zero. Let's plot that. What I'm
gonna do on the right is plot my cost function j. And notice, because my cost
the horizontal axis is now labeled with theta one. So I have j of one zero
zero so let's go ahead and plot that. End up with. An X over there. Now lets look at
some other examples. Theta-1 can take on a range of different values. Right? So
theta-1 can take on the negative values, zero, positive values. So what if theta-1
is equal to 0.5. What happens then? Let's go ahead and plot that. I'm now going to
set theta-1 equals 0.5, and in that case my hypothesis now looks like this. As a line
with slope equals to 0.5, and, lets compute J, of 0.5. So that is going to be
one over 2M of, my usual cost function. It turns out that the cost function is
going to be the sum of square values of the height of this line. Plus the sum of
square of the height of that line, plus the sum of square of the height of that
line, right? ?Cause just this vertical distance, that's the difference between,
you know, Y. I. and the predicted value, H of XI, right? So the first example
Whereas, the actual value was one. For my second example, I get, one minus two
squared, because my hypothesis predicted one, but the actual housing price was two.
And then finally, plus. 1.5 minus three squared. And so that's equal to one over
two times three. Because, M when trading set size, right, have three training
examples. In that, that's times simplifying for the parentheses it's 3.5.
So that's 3.5 over six which is about 0.68. So now we know that j of 0.5 is
about 0.68.[Should be 0.58] Lets go and plot that. Oh excuse me, math error, it's actually 0.58. So
we plot that which is maybe about over there. Okay? Now, let's do one more. How
about if theta one is equal to zero, what is J of zero equal to? It turns out that
if theta one is equal to zero, then H of X is just equal to, you know, this flat
line, right, that just goes horizontally like this. And so, measuring the errors.
We have that J of zero is equal to one over two M, times one squared plus two
squared plus three squared, which is, One six times fourteen which is about 2.3. So
let's go ahead and plot as well. So it ends up with a value around 2.3
and of course we can keep on doing this for other values of theta one. It turns
out that you can have you know negative values of theta one as well so if theta
one is negative then h of x would be equal to say minus 0.5 times x then theta
slope of negative 0.5. And you can actually keep on computing these errors.
This turns out to be, you know, for 0.5, it turns out to have really high error. It
works out to be something, like, 5.25. And so on, and the different values of theta
one, you can compute these things, right? And it turns out that you, your computed
range of values, you get something like that. And by computing the range of
values, you can actually slowly create out. What does function J of Theta say and
that's what J of Theta is. To recap, for each value of theta one, right? Each value
line fit on the left. And for each value of theta one, we could then derive a
different value of j of theta one. And for example, you know, theta one=1,
corresponded to this straight line straight through the data. Whereas theta
one=0.5. And this point shown in magenta corresponded to maybe that line, and theta
one=zero which is shown in blue that corresponds to this horizontal line. Right, so for each
value of theta one we wound up with a different value of J of theta one and we
could then use this to trace out this plot on the right. Now you remember, the
optimization objective for our learning algorithm is we want to choose the value
of theta one. That minimizes J of theta one. Right? This was our objective function for
the linear regression. Well, looking at this curve, the value that minimizes j of
theta one is, you know, theta one equals to one. And low and behold, that is indeed
the best possible straight line fit through our data, by setting theta one
equals one. And just, for this particular training set, we actually end up fitting
straight line that fits the data well. So, to wrap up. In this video, we looked up
algorithm. So that it only had one parameter theta one. And we set the
parameter theta zero to be only zero. In the next video. We'll go back to the original problem
and theta one. That is without setting theta zero to zero. And hopefully that will give
you, an even better sense of what the cost function j is doing in the original