Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Week 1 – Model Representation

Our first learning algorithm will be

linear regression. In this video, you'll see what the model looks like and more

importantly you'll see what the overall process of supervised learning looks like. Let's

use some motivating example of predicting housing prices. We're going to use a data

set of housing prices from the city of Portland, Oregon. And here I'm gonna

plot my data set of a number of houses that were different sizes that were sold

for a range of different prices. Let's say that given this data set, you have a

friend that's trying to sell a house and let's see if friend's house is size of

1250 square feet and you want to tell them how much they might be able to sell the

house for. Well one thing you could do is fit a model. Maybe fit a straight line

to this data. Looks something like that and based on that, maybe you could tell your friend

that let's say maybe he can sell the house for around $220,000.

So this is an example of a supervised learning algorithm. And it's

supervised learning because we're given the, quotes, "right answer" for each of

our examples. Namely we're told what was the actual house, what was the actual

price of each of the houses in our data set were sold for and moreover, this is

an example of a regression problem where the term regression refers to the fact

that we are predicting a real-valued output namely the price. And just to remind you

the other most common type of supervised learning problem is called the

classification problem where we predict discrete-valued (giá trị rời rạc) outputs such as if we are

looking at cancer tumors and trying to decide if a tumor is malignant or benign.

So that's a zero-one valued discrete output. More formally, in supervised learning, we have

a data set and this data set is called a training set. So for housing prices

example, we have a training set of different housing prices and our job is to

learn from this data how to predict prices of the houses. Let's define some notation

that we're using throughout this course. We're going to define quite a lot of

symbols. It's okay if you don't remember all the symbols right now but as the

course progresses it will be useful [inaudible] convenient notation. So I'm gonna use

lower case m throughout this course to denote the number of training examples. So

in this data set, if I have, you know, let's say 47 rows in this table. Then I
have 47 training examples and m equals 47. Let me use lowercase x to denote the

input variables often also called the features. That would be the x is here, it would the input
features. And I'm gonna use y to denote my output variables or the

target variable which I'm going to predict and so that's the second

column here. [inaudible] notation, I'm going to use (x, y) to denote a single

training example. So, a single row in this table corresponds to a single training

example and to refer to a specific training example, I'm going to use this

notation x(i) comma gives me y(i) And, we're going to use this to refer to the ith

training example. So this superscript i over here, this is not exponentiation

right? This (x(i), y(i)), the superscript i in parentheses that's just an index into my

training set and refers to the ith row in this table, okay? So this is not x to

the power of i, y to the power of i. Instead (x(i), y(i)) just refers to the ith row of this

table. So for example, x(1) refers to the input value for the first training example so

that's 2104. That's this x in the first row. x(2) will be equal to

1416 right? That's the second x and y(1) will be equal to 460.

The first, the y value for my first training example, that's what that (1)

refers to. So as mentioned, occasionally I'll ask you a question to let you check your

understanding and a few seconds in this video a multiple-choice question

will pop up in the video. When it does, please use your mouse to select what you

think is the right answer.

What defined by the training set is. So here's how this

supervised learning algorithm works. We saw that with the training set like our

training set of housing prices and we feed that to our learning algorithm. Is the job

of a learning algorithm to then output a function which by convention is

usually denoted lowercase h and h stands for hypothesis And what the job of

the hypothesis is, is, is a function that takes as input the size of a house like

maybe the size of the new house your friend's trying to sell so it takes in the value of

x and it tries to output the estimated value of y for the corresponding house.

So h is a function that maps from x's to y's. People often ask me, you

know, why is this function called hypothesis. Some of you may know the

meaning of the term hypothesis, from the dictionary or from science or whatever. It
turns out that in machine learning, this is a name that was used in the early days of

machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of

function, for mapping from sizes of houses to the predictions, that you know....

I think the term hypothesis, maybe isn't the best possible name for this, but this is the

standard terminology that people use in machine learning. So don't worry too much

about why people call it that. When designing a learning algorithm, the next

thing we need to decide is how do we represent this hypothesis h. For this and

the next few videos, I'm going to choose our initial choice , for representing the

hypothesis, will be the following. We're going to represent h as follows. And we will write this as

h<u>theta(x) equals theta<u>0</u></u> plus theta<u>1 of x. And as a shorthand,

sometimes instead of writing, you</u> know, h subscript theta of x, sometimes

there's a shorthand, I'll just write as a h of x. But more often I'll write it as a

subscript theta over there. And plotting this in the pictures, all this means is that,

we are going to predict that y is a linear function of x. Right, so that's the

data set and what this function is doing, is predicting that y is some straight

line function of x. That's h of x equals theta 0 plus theta 1 x, okay? And why a linear

function? Well, sometimes we'll want to fit more complicated, perhaps non-linear

functions as well. But since this linear case is the simple building block, we will

start with this example first of fitting linear functions, and we will build on

this to eventually have more complex models, and more complex learning

algorithms. Let me also give this particular model a name. This model is

called linear regression or this, for example, is actually linear regression

with one variable, with the variable being x. Predicting all the prices as functions

of one variable X. And another name for this model is univariate linear

regression. And univariate is just a fancy way of saying one variable. So,

that's linear regression. In the next video we'll start to talk about just how

we go about implementing this model.


Week 1 – Cost Function
In this video we'll define

something called the cost function, this will let us figure out how to fit the

best possible straight line to our data. In linear progression, we have a training

set that I showed here remember on notation M was the number of training

examples, so maybe m equals 47. And the form of our hypothesis, which we use to make predictions

is this linear function. To introduce a little bit more

terminology, these theta zero and theta one, they stabilize what I

call the parameters of the model. And what we're going to do in

this video is talk about how to go about choosing these two parameter

values, theta 0 and theta 1. With different choices of

the parameter's theta 0 and theta 1, we get different hypothesis,

different hypothesis functions. I know some of you will probably

be already familiar with what I am going to do on the slide, but

just for review, here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis
function

will look like this. Because your hypothesis function will

be h of x equals 1.5 plus 0 times x which is this constant value

function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then

the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x).
Or really h of theta(x), but sometimes

I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x,

which looks like that. And finally, if theta zero equals one,

and theta one equals 0.5, then we end up with a hypothesis

that looks like this. Let's see,

it should pass through the two-two point. Like so, and this is my new vector of x,

or my new h subscript theta of x. Whatever way you remember, I said that

this is h subscript theta of x, but that's a shorthand,

sometimes I'll just write this as h of x.

In linear regression, we have a training

set, like maybe the one I've plotted here. What we want to do, is come up with (đưa ra)
values for the parameters theta zero and theta one so that the straight line

we get out of this, corresponds to a straight line that somehow fits the data

well, like maybe that line over there. So, how do we come up with values, theta zero, theta one, that

corresponds to a good fit to the data? The idea is we get to choose our

parameters theta 0, theta 1 so that h of x,

meaning the value we predict on input x, that this is at least

close to the values y for the examples in our training set,

for our training examples. So in our training set, we've given

a number of examples where we know X decides the wholes and

we know the actual price is was sold for. So, let's try to choose values for

the parameters so that, at least in the training set, given the X in the training set we make reason of

the active predictions for the Y values. Let's formalize this. So linear regression,

what we're going to do is, I'm going to want to solve

a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small, right?
I want the difference between h(x) and

y to be small. And one thing I might do is try

to minimize the square difference between the output of the hypothesis and

the actual price of a house. Okay.

So lets find some details. You remember that I was using

the notation (x(i),y(i)) to represent the ith training example. So what I want really is to

sum over my training set, something i = 1 to m, of the square difference between,

this is the prediction of my hypothesis when it is

input to size of house number i. Right?

Minus the actual price that house number I was sold for, and I want to

minimize the sum of my training set, sum from I equals one through M,

of the difference of this squared error, the square difference between

the predicted price of a house, and the price that it was actually sold for. And just remind you of
notation, m here

was the size of my training set right? So my m there is my number

of training examples. Right that hash sign is the abbreviation

for number of training examples, okay? And to make some of our,

make the math a little bit easier, I'm going to actually look at
we are 1 over m times that so let's try to minimize my

average minimize one over 2m. Putting the 2 at the constant one

half in front, it may just sound the math probably easier so minimizing

one-half of something, right, should give you the same values of the process, theta

0 theta 1, as minimizing that function. And just to be sure,

this equation is clear, right? This expression in here, h subscript theta(x), this is our usual, right? That
is equal to this plus theta one xi. And this notation,

minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and

theta 1 that causes this expression to be minimized and this expression

depends on theta 0 and theta 1, okay? So just a recap. We're closing this problem as, find me

the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of
square errors (tổng sai số bình phương) between

my predictions on the training set minus the actual values of the houses

on the training set is minimized. So this is going to be my overall

objective function for linear regression. And just to rewrite this out a little bit

more cleanly, what I'm going to do is, by convention we usually

define a cost function, which is going to be exactly this,

that formula I have up here. And what I want to do is

minimize over theta0 and theta1. My function j(theta0, theta1). Just write this out. This is my cost
function. So, this cost function is also

called the squared error function. When sometimes called the squared

error cost function and it turns out that why do we

take the squares of the erros. It turns out that these squared error

cost function is a reasonable choice and works well for problems for

most regression programs. There are other cost functions

that will work pretty well. But the square cost function is

probably the most commonly used one for regression problems. Later in this class we'll talk about

alternative cost functions as well, but this choice that we just had should

be a pretty reasonable thing to try for most linear regression problems. Okay. So that's the cost
function. So far we've just seen a mathematical

definition of this cost function. In case this function j of theta zero,

theta one. In case this function seems


a little bit abstract, and you still don't have a good

sense of what it's doing, in the next video, in the next

couple videos, I'm actually going to go a little bit deeper into what

the cause function "J" is doing and try to give you better intuition about what

is computing and why we want to use it...


Week 1 – Cost Function Intuition I
In the previous video, we gave the

mathematical definition of the cost function. In this video, let's look at

some examples, to get back to intuition about what the cost function is doing, and

why we want to use it. To recap (tóm lại), here's what we had last time.

We want to fit a straight line to our data, so we had this formed as a hypothesis with these
parameters theta zero and theta one, and with different choices of the parameters we end up with
different straight line fits. So the data which are fit like so, and there's a cost function, and that was
our optimization objective :

So this video, in order to better visualize the cost function J, I'm going

to work with a simplified hypothesis function, like that shown on the right:
So I'm gonna use my simplified hypothesis, which is just theta one times X.

We can, if you want, think of this as setting the parameter theta zero equal to 0.

So I have only one parameter theta one and my cost function is similar to before

except that now H of X that is now equal to just theta one times X. And I have only

one parameter theta one and so my optimization objective is to minimize J of

theta one.

In pictures what this means is that if theta zero equals zero that

corresponds to choosing only hypothesis functions that pass through the origin (điểm gốc),

that pass through the point (0, 0).

Using this simplified definition of a hypothesizing cost function let's try to understand the cost
function concept better.

It turns out that two key functions we want to understand. The first is the hypothesis function, and
the second is a cost function.

So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So
the hypothesis is a function of, what is the size of the house X. In contrast, the cost function, J, that's
a function of the parameter, theta one, which controls the slope (độ dốc) of the straight line.

Let's plot these functions and try to understand them both better. Let's start with the hypothesis. On
the left, let's say here's my training set with

three points at (1, 1), (2, 2), and (3, 3). Let's pick a value theta one, so when theta one

equals one, and if that's my choice for theta one, then my hypothesis is going to

look like this straight line over here. And I'm gonna point out, when I'm plotting
my hypothesis function. X-axis, my horizontal axis is labeled X, is labeled

you know, size of the house over here. Now, of temporary (tạm thời), set

theta one equals one, what I want to do is figure out what is j of theta one, when

theta one equals one. So let's go ahead and compute what the cost function has

for. You'll devalue one. Well, as usual, my cost function is defined as follows,

right? Some from, some of 'em are training sets of this usual squared error term.

And, this is therefore equal to. And this. Of theta one x I minus y I and if you

simplify this turns out to be. That. Zero Squared to zero squared to zero squared which

is of course, just equal to zero. Now, inside the cost function. It turns out each

of these terms here is equal to zero. Because for the specific training set I have or my

3 training examples are (1, 1), (2, 2), (3,3). If theta one is equal to one. Then h of x. H of x

i. Is equal to y I exactly, let me write this better. Right? And so, h of x minus

y, each of these terms is equal to zero, which is why I find that j of one is equal

to zero.

So, we now know that j of one Is equal to zero. Let's plot that. What I'm

gonna do on the right is plot my cost function j. And notice, because my cost

function is a function of my parameter theta one, when I plot my cost function,

the horizontal axis is now labeled with theta one. So I have j of one zero

zero so let's go ahead and plot that. End up with. An X over there. Now lets look at

some other examples. Theta-1 can take on a range of different values. Right? So

theta-1 can take on the negative values, zero, positive values. So what if theta-1

is equal to 0.5. What happens then? Let's go ahead and plot that. I'm now going to

set theta-1 equals 0.5, and in that case my hypothesis now looks like this. As a line

with slope equals to 0.5, and, lets compute J, of 0.5. So that is going to be

one over 2M of, my usual cost function. It turns out that the cost function is

going to be the sum of square values of the height of this line. Plus the sum of

square of the height of that line, plus the sum of square of the height of that

line, right? ?Cause just this vertical distance, that's the difference between,

you know, Y. I. and the predicted value, H of XI, right? So the first example

is going to be 0.5 minus one squared. Because my hypothesis predicted 0.5.

Whereas, the actual value was one. For my second example, I get, one minus two
squared, because my hypothesis predicted one, but the actual housing price was two.

And then finally, plus. 1.5 minus three squared. And so that's equal to one over

two times three. Because, M when trading set size, right, have three training

examples. In that, that's times simplifying for the parentheses it's 3.5.

So that's 3.5 over six which is about 0.68. So now we know that j of 0.5 is

about 0.68.[Should be 0.58] Lets go and plot that. Oh excuse me, math error, it's actually 0.58. So

we plot that which is maybe about over there. Okay? Now, let's do one more. How

about if theta one is equal to zero, what is J of zero equal to? It turns out that

if theta one is equal to zero, then H of X is just equal to, you know, this flat

line, right, that just goes horizontally like this. And so, measuring the errors.

We have that J of zero is equal to one over two M, times one squared plus two

squared plus three squared, which is, One six times fourteen which is about 2.3. So

let's go ahead and plot as well. So it ends up with a value around 2.3

and of course we can keep on doing this for other values of theta one. It turns

out that you can have you know negative values of theta one as well so if theta

one is negative then h of x would be equal to say minus 0.5 times x then theta

one is minus 0.5 and so that corresponds to a hypothesis with a

slope of negative 0.5. And you can actually keep on computing these errors.

This turns out to be, you know, for 0.5, it turns out to have really high error. It

works out to be something, like, 5.25. And so on, and the different values of theta

one, you can compute these things, right? And it turns out that you, your computed

range of values, you get something like that. And by computing the range of

values, you can actually slowly create out. What does function J of Theta say and

that's what J of Theta is. To recap, for each value of theta one, right? Each value

of theta one corresponds to a different hypothesis, or to a different straight

line fit on the left. And for each value of theta one, we could then derive a

different value of j of theta one. And for example, you know, theta one=1,

corresponded to this straight line straight through the data. Whereas theta

one=0.5. And this point shown in magenta corresponded to maybe that line, and theta

one=zero which is shown in blue that corresponds to this horizontal line. Right, so for each

value of theta one we wound up with a different value of J of theta one and we
could then use this to trace out this plot on the right. Now you remember, the

optimization objective for our learning algorithm is we want to choose the value

of theta one. That minimizes J of theta one. Right? This was our objective function for

the linear regression. Well, looking at this curve, the value that minimizes j of

theta one is, you know, theta one equals to one. And low and behold, that is indeed

the best possible straight line fit through our data, by setting theta one

equals one. And just, for this particular training set, we actually end up fitting

it perfectly. And that's why minimizing j of theta one corresponds to finding a

straight line that fits the data well. So, to wrap up. In this video, we looked up

some plots. To understand the cost function. To do so, we simplify the

algorithm. So that it only had one parameter theta one. And we set the

parameter theta zero to be only zero. In the next video. We'll go back to the original problem

formulation and look at some visualizations involving both theta zero

and theta one. That is without setting theta zero to zero. And hopefully that will give

you, an even better sense of what the cost function j is doing in the original

linear regression formulation.

You might also like