07 09 P3-En

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 3

So right now we will start talking

about collaborative filtering.


This is actually the main part of today's lecture.

So let's write collaborative filtering.

So now I want to start thinking about this problem


in the same terms as what we've done when we were talking
about supervised learning.
You remember what we have done there we're always--
we're saying we have some objective
and we want to optimize this objective.
So what I would like to do is to say,
OK, let's just think about the problem
of recommending product as a standard objective-driven
algorithm.
So, as I've said earlier, we have-- and then
we'll redraw it here, so as you can see better.
This is our matrix Y. And the matrix Y-- again,
I'm going to write n and m--
this is our users, this is our movies.
And this is matrix Y. And I'm going to write here,
so that you don't forget--
this is what is given to me.
This is a matrix which have very few, sparse entries.
The very few entries-- it's a sparse matrix.
Now, I would say the goal of my computation
is to build another matrix, which
I would call X. They're supposed to be similar matrices, X,
of the same size n and m.
And this is a matrix, which would
be the output of my algorithm.
This is a output that what I want to output.
And here I want to have every single entry full.
Everything have to be completed here.
So you're given with some initial entries
and the algorithm need to output the matrix
X, which will contain prediction for every single user
and movie.
This is your goal.
Now, if I want, again, to think about it in terms
of their objectives, that what we were discussing before,
I can now say that we can look at their--
our empirical risk that we are trying to minimize.
In other words, for each of such matrices-- possible matrices,
we need to know how good is this matrix.

So the first things that we can do--


and you will see now something very similar to what you
already have seen in regression, because it takes the same
spirit--
the first thing that we want to do-- we want to say, of course,
our objective should record the fact that if here--
let's say for this user I had some prediction--
if we are looking to what is in this matrix for the same user
and movie, it should be similar or ideally--
maybe the same.
Because for these field entries we actually know the answer.
So the first things we are going to write--
we are going to write--
we are going to go through all the pairs of users and movies
that belong to D. And D I will just try to--
D is-- will be all the pairs a, i such that the entry Ya,i is
given.

In other words, we're saying for all the cases where you already
know the answer, because the answer was given to you,
whatever you have here, the loss should be minimal.
You want them to be very close to each other.
So we are going to write Ya,i minus Xa,i.
And I'm going to be using here, again,
the same as in regression, squared laws.
So this term tells you I want to be very
close to what was given to me.
Plus, in addition to it, I want to do regularization.
The same way as we've done in a linear regression where
we don't want our parameters to become really unruly large,
and we're keeping them-- there norm close to 0.
So in this case, this is our hyperparameter, lambda.
And we are going to go through all the entries of the matrix X
and we want them and the--
and we look at their square and we
want this norm to be minimal.
So if it helps you, you can also write it like this.
So this is our objective that we will try to minimize.
So we go through all possible matrix X's and we
want to find one which actually makes this empirical risk
the smallest.
So how can we do it?
Again, exactly the same way as we've
done in linear regression.
First of all, I want to notice that the way I formulated
the problem, every single entry here
is independent of each other.
Whenever I'm deciding about the preference of the first user
for the first movie and the second movie,
there is no connection between them.
So actually I don't have to keep this the sum.
I can independently estimate each one of those Xi's.
And the difference here would be the following--
whether this Xai actually part of this D or not.
Because for all those that are part of the set D,
we will need to look at these two factors.
For those which are not in D, we would only look at this factor.
So let's just start.
When looking again, we will make an assumption
that particular Xai that I am looking at it belongs to D.
So, in this case, as I've said what we are going to do,
we will take J X of that a, i and do the same thing
we've done previously, which is just differentiate with respect
to this Xi.
And there is no sum here, again, because we are
looking at one Xi at a time.
So we will take Ya,i minus Xa, i square divided by 2 plus alpha
2 Xi squared.

So now if we to take this derivative


and you will do it as an exercise,
what you will discover-- you will take this derivative,
make it equal to 0, do the whole computation,
and you discover something really funny here.
What you will discover--
the Xi, in this case, would be equal to Ya,i divided by 1 plus
lambda.
If you finish your computation-- so we're
going to take this derivative, it's a very simple derivative,
make it equal to 0, and that what you're going to get.
If you do it exactly the same thing with the case
where a, i is not equal-- is not part
of D, which means that in this case
you only need to look at the derivative of this factor
and do exactly the same computation,
which we will do as part of your exercise,
you will find that Xi is equal to 0.

So see what had happened here-- this something really funny.


When we did our estimation, we said for all the entries
that you didn't know what the value should be,
you just put zeros everywhere.
And for those that you knew what the value should be,
you actually corrupted the value.
Instead of taking Ya, i, which is the correct answer,
you take Ya, i and divided it by 1 plus lambda.
And depending how much regularization
you do, the more corruption you introduce.
So the solution that you got here
is even worse than what you had in the beginning.
I mean, just you--
if you just add trivially 0, you're
going to get better numbers.
So it doesn't make any sense.
So there is something wrong here, correct,
in this particular way of thinking
about it, which we exported directly
from our supervised learning problems?
And the answer comes from the fact that what I've done here--
and you remember when I start taking the derivatives--
I told you we are going to take it for each entry
independently.
How many parameters do I have here?
I have n multiplied by m parameters.
And, again, if you need to make it concrete for yourself,
think about half a million users and 18,000 movies.
It's a huge, huge number of parameters,
which are estimating on a very small number of entries.
There is no dependence whatsoever
between different assignment that we are making.
We're treating every choice independently.
And it is not surprising that since we're not
modeling dependency in any way, we're
really losing important connection, which
was the first reason why we decided to look at this problem
differently in order to find the connection--
the hidden connection between different users
and between different products.

You might also like