Matrix Factorisation - Part 7. Part 7 - by Rakesh4real - Fnplus Club - Medium

10/7/21, 9:31 PM Matrix Factorisation — Part 7.
Part 7 | by Rakesh4real | Fnplus Club | Medium
Matrix Factorisation — Part 7

Rakesh4real Follow
Jul 29, 2019 · 6 min read
Table of Contents :
1. Introduction and Recommendation Framework
2. Evaluating Recommendation Systems
3. Content Based Recommendations
4. Neighborhood Based Collaborative Filtering
5. User and Item Based Collaborative Filtering
6. KNN Recommendations
7. Matrix Factorisation
8. Deep Learning — Introduction
9. Restricted Boltzmann Machines
10. AutoRecs
11. Amazon DSSTNE and Sage Maker
12. Real-World Challenges and Solutions
https://medium.com/fnplus/matrix-factorisation-d3cd9c4d820a 1/10
10/7/21, 9:31 PM Matrix Factorisation — Part 7. Part 7 | by Rakesh4real | Fnplus Club | Medium
Collaborative filtering is a good method. But if it is that good, why are we seeking an
alternative approach?
It is sensitive to NOISE DATA and SPARSE DATA
Good results are obtained only if to of the conditions below are satisfied.
— Large data set is available
— Data is nice and clean
MODEL BASED METHODS

Instead of finding items or users similar to each other, we apply apply Machine Leaning
and Data Science in model based methods. Basic steps involved are :
Training our model with data
Predicting using model
MATRIX FACTORISATION
It has many sub categories of techniques.
CREEPY! Manages to find broader feature of users/items on their own (eg. Action or
Romantic). Math doesn’t know what to call newly found features which are simply
described by matrices.
MAIN IDEA:
Describe users and items as combinations of different amounts of each other.
For example, we may say Bob likes action movies as well as comedy movies. The same
can told as
Bob = 50% Action +20% Comedy

Matching movies for recommendations with these attributes is a good idea for
predictions
PRINCIPAL COMPONENT ANALYSIS(PCA) ON MOVIE RATINGS
PCA is a statistical procedure which reduces the dimension of our user-item matrix
without losing any important information!
It is used as,
Dimentionality reduction tool(Can’t say what newly-reduced-dimensions mean)
Feature extraction tool
Note: Often, the dimensions they find correspond to features humans have learnt to
associate with items (eg. How ‘action-ey’ a movie is)
Whatever it is about movies that causes individuals to rate them differently, PCA extracts
and finds those ‘Latent Features’
If R (Orignal Matrix) has some missing values, we can reconstruct R by filling blank
values!!
SIGMA MATRIX
It is a simple diagonal matrix only used to scale the values we get into proper scale -
We can multiply this sigma matrix with M or U and still R will just be product of two
matrices
Reconstruct R to get ratings of all users for all items!
We can even predict ratings using dot products of using recunstruction formula
described above
SINGULAR VALUE DECOMPOSITION (SVD)

A way of computing M, Sigma, U-Transpose all at once efficiently and get all three
factors in one shot!
Very accurate results
Used widely during Netflix prize competition

But wait! Just before applying PCA to original matrix R, how to deal with missing
values?
You may fill missing values with averages of some sort… but there is a better way!
Assume we have some ratings for any given row/column in M and U-Transpose
We can treat this as ‘minimisation of profits’ problem
Find values of those complete rows and columns that best minimizes errors in
known ratings R
Use Stochastic Gradient Descent (Just one way to do it)
Apache Spark uses a different technique called ALS (Alternating Least Squares)
Note: You might be confused here because we are talking about learning the values as
matrices in M and U.
Actually, we are predicting ratings and not computing them directly which is what SVD
does. We are not doing real SVD Recommendations because SVD can’t perform on missing
data. Hence, it is ‘SVD-Inspired-algoithm’ not SVD itself. Winner of Netflix prize was a
variation of SVD called SVD++
TIPS:
You can see source code for SVD algrithm used in surpiselib in github. But too
complex to understand.
Never write your own algorithm. Odds are too high that you will mess up
somewhere. Use third party library that has been used and validated by others.
SVD v/s SVD++

Actual loss function used during SGD
In SVD++, merely rating item at all is an indication of some sort of interest in the
item. No matter what the rating was!
SVD++ outperformed SVD
Worked really well!!
It is Spooky because we don’t understand the way latent features work
GIVEN, THE USAGE OF SVD IN NETFLIX COMPETITION AND IT’S SUCCESS;

A LOT OF RESEARCH FOCUSED ON SVD. THE BELOW
LISTED ONES ARE JUST SOME OF THE ALGORITHMS THAT WERE

DEVELOPED
I . Factorisation Matrices
Well suited for predicting ratings/clicks in Recommendation System
More ‘General Purposed’
Handles sparse data with ease. Unlike shoehorning itself into problem like SVD
Available in Amazon Sage
Only downsize — WORKS ONLY WITH CATEGORICAL DATA!
II. TimeSVD++
Used for predicting next item in series of events.
III. Factorized Personalized Markov Chains

Same as above. Used for predicting next item in series of events.
IV. PLSA (Promising!)

Extracts latent features from content itself
Use PLSA on ‘movie titles’ and ‘description’ and match them up with users
As they are content based, they don’t do well on their own. Add it with user
behaviour data.
HYPERPARAMETER TUNING(SVD)
Hyperparameters are the parameters chosen by the ML engineer himself in order to
recommend with effective predictions. Finding optimal hyperparameters is called
hyperparameter tuning.
Big problem in ML in general
A lot of algorithms are very sensitive to parameters such as Learning rates
Hyperparameters vary with datasets
In case of SVD, Hyperparameters are —
How many latent factors to extract?
How many dimensions to reduce to?
Note: Both of them depend on the nature of data we have.
Example Code
The below code helps us understand how we can tune hyperparameters in surpriselib
using GridSearchCV package.
Hyperparameter Tuning
GridSearchCV package,
Helps in hyperparameter tuning
Defines grid of different parameters you want to try different values for (try every
possible combination)
‘param_grid’ defines what to test
‘measures’ define how to test
‘cv’ set number of folds/cross_validation
SLIM — Sparse Linear Methods (2011)

MAIN IDEA: ‘Sparse aggregation’ of other things user has rated multiplied by ‘sparse
aggregation’ of weights (magic!)
Exiting results!
Outperformed all other recommendations (Hit Rates in TopN Recommendations)
Not only applicable to movies but books, music, credit cards and many more as well!
Because user has rated only some items and weights only exist for these items can be
calculated — ‘sparse’ aggregation
Extend to entire user item rating matrix with the formula below
HOW WEIGHTS ARE COMPUTED ?

Complicated
So complicated that many can’t implement it in code
Similar to SVD but requires higher mathematical concepts like
— Matrix Frobenius Norm
— L1 Norm Regularisation
If SLIM comes your way, give it a second look — Promising Stuff!
NEXT>>>

Matrix Factorisation - Part 7. Part 7 - by Rakesh4real - Fnplus Club - Medium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Factorisation - Part 7. Part 7 - by Rakesh4real - Fnplus Club - Medium

Uploaded by

Copyright:

Available Formats

10/7/21, 9:31 PM Matrix Factorisation — Part 7.

Part 7 | by Rakesh4real | Fnplus Club | Medium

Matrix Factorisation — Part 7

Jul 29, 2019 · 6 min read

2. Evaluating Recommendation Systems

3. Content Based Recommendations

4. Neighborhood Based Collaborative Filtering

5. User and Item Based Collaborative Filtering

8. Deep Learning — Introduction

9. Restricted Boltzmann Machines

11. Amazon DSSTNE and Sage Maker

12. Real-World Challenges and Solutions

It is sensitive to NOISE DATA and SPARSE DATA

— Large data set is available

— Data is nice and clean

MODEL BASED METHODS

Training our model with data

Predicting using model

Describe users and items as combinations of different amounts of each other.

Bob = 50% Action +20% Comedy

PRINCIPAL COMPONENT ANALYSIS(PCA) ON MOVIE RATINGS

Dimentionality reduction tool(Can’t say what newly-reduced-dimensions mean)

Feature extraction tool

Reconstruct R to get ratings of all users for all items!

SINGULAR VALUE DECOMPOSITION (SVD)

Very accurate results

Used widely during Netflix prize competition

We can treat this as ‘minimisation of profits’ problem

Use Stochastic Gradient Descent (Just one way to do it)

SVD v/s SVD++

SVD++ outperformed SVD

Worked really well!!

It is Spooky because we don’t understand the way latent features work

GIVEN, THE USAGE OF SVD IN NETFLIX COMPETITION AND IT’S SUCCESS;

LISTED ONES ARE JUST SOME OF THE ALGORITHMS THAT WERE

More ‘General Purposed’

Available in Amazon Sage

Only downsize — WORKS ONLY WITH CATEGORICAL DATA!

III. Factorized Personalized Markov Chains

IV. PLSA (Promising!)

Big problem in ML in general

A lot of algorithms are very sensitive to parameters such as Learning rates

Hyperparameters vary with datasets

In case of SVD, Hyperparameters are —

How many latent factors to extract?

How many dimensions to reduce to?

Note: Both of them depend on the nature of data we have.

Helps in hyperparameter tuning

‘param_grid’ defines what to test

‘measures’ define how to test

‘cv’ set number of folds/cross_validation

SLIM — Sparse Linear Methods (2011)

Outperformed all other recommendations (Hit Rates in TopN Recommendations)

HOW WEIGHTS ARE COMPUTED ?

So complicated that many can’t implement it in code

Similar to SVD but requires higher mathematical concepts like

— Matrix Frobenius Norm

If SLIM comes your way, give it a second look — Promising Stuff!

You might also like