FULLTEXT01

Matrix Factorization Methods for
Recommender Systems
Shameem Ahamed Puthiya Parambath
June 20, 2013
Master’s Thesis in Computing Science, 15 credits

Under the supervision of:
Prof. Dr. Patrik Eklund, Umeå University, Sweden
Examined by:
Dr. Jerry Eriksson, Umeå University, Sweden
Umeå University
Department of Computing Science
SE-901 87 UMEÅ
SWEDEN
[June] 2013
Abstract
This thesis is a comprehensive study of matrix factorization methods used

in recommender systems. We study and analyze the existing models, specifically
probabilistic models used in conjunction with matrix factorization methods, for
recommender systems from a machine learning perspective. We implement two
different methods suggested in scientific literature and conduct experiments on
the prediction accuracy of the models on the Yahoo! Movies rating dataset.
Contents
Abstract i
List of Figures v
List of Tables v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Recommender Systems 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Recommendation System Models . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Bayesian Matrix Factorization 15

3.1 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Experiments 23
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Related Work and Conclusion 29

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Acknowledgements 31
Bibliography 33
iv
List of Figures
4.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Bayesian Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . 26
List of Tables
2.1 User-Item Rating Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1 Dataset Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 1
Introduction
1.1 Introduction
The enormous growth of data has resulted in the foundation of new research areas
within computer science field. Recommender systems, a completely automated sys-
tem which analyze user preferences and predict user behavior, is one among them.
The research interest in this area is still very high mainly due to the practical signifi-
cance of the problem. As stated by Gediminas and Alexander [1] "recommender sys-
tem helps to deal with information overload and provide personalized recommenda-
tions, content and service". Most of the online companies have already incorporated
recommender systems with their services. Examples of such systems include product
recommendation by Amazon, product advertisements shown by Google based on the
search history, movie recommendations by Netflix, Yahoo! Movies and MovieLens.
Recommender Systems comes under the broad research field called collabo-
rative filtering systems. In scientific literature the terms collaborative filtering and
recommender systems are used interchangeably. A collaborative filtering system con-
sists of one task, filter user data according to the user preferences, whereas a recom-
mender system consists of two tasks. First, predicting the ratings and the second,
ranking the predictions. According to Resnick and Varian [2], a recommender system
differs from collaborative filtering systems in two aspects. First, the recommenders
may not explicitly collaborate with end users. Second, recommendation suggest par-
ticularly interesting items in addition to filtering out noise.
Matrix factorization methods have recently received greater exposure, mainly

as an unsupervised learning method for latent variable decomposition. It has suc-
cessfully applied in spectral data analysis and text mining [3]. Most of the matrix
factorization models are based on the linear factor model. In a linear factor model,
rating matrix is modeled as the product of a user coefficient matrix and an item factor
matrix. During the training step, a low rank approximation matrix is fitted under the
given loss function. One of the most commonly used loss function is sum-squared
error. The sum-squared error optimized low rank approximation can be found using
Singular Value Decomposition (SVD) or QR factorization.
The SVD and QR factorization has been successfully employed in information

retrieval systems [4]. Latent Semantic Indexing (LSI) [5] is a latent model based on
Singular Value Decomposition to find hidden semantic content in a given text cor-
pora. A probabilistic approach to the LSI model called Probabilistic Latent Semantic
Indexing (PLSI) was suggested by Hoffman [6, 7] which is the basis for more complex
latent generative models including Latent Dirichlet Allocation (LDA) [8, 9].
Even though SVD and QR factorization methods are successfully applied to

solve practical problems, it suffers from severe overfitting when applies to sparse ma-
trices. Most real-world datasets are very sparse i.e. density ratio can be very low. The
density ratio is one of the measures of the sparseness of a data matrix. It is defined as,
| R|
density ratio δ = |U |∗| I | , where |R| is the number of the known entries in the rating
matrix or number of observed ratings, |U| is the total number of users and |I| is
the total number of items. The density ratio of real-world datasets can be less than
0.25%. When SVD and QR factorization are applied on sparse data, error function
has to modify to consider only the observed ratings by setting non-observed entries
to zero. This minor modification results in a non-convex optimization problem.
Factor based models have been used extensively in the recommender systems
research, as it provides a simple framework for modeling user preference. The fun-
damental idea behind factor models is that the user preference can be described as
a latent factor. In linear factor models, the preference ratings are decomposed into
two matrices one representing the user preferences and the other representing item
factors. As it turns out, matrix factorization methods provide one of the simplest and
most effective approaches to recommender systems [10, 11]. In this thesis we explore
the probabilistic matrix factorization methods used for recommender systems.
We begin with the general discussion of recommender system from a machine

learning perspective. We briefly explain two of the commonly used models for rec-
ommendation systems. Since we will not be able to cover the mathematical theory
behind all of the methods, an in-depth study is not presented in this thesis. But
proper references is provided for enthusiastic reader. The formulation of the problem
and recommendation system approach based on clustering, dimentionality reduction
2
are briefly given in Chapter 2.
The probabilistic models for factor analysis used in recommender systems are
discussed in Chapter 3. This chapter is organized in a step-wise manner. We will start
with a simple probabilistic model for matrix factorization and develop the model to
a proper Bayesian model as we go along. We will cover some of the mathematical
theory behind Bayesian methods here.
In Chapter 4, we discuss the data set used for the implementation. We elaborate
the experimental results we obtained with two different implementations. Discussion
and analysis of the results will be given in this Chapter. We conclude with a brief
description of related works in Chapter 5.
Chapter 2
Recommender Systems
2.1 Introduction
In this Chapter, we describe different models used to build recommender systems.
However, we first define the formal problem statement associated with the recom-
mender system. A brief discussion of the limitations of different approaches is also
provided at the end of each model.
According to Resnick and Varian [2] "A recommender system assist and aug-
ment the natural social process of word of mouth recommendation provided by
friends, recommendation letters, book reviews etc for different products and ser-
vices". In the research community the term collaborative filtering is used as a syn-
onym for recommendation system. But as Resnick and Varian pointed out [2] rec-
ommender system need not be a collaborated one and in addition to filtering out
information, it can generate interesting hidden factors. In this thesis, we use the
generic term recommender systems and collaborative filtering is described as a spe-
cific model of recommender system.
Recommender systems are very active research field in the data mining com-
munity due to the large industrial potential. Every e-commerce company implement
recommendation systems for the products they trying to sell. The online retailer
Amazon provides customer with "Customer Who Bought This Item Also Bought"
recommendation for each product viewed. Yahoo! News provides two kinds of rec-
ommendations with the labels "News For You" a user centric news recommendation
for each user and "Explore Related Contents" a news centric recommendation for
each viewed news item. Other major applications of recommender system is in rec-
ommending movies, as provided by Netflix, Yahoo! Movies and MovieLens, and
music, as in Yahoo! Music and Last.fm.
2.1.1 Formal Definition
According to Gediminas and Alexander [1], the recommendation problem is the prob-
lem of estimating ratings for the items that have not been seen by a user in the context
of user-item settings. The rating estimation is based on the ratings given by the user
to other items and on the meta information associated with the users and items.
The formal definition of recommendation problem is defined in the context of

user-item context. The recommendation problem can be formulated as follows. Let C
be the set of all users and let S be the set of all possible items, such as music, movies,
books, electronic items etc. The set S of possible items and the set C of possible users
can be very large, millions in most cases. The recommendation system takes the set
of users, set of items and the set of partial ratings for some users and some items,
and outputs the items with top ratings for a selected user. Intuitively, this can be
decomposed into two sub-problems
• Finding the unknown ratings associated with users and items.
• Sorting the ratings to select the top k items
Second problem is a trivial problem of sorting a hash structure or dictionary,

which can be solved using any efficient sorting algorithms given in any algorithm
text books, as in Johnsonbaugh and Schaefer [12]. The rating of an item given by a
user can be expressed as a utility function. Let u be a utility function that measures
usefulness of item s to user c, i.e., u : C × S → R, where R is a totally ordered set of
integer or real numbers, as we consider only numeric values for ratings. A recom-
mender system find the set s0 ∈ S for each user c ∈ C that maximizes the associated
utility function. Mathematically
∀c ∈ C, s0 = arg max u(c, s)

s∈S
As mentioned earlier, in the context of recommender systems the utility function

is the rating which is either a positive integer valued or real valued number which
indicates how a particular user liked or disliked an item. One example of this util-
ity function can be given here; the online entertainment database IMDB [13] allows
users to rate movies, tv shows etc. Suppose a user Aybuke gave the movie "Pride and
6
Prejudice" the rating of 9 (out of 10) which clearly indicates that she liked the movie
very well and would prefer similar movies.
The utility value u i.e. rating can be acquired in two ways, depending on the
application. It can be specified explicitly by the user or computed automatically by
the application from the user profiles on the fly. Each user in the user space C is
associated with a profile. A profile is a representation of a user that contains various
characteristics such as age, gender, occupation, country etc. In addition, each user
profile is associated with a unique user id. Similarly each item is associated with cor-
responding metadata. For example in the case of IMDB database mentioned above,
the item metadata can be movie/show name, genre, director name, actor/actress
name etc.
The user rating is represented as a matrix called user-item rating matrix. A sam-
ple user-item rating matrix for the movie preferences of three users is given in the
Table 2.1. Unknown values in the matrix is represented using the symbol ⊥. Matrix
representation is one of the most common approach as it provides an intuitive way
for manipulation and prediction. As we mentioned earlier, the underlining problem
of recommender system is to predict the ratings of each user for each item given the
limited rating values available. This means the u utility needs to be extrapolated to
the whole user-item space given by C × S. The above statement gives an idea about
the sparsity of this matrix. In most of the real cases user-item matrix sparsity rate is
more than 99% i.e. less than 1% of of the entries in the user-matrix entries are filled.
User Movie
Looper Memento Atonement Pride & Prejudice
Aybuke 8 ⊥ 7 ⊥
Sham ⊥ 7 8 9
Shani ⊥ ⊥ ⊥ 6
Table 2.1: User-Item Rating Matrix
The new ratings of the not-yet-rated items can be estimated in many different
ways using techniques from machine learning, data mining, approximation theory
and artificial intelligence.
2.2 Recommendation System Models
In this section we provide a brief survey of the the methods suggested in scientific
literature for extrapolating the user-rating matrix. Most commonly accepted and
implemented methods are classified into two groups, which are
• Content Based Recommendation
• Collaborative Recommendation
2.2.1 Content Based Recommendation

In content based recommendation system extrapolating of the ratings is carried out
based on the past rating values and user profile characteristics. In formal settings,
the utility u(c, s) of an item s for a user c is estimated based on the utilities u(c, si )
assigned by user c to items si ∈ S that are "similar" to item s [1]. The notion of "simi-
larity" in above can be quantitatively derived based on the metadata of the items. For
example in case of movie ratings since the director of the movies Memento and Pride
& Prejudice are same, a person who likes one movie ’may’ like other one also.
Most of the content based recommendation techniques root back to the profile
based information retrieval methods. In profile based approaches each user is associ-
ated with a user profile which contains information about the tastes and preferences
of the users. This information is elicited from user explicitly through questionnaires
while signing up for the service or learned implicitly from the transactional behav-
ior. Similar way each item is associated with an item profile which is gathered from
the item descriptions or item features. The item profiles are usually described us-
ing keywords,(k ), which represents a set of most important words associated with the
item. The importance of a keyword k i in an item d j is determined using some weight-
ing measure wij .
The weighting measure wij can be defined in many different ways. One of
the most common measure used in information retrieval community is the t f ∗ id f
(term frequency*inverse document frequency) measure [14]. The t f ∗ id f represents the
importance of a term (keyword) in the set. It is the product of two measures, first one
called term frequency which represents the number of times a term appears for the
corresponding item i.e. frequency of the term in the item description and the second
one called inverse document frequency which represents the relevancy of the term in the
set of whole items. The id f is defined as the ratio of the number of items containing
the term to the total number of items
8
|S|
t f ∗ id f (s, k ) = t f (s, k ) ∗ log()
d f (k)
Here t f (s, k ) is the term frequency of the term k in item s, |S| is the number
of items in the set and d f (k ) is the number of items in the set in which the term k
appears. The above equation is a general equation that can be found in any standard
text book of information retrieval. Normalization is carried out on t f ∗ id f measure
to offset for different factors. Normalization reduces the advantages of well described
items over shorter ones. There exists different normalization techniques and motives,
a brief overview and comparison can be found in [15, 16]. A detailed description of
t f ∗ id f can be found in [9]. Other applications and methods for calculating t f ∗ id f
can be found in [17].
As stated above, content based systems recommend items similar to the one pre-
ferred by a user. A vector of keyword weights for the past preference of a user is con-
structed by keyword analysis techniques mentioned earlier. This vector is matched
with the item vectors of unrated items to compute the utility score u(c, s) and most
similar items are chosen as recommendations. The notion of the similarity is mea-
sured based on association coefficients and correlation coefficients used in cluster
analysis and information retrieval [18]. Association coefficients can be used with
qualitative features and are in widespread use. Most commonly used measure is
the cosine coefficient. Cosine coefficient between a user profile vector a and an item
profile vector b is calculated as given below
t a .tb
Cos(t a , tb ) =
|t a ||tb |
Here t a and tb are tf*idf vectors representing the user vector a and item vector
b respectively. One of the important point to be noted is that cosine similarity mea-
sure is one of the many variants of inner product similarity measure, other variants
include inner product measure, pseudo-cosine measure and dice measure. All of
the variants of inner product similarity measure comes under association coefficients.
Cosine measure is the Euclidean length normalized version of inner product similar-
ity. A detailed study of inner product based similarity measures with advantages and
disadvantages is given in [19].
Correlation coefficient is based on the Pearson product-moment correlation co-

efficient used widely in statistics. Standard formula for calculating the correlation
coefficient is
Cov( a, b)
Pa,b = , where
σa σb
σ represents the standard deviation between the data set a & b and Cov(a, b) repre-
sents the covariance between a and b
In case of profile vectors, we can define correlation coefficient based on the tf*idf
explained earlier. Numeric value of the correlation coefficient can vary between -1
and 1, different from other similarity measures,. 1 and -1 indicates identical profiles
and 0 indicates distinct profiles [20]. A more rigorous mathematical treatment of the
notion of similarity can be found in Eklund et al. [21].
Limitations
Even though content based models are in widespread use, it has several limitations.
According to [1], some of the major limitations are
• Limited Content Analysis
• Over-specialization
• New user problem
As the success of content based techniques is limited by the effectiveness of

keywords or features associated with objects, it is of vital importance to acquire or
generate explicit keywords for each object. The explicit acquiring of features for user
profile is an interactive task, requiring the users to select from a list of most commonly
used tags. This is discouraged from a user point of view as most of the customers
will not be interested in providing other than minimum required data. Most of the
automatic information extraction techniques work very well with text based objects.
But outside text model, these techniques are not employed successfully. One example
is the multimedia data including images, audio and video files.
Second major limitation is over-specialization. It is similar to the overfitting

error in the context of statistical modeling of the data. Since the content based sys-
tems recommend items with high rating score against past preferred items, the user
is limited to being recommended items similar to the ones preferred in the past. In
case of movie ratings context, a user who rated romantic comedies in the past will be
recommended more romantic comedies and he will not be recommended any other
genres. This can be addressed by adding some randomness to the profile with the
assumption that each profile contains some intrinsic error.
10
New user problem also known as cold-start problem is associated with all kinds
of recommendation systems. A recommending system should be capable enough to
recommend non-trivial recommendations to a user without sufficient prior recom-
mendations in his profile.
2.2.2 Collaborative Recommendation

Collaborative recommendation systems try to predict the ratings of items based on
the ratings of other users instead of the similarity between new items and the past
preferred items . Formally, utility u(c, s) of item s for user u is calculated based on
the utilities u(ci , s) of item s assigned by users ci ∈ C who are similar to user c. An
example for ’peer’, a term used to indicate similar users, based recommendation is
the product recommendation by Amazon.
Based on the algorithms for collaborative recommendations, it can be grouped

into two general classes [1].
• memory based
• model based
In memory based systems, heuristic functions are run over the entire collection
of past ratings to derive new unknown ratings. The heuristic function calculates prop-
erly normalized aggregate values for ratings i.e. the new rating rc,s for an item s and
a user c is calculated by aggregating the ratings of the other users for the same item
s [1]. The aggregate function can be mean (average), median, mode or some other
complex aggregate function including standard normalized value. Cosine similarity
and Pearson correlation coefficient, explained in the content based recommendation
section, can also be used as aggregate function.
Model based systems are one of the most successful methods implemented by
many large online companies including Google and Yahoo!. The fundamental idea
behind model based system is to learn a model, commonly a statistical data model,
from the given data (observed data) and predict values using the best-fit parameters
of the model. Bayesian network models and cluster based models are the most im-
portant kinds of models under this. A detailed study of cluster models can be found
in [9]. In Bayesian network based models, a Bayesian network is constructed from
the observed data with proper prior distribution assumption of model parameters.
Most of these methods is based on the principles of data mining and machine learn-
ing. Our main theme of this thesis also comes under model based recommendation
systems. We will discuss more details about Bayesian learning methods in Chapter 3.
Limitations
Collaborative recommendation systems suffers almost similar type of limitations as in

content based recommendation systems. There are three types of limitations reported
in scientific literature [1], these are
• New user problem
• New item problem
• Sparsity
New user problem also known as cold-start problem is same as the one discussed
in the context of content based recommendation systems. New item problem arises
mainly due to the fact that collaborative systems relies on similar user rating on one
item for prediction. If one item is not rated by enough users, collaborative system
results can be very much biased. As explained in the content based methods sparsity
of the user-item rating matrix poses a big modeling problem in collaborative recom-
mendation systems also. The sparsity of the data may lead to severe over-fitting of
the data and results in very bad prediction accuracy.
2.2.3 Other Models

In addition to the above specified models, some of the scientific articles divides rec-
ommendation models based on the mathematical models behind it. Even though
most of these models can be classified into either content based or collaborative mod-
els, we briefly discuss these models for the sake of completion of the discussion.
Clustering Models
Clustering based models are used mainly as a pre-processing step to recommenda-

tion. The users and items can be clustered to reduce the dimensions of the item space
and user space. The users and items can be clustered to identify groups of users with
similar preferences and items with similar features. The clustering methods leads to
the reduction in the dimensionality and collaborative recommendation methods can
be applied on the cluster of users or items to achieve better prediction ratings. Most
commonly used clustering schemes are k-means and k-median. Other variations of
12
clustering, Fuzzy c-means clustering and soft k-means clustering can also be used for
this purpose. In these clustering methods, each item or user is a member of different
clusters with an associated grade of membership. The sum of the grade for a specific
user/item should sum to 1.
Dimentionality Reduction
Dimensionality reduction methods are another mathematical models used for recom-
mender systems. Dimensionality reduction techniques are analogous to clustering
schemes, but the basic principle behind dimensionality reduction technique is the
assumption that the data given by the user-matrix rating matrix is formed by a fixed
number of latent variables rather than a single latent variable. A special case of di-
mensionality reduction is matrix factorization where a data matrix is decomposed
into product of two low rank matrices. Most commonly used matrix decomposi-
tion methods are Singular Value Decomposition(SVD), QR Factorization, factor anal-
ysis(FA) and principal components analysis (PCA). A detailed study of SVD and QR
can be found in [3, 5]. A tutorial on PCA can be found in [22]. More details about
factor analysis will be provided in Chapter 3.
2.3 Summary
We discussed two of the most popular models used to implement recommender sys-
tems 1) content based models and 2) collaborative models. In content based models
ratings for unknown users are estimated based on the past ratings of the similar
items. In collaborative models, we estimate the ratings based on the probabilistic
model fitted on the observed rating values. We conclude each model description
with a discussion of the advantages and disadvantages of the model.
Chapter 3
Bayesian Matrix Factorization
In this Chapter, we introduce the recommendation model based on the factor analysis.
We build a Bayesian matrix factorization model for recommender systems step by
step starting from a simple probabilistic matrix factorization model.
3.1 Factor Analysis

Factor analysis is a simple, constrained linear Gaussian latent variable model. In
factor analysis, the conditional distribution of the observed variable x given the latent
variable z is assumed to be a Gaussian distribution with a diagonal covariance matrix.
The model is specified as given below
p( x |z) = N ( x |Wz + µ, Ψ)
where x is the D dimensional observed variables, z is M dimensional latent variable,

W is a D × M matrix, µ is the D dimensional mean vector of x and Ψ is the diagonal
D × D covariance matrix. The basic assumption in factor analysis models are given
below
• Observed variables x1 , x2 , ..., x D are independent, in our context which indicates

that user ratings for each items are independent of each other.
• Observed variables conditioned on latent variable follow Gaussian distribution
• Latent variable follow Gaussian distribution with zero mean and spherical (isotropic)
covariance matrix
The matrix W captures the variance of the observed data points and Ψ captures
the covariance between variables in the matrix W [23]. The factor analysis model
explains the observed covariance structure of the data through W and Ψ. The matrix
W is called the factor loading matrix and Ψ is called the uniqueness matrix.
The corresponding generative view point associated with factor analysis can be
explained as, a sample of observed variable is obtained by first choosing a value for
the latent variable by sampling from the zero mean Gaussian distribution p(z)
p(z) = N (z|0, I )
where I is the identity matrix and then sampling the observed variables conditioned
on this latent value. The D-dimensional observed variable X is defined as a linear
transformation of the M-dimensional latent variable z plus additive Gaussian ’noise’,
so that
x = Wz + µ + e
In the above, z is a M-dimensional Gaussian latent variable, and e is a D-dimensional

zero mean Gaussian noise variable with covariance Ψ. The factor analysis methods
can be seen as a generalized version of probabilistic principal component analysis. In
probabilistic principal component analysis, the covariance matrix of the distribution
of observed variable is assumed to be spherical or isotropic (diagonal matrix with all
diagonal elements equal), whereas in factor analysis the covariance matrix is assumed
to be a general diagonal matrix.
3.1.1 Simple Probabilistic Model

The probabilistic model called Probabilistic Matrix Factorization is the simple factor
analysis model we are discussing here. Consider a data set of N users and M items,
an N × M preference matrix R is given by the product of N × D user coefficient ma-
trix U and a D × M factor matrix V T . Training such a model amounts to finding the
best rank-D approximation to the observed N × M target matrix R. Here D is the
assumed size of the latent features.
Suppose we are given a set of M items and N users. R be the rating matrix,
where the entry Rij represents the rating for item j by user i . Let U be a N × D latent
user latent matrix and V be M × D the movie feature loading matrix. We define the
conditional distribution of the observed ratings as given below.
N M
P( R|U, V, σ2 ) = ∏ ∏[N ( Rij |Ui Vj , σ2 )] Iij (3.1)
i = i j =1
16
where N ( x |µ, σ2 ) is the Gaussian distribution with mean µ and σ2 variance. The
Iij is the indicator function which is equal to 1 if user i rated the movie j in the dataset
R or else 0. We place a zero-mean spherical (isotropic) Gaussian prior distribution on
user and movie feature vectors.
N
2
P(U |σU ) = ∏ N (Ui |0, σU2 I ) (3.2)
i =1
M
P(V |σV2 ) = ∏ N (Vi |0, σV2 I ) (3.3)
i =1
The posterior distribution of the model can be found out as given below
P(U, V, R|σU 2 I, σ2 I, σ2 )
P(U, V | R, σ2 , σU
2
I, σV2 I ) = V
P ( R | σ2 )
P( R|U, V, σ2 ) P(U, V |σU 2 I, σ2 I )
V
=
P ( R | σ2 )
Since we assume that user and item vectors are independent,
2
P(U, V |σU I, σV2 I ) = P(U |σU
2
I ) P(V |σV2 I )
this implies that
P( R|U, V, σ2 ) P(U |σU 2 I ) P (V | σ2 I )

P(U, V | R, σ2 , σU
2
I, σV2 I ) = V
P ( R | σ2 )
The parameter estimation of the above model can be found using log maximum
likelihood estimation method, taking the log of the posterior distribution gives rise
to,
1 N M 1 N T 1 M T
ln( P(U, V | R, σ 2 2
, σU , σV2 )) = − 2 ∑ ∑ Iij ( Rij − Ui Vj ) − 2 ∑ Ui Ui − 2 ∑ Vi Vj
T 2
2σ i=1 j=1 2σU i=1 2σV j=1
! !
N M
1
−
2 ∑ ∑ Iij ln(σ2 ) + NDln(σU2 ) + MDln(σV2 ) + C
i =1 j =1
(3.4)
In Eq 3.1, the constant C does not depends on the parameters, it also captures
the normalization value P( R). Maximizing the equation given in 3.1 with hyperpa-
rameters kept fixed is equivalent to minimizing the sum-of-squared-errors objective
function with quadratic regularization terms [11]
1 N M λU N λV M
E = ∑ ∑ Iij ( Rij − Ui Vj ) +
T 2
∑ 2
||Ui || Fro + ∑ ||Vi ||2Fro (3.5)
2 i =1 j =1 2 i =1 2 i =1
In the above || . ||2Fro is the Frobenius norm and λU and λV are regularization
parameter. A local minima of the equation 3.2 can be found by running gradient
descent on U and V. The gradient descent equations for U and V is given below.

U = U − e −( R − UV T )V + λU (3.6)

V = V − e −( R − UV T ) T U + λV (3.7)
where e is the learning rate and λ is the regularization parameter (we use sym-
metric regularization parameter for both U and V).
Rating prediction is carried out by multiplying the corresponding user vector

and item vector Ui and Vj respectively. Ideally, since we have a probabilistic model,
the prediction is expressed in terms of the predictive distribution that gives the pre-
dictive distribution over R which is Gaussian rather than simply a point estimate.
Individual recommendation values for an item j by a user i can be found by looking
at Rij . Further details about the implementation, test run and results are given in the
Chapter Experiments.
3.1.2 Maximum a Posteriori Model

Maximum a posterior (MAP) estimation is a partial Bayesian approach. MAP estima-
tion gives a point estimate rather than a predictive distribution. In MAP estimation,
we introduce a prior distribution over the parameters of user vector Ui and item vec-
tor Vj . We consider spherical Gaussian distributions for σU and σV with zero mean.
We denote the set of parameters as ΘU and ΘV . Here we can find the point estimate
of parameters and hyperparameters ( ΘU , ΘV )by maximizing the posterior given by
ln( P(U, V, σ2 , ΘU , ΘV )) = ln( P( R|U, V, σ2 )) + ln( P(U |ΘU )) + ln( P(V |ΘV ))
(3.8)
+ ln( P(ΘU )) + ln( P(ΘV )) + C
A local maxima can be found by running gradient descent on four variables,

Ui , Vj , ΘU , ΘV .
18
3.2 Bayesian Model
In a fully Bayesian approach, we would integrate or sum over all the values of the
parameters and hyperparameters. In most of the cases this integration is too complex
to be evaluated exactly using analytical techniques. Here we use Gibbs sampling a
Monte Carlo Markov Chain sampling method. Gibbs sampling allows us to sample
from a distribution that asymptotically follows the conditional distribution of param-
eters given data without having to explicitly calculate the integrals. In the first section
we will explain the Bayesian model for matrix factorization and in the second section
we discuss Gibbs sampling, the parameter estimation method employed in the im-
plementation.
The likelihood of the observed data in the Bayesian settings is same as in the
simple probabilistic model given in Eq. 1. The prior distribution over the user and
item vectors are assumed to be Gaussian with non-zero mean and covariance. In the
simple probabilistic model, we assumed the vectors to be Gaussian with zero mean
and spherical covariance matrix, but here we generalize the Gaussian prior as given
below.
N
2
P(U |σU ) = ∏ N (Ui |µU , Λ−1 )
i =1
M
P(V |σV2 ) = ∏ N (Vi |µV , Λ−1 )
i =1
In the above model, we used the precision matrix (Λ−1 ) instead of covariance
matrix as the prior probability for the hyperparameters of user and item vector is easy
to represent using precision matrix. The precision matrix is defined as the inverse of
the covariance matrix. Now for a fully Bayesian approach, we introduce Gaussian-
Wishart prior distribution for the hyperparameters. The Gaussian-Wishart prior is
the conjugate prior for the joint distribution of mean and precision of the Gaussian
likelihood function.
P(ΘU |Θ0 ) = N (µU |µ0 , ( βΛU )−1 )W (ΛU |W0 , υ0 ) (3.9)
P(ΘV |Θ0 ) = N (µV |µ0 , ( βΛV )−1 )W (ΛV |W0 , υ0 ) (3.10)

In the above β is a constant, W (ΛV |W0 , υ0 represents Wishart distribution with
W0 as the D × D scale matrix, υ0 is called the degrees of freedom of Wishart distribu-
tion. One thing to keep in mind is that the joint distribution is not the product of two
independent distributions. The Gaussian distribution of the mean is a linear function
of the covariance matrix.

(υ0 − D −1)/2 1 −1
W (Λ|W, υ0 ) = B|Λ| exp − Tr(W Λ) (3.11)
2
The above equation represents the Wishart distribution. In the above equation,
Tr represents the Trace of the matrix i.e. sum of the diagonal elements of the matrix,
the matrix W represents the D × D scale matrix and B is the normalization constant
that is defined as
D ! −1
−

υ0 υ + 1 i
B(W, υ0 ) = |W|− 2 2Dυ0 /2 π D( D−1)/4 ∏ Γ
0
i =1
2
Z Z
P(U, V | R, ΘU , ΘV ) = P( R|U, V ) P(U, V | R, ΘU , ΘV ) P(ΘU , ΘV |Θ0 )dΘU dΘV
(3.12)
The resulting predictive model is given by Equation 3.12. The exact solution for
this equation can not be found using analytical methods. In practice, these equations
are solved using approximate inference techniques. There are two types of approxi-
mate inference methods generally used in this scientific literature.
• Variational Bayesian Methods.
• Sampling Methods
In Variational Bayesian Methods, the posterior distribution is approximated to

the product of constituent parameters with each parameters following different dis-
tributions. In sampling methods, we generate samples of the distribution without
calculating the probability distribution. In this thesis, we used a Monte Carlo Markov
Chain sampling algorithm called Gibbs sampling.
3.2.1 Gibbs Sampling

The Gibbs sampler is a technique for generating random variables from a marginal
probability distribution, indirectly without calculating the density. Consider a joint
distribution P( x1 , x2 , x3 , ..., xk ) from which we wish to generate samples of the joint
distribution. In each step of the Gibbs sampling algorithm, we replace one of the
variables with the value drawn from the conditional distribution of the variable con-
ditioned on the remaining variable. Also, the variables are initialized to some values
which satisfy Markov chain property. Again, at each step of the algorithm, the states
follow Markov property. The Gibbs algorithm is given in Algorithm 1.
20
Algorithm 1 Gibbs Sampling
1: Initialize zi : i = 1, 2, ..., k
2: for τ = 1, .., T do
3: for i = 1, .., k do
4: Sample ziτ +1 ∼ p(zi |z1τ +1 , z2τ +1 , ..ziτ−+11 , ziτ+1 .., zτk )
5: end for
6: end for
The conditional distribution specified in the Algorithm 1 can be found by the

basic principles of probability as,
p(z1τ +1 , z2τ +1 , ..ziτ−+11 , ziτ .., zτk )

p(zi |z1τ +1 , z2τ +1 , ..ziτ−+11 , ziτ+1 .., zτk ) =
p(z1τ +1 , z2τ +1 , ..ziτ−+11 , ziτ+1 .., zτk )
The only difference between the numerator and the denominator is that the numera-
tor is the joint distribution of all variables whereas denominator is the joint distribu-
tion of all variables except zi . One full execution of the inner loop in the Algorithm
1 will compute the joint distribution p(z1τ +1 , z2τ +1 , .., ziτ +1 .., zτk +1 ). A more detailed ex-
planation of Gibbs sampling theory and examples can be found in [24, 25].
In our model, we sample the conditional distribution of user vector conditioned

on all other variables and parameters and the same method is used for item vectors.
Since we use conjugate priors for the parameters and hyperparameters, it will become
easy to find the conditional distribution. In our model, the conditional distribution
takes the form.
M h i Iij
p(Ui | R, V, ΘU , α) = N (Ui |µi∗ , [Λi∗ ]−1 ) ∼∏ N ( Rij |Ui VjT , α−1 ) p(Ui |µU , ΛU ),
j =1
where
M
Λi∗ = ΛU + α ∑ [VjT Vj ] Iij
j =1
!
M
µi∗ = [Λi∗ ]−1 α ∑ [Vj Rij ] Iij + ΛU µU
j =1
The conditional distribution of the user and item hyperparameters follow Gaussian-
Wishart distribution asgiven below
p(µU , ΛU |U, Θ0 ) = N (µU |µ0∗ , ( β∗0 ΛU )−1 )W (ΛU |W0∗ , υ0∗ ),

where
β0 N 1 M
M i∑
[W0∗ ]−1 = W0−1 + N S̄ + (µ0 − Ū )(µ0 − Ū )T , Ū = Ui
β0 + N =1
1 M β 0 µ0 + N Ū ∗
S̄ = ∑
M i =1
Ui UiT , µ0∗ =
β0 + N
,β 0 = β 0 + N υ0∗ = υ0 + N
The update formulas for item vector Vi also follows same pattern. In the Gibbs
sampling procedure, we initialize the model parameters and first sample the hy-
perparameters conditioned on the initial values of user and item vectors, and then
sample the user and item vectors conditioned on the hyperparameters.
3.3 Summary
We covered the model based recommender system, the one based on factor analysis.
We implemented two probabilistic model based algorithms, a simple probabilistic
matrix factorization model and Bayesian matrix factorization model.
22
Chapter 4
Experiments
4.1 Dataset
We used the movie ratings provided by Yahoo! Research which is free to use in
academic research and can be accessed through Yahoo! Sandbox [26]. According to
Yahoo!, the data was gathered on or before November 2003 with some modifications
and additions by Yahoo! Research through the movie ratings and recommendation
portal Yahoo! Movies. All user ids in the dataset are anonymized to preserve the
privacy of the users. The fields in the dataset are delimited with tab ("\t") characters.
The fields of the dataset is given in Table 4.1.
Field Description
1 Anonymized User ID
2 Movie ID
3 User Rating
4 Normalized User Rating
Table 4.1: Dataset Field Description
The dataset is divided into two sets, a training set and a test set. The training
data contains 7,642 users (|U |), 11,915 movies (| M|) and 211,231 ratings (| R|). The
∑U RU
average user rating, defined as RU = |U |
, is 9.64 and and the average item rating,
∑M RM
defined as R M = | M|
, is 9.32. The average number of ratings per user, defined as
| R| | R|
RU M
avg = |U | , is 27.64 and the average number of ratings per item, defined as R avg = | M| ,
is 17.73. In the training dataset, all the users have rated at least 10 items and all
| R|
items are rated by at least one user. The density ratio, defined as δ = |U |∗| I | , is 0.0023,
meaning that only 0.23% of entries in the user-item matrix are filled. Please refer to
the discussion about the sparsity of the rating matrix in Chapter 2 and 3.
The ratings are divided in to two classes, normalized ratings are un-normalized
ratings. Un-normalized ratings ranges from 1 to 13, where 1 denotes a rating of ’F’
and 13 denotes ’A+’. We used the normalized ratings which ranges from 1 to 5, 5
being the best. The training data contains a small sample of Yahoo! users’ ratings of
movies. This test data was gathered chronologically after the training data. The data
fields are same as in the training data, given in Table 4.1. The test data contains 2,309
users, 2,380 items, and 10,136 ratings. There are no test users/movies that do not
also appear in the training data. The average user rating is 9.66 and the average item
rating is 9.54. The average number of ratings/user is 4.39 and the average number of
ratings/item is 4.26. All users have rated at least one movie and all items have been
rated by at least one user.
4.1.1 Implementation
We used Numpy [27] and Scipy [28], scientific libraries for Python to implement
Probabilistic Matrix Factorization and MATLAB [29] is used for the implementation
of Bayesian Matrix Factorization as Gibbs sampler is easy to implement in MATLAB.
The Graphs were created using the statistical computing language R [30]
4.2 Experiments
For the Probabilistic Matrix Factorization, we run gradient descent on the param-
eter matrices U and V for two different ratings, normalized user ratings and un-
normalized user ratings. The parameter matrices where initialized as per given in
the Equation 3.6 and 3.7. In the Equation λ is the regularization factor which is set
to 100 and e is the learning rate which is set to 2. We set the feature size to 50 in
both PMF and BPMF models. As we explained in the Chapter 2 and Chapter 3, the
feature space is the latent space which captures different characteristics of users and
items.Since the rating matrix was quite huge (7, 642) with 211,231 observed entries,
we run the code sequentially 1000 entries per iteration. On a 64GB RAM machine,
the PMF algorithm took 5 hours to converge.
24
Probabilistic Matrix Factorization
0.7
●●
●
●
●
RMSE (Root Mean Squared Error)
0.6
●
0.5
●
0.4
●
●●
●●●●●●
●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●
0.3
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 100 120 140
Number of Iterations
Figure 4.1: RMSE for the Probabilistic Matrix Factorization model on the training
data with learning rate 3 and regularization parameter 100
In case of Bayesian Matrix Factorization setup, we initialized the values of υ0 ,

the degree of freedom to D (number of factors), µ0 to zero and W to D × D iden-
tity matrix. We initialized the starting user and item matrix as the ones obtained
from PMF model. We took 50 samples of the variables and it took 8 hours to finish
the parameter estimation on a 4GB machine. The RMSE with Gibbs sampling was
0.27473624 on the test data.
Bayesian PMF
●
0.34
RMSE (Root Mean Squared Error)
●
0.32
●
0.30
●
●
●
●
●
●●
●●
●●●
●●
0.28
●●
●●
●●
●●
●●●
●●●●
●●●●●
●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 100
Number of Iterations
Figure 4.2: RMSE for the Bayesian Matrix Factorization model on the training data as
a function of the number of samples generated
4.2.1 Results
The cost function of the gradient descent algorithm is plotted in Figure 4.1. Here the
cost function is the Root Mean Squared Error (RMSE) of the observed ratings and
calculated ratings. RMSE on test data on the optimized U and V matrices for the
normalized ratings is 0.588912961491.
In case of Bayesian Matrix Factorization, the RMSE values for each generated
26
sample is plotted against the number of iterations in Figure 4.2. The RMSE for test
data was estimated as 0.27473624. As we can see the results of Gibbs sampling is far
better than the one obtained by PMF models. The RMSE gain using Bayesian Matrix
Factorization is double than of the PMF model.
4.3 Summary
We elaborated the experiments and dataset used for the experiments to measure the
validity of the models we implemented. The result obtained using Bayesian Matrix
Factorization methods is far superior to the result obtained using simple Probabilistic
Matrix Factorization methods. In both cases, the results obtained are superior to re-
sults published by researchers on experiments conducted on similar dataset, Netflix.
Chapter 5
Related Work and Conclusion
5.1 Related Work

Recommender System is a very active research field. A number of work has already
been mentioned in Chapter 2. Most of the online retail companies and travel com-
panies do very active study in the field. Sihem, Cong et al describe a content based
recommender system for travel recommendation in the paper [31]. A similar content
based recommender systems is studied with algorithms to find the most similar items
in the context of online shopping is described in [32].
Model based recommender systems are studied in very detail in the paper [33].
This PhD thesis covers the recommender systems from a machine learning problem
perspective. It discusses different algorithms used in recommender system, including
clustering, dimensionality reduction, Nave Bayes etc. One of the most advanced algo-
rithm used in recommender systems based on topic modeling is explained in detail
in [8].
A detailed probabilistic study of Latent Gaussian models and Factor analysis

can be found in the famous textbook of Bishop [23]. Some other parameter estimation
technique for probabilistic model like Expectation Maximization (EM) algorithm can
be found in the same textbook. The textbook also covers k-means algorithm, Soft
k-means etc. A study of similarity measure in the fuzzy settings can be found in [21].
5.2 Conclusion
The results we obtained by applying both Probabilistic Matrix Factorization and
Bayesian Probabilistic Matrix Factorization are very satisfying. The results are very
superior compared to results obtained by applying similar techniques on other dataset
like Netflix Movie Rating data [11, 10].
In the future, we would like to do more work in the statistical modeling of data
and machine learning algorithms for approximate inference. Initially we planed to
do variational bayesian inference to approximate posterior distribution in Bayesian
analysis.
30
Chapter 6
Acknowledgements
It is always a pleasure to finish a research project on a high note. It gives extra plea-
sure while looking back working with kind, humble and wise people.
First of all, I would like to thank the Almighty God for helping me to success-
fully finish this project. His blessings guided me through some of the hard times in
my life and showed me the correct path. I would like to express my heartfelt grat-
itude to my supervisor Prof. Dr. Patrik Eklund for allowing me to work with him,
helping me to see it through to completion. He was very supportive, encouraging,
patient during the whole life-cycle of the project and the final thesis preparation. He
took the pain of reading through the initial drafts of my thesis during his busy sched-
ule and I am greatly thankful to him for the suggestions regarding the structure and
content of final paper.I thank my examiner Dr. Jerry Eriksson for reading through the
final paper and making the necessary arrangements for the successful presentation.
Finally, I would like to thank my family, colleagues and friends, Ms. Aybüke
Öztürk, Mr. Sujith, Mr. Nishanth, Mr. Tony , Mr. Awad and Mr. Harishankar for all
the support given during the project implementation stage.
Bibliography
[1] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible extensions.
IEEE Trans. on Knowl. and Data Eng., 17(6):734–749, June 2005.
[2] Paul Resnick and Hal R. Varian. Recommender systems. Commun. ACM,
40(3):56–58, March 1997.
[3] Michael W. Berry, Zlatko Drmač, Elizabeth, and R. Jessup. Matrices, vector
spaces, and information retrieval. SIAM Review, 41:335–362, 1999.
[4] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and
Robert J. Plemmons. Algorithms and applications for approximate nonnegative
matrix factorization. In Computational Statistics and Data Analysis, pages 155–173,
2006.
[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer,

and Richard Harshman. Indexing by latent semantic analysis. JOURNAL OF
THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990.
[6] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the

22nd annual international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.
[7] Kathryn B. Laskey and Henri Prade, editors. UAI ’99: Proceedings of the Fifteenth
Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - Au-
gust 1, 1999. Morgan Kaufmann, 1999.
[8] David M Blei and J Lafferty. Topic models. Text mining: classification, clustering,
and applications, 10:71, 2009.
[9] Shameem Ahamed Puthiya Parambath. Topic extraction and bundling of related
scientific articles. Master’s thesis, Umeå University, Department of Computing
Science, 2012.
[10] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factor-
ization using markov chain monte carlo. In Proceedings of the 25th international
conference on Machine learning, ICML 08, pages 880–887, New York, NY, USA,
2008. ACM.
[11] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In

NIPS, 2007.
[12] Richard Johnsonbaugh and Marcus Schaefer. Algorithms, volume 2. Pearson

Education, 2004.
[13] Internet movie database. http://www.imdb.com. Accessed: 2013-05-12.
[14] G. Salton and M.J. McGill. Introduction to modern information retrieval. McGraw-
Hill computer science series. McGraw-Hill, 1983.
[15] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length nor-
malization. In SIGIR, pages 21–29, 1996.
[16] C. D. Manning, P. Raghavan, and H. SchÃijtze. Introduction to Information Re-

trieval. Cambridge University Press, 2008.
[17] Aybüke Öztürk. Textual Summarization of Scientific Publications and Usage Patterns.
PhD thesis, Umeå University, 2012.
[18] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: A
review. ACM Comput. Surv., 31(3):264–323, 1999.
[19] William P. Jones and George W. Furnas. Pictures of relevance: a geometric anal-
ysis of similarity measures. J. Am. Soc. Inf. Sci., 38(6):420–442, November 1987.
[20] A. Huang. Similarity measures for text document clustering. pages 49–56, 2008.
[21] Patrik Eklund, Maria A. Galán, Jesús Medina, Manuel Ojeda-Aciego, and
Agustín Valverde. Similarities between powersets of terms. Fuzzy Sets and Sys-
tems, 144(1):213–225, 2004.
[22] Jonathon Shlens. A tutorial on principal component analysis. In Systems Neuro-

biology Laboratory, Salk Institute for Biological Studies, 2005.
34
[23] Christopher M Bishop et al. Pattern recognition and machine learning, volume 1.
springer New York, 2006.
[24] Philip Resnik and Eric Hardisty. Gibbs sampling for the uninitiated. Technical
report, DTIC Document, 2010.
[25] George Casella and Edward I George. Explaining the gibbs sampler. The Ameri-
can Statistician, 46(3):167–174, 1992.
[26] Yahoo! research. http://webscope.sandbox.yahoo.com/. Accessed: 2013-05-12.
[27] Numpy, scientific computing tools for python. http://www.numpy.org/. Ac-

cessed: 2013-05-12.
[28] Scipy, open source library of scientific tools. http://www.scipy.org/. Accessed:

2013-05-12.
[29] Matlab, the language of technical computing. http://www.mathworks.se/

products/matlab/. Accessed: 2013-05-12.
[30] The r project for statistical computing. http://www.r-project.org/. Accessed:

2013-05-12.
[31] Munmun De Choudhury, Moran Feldman, Sihem Amer-Yahia, Nadav Golbandi,

Ronny Lempel, and Cong Yu. Automatic construction of travel itineraries using
social breadcrumbs. In Proceedings of the 21st ACM conference on Hypertext and
hypermedia, HT ’10, pages 35–44, New York, NY, USA, 2010. ACM.
[32] Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, and Cong
Yu. Constructing and exploring composite items. In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data, SIGMOD ’10, pages 843–
854, New York, NY, USA, 2010. ACM.
[33] Benjamin Marlin. Collaborative filtering: A machine learning perspective. PhD thesis,
University of Toronto, 2004.
[34] Ruth Gracia, Shameem A Puthiya Parambath, Aybuke Ozturk, and Sihem Amer-
Yahia. Crowd sourcing literature review in sunflower. Technical report, Technical
report, 2012.
[35] Google scholar. http://www.scholar.google.com. Accessed: 2013-05-12.
[36] Yahoo! movies. http://www.movies.yahoo.com. Accessed: 2013-05-12.
[37] Yahoo! music. http://www.music.yahoo.com. Accessed: 2013-05-12.

[38] Netflix inc. http://www.netflix.com. Accessed: 2013-05-12.
[39] Lindsay I Smith. A tutorial on principal components analysis. Cornell University,

USA, 51:52, 2002.
36

FULLTEXT01

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FULLTEXT01

Uploaded by

Copyright:

Available Formats

Matrix Factorization Methods for

Master’s Thesis in Computing Science, 15 credits

This thesis is a comprehensive study of matrix factorization methods used

3 Bayesian Matrix Factorization 15

5 Related Work and Conclusion 29

4.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 User-Item Rating Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 Dataset Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Matrix factorization methods have recently received greater exposure, mainly

The SVD and QR factorization has been successfully employed in information

Even though SVD and QR factorization methods are successfully applied to

We begin with the general discussion of recommender system from a machine

The formal definition of recommendation problem is defined in the context of

• Finding the unknown ratings associated with users and items.

• Sorting the ratings to select the top k items

Second problem is a trivial problem of sorting a hash structure or dictionary,

∀c ∈ C, s0 = arg max u(c, s)

As mentioned earlier, in the context of recommender systems the utility function

• Content Based Recommendation

2.2.1 Content Based Recommendation

Correlation coefficient is based on the Pearson product-moment correlation co-

• Limited Content Analysis

• New user problem

As the success of content based techniques is limited by the effectiveness of

Second major limitation is over-specialization. It is similar to the overfitting

2.2.2 Collaborative Recommendation

Based on the algorithms for collaborative recommendations, it can be grouped

Collaborative recommendation systems suffers almost similar type of limitations as in

• New user problem

• New item problem

2.2.3 Other Models

Clustering based models are used mainly as a pre-processing step to recommenda-

Bayesian Matrix Factorization

3.1 Factor Analysis

where x is the D dimensional observed variables, z is M dimensional latent variable,

• Observed variables x1 , x2 , ..., x D are independent, in our context which indicates

• Observed variables conditioned on latent variable follow Gaussian distribution

In the above, z is a M-dimensional Gaussian latent variable, and e is a D-dimensional

3.1.1 Simple Probabilistic Model

Since we assume that user and item vectors are independent,

this implies that

P( R|U, V, σ2 ) P(U |σU 2 I ) P (V | σ2 I )

Rating prediction is carried out by multiplying the corresponding user vector

3.1.2 Maximum a Posteriori Model

A local maxima can be found by running gradient descent on four variables,

P(ΘU |Θ0 ) = N (µU |µ0 , ( βΛU )−1 )W (ΛU |W0 , υ0 ) (3.9)

P(ΘV |Θ0 ) = N (µV |µ0 , ( βΛV )−1 )W (ΛV |W0 , υ0 ) (3.10)

• Variational Bayesian Methods.

In Variational Bayesian Methods, the posterior distribution is approximated to

3.2.1 Gibbs Sampling

The conditional distribution specified in the Algorithm 1 can be found by the

p(z1τ +1 , z2τ +1 , ..ziτ−+11 , ziτ .., zτk )

In our model, we sample the conditional distribution of user vector conditioned

p(µU , ΛU |U, Θ0 ) = N (µU |µ0∗ , ( β∗0 ΛU )−1 )W (ΛU |W0∗ , υ0∗ ),

0 20 40 60 80 100 120 140

In case of Bayesian Matrix Factorization setup, we initialized the values of υ0 ,

Related Work and Conclusion

5.1 Related Work

A detailed probabilistic study of Latent Gaussian models and Factor analysis

[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer,

[6] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the

[11] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In

[12] Richard Johnsonbaugh and Marcus Schaefer. Algorithms, volume 2. Pearson

[13] Internet movie database. http://www.imdb.com. Accessed: 2013-05-12.