Professional Documents
Culture Documents
Functional Matrix Factorizations For Cold Start Recommandation
Functional Matrix Factorizations For Cold Start Recommandation
Functional Matrix Factorizations For Cold Start Recommandation
Recommendation
ABSTRACT 1. INTRODUCTION
A key challenge in recommender system research is how to Recommendation systems have become a core component
effectively profile new users, a problem generally known as in today’s online business world. While E-commerce giants
cold-start recommendation. Recently the idea of progres- such as Amazon have been greatly benefitted from the abili-
sively querying user responses through an initial interview ties of their online recommenders in effectively delivering the
process has been proposed as a useful new user preference right items to the right people, online startups are increas-
elicitation strategy. In this paper, we present functional ma- ingly becoming aware of this secret and aggressively sharp-
trix factorization (fMF), a novel cold-start recommendation ening their weapons in order to compete. A key challenge for
method that solves the problem of initial interview construc- building an effective recommender system is the well-known
tion within the context of learning user and item profiles. cold-start problem — how to provide recommendations to
Specifically, fMF constructs a decision tree for the initial in- new users? While existing collaborative filtering (CF) ap-
terview with each node being an interview question, enabling proaches to recommendation perform quite satisfactorily for
the recommender to query a user adaptively according to her warm-start users (e.g. users purchasing a lot from an on-
prior responses. More importantly, we associate latent pro- line retailer), it could fail completely for fresh users, simply
files for each node of the tree — in effect restricting the latent because the system knows very little about those users in
profiles to be a function of possible answers to the interview terms of their preferences.
questions — which allows the profiles to be gradually refined Providing effective cold-start recommendations is of fun-
through the interview process based on user responses. We damental importance to a recommender system. Offering
develop an iterative optimization algorithm that alternates bad recommendations could risk severely the webshop’s rev-
between decision tree construction and latent profiles ex- enue, market share and overall reputations, because disap-
traction as well as a regularization scheme that takes into pointed customers may turn to other competitors and never
account of the tree structure. Experimental results on three come back. This problem is more acute for newly-launched
benchmark recommendation data sets demonstrate that the recommender systems (e.g. online startups) because their ex-
proposed fMF algorithm significantly outperforms existing isting user base is small and attracting more new customers
methods for cold-start recommendation. are crucial to their survival. Even for relatively mature rec-
ommender systems, a vast majority of the (potential) cus-
tomers are actually cold-start users – in principle, the popu-
Categories and Subject Descriptors lation of customers typically follows a power-law distribution
H.3.3 [Information Search and Retrieval]: Information w.r.t. the participation frequency such that most customers
filtering; I.2.6 [Artificial Intelligence]: Learning lie in the long-tail part.
A natural approach to solving the cold-start problem is to
elicit new user preferences by query users’ responses progres-
General Terms sively through an initial interview process [19]. Specifically,
Algorithms, Performance, Experimentation at the visit of a new user, the recommender initiates sev-
eral trials. At each trial, the recommender provides a seed
item as a question and ask the user for her opinion; based
Keywords on the user’s responses to the questions, the recommender
Recommender systems, Cold-start problem, Collaborative gradually refines its characterization (i.e. profile) of the user
filtering, Decision tree, Functional matrix factorization so that it can provide more satisfactory recommendations
to the user in the future. A good initial interview process
should not be time-consuming: the recommender can only
Permission to make digital or hard copies of all or part of this work for ask a very limited number of questions or otherwise the user
personal or classroom use is granted without fee provided that copies are will become impatient and leave the system. Equally im-
not made or distributed for profit or commercial advantage and that copies portant, the process should also be effective, i.e. the answers
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
collected from the user should be useful for constructing at
permission and/or a fee. least a rough profile for the user. It has been convincingly
SIGIR’11, July 24–28, 2011, Beijing, China. argued that an effective model for designing the interview
Copyright 2011 ACM 978-1-4503-0757-4/11/07 ...$10.00.
process should ask the interview questions adaptively, taking Recently, matrix factorization becomes a popular direc-
into account the user’s responses when asking the next ques- tion for collaborative filtering [12,14,16,21,24]. These meth-
tion in the interview process, and decision trees have been ods are shown to be effective in many applications. Specifi-
found to work especially well for this purpose [7, 19, 20]. As cally, matrix factorization methods usually seek to associate
an illustration of movie recommendation, in an interview both users and items with latent profiles represented by vec-
process depicted in Figure 1, the system asks a new user the tors in a low dimension space that can capture their char-
question “Do you like the movie Lethal Weapon 4?”. The acteristics. In [22], a convex relaxation for low rank matrix
user is allowed to answer with either “Like”, “Dislike” or “Un- factorization is proposed and applied to collaborative filter-
known”. The recommender refines its impression about the ing. A probabilistic model for extracting low dimensional
user and then direct the user to one of the child nodes and profiles is studies in [12], in which latent variables are intro-
then presents the next interview question according to her duced to capture the users and items and the EM algorithm
previous response. For example, if a user chooses “Like” for is used to estimate the parameters. More recently, fueled
“Lethal Weapon 4?” at the root node of Figure 1, she will be by the Netflix competition, several improvements have been
directed to the left child node and will be asked the question proposed including the use of regularized SVD [16], and the
“Do you like the movie Taxi Driver?” idea of matrix factorization combined with neighborhood-
In this paper, we propose functional matrix factorization based methods [14].
(fMF), a novel method for the construction of such inter-
view decision trees. We argue that it is more effective to 2.2 Cold-Start Collaborative Filtering
combine the matrix factorization model for collaborative fil- There have been several studies on the elicitation strate-
tering and the decision-tree based interview model into a gies for new user preferences. The work [18] surveys sev-
single framework. Our proposed method is based on the low eral guidelines for user preference elicitation. Several mod-
rank factorization of the incomplete user-item rating matrix els such as [9, 19, 20] constructed the interview process with
with an important twist: it restricts the user profiles to be a static seed set selected based on measures such as infor-
a function of the answers to the interview questions in the mativeness, popularity and coverage. The recent work of [7]
form of a decision tree, thus the name functional matrix fac- proposed a greedy seed selection algorithm by optimizing the
torization. The function — playing the role of the initial prediction performance. Such seed-based methods are not
interview — is in the form of a decision tree with each node completely satisfactory because the seeds are selected stat-
being an interview question [7, 20]. Both the decision tree ically in batch, and they do not reflect the user responses
and the item profiles need to be learned; for that we propose during the interview process. In [20], while discussing the
an iterative optimization algorithm that alternates between IGCN algorithm, it was mentioned the idea of adaptively
decision tree construction and latent profiles extraction. selecting the interview questions by fitting a decision tree to
Our proposed method tends to explore more effectively predefined user clusters. In particular, each node represents
the correlation between different items. In particular, the an interview question and the user is directed to one of the
low dimensional user profiles at each node of the decision child nodes according to her response to the parent ques-
tree and the item profiles adapt to each other to better fit tion. In [8], the authors proposed an algorithm that fits the
the training data. Thus, correlation between different items decision to the users’ ratings. This seems to provide a more
is captured by the low dimensional profiles. As a result, disciplined approach than that based on the predefined user
the prediction of ratings can be enhanced by making use clusters discussed in IGCN. Our proposed framework goes
of the ratings of different items. Experimental results on one step further by integrating the decision tree construction
three benchmark recommendation data sets, the MovieLens into the matrix factorization framework of collaborative fil-
data set, the EachMovie data set and the Netflix data set, tering. We should also mention that the decision tree struc-
demonstrate that the proposed fMF algorithm significantly ture used in those methods exhibit interesting resemblance
outperforms existing methods in cold-start recommendation. to the Bayesian network approach in [5].
Outline: In Section2, we briefly review existing studies A complimentary line of research for solving the cold-start
for cold-start collaborative filtering. In Section 3, we first problem is to utilize the features of users or items. The con-
introduce matrix factorization for collaborative filtering and tent features can be used to capture the similarities between
then present functional matrix factorization for constructing users and items and thus reduce the amount of data required
the interview process for cold-start collaborative filtering by to make accurate predictions [2, 3, 10, 15, 23, 25]. For exam-
restricting the user profiles to be a function in the form of ple, the work of [2] utilizes features of items and users as the
a decision tree. The learning algorithm based on alterna- prior distribution for latent profiles in matrix factorization.
tive optimization is then proposed with detailed derivations. Our method is based on the interview process and does not
Then, we evaluate the proposed method on three data sets solely rely on features of users and items.
and analyze the results in Section 4. Finally we conclude We also want to mention active learning for collaborative
our work and present several future directions in Section 5. filtering [4,11,13,17]. These methods usually select questions
that are optimal with respect to some selection criteria, such
2. RELATED WORK as the expected value of information or the distance to the
true user model. These methods generally are not suitable
2.1 Collaborative Filtering for interview process since the selection process usually in-
Many studies on recommender systems have been focused volves optimizing the selection criteria, which is not efficient
on collaborative filtering approaches. These methods can enough for online interview.
be categorized into memory-based and model-based. The
reader is referred to the survey papers [1, 26] for a compre- 3. FUNCTIONAL MATRIX FACTORIZATION
hensive summary of collaborative filtering algorithms. In this section, we describe the functional matrix factor-
Lethal Weapon 4
Like
Unknown Dislike
Jingle All the Way Out to Sea The Day the Earth Stood Still The Ice Storm Tin Men Snow White and the Seven Dwarfs Stop Making Sense
Like
Unknown Dislike
Paris, Texas Star Trek V: The Final Frontier Air Force One
Figure 1: An example for decision trees: Top three levels of a model of depth 6 for MovieLens data set
ization (fMF) method for cold-start collaborative filtering with all vi , i 6= j and all ui fixed:
which explores the well-known matrix factorization methods X
for constructing the interview process. The key innovation vj = argmin (rij − uTi vj )2 ,
vj
is that we parameterize user profiles to be a function of the (i,j)∈O
responses to the possible questions of the interview process which is also a linear regression problem with similar closed-
and use matrix factorization to compute the profiles. form solution.
3.1 Low-Rank Matrix Factorization 3.2 Functional Matrix Factorization
Before describing fMF, we first review the matrix factor- Now we consider constructing the interview process for
ization method for collaborative filtering. Let rij denote cold-start collaborative filtering. Assume that a new user
the rating of user i for item j, where i = 1, 2, . . . , N and registers at the recommendation system and nothing is known
j = 1, 2, . . . , M . In practice, only a small subset of ratings about her. To capture the preferences of the user, the system
are observed, denoted by K = {rij | (i, j) ∈ O}. The goal initiates several interview questions to query the responses
of collaborative filtering is to predict the unknown ratings from the user. Based on the responses, the system con-
based on ratings in K. Collaborative filtering exploits a ba- structs a profile for the user and provides recommendations
sic heuristic that similar users tend to rate similar items in accordingly.
a similar way. One important class of methods for collab- In the plain matrix factorization model described in Sec-
orative filtering are based on matrix factorization [12, 24]. tion 3.1, the user profile ui is estimated by optimizing the ℓ2
Specifically, we associate a K-dimensional vector ui ∈ RK loss on the history ratings rij . This model does not directly
for each user i and vj ∈ RK for each item j. The vectors apply to cold-start settings because no rating is observed
ui and vj are usually called user profiles and item profiles for the new user prior to the interview process. To build
since they are intended to capture the characteristics of users user profiles adaptively according to the user’s responses in
and items. The rating rij of user i for item j can be ap- the course of the interview process, we propose to param-
proximated by a similarity function of ui and vj in the low eterize the user profile ui in such a way that the profile ui
dimensional space, for example, rij = uTi vj . is tied to user i’s responses in the form of a function, thus
Given the set of known ratings K, the parameters ui and the name functional matrix factorization (fMF). More pre-
vj can be estimated through fitting the training data by cisely, assume there are P possible interview questions. We
solving the following optimization problem: assume that an answer to a question takes value in the fi-
X nite set {0, 1, Unknown}, representing “Dislike”, “Like” and
min (rij − uTi vj )2 . “Unknown”, respectively. Furthermore, let ai denote the P -
ui ,vj
{(i,j)}∈O
dimensional vector representing the answers of user i to the
Regularization terms such as the Frobenius norms on the P questions. And we tie the profile to the answers by as-
profile vectors can be introduced to avoid overfitting. The suming ui = T (ai ), where T is a function that maps the
problem can be solved by existing numerical optimization responses ai to the user profile ui ∈ Rk . To make recom-
methods such as alternating minimization and stochastic mendations for user i, we simply use rij = vjT T (ai ).
gradient descent. In our implementation, we use the al- Our goal is to learn both T and vj from the observed
ternating optimization for its amenability for the cold-start ratings K. To this end, substituting ui = T (ai ) into the
settings. Specifically, the optimization process performs the low-rank matrix factorization model, we have the following
following two updates alternatively. optimization problem:
First, for i = 1, 2, . . . , N , minimizing with respect to ui X
T, V = argmin (rij − vjT T (ai ))2 + λkV k2 , (1)
with all uj , j 6= i and all vj fixed: T ∈H,V
(i,j)∈O
X
ui = argmin (rij − uTi vj )2 , where V = (v1 , . . . , vM ) is the matrix of all item profiles, H
ui
(i,j)∈O is the space from which the function T (a) is selected and the
which is a linear regression problem with squared loss. The second term is the regularization term.
closed form solution can be expressed as Several issues need to be addressed in order to construct
the interview process by the above functional matrix factor-
−1
ization. First, the number of all possible interview questions
can be quite large (e.g. up to millions of items in movie rec-
X X
ui = vj vjT rij vj
(i,j)∈O (i,j)∈O
ommendation); yet a user is only patient enough to answer
a few interview questions. Second, the interview process
Then, for j = 1, 2, . . . , M , minimizing with respect to vj should be adaptive to user’s responses, in other words, a
follow-up question should be selected based on the user’s re- certain depth. In our experiments, the depth is usually set
sponses to the previous questions. Therefore, the selection to a small number between 3 and 7.
process should be efficient to generate interview questions in Formally, starting from the root node, the set of users
real time after the function T (a) is constructed. In addition, at current node are partitioned into three disjoint subsets
since we allow a user to choose “Unknown” to the interview RL (p), RD (p) and RU (p) corresponding to “Like”, “Dislike”
questions, we need to deal with such missing values as well. and “Unknown” of their responses to the interview question
Following prior works of [8, 20], we use a ternary decision p:
tree to represent T (a). Specifically, each node of the deci-
RL (p) = {i|aip = “Like”},
sion tree corresponds to an interview question and has three
child nodes. When the user answers the interview question, RD (p) = {i|aip = “Dislike”},
the user is directed to one of its three child nodes according RU (p) = {i|aip = “Unknown”}.
to her answer. As a result, each user follows a path from
the root node to a leaf node during the interview process. To find the optimal question p that leads to the best split,
A user profile is estimated at each leaf node based on the we minimize the following objective:
users’ responses, i.e., T (a). The number of interview ques- X X X X
min (rij − uTL vj )2 + (rij − uTD vj )2
tions presented to any user is bounded by the depth of the p
i∈RL (p) (i,j)∈O i∈RD (p) (i,j)∈O
decision tree, generally a small number determined by the X X
system. Also, non-responses to a question can be handled + (rij − uTU vj )2 , (4)
easily in the decision tree with the introduction of a “Un- i∈RU (p) (i,j)∈O
known” branch.
where uL , uD and uU are the optimal profiles for users in the
3.3 Alternative Optimization child nodes corresponds to the answers of “Like”, “Dislike”
and “Unknown”, respectively:
The objective function defined in Eq. (1) can be optimized
through an alternating minimization process. Specifically,
X X
uL = argmin (rij − uT vj )2 ,
we alternate between the following two steps: u
i∈RL (p) (i,j)∈O
X X
1. Given T (a), we can compute vj by regularized least uD = argmin (rij − uT vj )2 ,
square regression. Formally, for each j, we find vj such u
i∈RD (p) (i,j)∈O
that X X
X uU = argmin (rij − uT vj )2 .
vj = argmin (rij − vjT T (ai ))2 + λkvj k2 . u
i∈RU (p) (i,j)∈O
vj
(i,j)∈O
There also exist closed-form solutions to uL , uD and uD as
This problem has a closed-form solution given by follows:
−1 −1
X X X X
vj vjT
X X
vj = T (ai )T (ai )T + λI rij T (ai ) , (2) uL = rij vj , (5)
i∈RL (p) (i,j)∈O i∈RL (p) (i,j)∈O
(i,j)∈O (i,j)∈O
−1
where I is the identity matrix of appropriate size. X X X X
uD = vj vjT rij vj , (6)
2. Given vj , we try to fit a decision tree T such that i∈RD (p) (i,j)∈O i∈RD (p) (i,j)∈O
−1
X
T = argmin (rij − T (ai )T vj )2 . (3)
X X X X
uU = vj vjT rij vj . (7)
T ∈H
(i,j)∈O i∈RU (p) (i,j)∈O i∈RU (p) (i,j)∈O
A critical challenge is that the number of possible trees grows After the root node is constructed, its child nodes can be
exponentially with the depth of the trees, which can be ex- constructed in a similar way, recursively. We summarize our
tremely large even for trees of limited depth. It is therefore algorithms for functional matrix factorization and decision
computationally extensive to obtain a global optimal solu- tree construction in Algorithm 1 and Algorithm, 2 respec-
tion for the above. We address this problem by proposing tively.
an efficient greedy algorithm for finding an approximate so- Implementation and Computational Complexity.
lution. The time complexity for computing vj with Equation (2) is
O(M K 3 ), where M is the number of items and K is the di-
3.4 Decision Tree Construction mension of the latent space. In order to compute uL , uD and
Traditional decision tree algorithms such as C4.5 or CART uU at each split, we need to compute the inverse of a square
[6] usually minimize objective functions such as classification matrix of size K, which takes O(K 3 ) time. The brute-force
error and square loss for regression. In our scenario, the approach for generating the matrix itself requires O(U M K 2 )
objective function is defined in Eq. (3), and we now describe time. In order to reduce the time complexity, we observe
how to build a decision tree in a greedy and recursive fashion that the coefficient matrix can be computed based on the
to minimize it. Specifically, at each node, we select the best sum of ratings, the sum of squares of ratings and the num-
interview question by optimizing the objective defined in Eq. ber of ratings for each item within time O(M K 2 ). With a
(3) based on the response to this question; the decision tree similar method proposed in [8], the timePcomplexity of com-
then splits the user set into three subsets corresponding to puting these statistics of all items is O( Ni2 ) for each level
the child nodes (i.e. “Like”, “Dislike” and “Unknown”). We of the decision tree, where Ni is the number of ratings by the
carry out this procedure recursively until the tree reaches user i. Thus, the computation complexity for constructing
Algorithm 1 Alternating Optimization for Functional Ma-
trix Factorization Table 1: Statistics for Data Sets
Dataset No. Users No. Items No. Ratings
Require: The training data K = {rij | (i, j) ∈ O}.
Ensure: Estimated decision tree T (a) and item profiles MovieLens 6040 3900 1,000,209
vj , j = 1, 2, . . . , M . EachMovie 72,916 1628 2,811,983
1: Initialize vj randomly for j = 1, . . . , M . Netflix 480,189 17,770 104,706,033
2: while not converge do
3: Fit a decision tree T (a) using Algorithm 2.
4. EXPERIMENTS
4: Update vj with Equation (2). In order to evaluate the proposed method for cold-start
5: end while collaborative filtering, we carry out a set of controlled ex-
6: return T (a) and vj , j = 1, 2, . . . , M . periments on three widely used benchmark data sets: Movie-
Lens, EachMovie and Netflix.
Algorithm 2 Greedy Decision Tree Construction
1: function FitTree(UserSet)
4.1 Data sets and Evaluation Metrics
2: // UserSet: the set of users in current node. We first briefly describe the data sets.
3: Calculate uL , uD , uU by Equation (5), (6) and (7) for • The MovieLens1 data set contains 3900 movies, 6040
p = 1, . . . , P . users and about 1 million ratings. In this data set,
4: Compute the split criteria Lp by Equation (4) for p = about 4% of the user-movie dyads are observed. The
1, . . . , P . ratings are integers ranging from 1 (bad) to 5 (good).
• The EachMovie data set, collected by HP/Compaq
5: Find the best interview question p = argmaxp Lp .
Research, contains about 2.8 million ratings for 1628
6: Split user into three groups RL (p), RD (p) and RU (p).
movies by 72,916 users. The ratings are integers rang-
7: if square error reduces after split and depth <
ing from 1 to 6.
maxDepth then • The Netflix2 is one of the largest test bed for collabora-
8: Call FitTree(RL (p)), FitTree(RD (p)) and tive filtering. It contains over 100 million ratings and
FitTree(RU (p)) to construct the child nodes. was split into one training and one test subsets. The
9: end if training set consists of 100,480,507 ratings for 17,770
10: return T (a). movies by 480,189 users. The test set contains about
11: end function 2.8 million ratings. All the ratings are ranged from 1
to 5.
the decision tree is O(D i Ni2 + LM K 3 + LM 2 K 2 ), where
P
The performance of an collaborative filtering algorithm
D is the depth of the tree and L represents the number of
will be evaluated in terms of the widely used root mean
nodes in the tree. The computation time can be reduced
square error (RMSE) measure, which is defined as follows
by selecting a subset of items based on some criteria such
1/2
as the rating variance, and using only the selected items as
candidates for the interview questions. 1 X
RMSE = (rij − r̂ij )2 ,
|T |
(i,j)∈T
3.5 Hierarchical Regularization
where T represents the set of test ratings, rij is the ground-
In the above section, uL , uD and uU for each node are
truth values for movie j by user i and r̂ij is the predicted
estimated by linear regression. When the amount of train-
value by a collaborative filtering model. We can also nor-
ing data is small, the estimate may overfit the training data.
malize the error of each user by the number of their ratings.
The problem becomes more severe especially for the nodes
Our initial results show that the two metrics are consistent
close to the leaves of the decision tree, because as the split
in general, so we only report RMSE in our experiments.
process progresses, users belonging to those nodes become
fewer and fewer. To avoid overfitting, regularization terms 4.2 Deriving Interview Responses from User
need to be introduced. Although traditional ℓ2 regulariza- Ratings
tion can be applied in this case [16], but such approach does
The responses ai of the interview process for user i is re-
not take into account the rich structure of the decision tree.
quired in order to train and evaluate the interview process.
Here, we propose to apply hierarchical regularization that
Following the standard settings [8,20], we restrict the format
utilize the structure of the decision tree. Specifically, we
of interview questions to be “Do you like item j?” For exam-
shrink the coefficient u of a node toward the one at its parent
ple, for movie recommendation system, we can ask questions
node. For example:
like “Do you like the movie Independence Day?” Then, we
X X can infer the responses for existing users according to their
uL = argmin (rij − uT vj )2 + λh ku − uC k2 ,
u ratings for the corresponding items. For instance, with rat-
i∈RL (p) (i,j)∈O
ings in 1-5 scale, we assume that the responses aij of user i
where uC is the estimation at the current node and uL is to the question “Do you like item j?” can be inferred from
the estimation at its child node corresponding to the answer her rating rij as follows:
“Like” and the parameter λh controls the trade-off between
0,
if (i, j) ∈ O and rij ≤ 3
training error and regularization. We use λh = 0.03 in our aij = 1, if (i, j) ∈ O and rij > 3
experiments which seems to give good results. The similar
”Unknown”, if (i, j) 6∈ O
idea using parent node as the prior for the child node is also
1
studied in [8], but the ratings for the parent node is used to http://www.grouplens.org/node/73
2
improve the prediction in the child node. http://www.netflixprize.com/
Intuitively, we assume that the user will response 0 (dislike)
and 1 (like) for items she rates with ≤ 3 rating or > 3 rating.
If the user did not rate an item selected for the interview
process, we assume that she will respond with unknown. For
EachMovie data set, we use a similar method except setting
the threshold to be 4 since its ratings range from 1 to 6.
RMSE
(a) Interview process 0.9
1.1
RMSE
RMSE
0.92 0.9433
0.91 0.9432
0.9 0.9431
0.89 0.943
0.88 0.9429
0.87
0.86 0.9428
10 20 30 40 50 60
0 5 10 15 20 25 30
K
The Depth of The Decision Tree
RMSE
fMF 0.9240 0.9216 0.9190 0.9134 0.9098 0.955
Tree 0.9458 0.9439 0.9409 0.9321 0.9329 0.95
TreeU 0.9906 0.9850 0.9771 0.9706 0.9728
0.945
MF 0.8702
0.94
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Table 8: RMSE on EachMovie data set in warm-start λh
setting
No. Questions 3 4 5 6 7
fMF 1.2548 1.2499 1.2449 1.2366 1.2226 Figure 7: The performance of hierarchical regular-
Tree 1.2709 1.2614 1.2664 1.2570 1.2549 ization. The performance is reported by setting the
TreeU 1.28742 1.2813 1.2790 1.2772 1.2751 depth of the decision tree to be 6.
MF 1.1790
that the optimal performance is achieved when a moderate
Table 9: RMSE on Netflix data set in warm-start set- λ is used. Especially, the model has a high risk of overfitting
ting if λ is too small.
No. Questions 3 4 5 6 7 The learning algorithm of the proposed model is an itera-
fMF 0.9770 0.9749 0.9721 0.9715 0.9703 tive process. In Figure 9, we plot the both training and test
Tree 0.9772 0.9761 0.9726 0.9717 0.9713 performance measured by RMSE with respect to the num-
TreeU 0.9914 0.9879 0.9842 0.9792 0.9752 ber of iterations. We can see that the algorithm converges
very quickly, usually within 5 iterations.
MF 0.9399
5. CONCLUDING REMARKS
the RMSE first decreases, and reaches the optima around The main focus of this paper is on the cold-start problem
K = 20; thereafter, the model becomes increasingly overpa- in recommender systems. We have presented the functional
rameterized and the performance in turn starts to degrades. matrix factorization, a framework for simultaneously learn-
For example, the performance with K = 30 is worse than ing decision tree for initial interview and latent factors for
the that with K = 20. In principle, the optimal K should user/item profiling. The proposed fMF algorithm seamlessly
provide the best trade-off between fitting bias and model integrates matrix factorization for collaborative filtering and
complexity. decision-tree based interview into one unified framework by
One of our contributions is that we propose to use hierar- reparameterizing the latent profiles as a function of user
chical regularization to avoid overfiting. The impact of the responses to the interview questions. We have established
hierarchical regularization is controlled by the parameter λh . learning algorithms based on alternating minimization and
We evaluate fMF with different values of λh on MovieLens demonstrated the effectiveness of fMF on real-world recom-
data set. The performance obtained by ℓ2 regularization is mendation benchmarks.
also reported for comparison. We can see from Figure 7 that The current fMF model is based on a basic matrix fac-
the hierarchical regularization always performs better than torization formulation and does not take into account the
the basic ℓ2 regularization. This observation suggest that hi- content features such as the demographical information of
erarchical regularization, by exploiting the structure of the users. For future work, we plan to investigate how to utilize
decision tree, provides better regulatory to the functional such content features to further enhance the performance
matrix factorization model. of fMF in cold-start recommendations. Moreover, we also
Another parameter of interest is the regularization weight plan to conduct in-depth examination on the missing value
λ for the item profiles vj . We vary the value of λ and report problem, which seems to introduce considerable bias to the
the performance measured by RMSE in Figure 8. We can see learning process as revealed by our experiments.
0.985 [9] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
0.98 Eigentaste: A constant time collaborative filtering
0.975 algorithm. Information Retrieval, 4(2):133–151, 2001.
0.97 [10] A. Gunawardana and C. Meek. Tied boltzmann machines
RMSE
0.98
0.96 Discovery from Data (TKDD), 4(1):1–24, 2010.
0.94 [15] S.-T. Park and W. Chu. Pairwise preference regression for
0.92 cold-start recommendation. Proceedings of the third ACM
0.9 conference on Recommender systems - RecSys ’09, page 21,
0.88 2009.
0.86 [16] A. Paterek. Improving regularized singular value
0 2 4 6 8 10 12 14 16 18 20
decomposition for collaborative filtering. In Proceedings of
Number of Iterations KDD Cup and Workshop, volume 2007, pages 5–8.
Citeseer, 2007.
Figure 9: Performance measured by RMSE with re- [17] D. Pennock, E. Horvitz, S. Lawrence, and C. Giles.
spect to the number of iterations on MovieLens data Collaborative filtering by personality diagnosis: A hybrid
set. memory-and model-based approach. In Proceedings of the
16th conference on uncertainty in artificial intelligence,
6. ACKNOWLEDGEMENTS pages 473–480. Citeseer, 2000.
We would like to thank the reviewers for their constructive [18] P. Pu and L. Chen. User-involved preference elicitation for
product search and recommender systems. AI Magazine,
and insightful comments. Part of this work was supported
29(4):93, 2009.
by NSF Grant IIS-1049694, a Yahoo! faculty grant and 111 [19] A. Rashid, I. Albert, D. Cosley, S. Lam, S. McNee,
Project B07022. J. Konstan, and J. Riedl. Getting to know you: learning
new user preferences in recommender systems. In
7. REFERENCES Proceedings of the 7th international conference on
[1] G. Adomavicius and a. Tuzhilin. Toward the next Intelligent user interfaces, pages 127–134. ACM, 2002.
generation of recommender systems: a survey of the [20] A. M. Rashid, G. Karypis, and J. Riedl. Learning
state-of-the-art and possible extensions. IEEE Transactions Preferences of New Users in Recommender Systems: An
on Knowledge and Data Engineering, 17(6):734–749, June Information Theoretic Approach. SIGKDD Workshop on
2005. Web Mining and Web Usage Analysis, 22, Oct. 2008.
[2] D. Agarwal and B.-C. Chen. Regression-based latent factor [21] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme.
models. Proceedings of the 15th ACM SIGKDD Factorizing personalized Markov chains for next-basket
international conference on Knowledge discovery and data recommendation. Proceedings of the 19th international
mining - KDD ’09, page 19, 2009. conference on World wide web - WWW ’10, page 811, 2010.
[3] J. Basilico and T. Hofmann. Unifying collaborative and [22] J. Rennie and N. Srebro. Fast maximum margin matrix
content-based filtering. Twenty-first international factorization for collaborative prediction. In Proceedings of
conference on Machine learning - ICML ’04, page 9, 2004. the 22nd international conference on Machine learning,
[4] C. Boutilier, R. Zemel, and B. Marlin. Active collaborative pages 713–719. ACM, 2005.
filtering. In Proceedings of the Nineteenth Conference on [23] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock.
Uncertainty in Artificial Intelligence, pages 98–106. Methods and metrics for cold-start recommendations.
Citeseer, 2003. Proceedings of the 25th annual international ACM SIGIR
[5] J. Breese, D. Heckerman, C. Kadie, and Others. Empirical conference on Research and development in information
analysis of predictive algorithms for collaborative filtering. retrieval - SIGIR ’02, page 253, 2002.
In Proceedings of the 14th conference on Uncertainty in [24] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin
Artificial Intelligence, pages 43–52, 1998. matrix factorization. Advances in neural information
[6] L. Breiman, J. Friedman, R. Olshen, and C. Stone. processing systems, 17:1329–1336, 2005.
Classification and Regression Trees. Wadsworth and [25] D. Stern, R. Herbrich, and T. Graepel. Matchbox: large
Brooks, Monterey, CA, 1984. scale online bayesian recommendations. In Proceedings of
[7] N. Golbandi, Y. Koren, and R. Lempel. On bootstrapping the 18th international conference on World wide web,
recommender systems. In Proceedings of the 19th ACM pages 111–120. ACM, 2009.
international conference on Information and knowledge [26] X. Su and T. M. Khoshgoftaar. A Survey of Collaborative
management, pages 1805–1808. ACM, 2010. Filtering Techniques. Advances in Artificial Intelligence,
[8] N. Golbandi, Y. Koren, and R. Lempel. Adaptive 2009(Section 3):1–20, 2009.
Bootstrapping of Recommender Systems Using Decision
Trees. WSDM, 2011.