Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Decision Support Systems 51 (2011) 519–531

Contents lists available at ScienceDirect

Decision Support Systems


j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d s s

Collaborative error-reflected models for cold-start recommender systems


Heung-Nam Kim a,⁎, Abdulmotaleb El-Saddik a,c, Geun-Sik Jo b
a
School of Information Technology and Engineering, University of Ottawa, Canada
b
Department of Information Engineering, Inha University, Korea
c
Faculty of Engineering, New York University Abu Dhabi, UAE

a r t i c l e i n f o a b s t r a c t

Article history: Collaborative Filtering (CF), one of the most successful technologies among recommender systems, is a
Received 16 April 2010 system assisting users to easily find useful information. One notable challenge in practical CF is the cold start
Received in revised form 8 December 2010 problem, which can be divided into cold start items and cold start users. Traditional CF systems are typically
Accepted 27 February 2011
unable to make good quality recommendations in the situation where users and items have few opinions. To
Available online 4 March 2011
address these issues, in this paper, we propose a unique method of building models derived from explicit
Keywords:
ratings and we apply the models to CF recommender systems. The proposed method first predicts actual
Collaborative filtering ratings and subsequently identifies prediction errors for each user. From this error information, pre-computed
Cold start problems models, collectively called the error-reflected model, are built. We then apply the models to new predictions.
Recommender systems Experimental results show that our approach obtains significant improvement in dealing with cold start
problems, compared to existing work.
© 2011 Elsevier B.V. All rights reserved.

1. Introduction The cold start problems that can be divided into cold start items and
cold start users occur when available data is insufficient to enable the
The prevalence of digital devices and the development of Web 2.0 user to grasp such data [2]. A cold start item is cause by a new item. In
technologies and services enable end-users to be producers as well as a CF-based recommender system, an item cannot be recommended
consumers of media content. Even in a single day, an enormous until a number of users have previously rated it. This is known as the
amount of content including digital video, blogging, photography and cold start item problem [27]. This item is hardly ever recommended to
wikis is generated on the Web. It is getting more difficult to make a users due to insufficient user opinions. Another notable challenge in
recommendation to a user about what he/she will prefer among those recommender systems is the cold start user problem [1,30,31]. A cold
items automatically, not only because of the huge amount of data, but start user describes a new user who has joined a CF-based
also because of the difficulty of automatically grasping the meanings recommender system and has presented few opinions. With this
of such data. Recommender systems that have emerged in response to situation, it is often the case that there is no intersection at all between
the above challenges provide users with recommendations of items two users, and hence the similarity is not computable. Even when the
that are likely to fit their needs [2]. computation of similarity is possible, it may not be very reliable
One of the most successful technologies among recommender because of the insufficient processed information. Accordingly, the
systems is Collaborative Filtering (CF). Numerous on-line companies system is generally unable to make high quality recommendations [2].
(e.g., Amazon.com, Netflix.com, and Last.fm) apply CF to provide These problems, particularly for the cold start items, can be partially
recommendations to their customers. CF has an advantage over alleviated by content-based technologies because they provide
content-based filtering, which is the ability to filter any type of items, recommendations by comparing properties or content contained in
such as text, music, videos and photos [11]. Because the filtering an item to those of a user's interest items. Therefore, a number of
process is only based on historical information of whether or not a studies have attempted to incorporate content-based techniques into
given target user has preferred an item before, analysis of the actual collaborative filtering [15,20,27]. Although such systems give promise
content itself is not necessarily required. However, despite its success of overcoming the problems, the main drawback to the systems is that
and popularity, CF encounters serious limitations with quality filtering processes generally depend on a type of items (e.g., articles,
evaluation, namely the cold start problems. images, music, and videos); consequently, a system working a
particular application domain could not be directly applied to
different domains without any modifications. Moreover, for some
domains it is hard to automatically analyze the underlying content, as
⁎ Corresponding author. Tel.: +1 613 562 5800x6248; fax: +1 613 562 5664. well as a user's interest cannot always be characterized by content
E-mail address: hnkim@mcrlab.uottawa.ca (H.-N. Kim). properties contained in an item [6,7].

0167-9236/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2011.02.015
520 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

In this paper, we address the above issues by introducing a unique In CF-based recommendation schemes, two approaches have
method of building models that can be applied to CF recommender mainly been developed: memory-based CF (also known as user-
systems. Our aim is to build a recommender system derived from based CF) and model-based CF [5]. Following the proposal of
explicit ratings so that it can be flexible for any type of items. The GroupLens [24], the first system generates automated recommen-
proposed method is divided into two phases, an offline phase and an dations, user-based CF approaches have seen the widest use in
online phase. The offline phase is a pre-computed model building recommender systems. User-based CF uses a similarity measure-
phase in which most tasks can be conducted. The online phase is ment between neighbors and a target user to learn and predict
either a prediction or recommendation phase in which the models are preferences toward new items or unrated products by the target
used. In the model building phase, we first determine pre-predicted user. However, despite the popularity of user-based CF algorithms,
ratings and subsequently identify the pre-prediction errors for each they have some serious problems relating to the increasing
user. From the error information, error-reflected models are built. The computation complexity of recommendations as the number of
error-reflected models that reflect the average pre-prediction errors users and items increases. In addition, problems of sparsity due to
of user neighbors and of item neighbors can make accurate the insufficiency of users' historical information should be seriously
predictions in the situation where users or items have few opinions. considered [25]. In order to improve the scalability and real-time
In addition, in order to reduce the re-building tasks, the error- performance of large applications, a variety of model-based
reflected models are designed such that the models are effectively recommendation techniques have been developed. Model-based
updated and users' new opinions are incrementally reflected, even approaches, such as our algorithm, provide item recommendations
when users present a new rating feedback. by first developing a pre-computed model [13]. In comparison to
The subsequent sections are organized as follows: Section 2 briefly user-based approaches, model-based approaches are typically faster
discusses previous studies related to collaborative filtering. In in terms of recommendation time, though the method may have an
Section 3, we describe a detailed method of building models. In expensive learning or model building process. A new class of model-
Section 4, we then provide a description of how the system uses the based CF, called an item-based CF, has been proposed [9,25] and
models for item predictions. In Section 5, an experimental evaluation applied to commercial recommender systems such as Amazon.com
is presented comparing our approach with existing work. Finally, we [18]. Instead of computing the similarities between users, an item-
present the conclusions and future work. based CF reviews a set of items that the target user has rated and
selects the most similar items based on the similarities between the
2. Background and preliminaries items.
Usually, user-based and item-based CF systems involve two steps:
In this section, we briefly explain the concepts used in our research first, the neighbor group, which are users who have a similar
on recommender systems, especially those related to a user-based CF preference to the target user (for user-based CF) or the set of items
and an item-based CF. CF is based on the fact that “word of mouth” that is similar to the item rated by the target user (for an item-based
opinions of other people have considerable influence on the buyers' CF), should be determined by using a variety of similarity computing
decision making [14,28]. If advisors have similar preferences to the methods. Based on the group of neighbors, we obtain the prediction
buyer, the buyer is much more likely to be affected by their opinions. values of particular items, estimating how much the target user is
The most common ways to obtain users' opinions is to use rating likely to prefer the items, and then the top-N items with the highest
information given explicitly or observed implicitly [6,12]. In the case predicted values of interest to the target user are identified.
of explicit ratings indicating how relevant or interesting a specific
item is to a user, users are required to explicitly evaluate an item using 2.1. Neighborhood formation
like/dislike, thumbs up/down (a binary scale), or numerical values
(e.g., a scale of 1–5 points). On the other hand, in the case of implicit The important task in CF-based recommendations is neighborhood
ratings, information observed implicitly from users' behaviors is formation, because different neighbor users or items lead to different
treated as preference indicators, e.g., a user viewed, accessed, listened recommendations. Here, neighbors simply mean a group of like-
to, or bought an item [21]. The best use of implicit ratings has been minded users similar to a target user or a set of items similar to those
seen in Amazon.com where a user's past purchased items are used to that were previously identified as being preferred by the target user.
make product recommendations [18]. The number of neighbors may be varied depending on the
Although the field of CF research has a large number of information characteristics of the domains and the application. Since the number
filtering problems, in this paper, we focus on explicit numerical ratings also has significant impact on the quality of results from the CF, the
that can be represented as an m × n user–item rating matrix R [26]. recommender systems should determine the size of the neighborhood
in order to compute the prediction results effectively [26].
Definition 1. User–item rating matrix, R.
2.1.1. User neighborhood
If there is a list of m users U = {u1,u2,…,um} and a list of n items I = The main goal of neighborhood formation for a user-based CF is to
{i1,i2,…,in} mapping between the user–item pairs and the explicit identify the set of user neighbors which is defined as the group of
ratings, then m × n user–item data can be represented as a rating users exhibiting preferred items similar to those of the target user. For
matrix. This matrix is called a user–item rating matrix, R. The matrix finding the nearest neighbors, a variety of similarity methods have
rows represent users, the columns represent items and Ru,j represents been researched, such as the Pearson correlation [24], which is widely
the rating of user u of item j. Some of the entries are not filled, as there used, cosine similarity, weight amplification, inverse user frequency,
are items not rated by some users. and default rating [5], including probability-based approaches.
In matrix R, an element Ru,j either exists as numerical ordinal scale According to the results of the selected similarity measure, particular
between Rmin and Rmax, or is empty. If user u rates item j with Rmin, it k users with highest similarity are identified as neighbors. Fig. 1 shows
implies he/she does not have any preference for the item j. On the examples of calculating the similarity of two users, u and v from a
contrary, if user u rates item j with Rmax, it means the item is suited to his/ user–item rating matrix R. Finally, for m users, the similarity of users
her preference. If user u has not previously rated item j (i.e., a blank can be represented as an m × m user–user similarity matrix A, where
element), we assign ∅ to the value of Ru,j (i.e., Ru,j =∅), and ultimately both rows and columns represent users. In matrix A, Au,v, which
those items (i.e., Ru,* =∅) which have not yet been rated by the target represents the u-th user for the v-th user, is set to the similarity value
user can be considered for recommendation to the target user. between a pair of users u and v if the corresponding similarity value is
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 521

item user
user
i1 ... ii ij ... in user
u1 ... uu uv ... um

u1 R1,1 R1,j R1,n u1

...
...
Target User uu uu Au,v
Rv,i Rv,j Rv,n

uv Ru,1 Ru,i Ru,j Ru,n uv

...
...

um Rm,i Rm,j Rm,n um

user-item rating matrix R user-user similarity matrix A


Fig. 1. A user–user similarity matrix A used in user-based CF.

greater than the k highest similarity value in the u-th row of A, and 0 corresponding similarity value is greater than the k′ highest similarity
otherwise. Non-zero entries of each row, often called k nearest value in the i-th row of D, and 0 otherwise. Non-zero entries per each
neighbors (KNN), are used to recommend items for each user of the row, often called k′ most similar items (MSI), are used to recommend
row. items for the target user [9].

2.2. Predictions and recommendations


2.1.2. Item neighborhood
Instead of computing similarities between users, an item
Once the neighborhood is generated, various methods can be used
neighborhood for an item-based CF is generated by computing
to combine the ratings of neighbors to compute a prediction value on
similarities between items. The main use of the neighborhood is to
unrated items for the target user. The preference rating of each
identify, for each item, the set of items that is most likely to be
neighbor is usually weighted by the similarity value, which is
preferred by users. For capturing the similarity relationships
computed when the neighbors are determined. The more a neighbor
between pairs of items, Sarwar et al. [25] proposed several similarity
is similar to a target user or item, the more influence he/she has for
measures between pairs of items such as cosine-based similarity,
calculating a prediction value. After predicting how much a target user
correlation-based similarity and adjusted cosine similarity. The
will like particular items not previously rated by him/her, the top-N
basic idea of computing similarity between two items is to first look
item set, the set of ordered items with the highest predicted values is
into users who have rated both of these two items and then to apply
identified and recommended. The target user can present feedback on
one of the similarity measures to calculate a similarity value
whether he/she actually likes the recommend top-N items or how
between the two items [25].
much he/she prefers those items as scaled ratings.
Fig. 2 illustrates computation of the similarity for a pair of items i
and j corresponding to the i-th and j-th columns of the rating matrix R.
Similar to the user–user similarity matrix A, for n items, the similarity 2.2.1. User-based prediction
of items can be represented as an n × n item–item similarity matrix D, In a user-based CF, we can predict the target user's interest in the
where the i-th row stores the k′ most similar items to item i. In matrix target item based on its ratings from other similar users. The main
D, Di,j is set to the similarity value between two items i and j if the idea is that ratings by more similar users contribute more to

Target Item
item item ...
user
i1 ... ii ij ... in item
i1 ii ij ... in

u1 R1,1 R1,j R1,n i1


...

...

uv Rv,2 Rv,j Rv,n ii Di,j

uu Ru,1 Ru,2 Ru,j Ru,n ij


...
...

um Rm,2 Rm,j Rm,n in

user-item rating matrix R item-item similarity matrix D


Fig. 2. An item–item similarity matrix D used in item-based CF.
522 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

predicting the target item rating [11]. Formally, the measurement of refer to the average rating of the nearest neighbors of user u for items i
how much target user u prefers item j is given by: and j. Note that if the average rating of the user neighborhood for a
  certain item is unavailable, we always use the average rating of item j
∨ ∑v∈KNN Rv; j −Rv ·simðu; vÞ rated by all users instead. sim(i, j) represents the similarity between
Ru; j = Ru + ð1Þ items i and j, which can be calculated using diverse similarity
∑v∈KNN jsimðu; vÞ j
algorithms such as cosine-based similarity, correlation-based simi-
larity and adjusted cosine similarity [25]. However, we also consider
where KNN is a set of k nearest neighbors of user u, and Rv,j is the
the number of users' ratings of items in generating item-to-item
rating of user v on item j. In addition, Ru and Rv refer to the average
similarities, namely the inverse item frequency. When the inverse item
rating of users u and v, and sim(u,v) represents the similarity between
frequency is applied to the cosine similarity technique, the similarity
users u and v, which can be calculated by a number of different
between two items, i and j is measured by Eq. (4):
methods, as discussed in Section 2.1.1.
   
2.2.2. Item-based prediction ∑u∈ðU ∩U Þ Ru;i × logðn = fu Þ × Ru; j × logðn = fu Þ
simði; jÞ = rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffi ð4Þ
i j

Essentially, item-based prediction tries to capture how the target  


user has rated similar items [25]. For predicting a particular item in an ∑u∈Ui Ru;i × logðn =fu Þ ∑u∈Uj Ru; j × logðn =fu Þ
item-based CF, we can calculate a weighted average of the user's
ratings Ru,j using the similarity between two items as the weight. where Ui and Uj refer to a set of users who rated items i and j,
Formally, we can calculate the predicted rating of target user u for respectively. Ru,i is the rating of user u on item i whereas Ru,j is the
target item j using the following formula: rating of user u on item j. The inverse item frequency of user u is
defined as log(n/fu), where fu is the number of items rated by user u
∨ ∑i∈MSI simði; jÞ·Ru;i and n is the total number of items in the system. If user u rated all
Ru; j = ð2Þ
∑i∈MSI j simði; jÞj items, then the value of the inverse item frequency is 0. Likewise, in
the inverse user frequency [5], the main concept of the inverse item
where MSI is the set of k′ most similar items to item j. sim(i, j) frequency dictates that users rating numerous items present less
represents the similarity between items i and j, which can be contribution with regard to similarity than users rating a smaller
calculated in the manner mentioned in Section 2.1.2. The main idea number of items [9].
behind this prediction is that the weighted rating of items that are To illustrate a simple example for computing a pre-prediction,
similar to the target item is a good estimate of the rating for that item. consider the following user–item rating matrix R as shown in Table 1.
Alice has already rated the movies “Seven,” “JFK,” “Shrek” and
3. Building collaborative models using pre-prediction errors “Godzilla,” but she has not yet seen “Titanic” and “AI”. For Alice, the
movies for the pre-prediction are those movies that have already been
In this section, we describe our method of building models, rated by her, IAlice = {Seven, JFK, Shrek, Godzilla}.
collectively called an error-reflected model, in detail. According to a Assume that we calculate the prediction value for “JFK” by using
type of neighbors used in building the model, the error-reflected Eq. (3). And suppose KNN(Alice), the similar user neighborhood of
model is divided into three classes: a user-based model, an item- Alice, and MSI(JFK), the similar item neighborhood of “JFK,” are as
based model, and a hybrid model. follows:
In this paper, prior to generating a rating prediction of the items that
− KNN(Alice) = {John, Bob}
users have not yet rated, we predict values of the items that the users
− MSI(JFK) = {(Seven, 0.95), (Godzilla, 0.88), (Titanic, 0.71)}.
have previously rated. That is, we validate how accurately the rating of a
given user for an item is predicted, compared to an actual rating given by Analyzing the rating for “JFK” of the neighbors of Alice, John rated
him/her for that item. In this sense, a prediction can be divided into two it with 5 points and Bob rated it with 4 points. Hence, the average
cases: a prediction of a target user on items that have already been rated rating of the neighborhood RJFK knn(Alice) becomes 4.5. In addition,
by the target user and a prediction of a target user on items that have not analyzing the rating of Alice for items that are similar to “JFK,” the
yet been rated by the target user. To differentiate the former from the value is calculated as follows:
latter, we label the former case a pre-prediction.
ð3−3Þ × 0:95 + ð5−4Þ × 0:88
= 0:48:
3.1. Computing pre-predictions 0:95 + 0:88

Prior to generating a pre-prediction, we first identify k nearest In this calculation, “Titanic” is excluded because Alice has not yet
neighbors of each user by using cosine similarity with the inverse user rated it. Finally, we can calculate the pre-predicted value of Alice for
frequency [5] and k′ most similar items of each item by using cosine “JFK” as follows:
similarity with the inverse item frequency. For the pre-prediction of
the items, we withhold a single selected item for a target user within PAlice; JFK = 4:5 + 0:48 = 4:98:
the entire selection of items he/she rated, and then try to predict its
value. In order to compute the pre-prediction value of target user u for This implies that the movie “JFK” is pre-predicted as 4.98 even
item j, we consider not only the rating propensity of users who have though the actual rating of Alice on it is 5.
similar tastes with user u but also the past rating propensity of user u
for items similar to item j. Formally, the measurement of the pre-
prediction is given by: Table 1
  An example of a user–item rating matrix, R.
i
j
∑i∈MSIu ð jÞ Ru;i −RknnðuÞ × simði; jÞ Seven JFK Titanic Shrek AI Godzilla
Pu; j = RknnðuÞ + ð3Þ
∑i∈MSIu ð jÞ simði; jÞ Alice 3 5 1 5
Bob 4 4 5 4 3
where Pu,j is a pre-predicted value of user u on item j and MSIu(j) is a John 2 5 4 4
i j Dannis 1 2 5 4
set of most similar items of item j. Rknn(u) and Rknn(u) , respectively,
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 523

Definition 2. User–item pre-prediction matrix, P. a pre-prediction value. A pre-prediction error can be analyzed as
follows:
For an m × n user–item rating matrix, R, the pre-predictions can
− Ru, j b Pu, j (overestimation): In the case of the overestimation, the
be represented as an m × n user–item pre-prediction matrix, P. The
pre-predicted value is estimated as being higher than the actual
matrix rows represent users, the columns represent items, and Pu, j,
rating value of the user. This is the result of the prediction based
Rmin ≤ Pu,j ≤ Rmax, represents the pre-predicted rating of user u on
on the rating tendency of similar users with target user u and the
item j.
past rating tendency of target user u for items that are similar to
Analogous to the rating matrix R, the value Pu,j may be assigned to
item j. Considering this point, with respect to a new prediction of
∅, indicating user u has not previously rated item j. Occasionally, we
item j for a certain user similar to the target user u, it may be
cannot compute the pre-prediction due to the following cases: i) the
necessary to slightly decrease the value predicted. In addition, in
user neighborhood of user u does not exist and ii) the item
the case of a new prediction of the target user u for items similar
neighborhood of item j does not exist. If the cases happen, the
to item j, it may also be necessary to slightly decrease the value
average rating value of user u is used as Pu,j.
predicted.
− Ru,j N Pu,j (underestimation): Contrary to the overestimation, it may
3.2. Computing pre-prediction errors
be necessary to slightly increase the predicted value with respect
to a new prediction for the case of the underestimation.
Once the predictions for users on items are represented on the pre-
prediction matrix, error of each prediction can be computed by
subtracting the pre-predicted value from the actual rating. Given the 3.3. Building error-reflected models
set of actual and pre-predicted rating pairs bRu,j, Pu,jN for all actual
rating value in the rating matrix R and the corresponding pre- As mentioned in the previous section, the pre-prediction error of a
predicted value in the pre-prediction matrix P, a prediction error is target user is the result that reflects the opinion of like-minded users
calculated as: and the target user's own rating tastes. Therefore, pre-prediction
errors of users similar to a target user on a certain item may contain
valuable information to make a prediction of the target user for that
Eu; j = Ru; j −Pu; j : ð5Þ item. Likewise, pre-prediction errors of a target user for items that are
similar to a certain item may be helpful to estimate the rating of the
Fig. 3 illustrates the process of computing the pre-prediction error. target user for that item. In fact, there are recent studies [4,10,16,22]
For example, the error of the pre-predicted value of Alice for “JFK,” that have made attempts to utilize pre-predictions to CF recommend-
EAlice,JFK, as mentioned in the previous section, becomes 0.02. er systems. Similar to their motivation, the fundamental assumption
Formally, from matrices R and P, the prediction errors can be of our study is that there are systematic and thus exploitable pre-
represented by a user–item error matrix. prediction errors for predicting items that have not yet been rated by
the target user.
Definition 3. User–item error matrix, E. Theoretically, the error matrix E itself can be used for a new
prediction. Intuitively, however, if a pre-prediction value is
From the given set of actual rating and pre-predicted value pairs accurate, it implies that the pre-prediction process accurately
bRu, j, Pu, jN for all the data in matrices R and P, a user–item error reflects the tendency of the user's past ratings on similar items and
matrix, E, can be filled with error entries. Each entry, Eu, j, in E the tendency of similar other users' ratings on the same items. On
represents the pre-prediction error of the uth user of the jth item. Eu, j the contrary, if a pre-prediction value appears to be of a high
is in the range of (Rmin–Rmax) and (Rmax–Rmin). Some of the entries are deviation from a corresponding actual rating, there may be noise
not filled, as there are items that are not rated by some users. information used in making the prediction. Therefore, to avoid in
In the case that the pre-prediction was overestimated, the pre- not only increasing unnecessary computation cost but also
prediction error value Eu, j becomes negative; on the contrary, in the including unnecessary noise information, the model is built by
case that the value was underestimated, Eu, j becomes positive (Fig. 4). the only data within a predetermined threshold (θ). Note that the
If Eu, j = 0, it means that the algorithm exactly estimated the actual prediction error value becomes negative in the case of overesti-
rating. The closer to 0 the value approaches, the higher the accuracy of mation whereas it becomes positive in the case of underestimation.

3 – 3.9 = - 0.9
item item
user
i1 i2 ... ij ... in user
i1 i2 ... ij ... in

u1 5 3 4.8 3.9 u1 0.2 -0.9

u2 2 4 3 4 3.6 2.9 u2 -2 0.4 0.1


...
...

uu Ru,1 Ru,2 Ru,j Ru,n Pu,1 Pu,2 Pu,j Pu,n uu Eu,1 Eu,2 Eu,j Eu,n
...
...

um Rm,1 Rm,j Pm,1 Pm,j um Em,1 Em,j

user-item rating matrix, R user-item a prior prediction matrix, P user-item error matrix, E

Fig. 3. The process of computing a pre-prediction error. The pre-prediction error can be calculated by subtracting the pre-prediction value from the actual rating.
524 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

underestimation Ru,j Pu,j

Rmin Rmax

Pu,j overestimation
Fig. 4. Overestimation and underestimation of a pre-prediction value.

Therefore, we select the elements of the error matrix E that satisfy user-based error-reflected model. The difference is merely that the
the following conditions: similar item neighborhood is used instead of the user neighborhood.
The item-based error-reflected model can also be represented as an
  m × n user–item matrix Ě(θ). The matrix rows represent users, the
 
Eu;j bθ; for Ru;j ≠∅: ð6Þ
columns represent items and jth column of uth row implies the
average of the pre-prediction errors of user u for items similar to item
The method of building models that are reflected by the prediction j, as defined in Eq. (8).
errors can be divided into three approaches: a user-based approach,
an item-based approach, and a hybrid approach. ∨
∑i∈MSIuθ ð jÞ Eu;i
ðθÞ
Eu; j =   ð8Þ
card MSIuθ ð jÞ
3.3.1. The user-based error-reflected model
The user-based error-reflected model is built by utilizing the pre-
prediction errors of similar user neighbors with the target user u for a where MSIθu(j) denotes the set of items whose absolute value of the
certain item. The built model can be represented as an m × n user– pre-prediction errors for user u is less than θ among the similar items
item matrix Ê(θ). The matrix rows represent users, the columns of the target item i. And card(MSIθu(j)) is the number of elements in the
represent items and jth column of uth row implies the average error of set.
the pre-prediction on similar users' rating of the target user u for item Let us calculate the average of the prediction errors of Alice for
j, as defined in Eq. (7). similar items to “Titanic” in Table 2. The prediction error of Alice for
“Seven” and “JFK” that are similar to “Titanic” is EAlice,Seven = −0.3 and
∧ ∑v∈KNNθ ðuÞ Ev; j EAlice,JFK = 0.02, respectively. Therefore, Ě(0.8)Alice,Titanic can be calculat-
ðθÞ
Eu; j =  j  ð7Þ ed as follows:
card KNNjθ ðuÞ
ð0:8Þ
ĚAlice;Titanic = ð−0:3 + 0:02Þ = 2 = −0:14:
where KNNθj (u) denotes the set of users whose absolute value of the
pre-prediction error for item j is less than θ among the similar
neighbors of the target user u. In addition, card(KNNθj (u)) refers to the If the case of MSI(AI) = {Titanic, Shrek, Godzilla}, then Ě(0.8) Alice,AI of Alice
cardinality of the set KNNθj (u) (i.e., the number of elements in the set). for “AI” is calculated as −0.4. Since the prediction error of Alice for
For example, assume that the pre-prediction errors for Table 1 are “Godzilla” is 0.9, EAlice,Godzilla = 0.9, it is not reflected from calculation. In
shown in Table 2. If an error threshold value θ is 0.8 (θ = 0.8), and the the case of John for “AI,” Ě(0.8) (0.8)
John,AI is calculated as Ě John,AI = −0.35 from
size of the user neighborhood is 2 (k = 2), then the average of the pre- EJohn,Titanic = −0.4 and EJohn,Godzilla = −0.3. Finally, the item-based error-
prediction errors of Alice's neighbors for “Titanic” can be calculated reflected model of Table 2 can be built as Table 4.
from the given values EBob,Titanic = 0.7 and EJohn,Titanic = − 0.4.
3.3.3. The hybrid error-reflected model
ð0:8Þ
ÊAlice;Titanic = ð0:7−0:4Þ = 2 = 0:15 The hybrid error-reflected model, which is represented as m × n
user–item matrix Ĥ(θ), is built by unifying the user-based model and the
In an analogous fashion, Ê(0.8)Alice,AI of Alice for “AI” is calculated as item-based model. The entry in Ĥ(θ) is filled as the value that is close to 0,
−0.02. In the case of Bob, it is possible to compute that Ê(0.8) but is not 0 among the values of jth column of uth row in Ê(θ) and Ě(θ).
Bob,Godzilla for
“Godzilla” from KNN(Bob) = {Alice, Dannis}. However, the prediction Formally, Ĥu,j is defined as in Eq. (9).
error value of Alice for “Godzilla” is 0.9, which is greater than the 8∧ ∨ ∧ ∧ ∨
threshold value 0.8. Accordingly, Ê(0.8) >
>
Bob,Godzilla is 0.6, as the only ∧ < Eu; j ifð jEu; j j≥ jEu; j jand Eu; j ≠0Þ or ð Eu; j = 0Þ
∨ ∧ ∨ ∧
prediction error value of Dannis is selected. Finally, the user-based H u; j = ∨ :
>
> E ifð jEu; j jb jEu; j jand Eu; j ≠0Þ or ð Eu; j = 0Þ
error-reflected model of Table 2 can be built as shown in Table 3. : u; j
0 otherwise
3.3.2. The item-based error-reflected model ð9Þ
The item-based error-reflected model is built using the pre-
prediction errors of the target user u for the items that are similar to From two examples in Tables 3 and 4, the hybrid model unified
the target item. This method is similar to the method of building the two models can be built as Table 5.

Table 2 Table 3
An example of a user–item prediction error. An example of the user-based error-reflected model.

Seven JFK Titanic Shrek AI Godzilla Seven JFK Titanic Shrek AI Godzilla

Alice − 0.3 0.02 − 0.4 0.9 Alice 0 0 0.15 0 − 0.02 0


Bob 0.1 −2 0.7 2 − 0.02 Bob 0 0 0 0 0 0.6
John − 0.15 0.2 − 0.4 − 0.3 John 0 0 0 − 0.4 − 0.02 0
Dannis −2 0.5 − 0.03 0.6 Dannis − 0.05 0 0 0.3 0 0
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 525

Table 4 systems, it is crucial to accurately predict how much a certain user


An example of the item-based error-reflected model. prefers a certain item based on historical information as this directly
Seven JFK Titanic Shrek AI Godzilla influences the decision making of the users for purchasing, selecting
or watching. In addition, processing time required to generate
Alice 0 0 − 0.14 0 − 0.4 0
Bob 0 0 0 0 0 0.4 prediction is also an important issue. As noted previously, the
John 0 0 0 − 0.12 − 0.35 0 proposed CF approach constructs the error-reflected models which
Dannis 0.6 0 0 0.5 0 0 can be accomplished offline, prior to online prediction or recommen-
dation. Since most tasks can be conducted in the offline phase, the
system can result in fast online performance. The salient concept
3.3.4. Error-reflected models for cold start problems behind our prediction scheme is that prediction errors derived from
As clearly discussed in [27], for cold start users and cold start items, similar users and similar items can help in predicting a user similar to
recommender systems are generally unable to provide high quality the users on an item similar to the items. The online prediction
recommendations. With respect to the cold start users, they should be applied for each constructed model can be divided into three
encouraged to continuously provide their opinions because they do methods.
not have enough rating information. However, inaccurate predictions The first approach is a method of applying the user-based error
from the insufficiency of the users' historical information lead them to reflected model to the prediction. The basic concept of this method is
undermine the credibility of the system, and thus, cause their to reflect whether the pre-predictions of similar neighbors for the
deviation from the system. Likewise, the cold start items can hardly target item are overestimated or underestimated. Formally, the value
be recommended compared to items which have sufficient users' of target user u for item j, Řu,j, is computed by:
ratings. Considering these points, a differentiated strategy is necessary
∨ ∧
to generate the prediction for both of the cold start users and items. j
Ru; j = RknnðuÞ + Eu; j ð12Þ
Since a cold start item has a few ratings given by users, we take
into consideration all users who rated the cold start item when
where Êu,j is the value of the uth row of the jth column in the user-
building the user-based model for such item. Analogously, with j
based error-reflected model, Ê and Rknn(u) refers to the average rating
respect to a cold start user, because he/she has rated insufficient
of neighborhood of user u for item j. If the average rating of the
items, most of items similar to a target item may not be rated by him/
neighborhood for item j is unavailable, the average rating value of
her. Rather than similar items, therefore, we consider all items rated
item j rated by all users is used instead. For example, the rating value
by the cold start user when building the item-based model for that
of Alice for “Titanic,” ŘAlice,Titanic, in Table 1 is predicted as 4.65 from the
user. Formally, the error-reflected models of cold start users and items
average rating of similar users RTitanic knn(Alice) = 4.5 and the value in the
are built by revising Eqs. (7) and (8). That is, the user-based model
user-based error-reflected model Ê(0.8) Alice,Titanic = 0.15 as described in
and item-based model respectively use Eqs. (10) and (11), respec-
Table 3.
tively.
For the second approach, we apply the item-based error reflected

model to the online prediction. This approach reflects whether pre-
ðθÞ
∑v∈Uj Ev; j
Eu; j =   if j ∈ CSI ð10Þ predictions of the target user of items similar to the target item are
card Uj overestimated or underestimated. Formally, the measurement of how
much the target user u prefers item j is given by:

ðθÞ ∑i∈Iu Eu;i ∨ ∨
Eu; j = if u ∈ CSU ð11Þ j
Ru; j = RknnðuÞ + Eu;j ð13Þ
cardðIu Þ

where CSU and CSI are sets of cold start users and cold start items, where Ěu, j is the value of the uth row for the jth column in the item-
respectively. In addition, Uj is a set of users who has rated item j and Iu based error-reflected model, Ě. Applying this way, the predicted value
is a set of items that has been rated by user u; thus, card(Uj) and card of Alice for “Titanic”, ŘAlice,Titanic, is calculated as 4.36 from the average
(Iu) are the numbers of elements of the sets Uj and Iu, respectively. rating of similar users, RTitanic
knn(Alice) = 4.5, and the value in the item-based
error-reflected model Ě(0.8) Alice,Titanic = −0.14 as described in Table 4.

4. Applying collaborative models to recommender systems Finally, the measurement using the hybrid error-reflected model is
defined as:
Fig. 5 illustrates our method with two phases: an offline phase and
∨ ∧
an online phase. The offline phase is a building model phase as j
Ru; j = RknnðuÞ + H u; j ð14Þ
explained in Section 3 and the online phase is a prediction phase using
the error-reflected models.
where the value of the uth row for the jth column in the hybrid model,
Ĥ. Since |Ê(0.8) (0.8)
Alice,Titanic| N −Ě Alice,Titanic| in the previous example, in this
4.1. Generating a prediction
case, ŘAlice,Titanic can be predicted as 4.36.

The final step in a collaborative filtering is the process of


4.2. Model incremental updates
generating the prediction by attempting to guess the rating that a
user would provide for an item. In collaborative recommender
Model-based CF is generally faster in recommendation time
compared to memory-based CF due to the advantageous aspect of
Table 5
prior use of the pre-computed model [13]. However, this approach
An example of the hybrid error-reflected model.
tends to require expensive learning time for building a model.
Seven JFK Titanic Shrek AI Godzilla Moreover, once the model is built, it is difficult to immediately reflect
Alice 0 0 − 0.14 0 − 0.02 0 users' feedback despite its significance in the recommender system
Bob 0 0 0 0 0 0.4 [7]. In other words, the new information of user preference for the
John 0 0 0 − 0.12 − 0.02 0 items is difficult to reflect until the model is re-built. Generally,
Dannis − 0.05 0 0 0.3 0 0
building, renewing or rebuilding the model is not frequently
526 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

Input Building models in Offline phase


Rating [mxn]
i1 i2 ij in
u1 Pre-Prediction Pre-Prediction
[mxn] Error [mxn]
u2

uu R R R R
Neighborhood Error-reflected
Formations Models [mxn]
um (users, items)
m: users n: items

i1 i2 ij in Updating matrices

Uu Feedback
Target user Predicting
Prediction on target item j for the Rating
target user u, R u,j =? Recommendation User u

Unknown rating of the target user Rating of the target user

Fig. 5. An overview of the proposed approach for item recommendations.

considered due to the time consuming process. Accordingly, the 5.1. The dataset and evaluation metric
efficient method of rebuilding the model is required.
In order to alleviate the weak points of model-based CF, the Experimental data comes from MovieLens, a web-based research
proposed approach is designed such so that the model is updated recommender system (www.movielens.org). The dataset used in this
effectively as illustrated in Fig. 6. In addition, users' new opinions are paper is 100 k ratings dataset in MovieLens containing 100,000 ratings
reflected incrementally, even when users present explicit feedback. of 1682 movies rated by 943 users in the system (943 rows and 1682
For example, we assume that the system predicted the rating of Alice columns of a user–item matrix R). This dataset is publicly available.1
for movie “Titanic” as 4.36 and recommended it to Alice. If Alice We used two different training data: full training dataset and cold
provided 4.0 as explicit feedback of her actual rating after watching start training dataset. First, for the full dataset that includes all
the movie, then the new prediction error can be calculated from the available ratings of users, the entire data was divided into two groups;
actual rating and the predicted value as follows: 80% of the data (80,000 ratings) was used as a training set and 20% of
the data (20,000 ratings) was used as a test set. A five-fold cross
RAlice;Titanic −ŘAlice;Titanic = 4:0−4:36 = −0:36: validation scheme was used. This dataset is used to examine the
quality of the prediction no matter whether users or items have
sufficient ratings or not. Second, for the cold start dataset, we
The basic concept of measuring the new prediction error is the artificially generated two groups that satisfy cold starting conditions
same as that of measuring the pre-prediction error. In the case where because the original dataset contains a minimum of 10 ratings per
users present explicit feedback about the prediction, the models can user. The first group contains 100 users who have three ratings per
easily update the error value which is computed by subtracting the user (the number of items rated by each user) and the second group
predicted value from feedback rating. Therefore, the proposed contains 100 users who have five ratings per user.
method can use the updated information in the process of any further In order to measure the accuracy of the predictions, we adopted
new predictions as well as enhance the quality of recommendations the mean absolute error (MAE) that was widely used for the statistical
regarding user preferences. We believe our incremental update of the accuracy measurements in the diverse algorithms [12]. The mean
models is much more attractive and particularly efficient for cold start absolute error of user u for N items in the test data is defined as:
users before rebuilding process of the models.

5. Experimental evaluations ∑Nj= 1 jRu; j − Ru; j j
MAUEðuÞ = ð15Þ
N
In this section, we empirically evaluate the proposed prediction
methods using the error-reflected models and compare those
performances against the performances of the benchmark algorithms. where bRu, j, Řu, jN is the actual/predicted rating pairs of user u in the
To this end, we implemented a user-based CF algorithm, wherein the test data. Finally, the MAE of all M users in the test set is computed as:
similarity is computed by the well-known Pearson correlation
coefficient (denoted as UserCF) [5], and the item-based CF approach ∑M
u = 1 MAUEðuÞ
which employs cosine-based similarity (denoted as ItemCF) [25]. The MAE = : ð16Þ
M
performance applied the user-based model (denoted as UErrorCF), the
item-based model (denoted as IErrorCF), and the hybrid model
(denoted as HErrorCF) were evaluated in comparison with the
1
benchmark algorithms. The dataset can be downloaded from http://www.grouplens.org/node/73.
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 527

item i1 i2 ij in
user
Updating matrix Prediction Error
u1
Eu,j = - 0.3 user feedback

Error-reflected
4
uu 0.4 -0.6 -0.1 ? 0.2 Ea,n Predict Rating
Models
( )
Ê( ) R u,j = ?
4.3 Target User u
um
user-item error matrix, E

Fig. 6. Updating the error models incrementally by using user feedback.

5.2. Parameter tuning experiments decreases as the size of the user neighborhood increases from 10 to
20; beyond this point, any further increase of the neighborhood size
In this section, we present detailed experimental results according did not affect the accuracy even though the slight variation of MAE
to three parameters: the size of user neighborhood k, the size of item values appeared. When the neighborhood size was 10, we found
neighborhood k′ and the error threshold θ. many cases that nearest users of a target user have not rated a target
item yet while the pre-prediction of the target user on the target item
5.2.1. Accuracy of the pre-prediction according to neighborhood size was generated. This fact might affect the result that the accuracy of the
The pre-prediction is influenced by the size of the user neighbor- pre-prediction becomes worse when the size of the user neighbor-
hood, KNN, and the size of the item neighborhood, MSI. Accordingly, hood is small.
for building accurate prediction error models, we should first
determine a proper size of the user neighborhood k and the item 5.2.2. Experiments with the error threshold
neighborhood k′, respectively. Hence, in this section, we examined the In this section, we investigate the effect of an error threshold on
accuracy of the pre-prediction in order to choose optimal values for the performance of the prediction. As described in Section 3.3, we
the number of nearest neighbors and most similar items. expected that the threshold θ could be a significant factor affecting the
First, we measured MAE of the pre-prediction according to the quality of the prediction in our study because different error-reflected
variation of the item neighborhood k′. According to previous studies, models (i.e., Ê(θ) and Ě(θ)) are built depending on the threshold. So, we
we set the size of the user neighborhood k to 50 (k = 50). The measured MAE of the prediction according to the θ value variation
experimental result is depicted in Fig. 7 (left graph). It can be from 0.2 to 2.0. Based on the previous experiment, the size of the user
observed from the graph that the size of the item neighborhood neighbors and the item neighbors are set to 50 and 60, respectively
affects the prediction quality. The quality of the pre-prediction (k = 50, k′ = 60).
improved as k′ value was increased from 10 and 60, and after this Fig. 8 illustrates the variation of MAE for UErrorCF, IErrorCF and
value, the curve tends to become flat. We also observed that the MAE HerrorCF. It can be observed from the graph that the three methods
value increases after the item neighborhood size of 80. These results demonstrate similar types of charts. In the case of UErrorCF, the
indicate that when the item neighborhood size is too small, the downward curve appeared until θ value became 1.2, in the case of
accuracy of the pre-prediction is remarkably decreased. In addition, IErrorCF, the curve appeared until θ value bacame 1.6 and in the case of
too large of a size can also negatively impact the accuracy. HerrorCF, the curve appeared until θ value became 1.4, respectively.
In the subsequent experiment, we continued to examine the After those values, the upward curves gradually appeared in the
accuracy by changing the number of user neighbors k. During this graph. That is, a low threshold value discarded more pre-prediction
experiment, k′ was set to 60 according to the previous result. As errors, and thus, the three methods obtained the poor prediction
shown in Fig. 7 (right graph), the number of the neighbor users also quality because remaining pre-prediction errors were not sufficient to
affected the pre-prediction. However, unlike the size of the item build the error-reflected models. Contrarily, a high threshold value
neighborhood, the curve of the graph tends to be flat at the relatively included unnecessary noise information that could give negative
small size of the user neighborhood. For example, MAE considerably influence on accuracy. When the threshold is 1.4, the models can be

Fig. 7. MAE according to variation of user neighbor size and item neighbor size used in generating a pre-prediction.
528 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

0.780
ers of user or item neighbors from 10 to 100 were used for the
UErrorCF prediction generation.
0.775 IErrorCF Fig. 9(a) illustrates MAE of UserCF and UErrorCF with respect to the
HErrorCF variation in the user neighborhood size. In UErrorCF, the user
0.770
neighborhood size denotes the number of nearest neighbors k that
is exploited for building Ê(θ) in Eq. (7). In the experimental results, at
most neighborhood size, the overall prediction accuracy of UserCF
MAE

0.765
appears to be better than that of UErrorCF. However, we found that the
prediction accuracy of UErrorCF is superior to that of UserCF when the
0.760 neighborhood size is small (e.g., k = 10). This result can imply that
UErrorCF can provide a more accurate prediction performance than
0.755
UserCF when the information data is sparse or available data for users
is relatively insufficient.
We continued to examine the prediction accuracy of ItemCF and
0.750
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 IErrorCF. In IErrorCF, the item neighborhood size denotes the number
Error threshold of most similar items k′ that is exploited for building Ě(θ) in Eq. (8). In
the rating prediction for IErrorCF, we set the user neighborhood to 50
Fig. 8. MAE according to variation of the error threshold. in order to calculate the average rating of the user neighborhood for a
certain item in Eq. (13). Fig. 9(b) shows MAE obtained by ItemCF and
IErrorCF with respect to the variation of the item neighborhood size.
built by eliminating approximately 10,000 pieces of superfluous The result demonstrates that, at all neighborhood size levels, IErrorCF
information; consequently, UErrorCF, IErrorCF and HErrorCF can provides more accurate predictions than ItemCF.
provide the enhanced prediction quality. ItemCF elevates the prediction accuracy as the neighborhood size
Examining the best prediction quality of the three methods, increases from 10 to 50; after this value, the accuracy decreased
UerrorCF, IErrorCF and HerrorCF obtains an MAE of 0.7584 (θ = 1.2), slightly. On the contrary, in the case of IErrorCF, after the size passed a
0.7556 (θ = 1.6), and 0.7543 (θ = 1.4), respectively. Based on this certain level (k′ = 40–50), the variation of the accuracy almost never
experiment result, in the subsequent experiments we selected 1.2, 1.6, occurs. Comparing MAE obtained by the user-based approaches and
and 1.4 as the error threshold of UErrorCF, IErrorCF and HerrorCF, the item-based approaches in a neighborhood size of 10, the accuracy
respectively. That is, for UErrorCF, model Ê(1.2) was used whereas of IErrorCF and ItemCF is remarkably worse than UserCF and UErrorCF.
model Ě(1.6) was used for IErrorCF. In the case of HErrorCF, we used the These results may be affected by the fact that the item-based
model unified Ê(1.4) and Ě(1.4). approaches essentially attempt to capture how the target user has
rated the similar items. In the case of a too-small size of the item
neighborhood, the items similar to a certain item were not rated by
5.3. Comparison with other methods
the target user, and thus, it was more difficult to predict the rating of
the item for him/her.
In this section, we present detailed experimental results in
Table 6 summarizes the comparison of the best results achieved by
comparison with the benchmark methods. The performance compar-
the five methods. The comparison results of MAE show that the
ison is divided into three dimensions. The accuracy of the prediction is
methods based on the proposed models (such as UErrorCF, IErrorCF,
first evaluated, and then, the accuracy of the prediction to the cold
and HErrorCF) provide slightly worse accuracy than UserCF; however,
start problems is evaluated. Finally, we compare computational
the difference appears insignificant in a comparative fashion. Our
complexity with related studies.
methods obtain nearly 7% improvements of the prediction accuracy
compared to ItemCF. To analyze statistical significance, we also
5.3.1. Comparison of the prediction accuracy conducted two-tailed paired t-tests (per user) on MAUE results of
As noted in a number of previous studies, the number of HErrorCF and those of the benchmark algorithms [8]. As a result, we
neighbors has significant impact on the prediction accuracy of observed that the p-value obtained from the t-test on HErrorCF and
neighborhood-based algorithms [25,26]. Therefore, different numb- UserCF was 0.5074 (t[942] = 0.66) indicating there was no significant

a b
0.780 0.890
UserCF ItemCF
0.775 0.870
UErrorCF IErrorCF
0.850
0.770

0.830
MAE
MAE

0.765
0.810

0.760
0.790

0.755 0.770

0.750 0.750
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
k nearest neighbors k most similar items

Fig. 9. (a) A comparison of MAE achieved by UserCF and UErrorCF as the user neighborhood size (k) grows; (b) a comparison of MAE achieved by ItemCF and IErrorCF as the item
neighborhood size (k′) grows.
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 529

Table 6 Table 8
A comparison of the best results achieved by the five methods. A comparison of MAE for cold start users.

Method: UserCF ItemCF Error-reflected models Test user Cold start users Original
(k = 60) (k′ = 50)
UErrorCF IErrorCF HErrorCF # of ratings for each user: 3 5 Average
(k = 60) (k = 50, k′ = 60) (k = 50, k′ = 60)
UserCF 1.2360 1.0730 1.1545 0.7942
MAE 0.7534 0.8230 0.7572 0.7556 0.7543 ItemCF 1.2234 1.0365 1.1299 0.832
UErrorCF 0.9907 1.0374 1.0140 0.7915
IErrorCF 0.9846 0.9726 0.9786 0.8052
HErrorCF 1.1312 1.0395 1.0853 0.7897
difference. However, the difference between HErrorCF and ItemCF is
statistically significant (t[942] = −8.34, p b 0.001).
that have five ratings, the obtained p-values were also less than 0.01
(p b 0.01) between IErrorCF and the other two methods. The results
5.3.2. Comparison of cold start users and items
indicate that the differences in MAE are significantly different from
In this section, we investigate the prediction accuracy of the proposed
zero. We conclude from these experiments that the proposed CF
models for cold start problems in comparison with the benchmark
utilizing the error-reflected models can improve the prediction
methods. First, in order to analyze the prediction accuracy of cold start
quality of the cold start items and the cold start users.
items, we analyzed MAE of the previous prediction results according to
the number of users' ratings that items contained in the training set.
5.4. Discussion of computational complexities
Table 7 summarized the results of this analysis, showing how our
methods outperformed the other methods. As can be seen from the
In this section, we discuss computational complexity of previous
results, for all methods, the more the users' ratings contained, the
studies in comparison with our complexity. High computational
higher the prediction accuracy obtained. Comparing MAE of cold start
complexity is often demanded to enhance the quality of predictions
items (less than five ratings) obtained by each method, on the whole,
and recommendations. The scalability of CF is a critical challenge in
CF applied error-reflected models provide accurate prediction.
practical recommender systems with a huge number of users and items.
Particularly, HErrorCF provides more accurate prediction performance
We first analyzed the computational complexities of our methods
than the other methods. HErrorCF achieves 3% and 13% improvements
according to the number of users m, the number of items n, the number
compared to UserCF and ItemCF, respectively. Two-tailed paired t-tests
of rating values v, the number of similar users k and the number of
(per item) were also performed to determine if there are significant
similar items k′. In the model-based point of view, the computational
differences. With respect to HErrorCF and UserCF, there is a small
complexity can be distinguished between an offline phase and an online
difference at a level of 10% (t[332] = −1.864, p b 0.1). Comparing the
phase. The former can be accomplished offline, prior to actual
t-test obtained in HErrorCF and ItemCF, the difference appears to be
recommendations for a given user, whereas the latter has to be done
statistically significant (t[332] = −2.834, p b 0.01).
online and often in real time [13]. The offline computation is closely
We carried out a further experiment with the cold start dataset to
connected with time required to build the error-reflected model based
examine prediction accuracy for cold start users. For the experiment,
on pre-prediction errors. For calculating a pre-predicted value, the set of
we considered different subsets of users who had few ratings in the
user neighbors and the set of item neighbors should be determined. The
training dataset (e.g., users who have three ratings and five ratings).
upper bound on the complexity of this step is O(m2n) and O(mn2),
Table 8 summarizes MAE of the cold start users. As expected, we
respectively. Additionally, the time of O(kmn) and O(k′mn) is spent for
observed that the quality of prediction for the cold start users is
building the user-based model and for building the item-based model,
considerably lower than that of prediction when we used the original
respectively. Therefore, the total computational complexity in the offline
dataset. Such results were caused by the fact that it was hard to
phase becomes approximately O(m2n + mn2 + kmn) ≅ O(m2n + mn2) for
analyze the users' propensity to rate items. Nevertheless, CF based on
the user-based model, O(m2n + mn2 + k′mn) ≅ O(m2n + mn2) for the
the error-reflected models significantly outperforms the benchmark
item-based model and O(m2n + mn2 + kmn + k′mn) ≅ O(m2n + mn2) for
methods. Comparing MAE achieved by UErrorCF, IErrorCF, and
the hybrid model. However, in practice, since user–item rating matrix is
HErrorCF, interesting results were observed. A higher accuracy
very sparse, the actual computational complexity for the pre-prediction
prediction is achieved by the independent models like UErrorCF and
errors of users can be approximately reduced to O(mv + nv). In the
IErrorCF over the hybrid model HErrorCF, particularly when the users
online phase, the complexity required to predict a certain item j of a
have only three ratings. We identified that, on average, IErrorCF
target user u, is given as O(k) because we need to compute the average
provides improved prediction performance by 17.5% more than
rating of k number of users similar to the user u. Accordingly, the
UserCF and by 15.1% more than ItemCF, respectively. In addition,
computational complexity of predicting all the items becomes approx-
UErrorCF obtains 14.1% and 11.6% improvements for MAE compared to
imately O(kn) ≅ O(n). In fact, when we measured the server response
UserCF and ItemCF, respectively. We continued to compute two-tailed
time on an Apache Web server environment, on average, the responding
paired t-tests (per user). First, for the cold start users who have three
time required to predict an item was 0.0042 s. In addition, it took, on
ratings, we observed that there were significant differences between
average, 2.79 s to generate predictions of all items for each user.
IErrorCF and UserCF (t[99] = − 3.74, p b 0.01), and between IErrorCF
As for the complexity of previous studies, a memory-based CF such
and ItemCF (t[99] = − 3.56, p b 0.01). Second, for the cold start users
as UserCF provides an advantage to easily take new data into account.
This is because it utilizes entire data in real-time when generating
Table 7
A comparison of MAE for cold start items.
recommendations in an online phase. However, as both the number of
users m and the number of items n grow, computation cost also
Test item Cold start items (b 5) b10 b 15 b 20 rapidly grows as well. In the case of UserCF [5], for a target user, O(mn)
# of items 333 530 650 743 is required to determine k nearest neighbors. And O(kn) is
UserCF 0.9661 0.9199 0.8734 0.8412 additionally required to predict all items; therefore the computational
ItemCF 1.067 1.011 0.9976 0.9912 complexity becomes O(mn + kn) during the online phase. However,
UErrorCF 0.9405 0.8878 0.87714 0.8632 for most recommender systems, the online complexity in which the
IErrorCF 0.9542 0.9015 0.8908 0.8870 rating prediction of items that the users have not yet rated is more
HErrorCF 0.9352 0.8864 0.8528 0.8465
important compared to those of the offline case [13]. Similar to our
530 H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531

models, to reduce the online complexity, diverse models such as an CF recommender systems that can enhance the accuracy of the
item–item similarity model [9,25], Aspect model [13], User Rating prediction with respect to the cold start problem. As noted in the
Profile (URP) model [19], Unified Relevance (UR) model [30], and experimental results, our models obtained significantly better
Weighted Low Rank Approximations (WLRA) model [29], were prediction accuracy in dealing with both cold start users and cold
proposed. The aim of such models is to support fast recommendations start items, compared to the benchmark methods.
online by first developing a pre-computed model which most time- In future work, we plan to exploit social networks to build our model
consuming tasks can be conducted in the offline. If a user–user and generate item predictions, which is an emerging research area in
similarity model for UserCF is previously built in the offline [17], the recommender systems. We expect that the model incorporated with
online cost can be rapidly diminished as O(kn) whereas the time reliable social friends may offer more trustworthy items relevant to
required to build the model becomes O(m2n). Similarly, with respect users' needs. Another interesting direction to address is the problem of
to an item–item similarity model for ItemCF, O(mn2) time complexity manipulated ratings by unreliable users, often called shilling attacks [23].
is needed in the offline; for the online prediction the complexity is O We intend to detect unreliable user ratings by analyzing diverse types of
(kn). Therefore, in the cases that the number of users is relatively attack model. We will investigate the possible usages of pre-prediction
larger than the number of items (m NN n) or the number of users that errors for robust recommender systems against shilling attacks. Finally,
change is more dynamic than those of items that change, the item– we plan to examine stability of how the proposed models provide
item similarity pre-computed is practically more efficient than the consistent predictions over a period of time even when ratings are newly
user–user similarity pre-computed [25]. added to a system before rebuilding the models [3].
In probabilistic approaches for building models such as WLRA, URP,
UR and Aspect, Expectation Maximization (EM) algorithm is generally
Acknowledgments
used to estimate the models for CF. Hence, the complexity in the offline
is divided into E-step and M-step; thus, the number of iterations that
The authors would like to acknowledge the support of the Natural
affect the complexity is required to estimate stable parameters. For each Sciences and Engineering Research Council of Canada (NSERC) and
iteration, WLRA using Singular Value Decomposition (SVD) is needed O
Universidad Carlos III de Madrid and Banco Santander through a
(mn2 + m3) for building the model. In the case of URP and Aspect, both Catedra de Excelencia.
complexities in building the model are O(kmnv); in the worst case
they become O(km2n2) because the number of total ratings v
becomes mn at the worst (v = mn). With respect to UR model, the References
complexity is O(m2n + mn2 + k2mn) though the iteration process is
[1] H.J. Ahn, A new similarity measure for collaborative filtering to alleviate the new
not required to build the model.
user cold-starting problem, Information Sciences 178 (1) (2008) 37–51.
Table 9 summarizes the comparisons of the computation com- [2] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender
plexity in terms of the offline and the online. Although the complexity systems: a survey of the state-of-the-art and possible extensions, IEEE Transac-
of the item–item model and the user–user model is much lower than tions on Knowledge and Data Engineering 17 (6) (2005) 734–749.
[3] G. Adomavicius, J. Zhang, On the stability of recommendation algorithms,
that of our models, in the experiments we observed that UserCF and Proceedings of the 4th ACM Conference on Recommender Systems, 2010,
ItemCF performed worse for cold start users and items. That is, our pp. 47–54.
approach provides advantages both in terms of improving the quality [4] G. Bogdanova, T. Georgieva, Using error-correcting dependencies for collaborative
filtering, Data & Knowledge Engineering 66 (3) (2008) 402–413.
and in dealing with fast recommendation time. In comparison with [5] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for
the probabilistic models (WLRA, URP, UR, and Aspect), our approach collaborative filtering, Proceedings of the 14th Annual Conference on Uncertainty
does not include iterative building processes required in the in Artificial Intelligence, 1998, pp. 43–52.
[6] K.-W. Cheung, J.T. Kwok, M.H. Law, K.-C. Tsui, Mining customer product ratings for
probabilistic approaches. In addition, we support incremental updates personalized marketing, Decision Support Systems 35 (2003) 231–243.
of the models as presented in Section 4.2. [7] A. Das, M. Datar, A. Garg, S. Rajaram, Google news personalization: scalable online
collaborative filtering, Proceedings of the 16th International World Wide Web
Conference, 2007, pp. 271–280.
6. Conclusions and future work [8] J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of
Machine Learning Research 7 (2006) 1–30.
In this paper, we have proposed a unique method of building [9] M. Deshpande, G. Karypis, Item-based top-n recommendation algorithms, ACM
Transactions on Information Systems 22 (1) (2004) 143–177.
models derived from explicit ratings. The proposed method first
[10] S. Ding, S. Zhao, Q. Yuan, X. Zhang, R. Fu, L. Bergman, Boosting collaborative
determines a pre-predicted rating, and subsequently, identifies filtering based on statistical prediction errors, Proceedings of the 2nd ACM
prediction errors for each user. Pre-computed models, namely the Conference on Recommender Systems, 2008, pp. 3–10.
error-reflected model, are built by reflecting the prediction errors. The [11] J.L. Herlocker, J.A. Konstan, A. Borchers, J. Riedl, An algorithmic framework for
performing collaborative filtering, Proceedings of the 22nd Annual International
major advantage of the proposed models is that it supports ACM SIGIR Conference on Research and Development in Information Retrieval,
incremental updating of the model by using explicit user feedback. 1999, pp. 230–237.
We also presented a new method of applying the proposed models to [12] J.L. Herlocker, J.A. Konstan, L.G. Terveen, J.T. Riedl, Evaluating collaborative
filtering recommender systems, ACM Transactions on Information Systems 22 (1)
(2004) 5–53.
Table 9 [13] T. Hofmann, Latent semantic models for collaborative filtering, ACM Transactions
Comparison of computational complexities. on Information Systems 22 (1) (2004) 89–115.
[14] Y. Jiang, J. Shang, Y. Liu, Maximizing customer satisfaction through an online
Collaborative filtering algorithms Offline Online recommendation system: a novel associative classification model, Decision
Model building Prediction Support Systems 48 (2010) 470–479.
[15] C.Y. Kim, J.K. Lee, Y.H. Cho, D.H. Kim, VISCORS: a visual-content recommender for
Memory-based UserCF – O(mn + kn) the mobile Web, IEEE Intelligent Systems 19 (6) (2004) 32–39.
Model-based User–user model O(m2n) O(kn) [16] H.-N. Kim, A.-T. Ji, H.-J. Kim, G.-S. Jo, Error-based collaborative filtering algorithm
Item–item model O(mn2) O(kn) for top-n recommendation, Proceedings of the Joint 9th Asia-Pacific Web and 8th
WLRA model O(mn2 + m3) O(kn) International Conference on Web-Age Information Management Conference on
URP model O(kmnv) O(knv) Advances in Data and Web Management, 2007, pp. 594–605.
UR model O(mn2 + m2n + mnk2) O(k2n) [17] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon, J. Riedl, GroupLens:
Aspect model O(kmnv) O(knv) applying collaborative filtering to Usenet news, Communications of the ACM 40
Proposed models UErrorCF O(mn2 + m2n + kmn) O(kn) (1997) 77–87.
(model-based) IErrorCF O(mn2 + m2n + kmn) O(kn) [18] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item
HErrorCF O(mn2 + m2n + 2kmn) O(kn) collaborative filtering, IEEE Internet Computing 7 (1) (2003) 210–217.
[19] B. Marlin, Modeling user rating profiles for collaborative filtering, Proceedings of
m: # of total users, n: # of total items, v: # of ratings, k: model size. the 7th Annual Conference on Neural Information Processing Systems, 2003.
H.-N. Kim et al. / Decision Support Systems 51 (2011) 519–531 531

[20] P. Melville, R.J. Mooney, R. Nagarajan, Content-boosted collaborative filtering for Abdulmotaleb El Saddik is University Research Chair and Professor, SITE, University of
improved recommendations, Proceedings of the 18th National Conference on Ottawa and recipient of the Professional of the Year Award (2008), the Friedrich
Artificial Intelligence, 2002. Wilhelm-Bessel Research Award from Germany's Alexander von Humboldt Foundation
[21] D.M. Nichols, Implicit rating and filtering, Proceedings of the 5th DELOS (2007) the Premier’s Research Excellence Award (PREA 2004), and the National Capital
Workshop on Filtering and Collaborative Filtering, 1997, pp. 31–36. Institute of Telecommunications (NCIT) New Professorship Incentive Award (2004). He
[22] J. O'Donovan, B. Smyth, Mining trust values from recommendation errors, is the director of the Multimedia Communications Research Laboratory (MCRLab). He
International Journal on Artificial Intelligence Tools 15 (6) (2006) 945–962. was Director of the Information Technology Cluster, Ontario Research Network on
[23] M. O'Mahony, N. Hurley, N. Kushmerick, G. Silverstre, Collaborative recommen- Electronic Commerce (2005-2008). He is Associate Editor of the ACM Transactions on
dation: a robustness analysis, ACM Transactions on Internet Technology 4 (4) Multimedia Computing, Communications and Applications (ACM TOMCCAP), IEEE
(2004) 344–377. Transactions on Multimedia (TMM) and IEEE Transactions on Computational
[24] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an open Intelligence and AI in Games (IEEE TCIAIG) and Guest Editor for several IEEE
architecture for collaborative filtering of Netnews, Proceedings of the ACM 1994 Transactions and Journals. Dr. El Saddik has been serving on several technical program
Conference on Computer Supported Cooperative Work, 1994, pp. 175–186. committees of numerous IEEE and ACM events. He has been the General Chair and/or
[25] B. Sarwar, G. Karypis, J. Konstan, J. Reidl, Item-based collaborative filtering Technical Program Chair of more than 25 international conferences symposia and
recommendation algorithms, Proceedings of the 10th International World Wide workshops on collaborative hapto-audio-visual environments, multimedia commu-
Web Conference, 2001, pp. 285–295. nications and instrumentation and measurement. He is leading researcher in haptics,
[26] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analysis of recommendation algorithms service-oriented architectures, collaborative environments and ambient interactive
for e-commerce, Proceedings of ACM Conference on Electronic Commerce, 2000, media and communications. He has authored and co-authored two books and more
pp. 158–167. than 280 publications. He has received research grants and contracts totaling more than
[27] A.I. Schein, A. Popescul, L.H. Ungar, Methods and metrics for cold-start $12 million and has supervised more than 90 researchers. His research has been
recommendations, Proceedings of the 25th International ACM Conference on selected for the BEST Paper Award three times. Dr. El Saddik is a Distinguished Member
Research and Development in Information Retrieval, 2002, pp. 253–260. of ACM, an IEEE Distinguished Lecturer, Fellow of the Canadian Academy of
[28] U. Shardanand, P. Maes, Social information filtering: algorithms for automating Engineering, Fellow of the Engineering Institute of Canada and Fellow of IEEE.
word of mouth, Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, 1995, pp. 210–217. Geun-Sik Jo Professor, Computer and Information Engineering, Inha University, Korea.
[29] N. Srebro, T. Jaakkola, Weighted low-rank approximations, Proceedings of the He is the chairman of the school of Computer and Information Engineering at Inha
20th International Conference on Machine Learning, 2003, pp. 720–727. University. He received a B.S. degree in Computer Science from Inha University in 1982.
[30] J. Wang, A.P. de Vries, M.J.T. Reinders, Unified relevance models for rating He received M.S. and Ph.D. degrees in Computer Science from City University of New
prediction in collaborative filtering, ACM Transactions on Information Systems 26 York in 1985 and 1991, respectively. His research interests include knowledge-based
(3) (2008) 1–42. scheduling, ontology, semantic web, intelligent E-Commerce, constraint-directed
[31] Y. Zhen, W.-J. Li, D.-Y. Yeung, TagiCoFi: tag informed collaborative filtering, scheduling, knowledge-based systems, decision support systems, and intelligent
Proceedings of the 3rd ACM Conference on Recommender Systems, 2009, pp. 69–76. agents. He has authored and coauthored five books and more than 200 publications.

Heung-Nam Kim is a postdoctoral fellow in the Multimedia Communications Research


Laboratory (MCRLab) at University of Ottawa, Canada. His research interests include
collaborative filtering, recommender systems, semantic Web, data mining, and social
networking applications. He received a PhD in Computer and Information Engineering
from Inha University, Korea.

You might also like