Incorporating User Control Into Recommender Systems Based On Naive Bayesian Classification

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Incorporating User Control into Recommender Systems

Based on Naive Bayesian Classification

Verus Pronk Wim Verhaegh


Philips Research Europe Philips Research Europe
High Tech Campus 34 High Tech Campus 11
5656 AE Eindhoven 5656 AE Eindhoven
The Netherlands The Netherlands
verus.pronk@philips.com wim.verhaegh@philips.com
Adolf Proidl Marco Tiemann
Philips Research Europe Philips Research Europe
Gutheil-Schoder-Gasse 10 High Tech Campus 34
A-1100 Vienna 5656 AE Eindhoven
Austria The Netherlands
adolf.proidl@philips.com marco.tiemann@philips.com

ABSTRACT 1. INTRODUCTION
Recommender systems are increasingly being employed to person- The use of recommender technology is steadily being introduced
alize services, such as on the web, but also in electronics devices, into the market. Web sites such as Yahoo! Movies and MovieLens
such as personal video recorders. These recommenders learn a user offer a recommender to aid users in finding the movies they like,
profile, based on rating feedback from the user on, e.g., books, and personal video recorders like the Tivo box use a recommender
songs, or TV programs, and use machine learning techniques to for automatic recording of the user’s favorite movies.
infer the ratings of new items. A recommender learns the taste of a user, based on ratings that
The techniques commonly used are collaborative filtering and the user supplies on items, such as movies. These ratings are typi-
naive Bayesian classification, and they are known to have several cally a positive−negative classification, indicating like and dislike,
problems, in particular the cold-start problem and its slow adaptiv- respectively, or a more elaborate classification into a range of like-
ity to changing user preferences. These problems can be mitigated degrees. As such, this is the interface by which the user teaches the
by allowing the user to set up or manipulate his profile. recommender about his taste. This learning process is inherently
In this paper, we propose an extension to the naive Bayesian clas- slow in the sense that the user has to rate a considerable number of
sifier that enhances user control. We do this by maintaining and items before the recommender can make sensible suggestions. This
flexibly integrating two profiles for a user, one learned by rating leads to the well-known cold-start problem and to a slow adaptabil-
feedback, and one created by the user. We in particular show how ity of the recommender to changing user preferences. The former
the cold-start problem is mitigated. problem pertains to the situation that the user has only rated a few
items.
Although the cold-start problem can be mitigated by setting up a
Categories and Subject Descriptors rating session wherein the user rates a sufficient number of items,
G.3 [Probability and Statistics]: Probabilistic algorithms (includ- the burden this places upon the user is generally considered unac-
ing Monte Carlo); H.3.3 [Information Search and Retrieval]: In- ceptable. Furthermore, this is not an elegant way to deal with a
formation filtering sudden change of interest. For example, if the user discovers that
he likes a particular actor or director, having to rate many movies
General Terms before the recommender has learned this clearly indicates that this
rating interface is inappropriate to deal with such situations. The
Algorithms user should have more direct control over the recommender to be
able to resolve these situations.
Keywords Naive Bayesian classification (NBC) lends itself well to this type
of user control because often, the profile it uses is quite intuitive.
classification, machine learning, naive Bayes, recommender, user
A movie is described by a number of feature-value pairs, such as
profile, user control, multi-valued features
(channel, Hallmark Movie Channel), (actor, Clint Eastwood), or
(start time, 20:00), and the profile consists of positive and negative
counts for individual feature-value pairs. From these figures, like-
Permission to make digital or hard copies of all or part of this work for
degrees can be computed, at the level of feature-value pairs rather
personal or classroom use is granted without fee provided that copies are than at the level of individual movies. The user can similarly create
not made or distributed for profit or commercial advantage and that copies a profile by defining like-degrees for feature-value pairs, so that
bear this notice and the full citation on the first page. To copy otherwise, to these profiles can be combined.
republish, to post on servers or to redistribute to lists, requires prior specific We propose to extend NBC by incorporating a user-defined pro-
permission and/or a fee. file and flexibly combining it with the learned profile. This flexibil-
RecSys'07, October 19–20, 2007, Minneapolis, Minnesota, USA.
Copyright 2007 ACM 978-1-59593-730-8/07/0010 ...$5.00.
ity is such that, initially, the recommender operates based solely on

73
the user-defined profile and, as the recommender learns, the learned The factors Pr(c(x) = j) are called prior probabilities, the factors
profile gradually replaces the user-defined profile. However, when Pr(xi = yi | c(x) = j) conditional probabilities, and the expressions
the user adapts his profile, the adapted part temporarily takes over Pr(c(y) = j) are called the posterior probabilities.
again. The general approach in NBC is that the prior and conditional
The remainder of this paper is organized as follows. After briefly probabilities are estimated using the training data to obtain esti-
describing related work in Section 2, we revisit naive Bayesian mates of the posterior probabilities. We define the learned profile
classification in Section 3. Then, in Section 4, we explain how as follows. For each feature i, each value v ∈ Di , and each class j,
to incorporate a user-defined profile into the classifier. In Section 5
we generalize this to multi-valued features, where a feature can at- N( j) = |{x ∈ X | cx = j}| and
tain a number of values simultaneously. We report on performance N(i, v, j) = |{x ∈ X | xi = v ∧ cx = j}| , (2)
results, providing a proof of concept for the recommender, in Sec-
tion 6. We make some concluding remarks in Section 7. where |S| denotes the cardinality of a set S. By assuming, with-
out loss of generality, that for each j, N( j) > 0, we estimate the
2. RELATED WORK probabilities as
Several approaches have been proposed in the literature to miti- Pr(c(x) = j) ≈ N( j)/|X | and (3)
gate the cold-start problem. In [6], the authors propose to use pre-
N(i, yi , j)
computed stereotypes from which a user can choose some to jump- Pr(xi = yi | c(x) = j) ≈ . (4)
start a recommender. A stereotype is a possibly large set of items, N( j)
TV programs in their paper, that are similar to each other, where By substituting these estimates into Equation 1 we obtain an es-
similarity is defined using the modified value-difference metric (see timate of the probability that y belongs to class j in terms of the
[2], [10]). The precomputation uses rating histories of a number of training data.
users and uses a standard clustering algorithm. In case N(i, yi , j) = 0 in Equation 4, a technique called Laplace
An alternative approach is to combine a user-defined recom- correction, see [1], is used to prevent the conditional probability
mender with a learning recommender by combining their outputs. estimate and corresponding posterior probability estimate to be-
These are called hybrid recommender systems. in [12], the authors come 0. For now, we will assume that we do not have any so-called
propose to use a neural network for fusing the outputs. User con- zero counts.
trol beyond incorporating a user-defined recommender is, however, The naive Bayes’ classification c̃(y) of y is defined as the value
lacking. of j that maximizes the estimate. Ties are broken arbitrarily. For-
In [9], the author suggests an alternative to MovieLens, called mally, c̃(y) is defined as
DynamicLens, to aid the user in providing a recommender system
with user-defined ratings to a meta-recommender system. The in- f
N( j) N(i, yi , j)
terface is in particular geared towards enhancing user control in c̃(y) = arg max
|X | ∏ . (5)
hybrid recommender systems.
j∈C i=1 N( j)
The aim of this paper is to integrate a user-defined profile and a If c̃(y) 6= c(y), then we speak of a classification error. The classifi-
learned profile into a single recommender, rather than using multi- cation error rate E, or error rate for short, is defined as
ple recommenders, offering direct user control over recommended
items. E = Pr(c̃(x) 6= c(x)), (6)
and is a measure for the performance of the classifier. Here, x is
3. NAIVE BAYESIAN CLASSIFICATION again a randomly chosen instance. The classification accuracy is
We next describe NBC in detail, starting with some notation. defined as 1 − E. The definition of error rate can be refined by
An instance x is described by f feature values xi ∈ Di , for each considering class-conditional error rates. Given a class j, we define
i = 1, 2, . . . , f , where Di is the domain of feature i. Its class is
E j = Pr(c̃(x) 6= c(x) | c(x) = j) (7)
denoted by c(x) ∈ C , where C is the set of classes. For the moment,
we do not consider missing features, but return to this shortly. as the class- j error rate. The class-conditional classification accu-
Given is a non-empty set X of training instances and for each racy is given by 1 − E j .
instance x ∈ X its class cx = c(x). Let y be an instance to be clas- This summarizes the classical approach towards naive Bayesian
sified. The approach in NBC is that we express Pr(c(y) = j), for classification, see also [7]. Some remarks, however, are worth mak-
each j ∈ C , in terms of the training data. ing. First, Equations 3 and 4 can indeed be used as estimates of the
Let x be a random variable on the domain U of instances. Us- prior and conditional probabilities, respectively, if the training set
ing Bayes’ rule and assuming conditional independence of feature consists of instances chosen randomly on the instance space U . In
values for a given class, we can rephrase Pr(c(y) = j) as follows. practice, this may not be the case. For example, if the prior prob-
abilities are heavily skewed, then this would require a relatively
Pr(c(y) = j) = Pr(c(x) = j | x = y)
large training set to obtain a sufficient number of class- j instances
Pr(c(x) = j) Pr(x = y | c(x) = j) in the training set, for those values of j for which the prior proba-
= bilities are relatively small, to obtain sufficiently reliable estimates
Pr(x = y)
for the conditional probabilities. It seems therefore reasonable to
f keep the values of N( j) approximately equal. In this case, however,
Pr(c(x) = j) ∏i=1 Pr(xi = yi | c(x) = j)
= . (1) Equation 3 does not generally hold, and proper values for the prior
Pr(x = y)
probabilities should be obtained in a different way. Also in other
As the denominator can alternatively be written as the sum over all cases, such as in a recommender system, where it may be largely
j of the numerator, it serves as a normalization constant. When up to the user which instances end up in the training set, Equation 3
comparing probabilities, this constant can be omitted. may be invalid.

74
For a two-class recommender, [5] suggests to set the prior prob- ranges from 0 to 1, 1 itself excluded. This l(i, v) can be interpreted
abilities so as to balance the class-conditional error rates, an ap- as a like-degree for feature-value pair (i, v), where 0.5 corresponds
proach that is essentially the same a suggested in [4], where the to neutral, as it leads to the neutral skewing factor of 1. Further-
author uses a cost-based approach towards classification. We adopt more, two non-zero like-degrees l and 1 − l lead to the skewing
the same approach, in that the priors are set to predefined values factors of l/(1 − l) and (1 − l)/l, respectively, which cancel each
p j for each j ∈ C , based on criteria related to the class-conditional other when multiplied together.
error rates. This forms the basis for the incorporation of a user-defined pro-
The second remark pertains to missing features and features with file l u . On the one hand, the like-degree l l (i, v) is learned from the
multiple values. We will return to the latter in Section 5. When- training set using the learned skewing factor r l (i, v), defined as
ever a feature is missing in an instance to be classified, we omit the
N(i, v, +)/N(i, +)
corresponding factor from Equation 5. To deal with missing fea- rl (i, v) = , (14)
tures in instances from the training set, we adapt the estimate of the N(i, v, −)/N(i, −)
conditional probability, given in Equation 4 to which estimates the skewing factor given by Equation 10. On the
N(i, yi , j) other hand, for each feature-value pair (i, v), there is a user-defined
Pr(xi = yi | c(x) = j) ≈ , like-degree l u (i, v). Its default value is 0.5, i.e. neutral, but the user
N(i, j)
can set this like-degree to any value in the range [0, 1). Using this
where N(i, j) = ∑v∈Di N(i, v, j). Thus, N(i, j) counts for each j default value enables the user to easily create a profile by only set-
the number of instances in the training set that do have a value for ting a few like-degrees.
feature i. The user-defined and learned like-degrees are integrated as fol-
The definition of c̃ in Equation 5 is thus replaced by lows. We define the integrated like-degree l int (i, v) as
f
N(i, yi , j) l int (i, v) = αl u (i, v) + (1 − α)l l (i, v), (15)
c̃(y) = arg max p j ∏ . (8)
j∈C i=1 N(i, j) with α ∈ [0, 1]. This l int (i, v)
is next used to calculate its corre-
sponding integrated skewing factor r int (i, v) using the inverse func-
4. INCORPORATING A USER-DEFINED tion x/(1 − x) of x/(1 + x), which is used in Equation 13. This
PROFILE skewing factor replaces r(i, v) in Equation 12.
To generate a recommender that starts off as one based on a user-
The goal in this section is to integrate a user-defined profile with
defined profile and gradually turns into a learning recommender
the learned profile, given by Equation 2. We do this by defining
based only on the learned profile, this α should preferably be made
like-degrees for individual feature values.
dependent on the size of the training set. The thick, solid line in
We assume that there are two classes, i.e., positive (+) and neg-
Figure 1 illustrates a possible definition, where the horizontal axis
ative (−). We also simplify notation by omitting, where possible,
denotes the training set size.
explicit reference to the random variable x. In particular, we ab-
breviate ‘c(x) = j’ to ‘ j’ and ‘x = y’ to ‘y’. Using Equation 1 and
predefined prior probabilities p+ and p− , we derive that
1
Pr(+ | y) f
p+ Pr(xi = yi | +)
Pr(− | y)
= ∏ Pr(xi = yi | −) .
p− i=1
(9) α
Corresponding to the definition of c̃ in Equation 8, if the estimate
of the right-hand side of this equation is larger than 1, y is classified
as positive, if it is smaller than 1, y is classified as negative, and if 0
it equals one, a random choice is made. 0 K S0 L
We next define
S
Pr(xi = v | +) Figure 1: Possible dependency of α on the training set size S
r(i, v) = (10) (thick, solid line) and a possible trajectory of α when an update
Pr(xi = v | −)
at S0 in the user-defined profile and two pruning actions are
Pr(+ | xi = v)/ Pr(− | xi = v)
= (11) performed (thin, grey lines).
Pr(+)/ Pr(−)
as the skewing factor for feature-value pair (i, v). As shown by Initially, that is, until the training set size has a specific size K,
Equation 11, it indicates the relative skew that v causes for feature the user-defined profile should be active only to allow the learned
i in the prior probabilities. Using Equation 10, Equation 9 can be profile to mature somewhat. Otherwise, the user-defined profile
rephrased as would immediately be contaminated with unreliable data. Then,
the learned profile gradually takes over as the training set size in-
Pr(+ | y) p+ f
Pr(− | y)
= ∏ r(i, yi ).
p− i=1
(12) creases, until it is sufficiently large, indicated by the limit L, at
which point the learned profile has completely taken over. There
The involved skewing factors are thus multiplied together. Note is, of course, ample freedom in choosing how α depends on S. The
that a non-zero skewing factor r is canceled by its inverse 1/r. linear relationship is chosen here for simplicity.
Where r(i, v) ranges from 0 to ∞, the derived quantity l(i, v), This can be refined by making α dependent on i, the feature un-
defined as der consideration in Equation 15. In particular, Nmin (i), defined
as
r(i, v)
l(i, v) = , (13) Nmin (i) = min(N(i, −), N(i, +)), (16)
1 + r(i, v)

75
provides a measure for the size of the training data pertaining to Let y be an instance with probabilistic feature values, i.e., for
feature i. Using this feature-dependent measure is especially rel- each i = 1, 2, . . . , f , yi is randomly distributed on Di according to
evant for features that are often not assigned a value, resulting in probability distribution py (i, v). Let Vy (i) ⊆ Di denote the set of
relatively unreliable estimates of the conditional probabilities. possible values of yi . In the case that |Vy (i)| = 1, yi degenerates to
Hence, we propose to define αi as follows. a deterministic value. Let the set Vy = {v | vi ∈ Vy (i), i = 1, 2, . . . , f }
denote the set of values that y can attain. We assume that for each
(L − Nmin (i))+
 
αi = min 1, , (17) v ∈ Vy
L−K
f
where for an expression E, (E)+ stands for max(0, E). This def- Pr(y = v) = ∏ py (i, vi ).
inition corresponds to the thick, solid line in Figure 1, whereby i=1
S = Nmin (i).
This independence assumption is similar to the usual conditional
independence assumption.
4.1 On the user regaining control We are again interested in Pr(c(y) = j) for each class j. Let x be
As the training set size increases, the role of the user-defined a random instance, uniformly distributed on the instance space. We
profile generally decreases. Hence, if the user updates his profile, proceed along the same lines as in Section 3 and partially follow
say the like-degree of feature-value pair (i, v), while the learned Störr. Note that we use short-hand notation, i.e., we abbreviate
profile has nearly or completely taken over, then this does not resort ‘c(x) = j’ to ‘ j’, ‘x = v’ to ‘v’, and ‘x = y’ to ‘y’ in the probability
in much effect, if any. function.
A possible solution to this problem is to prune the rating history
such that Nmin (i) decreases, but this is a rather drastic approach.
It is instead more elegant to incorporate an offset D(i, v) by which Pr(c(y) = j)
Nmin (i) is decreased. Instead of using αi , we use αiv , defined as
= ∑ Pr(c(y) = j | y = v) Pr(y = v)
(L − (Nmin (i) − D(i, v)))+
 
v∈Vy
αiv = min 1, .
L−K = ∑ Pr(c(v) = j) Pr(y = v)
v∈Vy
Now, when the user updates the like-degree of feature-value pair
(i, v), the corresponding offset is set to, e.g., Nmin (i) or Nmin (i) − K, = ∑ Pr( j | v) Pr(y = v)
so that, for this pair, the recommender starts off using only the user- v∈Vy
defined profile again. Once αiv has become 0 again, D(i, v) can be
Pr(v | j) Pr( j)
reset to 0 as well. See Figure 1 for a possible trajectory of αiv . = ∑ Pr(v)
Pr(y = v)
v∈Vy
4.2 Pruning the rating history f
h i
Pr( j) ∑v∈Vy ∏i=1 Pr(xi = vi | j) py (i, vi )
To retain flexibility in the learned profile to adapt to user prefer-
=
ences that change slowly over time, the rating history should reg- Pr(y)
ularly be pruned, e.g., by disregarding the oldest ratings. This is f
h i
complicated by the presence of non-zero offsets D(i, v). Pr( j) ∏i=1 ∑v∈Vy (i) Pr(xi = v | j) py (i, v)
There are various ways to deal with this, but an obvious and = . (18)
Pr(y)
simple solution is to try and keep αiv constant, until the size of
the training set becomes so small that an increase in αiv is neces- Note that Pr(v) = Pr(y), as x is uniformly distributed on the instance
sary. This results in the following update for D(i, v) when Nmin (i) space.
changes from N to N 0 < N. As before, the right-hand side of Equation 18 can be estimated
using the training data, which may also contain instances with prob-
D(i, v) := (D(i, v) − N + N 0 )+ . abilistic feature values. To this end, we generalize the definition of
The thin, grey lines in Figure 1 illustrates a possible trajectory, in- the user profile N(i, v, j) as follows.
dicated by the arrowheads, of the value of α after an update in the
user-defined profile at S = S0 and two pruning actions.
N(i, v, j) = ∑ px (i, v), (19)
x∈X j

where X j = {x ∈ X | cx = j}, i.e. the set of class- j training instances.


5. MULTI-VALUED FEATURES For an instance y, the classification c̃(y) is generalized to
It is not uncommon that features may have multiple values, such
as the cast of a movie or a list of genres describing a movie in
 
f
N(i, v, j)
general terms. A possible approach is to consider a list of values c̃(y) = arg max p j ∏  ∑ py (i, v) .
as independent and to incorporate them as such in the computation j∈C i=1 v∈V (i) N(i, j) y
of the posterior probabilities. In particular, if a feature i has a set
of values V , then all feature-value pairs (i, v), with v ∈ V serve Multi-valued features are modeled as uniformly distributed prob-
as separate features, each contributing a factor in the product in abilistic features, i.e., for each i, all values v ∈ Vy (i) are equally
Equation 8. Especially when the values in V are dependent, such probable, i.e., py (i, v) = 1/ | Vy (i) |. In this case, all parameters
as with a fixed cast, this approach results in an over-representation py (i, v) can be omitted from the formula above, as is easily veri-
of this feature. fied. Equation 19 can in this case be interpreted as if each instance
We provide an alternative to this approach, based on [11], where with a multi-valued feature i, say with m values, is subdivided into
the author considers probabilistic feature values: The value of a m sub-instances, one corresponding to each value, each of which is
feature i is given by a probability distribution on its domain Di . counted 1/m times.

76
learning part user-defined part learning part user-defined part
Shi
conditional probability user-defined conditional probability user-defined
estimates like degree estimates like degrees
Sud Str Ste

learned user-defined
skewing factors skewing factors
S
weighted average ordinary average
Figure 3: Subdivision of part of the rating history of a user into
learned learned user-defined a user-defined-profile, training, and test part.
skewing factor skewing factor skewing factor

learned learned user-defined


like degree like degree like degree

α α’
6. PERFORMANCE RESULTS
integrated integrated This section consists of three parts. We describe the simulation
like degree like degree setup in the first part. In the second, we report on initial results
while experimenting with an actual implementation, resulting in a
integrated integrated
skewing factor skewing factor number of improvements. In the third part, we provide a proof of
(a) (b) concept by using a specially prepared user-defined profile.

Figure 2: Graphical views of the integration steps: (a) without 6.1 Simulation setup
and (b) with multi-valued features. For a thorough assessment of the performance of the recom-
mender, it is necessary to integrate the recommender into an ap-
plication or system, such as a personal video recorder, to involve
Now that we have generalized naive Bayesian classification to users, and to define appropriate performance measures. In particu-
deal with multi-valued features, we next explain how to incorporate lar, when the recommender is used to guide the automatic recording
a user-defined profile. Analogously to Equation 9, we derive, based of broadcast movies on a hard disk, the total average value to the
on Equation 18, user of the content on disk would be an appropriate measure. An-
other issue that is of relevance especially in the current context is
Pr(+ | y) p+ f ∑v∈Vy (i) Pr(xi = v | +)
Pr(− | y)
= ∏ ∑v∈V (i) Pr(xi = v | −) .
p− i=1
user satisfaction pertaining to the possibility to control the recom-
y mender. A user may initially set and tune a simple profile, based
Using Equation 10, we can rephrase this as on immediate feedback in terms of recommended items. We cur-
rently circumvent these complex assessments and instead focus on
Pr(+ | y) p+ f ∑v∈Vy (i) Pr(xi = v | −) r(i, v) a proof of concept, wherein we will show that, given an appro-
Pr(− | y)
= ∏ ∑v∈V (i) Pr(xi = v | −) ,
p− i=1
(20) priately chosen user-defined profile, the proposed solution indeed
y
mitigates the cold-start problem. For reasons of space, we will not
from which it is easily seen that, for each feature i, a weighted investigate sudden changes of interest by users, so that the offsets
average of the involved skewing factors is calculated, the weights D(i, v) do not play a role and the αi , defined in Equation 17 can be
depending on the conditional probabilities of the involved values v. used, which is independent of the feature values.
As these weights are not available in the user-defined profile, we We consider the rating histories of seven users A, . . . , G that we
suggest replacing the weights by 1/ | Vy (i) | to calculate the ordi- collected. For each user, we use part of his rating history to create
nary average. an appropriate user-defined profile. How this is done will shortly
Both average skewing factors must next be translated to like- be explained. We use another part of the rating history to train the
degrees to take a convex combination of these like-degrees. The recommender and to obtain values for the prior probabilities. We
final issue to resolve now is how to take a convex combination like use yet another part to test it. Figure 3 and Table 1 illustrate the
in Equation 15, where the like-degrees have been replaced by the details. The complete rating history of size Shi is shown as a line
translation of an average of skewing factors. An obvious solution that loosely corresponds to a time line. It alternatingly contains
is to use the maximum of αiv over v ∈ Vy (i), as this results in a con- positively and negatively rated programs. On this line are indicated
tribution from the user-defined profile corresponding to a value for the data used for the user-defined profile, Sud in size, the training set
which the contribution is maximal. In this way, whenever the like- of size Str , and the test set of size Ste . The variable S indicates the
degree for a value is updated by the user, its role will maximally starting position of the three adjacent parts and this variable runs
permeate the results. The procedure is illustrated in Figure 2b. For over the rating history to collect sufficient statistics.
the learning part of the recommender, we estimate the weighted Referring to the table, the parameters K and L pertain to the def-
average of the learned skewing factors in Equation 20 using the inition of αi in Equation 17. They were chosen, based on some
training data, resulting in a single learned skewing factor, which experience with the recommender. Note that a value of 50 for L
is translated to a learned like-degree. For the user-defined part, implies that a rating history of at least 100 programs is required for
we take the ordinary average of the involved user-defined skewing the recommender to become completely based on the learned pro-
factors, based on the user-defined like-degrees, resulting in a user- file. The parameter d is used in the construction of the user-defined
defined skewing factor, which is translated back to a user-defined profile, and is discussed below. For each value of Str , which runs
like-degree. Then, these two like-degrees are combined as in Equa- from 0 to 100 with a step size of 5, the variable j, which controls
tion 15, but with an alternative value α0 for α as explained above, S, runs from 0 to jmax , where jmax is the largest integer satisfying
and translated back to an integrated skewing factor. the inequality ( jmax + 1)Sud + Str + Ste ≤ Shi . This choice causes

77
almost the entire rating history to be used for each value of Str , each Although it is tempting not to use small training sets to generate
time with an independent, user-defined profile. These jmax + 1 re- the priors, it has a detrimental effect on the classification accuracy.
sults are averaged to compute an error rate. Hence, unless there is no training data in either class, in which case
the priors are set to 0.5, the training set is used to generate the
Table 1: Parameter settings for the simulations. priors.
The sometimes necessary Laplace correction needs the domain
sizes of the features. For some features, this domain is very large,
parameter value(s) such as for actors, and in general these are not known. We instead
K 10 use as the domain size for feature i the number of feature-value
L 50 pairs (i, v) that exist in the training set, with a lower bound of 1 to
d 7 prevent domains of size 0.
Sud 50 In the used rating histories, some features are often not present,
Str 0 (5)100 leading to relatively small values of Nmin (i), defined in Equation 16
Ste 100 for large training sets. This consequently leads to a slow take-over
S jSud by the learning recommender. And besides this, the features that
are often not present generally do not contribute significantly to
the classification accuracy. Hence, all features that were largely
The user-defined profile is constructed as follows. Using the absent in the rating histories were discarded. For simplicity, we
designated history part, skewing factors are defined for all feature- also discarded the feature ‘description’.
value pairs, according to Equation 14, using Laplace correction if
necessary. These are next converted to like-degrees and rounded to
6.3 Proof of concept
the nearest of a limited number of like-degrees lk , k = 1, 2, . . . , d, The performance measure of interest we use is the classification
defined as error rate. Figures 4−10 compare the classification error rates of
the integrated recommender with that of a learning recommender
lk = (k − 0.5)/d. for each of the users.
A value of d = 7 for d seems reasonable. Hence, we mimic a For each of the users, the results show that the cold-start prob-
user who has meticulously defined a profile that closely matches lem has been significantly reduced. Depending on the user, the
a learned profile based on 50 ratings, but is restrained to only a integrated recommender has an advantage over the learning recom-
limited number of equidistant like-degrees. Although this is not mender for a longer or shorter period of time. For each user, the
realistic in practice, it does provide proper input for a proof of con- curves converge quite fast to a difference of less than 0.05. This
cept. may be explained by the fact that the tastes of the users are obvi-
As already mentioned, the training set is also used to generate ously not that difficult to learn.
values for the prior probabilities. This is done as follows. Using Apparently, choosing priors, based on a small training set, and
the leave-one-out cross-validation method, see [7], positive poste- using an ordinary average instead of a weighted average as ex-
rior probability estimates are calculated for all training instances, plained in Section 5 do not have a devastating effect.
using uniform priors. Next, a decision threshold in the interval
(0, 1) is determined that results in a minimal difference between
the positive and negative classification error rate on the classified 0.5
learning
instances. The decision threshold determines which instances are 0.45 integrated
classified as positive and which ones as negative, based on a com- 0.4
classification error rate

parison with the positive posterior probability estimates. The deci- 0.35
sion threshold can be translated to prior probabilities. In particular, 0.3
it can be shown that, if uniform priors are used initially, then for 0.25
a decision threshold γ, the positive prior probability should set to
0.2
1 − γ. Once the priors have been set, the arg-max operator can be
0.15
used, effectively resulting in using a decision threshold of 0.5.
0.1
6.2 Initial results 0.05
In practice, a user will not generally set a large number of like- 0
0 20 40 60 80 100
degrees, so that the user-defined profile will contain a lot of neutral
training set size
values of 0.5, and thus skewing factors equal to 1. These neutral
values have a detrimental effect on the classification accuracy. To
mitigate this, the approach we take is that the ordinary average over Figure 4: Results for user A.
the user-defined skewing factors is taken only over those unequal
to 1. If there are no such skewing factors, the average is set to 1.
This change correspondingly influences the definition of α0 in the
general case that the offsets may be unequal 0. The maximum is 7. CONCLUDING REMARKS
only taken over those values whose user-defined skewing factors In this paper, we have proposed to seamlessly integrate a user-
are unequal to 1. If there are no such skewing factors, all α-values defined and a learned profile into a recommender based on naive
for the multi-valued feature are the same, and this value should be Bayesian classification in order to mitigate the cold-start problem
chosen. For the latter it is required that the offset D(i, v) should be and to enhance user control.
reset to 0 whenever the user resets the corresponding like-degree to The results reported upon provide a proof of concept using a
neutral. This, however, is of no concern for the current simulations. limited amount of ground-truth data. The experiments could be

78
0.5 0.5
learning learning
0.45 integrated 0.45 integrated
0.4 0.4
classification error rate

classification error rate


0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 20 40 60 80 100 0 20 40 60 80 100
training set size training set size

Figure 5: Results for user B. Figure 7: Results for user D.

0.5 0.5
learning learning
0.45 integrated 0.45 integrated
0.4 0.4
classification error rate

classification error rate


0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 20 40 60 80 100 0 20 40 60 80 100
training set size training set size

Figure 6: Results for user C. Figure 8: Results for user E.

extended with more data, using, for instance, the Duine data set, 8. REFERENCES
see [3]. In addition to these experiments, user tests should be per-
formed to assess the use and usefulness of the proposed solution, [1] Cestnik, B. [1990]. Estimating probabilities: a crucial task in
including sudden changes of interest of the user. To test sudden machine learning, Proceedings of the European Conference
changes of interest, a possible approach is to combine the rating on Artificial Intelligence, Stockholm, Norway, 147–149.
histories of two users and append one to the other. Upon the change
[2] Cost, S., & Salzberg, S. [1993]. A weighted nearest neighbor
from the first to the second rating history, a user-defined profile is
algorithm for learning with symbolic features. Machine
built and incorporated using some data from the second rating his-
Learning 10, 57–58.
tory in a similar way as described in the paper.
The integration of a user-defined profile with a learned profile [3] Duine [2004]. Duine Project, http://www.telin.nl/project/\
also allows the user-defined profile to come from another source, or Home.cfm?id=387&language=en.
even various sources. For example, using the learned profile from [4] Elkan, C. [2001]. The foundations of cost-sensitive learning,
another user can be used to jump-start the learning recommender, Proceedings of the Seventeenth International Joint
thereby still allowing ample freedom to change this profile. Conference on Artificial Intelligence, Seattle, Washington,
The proposed solution may be refined by incorporating the re- 973–978.
sults from Pronk, Gutta, and Verhaegh (2005), who extend the [5] Gärtner, T., Wu, S., & Flach, P.A. [2001]. Data mining on the
naive Bayesian classifier by adding confidence levels to the pos- Sisyphus dataset: evaluation and integration of results, in: C.
terior probability estimates. These confidence levels can provide Giraud-Carrier, N. Lavrac, & S. Moyle (Eds.), Proceedings
additional control pertaining to how fast the learned profile could of the Workshop Integrating Aspects of Data Mining,
take over the user-defined profile. This is a topic for further re- Decision Support and Meta-Learning, Freiburg, Germany,
search. 69–80.
A further refinement is to utilize feature-selection algorithms to [6] Kurapati, K., & Gutta S. [2002]. Instant personalization via
optimize the performance of the recommender by excluding certain clustering TV viewing patterns, Proceedings of the Sixth
features. An additional issue here is that the user might still want International Conference on Artificial Intelligence and Soft
to exert control on the recommender using one of the excluded fea- Computing, Banff, Canada.
tures. Also this is a topic for further research. [7] Mitchell, T.M. [1997]. Machine Learning. McGraw-Hill.

79
0.5
learning [8] Pronk, V., Gutta, S.V.R., & Verhaegh, W.F.J. [2005].
0.45 integrated Incorporating confidence in a naive Bayesian classifier, in: L.
0.4 Ardissono, P. Brna, & A. Mitrovic (Eds.), Lecture Notes in
classification error rate

0.35 Artificial Intelligence 3538: Proceedings of the Tenth


0.3 International Conference on User Modeling Edinburgh, UK,
0.25 317–326.
0.2 [9] Schafer, J.B. [2005]. DynamicLens: A dynamic user
0.15 interface for a meta-recommendation system, Proceedings of
0.1 the Tenth International Conference on Intelligent User
0.05
Interfaces, San Diego, California.
0
[10] Stanfill, C., & Waltz, D. [1986]. Toward memory-based
0 20 40 60 80 100 reasoning, Communications of the ACM 29:12, 1213–1228.
training set size [11] Störr, H.-P. [2002]. A compact fuzzy extension of the naive
Bayesian classification algorithm, Proceedings of the Third
Figure 9: Results for user F. International Conference on Intelligent Technologies and
Vietnam-Japan Symposium on Fuzzy Systems and
Applications, Hanoi, Vietnam, 172–177.
[12] Zimmerman, J., Kurapati, K., Buczak, A.L., Schaffer, D.,
0.5 Martino, J., & Gutta, S. [2004]. TV personalization system:
learning
0.45 integrated design of a TV show recommender engine and interface, in:
0.4 L. Ardissono, A. Kobsa, & M.T. Maybury (Eds.), Targeting
classification error rate

0.35 Programs to Individual Viewers Human-Computer


0.3 Interaction Series 6. Springer.
0.25
0.2
0.15
0.1
0.05
0
0 20 40 60 80 100
training set size

Figure 10: Results for user G.

80

You might also like