On The Use of LSH For Privacy Preserving Personalization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

On the use of LSH for Privacy Preserving Personalization

Armen Aghasaryan, Makram Bouzid, Dimitre Kostadinov, Mohit Kothari, and Animesh Nandi
Bell Labs Research, Alcatel-Lucent
Email: {name.surname}@alcatel-lucent.com
AbstractThe Locality Sensitive Hashing (LSH) technique
of scalably nding nearest-neighbors can be adapted to enable
discovering similar users while preserving their privacy. The
key idea is to compute the user prole on the end-user device,
apply LSH on the local prole, and use the LSH cluster
identier as the interest group identier of a user. By properties
of LSH, the interest group comprises other users with similar
interests. The collective behavior of the members of the interest
group is anonymously collected at some aggregation node to
generate recommendations for the group members.
The quality of recommendation depends on the efciency of
the LSH clustering algorithm, i.e. its capability of gathering
similar users. In contrast, with conventional usage of LSH (for
scalability and not privacy), in our framework one can not
perform a linear search over the cluster members to identify
the nearest neighbors and to prune away false positives. A
good clustering quality is therefore of functional importance for
our system. We report in this work how changing the nature
of LSH inputs, which in our case corresponds to the user
prole representations, impacts the performance of LSH-based
clustering and the nal quality of recommendations. We present
extensive performance evaluations of the LSH-based privacy-
preserving recommender system using two large datasets of
MovieLens ratings and Delicious bookmarks, respectively.
Keywords-Privacy, Personalization, LSH
I. INTRODUCTION
A broad class of applications such as StumbleUpon or
iGoogle URL recommendations, Netix movie recommen-
dations, or IPTV content recommender systems face users
with the dilemma of either disclosing their sensitive prole
information to benet from personalized services or con-
cealing their data to preserve the privacy. Today, users have
no option but to trust the service provider with their prole
data in return of the personalized services.
One approach to preserve privacy is to build a user prole
locally on end-user device, communicate the meta-level
semantic categories to the service provider, and then apply
ne-grained content-based local ltering of received recom-
mendations. Variants of this approach exist, but they support
only content-based recommendations, and not collaborative-
ltering [11], [13], [22]. To take benet of the collaborative-
ltering (CF) approach, i.e. recommendations derived from
like-minded users, the user prole must be exchanged with
content provider, or with other users in a peer-to-peer fashion
to be able to identify those similar users, and to compute
top-rated items of the like-minded community. Doing this
in a privacy preserving way is a challenging problem, and
existing solutions either rely on heavyweight computations
employing cryptographic operations (e.g. [4]), or rely on
perturbations in users proles (e.g. [2], [18]) which may
degrade the recommendation quality.
In our previous work [16] we proposed the basic design
principles of a distributed privacy-preserving personalization
(P3) approach where the user prole is built locally, and then
used in a lightweight, practical, and yet privacy-preserving
way to get both content-based and collaborative-ltering
recommendations. The P3 approach relies on two main
concepts (Figure 1). First, it adapts the Locality-Sensitive
Hashing (LSH) technique of doing scalable nearest-neighbor
search [5], [12], in a way to discover similar users while
preserving their privacy. In fact, the LSH approach allows
end-users with similar interests to get assigned to the same
interest group (cluster) with high probability, without re-
quiring the local proles to be revealed to any central or
trusted entity. Second, the P3 approach relies on a distributed
pool of non-colluding nodes and anonymous communication
channels. Each node acts on behalf of an interest group
and is in charge of anonymously aggregating the groups
behavior, generating recommendations, and anonymously
delivering them to group members.
The quality of recommendations in the P3 approach
depends on the efciency of the LSH clustering algorithm,
i.e. its capability of gathering similar users within interest
groups. Although LSH is a well studied algorithm in the
literature, the conventional approaches for improving its
performance can not be applied in this privacy-preserving
framework. Namely, this framework does not permit a
linear search over cluster members to identify the nearest
neighbors; here, for the sake of privacy protection, individual
user proles are indistinguishable within interest groups. We
then apply a transformation of the original input space of
user proles and show that it can improve the clustering per-
formance, especially in case of an extremely sparse dataset.
The main contributions of this paper are the following.
- We analyze the consequences of using LSH for privacy
as opposed to the conventional use for scalability. LSH
provides the possibility of computing cluster mem-
bership locally, avoiding pair-wise prole comparisons
needed in centralized CF approaches. This allows build-
ing a privacy-preserving recommender system, however
brings in additional challenges for maintaining good
recommendation quality, close to the conventional CF.
2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications
978-0-7695-5022-0/13 $26.00 2013 IEEE
DOI 10.1109/TrustCom.2013.46
362
Figure 1. Key idea of P3
- We describe in this work how changing the nature
of LSH inputs, which in our case corresponds to the
user prole representations, impacts the performance of
clustering. We analyze item-based and tag-based pro-
ling strategies and their consequences on the quality
of recommendations.
The rest of the paper is structured as follows. Sec-
tion II describes the distributed LSH-based approach for
privacy-preserving personalization and summarizes the main
building blocks of P3. Then, Section III introduces the
proling strategies applied to our experimental datasets, and
section IV presents evaluations of clustering properties and
the recommendation quality. We contrast our approach with
prior works in section V, and conclude in section VI.
II. THE P3 APPROACH
This section presents the main building blocks of the
P3 architecture divided into three layers: 1/ local clients
responsible for local proling, interest-group identication,
and local recommendations ltering, 2/ aggregator nodes
responsible for group proling and recommendations, and
3/ communication infrastructure providing anonymous chan-
nels between local clients and aggregator nodes. We then
outline the privacy-preserving properties of the P3 approach.
A. P3 Local Client Layer
Proling and Interest Group Assignment. This compo-
nent runs on the end-users device (e.g. PC, smart-phone,
or Set-top-box) and carries out two main functions: user
proling and assignment to interest groups. The rst is
achieved by collecting and analyzing local traces of user
activities, and building a consolidated user prole. This can
be done in an application specic way, e.g. see [13], [22]
for targeted ad applications on URL browsers. However,
independently from the applied approach we assume that the
resulting prole is represented as a list of <item, value>
pairs, where items can refer to consumed item references
or their semantic characteristics (e.g. tags, categories), and
values are the corresponding interest degrees such as an
explicit user-given rating or a derived implicit interest value.
The user assignment to interest groups is achieved by
locally computing an LSH code for each prole. An LSH
scheme is dened as a probability distribution over a family
H of hash functions such that two similar objects, x and y,
hash to the same value with high probability,
Pr
hH
[h(x) = h(y)] sim(x, y) (1)
Such schemes are used to assemble candidate nearest neigh-
bors in hash buckets (or clusters), in order to solve ap-
proximate nearest neighbor problems in high dimensional
spaces [1]. The output of several randomly selected hash
functions is concatenated in order to amplify the discrimi-
nating power of hash buckets: similar points are expected to
collide on all hash values (reducing false positives). Then,
the sequence of hash values can be used as the label (or key)
of a cluster to which the hashed objects belong. The cluster
key size, k, corresponds to the number of used hash functions
and denes a conguration parameter of the algorithm.
Furthermore, in order to reduce the false negatives, one can
compute multiple LSH sequences and subscribe each user
to a number of clusters. This denes another parameter of
the algorithm, the number of clustering iterations, L.
Depending on the chosen similarity metrics (or the cor-
responding distance dist(x, y) = 1 sim(x, y)), an appro-
priate LSH scheme should be found, e.g. Charikar offers a
solution to implement LSH for cosine similarity [5], while
the MinHash algorithm [6] is known for Jaccard similarity.
As in a general case the prole representation needs to
support non-binary interest values (e.g. ratings on movies),
our approach relies on the LSH variant for cosine similarity
based on random hyperplanes, see Figure 2. The idea is to
generate k random prole vectors as pivots, shared among
the users, and to construct a signature of the user prole
based on its discretized similarity with these pivots; the
signature is obtained by concatenating the k sign bits of dot
products between the user prole and each of the random
vectors. The users with identical signatures belong to the
same cluster (with the cluster key dened by the signature
of its members). This process is repeated L times and each
user is assigned to L clusters.
It is easy to see that for a given dataset using a higher k
value results in a larger number of clusters in each iteration
(bounded by 2
k
), with smaller average cluster size and
higher intra-cluster similarity (reducing false positives).
Note that this procedure requires sharing only random
seeds among the end-user clients, while all the proles are
processed locally without compromising the user privacy.
Prole slicing. We have shown above that cluster as-
signments can be done while keeping user proles local.
Nevertheless, to get CF recommendations (i.e. items from
363
Figure 2. Using LSH for cosine similarity; one clustering iteration with
cluster key size k = 5 explained.
similar users) one still needs to share somehow the pro-
le data. Here, we assume that no Personally Identiable
Information (PII)
1
is available in proles, e.g. the client
employs some mechanisms to lter out the URLs which act
themselves as PII. Then, our approach consists in slicing
each user prole into smaller segments (slices), composed
of one or several <item, value> pairs, and associating them
with the cluster key computed from the entire prole. Now,
the slices of a given prole can be uploaded anonymously
and independently to an aggregator node so that the latter
can not pick them up and observe entire user proles within
the mass of its interest-group members. This unlinkability
property [17] prevents the aggregator node from intelligently
inferring user identities and/or their sensitive prole data.
Local recommendations selection. At the nal stage of
personalization process, each client retrieves recommenda-
tions from its L interest-groups, and merges them into a
single list where each item rating is computed as a weighted
average of respective ratings in the received lists. Here, the
weights correspond to the user-to-group prole similarities.
Alternatively, one can consider only recommendations is-
sued from the users top L

clusters (according to their


weight) as a trade-off to limit the network throughput. In
practice, even considering only the best cluster produces
roughly the same quality of recommendations, and this is
the approach we have adopted in this paper.
B. P3 Aggregation Layer
This layer is composed of a set of independent group-wise
aggregator nodes, each one responsible for an interest-group.
Each node operates on the collected group consumption
(prole slices), and computes the primary interests of the
1
http://en.wikipedia.org/wiki/Personally identiable information
group in terms of Top-N items (group recommendations).
This is achieved by aggregating consumption elements of
group members, <item, value> pairs, where the aggregated
value (estimated rating) for each item, <item, value*>,
can be obtained by using a combined score of the items
frequency and its ratings; in this paper, we consider the
average ratings within the group.
Aggregator nodes act also as P3s points of contact to
third-parties such as content providers, allowing injection
of external recommendations in the Top-N lists of interest-
groups. To that end, the aggregator nodes rst need to
build their semantic group-proles, by using the uploaded
semantic characteristics of member proles (if available),
or by carrying out a semantic analysis of uploaded items
(URLs, movies, etc.) Then, the group proles can be ex-
posed to external content providers and semantic matching
techniques can be applied to generate additional content-
based recommendations. Note that this procedure does not
violate the users privacy as no individual prole data is
delivered to a third party.
C. P3 Network Infrastructure Layer
This layer ensures anonymous communication between
local clients and aggregator nodes: the anonymous upload
of prole slices from user devices to the respective interest-
groups and, in return, the anonymous delivery of recommen-
dations from aggregator nodes to local devices.
Distributed hosting of aggregator nodes. The aggregator
nodes are hosted by a small number of physical nodes (hosts)
interconnected via a Distributed-Hash-Table (DHT) overlay,
a well-known failure resilient and scalable distributed system
providing a lookup mechanism similar to hash tables [8].
This provides an elegant addressing scheme to identify the
destination aggregator node for each interest-group: the clus-
ter identiers (i.e. the LSH cluster keys) computed locally at
end-user devices can be used as DHT lookup keys, so that
the prole slices are efciently forwarded by the overlay
network to the hosts of destination aggregator nodes.
Anonymous communication channel. To hide the
clients IP address in their communications with the (hosts
of) aggregator nodes, we rely on an anonymizing network
like Tor [10] which implements the onion-routing paradigm.
The recommendations generated at aggregator nodes can
be looked up anonymously by their group members. Al-
ternatively, they can be anonymously pushed to the group
members using a publish-subscribe mechanism [19] which
can be done via the hidden service feature in TOR.
D. Privacy-preserving features of P3
Here, we summarize the privacy preserving features of
the P3 approach issued from our usage of LSH-based
local clustering algorithm, the prole slicing approach and
the choices done in the infrastructure design. We consider
the case of a honest-but-curious attacker taking control of
364
a group-wise aggregator node. Such an attacker will not
perturb the protocols to generate false data and behaviors,
but will try to infer sensitive information from the data under
his control.
Proles kept local: the local LSH-based algorithm
prevents the need for user proles to be communicated
outside of the client device in the stage of discovering
like-minded users and identifying the interest-groups.
Unlinkability: the prole slicing approach combined
with the usage of an anonymous communication chan-
nel between the clients and aggregator nodes achieves
the unlinkability of prole slices from the aggregator
nodes perspective.
Indistinguishability: As a consequence of the unlinka-
bility of prole slices, the members of a cluster become
indistinguishable at aggregator nodes. Higher is the
cardinality of a cluster, higher is the number of possible
slice combinations, and stronger is the indistinguisha-
bility of cluster members. The average cluster size
depends on the LSH parameter - key size; a larger key
size results in a higher number of clusters and a smaller
average cluster size for a given user population. This
parameter needs however to be considered in relation
with the achieved recommendations quality.
Decentralization: the distributed pool of non-colluding
physical nodes reduces the risk of a single attacker
accessing simultaneously to a large number of interest
groups; higher is the number of physical nodes, less is
the number of interest-groups hosted by each of them.
III. PROFILING STRATEGIES
The P3 approach of using LSH for privacy in contrast to
conventional usage of LSH for scalability comes with addi-
tional limitations. When using LSH for scalability, a linear
search over the cluster members can be done to identify
the nearest neighbors and build recommendations primarily
inuenced by those members [9]. In contrast, when using
LSH for privacy, the intentional unlinkability of prole slices
implies that the cluster has no notion of complete user
proles. The aggregator node can for example compute the
popular items within the mashed up prole slices gathered
by the cluster, but is not able to identify the nearest neighbors
and prioritize their consumption. Furthermore, errors of the
LSH-based randomized clustering can not be removed and
may degrade the quality of recommendations.
Thus, it is legitimate to question the performance of such
a privacy-preserving clustering method that happens to be
especially weakened for high dimensional sparse datasets
with overall very low similarities between the points, see
Figure 3. In fact, it results from the denition of LSH
(Eq. 1) that the probability of hash collision is smaller for
lower similarity values. Then, in a dataset where even the
nearest neighbors often have low pairwise similarities, the
probability of grouping them together will be diminished.
Figure 3. Empirical probability densities indicating high proportions of
very low similarities.
Figure 4. Empirical probability densities indicating the presence of
moderate and high similarity values.
In such cases, a conventional LSH algorithm can improve
the chances of nding the nearest neighbors by increasing
the number of clustering iterations, L, and therefore by
considering more candidates to the nearest-neighborhood.
This does not necessarily help when using LSH for privacy
because the new members can not be sorted according to
their similarity with the current user and therefore would
not help identifying the nearest neighbors. This is the reason
why we suggest a transformation of the prole space, by
means of semantic enrichment of proles, which changes
the properties of similarity distributions and improves the
performance of the P3 approach.
A. Datasets and Proling strategies
Large MovieLens dataset (ML) The large size Movie-
Lens dataset
2
contains 10M ratings and 95,580 tags applied
to 10,681 movies by 71,567 users. All users have rated at
least 20 movies, and their ratings are split into training set
(80%) and test set (20%). We have built two types of user
proles on the basis of movie ratings and tag assignments.
Item-based proles represent the movie ratings in the train
set, in the form of <movie, rating> pairs. Figure 5 shows
the popularity curve of movies in terms of number of ratings
they have received, where about 80% of movies received
more than 20 ratings.
2
http://www.grouplens.org/system/les/ml-10m-README.html
365
Figure 5. Item popularity curves.
Tag-based proles are built from item-based proles by
using the tags assigned to movies by different MovieLens
users. After some clean-up, removing stopwords and stem-
ming, each movie is associated with a set of weighted tags,
where the weight of a tag corresponds to the normalized
number of its assignments to the given movie. Then, a tag-
based prole is a set of <tag, value*> pairs obtained by
aggregating the tags of those movies that are present in the
users item-based prole, while additionally weighting by
the respective movie ratings.
Delicious dataset (DL) We made a random selection from
the Delicious dataset [23] involving 7,088,524 bookmarks to
3,270,509 web pages saved by 50,014 users, where each web
page contains at least some textual data and each user at least
20 bookmarks. As previously, the bookmarks are split into
training set (80%) and test set (20%).
We extracted tags from the web pages by analyzing their
HTML content, and then applied standard techniques for
stopword removal, stemming, and pruning of infrequent tags.
This resulted in a set of about 50,000 tags. The bookmarks
in the training set with their associated tags were then used
to generate two types of user proles.
Item-based proles contain the bookmarks of users in the
training set, in the form of <url,value=1> pairs. Figure 5
shows a much atter popularity curve for the DL dataset
with only 0.05% of URLs bookmarked more than 20 times.
Tag-based proles contain sets of weighted tags, <tag,
value>, where the weight of a tag is the sum of its
normalized weights in the URLs of the users item-based
prole.
B. Similarity Distribution Properties
We observe in Figures 3 and 4 a signicant difference
in similarity distribution patterns corresponding to the two
types of proling strategies. While the item-based strategies
indicate high proportions of very low similarities (especially
for the DL dataset), the application of tag-based strategies
brings in a much higher range of frequently occurring
similarity values. In addition, as reported in Table I, the DL
dataset has much higher sparsity compared to the ML. We
measured the mean distance ratio, D
max
/D
min
, between the
Table I
SPARSITY AND THE Dmax/D
min
RATIOS.
DL item DL tag ML item ML tag
Sparsity 4.33e-5 0.0253 0.0125 0.1018
Dmax/Dmin 1.0653 2.4705 1.8875 3.8796
farthest and the nearest neighbors of each user. Small values
of this ratio indicate that the users tend to be equally distant
each from other (the curse of dimensionality [14]), and
therefore, one can expect the clustering performance harder
to be achieved. We examine next how these differences
translate into the clustering and recommendations quality.
IV. PERFORMANCE EVALUATIONS
In this section, we measure the impact of proling strate-
gies on the clustering performance, on the cluster size
distribution, and on the quality of recommendations.
A. Clustering properties
We evaluate the clustering performance in terms of the
capability of grouping similar members together. We com-
pute for each prole u its average similarity with the n
nearest-neighbors (NN) within the cluster domain C(u) as
provided by the clustering algorithm. We denote this quantity
by NNS(u, n) and refer to NNS(n) as its average over the
set of all users.
Compared to an exhaustive NN-search, using a clustering
algorithm for limiting the NN-search to cluster co-members
provides a gain in scalability, but at the cost of possibly
missing some of the true nearest-neighbours; we refer to
this as the scalability cost. Furthermore, applying a privacy-
preserving clustering method such as the LSH-based P3
approach brings in an additional privacy cost due to the
unlinkability and indistinghuishability properties of the clus-
ters member proles. To account for these cost components
we dene three indicators of clustering performance based
on different ways of selecting the nearest neighbors.
Centralized. This indicator provides an upper bound for
NNS(n) for any clustering-based approach; it uses an ex-
haustive NN-search to chooses the true n nearest neighbors.
Here, the domain of returned results, C(u), is the entire set
of users.
Clustering-for-scalability. This indicator makes use of
a conventional clustering approach to select the n near-
est neighbors in a scalable way, within the clusters. The
specicity of multi-clustering approaches such as LSH is
that each user is afliated to L clusters. Here, the domain
of returned results corresponds to the union of clusters,
C
L
(u) =

L
i=1
C
i
(u). Note that increasing L brings new
candidates to the nearest neighborhood and will therefore
improve the resulting NNS(n).
Clustering-for-privacy. This indicator applies to a
privacy-preserving clustering approach where the cluster
366
members are indistinguishable, e.g. clusters containing only
mashed up prole slices. Here, the response to the NN-
search is, strictly speaking, the entire set of cluster members.
We limit, however, the computation of NNS(u, n) to those
members of C(u) whose similarity with the cluster centroid
(averaged group prole) is above the median value. The
reason is that the measurement of a clustering performance
should not be penalized by outlier cluster members which
have practically no impact on the top-N group recommen-
dations. Rather, we can reasonably assume that these above-
median members of the cluster well approximate the group
prole and the deliver the top-N recommendations
3
.
Finally, for the multi-clustering approach we have adopted
the best-cluster method where for each prole u its best
cluster is selected, among the L interest-group assignments,
according to the highest user-to-group prole similarity.
This choice resulted from our experimental observation that
selecting only the best-cluster recommendations provided
roughly the same results as using all the clusters. The
clustering-for-privacy indicator as well as the recommen-
dations are then computed on the basis of the best-clusters
members, C

(u).
Results. Figures 6 and 7 show the impact of different
proling strategies on the clustering performance indicators,
NNS(n = 6), applied to the ML and the DL datasets,
respectively. In addition to the LSH-based clustering, we
evaluated these indicators also for a centralized clustering
(k-means), centralized random, and random clustering al-
gorithms. While the centralized random approach picks n
arbitrary neighbors from the entire population, the random
clustering algorithm mimics the behavior of the multi-
clustering approach; it generates 2
k
random clusters in each
of the L clustering iterations, and then, selects the best
cluster for each user.
A general trend for all the strategies in both datasets is
that with an increasing number of clusters (for LSH, this
means increasing the cluster key size, k), the clustering-
for-scalability indicator decreases, while the clustering-for-
privacy increases. In fact, for a given dataset the higher
the number of clusters, the lower their average size. The
clustering-for-scalability indicator benets from larger clus-
ter sizes due to the possibility of doing a linear search over a
larger number of cluster members. As the cluster size |C(u)|
decreases, the chances of some of the nearest-neighbours
of u getting assigned to a cluster different from C(u)
increase which explains the performance decline. However,
this decline is fairly gradual when the clustering algorithm
succeeds to maintain most of the the nearest neighbors
within the C(u) set.
3
With this denition, the clustering-for-scalability indicator is still an
upper bound on the clustering-for-privacy, given that n is chosen to be
sufciently small, n < |C(u)|/2, u. In the sequel, we have used n = 6
which guarantees this property for more than 99% of the population, the
remaining part has been removed from the clustering evaluations.
Decreasing the average cluster size has an opposite ef-
fect on the clustering-for-privacy indicator, where no linear
search is possible within the cluster and where the group
prole is all we can compute irrespective of the target
number of nearest neighbors, n. The average similarity with
the above-median representative members is quite low when
the clusters are large. As the number of clusters increases,
the cluster members become closer to the average group
prole, and thereby NNS(n) increases and approaches the
corresponding value of clustering-for-scalability indicator.
Another general trend for all the evaluated cases is the
impact of the number of clustering iterations, L. We draw
for each of the multi-clustering approaches (LSH-based and
random clustering) two curves with L = 1 and L = 5.
Increasing the number of iterations increases |C
L
(u)|, and
therefore, improves the clustering-for-scalability. It improves
also the clustering-for-privacy results thanks to the best
cluster selection mechanism. In fact, higher number of
clustering iterations allows to select C

(u) among a higher


number of choices.
For the ML dataset, Figure 6 indicates a considerable
improvement in the values of performance indicators due
to the application of the tag-based proling strategy (right-
side) as compared to the item-based one (left-side). This is
not surprising given the similarity distribution patterns we
have observed in Figure 4 and Figure 3 (right-side).
The impact of the tag-based proling strategies is even
more important in the DL dataset, see Figure 7 where the
item-based NNS curves (left-side) are drawn in a log scale. It
is noteworthy that here the item-based strategy for the LSH-
based approach manifests extremely high performance gaps
with respect to the k-means clustering and the centralized
NN-search, both for privacy and scalability indicators. Fur-
thermore, in relative terms, the item-based LSH-for-privacy
indicators are much closer to the random approaches than
to their centralized clustering counterpart. The tag-based
strategy changes signicantly this picture in favor of the
LSH-based approach, see Figure 7 (right-side).
B. Quality of recommendations
To evaluate the usefulness of a recommender system we
measure the quality of top-N recommendations as suggested
in [15], [20]. Let R
u
(N) designate the top N items
recommended to a user u, and TP
u
be the subset of items
in the test set that the user has liked (true positives). We
considered the 5 star ratings in the ML test set as reliable
indicators of users interests. For the DL dataset, the only
choice to identify the true positives is to consider all the
existing bookmarks in the test set. The recall and precision
are then dened as follows:
Recall
u
(N) =
|R
u
(N) TP
u
|
|TP
u
|
Precision
u
(N) =
|R
u
(N) TP
u
|
|R
u
(N)|
367
Figure 6. Quality of clustering (ML): the cost of the privacy-preserving approach and the impact of clustering iterations.
Figure 7. Quality of clustering (DL): the cost of the privacy-preserving approach and the impact of clustering iterations.
As noticed by [20], the Top-N recall dened above is
proportional to the precision, i.e. the recommender system
with larger recall also has larger precision on xed data
and xed N. For that reason, we build our evaluations
only on the recall measure. The global performance of
the recommender system is evaluated by the average of
Recall
u
(N) over all tested users, Recall(N). Instead of
explicitly referring to a ranking position, N can also be
normalized within the interval [0, 1] and will represent in
that case a fraction of top items relative to their total number.
For each dataset, we evaluate the top-100 recall obtained
by the LSH-based P3 algorithm congured with k = 7
and L = 5 (P3 7x5) and subject to different proling
strategies. We compare the achieved recall curves with
other recommendation approaches: 1/ a classical Collab-
orative Filtering algorithm (NNS-CF) based on a user-
neighbourhood model [21] with cosine similarity measure,
2/ Best Seller (BS) list where the ranking of items is built
according to their average rating over all users , and 3/ a
Content-based (CB) recommender system based on user-to-
368
Figure 8. ML: Top-100 (1%) Recall.
Figure 9. DL: Top-100 (0.003%) Recall.
item semantic matching that exploits tag-based user proles
and item descriptions. In addition, we evaluate as well the
performance of a hypothetical recommender system using a
random multi-clustering and centralized random clustering
algorithms described in the section IV-A.
Results. For the ML dataset (Figure 8), the tag-based
strategy indicates an improvement over the item-based one.
Both provide a much better recall in comparison with
the best seller list, as well as content-based and random
approaches. Although, the performance of the LSH-based
privacy-preserving approach is inferior to the centralized
NNS-CF, this can be interpreted as the price to pay for
the achieved privacy with the proposed approach, given the
selected LSH conguration parameters.
For the DL dataset (Figure 8), as expected the tag-based
strategy improves over the item-based one. The extremely
poor clustering performance in the item-based space explains
why the recommendations quality with this proling strategy
does not really differentiate from the best seller list and is
just slightly better than the random clustering.
With regard to the upper bound, here we observe a larger
gap between the best LSH-based strategy and the classical
NNS-CF. Moreover, this gap is not overcome even when
using a centralized k-means clustering approach applied in
both privacy-preserving settings (i.e. using the average group
prole for recommendations) and scalability settings (i.e.
using the individual pairwise similarities to weight each
cluster members contribution to the recommendations). The
gure shows also the recall curve of the centralized NNS-
CF evaluated in privacy settings which indicates an upper
bound for a privacy-preserving clustering approach. The
extremely sparse nature of the DL dataset and very skewed
item popularity curve (Figure 5), where more than 80% of
items have a single occurrence in the train set, can explain
the difculty for the clustering-based recommendations to
approach the classical CF, a difculty which is, however,
not captured by the clustering performance indicators.
Finally, note that for the DL dataset the Top-100 recom-
mendations correspond only to 0.003% of items, as opposed
to 1% in ML, which explains why we observe overall low
absolute values of recall in Figure 9.
C. Cluster size distribution
This section presents a discussion on cluster sizes which
impact the user indistinguishability property within interest
groups. For instance, if the group has very few members,
then despite the prole slice unlinkability the aggregator
node (acting as an attacker) can rebuild a relatively small
number of possible slice combinations and infer the original
proles with some likelihood. Increasing the cluster size
increases exponentially the number of possible slice com-
binations and reduces the likelihood of each of them.
Although the cluster sizes can not be controlled de-
terministically, they depend on several factors: the total
number of clusters, the population size, the properties of the
prole space, etc. So, in the LSH-based clustering approach,
increasing the cluster key size, k, increases the total number
of clusters in each iteration ( 2
k
) and results in a smaller
average cluster size.
We dene an empirical probability p(w) of a ran-
dom user being afliated, in a single clustering iteration,
to a cluster with a cardinality superior to : p() =
card({u U : |C(u)| })/card(U) . Figure 10 estimates
this cluster size measure for different proling strategies
when the cluster key size is xed to k = 7. In the DL dataset,
the probability, p( = 100), of a user being classied into a
cluster with more than 100 members is almost 1 for the item-
based strategy and about 0.968 for the tag-based strategy.
Similar values are obtained for the ML dataset.
With multiple clustering iterations, the probability of
being classied into at least one smaller cluster increases
as 1 p()
L
. Although, in both datasets, we see a slightly
smaller p() for the tag-based strategies, however, with
L = 5 we still obtain a reasonably low chance of getting
369
Figure 10. Empirical probability of a user being assigned to a cluster with
cardinality superior to a given value on x-axis.
into a cluster with less than 100 members, 0.13 for the DL
and 0.12 for the ML tag-based strategies.
V. RELATED WORK
This section compares our work to existing approaches
with respect to the use of local proling, the privacy-
preserving properties, and the scalability.
Local proling has been used to categorize users into au-
dience segments based on observed browsing or content con-
sumption history [13], [22]. The general principle consists
in broadcasting ads/content to each user device, together
with specication of the intended audiences-segments. Then,
the local content ltering algorithm displays only items
corresponding to the users segment. Such systems are re-
strained to provide only content-based recommendations as
they do not suggest any way to discover the top-rated items
of similar users. We showed that P3-based CF approach
performs much better than a content-based approach even
when the latter accesses to full content catalog without
considering any resource limitations on the local device.
Some approaches to preserve privacy in CF systems
consist in computing local aggregations of individual pro-
les for/by a community of users. By using homomor-
phic encryption, the computed aggregate can be publicly
available without exposing individual user proles [4]. An-
other solution [18] is to use randomized perturbation to
disguise the real users consumption before sending them
to a server. However, this approach does not maintain the
sparsity patterns in user proles and requires uploading
a complete, full-dimensional, perturbed user prole which
can lead to scalability and network overload issues in high
dimensional spaces. Other approaches combine obfuscation
of a part of the user prole with their distribution among
multiple repositories to offer a privacy-preserving CF [2]. To
maintain the accuracy of similarity computations between
users, the prole of one of them (requesting recommen-
dations) is not obfuscated which can lead to a privacy
leak. In another approach, Gossple [3] system enables end-
users to be associated with a network of anonymous like-
minded acquaintances, allowing applications like personal-
ized query-expansion. To achieve anonymization, each node
is associated with a proxy gossiping prole information on
its behalf. The end-users identity is hidden from the proxies
using encryption. Although the proles are pseudonymized,
their content is exchanged between users to allow com-
puting their similarities. This is open to linkability attacks
which allow to intelligently infer the users identity from
the overall set of his consumed items. In contrast with
these approaches, our LSH-based P3 recommender system
relies on a lightweight mechanism of local interest-group
assignment, and then it uses prole slicing and anonymous
communication to achieve the unlinkability property while
uploading the relevant prole items without encryption.
LSH has been already used to develop scalable recom-
mendation systems, like Yoda [7] and Google-News person-
alization system [9]. Yoda uses an hybrid approach com-
bining collaborative ltering and content-based querying to
provide real-time recommendations. The authors introduced
the FLSH ltering mechanism (an extension of LSH) which
performs item clustering to reduce the potential solution
space when responding to the recommendations queries. In
Yoda, LSH was neither used to cluster users, nor to en-
sure their privacy. Finally, the Google-News personalization
system [9] investigated how the scalability can be achieved
in a recommender system that needs to deal with frequent
item churn and large set of users. Their rst technique
to get scalability is the use of model-based algorithms
for collaborative ltering wherein the user is mapped to
one (or more) cluster(s) of like-minded users. They used
MinHash [6] algorithm (an LSH scheme for set-similarity)
and PLSI (Probabilistic Latent Semantic Indexing). Their
second technique to achieve scalability consists in using the
MapReduce paradigm. In contrast to the P3 approach, they
use a centralized system where user proles are built and
exploited at the server side which assumes access to all the
users consumption history without any privacy protection.
VI. CONCLUSION
The Privacy-Preserving Personalization (P3) approach
builds on the idea of using Locality Sensitive Hashing
techniques combined with a distributed infrastructure of
aggregation nodes. LSH provides the possibility of com-
puting cluster membership in a distributed way, avoiding
the need for pair-wise prole comparisons in centralized
collaborative ltering approaches. This gives a solution to
a privacy-preserving recommender system, however brings
in additional challenges for maintaining good recommenda-
tions quality, close to the centralized approach.
In this paper, we have analyzed the consequences of using
LSH for privacy as opposed to the conventional use for scal-
ability. Namely, when using LSH for privacy, the intentional
unlinkability of proles segments prevents from selecting the
nearest neighbors within the formed clusters, and to offer
370
CF recommendations one can just use an average group
prole. A good clustering performance is therefore necessary
to ensure the quality of delivered recommendations. We
then described how changing the nature of LSH inputs can
improve the performance of the system. We analyzed item-
based and tag-based proling strategies and experimentally
explored their impact on the clustering performance and
the quality of recommendations in two large datasets. The
semantic tag-based strategy results in a clear improvement
in the clustering performance expressed in terms of per-
formance indicators which take into account the respective
costs of scalability and privacy introduced by the clustering
approach. The clustering performance indicators however
do not completely explain the gap we observe in a highly
sparse dataset of Delicious bookmarks between the quality
of the best proling strategy and a centralized collaborative
ltering system. This indicates the need for additional inves-
tigations in order to extend these indicators and cover other
factors impacting the quality of recommendations, e.g. the
properties of cluster size distribution, the diversity of items
in the group proles, etc.
Finally, another direction of our future work will target
the automatic ne-tuning of LSH parameters in order to
optimize the quality of recommendations while preserving
measurable privacy guarantees with regard to the cluster size
distribution.
REFERENCES
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for
approximate nearest neighbor in high dimensions. Commun.
ACM, 51.1, 2008.
[2] S. Berkovsky, Y. Eytani, T. Kuik, and F. Ricci. Enhancing
privacy and preserving accuracy of distributed collaborative
ltering. In Proc. of ACM Recommender Systems, 2007.
[3] M. Bertier, D. Frey, R. Guerraoui, A. Kermarrec, and
V. Leroy. The gossple anonymous social network. In Proc.
of ACM/IFIP/USENIX Int. Middleware Conference, 2010.
[4] J. Canny. Collaborative ltering with privacy. In Proc. of
IEEE Symposium on Security and Privacy, 2002.
[5] M. Charikar. Similarity estimation techniques from rounding
algorithms. In Proc. of ACM Symposium on Theory of
Computing, 2002.
[6] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Mot-
wani, J. Ullman, and C. Yang. Finding interesting associations
without support pruning. In Proc. of Int. Conf. on Data
Engineering, 2000.
[7] Y.-S. C. Cyrus Shahabi, Farnoush Banaei-Kashani and
D. McLeod. Yoda: An accurate and scalable web-based
recommendation system. In Proceedings of the Sixth In-
ternational Conference on Cooperative Information Systems.
Springer-Verlag, Sept. 2001.
[8] F. Dabek, B. Zhao, P. Druschel, J. Kubiatowicz, and I. Stoica.
Towards a common API for structured P2P overlays. In Proc.
of Int. Workshop on P2P Systems, 2003.
[9] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news
personalization: Scalable online collaborative ltering. In
Proc. of Int. World Wide Web Conference, 2007.
[10] R. Dingledine, N. Mathewson, and P. Syverson. TOR: The
second generation onion router. In Proc. of USENIX Security
Symposium, 2004.
[11] M. Fredrikson and B. Livshits. RePriv: Re-imagining content
personalization and in-browser privacy. In Proc. of IEEE
Symposium on Security and Privacy, 2011.
[12] A. Gionis, P. Indyk, and R. Motwani. Similarity search in
high dimensions via hashing. In Proc. of Int. Conference on
Very Large Data Bases, 1999.
[13] S. Guha, B. Cheng, and P. Francis. Privad: Practical privacy
in online advertising. In Proc. of USENIX Symposium on
Networked Systems Design and Implementation, 2011.
[14] R. R.-U. S. K. Beyer, J. Goldstein. When is nearest neighbor
meaningful? Proc. 7th International Conference on Database
Theory - ICDT99, LNCS 1540: 217-235, 1999.
[15] Y. Koren. Factorization meets the neighborhood: a multi-
faceted collaborative ltering model. In Proc. of the 14th
ACM SIGKDD int. conf. on Knowledge discovery and data
mining, pages 426434, 2008.
[16] A. Nandi, A. Aghasaryan, and M. Bouzid. P3: A privacy
preserving personalization middleware for recommendation
based services. In Proc. of 4th Hot Topics in Privacy
Enhancing Technologies Symposium, 2011.
[17] A. Ptzmann and M. Khntopp. Anonymity, unobservability,
and pseudonymity - a proposal for terminology. In Int. work-
shop on Designing privacy enhancing technologies, 2001.
[18] H. Polat and W. Du. Svd-based collaborative ltering with
privacy. In Proc. of ACM Symp. on Applied Computing, 2005.
[19] Publish subscribe. http://en.wikipedia.org/wiki/Publish-
subscribe pattern.
[20] H. Steck. Training and testing of recommender systems on
data missing not at random. In Proc. KDD10, 2010.
[21] X. Su and T. M. Khoshgoftaar. A survey of collabora-
tive ltering techniques. Advances in Articial Intelligence,
Vol.2009, 2009.
[22] V. Toubiana, A. Narayanan, D. Boneh, H. Nissenbaum, and
S. Barocas. Adnostic: Privacy preserving targeted advertis-
ing. In Proc. of Network and Distributed Systems Security
Symposium, 2010.
[23] R. Wetzker, C. Zimmermann, and C. Bauckhage. Analyzing
social bookmarking systems: A del.icio.us endeavour. In
International Journal of Data Warehousing and Mining, Vol.
6 , Number 1, 2010.
371

You might also like