Professional Documents
Culture Documents
On The Use of LSH For Privacy Preserving Personalization
On The Use of LSH For Privacy Preserving Personalization
On The Use of LSH For Privacy Preserving Personalization
Armen Aghasaryan, Makram Bouzid, Dimitre Kostadinov, Mohit Kothari, and Animesh Nandi
Bell Labs Research, Alcatel-Lucent
Email: {name.surname}@alcatel-lucent.com
AbstractThe Locality Sensitive Hashing (LSH) technique
of scalably nding nearest-neighbors can be adapted to enable
discovering similar users while preserving their privacy. The
key idea is to compute the user prole on the end-user device,
apply LSH on the local prole, and use the LSH cluster
identier as the interest group identier of a user. By properties
of LSH, the interest group comprises other users with similar
interests. The collective behavior of the members of the interest
group is anonymously collected at some aggregation node to
generate recommendations for the group members.
The quality of recommendation depends on the efciency of
the LSH clustering algorithm, i.e. its capability of gathering
similar users. In contrast, with conventional usage of LSH (for
scalability and not privacy), in our framework one can not
perform a linear search over the cluster members to identify
the nearest neighbors and to prune away false positives. A
good clustering quality is therefore of functional importance for
our system. We report in this work how changing the nature
of LSH inputs, which in our case corresponds to the user
prole representations, impacts the performance of LSH-based
clustering and the nal quality of recommendations. We present
extensive performance evaluations of the LSH-based privacy-
preserving recommender system using two large datasets of
MovieLens ratings and Delicious bookmarks, respectively.
Keywords-Privacy, Personalization, LSH
I. INTRODUCTION
A broad class of applications such as StumbleUpon or
iGoogle URL recommendations, Netix movie recommen-
dations, or IPTV content recommender systems face users
with the dilemma of either disclosing their sensitive prole
information to benet from personalized services or con-
cealing their data to preserve the privacy. Today, users have
no option but to trust the service provider with their prole
data in return of the personalized services.
One approach to preserve privacy is to build a user prole
locally on end-user device, communicate the meta-level
semantic categories to the service provider, and then apply
ne-grained content-based local ltering of received recom-
mendations. Variants of this approach exist, but they support
only content-based recommendations, and not collaborative-
ltering [11], [13], [22]. To take benet of the collaborative-
ltering (CF) approach, i.e. recommendations derived from
like-minded users, the user prole must be exchanged with
content provider, or with other users in a peer-to-peer fashion
to be able to identify those similar users, and to compute
top-rated items of the like-minded community. Doing this
in a privacy preserving way is a challenging problem, and
existing solutions either rely on heavyweight computations
employing cryptographic operations (e.g. [4]), or rely on
perturbations in users proles (e.g. [2], [18]) which may
degrade the recommendation quality.
In our previous work [16] we proposed the basic design
principles of a distributed privacy-preserving personalization
(P3) approach where the user prole is built locally, and then
used in a lightweight, practical, and yet privacy-preserving
way to get both content-based and collaborative-ltering
recommendations. The P3 approach relies on two main
concepts (Figure 1). First, it adapts the Locality-Sensitive
Hashing (LSH) technique of doing scalable nearest-neighbor
search [5], [12], in a way to discover similar users while
preserving their privacy. In fact, the LSH approach allows
end-users with similar interests to get assigned to the same
interest group (cluster) with high probability, without re-
quiring the local proles to be revealed to any central or
trusted entity. Second, the P3 approach relies on a distributed
pool of non-colluding nodes and anonymous communication
channels. Each node acts on behalf of an interest group
and is in charge of anonymously aggregating the groups
behavior, generating recommendations, and anonymously
delivering them to group members.
The quality of recommendations in the P3 approach
depends on the efciency of the LSH clustering algorithm,
i.e. its capability of gathering similar users within interest
groups. Although LSH is a well studied algorithm in the
literature, the conventional approaches for improving its
performance can not be applied in this privacy-preserving
framework. Namely, this framework does not permit a
linear search over cluster members to identify the nearest
neighbors; here, for the sake of privacy protection, individual
user proles are indistinguishable within interest groups. We
then apply a transformation of the original input space of
user proles and show that it can improve the clustering per-
formance, especially in case of an extremely sparse dataset.
The main contributions of this paper are the following.
- We analyze the consequences of using LSH for privacy
as opposed to the conventional use for scalability. LSH
provides the possibility of computing cluster mem-
bership locally, avoiding pair-wise prole comparisons
needed in centralized CF approaches. This allows build-
ing a privacy-preserving recommender system, however
brings in additional challenges for maintaining good
recommendation quality, close to the conventional CF.
2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications
978-0-7695-5022-0/13 $26.00 2013 IEEE
DOI 10.1109/TrustCom.2013.46
362
Figure 1. Key idea of P3
- We describe in this work how changing the nature
of LSH inputs, which in our case corresponds to the
user prole representations, impacts the performance of
clustering. We analyze item-based and tag-based pro-
ling strategies and their consequences on the quality
of recommendations.
The rest of the paper is structured as follows. Sec-
tion II describes the distributed LSH-based approach for
privacy-preserving personalization and summarizes the main
building blocks of P3. Then, Section III introduces the
proling strategies applied to our experimental datasets, and
section IV presents evaluations of clustering properties and
the recommendation quality. We contrast our approach with
prior works in section V, and conclude in section VI.
II. THE P3 APPROACH
This section presents the main building blocks of the
P3 architecture divided into three layers: 1/ local clients
responsible for local proling, interest-group identication,
and local recommendations ltering, 2/ aggregator nodes
responsible for group proling and recommendations, and
3/ communication infrastructure providing anonymous chan-
nels between local clients and aggregator nodes. We then
outline the privacy-preserving properties of the P3 approach.
A. P3 Local Client Layer
Proling and Interest Group Assignment. This compo-
nent runs on the end-users device (e.g. PC, smart-phone,
or Set-top-box) and carries out two main functions: user
proling and assignment to interest groups. The rst is
achieved by collecting and analyzing local traces of user
activities, and building a consolidated user prole. This can
be done in an application specic way, e.g. see [13], [22]
for targeted ad applications on URL browsers. However,
independently from the applied approach we assume that the
resulting prole is represented as a list of <item, value>
pairs, where items can refer to consumed item references
or their semantic characteristics (e.g. tags, categories), and
values are the corresponding interest degrees such as an
explicit user-given rating or a derived implicit interest value.
The user assignment to interest groups is achieved by
locally computing an LSH code for each prole. An LSH
scheme is dened as a probability distribution over a family
H of hash functions such that two similar objects, x and y,
hash to the same value with high probability,
Pr
hH
[h(x) = h(y)] sim(x, y) (1)
Such schemes are used to assemble candidate nearest neigh-
bors in hash buckets (or clusters), in order to solve ap-
proximate nearest neighbor problems in high dimensional
spaces [1]. The output of several randomly selected hash
functions is concatenated in order to amplify the discrimi-
nating power of hash buckets: similar points are expected to
collide on all hash values (reducing false positives). Then,
the sequence of hash values can be used as the label (or key)
of a cluster to which the hashed objects belong. The cluster
key size, k, corresponds to the number of used hash functions
and denes a conguration parameter of the algorithm.
Furthermore, in order to reduce the false negatives, one can
compute multiple LSH sequences and subscribe each user
to a number of clusters. This denes another parameter of
the algorithm, the number of clustering iterations, L.
Depending on the chosen similarity metrics (or the cor-
responding distance dist(x, y) = 1 sim(x, y)), an appro-
priate LSH scheme should be found, e.g. Charikar offers a
solution to implement LSH for cosine similarity [5], while
the MinHash algorithm [6] is known for Jaccard similarity.
As in a general case the prole representation needs to
support non-binary interest values (e.g. ratings on movies),
our approach relies on the LSH variant for cosine similarity
based on random hyperplanes, see Figure 2. The idea is to
generate k random prole vectors as pivots, shared among
the users, and to construct a signature of the user prole
based on its discretized similarity with these pivots; the
signature is obtained by concatenating the k sign bits of dot
products between the user prole and each of the random
vectors. The users with identical signatures belong to the
same cluster (with the cluster key dened by the signature
of its members). This process is repeated L times and each
user is assigned to L clusters.
It is easy to see that for a given dataset using a higher k
value results in a larger number of clusters in each iteration
(bounded by 2
k
), with smaller average cluster size and
higher intra-cluster similarity (reducing false positives).
Note that this procedure requires sharing only random
seeds among the end-user clients, while all the proles are
processed locally without compromising the user privacy.
Prole slicing. We have shown above that cluster as-
signments can be done while keeping user proles local.
Nevertheless, to get CF recommendations (i.e. items from
363
Figure 2. Using LSH for cosine similarity; one clustering iteration with
cluster key size k = 5 explained.
similar users) one still needs to share somehow the pro-
le data. Here, we assume that no Personally Identiable
Information (PII)
1
is available in proles, e.g. the client
employs some mechanisms to lter out the URLs which act
themselves as PII. Then, our approach consists in slicing
each user prole into smaller segments (slices), composed
of one or several <item, value> pairs, and associating them
with the cluster key computed from the entire prole. Now,
the slices of a given prole can be uploaded anonymously
and independently to an aggregator node so that the latter
can not pick them up and observe entire user proles within
the mass of its interest-group members. This unlinkability
property [17] prevents the aggregator node from intelligently
inferring user identities and/or their sensitive prole data.
Local recommendations selection. At the nal stage of
personalization process, each client retrieves recommenda-
tions from its L interest-groups, and merges them into a
single list where each item rating is computed as a weighted
average of respective ratings in the received lists. Here, the
weights correspond to the user-to-group prole similarities.
Alternatively, one can consider only recommendations is-
sued from the users top L
L
i=1
C
i
(u). Note that increasing L brings new
candidates to the nearest neighborhood and will therefore
improve the resulting NNS(n).
Clustering-for-privacy. This indicator applies to a
privacy-preserving clustering approach where the cluster
366
members are indistinguishable, e.g. clusters containing only
mashed up prole slices. Here, the response to the NN-
search is, strictly speaking, the entire set of cluster members.
We limit, however, the computation of NNS(u, n) to those
members of C(u) whose similarity with the cluster centroid
(averaged group prole) is above the median value. The
reason is that the measurement of a clustering performance
should not be penalized by outlier cluster members which
have practically no impact on the top-N group recommen-
dations. Rather, we can reasonably assume that these above-
median members of the cluster well approximate the group
prole and the deliver the top-N recommendations
3
.
Finally, for the multi-clustering approach we have adopted
the best-cluster method where for each prole u its best
cluster is selected, among the L interest-group assignments,
according to the highest user-to-group prole similarity.
This choice resulted from our experimental observation that
selecting only the best-cluster recommendations provided
roughly the same results as using all the clusters. The
clustering-for-privacy indicator as well as the recommen-
dations are then computed on the basis of the best-clusters
members, C
(u).
Results. Figures 6 and 7 show the impact of different
proling strategies on the clustering performance indicators,
NNS(n = 6), applied to the ML and the DL datasets,
respectively. In addition to the LSH-based clustering, we
evaluated these indicators also for a centralized clustering
(k-means), centralized random, and random clustering al-
gorithms. While the centralized random approach picks n
arbitrary neighbors from the entire population, the random
clustering algorithm mimics the behavior of the multi-
clustering approach; it generates 2
k
random clusters in each
of the L clustering iterations, and then, selects the best
cluster for each user.
A general trend for all the strategies in both datasets is
that with an increasing number of clusters (for LSH, this
means increasing the cluster key size, k), the clustering-
for-scalability indicator decreases, while the clustering-for-
privacy increases. In fact, for a given dataset the higher
the number of clusters, the lower their average size. The
clustering-for-scalability indicator benets from larger clus-
ter sizes due to the possibility of doing a linear search over a
larger number of cluster members. As the cluster size |C(u)|
decreases, the chances of some of the nearest-neighbours
of u getting assigned to a cluster different from C(u)
increase which explains the performance decline. However,
this decline is fairly gradual when the clustering algorithm
succeeds to maintain most of the the nearest neighbors
within the C(u) set.
3
With this denition, the clustering-for-scalability indicator is still an
upper bound on the clustering-for-privacy, given that n is chosen to be
sufciently small, n < |C(u)|/2, u. In the sequel, we have used n = 6
which guarantees this property for more than 99% of the population, the
remaining part has been removed from the clustering evaluations.
Decreasing the average cluster size has an opposite ef-
fect on the clustering-for-privacy indicator, where no linear
search is possible within the cluster and where the group
prole is all we can compute irrespective of the target
number of nearest neighbors, n. The average similarity with
the above-median representative members is quite low when
the clusters are large. As the number of clusters increases,
the cluster members become closer to the average group
prole, and thereby NNS(n) increases and approaches the
corresponding value of clustering-for-scalability indicator.
Another general trend for all the evaluated cases is the
impact of the number of clustering iterations, L. We draw
for each of the multi-clustering approaches (LSH-based and
random clustering) two curves with L = 1 and L = 5.
Increasing the number of iterations increases |C
L
(u)|, and
therefore, improves the clustering-for-scalability. It improves
also the clustering-for-privacy results thanks to the best
cluster selection mechanism. In fact, higher number of
clustering iterations allows to select C