Professional Documents
Culture Documents
2. Mining Massive DataSets
2. Mining Massive DataSets
1
Finding Similar Items
Outline
▪ Introduction
▪ Shingling
▪ Minhashing
▪ Locality-Sensitive Hashing
2
Finding Similar Items Introduction
Goals
▪ Many Web-mining problems can be expressed
as finding “similar” sets:
1. Pages with similar words, e.g., for classification by
topic.
2. NetFlix users with similar tastes in movies, for
recommendation systems.
3. Dual: movies with similar sets of fans.
4. Images of related things.
3
Finding Similar Items Introduction
Example Problem:
Comparing Documents
▪ Goal: common text.
4
Finding Similar Items Introduction
5
Finding Similar Items Introduction
6
Finding Similar Items Introduction
Candidate
pairs :
Minhas Locality-
Docu- Shinglin those pairs
h- sensitive
ment g of signatures
ing Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
7
Finding Similar Items
Outline
▪ Introduction
▪ Shingling
▪ Minhashing
▪ Locality-Sensitive Hashing
8
Finding Similar Items Shingling
Shingles
▪ A k -shingle (or k -gram) for a document is a
sequence of k characters that appears in the
document.
9
Finding Similar Items Shingling
Working Assumption
▪ Documents that have lots of shingles in common
have similar text, even if the text appears in different
order.
10
Finding Similar Items Shingling
11
Finding Similar Items
Outline
▪ Introduction
▪ Shingling
▪ Minhashing
▪ Locality-Sensitive Hashing
12
Finding Similar Items Minhashing
13
Finding Similar Items Minhashing
14
Finding Similar Items Minhashing
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
15
Finding Similar Items Minhashing
16
Finding Similar Items Minhashing
17
Finding Similar Items Minhashing
Aside
▪ We might not really represent the data by a
boolean matrix.
▪ Sparse matrices are usually better represented by
the list of places where there is a non-zero value.
▪ But the matrix picture is conceptually useful.
18
Finding Similar Items Minhashing
19
Finding Similar Items Minhashing
20
Finding Similar Items Minhashing
Warnings
1. Comparing all pairs of signatures may take too
much time, even if not too much space.
▪ A job for Locality-Sensitive Hashing.
21
Finding Similar Items Minhashing
Signatures
▪ Key idea: “hash” each column C to a small
signature Sig (C), such that:
1. Sig (C) is small enough that we can fit a signature in main
memory for each column.
2. Sim (C1, C2) is the same as the “similarity” of Sig (C1) and
Sig (C2).
22
Finding Similar Items Minhashing
23
Finding Similar Items Minhashing
Minhashing
▪ Imagine the rows permuted randomly.
24
Finding Similar Items Minhashing
Minhashing Example
Input matrix Signature matrix M
1 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1
2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 6 0 1 0 1
2 6 1 0 1 0 1
5 7 2 1 0 1 0
4 5 5 1 0 1 0
25
Finding Similar Items Minhashing
Surprising Property
▪ The probability (over all permutations of the
rows) that h (C1) = h (C2) is the same as Sim (C1,
C2).
▪ Both are a /(a +b +c )!
▪ Why?
▪ Look down the permuted columns C1 and C2 until we
see a 1.
▪ If it’s a type-a row, then h (C1) = h (C2). If a type-b or
type-c row, then not.
26
Finding Similar Items Minhashing
27
Finding Similar Items Minhashing
2 6 1 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 2 1 0 1 0
Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
28
Finding Similar Items Minhashing
Minhash Signatures
▪ Pick (say) 100 random permutations of the rows.
▪ Think of Sig (C) as a column vector.
▪ Let Sig (C)[i] =
according to the ith permutation, the number of the first
row that has a 1 in column C.
29
Finding Similar Items Minhashing
Implementation – (1)
▪ Suppose 1 billion rows.
Implementation – (2)
▪ A good approximation to permuting rows: pick
100 (?) hash functions.
31
Finding Similar Items Minhashing
Implementation – (3)
Initialize M(i,c) to ∞ for all i and c
for each row r
for each column c
if c has 1 in row r
for each hash function hi do
if hi (r ) is a smaller value than M (i, c ) then
M (i, c ) := hi (r );
32
Finding Similar Items Minhashing
Example
Sig1 Sig2
Row C1 C2 h(1) = 1 1 -
1 1 0 g(1) = 3 3 -
2 0 1 h(2) = 2 1 2
3 1 1 g(2) = 0 3 0
4 1 0
5 0 1 h(3) = 3 1 2
g(3) = 2 2 0
Implementation – (4)
▪ Often, data is given by column, not row.
▪ E.g., columns = documents, rows = shingles.
34
Finding Similar Items
Outline
▪ Introduction
▪ Shingling
▪ Minhashing
▪ Locality-Sensitive Hashing
35
Finding Similar Items Locality-Sensitive Hashing
36
Finding Similar Items Locality-Sensitive Hashing
▪ At 1 microsecond/comparison: 6 days.
37
Finding Similar Items Locality-Sensitive Hashing
Locality-Sensitive Hashing
▪ General idea: Use a function f(x,y) that tells whether
or not x and y is a candidate pair : a pair of
elements whose similarity must be evaluated.
38
Finding Similar Items Locality-Sensitive Hashing
39
Finding Similar Items Locality-Sensitive Hashing
40
Finding Similar Items Locality-Sensitive Hashing
r rows
per band
b bands
One
signature
Matrix M 41
Finding Similar Items Locality-Sensitive Hashing
Columns 2 and 6
Buckets are probably identical.
r rows b bands
43
Finding Similar Items Locality-Sensitive Hashing
Simplifying Assumption
▪ There are enough buckets that columns are unlikely
to hash to the same bucket unless they are identical
in a particular band.
44
Finding Similar Items Locality-Sensitive Hashing
45
Finding Similar Items Locality-Sensitive Hashing
46
Finding Similar Items Locality-Sensitive Hashing
47
Finding Similar Items Locality-Sensitive Hashing
48
Finding Similar Items Locality-Sensitive Hashing
Probability
= 1 if s > t
Probability No chance
of sharing if s < t
a bucket
Probability Remember:
of sharing probability of
a bucket equal hash-values
= similarity
r
Probability t ~ (1/b) 1/r 1 - (1 - s )b
of sharing
a bucket
Example: b = 20; r = 5
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
52
Finding Similar Items Locality-Sensitive Hashing
LSH Summary
▪ Tune to get almost all pairs with similar
signatures, but eliminate most pairs that do not
have similar signatures.
Acknowledgement
▪ Slides are from
▪ Prof. Jeffrey D. Ullman
▪ Dr. Anand Rajaraman
▪ Dr. Jure Leskovec
54