Professional Documents
Culture Documents
Yum Yum D Giga
Yum Yum D Giga
Yum Yum D Giga
New mantra
Gather whatever data you can whenever and
wherever possible.
Expectations
Gathered data will have value either for the Traffic Patterns Social Networking: Twitter
purpose collected or for a purpose not
envisioned.
• Feature Extraction
• Bayes Nets
• Frequent Itemsets
• Market-basket problem
• Similar items
• Collaborative filtering: making automatic predictions (filtering) about the interests of a
user by collecting preferences or taste information from many users (collaborating)
Meaningfulness of Analytic Answers
• A risk with “Data mining” is that an analyst can “discover” patterns
that are meaningless
• Statisticians call it Bonferroni’s principle:
• Roughly, if you look in more places for interesting patterns than your amount
of data will support, you are bound to find crap
Meaningfulness of Analytic Answers
Example:
• We want to find (unrelated) people who at least twice have stayed at the
same hotel on the same day
• 109 people being tracked
• 1,000 days
• Each person stays in a hotel 1% of time (1 day out of 100)
• Hotels hold 100 people (so 105 hotels)
• If everyone behaves randomly (i.e., no terrorists) will the data mining detect
anything suspicious?
• Expected number of “suspicious” pairs of people:
• 250,000
• … too many combinations to check – we need to have some additional evidence to
find “suspicious” pairs of people in some more efficient way
What matters when dealing with data?
Usage
Quality
Context
Streaming
Scalability
Data Mining: Cultures
• Data mining overlaps with:
• Databases: Large-scale data, simple queries
• Machine learning: Small data, Complex models
• CS Theory: (Randomized) Algorithms
• Different cultures:
• To a DB person, data mining is an extreme form of CS
analytic processing – queries that Machine
Theory
examine large amounts of data Learning
• Result is the query answer Data
• To a ML person, data-mining Mining
is the inference of models
• Result is the parameters of the model Database
• In this class we will do both! systems
MMD Course
• This class overlaps with machine learning, statistics, artificial
intelligence, databases but more stress on
• Scalability (big data)
• Algorithms
Statistics Machine
• Computing architectures
• Automation for handling Learning
large data
Data Mining
Database
systems
What will we learn?
• We will learn to mine different types of data:
• Data is high dimensional
• Data is a graph
• Data is infinite/never-ending
• Data is labeled
• We will learn to use different models of computation:
• MapReduce
• Streams and online algorithms
• Single machine in-memory
How It All Fits Together
High dim. Graph Infinite Machine
Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
Single Node Architecture
CPU
Machine Learning, Statistics
Memory
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
Map-Reduce: An Example
Map-Reduce: In Parallel
Map-Reduce
• Programmer specifies: Input 0 Input 1 Input 2
• Not suitable for every problem, not even for every problem that uses
many compute nodes operating in parallel.
Matrix-Vector Multiplication
• × ; ; 𝑥 = 𝑚 𝑣
• Reducer
• Pass each key-value pair to the output.
• Identity function
Projection ( ) in MapReduce
• Projection may cause the same tuple to appear several times
• In this version, we want to eliminate duplicates
• Map
• For each tuple t in R, construct a tuple t’ by eliminating those components whose
attributes are not in S
• Emit a key/value pair (t’, t’)
• Reduce
• For each key produced by any of the Map tasks, fetch t′, [t′, ··· , t′]
• Emit a key/value pair (t’, t’)
• This operation is associative and commutative so combiner can be used.
Union in MapReduce
• Assume input relations and have same schema
• Map tasks will be assigned chunks from either or
• Mappers don’t do much, just pass by to reducers
• Reducers do duplicate elimination
• Map
• For each tuple , emit a key/value pair ( , )
• Reduce
• For each key , emit a key/value pair ( , )
[How many values each key can have?]
Intersection in MapReducce
• Map
• For each tuple , emit a key/value pair ( , )
• Reduce
• For each key , if it has a value list [ , ], emit a key/value pair ( , )
[How many values each key can have?]
Difference in MapReduce
• Find
• A tuple appears in the output if it is in but not in .
• Need a way to know from which relation is coming from!
• Map
• For a tuple in emit a key/value pair ( , )
• For a tuple in emit a key/value pair ( , )
• Reduce
• If key has value list , emit a key/value pair ( , )
Natural Join by MapReduce
• Assume to have two relations: and
• We must find tuples that agree on their components
• Map
• For a tuple in emit a key/value pair
• For a tuple in , emit a key/value pair
• Reduce
• Gets a value list corresponding to key b. The value list consists of pairs either from R
or from S.
• For a key , emit a key/value pair for each combination of values which are from R
and S. For example, for values , emit a key/value pair .
Join Example
Links
From To
• Relation ‘Links’ captures the structure of the Web url-1 url-2
• A tuple indicates existence of a link from url-X to url-Y url-1 url-3
url-2 url-4
… …
• Question
• Find paths of length two in the web.
Grouping and Aggregation in
MapReduce
• Given a relation R, partition its tuples according to their values in one
set of attributes G
• The set G is called the grouping attributes
• Then, for each group, aggregate the values in certain other attributes
• Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
• Map
• For a tuple (a,b,c) emit a key/value pair (a, b)
• Reduce
• Each key a represents a group, with values [b1, b2, …, bn]
• Apply θ to the list [b1, b2, …, bn]
• Emit the key/value pair (a,x), where x = θ([b1, b2, …, bn])
Example
Friends
User Friend
• A relation ‘Friends’ having two attributes u1 f1
• Each tuple is a pair (a,b) such that b is a friend of a. u1 f2
u2 f3
… …
• Question
• Compute the number of friends that each user has.
• , ( )
Matrix Multiplication
•
• Requirement: #columns of M is same as #rows of N
• Two relations: and with tuples and
respectively
• Good representation for large and sparse matrix!
• Solution
• Natural join followed by grouping and aggregation
Matrix Multiplication
• Map
• For each tuple from M emit key/value pair
• For each tuple from N emit key/value pair
• Reduce
• For each key j, for each value that comes from M and for each value that comes from
N, emit a key/value pair
• Map 2
• Act like a identity function
• Reduce 2
• For each key , produce sum of the list of values associated with the key.
Matrix Multiplication
Matrix Multiplication : One MR
• Map
• For each tuple from M emit a set of key/value pairs for k =
1, 2, …, K.
• For each tuple from N emit a set of key/value pairs for i =
1, 2, …, I.
• Reduce
• For each key , the list will have 2J values half coming from M and the
other half from N.
• Multiply each pair and having the same j and take sum
of all such multiplication results (i.e. and let it be ).
• Emit for each key a key/value pair .
Exercise
Given a very large file of integers produce as output:
(a) The largest integer.
(b) The average of all the integers.
(c) The same set of integers, but with each integer appearing only once.
(d) The count of the number of distinct integers in the input.
(a) Produce the largest integer
• Map
• For each integer i, emit a key/value pair
• Reduce
• Gets a key with a list of values of the form
• Find maximum form the list of values. Let
• Emit a key/value pair
• NOTE: the key is the result
• Similarity
• Finding sets with a relatively large intersection!
• What if “similarity” can not be expressed in terms of intersection of sets?
Applications
• Similarity of Documents
• Finding textually similar documents (Character-level similarity and not similar
meaning!)
• Focus is on ‘similar’ not ‘identical’
• Plagiarism
• Mirror Pages
• Articles from the same source
• Collaborative Filtering
• Recommend items to users that are liked by other users having similar tastes.
• Online Purchases
• Movie Ratings
3 Essential Steps for Similar Docs
• Shingling: Convert documents to sets
• Optimal k?
Compressing Shingles
• To compress long shingles, we can hash them to (say) 4 bytes
• Represent a document by the set of hash values of its k-shingles
• Idea: Two documents could (rarely) appear to have shingles in common,
when in fact only the hash-values were shared
• Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
Similarity Metric for Shingles
• Document D1 is a set of its k-shingles C1=S(D1)
• Equivalently, each document is a 0/1 vector in the space of k-shingles
• Each unique shingle is a dimension
• Vectors are very sparse
• A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
Challenging Task!
• Suppose we need to find near-duplicate documents among million
documents
• Solution: MinHashing!
MinHashing
• Convert large sets to short signatures, while preserving similarity
Docu-
ment The set
of strings
of length k
that appear Signatures:
in the doc- short integer
ument vectors that
represent the
sets, and
reflect their
similarity
Encoding Sets as Bit Vectors
• Many similarity problems can be formalized as finding subsets that
have significant intersection
• Encode sets using 0/1 (bit, boolean) vectors
• One dimension per element in the universal set
• Interpret set intersection as bitwise AND, and
set union as bitwise OR
• Example: C1 = 10111; C2 = 10011
• Size of intersection = 3; size of union = 4,
• Jaccard similarity (not distance) = 3/4
• Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
From Sets to Boolean Matrices
Documents
• Rows = elements (shingles)
• Columns = sets (documents) 1 1 1 0
• 1 in row e and column s if and only if e is a member of s 1 1 0 1
• Column similarity is the Jaccard similarity of the
corresponding sets (rows with value 1) 0 1 0 1
Shingles
• Typical matrix is sparse! 0 0 0 1
• Each document is a column: 1 0 0 1
• Example: sim(C1 ,C2) = ?
• Size of intersection = 3; size of union = 6, 1 1 1 0
Jaccard similarity (not distance) = 3/6
• d(C1,C2) = 1 – (Jaccard similarity) = 3/6 1 0 1 0
Outline: Finding Similar Columns
• Next Goal: Find similar columns, Small signatures
• Naïve approach:
• 1) Signatures of columns: small summaries of columns
• 2) Examine pairs of signatures to find similar columns
• Essential: Similarities of signatures and columns are related
• 3) Optional: Check that columns with similar signatures are really similar
• Warnings:
• Comparing all pairs may take too much time: Job for LSH
• These methods can produce false negatives, and even false positives (if the optional
check is not made)
Hashing Columns (Signatures)
• Key idea: “hash” each column C to a small signature h(C), such that:
• (1) h(C) is small enough that the signature fits in RAM
• (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
• Hash docs into buckets. Expect that “most” pairs of near duplicate
docs hash into the same bucket!
Min-Hashing
• Goal: Find a hash function h(·) such that:
• if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
• if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
• Define a “hash” function h(C) = the index of the first (in the
permuted order ) row in which column C has value 1:
h (C) = min (C)
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
0 0
The Min-Hash Property 0 0
1 1
• Choose a random permutation
• Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
• Why? 0 1
• Let X be a doc (set of shingles), y X is a shingle
• Then: Pr[(y) = min((X))] = 1/|X| 1 0
• It is equally likely that any y X is mapped to the min element
• Let y be s.t. (y) = min((C1C2))
• Then either: (y) = min((C1)) if y C1 , or
One of the two
(y) = min((C2)) if y C2 cols had to have
• So the prob. that both are true is the prob. y C1 C2 1 at position y
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
Min-Hash Signatures
• Pick K=100 random permutations of the rows
• Think of sig(C) as a column vector
• sig(C)[i] = according to the i-th permutation, the index of the first row
that has a 1 in column C
sig(C)[i] = min (i(C))
• Note: The sketch (signature) of document C is small bytes!
if M[r,c] = 1 then
for each hash function hi Signature Matrix S
∞ ∞ ∞ ∞
if hi(r)<S[i,c] then
∞ ∞ ∞ ∞
S[i,c] = hi(r)
Example
Input Matrix M Signature Matrix S h1(1) = 2 h1(3) = 4
0 1 0 0 1 ∞ ∞ ∞ ∞ h2(1) = 4 h2(3) = 0
1 0 0 1 0 ∞ ∞ ∞ ∞ 1 ∞ 2 1 1 3 2 1
2 0 1 0 1 1 ∞ 4 1 0 2 0 0
3 1 0 1 1
h1(0) = 1 h1(2) = 3 h1(4) = 0
4 0 0 1 0 h2(0) = 1 h2(2) = 2 h2(4) = 3
1 ∞ ∞ 1 1 3 2 1 1 3 0 1
h1(r) = (r+1) mod 5 1 ∞ ∞ 1 1 2 4 1 0 2 0 0
h2(r) = (3*r+1) mod 5
Locality Sensitive Hashing
• Focus on pairs of signatures likely to be from similar documents
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
LSH: First Cut
• Goal: Find documents with Jaccard similarity at least s (for some
similarity threshold, e.g., s=0.8)
• LSH – General idea: Use a function f(x,y) that tells whether x and y is
a candidate pair: a pair of elements whose similarity must be
evaluated
r rows
per band
b bands
One
signature
Signature matrix M
Partition M into b bands
• Divide matrix M into b bands of r rows
• For each band, hash its portion of each column to a hash table with k
buckets
• Make k as large as possible
• Candidate column pairs are those that hash to the same bucket for ≥
1 band
• Tune b and r to catch most similar pairs, but few non-similar pairs
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.
Matrix M
r rows b bands
Analysis of the banding technique
• Columns C1 and C2 have similarity t
• Pick any band (r rows)
• Prob. that all rows in band equal = tr
• Prob. that some row in band unequal = 1 - tr
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
Analysis of LSH – What We Want
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
0.9
Prob. sharing a bucket
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
Summary: 3 Steps
• Shingling: Convert documents to sets
• We used hashing to assign each shingle an ID
• Min-Hashing: Convert large sets to short signatures, while preserving
similarity
• We used similarity preserving hashing to generate signatures with property
Pr[h(C1) = h(C2)] = sim(C1, C2)
• We used hashing to get around generating random permutations
• Locality-Sensitive Hashing: Focus on pairs of signatures likely to be
from similar documents
• We used hashing to find candidate pairs of similarity s
Finding Similar Items
Dr. Subrat K Dash
Distance Measures
Definition
• A function 𝑑 𝑥, 𝑦 in a space of points that takes two points in this
space and prodces a real number satisfying the following:
• 𝑑 𝑥, 𝑦 ≥0
• 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦
• 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥 (symmetric)
• 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦 (triangle inequality)
Distance Measures
• Euclidean Distance:
• L1-norm: 𝑑 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑦1 , 𝑦2 , … , 𝑦𝑛 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖 2
• Cosine Distance
• It is the angle between two vectors in spaces that have dimensions.
• Range of value: [0,180]
Distance Measures
• Edit Distance
• Defined for strings.
• The edit distance between two strings 𝑥 = 𝑥1 𝑥2 … 𝑥𝑛 and 𝑦 = 𝑦1 𝑦2 … 𝑦𝑚 is
• the smallest number of insertions and deletions of single characters that will convert x to
y.
• Length of x + length of y – 2 * length of LCS(x,y)
• LCS(x,y) = the longest string obtained be deleting positions from x and y.
• Example:
• 𝑥 = 𝑎𝑏𝑐𝑑𝑒 and 𝑦 = 𝑎𝑐𝑓𝑑𝑒𝑔 then edit distance between x and y is 3.
• 𝐿𝐶𝑆 𝑎𝑏𝑐𝑑𝑒, 𝑎𝑐𝑓𝑑𝑒𝑔 = 𝑎𝑐𝑑𝑒
• 𝐿𝐶𝑆 𝑎𝑏𝑎, 𝑏𝑎𝑏 = 𝑎𝑏 or 𝑏𝑎
Distance Measures
• Hamming Distance
• It is the number of components in which two vectors x and y differ.
• Defined mostly for Boolean vectors
• For x=10101 and y=11110, d(x,y) = 3
Methods for High Degrees of Similarity
• Find sets that are almost identical
• Find every pair of items with the desired degree of similarity.
• No false negative
Finding Identical Items
• Find Web pages that are identical
• But avoid comparing every pair of documents!
• Approach 1:
• Use hash function on entire document
• Compare documents that fall in the same bucket
• Examine every character of every document
• Approach 2:
• Hash each page based on first few characters.
• Compare those documents that fall in the same bucket
• Fails if documents have common prefix like HTML documents
• Approach 3:
• Pick some fixed random positions for all documents and hash on corresponding
characters.
Sets as Strings
• Represent set as an ordered sequence of values (string).
• Set {d,a,b} is represented as abd
• Example:
• Suppose s is a string of length 9 then compare
it with strings that have length at most 9/0.9 =
10.
Prefix Indexing
• Index each string s, consisting of first p symbols of s.
• add string s to the index for each of its first p symbols.
• Compare strings that fall in the same bucket.
• Any other string t such that SIM(s, t) ≥ J will have at least one symbol
in its prefix that also appears in the prefix of s.
• 𝑝 = 1 − 𝐽 𝐿𝑠 + 1
• Example:
• If J=0.9, and Ls=9 then p=1. S need to be indexed under only its first symbol.
Using position Information
• Motivation
• Let s=acdefghijk and t=bcdefghijk and J=0.9
• As per prefix indexing, both s and t will be indexed using first 2 (𝑝 =
1 − 𝐽 𝐿𝑠 + 1) characters. Thus, both are in same bucket because of
indexing on character ‘c’.
• But SIM(s,t) = 9/11 < 0.9
• Can we avoid comparing s and t?
• Solution
• Build index using character as well as position of the character within the
string.
Using position Information
• Index is based on a pair (x,i) where x is a symbol at position i in the
prefix of a string s.
• Given a string s and minimum desired Jaccard similarity J, consider
prefix of s (first p symbols – positions 1 through 1 − 𝐽 𝐿𝑠 + 1). If
the symbol in position i of the prefix of s is x, add s to the index
bucket for (x,i).
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
As the stream grows, the sample is of
fixed size
Problem 2: Fixed-size sample
Suppose we need to maintain a random
sample S of size exactly s tuples
E.g., main memory size constraint
Why? Don’t know length of stream in advance
Suppose at time n we have seen n items
Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Algorithm (a.k.a. Reservoir Sampling)
Store all the first s elements of the stream to S
Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
With probability s/n, keep the nth element, else discard it
If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
Publish-subscribe systems
You are collecting lots of messages (news articles)
People express interest in certain sets of keywords
Determine whether each message matches user’s
interest
Hash
func h
In our case:
Targets = bits/buckets
Darts = hash values of items
n( m / n)
1 - (1 – 1/n)
1 – e–m/n
Probability some
target X not hit
Probability at
by a dart
least one dart
hits target X
0.1
0.08
0.04
Obvious approach:
Maintain the set of elements seen so far
That is, keep a hash table of all the distinct
elements seen so far
r m r 2r ( m 2 r ) m 2 r
Note: (1 2 ) (1 2 ) e
Prob. of NOT finding a tail of length r is:
If m << 2r, then prob. tends to 1
r m m 2 r
(1 2 ) e 1 as m/2r 0
So, the probability of finding a tail of length r tends to 0
If m >> 2r, then prob. tends to 0
r m m 2 r
(1 2 ) e 0 as m/2r
So, the probability of finding a tail of length r tends to 1
For example:
Finding communities in graphs (e.g., Twitter)
Support of
then sets of items that appear {Beer, Bread} = 2
in at least s baskets are called
frequent itemsets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
Items = {milk, coke, pepsi, beer, juice}
Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j},
{m,b} , {b,c} , {c,j}.
support(I j )
conf(I j )
support(I )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
d d k
R
d 1 d k
k j
k 1 j 1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
16
Step 1: Find all frequent itemsets I
(we will explain this next)
Step 2: Rule generation
For every subset A of I, generate a rule A → I \ A
Since I is frequent, A is also frequent
Variant 1: Single pass to compute the rule confidence
confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
Variant 2:
Observation: If A,B,C→D is below confidence, so is A,B→C,D
Can generate “bigger” rules from smaller ones!
Output the rules above the confidence threshold
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Support threshold s = 3, confidence c = 0.75
1) Frequent itemsets:
{b,m} {b,c} {c,m} {c,j} {m,c,b}
2) Generate rules:
b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
m→b: c=4/5 … b,m→c: c=3/4
b→c,m: c=3/6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal freaquent = 4
ABCDE
Frequent, but
superset BC
Support Maximal(s=3) Closed also frequent.
A 4 No No Frequent, and
B 5 No Yes its only superset,
ABC, not freq.
C 3 No No Superset BC
AB 4 Yes Yes has same count.
ABC 2 No Yes
Counts of
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2
Main memory
Counts of pairs
with storing triples Counts
of of
frequent
Trick: re-number pairs of
items
frequent items
frequent items 1,2,…
and keep a table relating
new numbers to original Pass 1 Pass 2
item numbers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
For each k, we construct two sets of
k-tuples (sets of size k):
Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
Lk = the set of truly frequent k-tuples
Count All pairs Count
All of items To be
the items the pairs explained
items from L1
40
One pass for each k (itemset size)
Needs room in main memory to count
each candidate k–tuple
For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most memory
Many possible extensions:
Association rules with intervals:
For example: Men over 65 have 2 cars
Association rules when items are in a taxonomy
Bread, Butter → FruitJam
BakedGoods, MilkProduct → PreservedGoods
Lower the support s as itemset gets bigger
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Observation:
In pass 1 of A-Priori, most memory is idle
We store only individual item counts
Can we use the idle memory to reduce
memory required in pass 2?
Pass 1 of PCY: In addition to item counts,
maintain a hash table with as many
buckets as fit in memory
Keep a count for each bucket into which
pairs of items are hashed
For each bucket just keep the count, not the actual
pairs that hash to the bucket!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in hash the pair to a bucket;
PCY add 1 to the count for that bucket;
Pass 2:
Only count pairs that hash to frequent buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Replace the buckets by a bit-vector:
1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs candidate
pairs
Pass 1 Pass 2
Bitmap 1 Bitmap 1
First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs
Bitmap 1
Main memory
First
First hash
hash table
table Bitmap 2
Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs
Pass 1 Pass 2
Main memory
time we increase the size of itemsets baskets
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
x x x
x x x x
x x x
x
Outlier Cluster
Point assignment:
Maintain a set of clusters
Points belong to “nearest” cluster
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Key operation:
Repeatedly combine
two nearest clusters
Data:
o … data point
x … centroid
Dendrogram
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
What about the Non-Euclidean case?
The only “locations” we can talk about are the
points themselves
i.e., there is no “average” of two points
Approach 1:
(1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
(2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
(1) How to represent a cluster of many points?
clustroid = point “closest” to other points
Possible meanings of “closest”:
Smallest maximum distance to other points
Smallest average distance to other points
Smallest sum of squares of distances to other points
For distance metric d clustroid c of cluster C is: min
c
d ( x, c )
xC
2
Datapoint Centroid
x x x x x x
x … data point
… centroid Clusters after round 1
x x x x x x
x … data point
… centroid Clusters after round 2
x x x x x x
x … data point
… centroid Clusters at the end
Best value
of k
Average
distance to
centroid k
x x x
x x x x
x x x
x
x x x
x x x x
x x x
x
x x x
x x x x
x x x
x
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
2d + 1 values represent any size cluster
d = number of dimensions
Average in each dimension (the centroid)
can be calculated as SUMi / N
SUMi = ith component of SUM
Variance of a cluster’s discard set in
dimension i is: (SUMSQi / N) – (SUMi / N)2
And standard deviation is the square root of that
Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Processing the “Memory-Load” of points (1):
1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
These points are so close to the centroid that
they can be summarized and then discarded
2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Processing the “Memory-Load” of points (2):
3) DS set: Adjust statistics of the clusters to
account for the new points
Add Ns, SUMs, SUMSQs
4) Consider merging compressed sets in the CS
5) If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Points in
the RS
Compressed sets.
Their points are in
the CS.
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
h
e e
e
h e
e e h
e e e e
h
salary e
h Pick (say) 4
h remote points
h h for each
h h h cluster.
age
h
e e
e
h e
e e h
e e e e
h
salary e
h Move points
h (say) 20%
h h toward the
h h h centroid.
age
Online Algorithms
You get to see the input one piece at a time, and
need to make irrevocable decisions along the way
Similar to the data stream model
2 b
3 c
Boys 4 d Girls
2 b
3 c
Boys 4 d Girls
M = {(1,a),(2,b),(3,d)} is a matching
Cardinality of matching = |M| = 3
2 b
3 c
Boys 4 d Girls
M = {(1,c),(2,b),(3,d),(4,a)} is a
perfect matching
Perfect matching … all vertices of the graph are matched
Maximum matching … a matching that contains the largest possible number of matches
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Problem: Find a maximum matching for a
given bipartite graph
A perfect one if it exists
Example of application:
Assigning tasks to servers
4 d
Competitive ratio =
minall possible inputs I (|Mgreedy|/|Mopt|)
(what is greedy’s worst performance over all possible inputs I)
4 d
Interesting problem:
What ads to show for a given query?
(Today’s lecture)
A $1.00 1% 1 cent
A $1.00 1% 1 cent
Query stream: x x x x y y y y
BALANCE choice: A B A B B B _ _
Optimal: A A A A B B B B
A1 A2
Optimal revenue = 2B
Assume Balance gives revenue = 2B-x = B+y
x
B
Unassigned queries should be assigned to A2
y x (if we could assign to A1 we would since we still have the budget)
Goal: Show we have y x
A1 A2 Not Case 1) ≤ ½ of A1’s queries got assigned to A2
used then 𝒚 ≥ 𝑩/𝟐
Case 2) > ½ of A1’s queries got assigned to A2
x then 𝒙 ≤ 𝑩/𝟐 and 𝒙 + 𝒚 = 𝑩
B Balance revenue is minimum for 𝒙 = 𝒚 = 𝑩/𝟐
y Minimum Balance revenue = 𝟑𝑩/𝟐
x
Competitive Ratio = 3/4
A1 A2 Not BALANCE exhausts A2’s budget
used J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
In the general case, worst competitive ratio
of BALANCE is 1–1/e = approx. 0.63
Interestingly, no online algorithm has a better
competitive ratio!
Sk = B
Sk = 1
ln(N)-1 Sk = 1
𝑵
𝑺𝒌 = 𝟏 implies: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵) − 𝟏 = 𝒍𝒏( )
𝒆
We also know: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵 − 𝒌)
𝑵
So: 𝑵 − 𝒌 = N terms sum to ln(N).
𝒆 Last k terms sum to 1.
𝟏
Then: 𝒌 = 𝑵(𝟏 − ) First N-k terms sum
𝒆 to ln(N-k) but also to ln(N)-1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
So after the first k=N(1-1/e) rounds, we
cannot allocate a query to any advertiser
advertiser
Need to re-compute
betweenness at
every step
Questions:
How can we define a “good” partition of 𝑮?
How can we efficiently identify such a partition?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
What makes a good partition?
Maximize the number of within-group
connections
Minimize the number of between-group
connections
5
1
2 6
4
3
A B
A 5 B
1
cut(A,B) = 2
2 6
4
3
Problem:
Only considers external cluster connections
Does not consider internal cluster connectivity
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
[Shi-Malik]
5
1 3 0 0 0 0 0
1 2 0 2 0 0 0 0
2 6 3 0 0 3 0 0 0
4
3 4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2
n n symmetric matrix 2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
5
1
4 0 0 -1 3 -1 -1
2 6
4 5 -1 0 0 -1 3 -1
3 6 0 0 0 -1 -1 2
2 𝟐
=σ (𝑥
𝑖,𝑗 ∈𝐸 𝑖 + 𝑥𝑗2 − 2𝑥𝑖 𝑥𝑗 ) = σ 𝒊,𝒋 ∈𝑬 𝒙𝒊 − 𝒙𝒋
Node 𝒊 has degree 𝒅𝒊 . So, value 𝒙𝟐𝒊 needs to be summed up 𝒅𝒊 times.
But each edge (𝒊, 𝒋) has two endpoints so we need 𝒙𝟐𝒊 +𝒙𝟐𝒋
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
T
x Mx Details!
2 min T
x x x
Remember:
2 min
( i , j )E ( xi x j ) 2
All labelings
of nodes 𝑖 so
that σ𝑥𝑖 = 0
i x 2
i
x
We want to assign values 𝒙𝒊 to nodes i such
𝑥𝑖 0 𝑥𝑗
that few edges cross 0.
(we want xi and xj to subtract each other) Balance to minimize
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
Back to finding the optimal cut
Express partition (A,B) as a vector
+𝟏 𝒊𝒇 𝒊 ∈ 𝑨
𝒚𝒊 = ቊ
−𝟏 𝒊𝒇 𝒊 ∈ 𝑩
We can minimize the cut of the partition by
finding a non-trivial vector x that minimizes:
Proof (continued):
𝟏
− 𝒊𝒇 𝒊 ∈ 𝑨
1) Let’s set: 𝒙𝒊 = ൞ 𝒂𝟏
+ 𝒊𝒇 𝒊 ∈ 𝑩
𝒃
1 1
Let’s quickly verify that σ𝑖 𝑥𝑖 = 0: 𝑎 − +𝑏 =𝟎
𝑎 𝑏
2 1 1 2 1 1 2
σ 𝑥𝑖 −𝑥𝑗 σ𝑖∈𝐴,𝑗∈𝐵 + 𝑒⋅ +
2) Then: σ𝑖 𝑥𝑖2
= 1 2
𝑏 𝑎
1 2
= 𝑎 𝑏
1 1 =
𝑎 − +𝑏 +
𝑎 𝑏
𝑎 𝑏
1 1 1 1 𝟐 Which proves that the cost
𝑒 + ≤𝑒 + ≤ 𝒆 = 𝟐𝜶 achieved by spectral is better
𝑎 𝑏 𝑎 𝑎 𝒂
than twice the OPT cost
e … number of edges between A and B
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46
Details!
Spectral Clustering
Build Laplacian
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
matrix L of the 3
4
-1
0
-1
0
3
-1
-1
3
0
-1
0
-1
graph 5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
2) 0.0
1.0
0.4
0.4
0.3
0.6
-0.5
0.4
-0.2
-0.4
-0.4
0.4
-0.5
0.0
Decomposition: =
3.0
X=
0.4 0.3 0.1 0.6 -0.4 0.5
4.0
0.4
0.4
-0.3
-0.3
0.1
-0.5
0.6
-0.2
0.4
0.4
-0.5
0.5
and eigenvectors x 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0
of the matrix L
1 0.3
2 0.6
Map vertices to 3 0.3
How do we now
corresponding 4 -0.3
components of 2 5
6
-0.3
-0.6
find the clusters?
Rank in x2
Value of x2
Rank in x2
Components of x3
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54
How do we partition a graph into k clusters?