Yum Yum D Giga

Mining of Massive Datasets
Dr. Subrat K Dash
Acknowledgement: http://www.mmds.org, Introduction to Data Mining by Vipin Kumar et

al.
Large-scale Data is Everywhere!
 There has been enormous data growth in both
commercial and scientific databases due to
advances in data generation and collection E-Commerce
technologies Cyber Security
 New mantra
 Gather whatever data you can whenever and
wherever possible.
 Expectations
 Gathered data will have value either for the Traffic Patterns Social Networking: Twitter
purpose collected or for a purpose not
envisioned.
Sensor Networks Computational Simulations

Great opportunities to improve productivity in
all walks of life
Data contains value and knowledge
Big Data
• Big data refers to data that is so
large, fast or complex that it's
difficult or impossible to process
using traditional methods.
• Characteristics
• Volume
• Variety
• Velocity
• Veracity
• Value
Data Mining
• Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover
meaningful patterns
Demand for Data Mining:
Data Mining
• But to extract the knowledge data needs to be (i) Stored (ii) Managed
and (iii) ANALYZED
Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science
• Data Mining Tasks

• Descriptive Methods: Find human-interpretable patterns that describe the
data
• Example: Clustering
• Use some variables to predict unknown or future values of other variables
• Example: Recommender Systems
Data Mining: Discovery of “models” for data
• Statistical Modeling
• Construct a statistical model: an underlying distribution from which the visible
data is drawn.
• Machine Learning
• Use the data as a training set to train an algorithm
• Little idea about what we are looking for in the data
• Netflix challenge
• What if goal of mining can be described more directly?
• Case study: Use ML to locate people’s resumes on the web (WhizBang! Labs)
Computational approaches to Modeling
• Summarization
• Example: PageRank, Clustering
• Feature Extraction
• Bayes Nets
• Frequent Itemsets
• Market-basket problem
• Similar items
• Collaborative filtering: making automatic predictions (filtering) about the interests of a
user by collecting preferences or taste information from many users (collaborating)
Meaningfulness of Analytic Answers
• A risk with “Data mining” is that an analyst can “discover” patterns
that are meaningless
• Statisticians call it Bonferroni’s principle:
• Roughly, if you look in more places for interesting patterns than your amount
of data will support, you are bound to find crap
Meaningfulness of Analytic Answers
Example:
• We want to find (unrelated) people who at least twice have stayed at the
same hotel on the same day
• 109 people being tracked
• 1,000 days
• Each person stays in a hotel 1% of time (1 day out of 100)
• Hotels hold 100 people (so 105 hotels)
• If everyone behaves randomly (i.e., no terrorists) will the data mining detect
anything suspicious?
• Expected number of “suspicious” pairs of people:
• 250,000
• … too many combinations to check – we need to have some additional evidence to
find “suspicious” pairs of people in some more efficient way
What matters when dealing with data?
Usage
Quality
Context
Streaming
Scalability
Data Mining: Cultures
• Data mining overlaps with:
• Databases: Large-scale data, simple queries
• Machine learning: Small data, Complex models
• CS Theory: (Randomized) Algorithms
• Different cultures:
• To a DB person, data mining is an extreme form of CS
analytic processing – queries that Machine
Theory
examine large amounts of data Learning
• Result is the query answer Data
• To a ML person, data-mining Mining
is the inference of models
• Result is the parameters of the model Database
• In this class we will do both! systems
MMD Course
• This class overlaps with machine learning, statistics, artificial
intelligence, databases but more stress on
• Scalability (big data)
• Algorithms
Statistics Machine
• Computing architectures
• Automation for handling Learning
large data
Data Mining
Database
systems
What will we learn?
• We will learn to mine different types of data:
• Data is high dimensional
• Data is a graph
• Data is infinite/never-ending
• Data is labeled
• We will learn to use different models of computation:
• MapReduce
• Streams and online algorithms
• Single machine in-memory
How It All Fits Together
High dim. Graph Infinite Machine
Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Community Web Decision Association

Clustering
Detection advertising Trees Rules
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
Single Node Architecture
CPU
Machine Learning, Statistics
Memory
“Classical” Data Mining

Disk
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce
MapReduce
• Much of the course will be devoted to
large scale computing for data mining
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard
• Map-reduce addresses all of the above

• Google’s computational/data manipulation model
• Elegant way to work with big data
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
Map-Reduce: An Example
Map-Reduce: In Parallel
Map-Reduce
• Programmer specifies: Input 0 Input 1 Input 2
• Map and Reduce and input files

• Workflow:
• Read inputs as a set of key-value-pairs Map 0 Map 1 Map 2
• Map transforms input kv-pairs into a new set of k'v'-
pairs
• Sorts & Shuffles the k'v'-pairs to output nodes Shuffle
• All k’v’-pairs with a given k’ are sent to the same
reduce
• Reduce processes all k'v'-pairs grouped by key into
new k''v''-pairs Reduce 0 Reduce 1
• Write the resulting pairs to files
• All phases are distributed with many tasks doing

the work Out 0 Out 1
Data Flow
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data
• Intermediate results are stored on local FS of Map and Reduce

workers
• Output is often input to another MapReduce task

Coordination: Master
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its R
intermediate files, one for each reducer
• Master pushes this info to reducers
• Master pings workers periodically to detect failures

Dealing with Failures
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified
How many Map and Reduce jobs?
• M map tasks, R reduce tasks
• Rule of thumb:
• Make M much larger than the number of nodes in the cluster
• One DFS chunk per map is common
• Improves dynamic load balancing and speeds up recovery from worker
failures
• Usually R is smaller than M
• Because output is spread across R files
Task Granularity & Pipelining
• Fine granularity tasks: map tasks >> machines
• Minimizes time for fault recovery
• Can do pipeline shuffling with map execution
• Better dynamic load balancing
Refinement: Combiners
• Often a Map task will produce many pairs of the form (k,v1), (k,v2), …
for the same key k
• E.g., popular words in the word count example
• Can save network time by
pre-aggregating values in
the mapper:
• combine(k, list(v1))  v2
• Combiner is usually same
as the reduce function
• Works only if reduce function is
commutative and associative
Refinement: Combiners
• Back to our word counting example:
• Combiner combines the values of all keys of a single mapper (single machine):
• Much less data needs to be copied and shuffled!

Algorithms using MapReduce
• Files are very large and are rarely updated in place.
• Not suitable for every problem, not even for every problem that uses
many compute nodes operating in parallel.
Matrix-Vector Multiplication
• × ; ; 𝑥 = 𝑚 𝑣
• Each map operates on a chunk of M

• Assume, V fits in main memory
• Map Function:
• For each element of M, i.e.
• Produces pair
• Reduce Function:
• Sums all the values associated with a given key i.
• Produces
Matrix-Vector Multiplication 𝑥 = 𝑚 𝑣
• If V can not fit in main memory

• Divide V into p horizontal stripes of equal
size s.t. each stripe fits in main memory
• Divide M into p vertical stripes of equal
width
• Each and is saved as a file.
• Each Map task is assigned a and the
corresponding
Relational Algebra Operations
• Procedural query language which takes relation as input and generate
relation as output.
• Selection
• Projection
• Union, Intersection and Difference
• Natural Join
• Grouping and Aggregation
Selection ( ) in MapReduce
• Map
• For each tuple t in R, test if t satisfies C. If so, emit a key-value pair (t,t)
• Reducer
• Pass each key-value pair to the output.
• Identity function
Projection ( ) in MapReduce
• Projection may cause the same tuple to appear several times
• In this version, we want to eliminate duplicates
• Map
• For each tuple t in R, construct a tuple t’ by eliminating those components whose
attributes are not in S
• Emit a key/value pair (t’, t’)
• Reduce
• For each key produced by any of the Map tasks, fetch t′, [t′, ··· , t′]
• Emit a key/value pair (t’, t’)
• This operation is associative and commutative so combiner can be used.
Union in MapReduce
• Assume input relations and have same schema
• Map tasks will be assigned chunks from either or
• Mappers don’t do much, just pass by to reducers
• Reducers do duplicate elimination
• Map
• For each tuple , emit a key/value pair ( , )
• Reduce
• For each key , emit a key/value pair ( , )
[How many values each key can have?]
Intersection in MapReducce
• Map
• For each tuple , emit a key/value pair ( , )
• Reduce
• For each key , if it has a value list [ , ], emit a key/value pair ( , )
[How many values each key can have?]
Difference in MapReduce
• Find
• A tuple appears in the output if it is in but not in .
• Need a way to know from which relation is coming from!
• Map
• For a tuple in emit a key/value pair ( , )
• For a tuple in emit a key/value pair ( , )
• Reduce
• If key has value list , emit a key/value pair ( , )
Natural Join by MapReduce
• Assume to have two relations: and
• We must find tuples that agree on their components
• Map
• For a tuple in emit a key/value pair
• For a tuple in , emit a key/value pair
• Reduce
• Gets a value list corresponding to key b. The value list consists of pairs either from R
or from S.
• For a key , emit a key/value pair for each combination of values which are from R
and S. For example, for values , emit a key/value pair .
Join Example
Links
From To
• Relation ‘Links’ captures the structure of the Web url-1 url-2
• A tuple indicates existence of a link from url-X to url-Y url-1 url-3
url-2 url-4
… …
• Question
• Find paths of length two in the web.
Grouping and Aggregation in
MapReduce
• Given a relation R, partition its tuples according to their values in one
set of attributes G
• The set G is called the grouping attributes
• Then, for each group, aggregate the values in certain other attributes
• Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
• In the notation, X is a list of elements that can be:

• A grouping attribute
• An expression θ(A), where θ is one of the (five) aggregation functions and A is
an attribute NOT among the grouping attributes
Grouping and Aggregation in
MapReduce
• Let R(A, B, C) be a relation to which we apply γA,θ(B)(R)
• The map operation prepares the grouping
• The grouping is done by the framework
• The reducer computes the aggregation
• Simplifying assumptions: one grouping attribute and one aggregation function
• Map
• For a tuple (a,b,c) emit a key/value pair (a, b)
• Reduce
• Each key a represents a group, with values [b1, b2, …, bn]
• Apply θ to the list [b1, b2, …, bn]
• Emit the key/value pair (a,x), where x = θ([b1, b2, …, bn])
Example
Friends
User Friend
• A relation ‘Friends’ having two attributes u1 f1
• Each tuple is a pair (a,b) such that b is a friend of a. u1 f2
u2 f3
… …
• Question
• Compute the number of friends that each user has.
• , ( )
Matrix Multiplication

•
• Requirement: #columns of M is same as #rows of N
• Two relations: and with tuples and
respectively
• Good representation for large and sparse matrix!
• Solution
• Natural join followed by grouping and aggregation
• Map
• For each tuple from M emit key/value pair
• For each tuple from N emit key/value pair
• Reduce
• For each key j, for each value that comes from M and for each value that comes from
N, emit a key/value pair
• Map 2
• Act like a identity function
• Reduce 2
• For each key , produce sum of the list of values associated with the key.
Matrix Multiplication : One MR
• Map
• For each tuple from M emit a set of key/value pairs for k =
1, 2, …, K.
• For each tuple from N emit a set of key/value pairs for i =
1, 2, …, I.
• Reduce
• For each key , the list will have 2J values half coming from M and the
other half from N.
• Multiply each pair and having the same j and take sum

of all such multiplication results (i.e. and let it be ).
• Emit for each key a key/value pair .
Exercise
Given a very large file of integers produce as output:
(a) The largest integer.
(b) The average of all the integers.
(c) The same set of integers, but with each integer appearing only once.
(d) The count of the number of distinct integers in the input.
(a) Produce the largest integer
• Map
• For each integer i, emit a key/value pair
• Reduce
• Gets a key with a list of values of the form
• Find maximum form the list of values. Let
• Emit a key/value pair
• NOTE: the key is the result
It is correct but not a good solution! Find a better one!

(b) The average of all the integers.
• Map
• Let and
• For each integer i, and
• emit a key/value pair
• Reduce
• Gets a key (1) with a list of values of the form
• Let and
• For each value in the list and
• Emit a key/value pair where
• NOTE: the key is the result
Finding Similar Items
Dr. Subrat K Dash
Problem Statement
• Find “Similar” items
• Example:
• Given a collection of web pages find near-duplicate pages.
• Similarity
• Finding sets with a relatively large intersection!
• What if “similarity” can not be expressed in terms of intersection of sets?
Applications
• Similarity of Documents
• Finding textually similar documents (Character-level similarity and not similar
meaning!)
• Focus is on ‘similar’ not ‘identical’
• Plagiarism
• Mirror Pages
• Articles from the same source
• Collaborative Filtering
• Recommend items to users that are liked by other users having similar tastes.
• Online Purchases
• Movie Ratings
3 Essential Steps for Similar Docs
• Shingling: Convert documents to sets
• Min-Hashing: Convert large sets to short signatures, while preserving

similarity
• Locality-Sensitive Hashing: Focus on pairs of signatures likely to be

from similar documents
• Candidate pairs!
The Big Picture
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment The set of signatures
Hashing
of strings that we need
of length k to test for
that appear Signatures: similarity
in the doc- short integer
ument vectors that
represent the
sets, and
reflect their
similarity
Shingling of Documents
• How to represent documents as sets?
• A k-shingle (or k-gram) for a document is a sequence of k tokens that
appears in the doc
• Tokens can be characters, words or something else, depending on the application
• Assume tokens = characters for examples
• Example: k=2; document D1 = abcab

Set of 2-shingles: S(D1) = {ab, bc, ca}
• Option: Shingles as a bag (multiset), count ‘ab’ twice: S’(D1) = {ab, bc, ca, ab}
• Optimal k?
Compressing Shingles
• To compress long shingles, we can hash them to (say) 4 bytes
• Represent a document by the set of hash values of its k-shingles
• Idea: Two documents could (rarely) appear to have shingles in common,
when in fact only the hash-values were shared
• Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
Similarity Metric for Shingles
• Document D1 is a set of its k-shingles C1=S(D1)
• Equivalently, each document is a 0/1 vector in the space of k-shingles
• Each unique shingle is a dimension
• Vectors are very sparse
• A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
Challenging Task!
• Suppose we need to find near-duplicate documents among million
documents
• Naïvely, we would have to compute pairwise Jaccard similarities for every

pair of docs
• ≈ 5*1011 comparisons
• At 105 secs/day and 106 comparisons/sec, it would take 5 days
• For million, it takes more than a year…
• Solution: MinHashing!
MinHashing
• Convert large sets to short signatures, while preserving similarity
Docu-
ment The set
of strings
of length k
that appear Signatures:
in the doc- short integer
ument vectors that
represent the
sets, and
reflect their
similarity
Encoding Sets as Bit Vectors
• Many similarity problems can be formalized as finding subsets that
have significant intersection
• Encode sets using 0/1 (bit, boolean) vectors
• One dimension per element in the universal set
• Interpret set intersection as bitwise AND, and
set union as bitwise OR
• Example: C1 = 10111; C2 = 10011
• Size of intersection = 3; size of union = 4,
• Jaccard similarity (not distance) = 3/4
• Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
From Sets to Boolean Matrices
Documents
• Rows = elements (shingles)
• Columns = sets (documents) 1 1 1 0
• 1 in row e and column s if and only if e is a member of s 1 1 0 1
• Column similarity is the Jaccard similarity of the
corresponding sets (rows with value 1) 0 1 0 1
Shingles
• Typical matrix is sparse! 0 0 0 1
• Each document is a column: 1 0 0 1
• Example: sim(C1 ,C2) = ?
• Size of intersection = 3; size of union = 6, 1 1 1 0
Jaccard similarity (not distance) = 3/6
• d(C1,C2) = 1 – (Jaccard similarity) = 3/6 1 0 1 0
Outline: Finding Similar Columns
• Next Goal: Find similar columns, Small signatures
• Naïve approach:
• 1) Signatures of columns: small summaries of columns
• 2) Examine pairs of signatures to find similar columns
• Essential: Similarities of signatures and columns are related
• 3) Optional: Check that columns with similar signatures are really similar
• Warnings:
• Comparing all pairs may take too much time: Job for LSH
• These methods can produce false negatives, and even false positives (if the optional
check is not made)
Hashing Columns (Signatures)
• Key idea: “hash” each column C to a small signature h(C), such that:
• (1) h(C) is small enough that the signature fits in RAM
• (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
• Goal: Find a hash function h(·) such that:

• If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
• If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
• Hash docs into buckets. Expect that “most” pairs of near duplicate
docs hash into the same bucket!
Min-Hashing
• Goal: Find a hash function h(·) such that:
• if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
• if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
• Clearly, the hash function depends on the similarity metric:

• Not all similarity metrics have a suitable hash function
• There is a suitable hash function for the Jaccard similarity: It is called

Min-Hashing
Min-Hashing
• Imagine the rows of the boolean matrix permuted under random
permutation 
• Define a “hash” function h(C) = the index of the first (in the
permuted order ) row in which column C has value 1:
h (C) = min (C)
• Use several (e.g., 100) independent hash functions (that is,

permutations) to create a signature of a column
Note: Another (equivalent) way is to
store row indexes: 1 5 1 5
2 3 1 3
Min-Hashing Example 6 4 6 4
2nd element of the permutation

is the first to map to a 1
Permutation  Input matrix (Shingles x Documents)

Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
0 0
The Min-Hash Property 0 0
1 1
• Choose a random permutation 
• Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
• Why? 0 1
• Let X be a doc (set of shingles), y X is a shingle
• Then: Pr[(y) = min((X))] = 1/|X| 1 0
• It is equally likely that any y X is mapped to the min element
• Let y be s.t. (y) = min((C1C2))
• Then either: (y) = min((C1)) if y  C1 , or
One of the two
(y) = min((C2)) if y  C2 cols had to have
• So the prob. that both are true is the prob. y  C1  C2 1 at position y
• Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)

Four Types of Rows
• Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
• a = # rows of type A, etc.
• Note: sim(C1, C2) = a/(a +b +c)
• Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
• Look down the cols C1 and C2 until we see a 1
• If it’s a type-A row, then h(C1) = h(C2) If a type-B or type-C row, then not
Similarity for Signatures
• We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
• Now generalize to multiple hash functions
• The similarity of two signatures is the fraction of the hash functions

in which they agree
• Note: Because of the Min-Hash property, the similarity of columns is

the same as the expected similarity of their signatures
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
Min-Hash Signatures
• Pick K=100 random permutations of the rows
• Think of sig(C) as a column vector
• sig(C)[i] = according to the i-th permutation, the index of the first row
that has a 1 in column C
sig(C)[i] = min (i(C))
• Note: The sketch (signature) of document C is small bytes!
• We achieved our goal! We “compressed” long bit vectors into short

signatures
Implementation Trick
• Permuting rows even once is prohibitive
• Row hashing!
• Pick K (say 100) hash functions hi (i=1 to 100)
• Ordering under hi gives a random row permutation!
• One-pass implementation
• For each column C and hash-func. ki keep a “slot” for the min-hash value
• Initialize all sig(C)[i] =  How to pick a random
• Scan rows looking for 1s hash function h(x)?
• Suppose row j has 1 in column C Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
• Then for each ki : where:
• If ki(j) < sig(C)[i], then sig(C)[i]  ki(j) a,b … random integers
p … prime number (p > N)
Algorithm: Computing Minhash Signatures
Input Matrix M
Ini alize signature matrix to ∞ (i.e. S[i,c] = ∞)
1 0 0 1
for each row r of M 0 0 1 0
for each hash function hi 0 1 0 1
compute hi(r) 1 0 1 1
for each column c of M 0 0 1 0
if M[r,c] = 1 then
for each hash function hi Signature Matrix S
∞ ∞ ∞ ∞
if hi(r)<S[i,c] then
∞ ∞ ∞ ∞
S[i,c] = hi(r)
Example
Input Matrix M Signature Matrix S h1(1) = 2 h1(3) = 4
0 1 0 0 1 ∞ ∞ ∞ ∞ h2(1) = 4 h2(3) = 0
1 0 0 1 0 ∞ ∞ ∞ ∞ 1 ∞ 2 1 1 3 2 1
2 0 1 0 1 1 ∞ 4 1 0 2 0 0
3 1 0 1 1
h1(0) = 1 h1(2) = 3 h1(4) = 0
4 0 0 1 0 h2(0) = 1 h2(2) = 2 h2(4) = 3
1 ∞ ∞ 1 1 3 2 1 1 3 0 1
h1(r) = (r+1) mod 5 1 ∞ ∞ 1 1 2 4 1 0 2 0 0
h2(r) = (3*r+1) mod 5
Locality Sensitive Hashing
• Focus on pairs of signatures likely to be from similar documents
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
LSH: First Cut
• Goal: Find documents with Jaccard similarity at least s (for some
similarity threshold, e.g., s=0.8)
• LSH – General idea: Use a function f(x,y) that tells whether x and y is
a candidate pair: a pair of elements whose similarity must be
evaluated
• For Min-Hash matrices:

• Hash columns of signature matrix M to many buckets
• Each pair of documents that hashes into the same bucket is a candidate pair
Candidates from Min-Hash
• Pick a similarity threshold s (0 < s < 1)
• Columns x and y of M are a candidate pair if their signatures agree on at least
fraction s of their rows:
M (i, x) = M (i, y) for at least frac. s values of i
• We expect documents x and y to have the same (Jaccard) similarity as their signatures
• Big idea: Hash columns of signature matrix M several times

• Arrange that (only) similar columns are likely to hash to the same bucket, with
high probability
• Candidate pairs are those that hash to the same bucket
Partition M into b bands
r rows
per band
b bands
One
signature
Signature matrix M
Partition M into b bands
• Divide matrix M into b bands of r rows
• For each band, hash its portion of each column to a hash table with k
buckets
• Make k as large as possible
• Candidate column pairs are those that hash to the same bucket for ≥
1 band
• Tune b and r to catch most similar pairs, but few non-similar pairs
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.
Matrix M
r rows b bands
Analysis of the banding technique
• Columns C1 and C2 have similarity t
• Pick any band (r rows)
• Prob. that all rows in band equal = tr
• Prob. that some row in band unequal = 1 - tr
• Prob. that no band identical = (1 - tr)b
• Prob. that at least 1 band identical = 1 - (1 - tr)b

What b Bands of r Rows Gives You
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
Analysis of LSH – What We Want
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
Similarity t =sim(C1, C2) of two sets

Example of Bands
Assume the following case:
• Suppose 100,000 columns of M (100k docs)
• Signatures of 100 integers (rows)
• Therefore, signatures take 40Mb
• Choose b = 20 bands of r = 5 integers/band
• Goal: Find pairs of documents that are at least s = 0.8 similar

C1, C2 are 80% Similar
• Find pairs of  s=0.8 similarity, set b=20, r=5
• Assume: sim(C1, C2) = 0.8
• Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair: We want them to
hash to at least 1 common bucket (at least one band is identical)
• Probability C1, C2 identical in one particular band: (0.8)5 = 0.328
• Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 =
0.00035
• i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
• We would find 99.965% pairs of truly similar documents
C1, C2 are 30% Similar
• Find pairs of  s=0.8 similarity, set b=20, r=5
• Assume: sim(C1, C2) = 0.3
• Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
• Probability C1, C2 identical in one particular band: (0.3)5 = 0.00243
• Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20
= 0.0474
• In other words, approximately 4.74% pairs of docs with similarity 0.3% end up
becoming candidate pairs
• They are false positives since we will have to examine them (they are candidate pairs)
but then it will turn out their similarity is below threshold s
Picking r and b: The S-curve
• Picking r and b to get the best S-curve
• 50 hash-functions (r=5, b=10)
1
0.9
Prob. sharing a bucket
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
Summary: 3 Steps
• Shingling: Convert documents to sets
• We used hashing to assign each shingle an ID
• Min-Hashing: Convert large sets to short signatures, while preserving
similarity
• We used similarity preserving hashing to generate signatures with property
Pr[h(C1) = h(C2)] = sim(C1, C2)
• We used hashing to get around generating random permutations
• Locality-Sensitive Hashing: Focus on pairs of signatures likely to be
from similar documents
• We used hashing to find candidate pairs of similarity  s
Finding Similar Items
Dr. Subrat K Dash
Distance Measures
Definition
• A function 𝑑 𝑥, 𝑦 in a space of points that takes two points in this
space and prodces a real number satisfying the following:
• 𝑑 𝑥, 𝑦 ≥0
• 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦
• 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥 (symmetric)
• 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦 (triangle inequality)
Distance Measures
• Euclidean Distance:
• L1-norm: 𝑑 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑦1 , 𝑦2 , … , 𝑦𝑛 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖 2
• Lr-norm: 𝑑 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑦1 , 𝑦2 , … , 𝑦𝑛 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖 𝑟 1Τ𝑟
• L∞-norm: limit as r approaches ∞. The maximum of 𝑥𝑖 − 𝑦𝑖 over all

dimensions i.
Distance Measures
• Jaccard Distance: (between two sets)
• 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦
• Cosine Distance
• It is the angle between two vectors in spaces that have dimensions.
• Range of value: [0,180]
Distance Measures
• Edit Distance
• Defined for strings.
• The edit distance between two strings 𝑥 = 𝑥1 𝑥2 … 𝑥𝑛 and 𝑦 = 𝑦1 𝑦2 … 𝑦𝑚 is
• the smallest number of insertions and deletions of single characters that will convert x to
y.
• Length of x + length of y – 2 * length of LCS(x,y)
• LCS(x,y) = the longest string obtained be deleting positions from x and y.
• Example:
• 𝑥 = 𝑎𝑏𝑐𝑑𝑒 and 𝑦 = 𝑎𝑐𝑓𝑑𝑒𝑔 then edit distance between x and y is 3.
• 𝐿𝐶𝑆 𝑎𝑏𝑐𝑑𝑒, 𝑎𝑐𝑓𝑑𝑒𝑔 = 𝑎𝑐𝑑𝑒
• 𝐿𝐶𝑆 𝑎𝑏𝑎, 𝑏𝑎𝑏 = 𝑎𝑏 or 𝑏𝑎
Distance Measures
• Hamming Distance
• It is the number of components in which two vectors x and y differ.
• Defined mostly for Boolean vectors
• For x=10101 and y=11110, d(x,y) = 3
Methods for High Degrees of Similarity
• Find sets that are almost identical
• Find every pair of items with the desired degree of similarity.
• No false negative
Finding Identical Items
• Find Web pages that are identical
• But avoid comparing every pair of documents!
• Approach 1:
• Use hash function on entire document
• Compare documents that fall in the same bucket
• Examine every character of every document
• Approach 2:
• Hash each page based on first few characters.
• Compare those documents that fall in the same bucket
• Fails if documents have common prefix like HTML documents
• Approach 3:
• Pick some fixed random positions for all documents and hash on corresponding
characters.
Sets as Strings
• Represent set as an ordered sequence of values (string).
• Set {d,a,b} is represented as abd
• Many-one problem: Given a string s, find all strings in a database that

match with s.
• Many-Many problem: Find all strings that are identical.
• Goal: Limit the number of comparisons that must be made to identify

all pairs of similar strings
Length-based Filtering
• Sort the strings by length
• Each string s is compared with those strings
t that follow s in the list if 𝐽 ≤ 𝐿𝑠 /𝐿𝑡 i.e.
𝐿𝑡 ≤ 𝐿𝑠 /𝐽
• Example:
• Suppose s is a string of length 9 then compare
it with strings that have length at most 9/0.9 =
10.
Prefix Indexing
• Index each string s, consisting of first p symbols of s.
• add string s to the index for each of its first p symbols.
• Compare strings that fall in the same bucket.
• Any other string t such that SIM(s, t) ≥ J will have at least one symbol
in its prefix that also appears in the prefix of s.
• 𝑝 = 1 − 𝐽 𝐿𝑠 + 1
• Example:
• If J=0.9, and Ls=9 then p=1. S need to be indexed under only its first symbol.
Using position Information
• Motivation
• Let s=acdefghijk and t=bcdefghijk and J=0.9
• As per prefix indexing, both s and t will be indexed using first 2 (𝑝 =
1 − 𝐽 𝐿𝑠 + 1) characters. Thus, both are in same bucket because of
indexing on character ‘c’.
• But SIM(s,t) = 9/11 < 0.9
• Can we avoid comparing s and t?
• Solution
• Build index using character as well as position of the character within the
string.
• Index is based on a pair (x,i) where x is a symbol at position i in the
prefix of a string s.
• Given a string s and minimum desired Jaccard similarity J, consider
prefix of s (first p symbols – positions 1 through 1 − 𝐽 𝐿𝑠 + 1). If
the symbol in position i of the prefix of s is x, add s to the index
bucket for (x,i).
• Consider a many-one problem where s is a probe string. With what

buckets it must be compared?
• Given s as a probe string:
• Consider prefix of s (first p symbols)
• If ith symbol of s (in the prefix) is x, then look in the bucket (x,j) for certain
small values of j.
• Do not find a string t which has already been compared with.
• Find upper bound of j
𝐿𝑠 −𝑖+1
• Highest value of SIM(s,t) = ≥𝐽
𝐿𝑠 +𝑗−1
𝐿𝑠 1−𝐽 −𝑖+1+𝐽
•𝑗≤
𝐽
• Let s=acdefghijk and t=bcdefghijk and J=0.9
• Prefix for s is “ac”. So, i can be 1 or 2.
10×0.1−1+1+0.9
• If i=1, 𝑗 ≤ = 2.11. Thus j can be 1 or 2.
0.9
• So, compare the symbol a with strings in the bucket for (a,j)
10×0.1−2+1+0.9
• If i=2, 𝑗 ≤ = 1. Thus, j can be 1.
0.9
• So, compare the symbol c with string in the bucket for (c,j)
• Thus, for probe string s, look in the buckets (a,1), (a,2), (c,1).
Using Position and Length in indexes
• Use index buckets corresponding to a symbol, a position, and the
suffix length, that is, the number of symbols following the position in
question.
• The string s = acdefghijk, with J = 0.9, would be indexed in the
buckets for (a, 1, 9) and (c, 2, 8).
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Apps
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Community Queries on Decision Association

Clustering
Detection streams Trees Rules
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 In many data mining situations, we do not
know the entire data set in advance
 Stream Management is important when the

input rate is controlled externally:
 Google queries
 Twitter or Facebook status updates
 We can think of the data as infinite and
non-stationary (the distribution changes
over time)
 Input elements enter at a rapid rate,
at one or more input ports (i.e., streams)
 We call elements of the stream tuples
 The system cannot store the entire stream

accessibly
 Q: How do you make critical calculations

about the stream using a limited amount of
(secondary) memory?

 Stochastic Gradient Descent (SGD) is an
example of a stream algorithm
 In Machine Learning we call this: Online Learning
 Allows for modeling problems where we have
a continuous stream of data
 We want an algorithm to learn from it and
slowly adapt to the changes in data
 Idea: Do slow updates to the model
 SGD (SVM, Perceptron) makes small updates
 So: First train the classifier on training data.
 Then: For every example from the stream, we slightly
update the model (using small learning rate)
Ad-Hoc
Queries
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

 Types of queries one wants on answer on
a data stream: (we’ll do these today)
 Sampling data from a stream
 Construct a random sample
 Queries over sliding windows
 Number of items of type x in the last k elements
of the stream

 Types of queries one wants on answer on
a data stream: (we’ll do these next time)
 Filtering a data stream
 Select elements with property x from the stream
 Counting distinct elements
 Number of distinct elements in the last k elements
of the stream
 Estimating moments
 Estimate avg./std. dev. of last k elements
 Finding frequent elements

 Mining query streams
 Google wants to know what queries are
more frequent today than yesterday
 Mining click streams

 Yahoo wants to know which of its pages are
getting an unusual number of hits in the past hour
 Mining social network news feeds

 E.g., look for trending topics on Twitter, Facebook

 Sensor Networks
 Many sensors feeding into a central controller
 Telephone call records
 Data feeds into customer bills as well as
settlements between telephone companies
 IP packets monitored at a switch
 Gather information for optimal routing
 Detect denial-of-service attacks

As the stream grows the sample
also gets bigger
 Since we can not store the entire stream,
one obvious approach is to store a sample
 Two different problems:
 (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
 (2) Maintain a random sample of fixed size
over a potentially infinite stream
 At any “time” k we would like a random sample
of s elements
 What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
 Problem 1: Sampling fixed proportion
 Scenario: Search engine query stream
 Stream of tuples: (user, query, time)
 Answer questions such as: How often did a user
run the same query in a single days
 Have space to store 1/10th of query stream
 Naïve solution:
 Generate a random integer in [0..9] for each query
 Store the query if the integer is 0, otherwise
discard
 Simple question: What fraction of queries by an
average search engine user are duplicates?
 Suppose each user issues x queries once and d queries
twice (total of x+2d queries) and no search queries
more than twice.
 Correct answer: d/(x+d)
 Proposed solution: We keep 10% of the queries
 Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
 But only d/100 pairs of duplicates
 d/100 = 1/10 ∙ 1/10 ∙ d
 Of d “duplicates” 18d/100 appear exactly once
 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
𝑑
𝒅
 So the sample-based answer is 𝑥
100
𝑑 18𝑑 =
+ + 𝟏𝟎𝒙+𝟏𝟗𝒅
10 100 100
Solution:
 Pick 1/10th of users and take all their
searches in the sample
 Use a hash function that hashes the

user name or user id uniformly into 10
buckets

 Stream of tuples with keys:
 Key is some subset of each tuple’s components
 e.g., tuple is (user, search, time); key is user
 Choice of key depends on application
 To get a sample of a/b fraction of the stream:

 Hash each tuple’s key uniformly into b buckets
 Pick the tuple if its hash value is at most a
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
As the stream grows, the sample is of
fixed size
 Problem 2: Fixed-size sample
 Suppose we need to maintain a random
sample S of size exactly s tuples
 E.g., main memory size constraint
 Why? Don’t know length of stream in advance
 Suppose at time n we have seen n items
 Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
 Algorithm (a.k.a. Reservoir Sampling)
 Store all the first s elements of the stream to S
 Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
 With probability s/n, keep the nth element, else discard it
 If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
 Claim: This algorithm maintains a sample S

with the desired property:
 After n elements, the sample contains each
element seen so far with probability s/n
 We prove this by induction:
 Assume that after n elements, the sample contains
each element seen so far with probability s/n
 We need to show that after seeing element n+1
the sample maintains the property
 Sample contains each element seen so far with
probability s/(n+1)
 Base case:
 After we see n=s elements the sample S has the
desired property
 Each out of n=s elements is in the sample with
probability s/s = 1

 Inductive hypothesis: After n elements, the sample
S contains each element seen so far with prob. s/n
 Now element n+1 arrives
 Inductive step: For elements already in S,
probability that the algorithm keeps it in S is:
 s   s  s  1  n
1    
 n  1  Element
 n n+11  Element
s  n 1
in the
Element n+1 discarded
not discarded sample not picked
 So, at time n, tuples in S were there with prob. s/n
 Time nn+1, tuple stayed in S with prob. n/(n+1)
𝒔 𝒏 𝒔
 So prob. tuple is in S at time n+1 = ⋅ =
𝒏 𝒏+𝟏 𝒏+𝟏
 Each element of data stream is a tuple
 Given a list of keys S
 Determine which tuples of stream are in S
 Obvious solution: Hash table

 But suppose we do not have enough memory to
store all of S in a hash table
 E.g., we might be processing millions of filters
on the same stream

 Example: Email spam filtering
 We know 1 billion “good” email addresses
 If an email comes from one of these, it is NOT
spam
 Publish-subscribe systems
 You are collecting lots of messages (news articles)
 People express interest in certain sets of keywords
 Determine whether each message matches user’s
interest

 Given a set of keys S that we want to filter
 Create a bit array B of n bits, initially all 0s
 Choose a hash function h with range [0,n)
 Hash each member of s S to one of
n buckets, and set that bit to 1, i.e., B[h(s)]=1
 Hash each element a of the stream and
output only those that hash to bit that was
set to 1
 Output a if B[h(a)] == 1

Output the item since it may be in S.
Item hashes to a bucket that at least
one of the items in S hashed to.
Item
Hash
func h
0010001011000 Bit array B
Drop the item.

It hashes to a bucket set
to 0 so it is surely not in S.
 Creates false positives but no false negatives

 If the item is in S we surely output it, if not we may
still output it
 |S| = 1 billion email addresses
|B|= 1GB = 8 billion bits
 If the email address is in S, then it surely
hashes to a bucket that has the big set to 1,
so it always gets through (no false negatives)
 Approximately 1/8 of the bits are set to 1, so
about 1/8th of the addresses not in S get
through to the output (false positives)
 Actually, less than 1/8th, because more than one
address might hash to the same bit
 More accurate analysis for the number of
false positives
 Consider: If we throw m darts into n equally

likely targets, what is the probability that
a target gets at least one dart?
 In our case:
 Targets = bits/buckets
 Darts = hash values of items

 We have m darts, n targets
 What is the probability that a target gets at
least one dart?
Equals 1/e
Equivalent
as n ∞
n( m / n)
1 - (1 – 1/n)
1 – e–m/n
Probability some
target X not hit
Probability at
by a dart
least one dart
hits target X

 Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n
 Example: 109 darts, 8∙109 targets

 Fraction of 1s in B = 1 – e-1/8 = 0.1175
 Compare with our earlier estimate: 1/8 = 0.125

 Consider: |S| = m, |B| = n
 Use k independent hash functions h1 ,…, hk
 Initialization:
 Set B to all 0s
 Hash each element s S using each hash function hi,
set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a
single array B!)
 Run-time:
 When a stream element with key x arrives
 If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
 That is, x hashes to a bucket set to 1 for every hash function hi(x)
 Otherwise discard the element x
 What fraction of the bit vector B are 1s?
 Throwing k∙m darts at n targets
 So fraction of 1s is (1 – e-km/n)
 But we have k independent hash functions

and we only let the element x through if all k
hash element x to a bucket of value 1
 So, false positive probability = (1 – e-km/n)k

0.2
 m = 1 billion, n = 8 billion 0.18
 k = 1: (1 – e-1/8) = 0.1175 0.16
False positive prob.

0.14
 k = 2: (1 – e-1/4)2 = 0.0493 0.12
0.1
0.08
 What happens as we 0.06
0.04
keep increasing k? 0.02

0 2 4 6 8 10 12 14 16 18 20
Number of hash functions, k
 “Optimal” value of k: n/m ln(2)

 In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
 Error at k = 6: (1 – e-1/6)2 = 0.0235
 Bloom filters guarantee no false negatives,
and use limited memory
 Great for pre-processing before more
expensive checks
 Suitable for hardware implementation
 Hash function computations can be parallelized
 Is it better to have 1 big B or k small Bs?

 It is the same: (1 – e-km/n)k vs. (1 – e-m/(n/k))k
 But keeping 1 big B is simpler

 Problem:
 Data stream consists of a universe of elements
chosen from a set of size N
 Maintain a count of the number of distinct
elements seen so far
 Obvious approach:
Maintain the set of elements seen so far
 That is, keep a hash table of all the distinct
elements seen so far

 How many different words are found among
the Web pages being crawled at a site?
 Unusually low or high numbers could indicate
artificial pages (spam?)
 How many different Web pages does each

customer request in a week?
 How many distinct products have we sold in

the last week?

 Real problem: What if we do not have space
to maintain the set of elements seen so far?
 Estimate the count in an unbiased way
 Accept that the count may have a little error,

but limit the probability that the error is large

 Pick a hash function h that maps each of the
N elements to at least log2 N bits
 For each stream element a, let r(a) be the

number of trailing 0s in h(a)
 r(a) = position of first 1 counting from the right
 E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
 Record R = the maximum r(a) seen
 R = maxa r(a), over all the items a seen so far
 Estimated number of distinct elements = 2R

 Very very rough and heuristic intuition why
Flajolet-Martin works:
 h(a) hashes a with equal prob. to any of N values
 Then h(a) is a sequence of log2 N bits,
where 2-r fraction of all as have a tail of r zeros
 About 50% of as hash to ***0
 About 25% of as hash to **00
 So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen
about 4 distinct items so far
 So, it takes to hash about 2r items before we
see one with zero-suffix of length r
 Now we show why Flajolet-Martin works
 Formally, we will show that probability of

finding a tail of r zeros:
 Goes to 1 if 𝒎 ≫ 𝟐𝒓
 Goes to 0 if 𝒎 ≪ 𝟐𝒓
where 𝒎 is the number of distinct elements
seen so far in the stream
 Thus, 2R will almost always be around m!

 What is the probability that a given h(a) ends
in at least r zeros is 2-r
 h(a) hashes elements uniformly at random
 Probability that a random number ends in
at least r zeros is 2-r
 Then, the probability of NOT seeing a tail
of length r among m elements:
𝟏 − 𝟐−𝒓 𝒎
Prob. all end in Prob. that given h(a) ends

fewer than r zeros. in fewer than r zeros

 (1 2r)2 (m2r)
 e  m 2 r
r m  r 2r ( m 2 r )  m 2 r
 Note: (1  2 )  (1  2 ) e
 Prob. of NOT finding a tail of length r is:
 If m << 2r, then prob. tends to 1
r m  m 2 r
 (1  2 )  e  1 as m/2r 0
 So, the probability of finding a tail of length r tends to 0
 If m >> 2r, then prob. tends to 0
r m  m 2 r
 (1  2 )  e  0 as m/2r  
 So, the probability of finding a tail of length r tends to 1
 Thus, 2R will almost always be around m!

 E[2R] is actually infinite
 Probability halves when R  R+1, but value doubles
 Workaround involves using many hash
functions hi and getting many samples of Ri
 How are samples Ri combined?
 Average? What if one very large value 𝟐𝑹𝒊 ?
 Median? All estimates are a power of 2
 Solution:
 Partition your samples into small groups
 Take the median of groups
 Then take the average of the medians

Stanford University
http://www.mmds.org
Supermarket shelf management – Market-basket
model:
 Goal: Identify items that are bought together by
sufficiently many customers
 Approach: Process the sales data collected with
barcode scanners to find dependencies among
items
 A classic rule:
 If someone buys diaper and milk, then he/she is
likely to buy beer
 Don’t be surprised if you find six-packs next to diapers!
Input:
 A large set of items TID Items
 e.g., things sold in a 1 Bread, Coke, Milk
supermarket 2 Beer, Bread
3 Beer, Coke, Diaper, Milk
 A large set of baskets 4 Beer, Bread, Diaper, Milk
 Each basket is a 5 Coke, Diaper, Milk
small subset of items

Output:
 e.g., the things one
customer buys on one day Rules Discovered:
{Milk} --> {Coke}
 Want to discover {Diaper, Milk} --> {Beer}
association rules
 People who bought {x,y,z} tend to buy {v,w}
 Amazon!

 Items = products; Baskets = sets of products
someone bought in one trip to the store
 Real market baskets: Chain stores keep TBs of
data about what customers buy together
 Tells how typical customers navigate stores, lets
them position tempting items
 Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
 Need the rule to occur frequently, or no $$’s
 Amazon’s people who bought X also bought Y
 Baskets = sentences; Items = documents
containing those sentences
 Items that appear together too often could
represent plagiarism
 Notice items do not have to be “in” baskets
 Baskets = patients; Items = drugs & side-effects

 Has been used to detect combinations
of drugs that result in particular side-effects
 But requires extension: Absence of an item
needs to be observed as well as presence
 A general many-to-many mapping
(association) between two kinds of things
 But we ask about connections among “items”,
not “baskets”
 For example:
 Finding communities in graphs (e.g., Twitter)

 Finding communities in graphs (e.g., Twitter)
 Baskets = nodes; Items = outgoing neighbors
 Searching for complete bipartite subgraphs Ks,t of a
big graph
 How?
 View each node i as a
basket Bi of nodes i it points to
t nodes
s nodes
 Ks,t = a set Y of size t that

occurs in s buckets Bi
…
 Looking for Ks,t  set of

A dense 2-layer graph support s and look at layer t –
all frequent sets of size t
First: Define
Frequent itemsets
Association rules:
Confidence, Support, Interestingness
Then: Algorithms for finding frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithm + 2 refinements
 Simplest question: Find sets of items that
appear together “frequently” in baskets
 Support for itemset I: Number of baskets
containing all items in I TID Items
 (Often expressed as a fraction 1

2
Bread, Coke, Milk
Beer, Bread
of the total number of baskets) 3

4
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
 Given a support threshold s, 5 Coke, Diaper, Milk
Support of
then sets of items that appear {Beer, Bread} = 2
in at least s baskets are called
frequent itemsets
 Items = {milk, coke, pepsi, beer, juice}
 Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Frequent itemsets: {m}, {c}, {b}, {j},
{m,b} , {b,c} , {c,j}.

 Association Rules:
If-then rules about the contents of baskets
 {i1, i2,…,ik} → j means: “if a basket contains
all of i1,…,ik then it is likely to contain j”
 In practice there are many rules, want to find
significant/interesting ones!
 Confidence of this association rule is the
probability of j given I = {i1,…,ik}
support( I  j )
conf( I  j ) 
support( I )
 Not all high-confidence rules are interesting
 The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased very
often (independent of X) and the confidence will be
high
 Interest of an association rule I → j:
difference between its confidence and the
fraction of baskets that contain j
Interest(I  j )  conf( I  j )  Pr[ j ]
 Interesting rules are those with high positive or
negative interest values (usually above 0.5)
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Association rule: {m, b} →c

 Confidence = 2/4 = 0.5
 Interest = |0.5 – 5/8| = 1/8
 Item c appears in 5/8 of the baskets
 Rule is not very interesting!

 Problem: Find all association rules with
support ≥s and confidence ≥c
 Note: Support of an association rule is the support
of the set of items on the left side
 Hard part: Finding the frequent itemsets!
 If {i1, i2,…, ik} → j has high support and
confidence, then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
support(I  j )
conf(I  j ) 
support(I )
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
 d   d  k 
R        
d 1 d k
 k   j 
k 1 j 1
 Given d unique items:  3  2 +1

d d +1
 Total number of itemsets = 2d

 Total number of possible association rules:
If d=6, R = 602 rules
15
TID Items
1 Bread, Milk
Example of Rules:
2 Bread, Diaper, Beer, Eggs {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Beer {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
16
 Step 1: Find all frequent itemsets I
 (we will explain this next)
 Step 2: Rule generation
 For every subset A of I, generate a rule A → I \ A
 Since I is frequent, A is also frequent
 Variant 1: Single pass to compute the rule confidence
 confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
 Variant 2:
 Observation: If A,B,C→D is below confidence, so is A,B→C,D
 Can generate “bigger” rules from smaller ones!
 Output the rules above the confidence threshold
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Support threshold s = 3, confidence c = 0.75
 1) Frequent itemsets:
 {b,m} {b,c} {c,m} {c,j} {m,c,b}
 2) Generate rules:
 b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
 m→b: c=4/5 … b,m→c: c=3/4
 b→c,m: c=3/6
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there

are 2d possible
candidate itemsets
19
 To reduce the number of rules we can
post-process them and only output:
 Maximal frequent itemsets:
No immediate superset is frequent
 Gives more pruning
or
 Closed itemsets:
No immediate superset has the same count (> 0)
 Stores not only frequent information, but exact counts

TID Items Minimum support = 2 null
Closed but
1 ABC
not maximal
2 ABCD 124 123 1234 245 345
Closed and
3 BCE A B C D E
maximal
4 ACDE
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal freaquent = 4
ABCDE
Frequent, but
superset BC
Support Maximal(s=3) Closed also frequent.
A 4 No No Frequent, and
B 5 No Yes its only superset,
ABC, not freq.
C 3 No No Superset BC
AB 4 Yes Yes has same count.
AC 2 No No Its only super-

set, ABC, has
BC 3 Yes Yes smaller count.
ABC 2 No Yes

 Back to finding frequent itemsets Item
Item
 Typically, data is kept in flat files Item

Item
rather than in a database system: Item

Item
 Stored on disk Item

Item
 Stored basket-by-basket Item

Item
 Baskets are small but we have Item

Item
many baskets and many items

 Expand baskets into pairs, triples, etc. Etc.
as you read baskets
 Use k nested loops to generate all
sets of size k
Items are positive integers,
Note: We want to find frequent itemsets. To find them, we and boundaries between
have to count them. To count them, we have to generate them. baskets are –1.
 The true cost of mining disk-resident data is
usually the number of disk I/Os
 In practice, association-rule algorithms read

the data in passes – all baskets read in turn
 We measure the cost by the number of

passes an algorithm makes over the data

 For many frequent-itemset algorithms,
main-memory is the critical resource
 As we read baskets, we need to count
something, e.g., occurrences of pairs of items
 The number of different things we can count
is limited by main memory
 Swapping counts in/out is a disaster (why?)

 The hardest problem often turns out to be
finding the frequent pairs of items {i1, i2}
 Why? Freq. pairs are common, freq. triples are rare
 Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
 Let’s first concentrate on pairs, then extend to
larger sets
 The approach:
 We always need to generate all the itemsets
 But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent
 Naïve approach to finding frequent pairs
 Read file once, counting in main memory
the occurrences of each pair:
 From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
 Fails if (#items)2 exceeds main memory
 Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
 Suppose 105 items, counts are 4-byte integers
 Number of pairs of items: 105(105-1)/2 = 5*109
 Therefore, 2*1010 (20 gigabytes) of memory needed
Two approaches:
 Approach 1: Count all pairs using a matrix
 Approach 2: Keep a table of triples [i, j, c] =
“the count of the pair of items {i, j} is c.”
 If integers and item ids are 4 bytes, we need
approximately 12 bytes for pairs with count > 0
 Plus some additional overhead for the hashtable
Note:
 Approach 1 only requires 4 bytes per pair
 Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)
12 per
4 bytes per pair
occurring pair
Triangular Matrix Triples

 Approach 1: Triangular Matrix
 n = total number items
 Count pair of items {i, j} only if i<j
 Keep pair counts in lexicographic order:
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
 Pair {i, j} is at position (i –1)(n– i/2) + j –1
 Total number of pairs n(n –1)/2; total bytes= 2n2
 Triangular Matrix requires 4 bytes per pair
 Approach 2 uses 12 bytes per occurring pair
 Beats Approach 1 if less than 1/3 of
possible pairs actually occur
 Approach 1: Triangular Matrix
 n = total number items
 Count pair of items {i, j} only if i<j
 Keep pair counts in lexicographic order:
Problem is if we have too
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
 Pair {i,many items(iso
j} is at position thei/2)pairs
–1)(n– + j –1
 Total number of pairs n(n
do not fit into memory.–1)/2; total bytes= 2n2
 Triangular Matrix requires 4 bytes per pair

 ApproachCan we
2 uses 12 do better?
bytes per pair
 Beats Approach 1 if less than 1/3 of
possible pairs actually occur
 A two-pass approach called
A-Priori limits the need for
main memory
 Key idea: monotonicity
 If a set of items I appears at
least s times, so does every subset J of I
 Contrapositive for pairs:
If item i does not appear in s baskets, then no
pair including i can appear in s baskets
 So, how does A-Priori find freq. pairs?

 Pass 1: Read baskets and count in main memory
the occurrences of each individual item
 Requires only memory proportional to #items
 Items that appear ≥ 𝒔 times are the frequent items

 Pass 2: Read baskets again and count in main
memory only those pairs where both elements
are frequent (from Pass 1)
 Requires memory proportional to square of frequent
items only (for counts)
 Plus a list of the frequent items (so you know what must
be counted)
Frequent items
Item counts
Main memory
Counts of
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2

 You can use the
triangular matrix
method with n = number Item counts Frequent Old
item
items
of frequent items #s
 May save space compared
Main memory
Counts of pairs
with storing triples Counts
of of
frequent
 Trick: re-number pairs of
items
frequent items
frequent items 1,2,…
and keep a table relating
new numbers to original Pass 1 Pass 2
item numbers
 For each k, we construct two sets of
k-tuples (sets of size k):
 Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
 Lk = the set of truly frequent k-tuples
Count All pairs Count
All of items To be
the items the pairs explained
items from L1
C1 Filter L1 Construct C2 Filter L2 Construct C3

** Note here we generate new candidates by
generating Ck from Lk-1 and L1.
But that one can be more careful with candidate
generation. For example, in C3 we know {b,m,j}
cannot be frequent since {m,j} is not frequent
 Hypothetical steps of the A-Priori algorithm

 C1 = { {b} {c} {j} {m} {n} {p} }
 Count the support of itemsets in C1
 Prune non-frequent: L1 = { b, c, j, m }
 Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
 Count the support of itemsets in C2
 Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
 Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
 Count the support of itemsets in C3 **
 Prune non-frequent: L3 = { {b,c,m} }

Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup

3rd scan
{B, C, E} {B, C, E} 2
40
 One pass for each k (itemset size)
 Needs room in main memory to count
each candidate k–tuple
 For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most memory
 Many possible extensions:
 Association rules with intervals:
 For example: Men over 65 have 2 cars
 Association rules when items are in a taxonomy
 Bread, Butter → FruitJam
 BakedGoods, MilkProduct → PreservedGoods
 Lower the support s as itemset gets bigger
 Observation:
In pass 1 of A-Priori, most memory is idle
 We store only individual item counts
 Can we use the idle memory to reduce
memory required in pass 2?
 Pass 1 of PCY: In addition to item counts,
maintain a hash table with as many
buckets as fit in memory
 Keep a count for each bucket into which
pairs of items are hashed
 For each bucket just keep the count, not the actual
pairs that hash to the bucket!
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in hash the pair to a bucket;
PCY add 1 to the count for that bucket;
 Few things to note:

 Pairs of items need to be generated from the input
file; they are not present in the file
 We are not just interested in the presence of a pair,
but we need to see whether it is present at least s
(support) times
 Observation: If a bucket contains a frequent pair,
then the bucket is surely frequent
 However, even without any frequent pair,
a bucket can still be frequent 
 So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
 But, for a bucket with total count less than s,
none of its pairs can be frequent 
 Pairs that hash to this bucket can be eliminated as
candidates (even if the pair consists of 2 frequent items)
 Pass 2:
Only count pairs that hash to frequent buckets
 Replace the buckets by a bit-vector:
 1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not
 4-byte integer counts are replaced by bits,

so the bit-vector requires 1/32 of memory
 Also, decide which items are frequent

and list them for the second pass

 Count all pairs {i, j} that meet the
conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in
the bit vector is 1 (i.e., a frequent bucket)
 Both conditions are necessary for the

pair to have a chance of being frequent

Item counts Frequent items
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs candidate
pairs
Pass 1 Pass 2

 Buckets require a few bytes each:
 Note: we do not have to count past s
 #buckets is O(main-memory size)
 On second pass, a table of (item, item, count)

triples is essential (we cannot use triangular
matrix approach, why?)
 Thus, hash table must eliminate approx. 2/3
of the candidate pairs for PCY to beat A-Priori

 Limit the number of candidates to be counted
 Remember: Memory is the bottleneck
 Still need to generate all the itemsets but we only
want to count/keep track of the ones that are frequent
 Key idea: After Pass 1 of PCY, rehash only those
pairs that qualify for Pass 2 of PCY
 i and j are frequent, and
 {i, j} hashes to a frequent bucket from Pass 1
 On middle pass, fewer pairs contribute to
buckets, so fewer false positives
 Requires 3 passes over the data
Item counts Freq. items Freq. items
Main memory
Bitmap 1 Bitmap 1
First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs
Pass 1 Pass 2 Pass 3

Hash pairs {i,j} Count pairs {i,j} iff:
Count items into Hash2 iff: i,j are frequent,
Hash pairs {i,j} i,j are frequent, {i,j} hashes to
{i,j} hashes to freq. bucket in B1
freq. bucket in B1 {i,j} hashes to
freq. bucket in B2
 Count only those pairs {i, j} that satisfy these
candidate pair conditions:
1. Both i and j are frequent items
2. Using the first hash function, the pair hashes to
a bucket whose bit in the first bit-vector is 1
3. Using the second hash function, the pair hashes to
a bucket whose bit in the second bit-vector is 1

1. The two hash functions have to be
independent
2. We need to check both hashes on the
third pass
 If not, we would end up counting pairs of
frequent items that hashed first to an
infrequent bucket but happened to hash
second to a frequent bucket

 Key idea: Use several independent hash
tables on the first pass
 Risk: Halving the number of buckets doubles
the average count
 We have to be sure most buckets will still not
reach count s
 If so, we can get a benefit like multistage,

but in only 2 passes

Item counts Freq. items
Bitmap 1
Main memory
First
First hash
hash table
table Bitmap 2
Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs
Pass 1 Pass 2

 Either multistage or multihash can use more
than two hash functions
 In multistage, there is a point of diminishing

returns, since the bit-vectors eventually
consume all of main memory
 For multihash, the bit-vectors occupy exactly

what one PCY bitmap does, but too many
hash functions makes all counts > s

 A-Priori, PCY, etc., take k passes to find
frequent itemsets of size k
 Can we use fewer passes?
 Use 2 or fewer passes for all sizes,

but may miss some frequent itemsets
 Random sampling
 SON (Savasere, Omiecinski, and Navathe)
 Toivonen (see textbook)

 Take a random sample of the market baskets
 Run a-priori or one of its improvements

in main memory
Copy of
 So we don’t pay for disk I/O each sample
Main memory
time we increase the size of itemsets baskets
 Reduce support threshold

proportionally Space
for
to match the sample size counts

 Optionally, verify that the candidate pairs are
truly frequent in the entire data set by a
second pass (avoid false positives)
 But you don’t catch sets frequent in the whole

but not in the sample
 Smaller threshold, e.g., s/125, helps catch more
truly frequent itemsets
 But requires more space

 Repeatedly read small subsets of the baskets
into main memory and run an in-memory
algorithm to find all frequent itemsets
 Note: we are not sampling, but processing the
entire file in memory-sized chunks
 An itemset becomes a candidate if it is found

to be frequent in any one or more subsets of
the baskets.

 On a second pass, count all the candidate
itemsets and determine which are frequent in
the entire set
 Key “monotonicity” idea: an itemset cannot

be frequent in the entire set of baskets unless
it is frequent in at least one subset.

 SON lends itself to distributed data mining
 Baskets distributed among many nodes

 Compute frequent itemsets at each node
 Distribute candidates to all nodes
 Accumulate the counts of all candidates

 Phase 1: Find candidate itemsets
 Map?
 Reduce?
 Phase 2: Find true frequent itemsets

 Map?
 Reduce?


Stanford University
http://www.mmds.org
Apps
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Community Web Decision Association

Clustering
Detection advertising Trees Rules
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection

 Given a cloud of data points we want to
understand its structure

 Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
 Members of a cluster are close/similar to each other
 Members of different clusters are dissimilar
 Usually:
 Points are in a high-dimensional space
 Similarity is defined using a distance measure
 Euclidean, Cosine, Jaccard, edit distance, …

x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster

 Clustering in two dimensions looks easy
 Clustering small amounts of data looks easy
 And in most cases, looks are not deceiving
 Many applications involve not 2, but 10 or

10,000 dimensions
 High-dimensional spaces look different:
Almost all pairs of points are at about the
same distance

 A catalog of 2 billion “sky objects” represents
objects by their radiation in 7 dimensions
(frequency bands)
 Problem: Cluster into similar objects, e.g.,
galaxies, nearby stars, quasars, etc.
 Sloan Digital Sky Survey

 Intuitively: Music divides into categories, and
customers prefer a few categories
 But what are categories really?
 Represent a CD by a set of customers who

bought it:
 Similar CDs have similar sets of customers,

and vice-versa

Space of all CDs:
 Think of a space with one dim. for each
customer
 Values in a dimension may be 0 or 1 only
 A CD is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the CD
 For Amazon, the dimension is tens of millions
 Task: Find clusters of similar CDs

Finding topics:
 Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
 It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words
 Documents with similar sets of words

may be about the same topic

 As with CDs we have a choice when we
think of documents as sets of words or
shingles:
 Sets as vectors: Measure similarity by the
cosine distance
 Sets as sets: Measure similarity by the
Jaccard distance
 Sets as points: Measure similarity by
Euclidean distance

 Hierarchical:
 Agglomerative (bottom up):
 Initially, each point is a cluster
 Repeatedly combine the two
“nearest” clusters into one
 Divisive (top down):
 Start with one cluster and recursively split it
 Point assignment:
 Maintain a set of clusters
 Points belong to “nearest” cluster
 Key operation:
Repeatedly combine
two nearest clusters
 Three important questions:

 1) How do you represent a cluster of more
than one point?
 2) How do you determine the “nearness” of
clusters?
 3) When to stop combining clusters?

 Key operation: Repeatedly combine two
nearest clusters
 (1) How to represent a cluster of many points?
 Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
 Euclidean case: each cluster has a
centroid = average of its (data)points
 (2) How to determine “nearness” of clusters?
 Measure cluster distances by distances of centroids

(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
What about the Non-Euclidean case?
 The only “locations” we can talk about are the
points themselves
 i.e., there is no “average” of two points
 Approach 1:
 (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
 (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
 (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
 Possible meanings of “closest”:
 Smallest maximum distance to other points
 Smallest average distance to other points
 Smallest sum of squares of distances to other points
 For distance metric d clustroid c of cluster C is: min
c
 d ( x, c )
xC
2
Datapoint Centroid
X Centroid is the avg. of all (data)points

in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
the
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, cluster.
http://www.mmds.org 18
 (2) How do you determine the “nearness” of
clusters?
 Approach 2:
Intercluster distance = minimum of the distances
between any two points, one from each cluster
 Approach 3:
Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
 Merge clusters whose union is most cohesive

 Approach 3.1: Use the diameter of the
merged cluster = maximum distance between
points in the cluster
 Approach 3.2: Use the average distance
between points in the cluster
 Approach 3.3: Use a density-based approach
 Take the diameter or avg. distance, e.g., and divide
by the number of points in the cluster

 Naïve implementation of hierarchical
clustering:
 At each step, compute pairwise distances
between all pairs of clusters, then merge
 O(N3)
 Careful implementation using priority queue

can reduce time to O(N2 log N)
 Still too expensive for really big datasets
that do not fit in memory

 Assumes Euclidean space/distance
 Start by picking k, the number of clusters
 Initialize clusters by picking one point per

cluster
 Example: Pick one point at random, then k-1
other points, each as far away as possible from
the previous points

 1) For each point, place it in the cluster whose
current centroid it is nearest
 2) After all points are assigned, update the
locations of centroids of the k clusters
 3) Reassign all points to their closest centroid
 Sometimes moves points between clusters
 Repeat 2 and 3 until convergence

 Convergence: Points don’t move between clusters
and centroids stabilize
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 1

x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2

x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end

How to select k?
 Try different k, looking at the change in the
average distance to centroid as k increases
 Average falls rapidly until right k, then
changes little
Best value
of k
Average
distance to
centroid k

Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x

x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x

Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x

Extension of k-means to large data
 BFR [Bradley-Fayyad-Reina] is a
variant of k-means designed to
handle very large (disk-resident) data sets
 Assumes that clusters are normally distributed

around a centroid in a Euclidean space
 Standard deviations in different
dimensions may vary
 Clusters are axis-aligned ellipses
 Efficient way to summarize clusters
(want memory required O(clusters) and not O(data))
 Points are read from disk one main-memory-
full at a time
 Most points from previous memory loads are
summarized by simple statistics
 To begin, from the initial load we select the
initial k centroids by some sensible approach:
 Take k random points
 Take a small random sample and cluster optimally
 Take a sample; pick a random point, and then
k–1 more points, each as far from the previously
selected points as possible
3 sets of points which we keep track of:
 Discard set (DS): Update the metadata
of the corresponding
 Points close enough to a centroid to be cluster and discard
the points
summarized
 Compression set (CS):
 Groups of points that are close together but
not close to any existing centroid Summarize the
points but do not
 These points are summarized, but not include in any cluster.
assigned to a cluster
Not close to anyone
 Retained set (RS): and are like outliers.
 Isolated points waiting to be assigned to a
compression set
Points in
the RS
Compressed sets.
Their points are in
the CS.
A cluster. Its points The centroid

are in the DS.
Discard set (DS): Close enough to a centroid to be summarized

Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
For each cluster, the discard set (DS) is
summarized by:
 The number of points, N
 The vector SUM, whose ith component is the
sum of the coordinates of the points in the
ith dimension
 The vector SUMSQ: ith component = sum of
squares of coordinates in ith dimension
A cluster.
All its points are in the DS. The centroid
 2d + 1 values represent any size cluster
 d = number of dimensions
 Average in each dimension (the centroid)
can be calculated as SUMi / N
 SUMi = ith component of SUM
 Variance of a cluster’s discard set in
dimension i is: (SUMSQi / N) – (SUMi / N)2
 And standard deviation is the square root of that
 Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
Processing the “Memory-Load” of points (1):
 1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
 These points are so close to the centroid that
they can be summarized and then discarded
 2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
 Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Processing the “Memory-Load” of points (2):
 3) DS set: Adjust statistics of the clusters to
account for the new points
 Add Ns, SUMs, SUMSQs
 4) Consider merging compressed sets in the CS
 5) If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
Discard set (DS): Close enough to a centroid to be summarized.
Points in
the RS
Compressed sets.
Their points are in
the CS.
A cluster. Its points The centroid

are in the DS.
Discard set (DS): Close enough to a centroid to be summarized

 Q1) How do we decide if a point is “close
enough” to a cluster that we will add the
point to that cluster?
 Q2) How do we decide whether two

compressed sets (CS) deserve to be
combined into one?

 Q1) We need a way to decide whether to put
a new point into a cluster (and discard)
 BFR suggests two ways:

 The Mahalanobis distance is less than a threshold
 High likelihood of the point belonging to
currently nearest centroid

 Normalized Euclidean distance from centroid
 For point (x1, …, xd) and centroid (c1, …, cd)

1. Normalize in each dimension: yi = (xi - ci) / i
2. Take sum of the squares of the yi
3. Take the square root
𝑑 2 𝑑
𝑥𝑖 − 𝑐𝑖
𝑑 𝑥, 𝑐 = ෍ = ෍ 𝑦𝑖2
𝜎𝑖
𝑖=1 𝑖=1
σi … standard deviation of points in
the cluster in the ith dimension
 If clusters are normally distributed in d
dimensions, then after transformation, one
standard deviation = 𝒅
 i.e., 68% of the points of the cluster will
have a Mahalanobis distance < 𝒅
 Accept a point for a cluster if

its M.D. is < some threshold,
e.g. 2 standard deviations

 Euclidean vs. Mahalanobis distance
Contours of equidistant points from the origin
Uniformly distributed points, Normally distributed points, Normally distributed points,

Euclidean distance Euclidean distance Mahalanobis distance

Q2) Should 2 CS subclusters be combined?
 Compute the variance of the combined
subcluster
 N, SUM, and SUMSQ allow us to make that
calculation quickly
 Combine if the combined variance is
below some threshold
 Many alternatives: Treat dimensions

differently, consider density

Extension of k-means to clusters
of arbitrary shapes
Vs.
 Problem with BFR/k-means:
 Assumes clusters are normally
distributed in each dimension
 And axes are fixed – ellipses at
an angle are not OK
 CURE (Clustering Using REpresentatives):

 Assumes a Euclidean distance
 Allows clusters to assume any shape
 Uses a collection of representative
points to represent clusters
h h
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age

2 Pass algorithm. Pass 1:
 0) Pick a random sample of points that fit in
main memory
 1) Initial clusters:
 Cluster these points hierarchically – group
nearest points/clusters
 2) Pick representative points:
 For each cluster, pick k (e.g. 4) representative
points, as dispersed as possible
 Move each representative point a fixed fraction
(e.g. 20%) toward the centroid of the cluster
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age

h h
h
e e
e
h e
e e h
e e e e
h
salary e
h Pick (say) 4
h remote points
h h for each
h h h cluster.
age

h h
h
e e
e
h e
e e h
e e e e
h
salary e
h Move points
h (say) 20%
h h toward the
h h h centroid.
age

Pass 2:
 Now, rescan the whole dataset and
visit each point p in the data set
 Place it in the “closest cluster” p

 Normal definition of “closest”:
Find the closest representative to p and
assign it to representative’s cluster

 Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
 Algorithms:
 Agglomerative hierarchical clustering:
 Centroid and clustroid
 k-means:
 Initialization, picking k
 BFR
 CURE

Stanford University
http://www.mmds.org
 Classic model of algorithms
 You get to see the entire input, then compute
some function of it
 In this context, “offline algorithm”
 Online Algorithms
 You get to see the input one piece at a time, and
need to make irrevocable decisions along the way
 Similar to the data stream model

1 a
2 b
3 c
Boys 4 d Girls
Nodes: Boys and Girls; Edges: Preferences

Goal: Match boys to girls so that maximum
number of preferences is satisfied

1 a
2 b
3 c
Boys 4 d Girls
M = {(1,a),(2,b),(3,d)} is a matching
Cardinality of matching = |M| = 3

1 a
2 b
3 c
Boys 4 d Girls
M = {(1,c),(2,b),(3,d),(4,a)} is a
perfect matching
Perfect matching … all vertices of the graph are matched
Maximum matching … a matching that contains the largest possible number of matches
 Problem: Find a maximum matching for a
given bipartite graph
 A perfect one if it exists
 There is a polynomial-time offline algorithm

based on augmenting paths (Hopcroft & Karp 1973,
see http://en.wikipedia.org/wiki/Hopcroft-Karp_algorithm)
 But what if we do not know the entire

graph upfront?

 Initially, we are given the set boys
 In each round, one girl’s choices are revealed
 That is, girl’s edges are revealed
 At that time, we have to decide to either:
 Pair the girl with a boy
 Do not pair the girl with any boy
 Example of application:
Assigning tasks to servers

1 a
(1,a)
2 b (2,b)
c
(3,d)
3
4 d

 Greedy algorithm for the online graph
matching problem:
 Pair the new girl with any eligible boy
 If there is none, do not pair girl
 How good is the algorithm?

 For input I, suppose greedy produces
matching Mgreedy while an optimal
matching is Mopt
Competitive ratio =
minall possible inputs I (|Mgreedy|/|Mopt|)
(what is greedy’s worst performance over all possible inputs I)

Mopt
 Consider a case: Mgreedy≠ Mopt 1 a
 Consider the set G of girls 2 b

matched in Mopt but not in Mgreedy 3 c
 Then every boy B adjacent to girls 4 d
in G is already matched in Mgreedy: B={ } G={ }
 If there would exist such non-matched

(by Mgreedy) boy adjacent to a non-matched
girl then greedy would have matched them
 Since boys B are already matched in Mgreedy then
(1) |Mgreedy|≥ |B|
Mopt
1 a
 Summary so far:
 Girls G matched in Mopt but not in Mgreedy2 b
3
 (1) |Mgreedy|≥ |B| c
 There are at least |G| such boys 4 d
(|G|  |B|) otherwise the optimal B={ } G={ }
algorithm couldn’t have matched all girls in G

 So: |G|  |B|  |Mgreedy|
 By definition of G also: |Mopt|  |Mgreedy| + |G|
 Worst case is when |G| = |B| = |Mgreedy|
 |Mopt|  2|Mgreedy| then |Mgreedy|/|Mopt|  1/2
1 a
(1,a)
2 b (2,b)
3 c
4 d

 Banner ads (1995-2001)
 Initial form of web advertising
 Popular websites charged
X$ for every 1,000
“impressions” of the ad
 Called “CPM” rate
CPM…cost per mille
(Cost per thousand impressions) Mille…thousand in Latin
 Modeled similar to TV, magazine ads
 From untargeted to demographically targeted
 Low click-through rates
 Low ROI for advertisers
 Introduced by Overture around 2000
 Advertisers bid on search keywords
 When someone searches for that keyword, the
highest bidder’s ad is shown
 Advertiser is charged only if the ad is clicked on
 Similar model adopted by Google with some

changes around 2002
 Called Adwords

 Performance-based advertising works!
 Multi-billion-dollar industry
 Interesting problem:
What ads to show for a given query?
 (Today’s lecture)
 If I am an advertiser, which search terms

should I bid on and how much should I bid?
 (Not focus of today’s lecture)

 Given:
 1. A set of bids by advertisers for search queries
 2. A click-through rate for each advertiser-query pair
 3. A budget for each advertiser (say for 1 month)
 4. A limit on the number of ads to be displayed with
each search query
 Respond to each search query with a set of
advertisers such that:
 1. The size of the set is no larger than the limit on the
number of ads per query
 2. Each advertiser has bid on the search query
 3. Each advertiser has enough budget left to pay for
the ad if it is clicked upon
 A stream of queries arrives at the search
engine: q1, q2, …
 Several advertisers bid on each query
 When query qi arrives, search engine must
pick a subset of advertisers whose ads are
shown
 Goal: Maximize search engine’s revenues
 Simple solution: Instead of raw bids, use the
“expected revenue per click” (i.e., Bid*CTR)
 Clearly we need an online algorithm!
Advertiser Bid CTR Bid * CTR
A $1.00 1% 1 cent
B $0.75 2% 1.5 cents
C $0.50 2.5% 1.125 cents

Click through Expected
rate revenue

Advertiser Bid CTR Bid * CTR
B $0.75 2% 1.5 cents
C $0.50 2.5% 1.125 cents
A $1.00 1% 1 cent

 Two complications:
 Budget
 CTR of an ad is unknown
 Each advertiser has a limited budget

 Search engine guarantees that the advertiser
will not be charged more than their daily budget

 CTR: Each ad has a different likelihood of
being clicked
 Advertiser 1 bids $2, click probability = 0.1
 Advertiser 2 bids $1, click probability = 0.5
 Clickthrough rate (CTR) is measured historically
 Very hard problem: Exploration vs. exploitation
Exploit: Should we keep showing an ad for which we have
good estimates of click-through rate
or
Explore: Shall we show a brand new ad to get a better
sense of its click-through rate

 Our setting: Simplified environment
 There is 1 ad shown for each query
 All advertisers have the same budget B
 All ads are equally likely to be clicked
 Value of each ad is the same (=1)
 Simplest algorithm is greedy:

 For a query pick any advertiser who has
bid 1 for that query
 Competitive ratio of greedy is 1/2

 Two advertisers A and B
 A bids on query x, B bids on x and y
 Both have budgets of $4
 Query stream: x x x x y y y y
 Worst case greedy choice: B B B B _ _ _ _
 Optimal: A A A A B B B B
 Competitive ratio = ½
 This is the worst case!
 Note: Greedy algorithm is deterministic – it always
resolves draws in the same way

 BALANCE Algorithm by Mehta, Saberi,
Vazirani, and Vazirani
 For each query, pick the advertiser with the
largest unspent budget
 Break ties arbitrarily (but in a deterministic way)

 Two advertisers A and B
 A bids on query x, B bids on x and y
 Both have budgets of $4
 Query stream: x x x x y y y y
 BALANCE choice: A B A B B B _ _
 Optimal: A A A A B B B B
 In general: For BALANCE on 2 advertisers

Competitive ratio = ¾
 Consider simple case (w.l.o.g.):
 2 advertisers, A1 and A2, each with budget B (1)
 Optimal solution exhausts both advertisers’ budgets
 BALANCE must exhaust at least one
advertiser’s budget:
 If not, we can allocate more queries
 Whenever BALANCE makes a mistake (both advertisers bid
on the query), advertiser’s unspent budget only decreases
 Since optimal exhausts both budgets, one will for sure get
exhausted
 Assume BALANCE exhausts A2’s budget,
but allocates x queries fewer than the optimal
 Revenue: BAL = 2B - x
Queries allocated to A1 in the optimal solution
B
Queries allocated to A2 in the optimal solution
A1 A2
Optimal revenue = 2B
Assume Balance gives revenue = 2B-x = B+y
x
B
Unassigned queries should be assigned to A2
y x (if we could assign to A1 we would since we still have the budget)
Goal: Show we have y  x
A1 A2 Not Case 1) ≤ ½ of A1’s queries got assigned to A2
used then 𝒚 ≥ 𝑩/𝟐
Case 2) > ½ of A1’s queries got assigned to A2
x then 𝒙 ≤ 𝑩/𝟐 and 𝒙 + 𝒚 = 𝑩
B Balance revenue is minimum for 𝒙 = 𝒚 = 𝑩/𝟐
y Minimum Balance revenue = 𝟑𝑩/𝟐
x
Competitive Ratio = 3/4
A1 A2 Not BALANCE exhausts A2’s budget
used J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
 In the general case, worst competitive ratio
of BALANCE is 1–1/e = approx. 0.63
 Interestingly, no online algorithm has a better
competitive ratio!
 Let’s see the worst case example that gives

this ratio

 N advertisers: A1, A2, … AN
 Each with budget B > N
 Queries:
 N∙B queries appear in N rounds of B queries each
 Bidding:
 Round 1 queries: bidders A1, A2, …, AN
 Round 2 queries: bidders A2, A3, …, AN
 Round i queries: bidders Ai, …, AN
 Optimum allocation:
Allocate round i queries to Ai
 Optimum revenue N∙B
… B/(N-2)
B/(N-1)
B/N
A1 A2 A3 AN-1 AN
BALANCE assigns each of the queries in round 1 to N advertisers.

After k rounds, sum of allocations to each of advertisers Ak,…,AN is
𝒌−𝟏 𝑩
𝑺𝒌 = 𝑺𝒌+𝟏 = ⋯ = 𝑺𝑵 = 𝒊=𝟏 σ
𝑵−(𝒊−𝟏)
If we find the smallest k such that Sk  B, then after k rounds

we cannot allocate any queries to any advertiser
B/1 B/2 B/3 … B/(N-(k-1)) … B/(N-1) B/N
S1
S2
Sk = B
1/1 1/2 1/3 … 1/(N-(k-1)) … 1/(N-1) 1/N

S1
S2
Sk = 1

 Fact: 𝑯𝒏 = σ𝒏𝒊=𝟏 𝟏/𝒊 ≈ 𝐥𝐧 𝒏 for large n
 Result due to Euler
1/1 1/2 1/3 … 1/(N-(k-1)) … 1/(N-1) 1/N
ln(N)
ln(N)-1 Sk = 1
𝑵
 𝑺𝒌 = 𝟏 implies: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵) − 𝟏 = 𝒍𝒏( )
𝒆
 We also know: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵 − 𝒌)
𝑵
 So: 𝑵 − 𝒌 = N terms sum to ln(N).
𝒆 Last k terms sum to 1.
𝟏
 Then: 𝒌 = 𝑵(𝟏 − ) First N-k terms sum
𝒆 to ln(N-k) but also to ln(N)-1
 So after the first k=N(1-1/e) rounds, we
cannot allocate a query to any advertiser
 Revenue = B∙N (1-1/e)
 Competitive ratio = 1-1/e

 Arbitrary bids and arbitrary budgets!
 Consider we have 1 query q, advertiser i
 Bid = xi
 Budget = bi
 In a general setting BALANCE can be terrible
 Consider two advertisers A1 and A2
 A1: x1 = 1, b1 = 110
 A2: x2 = 10, b2 = 100
 Consider we see 10 instances of q
 BALANCE always selects A1 and earns 10
 Optimal earns 100
 Arbitrary bids: consider query q, bidder i
 Bid = xi
 Budget = bi
 Amount spent so far = mi
 Fraction of budget left over fi = 1-mi/bi
 Define i(q) = xi(1-e-fi)
 Allocate query q to bidder i with largest

value of i(q)
 Same competitive ratio (1-1/e)


Stanford University
http://www.mmds.org
 We often think of networks being organized
into modules, cluster, communities:

 Find micro-markets by partitioning the
query-to-advertiser graph:
query
advertiser
[Andersen, Lang: Communities from seed sets, 2006]

 Clusters in Movies-to-Actors graph:
[Andersen, Lang: Communities from seed sets, 2006]

 Discovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]

How to find communities?
We will work with undirected (unweighted) networks

 Edge betweenness: Number of
shortest paths passing over the edge b=16
 Intuition: b=7.5
Edge strengths (call volume) Edge betweenness

in a real network in a real network
[Girvan-Newman ‘02]
 Divisive hierarchical clustering based on the

notion of edge betweenness:
Number of shortest paths passing through the edge
 Girvan-Newman Algorithm:
 Undirected unweighted networks
 Repeat until no edges are left:
 Calculate betweenness of edges
 Remove edges with highest betweenness
 Connected components are communities
 Gives a hierarchical decomposition of the network

12
1
33
49
Need to re-compute
betweenness at
every step

Step 1: Step 2:
Step 3: Hierarchical network decomposition:

Communities in physics collaborations
 Zachary’s Karate club:
Hierarchical decomposition

1. How to compute betweenness?
2. How to select the number of
clusters?

 Want to compute  Breath first search
betweenness of starting from 𝑨:
paths starting at
node 𝑨 0

 Count the number of shortest paths from
𝑨 to all other nodes of the network:

 Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
The algorithm:
•Add edge flows:
-- node flow =
1+∑child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
• Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node 𝑈
1 path to K.
Split evenly
 Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
The algorithm:
•Add edge flows:
-- node flow =
1+∑child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
• Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node 𝑈
1 path to K.
Split evenly
1. How to compute betweenness?
2. How to select the number of
clusters?

 Communities: sets of
tightly connected nodes
 Define: Modularity 𝑸
 A measure of how well
a network is partitioned
into communities
 Given a partitioning of the
network into groups 𝒔  𝑺:
Q  ∑s S [ (# edges within group s) –
(expected # edges within group s) ]
Need a null model!
 Given real 𝑮 on 𝒏 nodes and 𝒎 edges,
construct rewired network 𝑮’
 Same degree distribution but i
random connections
j
 Consider 𝑮’ as a multigraph
 The expected number of edges between nodes
𝒌𝒋 𝒌𝒊 𝒌𝒋
𝒊 and 𝒋 of degrees 𝒌𝒊 and 𝒌𝒋 equals to: 𝒌𝒊 ⋅ =
𝟐𝒎 𝟐𝒎
 The expected number of edges in (multigraph) G’:
𝟏 𝒌𝒊 𝒌𝒋 𝟏 𝟏
 = σ σ = ⋅ σ 𝒌 σ𝒋∈𝑵 𝒌𝒋 =
𝟐 𝒊∈𝑵 𝒋∈𝑵 𝟐𝒎 𝟐 𝟐𝒎 𝒊∈𝑵 𝒊 Note:
𝟏
 = 𝟐𝒎 ⋅ 𝟐𝒎 = 𝒎 ෍ 𝑘𝑢 = 2𝑚
𝟒𝒎 𝑢∈𝑁
 Modularity of partitioning S of graph G:
 Q  ∑s S [ (# edges within group s) –
(expected # edges within group s) ]
𝟏 𝒌𝒊 𝒌𝒋
 𝑸 𝑮, 𝑺 = σ𝒔∈𝑺 σ𝒊∈𝒔 σ𝒋∈𝒔 𝑨𝒊𝒋 −
𝟐𝒎 𝟐𝒎
Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else
 Modularity values take range [−1,1]

 It is positive if the number of edges within
groups exceeds the expected number
 0.3-0.7<Q means significant community structure

 Modularity is useful for selecting the
number of clusters: Q
Next time: Why not optimize Modularity directly?

1 5
 Undirected graph 𝑮(𝑽, 𝑬):
2 6
4
3
 Bi-partitioning task:
 Divide vertices into two disjoint groups 𝑨, 𝑩
A 1 5 B
2 6
3 4
 Questions:
 How can we define a “good” partition of 𝑮?
 How can we efficiently identify such a partition?
 What makes a good partition?
 Maximize the number of within-group
connections
 Minimize the number of between-group
connections
5
1
2 6
4
3
A B

 Express partitioning objectives as a function
of the “edge cut” of the partition
 Cut: Set of edges with only one vertex in a
group:
A 5 B
1
cut(A,B) = 2
2 6
4
3

 Criterion: Minimum-cut
 Minimize weight of connections between groups
arg minA,B cut(A,B)
 Degenerate case:
“Optimal cut”
Minimum cut
 Problem:
 Only considers external cluster connections
 Does not consider internal cluster connectivity
[Shi-Malik]
 Criterion: Normalized-cut [Shi-Malik, ’97]

 Connectivity between groups relative to the
density of each group
𝒗𝒐𝒍(𝑨): total weight of the edges with at least

one endpoint in 𝑨: 𝒗𝒐𝒍 𝑨 = σ𝒊∈𝑨 𝒌𝒊
 Why use this criterion?
 Produces more balanced partitions
 How do we efficiently find a good partition?
 Problem: Computing optimal cut is NP-hard
 A: adjacency matrix of undirected G
 Aij =1 if (𝒊, 𝒋) is an edge, else 0
 x is a vector in n with components (𝒙𝟏, … , 𝒙𝒏)
 Think of it as a label/value of each node of 𝑮
 What is the meaning of A x?
 Entry yi is a sum of labels xj of neighbors of i

 jth coordinate of A x:
 Sum of the x-values
of neighbors of j
 Make this a new value at node j 𝑨⋅𝒙=𝝀⋅𝒙
 Spectral Graph Theory:
 Analyze the “spectrum” of matrix representing 𝑮
 Spectrum: Eigenvectors 𝒙𝒊 of a graph, ordered by
the magnitude (strength) of their corresponding
eigenvalues 𝝀𝒊 :

 Suppose all nodes in 𝑮 have degree 𝒅
and 𝑮 is connected
 What are some eigenvalues/vectors of 𝑮?
𝑨 𝒙 = 𝝀 ⋅ 𝒙 What is ? What x?
 Let’s try: 𝒙 = (𝟏, 𝟏, … , 𝟏)
 Then: 𝑨 ⋅ 𝒙 = 𝒅, 𝒅, … , 𝒅 = 𝝀 ⋅ 𝒙. So: 𝝀 = 𝒅
 We found eigenpair of 𝑮: 𝒙 = (𝟏, 𝟏, … , 𝟏), 𝝀 = 𝒅
Remember the meaning of 𝒚 = 𝑨 𝒙:

Details!
 G is d-regular connected, A is its adjacency matrix

 Claim:
 d is largest eigenvalue of A,
 d has multiplicity of 1 (there is only 1 eigenvector
associated with eigenvalue d)
 Proof: Why no eigenvalue 𝒅′ > 𝒅?
 To obtain d we needed 𝒙𝒊 = 𝒙𝒋 for every 𝑖, 𝑗
 This means 𝒙 = 𝑐 ⋅ (1,1, … , 1) for some const. 𝑐
 Define: 𝑺 = nodes 𝒊 with maximum possible value of 𝒙𝒊
 Then consider some vector 𝒚 which is not a multiple of
vector (𝟏, … , 𝟏). So not all nodes 𝒊 (with labels 𝒚𝒊 ) are in 𝑺
 Consider some node 𝒋 ∈ 𝑺 and a neighbor 𝒊 ∉ 𝑺 then
node 𝒋 gets a value strictly less than 𝒅
 So 𝑦 is not eigenvector! And so 𝒅 is the largest eigenvalue!
 What if 𝑮 is not connected?
 𝑮 has 2 components, each 𝒅-regular
A B
 What are some eigenvectors?
 𝒙 = Put all 𝟏s on 𝑨 and 𝟎s on 𝑩 or vice versa
 𝒙′ = (𝟏, … , 𝟏, 𝟎, … , 𝟎) then 𝐀 ⋅ 𝒙′ = 𝒅, … , 𝒅, 𝟎, … , 𝟎
|A| |B|
 𝒙′′ = (𝟎, … , 𝟎, 𝟏, … , 𝟏) then 𝑨 ⋅ 𝒙′′ = (𝟎, … , 𝟎, 𝒅, … , 𝒅)
 And so in both cases the corresponding 𝝀 = 𝒅
 A bit of intuition:
2nd largest eigval.
𝜆𝑛−1 now has
A B A B value very close
to 𝜆𝑛
𝝀𝒏 = 𝝀𝒏−𝟏
𝝀𝒏 − 𝝀𝒏−𝟏 ≈ 𝟎
 More intuition:
2nd largest eigval.
𝜆𝑛−1 now has
A B A B value very close
to 𝜆𝑛
𝝀𝒏 = 𝝀𝒏−𝟏 𝝀𝒏 − 𝝀𝒏−𝟏 ≈ 𝟎
 If the graph is connected (right example) then we
already know that 𝒙𝒏 = (𝟏, … 𝟏) is an eigenvector
 Since eigenvectors are orthogonal then the
components of 𝒙𝒏−𝟏 sum to 0.
 Why? Because 𝒙𝒏 ⋅ 𝒙𝒏−𝟏 = σ𝒊 𝒙𝒏 𝒊 ⋅ 𝒙𝒏−𝟏 [𝒊]
 So we can look at the eigenvector of the 2nd largest
eigenvalue and declare nodes with positive label in A
and negative label in B.
 But there is still lots to sort out.
 Adjacency matrix (A):
 n n matrix
 A=[aij], aij=1 if edge between node i and j
1 2 3 4 5 6
5 1 0 1 1 0 1 0
1
2 1 0 1 0 0 0
2 6
4 3 1 1 0 1 0 0
3 4 0 0 1 0 1 1
5 1 0 0 1 0 1
 Important properties: 6 0 0 0 1 1 0
 Symmetric matrix
 Eigenvectors are real and orthogonal
 Degree matrix (D):
 n n diagonal matrix
 D=[dii], dii = degree of node i
1 2 3 4 5 6
5
1 3 0 0 0 0 0
1 2 0 2 0 0 0 0
2 6 3 0 0 3 0 0 0
4
3 4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2

1 2 3 4 5 6
 Laplacian matrix (L): 1 3 -1 -1 0 -1 0
 n n symmetric matrix 2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
5
1
4 0 0 -1 3 -1 -1
2 6
4 5 -1 0 0 -1 3 -1
3 6 0 0 0 -1 -1 2
 What is trivial eigenpair? 𝑳 = 𝑫 − 𝑨

 𝒙 = (𝟏, … , 𝟏) then 𝑳 ⋅ 𝒙 = 𝟎 and so 𝝀 = 𝝀𝟏 = 𝟎
 Important properties:
 Eigenvalues are non-negative real numbers
 Eigenvectors are real and orthogonal
Details!
(a) All eigenvalues are ≥ 0

(b) 𝑥 𝑇 𝐿𝑥 = σ𝑖𝑗 𝐿𝑖𝑗 𝑥𝑖 𝑥𝑗 ≥ 0 for every 𝑥
(c) 𝐿 = 𝑁 𝑇 ⋅ 𝑁
 That is, 𝐿 is positive semi-definite
 Proof:
 (c)(b): 𝑥 𝑇 𝐿𝑥 = 𝑥 𝑇 𝑁 𝑇 𝑁𝑥 = 𝑥𝑁 𝑇
𝑁𝑥 ≥ 0
 As it is just the square of length of 𝑁𝑥
 (b)(a): Let 𝝀 be an eigenvalue of 𝑳. Then by (b)
𝑥 𝑇 𝐿𝑥 ≥ 0 so 𝑥 𝑇 𝐿𝑥 = 𝑥 𝑇 𝜆𝑥 = 𝜆𝑥 𝑇 𝑥  𝝀 ≥ 𝟎
 (a)(c): is also easy! Do it yourself.
 Fact: For symmetric matrix M:
T
x Mx
2  min T
x x x
 What is the meaning of min xT L x on G?
 x T L x = σ𝑛𝑖,𝑗=1 𝐿𝑖𝑗 𝑥𝑖 𝑥𝑗 = σ𝑛𝑖,𝑗=1 𝐷𝑖𝑗 − 𝐴𝑖𝑗 𝑥𝑖 𝑥𝑗
 = σ𝑖 𝐷𝑖𝑖 𝑥𝑖2 − σ 𝑖,𝑗 ∈𝐸 2𝑥𝑖 𝑥𝑗
2 𝟐
=σ (𝑥
𝑖,𝑗 ∈𝐸 𝑖 + 𝑥𝑗2 − 2𝑥𝑖 𝑥𝑗 ) = σ 𝒊,𝒋 ∈𝑬 𝒙𝒊 − 𝒙𝒋
Node 𝒊 has degree 𝒅𝒊 . So, value 𝒙𝟐𝒊 needs to be summed up 𝒅𝒊 times.
But each edge (𝒊, 𝒋) has two endpoints so we need 𝒙𝟐𝒊 +𝒙𝟐𝒋
T
x Mx Details!
2  min T
x x x
 Write 𝑥 in axes of eigenvecotrs 𝑤1 , 𝑤2 , … , 𝑤𝑛 of

𝑴. So, 𝑥 = σ𝑛𝑖 𝛼𝑖 𝑤𝑖
 Then we get: 𝑀𝑥 = σ𝑖 𝛼𝑖 𝑀𝑤𝑖 = σ𝑖 𝛼𝑖 𝜆𝑖 𝑤𝑖
 So, what is 𝒙𝑻 𝑴𝒙? 𝝀𝒊 𝒘𝒊 = 𝟎 if 𝒊 ≠ 𝒋
1 otherwise
 𝑥 𝑇 𝑀𝑥 = σ𝑖 𝛼𝑖 𝑤𝑖 σ𝑖 𝛼𝑖 𝜆𝑖 𝑤𝑖 = σ𝑖𝑗 𝛼𝑖 𝜆𝑗 𝛼𝑗 𝑤𝑖 𝑤𝑗
= σ𝑖 𝛼𝑖 𝜆𝑖 𝑤𝑖 𝑤𝑖 = σ𝒊 𝝀𝒊 𝜶𝟐𝒊
 To minimize this over all unit vectors x orthogonal to:
w = min over choices of (𝛼1 , … 𝛼𝑛 ) so that:
σ𝛼𝑖2 = 1 (unit length) σ𝛼𝑖 = 0 (orthogonal to 𝑤1 )
 To minimize this, set 𝜶𝟐 = 𝟏 and so σ𝒊 𝝀𝒊 𝜶𝟐𝒊 = 𝝀𝟐
 What else do we know about x?
 𝒙 is unit vector: σ𝒊 𝒙𝟐𝒊 = 𝟏
 𝒙 is orthogonal to 1st eigenvector (𝟏, … , 𝟏) thus:
σ𝒊 𝒙𝒊 ⋅ 𝟏 = σ𝒊 𝒙𝒊 = 𝟎
 Remember:
2  min
 ( i , j )E ( xi  x j ) 2
All labelings
of nodes 𝑖 so
that σ𝑥𝑖 = 0
 i x 2
i
x
We want to assign values 𝒙𝒊 to nodes i such
𝑥𝑖 0 𝑥𝑗
that few edges cross 0.
(we want xi and xj to subtract each other) Balance to minimize
 Back to finding the optimal cut
 Express partition (A,B) as a vector
+𝟏 𝒊𝒇 𝒊 ∈ 𝑨
𝒚𝒊 = ቊ
−𝟏 𝒊𝒇 𝒊 ∈ 𝑩
 We can minimize the cut of the partition by
finding a non-trivial vector x that minimizes:
Can’t solve exactly. Let’s relax 𝒚 and

allow it to take any real value. 𝑦𝑖 = −1 0 𝑦𝑗 = +1
𝑥𝑖 0 𝑥𝑗 x
 𝝀𝟐 = 𝐦𝐢𝐧 𝒇 𝒚 : The minimum value of 𝒇(𝒚) is

𝒚
given by the 2nd smallest eigenvalue λ2 of the
Laplacian matrix L
 𝐱 = 𝐚𝐫𝐠 𝐦𝐢𝐧 𝐲 𝒇 𝒚 : The optimal solution for y
is given by the corresponding eigenvector 𝒙,
referred as the Fiedler vector
Details!
 Suppose there is a partition of G into A and B

(# 𝑒𝑑𝑔𝑒𝑠 𝑓𝑟𝑜𝑚 𝐴 𝑡𝑜 𝐵)
where 𝐴 ≤ |𝐵|, s.t. 𝜶 =
𝐴
then 2𝜶 ≥ 𝝀𝟐
 This is the approximation guarantee of the spectral
clustering. It says the cut spectral finds is at most 2
away from the optimal one of score 𝜶.
 Proof:
 Let: a=|A|, b=|B| and e= # edges from A to B
 Enough to choose some 𝒙𝒊 based on A and B such
2
σ 𝑥𝑖 −𝑥𝑗
that: 𝜆2 ≤ σ𝑖 𝑥𝑖2
≤ 2𝛼 (while also σ𝑖 𝑥𝑖 = 0)
𝝀𝟐 is only smaller
Details!
 Proof (continued):
𝟏
− 𝒊𝒇 𝒊 ∈ 𝑨
 1) Let’s set: 𝒙𝒊 = ൞ 𝒂𝟏
+ 𝒊𝒇 𝒊 ∈ 𝑩
𝒃
1 1
 Let’s quickly verify that σ𝑖 𝑥𝑖 = 0: 𝑎 − +𝑏 =𝟎
𝑎 𝑏
2 1 1 2 1 1 2
σ 𝑥𝑖 −𝑥𝑗 σ𝑖∈𝐴,𝑗∈𝐵 + 𝑒⋅ +
 2) Then: σ𝑖 𝑥𝑖2
= 1 2
𝑏 𝑎
1 2
= 𝑎 𝑏
1 1 =
𝑎 − +𝑏 +
𝑎 𝑏
𝑎 𝑏
1 1 1 1 𝟐 Which proves that the cost
𝑒 + ≤𝑒 + ≤ 𝒆 = 𝟐𝜶 achieved by spectral is better
𝑎 𝑏 𝑎 𝑎 𝒂
than twice the OPT cost
e … number of edges between A and B
Details!
 Putting it all together:

𝜶𝟐
𝟐𝜶 ≥ 𝝀𝟐 ≥
𝟐𝒌𝒎𝒂𝒙
 where 𝑘𝑚𝑎𝑥 is the maximum node degree
in the graph
 Note we only provide the 1st part: 𝟐𝜶 ≥ 𝝀𝟐
𝜶𝟐
 We did not prove 𝝀𝟐 ≥
𝟐𝒌𝒎𝒂𝒙
 Overall this always certifies that 𝝀𝟐 always gives a

useful bound

 How to define a “good” partition of a graph?
 Minimize a given graph cut criterion
 How to efficiently identify such a partition?

 Approximate using information provided by the
eigenvalues and eigenvectors of a graph
 Spectral Clustering

 Three basic stages:
 1) Pre-processing
 Construct a matrix representation of the graph
 2) Decomposition
 Compute eigenvalues and eigenvectors of the matrix
 Map each point to a lower-dimensional representation
based on one or more eigenvectors
 3) Grouping
 Assign points to two or more clusters, based on the new
representation

 1) Pre-processing: 1 2 3 4 5 6
 Build Laplacian
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
matrix L of the 3
4
-1
0
-1
0
3
-1
-1
3
0
-1
0
-1
graph 5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
 2) 0.0
1.0
0.4
0.4
0.3
0.6
-0.5
0.4
-0.2
-0.4
-0.4
0.4
-0.5
0.0
Decomposition: =
3.0
X=
0.4 0.3 0.1 0.6 -0.4 0.5
 Find eigenvalues  3.0
4.0
0.4
0.4
-0.3
-0.3
0.1
-0.5
0.6
-0.2
0.4
0.4
-0.5
0.5
and eigenvectors x 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0
of the matrix L
1 0.3
2 0.6
 Map vertices to 3 0.3
How do we now
corresponding 4 -0.3
components of 2 5
6
-0.3
-0.6
find the clusters?

 3) Grouping:
 Sort components of reduced 1-dimensional vector
 Identify clusters by splitting the sorted vector in two
 How to choose a splitting point?
 Naïve approaches:
 Split at 0 or median value
 More expensive approaches:
 Attempt to minimize normalized cut in 1-dimension
(sweep over ordering of nodes induced by the eigenvector)
1 0.3 Split at 0: A B
2 0.6 Cluster A: Positive points

3 0.3 Cluster B: Negative points
4 -0.3 1 0.3 4 -0.3
5 -0.3 2 0.6 5 -0.3
6 -0.6 3 0.3 6 -0.6

Value of x2
Rank in x2

Components of x2
Value of x2
Rank in x2

Components of x1
Components of x3
 How do we partition a graph into k clusters?
 Two basic approaches:

 Recursive bi-partitioning [Hagen et al., ’92]
 Recursively apply bi-partitioning algorithm in a
hierarchical divisive manner
 Disadvantages: Inefficient, unstable
 Cluster multiple eigenvectors [Shi-Malik, ’00]
 Build a reduced space from multiple eigenvectors
 Commonly used in recent papers
 A preferable approach…

 Approximates the optimal cut [Shi-Malik, ’00]
 Can be used to approximate optimal k-way normalized
cut
 Emphasizes cohesive clusters
 Increases the unevenness in the distribution of the data
 Associations between similar points are amplified,
associations between dissimilar points are attenuated
 The data begins to “approximate a clustering”
 Well-separated space
 Transforms data to a new “embedded space”,
consisting of k orthogonal basis vectors
 Multiple eigenvectors prevent instability due to
information loss

Yum Yum D Giga

Uploaded by

Copyright:

Available Formats

You might also like

Yum Yum D Giga

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yum Yum D Giga

Uploaded by

Copyright:

Available Formats

Mining of Massive Datasets

Dr. Subrat K Dash

Acknowledgement: http://www.mmds.org, Introduction to Data Mining by Vipin Kumar et

Sensor Networks Computational Simulations

• Data Mining Tasks

Community Web Decision Association

“Classical” Data Mining

• Map-reduce addresses all of the above

• Map and Reduce and input files

• All phases are distributed with many tasks doing

• Intermediate results are stored on local FS of Map and Reduce

• Output is often input to another MapReduce task

• Master pings workers periodically to detect failures

• Much less data needs to be copied and shuffled!

• Each map operates on a chunk of M

• If V can not fit in main memory

• In the notation, X is a list of elements that can be:

It is correct but not a good solution! Find a better one!

• Min-Hashing: Convert large sets to short signatures, while preserving

• Locality-Sensitive Hashing: Focus on pairs of signatures likely to be

• Example: k=2; document D1 = abcab

• Naïvely, we would have to compute pairwise Jaccard similarities for every

• For million, it takes more than a year…

• Goal: Find a hash function h(·) such that:

• Clearly, the hash function depends on the similarity metric:

• There is a suitable hash function for the Jaccard similarity: It is called

• Use several (e.g., 100) independent hash functions (that is,

2nd element of the permutation

Permutation  Input matrix (Shingles x Documents)

• Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)

• Now generalize to multiple hash functions

• The similarity of two signatures is the fraction of the hash functions

• Note: Because of the Min-Hash property, the similarity of columns is

• We achieved our goal! We “compressed” long bit vectors into short

• For Min-Hash matrices:

• Big idea: Hash columns of signature matrix M several times

• Prob. that no band identical = (1 - tr)b

• Prob. that at least 1 band identical = 1 - (1 - tr)b

Similarity t =sim(C1, C2) of two sets

• Goal: Find pairs of documents that are at least s = 0.8 similar

• Lr-norm: 𝑑 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑦1 , 𝑦2 , … , 𝑦𝑛 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖 𝑟 1Τ𝑟

• L∞-norm: limit as r approaches ∞. The maximum of 𝑥𝑖 − 𝑦𝑖 over all

• Many-one problem: Given a string s, find all strings in a database that

• Goal: Limit the number of comparisons that must be made to identify

• Consider a many-one problem where s is a probe string. With what

Mining of Massive Datasets

Community Queries on Decision Association

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 Stream Management is important when the

 The system cannot store the entire stream

 Q: How do you make critical calculations

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 Mining click streams

 Mining social network news feeds