Professional Documents
Culture Documents
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
Tilani Gunawardena
Content
• Introduction
• Preliminaries
• Self-Join case
• R-S Join case
• Handling insufficient memory
• Experimental evaluation
• Conclusions
Introduction
• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce
• Why Hadoop ?
– Large amount data,shared nothign architecture
Record 1
Record 2
1 A B D AA … …
2 BBDAE … …
Global Ordering: E D B A
(based on
frequency) 1 2 3 4
Basic Token Ordering(BTO)
• 2 MapReduce cycles:
– 1st : compute token frequencies
– 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
, ,
map: reduce:
• tokenize the join • for each token, compute total
value of each record count (frequency)
• emit each token
with no. of occurrences 1
Basic Token Ordering – 2nd MapReduce cycle
map:
reduce:
• tokenize the join
• for each token, compute
value of each record
total count (frequency)
• emit each token
with no. of occurrences 1
Stage II: RID-Pair Generation
Basic Kernel(BK)
Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle
“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
Grouping/Routing: using individual tokens
• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar, are
never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might be
checked for similarity in multiple reducers, i.e.
redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
the groups of the prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
A D F B G E C
• Disadvantage:
– Quality of grouping is not so high (records having no
chance of being similar are sent to the same reducer
which checks their similarity)
Bucket of
candidates
RID-Pair Generation: Reduce Phase
• Scales-up up nicely
Thank You!