Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Efficient Parallel Set-Similarity

Joins Using MapReduce

Tilani Gunawardena
Content
• Introduction
• Preliminaries
• Self-Join case
• R-S Join case
• Handling insufficient memory
• Experimental evaluation
• Conclusions
Introduction

• Vast amount of data:


– Google N-gram database : ~1 trillion records
– GeneBank : 100 million records, size=416GB
– Facebook : 400 million active users

• Detecting similar pairs of records becomes a


challanging proble
Examples
• Detecting near duplicate web-pages in web crawlin
• Document clustering
• Plagiarism detection
• Master data management
– “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
their similarity to other users in query refinement
• Mining in social networking sites
– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising
Preliminaries
• Problem Statement: Given two collections of
objects/items/records, a similarity metric
sim(o1,o2) and a threshold λ , find the pairs of
objects/items/records satisfying sim(o1,o2)≥ λ
Set -similarity functions
• Jaccard or Tanimoto coefficient
– Jaccard(x, y) =|x ∩y| / |x U y|

• “I will call back” =[I, will, call, back]


• “I will call you soon”=[I, will, call, you, soon]

• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce
• Why Hadoop ?
– Large amount data,shared nothign architecture

• map (k1,v1) -> list(k2,v2);


• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
– Too much data to transfer
– Too many pairs to verify(Two similar sets share at least
1 token)
Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
effective filters

• string s =“I will call back”


• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]

• prefix filtering principle states that similar strings


need to share at least one common token in their
prefixes.
Prefix filtering: example

Record 1

Record 2

• Each set has 5 tokens


• “Similar”: they share at least 4 tokens
• Prefix length: 2
9
Parallel Set-Similarity Joins
• Stage I: Token Ordering
– Compute data statistics for good signatures
• Stage II -RID-Pair Generation
• Stage III: Record Join
– Generate actual pairs of joined records
Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
• Address: “14th Saarbruecker Strasse”
• Name: “John W. Smith”
Stage I: Token Ordering
• Basic Token Ordering(BTO)
• One Phase Token Ordering (OPTO)
Token Ordering

• Creates a global ordering of the tokens in the


join column, based on their frequency
RID a b c

1 A B D AA … …
2 BBDAE … …

Global Ordering: E D B A
(based on
frequency) 1 2 3 4
Basic Token Ordering(BTO)

• 2 MapReduce cycles:
– 1st : compute token frequencies
– 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
, ,

map: reduce:
• tokenize the join • for each token, compute total
value of each record count (frequency)
• emit each token
with no. of occurrences 1
Basic Token Ordering – 2nd MapReduce cycle

map: reduce(use only 1 reducer):


• interchange key • emits the value
with value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
– Uses only one MapReduce Cycle (less I/O)
– In-memory token sorting, instead of using a
reducer
OPTO – Details
, ,
Use tear_down
method to order
the tokens in
memory

map:
reduce:
• tokenize the join
• for each token, compute
value of each record
total count (frequency)
• emit each token
with no. of occurrences 1
Stage II: RID-Pair Generation

 Basic Kernel(BK)
 Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle

Global ordering of tokens obtained in the previous


stage
RID-Pair Generation: Map Phase

• scan input records and for each record:


– project it on RID & join attribute
– tokenize it
– extract prefix according to global ordering of tokens obtained in the Token
Ordering stage
– route tokens to appropriate reducer
Grouping/Routing Strategies

• Goal: distribute candidates to the right


reducers to minimize reducers’ workload
• Like hashing (projected)records to the
corresponding candidate-buckets
• Each reducer handles one/more candidate-
buckets
• 2 routing strategies:

Using Individual Tokens Using Grouped Tokens


Routing: using individual tokens

• Treat each token as a key


• For each record, generates a (key, value) pair for each
of its prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
Grouping/Routing: using individual tokens

• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar, are
never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might be
checked for similarity in multiple reducers, i.e.
redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
the groups of the prefix tokens:

Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

“A B C” => prefix of length 2: A,B


Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens

• The groups of tokens (X,Y) are formed assigning


tokens to groups in a Round-Robin manner
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

A D F B G E C

Group1 Group2 Group3


Grouping/Routing: Using Grouped Tokens
• Advantage:
– fewer replication of record projection

• Disadvantage:
– Quality of grouping is not so high (records having no
chance of being similar are sent to the same reducer
which checks their similarity)

– “ABCD” (A,B belong to Group X ; C belong to Group Y)


• o/p –(X,_) & (Y,_)
– “EFG” (E belong to Group Y )
• o/p –(Y,_)
RID-Pair Generation: Reduce Phase

• This is the core of the entire method


• Each reducer processes one/more buckets
• In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity

Bucket of
candidates
RID-Pair Generation: Reduce Phase

• Computing similarity of the candidates in a


bucket comes in 2 flavors:

• Basic Kernel : uses 2 nested loops to verify each pair of


candidates in the bucket

• Indexed Kernel : uses a PPJoin+ index


RID-Pair Generation: Basic Kernel

• Straightforward method for finding candidates satisfying


the join predicate
• Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
• Uses a special index data structure
• Not so straightforward to implement
• map() -same as in BK algorithm
• Much more efficient
Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
records
• Use the RID pairs generated in the previous stage to join
the actual records
• Main idea:
– bring in the rest of the each record (everything except the RID
which we already have)
• 2 approaches:
– Basic Record Join (BRJ)
– One-Phase Record Join (OPRJ)
Record Join: Basic Record Join

• Uses 2 MapReduce cycles


– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join

• Uses only one MapReduce cycle


R-S Join

• Challenge: We now have 2 different record sources => 2


different input streams

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a


relation tag for each record
Handling Insufficient Memory
• Map-Based Block Processing.
• Reduce-Based Block Processing
Evaluation

• Cluster: 10-node IBM x3650, running Hadoop


• Data sets:
• DBLP: 1.2M publications
• CITESEERX: 1.3M publication
• Consider only the header of each paper(i.e author, title, date of
publication, etc.)
• Data size synthetically increased (by various factors)
• Measure:
• Absolute running time
• Speedup
• Scaleup
Self-Join running time

• Best algorithm: BTO-PK-OPRJ


• Most expensive stage: the
RID-pair generation
Self-Join Speedup

• Fixed data size, vary the


cluster size
• Best time: BTO-PK-OPRJ
Self-Join Scaleup

• Increase data size and


cluster size together by the
same factor
• Best time: BTO-PK-OPRJ
Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
of data and the size of the cluster
– OPRJ was somewhat faster, but the cost of loading the
similar-RID pairs in memory was constant as the the
cluster size increased, and the cost increased as the
data size increased. For these reasons, we recommend
BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
• I stage - R-S Join performance was identical to
the first stage in the self-join case
• II stage -noticed a similar speedup (almost
perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
fastest (for the 2 and 4 node case), but it
eventually became slower than the BRJ
approach.
Conclusions

• For both self-join and R-S join cases, we recommend BTO-


PK-BRJ as a robust and scalable method.

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

• Scales-up up nicely
Thank You!

You might also like