Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena

Efficient Parallel Set-Similarity
Joins Using MapReduce
Tilani Gunawardena
Content
• Introduction
• Preliminaries
• Self-Join case
• R-S Join case
• Handling insufficient memory
• Experimental evaluation
• Conclusions
Introduction
• Vast amount of data:

– Google N-gram database : ~1 trillion records
– GeneBank : 100 million records, size=416GB
– Facebook : 400 million active users
• Detecting similar pairs of records becomes a

challanging proble
Examples
• Detecting near duplicate web-pages in web crawlin
• Document clustering
• Plagiarism detection
• Master data management
– “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
their similarity to other users in query refinement
• Mining in social networking sites
– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising
Preliminaries
• Problem Statement: Given two collections of
objects/items/records, a similarity metric
sim(o1,o2) and a threshold λ , find the pairs of
objects/items/records satisfying sim(o1,o2)≥ λ
Set -similarity functions
• Jaccard or Tanimoto coefficient
– Jaccard(x, y) =|x ∩y| / |x U y|
• “I will call back” =[I, will, call, back]

• “I will call you soon”=[I, will, call, you, soon]
• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce
• Why Hadoop ?
– Large amount data,shared nothign architecture
• map (k1,v1) -> list(k2,v2);

• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
– Too much data to transfer
– Too many pairs to verify(Two similar sets share at least
1 token)
Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
effective filters
• string s =“I will call back”

• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]
• prefix filtering principle states that similar strings

need to share at least one common token in their
prefixes.
Prefix filtering: example
Record 1
Record 2
• Each set has 5 tokens

• “Similar”: they share at least 4 tokens
• Prefix length: 2
9
Parallel Set-Similarity Joins
• Stage I: Token Ordering
– Compute data statistics for good signatures
• Stage II -RID-Pair Generation
• Stage III: Record Join
– Generate actual pairs of joined records
Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
• Address: “14th Saarbruecker Strasse”
• Name: “John W. Smith”
Stage I: Token Ordering
• Basic Token Ordering(BTO)
• One Phase Token Ordering (OPTO)
Token Ordering
• Creates a global ordering of the tokens in the

join column, based on their frequency
RID a b c
1 A B D AA … …
2 BBDAE … …
Global Ordering: E D B A
(based on
frequency) 1 2 3 4
Basic Token Ordering(BTO)
• 2 MapReduce cycles:
– 1st : compute token frequencies
– 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
, ,
map: reduce:
• tokenize the join • for each token, compute total
value of each record count (frequency)
• emit each token
with no. of occurrences 1
Basic Token Ordering – 2nd MapReduce cycle
map: reduce(use only 1 reducer):

• interchange key • emits the value
with value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
– Uses only one MapReduce Cycle (less I/O)
– In-memory token sorting, instead of using a
reducer
OPTO – Details
, ,
Use tear_down
method to order
the tokens in
memory
map:
reduce:
• tokenize the join
• for each token, compute
value of each record
total count (frequency)
• emit each token
with no. of occurrences 1
Stage II: RID-Pair Generation
 Basic Kernel(BK)
 Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous

stage
RID-Pair Generation: Map Phase
• scan input records and for each record:

– project it on RID & join attribute
– tokenize it
– extract prefix according to global ordering of tokens obtained in the Token
Ordering stage
– route tokens to appropriate reducer
Grouping/Routing Strategies
• Goal: distribute candidates to the right

reducers to minimize reducers’ workload
• Like hashing (projected)records to the
corresponding candidate-buckets
• Each reducer handles one/more candidate-
buckets
• 2 routing strategies:
Using Individual Tokens Using Grouped Tokens

Routing: using individual tokens
• Treat each token as a key

• For each record, generates a (key, value) pair for each
of its prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
Grouping/Routing: using individual tokens
• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar, are
never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might be
checked for similarity in multiple reducers, i.e.
redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
the groups of the prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
“A B C” => prefix of length 2: A,B

Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens
• The groups of tokens (X,Y) are formed assigning

tokens to groups in a Round-Robin manner
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
A D F B G E C
Group1 Group2 Group3

Grouping/Routing: Using Grouped Tokens
• Advantage:
– fewer replication of record projection
• Disadvantage:
– Quality of grouping is not so high (records having no
chance of being similar are sent to the same reducer
which checks their similarity)
– “ABCD” (A,B belong to Group X ; C belong to Group Y)

• o/p –(X,_) & (Y,_)
– “EFG” (E belong to Group Y )
• o/p –(Y,_)
RID-Pair Generation: Reduce Phase
• This is the core of the entire method

• Each reducer processes one/more buckets
• In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity
Bucket of
candidates
RID-Pair Generation: Reduce Phase
• Computing similarity of the candidates in a

bucket comes in 2 flavors:
• Basic Kernel : uses 2 nested loops to verify each pair of

candidates in the bucket
• Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel
• Straightforward method for finding candidates satisfying

the join predicate
• Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
• Uses a special index data structure
• Not so straightforward to implement
• map() -same as in BK algorithm
• Much more efficient
Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
records
• Use the RID pairs generated in the previous stage to join
the actual records
• Main idea:
– bring in the rest of the each record (everything except the RID
which we already have)
• 2 approaches:
– Basic Record Join (BRJ)
– One-Phase Record Join (OPRJ)
Record Join: Basic Record Join
• Uses 2 MapReduce cycles

– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join
• Uses only one MapReduce cycle

R-S Join
• Challenge: We now have 2 different record sources => 2

different input streams
• Map Reduce can work on only 1 input stream
• 2nd and 3rd stage affected
• Solution: extend (key, value) pairs so that it includes a

relation tag for each record
Handling Insufficient Memory
• Map-Based Block Processing.
• Reduce-Based Block Processing
Evaluation
• Cluster: 10-node IBM x3650, running Hadoop

• Data sets:
• DBLP: 1.2M publications
• CITESEERX: 1.3M publication
• Consider only the header of each paper(i.e author, title, date of
publication, etc.)
• Data size synthetically increased (by various factors)
• Measure:
• Absolute running time
• Speedup
• Scaleup
Self-Join running time
• Best algorithm: BTO-PK-OPRJ

• Most expensive stage: the
RID-pair generation
Self-Join Speedup
• Fixed data size, vary the

cluster size
• Best time: BTO-PK-OPRJ
Self-Join Scaleup
• Increase data size and

cluster size together by the
same factor
• Best time: BTO-PK-OPRJ
Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
of data and the size of the cluster
– OPRJ was somewhat faster, but the cost of loading the
similar-RID pairs in memory was constant as the the
cluster size increased, and the cost increased as the
data size increased. For these reasons, we recommend
BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
• I stage - R-S Join performance was identical to
the first stage in the self-join case
• II stage -noticed a similar speedup (almost
perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
fastest (for the 2 and 4 node case), but it
eventually became slower than the BRJ
approach.
Conclusions
• For both self-join and R-S join cases, we recommend BTO-

PK-BRJ as a robust and scalable method.
• Useful in many data cleaning scenarios
• SSJoin and MapReduce: one solution for huge datasets
• Very efficient when based on prefix-filtering and PPJoin+
• Scales-up up nicely
Thank You!

Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena

Uploaded by

Copyright:

Available Formats

You might also like

Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena

Uploaded by

Copyright:

Available Formats

Efficient Parallel Set-Similarity

Joins Using MapReduce

• Vast amount of data:

• Detecting similar pairs of records becomes a

• “I will call back” =[I, will, call, back]

• map (k1,v1) -> list(k2,v2);

• string s =“I will call back”

• prefix filtering principle states that similar strings

• Each set has 5 tokens

• Creates a global ordering of the tokens in the

map: reduce(use only 1 reducer):

Global ordering of tokens obtained in the previous

• scan input records and for each record:

• Goal: distribute candidates to the right

Using Individual Tokens Using Grouped Tokens

• Treat each token as a key

“A B C” => prefix of length 2: A,B

• The groups of tokens (X,Y) are formed assigning

Group1 Group2 Group3

– “ABCD” (A,B belong to Group X ; C belong to Group Y)

• This is the core of the entire method

• Computing similarity of the candidates in a

• Basic Kernel : uses 2 nested loops to verify each pair of

• Indexed Kernel : uses a PPJoin+ index

• Straightforward method for finding candidates satisfying

• Uses 2 MapReduce cycles

• Uses only one MapReduce cycle

• Challenge: We now have 2 different record sources => 2

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a

• Cluster: 10-node IBM x3650, running Hadoop

• Best algorithm: BTO-PK-OPRJ

• Fixed data size, vary the

• Increase data size and

• For both self-join and R-S join cases, we recommend BTO-

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

You might also like