Professional Documents
Culture Documents
Jaccard Similarity Join: The Code
Jaccard Similarity Join: The Code
The Code:
entity_resolution.py
amazon-google-sample.zip
Output:
The program should output the following when running on the provided data:
Before filtering: 256 pairs in total
After Filtering: 79 pairs left
After Verification: 5 similar pairs
(precision, recall, fmeasure) = (1.0, 0.3125, 0.47619047619047616)
Since Jaccard needs to take two sets as input, your first job is to preprocess DataFrames by
transforming each record into a set of tokens. Please implement the following function.
Hints.
If you have mastered the use of UDF and withColumn by doing Assignment 3, you
should have no problem to finish this task. One small hint is to take a look at
concat_ws.
For the purpose of testing, you can compare your outputs with newDF1 and
newDF2 that can be found from the test folder of the Amazon-Google-Sample dataset.
Hints.
You need to construct an inverted index for df1 and df2, respectively. The inverted index is a
DataFrame with two columns: token and id, which stores a mapping from each token to a record
that contains the token. You might need to use flatMap to obtain the inverted index.
For the purpose of testing, you can compare your output with candDF that can be found from
the test folder of the Amazon-Google-Sample dataset.
Hints.
You need to implement a function for computing the Jaccard similarity between two
joinKeys. Since the function will be called for many times, you have to think about
what's the most efficient implementation for the function. Furthermore, you also need
to consider some edge cases in the function.
For the purpose of testing, you can compare your output with resultDF that can be
found from the test folder of the Amazon-Google-Sample dataset.
Hints. It's likely that |R|, |A|, or Precision+Recall are equal to zero, so please pay attention to some edge
cases.