Welcome to Scribd!

Jaccard Similarity Join: The Code

Uploaded by

0% found this document useful (0 votes)

17 views3 pages

The document provides instructions for summarizing 4 tasks: 1. Preprocess data frames by transforming records into token sets 2. Filter out obviously non-matching pairs using inverted indexes of tokens 3. Compute Jaccard similarity for surviving pairs and filter pairs below a threshold 4. Evaluate the entity resolution results by computing precision, recall, and f-measure based on the true matches

Original Description:

spark sql

Original Title

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

0% found this document useful (0 votes)

17 views3 pages

Jaccard Similarity Join: The Code

Uploaded by

Prasanna Kumar

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

Jump to Page

You are on page 1of 3

Search inside document

the attached code first, and then implement the remaining four functions:

preprocessDF(), filtering(), verification(), and evaluate().

The Code:

entity_resolution.py

The Data set:

amazon-google-sample.zip

Output:
The program should output the following when running on the provided data:
Before filtering: 256 pairs in total
After Filtering: 79 pairs left
After Verification: 5 similar pairs
(precision, recall, fmeasure) = (1.0, 0.3125, 0.47619047619047616)

Jaccard Similarity Join

Task A. Data Preprocessing (Record --> Token Set)

Since Jaccard needs to take two sets as input, your first job is to preprocess DataFrames by
transforming each record into a set of tokens. Please implement the following function.

Hints.

If you have mastered the use of UDF and withColumn by doing Assignment 3, you
should have no problem to finish this task. One small hint is to take a look at
concat_ws.

For the purpose of testing, you can compare your outputs with newDF1 and
newDF2 that can be found from the test folder of the Amazon-Google-Sample dataset.

Task B. Filtering Obviously Non-matching Pairs

Hints.

You need to construct an inverted index for df1 and df2, respectively. The inverted index is a
DataFrame with two columns: token and id, which stores a mapping from each token to a record
that contains the token. You might need to use flatMap to obtain the inverted index.

For the purpose of testing, you can compare your output with candDF that can be found from
the test folder of the Amazon-Google-Sample dataset.

Task C. Computing Jaccard Similarity for Survived Pairs

In the second phase of the filtering-and-verification framework, you need to compute the Jaccard
similarity for each survived pair and return those pairs whose jaccard similarity values are no
smaller than the specified threshold.
In Task C, your job is to implement the verification function. This task looks simple, but there are
a few small "traps" (see the hints below).

Hints.

You need to implement a function for computing the Jaccard similarity between two
joinKeys. Since the function will be called for many times, you have to think about
what's the most efficient implementation for the function. Furthermore, you also need
to consider some edge cases in the function.

For the purpose of testing, you can compare your output with resultDF that can be
found from the test folder of the Amazon-Google-Sample dataset.

Task D. Evaluating an ER result

Hints. It's likely that |R|, |A|, or Precision+Recall are equal to zero, so please pay attention to some edge
cases.

Java Problems with Solutions
From Everand
Java Problems with Solutions
Mayank Arora
Rating: 4.5 out of 5 stars
4.5/5 (18)
Flipkart Bill
Document2 pages
Flipkart Bill
Prasanna Kumar
50% (14)
Flipkart Invoice
Document1 page
Flipkart Invoice
Deepak Sharma
82% (22)
Lab 2 - Higher Order Functions - CS 61A Summer 2019 PDF
Document14 pages
Lab 2 - Higher Order Functions - CS 61A Summer 2019 PDF
zhen hu
No ratings yet
CS1702 Worksheet 7 - Built in Functions and Methods v1 (2022-2023)
Document8 pages
CS1702 Worksheet 7 - Built in Functions and Methods v1 (2022-2023)
John Moursy
No ratings yet
Lab 2
Document4 pages
Lab 2
geoaamer
100% (1)
Randoop Tutorial PDF
Document5 pages
Randoop Tutorial PDF
Sahodara reddy
No ratings yet
Lab-11 Random Forest
Document2 pages
Lab-11 Random Forest
KamranKhan
No ratings yet
Software Testing Lab 5: Automated Unit Test Generation
Document10 pages
Software Testing Lab 5: Automated Unit Test Generation
Толганай Кыдырмоллаева
No ratings yet
Asic Lab3
Document11 pages
Asic Lab3
balukrish2018
No ratings yet
2324 BigData Lab3
Document6 pages
2324 BigData Lab3
Elie Al Howayek
No ratings yet
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
Document3 pages
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
s_gamal15
No ratings yet
Cs294a 2011 Assignment
Document5 pages
Cs294a 2011 Assignment
Jose
No ratings yet
GenArise Images
Document32 pages
GenArise Images
Anonymous MqprQvjEK
No ratings yet
Data Mining Exercise 3
Document11 pages
Data Mining Exercise 3
Mohamed Boukhari
No ratings yet
Paren Lab
Document5 pages
Paren Lab
Shobiitaa Krish
No ratings yet
ISTA 130: Fall 2020 Programming Assignment 2 Functions
Document7 pages
ISTA 130: Fall 2020 Programming Assignment 2 Functions
tts
No ratings yet
Ex 2
Document13 pages
Ex 2
sumerian786
No ratings yet
Call MATLAB Function From C#
Document11 pages
Call MATLAB Function From C#
maherkamel
No ratings yet
Homework 1
Document9 pages
Homework 1
Tomás Calderón
No ratings yet
HW 2
Document4 pages
HW 2
5qf59ptg2s
No ratings yet
Project0 Testing
Document4 pages
Project0 Testing
sdfkdnbvbr
No ratings yet
Databricks Spark Knowledge Base
Document22 pages
Databricks Spark Knowledge Base
Lokesh Dikshi
100% (1)
Test-Driven APIs With Laravel and Pest Sample Chapter
Document32 pages
Test-Driven APIs With Laravel and Pest Sample Chapter
Jendela Kayu
No ratings yet
3.5.7 Lab - Create A Python Unit Test
Document9 pages
3.5.7 Lab - Create A Python Unit Test
Willy Dinata
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
Document4 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
Mathias Mbizvo
No ratings yet
TD2345
Document3 pages
TD2345
ashitaka667
No ratings yet
Using Code Blocks, Again: One More Time..
Document5 pages
Using Code Blocks, Again: One More Time..
Jose Cordero
No ratings yet
An Empirical Study On Apache Spark
Document15 pages
An Empirical Study On Apache Spark
Lokesh Dikshi
No ratings yet
Using Car Functions in Other Functions: 1 Deltamethod
Document7 pages
Using Car Functions in Other Functions: 1 Deltamethod
suresh1969
No ratings yet
Testing in Python Using Doctest Module
Document3 pages
Testing in Python Using Doctest Module
Ahmed Mohamed
No ratings yet
Assignment 1-Preprocessing Handon
Document6 pages
Assignment 1-Preprocessing Handon
Ch Ubaid Warraich
No ratings yet
ML Coursera Python Assignments
Document20 pages
ML Coursera Python Assignments
M
No ratings yet
Curve Fitting With Scilab
Document8 pages
Curve Fitting With Scilab
Diana Nahiely
No ratings yet
Data Science and Machine Learning Essentials: Lab 4A - Working With Regression Models
Document24 pages
Data Science and Machine Learning Essentials: Lab 4A - Working With Regression Models
aussatris
No ratings yet
3.5.7 Lab - Create A Python Unit Test
Document15 pages
3.5.7 Lab - Create A Python Unit Test
Samuel Garcia
No ratings yet
CSC2626: Assignment 1 Due January 28 at 6pm ET 25 Points
Document2 pages
CSC2626: Assignment 1 Due January 28 at 6pm ET 25 Points
Beerbhan Naru
No ratings yet
PA4
Document8 pages
PA4
akhaye047
No ratings yet
Taller Laboratorios Módulo 4 Python
Document11 pages
Taller Laboratorios Módulo 4 Python
Santiago Rivera
No ratings yet
MATLAB Integration
Document7 pages
MATLAB Integration
Jay Srivastava
100% (1)
MIT6 189IAP11 hw2
Document8 pages
MIT6 189IAP11 hw2
Ali Akhavan
No ratings yet
Assignment 1-Preprocessing Handon
Document13 pages
Assignment 1-Preprocessing Handon
suleman045
No ratings yet
Modular Programming
Document11 pages
Modular Programming
AbdulkhadarJilani Shaik
No ratings yet
Java Notes
Document36 pages
Java Notes
Vignesh Murali
No ratings yet
CIS-355A Lab 5B: Purpose
Document1 page
CIS-355A Lab 5B: Purpose
rondnew_906891183
No ratings yet
C - Notes (Data Planet)
Document142 pages
C - Notes (Data Planet)
Akash Shinde
No ratings yet
PowerShell Optimization and Performance Testing
Document3 pages
PowerShell Optimization and Performance Testing
ignacio fernandez luengo
No ratings yet
What To Do If Your Solution Doesn't Work?
Document5 pages
What To Do If Your Solution Doesn't Work?
Syed Khoab
No ratings yet
WEKA Lab Manual
Document107 pages
WEKA Lab Manual
Ramesh Kumar
100% (1)
Java Assignment 1 - New
Document6 pages
Java Assignment 1 - New
dassayantan450
No ratings yet
Programming Automation Using Object Oriented Python and Pandas
Document6 pages
Programming Automation Using Object Oriented Python and Pandas
Dusan WEB
No ratings yet
Assignment 2
Document6 pages
Assignment 2
raosaheb
No ratings yet
Lab Manual - AETN2302 - L2 (Lirterals and Variables)
Document7 pages
Lab Manual - AETN2302 - L2 (Lirterals and Variables)
Zille Huma
No ratings yet
BES - R Lab 1
Document4 pages
BES - R Lab 1
Viem Anh
No ratings yet
CS 116 Spring 2020 Lab #05: Due: Wednesday, February 26 Points: 20
Document6 pages
CS 116 Spring 2020 Lab #05: Due: Wednesday, February 26 Points: 20
Andrew Cordell
No ratings yet
Unit 4 BDA
Document31 pages
Unit 4 BDA
Amritha
No ratings yet
Lecture 8 July2015
Document22 pages
Lecture 8 July2015
Pulak Kundu
No ratings yet
Testing and Debugging: Chapter Goals
Document28 pages
Testing and Debugging: Chapter Goals
Ani Ani
No ratings yet
Search For Potential Functional Issues With Code Inspector
Document11 pages
Search For Potential Functional Issues With Code Inspector
Esther Vizarro
No ratings yet
# Assignment 4&5 (Combined) (Clustering & Dimension Reduction)
Document15 pages
# Assignment 4&5 (Combined) (Clustering & Dimension Reduction)
raosaheb
No ratings yet
Data Driven Testing
Document4 pages
Data Driven Testing
Muthukrishnan Srinivasan
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Crawl Resultscrawlresults
Document6 pages
Crawl Resultscrawlresults
Prasanna Kumar
No ratings yet
Prasanna Kumar CL
Document1 page
Prasanna Kumar CL
Prasanna Kumar
No ratings yet
PGPFE
Document49 pages
PGPFE
Prasanna Kumar
No ratings yet
Btec Level 3 Ict Unit 28
Document8 pages
Btec Level 3 Ict Unit 28
naymul islam
No ratings yet
RAM Technologies
Document25 pages
RAM Technologies
Amarnath M Damodaran
No ratings yet
Isilon Node Firmware Package Version 10.2.1
Document16 pages
Isilon Node Firmware Package Version 10.2.1
Peter Sonap
No ratings yet
Sol CH 1 Hw1-Key
Document4 pages
Sol CH 1 Hw1-Key
dhuazqshdf
No ratings yet
Indiana University BSCS
Document2 pages
Indiana University BSCS
ivan7344
No ratings yet
Attacking Ajax Applications Web20 Expo
Document65 pages
Attacking Ajax Applications Web20 Expo
Victor Ceron
No ratings yet
XML Security: September 13, 2006 Robert Richards
Document52 pages
XML Security: September 13, 2006 Robert Richards
hashmude
No ratings yet
Selenium Keyword Driver Framework
Document4 pages
Selenium Keyword Driver Framework
Samsul Hutha
No ratings yet
Sdccman
Document126 pages
Sdccman
Gustavo A. Sarache Millan
No ratings yet
Microsoft Exchange Server PowerShell Cookbook - Third Edition - Sample Chapter
Document33 pages
Microsoft Exchange Server PowerShell Cookbook - Third Edition - Sample Chapter
Packt Publishing
No ratings yet
3A Programminog Assignment
Document3 pages
3A Programminog Assignment
jcvoscrib
No ratings yet
Customizing Electrochemical Experiments With The Explain™ Scripting Language Gamry Instruments
Document19 pages
Customizing Electrochemical Experiments With The Explain™ Scripting Language Gamry Instruments
Nathan T Nesbitt
No ratings yet
01 TNC320 UpdInfo1 en
Document80 pages
01 TNC320 UpdInfo1 en
hydans
No ratings yet
The Phishing Guide
Document47 pages
The Phishing Guide
zorantic1
No ratings yet
Computer
Document50 pages
Computer
x y
No ratings yet
QoreStor 4.1 CLI ReferenceGuide
Document64 pages
QoreStor 4.1 CLI ReferenceGuide
Gokul Ve
No ratings yet
A26361 D2594 Z110 Muli
Document20 pages
A26361 D2594 Z110 Muli
botezatu
No ratings yet
P228 - Authentication Scheme Using Illusion PIN To Prevent Shoulder Surfer Attack
Document25 pages
P228 - Authentication Scheme Using Illusion PIN To Prevent Shoulder Surfer Attack
Tajudeen Taju
No ratings yet
Microsoft EPS
Document3 pages
Microsoft EPS
vikas kundu
No ratings yet
Presentations PPT Unit-5 25042019031434AM
Document38 pages
Presentations PPT Unit-5 25042019031434AM
JAYPALSINH GOHIL
No ratings yet
Spring Framework Reference PDF
Document855 pages
Spring Framework Reference PDF
Dimas Fahmi
No ratings yet
Data Flow Diagram
Document4 pages
Data Flow Diagram
Arun Shankar N. Pillai
No ratings yet
Basic Timer
Document0 pages
Basic Timer
Bhanu Partap
No ratings yet
Chapter 1 Malik
Document15 pages
Chapter 1 Malik
Putra Makmur Boangmanalu
No ratings yet
COMAL 80 For The Commodore 64
Document324 pages
COMAL 80 For The Commodore 64
Chris Harker
No ratings yet
Data Cubemod2
Document21 pages
Data Cubemod2
sgk
100% (1)
Overview of Computer Vision: CS491E/791E
Document55 pages
Overview of Computer Vision: CS491E/791E
neilwu
No ratings yet
CCIE/CCDE Written Exam Evolving Technologies Study Guide
Document180 pages
CCIE/CCDE Written Exam Evolving Technologies Study Guide
Ala'a Hassan
No ratings yet
Apo Sapapo - mc62 JPN Delete CVC
Document25 pages
Apo Sapapo - mc62 JPN Delete CVC
nguyencaohuy
No ratings yet