Matching Introduction

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Matching

introduction 

10/2020
ODD

OFFICIAL
Matching two datasets

• Create extracts

• Consistently clean the datasets that you are going to match

• Match the data

OFFICIAL
Unified View data 

• Matched to the OS data

• Contains historic information: 


– date an establishment was first uploaded 

– establishments that are no longer in the FHRS

• We have picked out the columns required for matching

OFFICIAL
We have scripts to test the API

https://internal-data.food.gov.uk/business/id/establishment.json?
_projection=*,
premises(*,givenAddress(*),reconciledAddress(*))
&
establishmentRegistration.enrolmentStatus=active
&
establishmentRegistration.enrolmentAuthority.country=http://data.food.gov
.uk/codes/geographies/countries/GB-ENG

OFFICIAL
All data has limitations

• is not completely trustworthy (establishments marked as inactive


due to an upload error etc.)

• is not always up to date

• is not exhaustive (England: selling food to the final consumer)

• contains some redacted addresses

OFFICIAL
Matching two datasets

• Create extracts

• Consistently clean the datasets that you are going to match

• Match the data

OFFICIAL
Matching two datasets

• Create extracts

• Consistently clean the datasets that you are going to match

• Match the data

OFFICIAL
matching premises

Find premises Score how


Decide what
in the two well they
scores are
datasets with match on their
required for a
the same different
match we trust
postal sector features

OFFICIAL
Harrys takeaway IP5 6JW

Find premises Score how


Decide what
in the two well they
scores are
datasets with match on their
required for a
the same different
match we trust
postal sector features

Name > 0.8 Postcode = 1

Tesco IP5 6JT Name = 0.4 Postcode = 0 FAIL


Harry’s IP5 6JW Name = 0.9 Postcode = 1 PASS
Fresh Café IP5 6JW Name = 0.5 Postcode = 1 FAIL

OFFICIAL
Incomplete postcode?

Find premises Score how


Decide what
in the two well they
scores are
datasets with match on their
required for a
the same different
match we trust
postal district features

OFFICIAL
We used two different methods to calculate the
similarity between names

Levenshtein
minimum number of single-
character edits required to change
one word into the other. Sensitive to
string length.

Jaro–Winkler
minimum number of single-
character transpositions required to
change one word into the other, plus
a scale which gives more favourable
ratings to strings that match from
the beginning for a set prefix length

OFFICIAL
'superveg' vs 'super veg'

Levenshtein: 89% 

Jaro–Winkler: 98% 

OFFICIAL
We used sample data to choose matching rules
that gave a high precision match

We also considered a match with


higher recall and lower precision

97% of the matches we made


with the adjusted set of rules
were correct. We found 83% of
the matches in the sample
dataset

OFFICIAL
OUR RULES

OFFICIAL
For two premises with complete postcodes to
match they must fulfil one of the following criteria*

• names that score above 30% on Levenshtein and 90% on Jaro–Winkler


and the same postcode 
• names that score above 70% on Levenshtein and the same postcode 
• names that score above 30% on Levenshtein and 90% on Jaro–Winkler
and addresses that score over 70% and the same postal sector

Matches were discounted if the two premises had very


different house numbers.

OFFICIAL
For two premises without complete postcodes to
match they must fulfil one of the following criteria*

• names that score above 30% on Levenshtein and 90% on Jaro–


Winkler and the same postal district

Matches were discounted if the two premises had very


different house numbers.

OFFICIAL
62% of respondents to the RaFB FBO survey gave
'change of food business operator' as their reason
for registration (1st April to 30th June)

OFFICIAL
It is difficult to tell if two establishments have the
same operator
no data, franchises, employee vs business name, parent companies...

Identifying inactive establishments


• Only matched to an active establishment: operator information up to date
• Only matched to an inactive establishment: establishment no longer exists
• Match to both: We are not confident in assigning a status

Matching to RaFB
• Took matches with closest first upload date and registration date

OFFICIAL
Our repository

Create UV extracts: 
• api/make_requests.ipynb 

Clean extracts: 
• data_pipelines/clean_ … .R

Match extracts: 
• matching/project_folder/matching_file_ … .ipynb
• match_function.py

Testing new match rules with sample data:


• matching/matching_model_calibration.ipynb
OFFICIAL

You might also like