Professional Documents
Culture Documents
Matching Introduction
Matching Introduction
Matching Introduction
introduction
10/2020
ODD
OFFICIAL
Matching two datasets
• Create extracts
OFFICIAL
Unified View data
OFFICIAL
We have scripts to test the API
https://internal-data.food.gov.uk/business/id/establishment.json?
_projection=*,
premises(*,givenAddress(*),reconciledAddress(*))
&
establishmentRegistration.enrolmentStatus=active
&
establishmentRegistration.enrolmentAuthority.country=http://data.food.gov
.uk/codes/geographies/countries/GB-ENG
OFFICIAL
All data has limitations
OFFICIAL
Matching two datasets
• Create extracts
OFFICIAL
Matching two datasets
• Create extracts
OFFICIAL
matching premises
OFFICIAL
Harrys takeaway IP5 6JW
OFFICIAL
Incomplete postcode?
OFFICIAL
We used two different methods to calculate the
similarity between names
Levenshtein
minimum number of single-
character edits required to change
one word into the other. Sensitive to
string length.
Jaro–Winkler
minimum number of single-
character transpositions required to
change one word into the other, plus
a scale which gives more favourable
ratings to strings that match from
the beginning for a set prefix length
OFFICIAL
'superveg' vs 'super veg'
Levenshtein: 89%
Jaro–Winkler: 98%
OFFICIAL
We used sample data to choose matching rules
that gave a high precision match
OFFICIAL
OUR RULES
OFFICIAL
For two premises with complete postcodes to
match they must fulfil one of the following criteria*
OFFICIAL
For two premises without complete postcodes to
match they must fulfil one of the following criteria*
OFFICIAL
62% of respondents to the RaFB FBO survey gave
'change of food business operator' as their reason
for registration (1st April to 30th June)
OFFICIAL
It is difficult to tell if two establishments have the
same operator
no data, franchises, employee vs business name, parent companies...
Matching to RaFB
• Took matches with closest first upload date and registration date
OFFICIAL
Our repository
Create UV extracts:
• api/make_requests.ipynb
Clean extracts:
• data_pipelines/clean_ … .R
Match extracts:
• matching/project_folder/matching_file_ … .ipynb
• match_function.py