Solution For The Data Challenge

Solution for the data challenge.
The 5 data sets given are
1. Corporate Directory - list of Canadian Corporations which has 23 col

umns and 50 rows. This contains details about a company.
The most important columns are 'corporation_number', 'business_numb

er', 'corporate_name', 'alternate_name', 'full_address_source', 'street_addre
ss', 'neighborhood', 'city', 'postal_code', 'country', 'formatted_address’,'las
t_annual_meeting', 'modified_at', 'retrieved_at'
2. SEDAR companies - list of public Canadian companies. This has 48

columns and 50 rows, and this is all about companies.
most important columns are:
'company_name', 'mailing_address_source', 'street_address_mailing', 'neig
hborhood_mailing', 'city_mailing', ‘administrative_area_level_2_mailing', '
administrative_area_level_1_mailing', 'postal_code_mailing', 'country_mai
ling', 'formatted_address_mailing', 'geometry_mailing'
3. Create a new table for head office addresses.
'head_office_address_source', 'street_address_head_office', 'neighborhood_h

ead_office', 'city_head_office', 'administrative_area_level_3_head_office', 'a
dministrative_area_level_2_head_office', 'administrative_area_level_1_head
_office', 'postal_code_head_office', 'country_head_office', 'formatted_addres
s_head_office', 'geometry_head_office', 'email', 'telephone', 'fax','contact_n
ame', 'cusip', 'industry_classification', 'formation_date', 'original_jurisdict
ion', 'reporting_jurisdictions', 'principal_regulator','financial_year_end', 's
tock_exchange','documents_number', 'documents_size_in_mb', 'source_url
', 'retrieved_at'
4. Canada Procurement - federal government contracts. This has 46
columns and has details about a contract, contract giver and the
contractor.
'supplier_standardized_name', 'supplier_operating_name', 'supplier_legal_
name', 'supplier_address_city', 'supplier_address_prov_state', 'supplier_ad
dress_postal_code', 'supplier_address_country','contracting_entity_office_n
ame_en','contracting_address_street_1', 'contracting_address_street_2', 'co
ntracting_address_city', 'contracting_address_prov_state', 'contracting_ad
dress_postal_code', 'contracting_address_country'.
5. Awarded Government Contracts - municipal government contracts.

This has 16 columns about a bid item, contractor who won the bid,
'bid_number', 'bid_name', 'bid_status', 'published_date', 'closing_date', 'q
uestion_deadline', 'bid_pricing', 'bid_description', 'company_names', 'com
pany_contact_names', 'company_contact_address_1', 'company_contact_ad
dress_2', 'company_contact_postal_codes', 'company_contact_emails', 'mu
nicipality', 'url'
6. Bills of lading- This has 31 colums which has some details about a ship
which is carrying something from one place to other.
'vessel_name', 'port_of_unlading', 'estimated_arrival_date', 'foreign_port_
of_lading', 'record_status_indicator', 'place_of_receipt', 'port_of_destinatio
n', 'actual_arrival_date', 'consignee_name', 'consignee_address', 'consignee
_comm_number', 'shipper_party_name', 'shipper_address', 'shipper_contac
t_name'.
From the datasets above we can create a unique name column. The most
common thing in all the datasets is a business organization with all the
details and types of business in some cases and in the billing dataset, we also
have consignee and shipper names which can relate back to a business name
or a city name. Similarly, the billing set has contractor and supplier names
which can correlate with a business name/organization.
Creating an entity resolution model on the Name:
Since most of the business names are in text, we can create string similarity
models like cosine similarity, bigrams, n-grams, graphs, using blocking
algorithms.
Creating a standard representation for Addresses:
Addresses are the next most seen things in all these datasets and most of
them 3 level addresses, so we can reduce them using a standard for
representing country, locality, postal code etc. and then we can match it to
respective business name.
Creating an ER model for those columns which do not have a standard

representation:
There are some examples in the above datasets where the phone numbers
are missing but the address is given or address is missing but phone number
is given, in such cases we can infer the respective locality from the phone
number and vice-versa. This can help us to reduce the duplicates even more.
We can look for more correlations in the data using pairwise snaps,
clustering and de duplication. We can create a graph model on the standard
entities and map it to all the possible entities of given datasets and then
reduce it using spark.
For Names of the business owners, we can use Map reduce to get possible
matches from different datasets of business a person is owning.
The general approach would be to ingest the data, clean it and creating
metrics to find similarities in various desired parameters like owner names,
business names etc, making combinations, representing them in a standard
and a non-standard form, making a graph linking all the records, extracting
all the vertices which have a connection and saving them in a new set which
is our Unique set here and combining it in to a single big data source.
To track what all entities we have changed or modified we can use Hash
sets.

Solution For The Data Challenge

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solution For The Data Challenge

Uploaded by

Copyright:

Available Formats

Solution for the data challenge.

The 5 data sets given are

1. Corporate Directory - list of Canadian Corporations which has 23 col

The most important columns are 'corporation_number', 'business_numb

2. SEDAR companies - list of public Canadian companies. This has 48

3. Create a new table for head office addresses.

'head_office_address_source', 'street_address_head_office', 'neighborhood_h

5. Awarded Government Contracts - municipal government contracts.

Creating an entity resolution model on the Name:

Creating a standard representation for Addresses:

Creating an ER model for those columns which do not have a standard

You might also like