Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Solution for the data challenge.

The 5 data sets given are

1. Corporate Directory - list of Canadian Corporations which has 23 col


umns and 50 rows. This contains details about a company.

The most important columns are 'corporation_number', 'business_numb


er', 'corporate_name', 'alternate_name', 'full_address_source', 'street_addre
ss', 'neighborhood', 'city', 'postal_code', 'country', 'formatted_address’,'las
t_annual_meeting', 'modified_at', 'retrieved_at'

2. SEDAR companies - list of public Canadian companies. This has 48


columns and 50 rows, and this is all about companies.
most important columns are:
'company_name', 'mailing_address_source', 'street_address_mailing', 'neig
hborhood_mailing', 'city_mailing', ‘administrative_area_level_2_mailing', '
administrative_area_level_1_mailing', 'postal_code_mailing', 'country_mai
ling', 'formatted_address_mailing', 'geometry_mailing'

3. Create a new table for head office addresses.

'head_office_address_source', 'street_address_head_office', 'neighborhood_h


ead_office', 'city_head_office', 'administrative_area_level_3_head_office', 'a
dministrative_area_level_2_head_office', 'administrative_area_level_1_head
_office', 'postal_code_head_office', 'country_head_office', 'formatted_addres
s_head_office', 'geometry_head_office', 'email', 'telephone', 'fax','contact_n
ame', 'cusip', 'industry_classification', 'formation_date', 'original_jurisdict
ion', 'reporting_jurisdictions', 'principal_regulator','financial_year_end', 's
tock_exchange','documents_number', 'documents_size_in_mb', 'source_url
', 'retrieved_at'
4. Canada Procurement - federal government contracts. This has 46
columns and has details about a contract, contract giver and the
contractor.
most important columns are:
'supplier_standardized_name', 'supplier_operating_name', 'supplier_legal_
name', 'supplier_address_city', 'supplier_address_prov_state', 'supplier_ad
dress_postal_code', 'supplier_address_country','contracting_entity_office_n
ame_en','contracting_address_street_1', 'contracting_address_street_2', 'co
ntracting_address_city', 'contracting_address_prov_state', 'contracting_ad
dress_postal_code', 'contracting_address_country'.

5. Awarded Government Contracts - municipal government contracts.


This has 16 columns about a bid item, contractor who won the bid,
most important columns are:
'bid_number', 'bid_name', 'bid_status', 'published_date', 'closing_date', 'q
uestion_deadline', 'bid_pricing', 'bid_description', 'company_names', 'com
pany_contact_names', 'company_contact_address_1', 'company_contact_ad
dress_2', 'company_contact_postal_codes', 'company_contact_emails', 'mu
nicipality', 'url'

6. Bills of lading- This has 31 colums which has some details about a ship
which is carrying something from one place to other.
most important columns are:
'vessel_name', 'port_of_unlading', 'estimated_arrival_date', 'foreign_port_
of_lading', 'record_status_indicator', 'place_of_receipt', 'port_of_destinatio
n', 'actual_arrival_date', 'consignee_name', 'consignee_address', 'consignee
_comm_number', 'shipper_party_name', 'shipper_address', 'shipper_contac
t_name'.
From the datasets above we can create a unique name column. The most
common thing in all the datasets is a business organization with all the
details and types of business in some cases and in the billing dataset, we also
have consignee and shipper names which can relate back to a business name
or a city name. Similarly, the billing set has contractor and supplier names
which can correlate with a business name/organization.

Creating an entity resolution model on the Name:

Since most of the business names are in text, we can create string similarity
models like cosine similarity, bigrams, n-grams, graphs, using blocking
algorithms.

Creating a standard representation for Addresses:

Addresses are the next most seen things in all these datasets and most of
them 3 level addresses, so we can reduce them using a standard for
representing country, locality, postal code etc. and then we can match it to
respective business name.

Creating an ER model for those columns which do not have a standard


representation:

There are some examples in the above datasets where the phone numbers
are missing but the address is given or address is missing but phone number
is given, in such cases we can infer the respective locality from the phone
number and vice-versa. This can help us to reduce the duplicates even more.

We can look for more correlations in the data using pairwise snaps,
clustering and de duplication. We can create a graph model on the standard
entities and map it to all the possible entities of given datasets and then
reduce it using spark.
For Names of the business owners, we can use Map reduce to get possible
matches from different datasets of business a person is owning.

The general approach would be to ingest the data, clean it and creating
metrics to find similarities in various desired parameters like owner names,
business names etc, making combinations, representing them in a standard
and a non-standard form, making a graph linking all the records, extracting
all the vertices which have a connection and saving them in a new set which
is our Unique set here and combining it in to a single big data source.

To track what all entities we have changed or modified we can use Hash
sets.

You might also like