Professional Documents
Culture Documents
Solution For The Data Challenge
Solution For The Data Challenge
6. Bills of lading- This has 31 colums which has some details about a ship
which is carrying something from one place to other.
most important columns are:
'vessel_name', 'port_of_unlading', 'estimated_arrival_date', 'foreign_port_
of_lading', 'record_status_indicator', 'place_of_receipt', 'port_of_destinatio
n', 'actual_arrival_date', 'consignee_name', 'consignee_address', 'consignee
_comm_number', 'shipper_party_name', 'shipper_address', 'shipper_contac
t_name'.
From the datasets above we can create a unique name column. The most
common thing in all the datasets is a business organization with all the
details and types of business in some cases and in the billing dataset, we also
have consignee and shipper names which can relate back to a business name
or a city name. Similarly, the billing set has contractor and supplier names
which can correlate with a business name/organization.
Since most of the business names are in text, we can create string similarity
models like cosine similarity, bigrams, n-grams, graphs, using blocking
algorithms.
Addresses are the next most seen things in all these datasets and most of
them 3 level addresses, so we can reduce them using a standard for
representing country, locality, postal code etc. and then we can match it to
respective business name.
There are some examples in the above datasets where the phone numbers
are missing but the address is given or address is missing but phone number
is given, in such cases we can infer the respective locality from the phone
number and vice-versa. This can help us to reduce the duplicates even more.
We can look for more correlations in the data using pairwise snaps,
clustering and de duplication. We can create a graph model on the standard
entities and map it to all the possible entities of given datasets and then
reduce it using spark.
For Names of the business owners, we can use Map reduce to get possible
matches from different datasets of business a person is owning.
The general approach would be to ingest the data, clean it and creating
metrics to find similarities in various desired parameters like owner names,
business names etc, making combinations, representing them in a standard
and a non-standard form, making a graph linking all the records, extracting
all the vertices which have a connection and saving them in a new set which
is our Unique set here and combining it in to a single big data source.
To track what all entities we have changed or modified we can use Hash
sets.