handouts

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Identifying Problems in Data :

What kind of problems can be found in a particular data set and how they can be fixed
(expected to be fixed).

Incomplete Data :
This is by far the most common issue when dealing with data quality in data sets. Key
columns are missing information, causing downstream analytics impact.
How to fix this issue?
The best way to fix this is to put in place a reconciliation framework control. The control
would check the number of records passing through your analytical layers and alert
when records have gone missing.

Default Values :
Ever analysed your data and found 01/01/1891 as a date for a transaction? Unless your
customer base comprises 130-year-old individuals, this is likely a case of using default
values. This is especially a problem if there is a lack of documentation.
How to fix this issue?
The best way to fix this is to profile the data and understand the pattern of why default
values were used. Usually, engineers use this data when a real-life alternative date is
unavailable.

Data Format Inconsistencies :


String columns predominantly suffer from this problem, where data can be stored in
many formats. For example, a customer’s first and last name is stored in different cases
or an email address without the correct formatting. It occurs when multiple systems store
information without an agreed data format.
How to fix this issue?
To fix this, data needs to be homogenized (standardized) across the source system or at
least in the data pipeline when fed to the data lake or warehouse.

Duplicate Data :
Reasonably straightforward to spot, quite tricky to fix. If the critical attribute is
populated with dirty data duplicates, it will break all the key downstream processes. It
can also cause other data quality issues.
How to fix this issue?
To fix this, a master data management control needs to be implemented, even as basic as
a uniqueness check. This control will check for exact duplicates of records and purge
one record. It can also send a notification for the other record to the data engineer or
steward for investigation.
Cross System Inconsistencies :
Very common in large organizations that have grown by acquisitions and mergers.
Multiple source legacy systems all have a slightly different view of the world. Customer
name, address or DOB all have inconsistent or incorrect information.
How to fix this issue?
Like above, a master data management solution must be implemented to ensure all the
different information is matched into a single record. This matching doesn’t need to be
exact; it could be fuzzy based on a threshold of match percentage.

Orphaned Data :
This data quality issue relates to data inconsistency problems where data exists in one
system and not the other. A customer exists in table A, but their account doesn’t exist in
table B. It would be classed as an orphan customer. On the other hand, if an account
exists in table B but has no associated customer, it would be classed as an orphan
account. A data quality rule that checks for consistency each time data is ingested in
tables A and B will help spot the issue.
How to fix this issue?
To remediate this, the source system would need to check the underlying cause of this
inconsistency.

Irrelevant Data :
Nothing is more frustrating than capturing ALL the available information. Besides the
regulatory restrictions of data minimization, capturing all the available data is more
expensive and less sustainable.
How to fix this issue?
To fix this, data capturing principles need to be agreed upon; each data attribute should
have an end goal; otherwise, it should not be captured.

Redundant Data :
Multiple teams across the organization, capturing the same data repeatedly. In an
organization with an online and high street presence, capturing the same information
numerous times will lead to data being available in various systems leading to data
redundancy. Not only is this poor for the company’s bottom line, but it is also a poor
customer experience.
How to fix this issue?
To fix this, a singular base system should be utilized where all the organization’s agents
receive their data, yet again a master data implementation.
Old & Stale Data :
Storing data beyond a certain period adds no value to your data stack. It costs more
money, confuses the engineer, and it impacts your ability to conduct analytics. It also
makes the data irrelevant.
Unclear Data Definitions:
Speak to Sam in Finance and Jess in Customer Services, both interpreting the same data
point differently; sounds familiar? Clarity is a data quality dimension that is not
discussed much, as in the modern data stack, it is part of the business glossary or data
catalogue.
How to fix this issue?
Fixing this requires aligning data definitions each time a new metric/data point is
created.

Dysfunctional History Management :


History maintenance is critical for any data warehousing implementation. Now data is
being received chronologically, and history is being maintained using slowly changing
dimensions Type 2. However, the incorrect rows are being opened and closed, leading
to a false representation of the latest valid record. In turn, breaking the history
maintenance method and downstream processes.
How to fix this issue?
To fix this, ensure the correct date column is used to determine the history maintenance.

Data Received too late :


Data needs to be timely enough to make the critical decision in that period. If your
marketing campaigns are running weekly, you must receive the required data by the set
day of the week to trigger them. Otherwise, too late data could lead to poor responses on
your campaigns.
How to fix this issue?
You must agree on an appropriate time window with the engineering team to fix this.
And work backwards to ensure your source systems can adhere to those Service Level
Agreements.

Data Exploration:

‘a way to get to know data before working with it ’

This is an initial step to analyze data and visualize data to uncover insights from start.

Quality of your input decides the quality of your output.


Data exploration , cleaning and preparation can take upto 70% of your time.

First step in preparing data is exploration.

Q) What data explorations provides us with?

Exploring data helps us to characterize data such as size, quality and accuracy

Q) Why is data exploration important?

To better understand the nature of data helps in easily navigating through the data
and use it.

Its always easier visuals as compared to text .

Identify anomalies and relationships that might go undetected.

Understanding more ground truth in less time.

Data Mining Vs Data Exploration

There are two primary methods for retrieving relevant data from large, unorganized
pools: data exploration, which is the manual method, and data mining, which is the
automatic method. Data mining, a field of study within machine learning, refers to the
process of extracting patterns from data with the application of algorithms. Data
exploration and visualization provide guidance in applying the most effective further
statistical and data mining treatment to the data.

Data Mining Data Exploration


Data mining is also named knowledge Data Exploration is used interchangeably
discovery in databases, extraction, with web exploration, web scraping, web
data/pattern analysis, and information crawling, data retrieval, data harvesting,
harvesting. etc.

Data mining studies are mostly on Data Exploration usually retrieves data out
structured data. of unstructured or poorly structured data
sources.

Data Exploration is to collect data and


Data mining aims to make available
gather them into a place where they can be
data more useful for generating insights.
stored or further processed.

Data Exploration is based on programming


Data mining is based on mathematical
languages or data Exploration tools to
methods to reveal patterns or trends.
crawl the data sources.

The purpose of data mining is to find


Data Exploration deals with existing
facts that are previously unknown or
information.
ignored,

Data mining is much more complicated Data Exploration can be extremely easy
and requires large investments in staff and cost-effective when conducted with the
training. right tool.

Data Exploration Vs Data Examining


Data examination and data exploration are effectively the same process. Data
examination assesses the internal consistency of the data as a whole for the purpose of
confirming the quality of the data for subsequent analysis. Internal consistency
reliability is an assessment based on the correlations between different items on the
same test. This assessment gauges the reliability of a test or survey that is designed to
measure the same construct for different items.

Steps of Data Exploration and Preparation


Data exploration typically follows three steps:
1) Understanding your variables
2) Detect any Out liers
3) Identify Correlation ( Examine patterns and relations)
Variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify the data
type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not
(refer below data set). Here you need to identify predictor variables, target variable, data
type of variables and category of variables.

Below, the variables have been defined in different category:


Univariate Analysis

At this stage, we explore variables one by one. Method to perform uni-variate analysis
will depend on whether the variable type is categorical or continuous. Let’s look at these
methods and statistical measures for categorical and continuous variables individually:

Continuous Variables:- In case of continuous variables, we need to understand the


central tendency and spread of the variable. These are measured using various statistical
metrics visualization methods
Categorical Variables:- For categorical variables, we’ll use frequency table to
understand distribution of each category. We can also read as percentage of values under
each category. It can be be measured using two metrics, Count and Count% against
each category. Bar chart can be used as visualization.

Bi-variate Analysis

Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance level.
We can perform bi-variate analysis for any combination of categorical and continuous
variables. The combination can be: Categorical & Categorical, Categorical &
Continuous and Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship
between two variables. The pattern of scatter plot indicates the relationship between
variables. The relationship can be linear or non-linear.
Categorical & Categorical:To find the relationship between two categorical variables,
we can use following methods:
•Two-way table:We can start analyzing the relationship by creating a two-way table of
count and count%. The rows represents the category of one variable and the columns
represent the categories of the other variable. We show count or count% of observations
available in each combination of row and column categories.
•Stacked Column Chart:This method is more of a visual form of Two-way table.

Categorical & Continuous:While exploring relation between categorical and


continuous variables, we can draw box plots for each level of categorical variables. If
levels are small in number, it will not show the statistical significance. To look at the
statistical significance we can perform Z-test, T-test or ANOVA.

Example:Suppose, we want to test the effect of five different exercises. For this, we
recruit 20 men and assign one type of exercise to 4 men (5 groups). Their weights are
recorded after a few weeks. We need to find out whether the effect of these exercises on
them is significantly different or not. This can be done by comparing the weights of the
5 groups of 4 men each.

2. Missing Value Treatment


Why missing values treatment is required?

Missing data in the training data set can reduce the power / fit of a model or can lead to a
biased model because we have not analyzed the behavior and relationship with other
variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not
treated missing values. The inference from this data set is that the chances of playing
cricket by males is higher than females. On the other hand, if you look at the second
table, which shows data after treatment of missing values (based on gender), we can see
that females have higher chances of playing cricket compared to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now, let’s


identify the reasons for occurrence of these missing values. They may occur at two
stages:
1.Data Extraction: It is possible that there are problems with extraction process.
In such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find and can be corrected easily as
well.
2.Data collection: These errors occur at time of data collection and are harder to
correct. They can be categorized in four types:
•Missing completely at random:This is a case when the probability of missing
variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an
head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value.
•Missing at random:This is a case when variable is missing at random and
missing ratio varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher missing value
compare to male.
•Missing that depends on unobserved predictors:This is a case when the
missing values are not random and are related to the unobserved input variable.
For example: In a medical study, if a particular diagnostic causes discomfort, then
there is higher chance of drop out from the study. This missing value is not at
random unless we have included “discomfort” as an input variable for all patients.
•Missing that depends on the missing value itself:This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response
to their earning.

Which are the methods to treat missing values ?

1.Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.

•In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces
the power of model because it reduces the sample size.
•In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available
for analysis. One of the disadvantage of this method, it uses different sample size
for different variables.
Deletion methods are used when the nature of missing data is “Missing completely at
random” else non random missing values can bias the model output.
1.Mean/ Mode/ Median Imputation:Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all
known values of that variable. It can be of two types:-
•Generalized Imputation:In this case, we calculate the mean or median for all
non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take average
of all non missing values of “Manpower” (28.33) and then replace missing value
with it.
•Similar case Imputation:In this case, we calculate average for gender
“Male”(29.75) and “Female” (25) individually of non missing values then replace
the missing value based on gender. For “Male“, we will replace missing values of
manpower with 29.75 and for “Female” with 25.

Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for
this approach:
1.The model estimated values are usually more well-behaved than the true values

2.If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.

Techniques of Outlier Detection and Treatment


What is an Outlier?

Outlier is a commonly used terminology by analysts and data scientists as it needs close
attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an
observation that appears far away and diverges from an overall pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual
income of customers is $0.8 million. But, there are two customers having annual income
of $4 and $4.2 million. These two customers annual income is much higher than rest of
the population. These two observations will be seen as Outliers.

What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed
the example of univariate outlier. These outliers can be found when we look at
distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional
space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we are understanding the relationship
between height and weight. Below, we have univariate and bivariate distribution for
Height, Weight. Take a look at the box plot. We do not have any outlier (above and
below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two
values below and one above the average in a specific segment of weight and height.

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason
of having these outliers. The method to deal with them would then depend on the reason
of their occurrence. Causes of outliers can be classified in two broad categories:
1.Artificial (Error) / Non-natural

2.Natural.

Let’s understand various types of outliers in more detail:


•Data Entry Errors:-Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data. For example: Annual income of a customer
is $100,000. Accidentally, the data entry operator puts an additional zero in the figure.
Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be
the outlier value when compared with rest of the population.
•Measurement Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty. For example: There are 10
weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the
faulty machine will be higher / lower than the rest of people in the group.The weights
measured on faulty machine can lead to outliers.
•Experimental Error: Another cause of outliers is experimental error. For example: In
a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call
which caused him to start late. Hence, this caused the runner’s run time to be more than
other runners. His total run time can be an outlier.
•Intentional Outlier: This is commonly found in self-reported measures that involves
sensitive data. For example: Teens would typically under report the amount of alcohol
that they consume. Only a fraction of them would report actual value. Here actual values
might look like outliers because rest of the teens are under reporting the consumption.
•Data Processing Error: Whenever we perform data mining, we extract data from
multiple sources. It is possible that some manipulation or extraction errors may lead to
outliers in the dataset.
•Sampling error:For instance, we have to measure the height of athletes. By mistake,
we include a few basketball players in the sample. This inclusion is likely to cause
outliers in the dataset.
•Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier.
For instance: In my last assignment with one of the renowned insurance company, I
noticed that the performance of top 50 financial advisors was far higher than rest of the
population. Surprisingly, it was not due to any error. Hence, whenever we perform any
data mining activity with advisors, we used to treat this segment separately.

What is the impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling.
There are numerous unfavourable impacts of outliers in the data set:
•It increases the error variance and reduces the power of statistical tests
•If the outliers are non-randomly distributed, they can decrease normality
•They can bias or influence estimates that may be of substantive interest
•They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
Data Discovery

Data discovery is the business-user-oriented data science process of visually navigating data
and applying advanced analytics in order to detect patterns, gain insight, answer highly specific
business questions, and derive value from business data.
The purpose of data discovery is to reveal relevant data insights, communicate these insights to
business users in a way that is accessible to non-technical users, and ultimately improve
business processes. The data discovery process is accomplished with visual data discovery
tools and business intelligence tools that extract data from various sources and consolidate the
data into a single location, where users can have a “big picture” view of their data.
The most integral component of data discovery is the interactive visual analysis, which enables
users to interactively and clearly view their data from every angle, and drill down to hyper-
specific events. Visual data discovery is the subsequent step after data exploration has first
refined the data sets.
Data Discovery Categories
The three main categories of visual data discovery include data preparation, data visualization,
and advanced analytics and reporting:
• Data Preparation: This preprocessing step uses statistical techniques to merge
unstructured, raw data from disparate sources, then clean, transform, and eliminate noise
from the data so that quality is consistent and formatting is usable.
• Data Visualization: The human brain instantly recognizes patterns and relationships in
visualizations. Data visualization is a critical process for analyzing big data, in which
important insights and messages would otherwise be lost, and for displaying the results
of machine learning and predictive analytics.
• Advanced Analytics and Reporting: Descriptive statistics organizes, summarizes, and
breaks down data into a simple, intelligible report that is easy to understand and helps
businesses make data-driven decisions.

Types of Data Discovery


There are two main data discovery processes: Manual and Smart:
• Manual Data Discovery: Manual data discovery is the manual management of data by
a highly technical, human data steward. Before advancements in machine learning, data
specialists would manually map and prioritize data, monitor and categorize metadata,
document rules and standards, and conceptualize all available data using critical
thinking.
• Smart Data Discovery: Smart data discovery solutions provided an automated
experience. With the advancement of machine learning, smart data discovery software
has been developed, using Artificial Intelligence to automate data preparation,
conceptualization, integration, and presentation of hidden patterns and insights through
data discovery visualizations.
Data Governance
Data governance, as defined by the Data Management Association (DAMA), is the
exercise of authority and control (planning, monitoring, and enforcement) over the
management of data assets.

(1) Planning: Creating rules that datasets need to conform to (e.g. format standardization
of date fields e.g. YYYY-MM-DD)
(2) Monitoring: Measuring compliance to rules defined in the Planning phase (e.g.
tracking the percentage of ‘Date’ fields with missing values)
(3) Enforcement: Remediation actions in the event of data rule breaches (e.g. immediate
corrective action if critical data fields such as social security number are found to
contain major discrepancies)
While there are different data governance frameworks available, they all revolve around
the same key elements of people (i.e. data stewards and stakeholders overseeing data
governance programs), processes (i.e. data policies and procedures that make up the
rules datasets need to conform to), the datasets in question, and tools (i.e. technology
and infrastructure that allow the monitoring and enforcement of policies).
Why is Data Governance Important?
For data to be useful as a strategic asset, it must be kept compliant, clean, and accessible
through formal procedures and accountability. The purpose of data governance is to
establish the methods, processes, and responsibilities that enable organizations to ensure
the quality, compliance, and usability of their data.
Data quality is crucial for the democratization of data science because organization-wide
trust in data is of utmost importance. Just as consumers would not purchase or use
products from retailers they do not trust, business stakeholders and data practitioners
will be hesitant in using data (or insights gleaned from data) for decision-making if they
lack trust in it.
Ultimately, the most successful businesses are the ones who can effectively manage data
as a trusted asset and are able to marry the best data science capabilities with the best
data quality.

What does Good Data Quality look like?


Data governance programs ensure the scalability and maintenance of data quality but
what exactly does good data quality look like? Atlan nicely summarized it with the
following seven characteristics:
1. Accurate
2. Available and accessible
3. Complete
4. Relevant and reliable
5. Timely
6. Gives the right degree of granularity
7. Helps you in decision-making

What are the benefits of data governance?


When properly implemented, a solid data governance framework can help companies
reap numerous significant benefits:
• Common Data Language - A common verbiage around metrics and datasets
creates a common data language and consistent terminology that is easily used
and understood across the organization.
• Increased confidence in data quality - Improved data quality from good data
governance promotes trust and confidence in the data and its corresponding
documentation, thereby leading to greater utilization of the data while making
decisions.
• Improved reusability of data - Data governance creates standardization and
mapping of enterprise-wide data. This greatly enhances efficiency through the
reuse of data across different business units and varying use cases within the
enterprise.
• Better data management mechanisms - Central control mechanisms establish clear
policies and rules that guide best practices and codes of conduct around data
usage. This provides a bird’s eye view for the consistent management of key areas
like compliance and security, while also minimizing costs in overall data
management.
• Single Source of Truth - Data governance facilitates the integration of different
data sources such that data is no longer siloed. This establishes a consistent and
complete single source of truth of the customer profiles (or profiles of other key
entities) that business stakeholders can agree upon to make better cooperative
decisions.
• Increased Agility from Data Democratization - The high quality of data created by
good governance makes it readily available even for non-technical units to utilize
for their business goals. This democratization of data will then enable business
units to be more agile through self-service analytics.
All these benefits will in turn enable companies to reduce costs through improved
efficiency, minimize risks through prudent use of data, stay compliant to data
regulations, and maximize business performance through consistent use of data-driven
decision making.

You might also like