Professional Documents
Culture Documents
Data Preparation
Data Preparation
Analytics
Data Preparation
Andy Oh
03 May 2021
Learning Objectives:
1 Data Preparation
Data preparation is the process of taking raw data and getting it ready for ingestion in an
analytics platform. To achieve the final stage of preparation, the data must
be cleansed, formatted, and transformed into something digestible by analytics tools. These
are broad strokes, and the actual process can include a wide range of steps, such as
consolidating/separating fields and columns, changing formats, deleting unnecessary or junk
data, and making corrections to data.
Consider how a chef uses the technique “mise en place,” which translated from French means
“everything in place.” For the process of cooking, this means getting all ingredients measured
and ready, mixing bowls and tools out and available, and vegetables all cut up. Only when that
work has been completed is the chef ready to start with step 1 of the recipe without having to
pause and look for any ingredient or tool.
Data preparation is the equivalent of mise en place, but for analytics projects.
source: https://blogs.oracle.com/analytics/what-is-data-preparation-and-why-is-it-important
Importance of Data Preparation
Data preparation is necessary because of the presence of unformatted real-world data. Mostly
real-world data is composed of :
• Inaccurate data (missing data): There are many reasons for missing data such as data
is not continuously collected, a mistake in data entry, technical problems with
biometrics, and much more, which requires proper Data Preparation.
• The presence of noisy data (erroneous data and outliers): The reasons for the
existence of noisy data could be a technological problem of gadget that gathers data, a
human mistake during data entry and much more.
• Inconsistent data: The presence of inconsistencies are due to the reasons such that
existence of duplication within data, human data entry, containing mistakes in codes or
names, i.e., violation of data constraints and much more necessitate Data Preparation
and analysis.
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
2 Missing data
Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Missing data present various problems:
• First, the absence of data reduces statistical power, which refers to the probability that
the test will reject the null hypothesis when it is false.
• Second, the lost data can cause bias in the estimation of parameters.
• Third, it can reduce the representativeness of the samples.
• Fourth, it may complicate the analysis of the study. Each of these distortions may
threaten the validity of the trials and can lead to invalid conclusions.
source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/
Multiple approaches can be used in treating missing data. Let’s look at each of them:
1. Removing the training example: You can ignore the training example if the output
label is missing (if it is a classification problem). This is usually discouraged as it leads to
loss of data, as you are removing the attribute values that can add value to data set as
well.
2. Filling in missing value manually: This approach is time consuming, and not
recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be
replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach,
but not foolproof.
4. Using central tendency (mean, median, mode) for attribute to replace the missing
value: Based on data distribution, mean (in case of normal distribution) or median (for
non-normal distribution) can be used to fill in for the missing value.
5. Using central tendency (mean, median, mode) for attribute belonging to same
class to replace the missing value: This is the same as method 4, except that the
measures of central tendency are specific to each class.
6. Using the most probable value to fill in the missing value: Using algorithms like
regression and decision tree, the missing values can be predicted and replaced.
Let’s explore and examine Azure can handle missing data. This experiment will demonstrate
how to replace missing data using:
3 Noisy Data
3.1 Outliers
An outlier is an unusually large or small observation. Outliers can have a disproportionate effect
on statistical results, such as the mean, which can result in misleading interpretations.
Machine learning algorithms are sensitive to the range and distribution of attribute values. Data
outliers can spoil and mislead the training process resulting in longer training times, less
accurate models and ultimately poorer results.
https://datascience.foundation/sciencewhitepaper/knowing-all-about-outliers-in-machine-
learninghttps://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
Most common causes of outliers on a data set:
• Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
• Measurement Error (instrument errors): It is the most common source of outliers. This
is caused when the measurement instrument used turns out to be faulty.
• Experimental errors: Data extraction or experiment planning/executing errors
• Intentional: Dummy outliers made to test detection methods
• Data processing errors: Data manipulation or data set unintended mutations
• Sampling errors: Extracting or mixing data from wrong or various sources
• Natural Outlier (Not an error, novelties in data): When an outlier is not artificial (due to
error), it is a natural outlier. Most of real world data belong to this category.
source:https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-
24620c4657e8
There are several approaches to treat outliers, we will only two methods in this course:
1. Examine each column with numerical feature using Boxplot. Using the fraud dataset,
observe an outlier of $200,000 of monthly income.
2. Connect Summarize Data instance for descriptive statistics of each column. From the
fraud dataset, observed the $200,000 is picked up as the maximum value in the monthly
income column.
3. Connect and setup Clip Values instance in Azure to clip the outliers value with
a constant value such as value of the 3rd quantile or based on percentile.
4 Feature scaling
Feature scaling is a technique often applied as part of data preparation for machine learning.
The goal of feature scaling is to change the values of numeric columns in the dataset to a
common scale, without distorting differences in the ranges of values.
Feature scaling is required in some machine learning algorithms:
Gradient Descent Based Algorithms: Machine learning algorithms such as neural network that
use gradient descent as an optimization technique require data to be scaled.
Distance-Based Algorithms: Distance algorithms like KNN, K-means, and SVM are most
affected by the range of features. This is because behind the scenes they are using distances
between data points to determine their similarity.
Tree-Based Algorithms: Tree-based algorithms, on the other hand, are fairly insensitive to the
scale of the features. Think about it, a decision tree is only splitting a node based on a single
feature. The decision tree splits a node on a feature that increases the homogeneity of the node.
This split on a feature is not influenced by other features. So, there is virtually no effect of the
remaining features on the split. This is what makes them invariant to the scale of the features!
source: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-
normalization-standardization/
In the numerical column of the fraud dataset, the units of measurement for each column
belongs to a different scale:
4.1 Normalisation
Normalisation is a scaling technique in which values are shifted and rescaled so that they end
up ranging between 0 and 1. It is also known as Min-Max scaling.
4.2 Standardisation
Standardisation is another scaling technique where the values are centered around the mean
with a unit standard deviation. This means that the mean of the attribute becomes zero and the
resultant distribution has a unit standard deviation. Standardization value is also known as z-
score.
Observe the difference in the scale results between using Zscore and MinMax when choosing
the transformation methods.
Standardisation with Zscore
Standardisation with
MinMax
5.1 Skewness
Skewness refers to the distortion or asymmetry in a symmetrical bell curve. More often than not,
data collected are usually not normally distributed.
The skewness for a normal distribution is zero, and any symmetric data should have a
skewness near zero. Negative values for the skewness indicate data that are skewed left and
positive values for the skewness indicate data that are skewed right.
source : www.fromthegenesis.com
5.2 Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets
with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the
extreme case.
The standard normal distribution has a kurtosis of zero. Positive kurtosis indicates a “heavy-
tailed” distribution and negative kurtosis indicates a “light tailed” distribution.
source:https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
source:https://www.excelr.com/skewness-and-kurtosis
6.1 SMOTE
SMOTE (Synthetic Minority Over-sampling Technique) is specifically designed for learning from
imbalanced datasets.
SMOTE is one of the most adopted approaches to deal with class imbalance due to its
simplicity and effectiveness. It is a combination of oversampling and undersampling.
• The majority class is first under-sampled by randomly removing samples from the
majority class population until the minority class becomes some specified percentage of
the majority class.
• Oversampling is then used to build up the minority class, not by simply replicating data
instances of the minority class but, instead, it constructs new minority class data
instances via an algorithm.
The module returns a dataset that contains the original samples, plus an additional number of
synthetic minority samples, depending on a percentage that you specify. As previously
described, these new instances are created by taking samples of the feature space for each
target class and its nearest neighbors, and generating new examples that combine features of
the target case with features of its neighbors.
Source: Predictive Analytics with Microsoft Azure Machine Learning 2015 by Roger Barga,
Valentine Fontama, and Wee Hyong Tok. Springer Science+Business Media New York
To compare the model performance of a model without addressing class imbalance and one
which used SMOTEmethod, setup the instances in a new experiment in Azure as shown below.
Follow the settings in the video and run the experiment to view the output. The experiment in
the video had been run beforehand.
Model Performance with SMOTE
The model performance when evaluated with the test data (20%) shows that SMOTE method
has improved the Recallrate over the model without addressing class imbalance. This is
because the model with SMOTE has more sythetic training examples of >50k, therefore the
model is less biased towards a better classification of the majority class.