Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Advanced Applied Business

Analytics
Data Preparation
Andy Oh
03 May 2021
Learning Objectives:

1. Understand the importance of data preparation before building an analytical model.


2. Differentiate between data preprocessing and data wrangling.
3. Perform basic data preprocessing including: Handle missing data, Treat outliers, Feature
scaling.
4. Able to perform basic data wrangling including: Apply logarithm transformation, Feature
extraction.

1 Data Preparation
Data preparation is the process of taking raw data and getting it ready for ingestion in an
analytics platform. To achieve the final stage of preparation, the data must
be cleansed, formatted, and transformed into something digestible by analytics tools. These
are broad strokes, and the actual process can include a wide range of steps, such as
consolidating/separating fields and columns, changing formats, deleting unnecessary or junk
data, and making corrections to data.
Consider how a chef uses the technique “mise en place,” which translated from French means
“everything in place.” For the process of cooking, this means getting all ingredients measured
and ready, mixing bowls and tools out and available, and vegetables all cut up. Only when that
work has been completed is the chef ready to start with step 1 of the recipe without having to
pause and look for any ingredient or tool.
Data preparation is the equivalent of mise en place, but for analytics projects.
source: https://blogs.oracle.com/analytics/what-is-data-preparation-and-why-is-it-important
Importance of Data Preparation
Data preparation is necessary because of the presence of unformatted real-world data. Mostly
real-world data is composed of :
• Inaccurate data (missing data): There are many reasons for missing data such as data
is not continuously collected, a mistake in data entry, technical problems with
biometrics, and much more, which requires proper Data Preparation.
• The presence of noisy data (erroneous data and outliers): The reasons for the
existence of noisy data could be a technological problem of gadget that gathers data, a
human mistake during data entry and much more.
• Inconsistent data: The presence of inconsistencies are due to the reasons such that
existence of duplication within data, human data entry, containing mistakes in codes or
names, i.e., violation of data constraints and much more necessitate Data Preparation
and analysis.

1.1 Data Preprocessing


Data preprocessing is a technique that is used to convert the raw data into a clean data set. In
other words, whenever the data is gathered from different sources it is collected in raw format
which is not feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set. This
technique is performed before the execution of the Iterative Analysis. The set of steps is known
as Data Preprocessing. It includes:

• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction

Importance of Data Preprocessing


For achieving better results from the applied model in Machine Learning, the format of the data
has to be in a proper manner. Some specified Machine Learning model need information in a
specified format, for example, Random Forest algorithm does not support null values, therefore
to execute random forest algorithm null values has to be managed from the original raw data
set.

1.2 Data Wrangling


Data Wrangling is a technique that is executed at the time of making an interactive model. In
other words, it is used to convert the raw data into the format that is convenient for the
consumption of data.
This technique is also known as Data Munging. This method also follows certain steps such as
after extracting the data from different data sources, sorting of data using the certain algorithms
are performed, decompose the data into a different structured format and finally store the data
into another database.
Importance of Data Wrangling
Data wrangling is an important aspect of implementing the model. Therefore, data is converted
to the proper feasible format before applying any model to it. By
performing filtering, grouping and selecting appropriate data, accuracy and performance of the
model could be increased.
source: https://www.xenonstack.com/blog/data-preparation/
There are 6 basic steps one needs to follow:
1. Data Discovery: This is an all-encompassing term that describes understanding what
your data is all about. In this first step, you get familiar with your data
2. Data Structuring: When you collect raw data, it initially is in all shapes and sizes, and
has no definite structure. Such data needs to be restructured to suit the analytical model
that your enterprise plans to deploy
3. Data Cleaning: Raw data comes with some errors that need to be fixed before data is
passed on to the next stage. Cleaning involves the tackling of outliers, making
corrections or deleting bad data completely
4. Data Enriching: By this stage, you have kind of become familiar with the data in hand.
Now is the time to ask yourself this question – do you need to embellish the raw data?
Do you want to augment it with other data?
5. Data Validating: This activity surfaces data quality issues, and they have to be
addressed with the necessary transformations. The rules of validation rules require
repetitive programming steps to check the authenticity and the quality of your data
6. Data Publishing: Once all the above steps are completed, the final output of your data
wrangling efforts are pushed downstream for your analytics needs
Data wrangling is a core iterative process that throws up the cleanest, most usable data
possible before you start your actual analysis.
source: https://expressanalytics.com/blog/what-is-data-wrangling-what-are-the-steps-in-data-
wrangling/

2 Missing data
Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Missing data present various problems:

• First, the absence of data reduces statistical power, which refers to the probability that
the test will reject the null hypothesis when it is false.
• Second, the lost data can cause bias in the estimation of parameters.
• Third, it can reduce the representativeness of the samples.
• Fourth, it may complicate the analysis of the study. Each of these distortions may
threaten the validity of the trials and can lead to invalid conclusions.

source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/
Multiple approaches can be used in treating missing data. Let’s look at each of them:
1. Removing the training example: You can ignore the training example if the output
label is missing (if it is a classification problem). This is usually discouraged as it leads to
loss of data, as you are removing the attribute values that can add value to data set as
well.
2. Filling in missing value manually: This approach is time consuming, and not
recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be
replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach,
but not foolproof.
4. Using central tendency (mean, median, mode) for attribute to replace the missing
value: Based on data distribution, mean (in case of normal distribution) or median (for
non-normal distribution) can be used to fill in for the missing value.
5. Using central tendency (mean, median, mode) for attribute belonging to same
class to replace the missing value: This is the same as method 4, except that the
measures of central tendency are specific to each class.
6. Using the most probable value to fill in the missing value: Using algorithms like
regression and decision tree, the missing values can be predicted and replaced.

2.1 Missing Data: Experiment 1


2.1.1 Fraud dataset : fraud_data1.csv
A frictious dataset fraud_data1.csv with Fraud as the target variable will be used to
explore data preparationtechniques using Azure Machine Learning.

Let’s explore and examine Azure can handle missing data. This experiment will demonstrate
how to replace missing data using:

1. Replace with mode


2. Replace with median

• Use fraud_data1.csv dataset and create a new experiment.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.

2.2 Missing Data: Experiment 2


This experiment will demonstrate how to replace missing data using:
1. Replace using MICE
2. Replace using Probablistic PCA

• Save and rename the first experiment as another experiment.


• Edit the settings and run the experiment to view the output as shown in video
To learn more options of clean missing data instance, refer to the documentation:
source: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/clean-missing-data

2.3 Activity: Imputing Missing Data


Perform cleaning of missing data on fraud_data2.csv with clean missing data instance.
Compare the results between the two cleaning modes and comment on your observation.
Cleaning Mode 1 : Replace using MICE. Cleaning Mode 2 : Replace using Probablistic PCA.
fraud_data2.csv is the exact data as fraud_data1.csvwith a few more data points (row wise)
removed at row 12.

3 Noisy Data
3.1 Outliers
An outlier is an unusually large or small observation. Outliers can have a disproportionate effect
on statistical results, such as the mean, which can result in misleading interpretations.
Machine learning algorithms are sensitive to the range and distribution of attribute values. Data
outliers can spoil and mislead the training process resulting in longer training times, less
accurate models and ultimately poorer results.
https://datascience.foundation/sciencewhitepaper/knowing-all-about-outliers-in-machine-
learninghttps://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
Most common causes of outliers on a data set:

• Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
• Measurement Error (instrument errors): It is the most common source of outliers. This
is caused when the measurement instrument used turns out to be faulty.
• Experimental errors: Data extraction or experiment planning/executing errors
• Intentional: Dummy outliers made to test detection methods
• Data processing errors: Data manipulation or data set unintended mutations
• Sampling errors: Extracting or mixing data from wrong or various sources
• Natural Outlier (Not an error, novelties in data): When an outlier is not artificial (due to
error), it is a natural outlier. Most of real world data belong to this category.

source:https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-
24620c4657e8
There are several approaches to treat outliers, we will only two methods in this course:

3.1.1 Delete rows of data with outliers


Some considerations to delete data include:
• Size of the dataset: Deleting data with outliers is generally not preferred when the
dataset is small (less rows).
• Percentage of outliers to the dataset: Deleting high percentage of outliers may mean
deleting important information.

3.1.2 Clip Outliers Values


Clipping outlier values are generally preferred than deleting data as it retains important
information of each record.
Steps to clip the values of outliers:

1. Examine each column with numerical feature using Boxplot. Using the fraud dataset,
observe an outlier of $200,000 of monthly income.
2. Connect Summarize Data instance for descriptive statistics of each column. From the
fraud dataset, observed the $200,000 is picked up as the maximum value in the monthly
income column.

3. Connect and setup Clip Values instance in Azure to clip the outliers value with
a constant value such as value of the 3rd quantile or based on percentile.

3.2 Azure demonstration: Experiment 3


• Save and rename the first experiment in Missing Data.
• Follow the steps to check for outliers and clip their values in the video
• The video will show how clip an outlier at the monthly income column ($200,000) and
clip it to a constant value using the 3rd quantile in the dataset ($17,000).
After clipping the outlier, the value of the outlier is now changed to $17,000 instead of $200,000
as shown.

4 Feature scaling
Feature scaling is a technique often applied as part of data preparation for machine learning.
The goal of feature scaling is to change the values of numeric columns in the dataset to a
common scale, without distorting differences in the ranges of values.
Feature scaling is required in some machine learning algorithms:
Gradient Descent Based Algorithms: Machine learning algorithms such as neural network that
use gradient descent as an optimization technique require data to be scaled.
Distance-Based Algorithms: Distance algorithms like KNN, K-means, and SVM are most
affected by the range of features. This is because behind the scenes they are using distances
between data points to determine their similarity.
Tree-Based Algorithms: Tree-based algorithms, on the other hand, are fairly insensitive to the
scale of the features. Think about it, a decision tree is only splitting a node based on a single
feature. The decision tree splits a node on a feature that increases the homogeneity of the node.
This split on a feature is not influenced by other features. So, there is virtually no effect of the
remaining features on the split. This is what makes them invariant to the scale of the features!
source: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-
normalization-standardization/
In the numerical column of the fraud dataset, the units of measurement for each column
belongs to a different scale:

• Age: units in year


• Weight: units in kilogram
• Height: units in centimetre
• Monthly Salary: units in dollars
There are two techniques for feature scaling include:

4.1 Normalisation
Normalisation is a scaling technique in which values are shifted and rescaled so that they end
up ranging between 0 and 1. It is also known as Min-Max scaling.

4.2 Standardisation
Standardisation is another scaling technique where the values are centered around the mean
with a unit standard deviation. This means that the mean of the attribute becomes zero and the
resultant distribution has a unit standard deviation. Standardization value is also known as z-
score.

4.3 Azure Demonstration: Experiment 4


• Save and rename the third experiment in clipping outliers.
• Follow the steps in the video to scale numeric features in the fraud dataset.
• In this experiment, we use two Normalize Data instances to compare the results of
standardisation (using z-score) and normalisation (using min max).

Observe the difference in the scale results between using Zscore and MinMax when choosing
the transformation methods.
Standardisation with Zscore
Standardisation with
MinMax

5 Skewed data distribution


Numerical data that are continous in nature such as income and property prices often
display skewness in their distrbution due to extreme values.
When the distribution of data is skewed, the area under a normal curve is no longer
symmetrically. This means that the probability of falling on either side are not equal. For
example, for a positively-skewed distribution, it will have a large number of points on the left
side of the distribution than on the right side. Predictive models which are modelled using
positively skewed data will not be able to capture data on the right side of the curve as well as
the left side.

5.1 Skewness
Skewness refers to the distortion or asymmetry in a symmetrical bell curve. More often than not,
data collected are usually not normally distributed.
The skewness for a normal distribution is zero, and any symmetric data should have a
skewness near zero. Negative values for the skewness indicate data that are skewed left and
positive values for the skewness indicate data that are skewed right.
source : www.fromthegenesis.com

5.2 Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets
with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the
extreme case.
The standard normal distribution has a kurtosis of zero. Positive kurtosis indicates a “heavy-
tailed” distribution and negative kurtosis indicates a “light tailed” distribution.
source:https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
source:https://www.excelr.com/skewness-and-kurtosis

5.3 Power Transformation


The use of power transformation on skewed data can lead to an improvement of the predictive
power in regression models.
When the distribution of data is skewed, the area under a normal curve is no longer
symmetrically. This means that the probability of falling on either side are not equal. For
example, for a positively-skewed distribution, it will have a large number of points on the left
side of the distribution than on the right side. Predictive models which are modelled using
positively skewed data will not be able to capture data on the right side of the curve as well as
the left side.
Power transformation on variables with skewness can help to move the data to a more normal
distribution hence improving the model’s ability to estimate both left and right side of a normal
curve.
Nicholas J. Cox who wrote an article ‘Transformation: An Introduction’ provided four reasons of
transforming data:
• Reduce skewness : A transformation may be used to reduce skewness. A distribution
that is symmetric or nearly so is often easier to handle and interpret than a skewed
distribution. More specifically, a normal or Gaussian distribution is often regarded as
ideal as it is assumed by many statistical methods.
• Homoscedasticity : A transformation may be used to produce approximately equal
spreads, despite marked variations in level, which again makes data easier to handle
and interpret. Each data set or subset having about the same spread or variability is a
condition called homoscedasticity: its opposite is called heteroscedasticity.
• Linear relationship : When looking at relationships between variables, it is often far
easier to think about patterns that are approximately linear than about patterns that are
highly curved. This is vitally important when using linear regression, which amounts to
fitting such patterns to data.
• Additive relationship : Relationships are often easier to analyse when additive rather
than (say) multiplicative. So y = a + bx in which two terms a and bx are added is easier
to deal with than y = ax^b, in which two terms a and x^b are multiplied.

5.3.1 Logarithm transformation


Applying logarithm can help to reduce positively skewed data. Generally, the more severe the
right skewness, the higher logarithm base. Common logarithm bases include:

• Natural Log (Euler’s number)


• Log (base of 2)
• Log (base of 10)

5.3.2 Azure Demonstration: Experiment 5


We will use a frictious dataset kc_data.csv to apply logarithm transformation using Azure
Machine Learning.
This experiment will demonstrate how to use Apply Math Operation instance to
transform price column:
• Use kc_data.csv dataset and create a new experiment.
• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.
Histagram of price data in units of dollar
Observe that the data distribution of price column is highly right skewed.
Histagram of price data in units of log base of 10
Observe that the data distribution of price column is more normally distributed after apply log
base of 10.

Skewness and Kurtosis


Observe skewness and kurtosis values are closer to zero after transformation.
6 Class Imbalance
Class imbalance in machine learning is where the total number of a class of data (class 1) is far
less than the total number of another class of data (class 2). This problem is extremely common
in practice.
Most machine learning algorithms work best when the number of instances of each classes are
roughly equal. When the number of instances of one class far exceeds the other, it usually
produces a biased classifier that has a higher predictive accuracy over the majority class(es),
but poorer predictive accuracy over the minority class.

6.1 SMOTE
SMOTE (Synthetic Minority Over-sampling Technique) is specifically designed for learning from
imbalanced datasets.
SMOTE is one of the most adopted approaches to deal with class imbalance due to its
simplicity and effectiveness. It is a combination of oversampling and undersampling.

• The majority class is first under-sampled by randomly removing samples from the
majority class population until the minority class becomes some specified percentage of
the majority class.
• Oversampling is then used to build up the minority class, not by simply replicating data
instances of the minority class but, instead, it constructs new minority class data
instances via an algorithm.

The module returns a dataset that contains the original samples, plus an additional number of
synthetic minority samples, depending on a percentage that you specify. As previously
described, these new instances are created by taking samples of the feature space for each
target class and its nearest neighbors, and generating new examples that combine features of
the target case with features of its neighbors.
Source: Predictive Analytics with Microsoft Azure Machine Learning 2015 by Roger Barga,
Valentine Fontama, and Wee Hyong Tok. Springer Science+Business Media New York

6.2 Azure Demonstration: Experiment 6


We will use an Azure sample dataset Adult Census Income Binary
Classification with income as the target variable to explore SMOTE technique to address
class imbalance.
From the dataset, notice the target variable income has a high proportion of class imbalance
between <=50k or >50k. (Majority class: <= 50k and minority class: >50k)

To compare the model performance of a model without addressing class imbalance and one
which used SMOTEmethod, setup the instances in a new experiment in Azure as shown below.
Follow the settings in the video and run the experiment to view the output. The experiment in
the video had been run beforehand.
Model Performance with SMOTE
The model performance when evaluated with the test data (20%) shows that SMOTE method
has improved the Recallrate over the model without addressing class imbalance. This is
because the model with SMOTE has more sythetic training examples of >50k, therefore the
model is less biased towards a better classification of the majority class.

You might also like