Ch5 5 Data Preprocessing

Ch5:Data Preprocessing
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Preprocessing
The data preprocessing phase is the most challenging and time-consuming part of data
science, but it’s also one of the most important parts.
If you fail to clean and prepare the data, it could compromise the model.
“If 80 percent of our work is data preparation, then ensuring data quality is the important
work of a machine learning team.”
– Andrew Ng
When dealing with real-world data, Data Scientists will always need to apply some
preprocessing techniques in order to make the data more usable.
These techniques will facilitate its use in machine learning (ML) algorithms, reduce the
complexity to prevent overfitting, and result in a better model.
Ms.Duha Qutishat
What is Data Preprocessing?
• After understanding the small difference of your dataset and the main issues in the
data through the Exploratory Data Analysis, data preprocessing comes into play by
preparing your dataset for use in the model.
• In an ideal world, your dataset would be perfect and without any problems.
Unfortunately, real-world data will always present some issues that you’ll need to
address. Consider, for instance, the data you have in your company. Can you think of
any inconsistencies such as typos, missing data, different scales, etc.?
• These examples often happen in the real world and need to be adjusted in order to
make the data more useful and understandable.
• This process, where we clean and solve most of the issues in the data, is what we call
the data preprocessing step.
Ms.Duha Qutishat
What is data preprocessing?
• Data preprocessing, a component of data preparation, describes any type

of processing performed on raw data to prepare it for another data
processing procedure. It has traditionally been an important preliminary
step for the data mining process. More recently, data preprocessing
techniques have been adapted for training machine learning models and
AI models and for running inferences against them.
• Data preprocessing transforms the data into a format that is more easily
and effectively processed in data mining, machine learning and other data
science tasks. The techniques are generally used at the earliest stages of
the machine learning and AI development pipeline to ensure accurate
results.
Ms.Duha Qutishat
Why is Data Preprocessing Important?
• If you skip the data preprocessing step, it will affect your work later on
when applying this dataset to a machine learning model.
• Most of the models can’t handle missing values.
• Some of them are affected by outliers, high dimensionality and noisy data,
• and so by preprocessing the data, you’ll make the dataset more complete
and accurate. This phase is critical to make necessary adjustments in the
data before feeding the dataset into your machine learning model.
Ms.Duha Qutishat
• Virtually any type of data analysis, data science or AI development

requires some type of data preprocessing to provide reliable, precise and
robust results for enterprise applications.
• Real-world data is messy and is often created, processed and stored by a
variety of humans, business processes and applications. As a result, a
data set may be missing individual fields, contain manual input errors, or
have duplicate data or different names to describe the same thing.
Humans can often identify and rectify these problems in the data they use
in the line of business, but data used to train machine learning or deep
learning algorithms needs to be automatically preprocessed.
Ms.Duha Qutishat
• Machine learning and deep learning algorithms work best when data is presented in a

format that highlights the relevant aspects required to solve a problem. Feature
engineering practices that involve data wrangling, data transformation, data reduction,
feature selection and feature scaling help restructure raw data into a form suited for
particular types of algorithms. This can significantly reduce the processing power and
time required to train a new machine learning or AI algorithm or run an inference against
it.
• One caution that should be observed in preprocessing data: the potential for reencoding
bias into the data set. Identifying and correcting bias is critical for applications that help
make decisions that affect people, such as loan approvals. Although data scientists may
deliberately ignore variables like gender, race or religion, these traits may be correlated
with other variables like zip codes or schools attended, generating biased results.
• Most modern data science packages and services now include various preprocessing
libraries that help to automate many of these tasks.
Ms.Duha Qutishat
Important Data Preprocessing Techniques
• Now that you know more about the data preprocessing phase and why it’s important, let’s
look at the main techniques to apply in the data, making it more usable for our future
work. The techniques that we’ll explore are:
1. Data Quality Assessment
2. Data Cleaning
3. Dimensionality Reduction
4. Sampling Data
5. Data Transformation
6. Imbalanced Data
Ms.Duha Qutishat
1.Data Quality Assessment
• Take a good look at your data and get an idea of its overall quality, relevance to your project, and
consistency. There are a number of data anomalies and inherent problems to look out for in almost
any data set, for example:
• Mismatched data types: When you collect data from many different sources, it may come to you in different
formats. While the ultimate goal of this entire process is to reformat your data for machines, you still need to
begin with similarly formatted data. For example, if part of your analysis involves family income from multiple
countries, you’ll have to convert each income amount into a single currency.
• Mixed data values: Perhaps different sources use different descriptors for features – for example, man or male.
These value descriptors should all be made uniform.
• Data outliers: Outliers can have a huge impact on data analysis results. For example if you're averaging test
scores for a class, and one student didn’t respond to any of the questions, their 0% could greatly skew the
results.
• Missing data: Take a look for missing data fields, blank spaces in text, or unanswered survey questions. This
could be due to human error or incomplete data. To take care of missing data, you’ll have to perform data
cleaning.
Ms.Duha Qutishat
Data Quality Assessment(cont.)
• Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
Ms.Duha Qutishat
2.Data Cleaning
• One of the most important aspects of the data preprocessing phase is

detecting and fixing bad and inaccurate observations from your dataset in
order to improve its quality.
• This technique refers to identifying incomplete, inaccurate, duplicated,
irrelevant or null values in the data.
• After identifying these issues, you will need to either modify or delete them.
• The strategy that you adopt depends on the problem domain and the goal
of your project.
Ms.Duha Qutishat
Noisy Data
• Usually, noisy data refers to meaningless data in your dataset, incorrect records, or duplicated observations.
• For example, imagine there is a column in your database for ‘age’ that has negative values.In this case, the
observation doesn’t make sense, so you could delete it or set the value as null
• Another case is when you need to remove unwanted or irrelevant data. For example, say you need to predict
whether a woman is pregnant or not. You don’t need the information about their hair color, marital status or
height, as they are irrelevant for the model.
• An outlier can be considered noise, even though it might be a valid record, depending on the outlier. You’ll need
to determine if the outlier can be considered noise data and if you can delete it from your dataset or not.
• Solution:
• A common technique for noise data is the binning approach, where you first sort the values, then divide them
into “bins” (buckets with the same size), and then apply a mean/median in each bin, smoothing it.
• Note: a way to group numbers of more-or-less continuous values into a smaller number of "bins". For example, if
you have data about a group of people, you might want to arrange their ages into a smaller number of age
intervals (for example, grouping every five years together).
Ms.Duha Qutishat
Missing Data
• Another common issue that we face in real-world data is the absence of data points.
• Most machine learning models can’t handle missing values in the data, so you need to intervene and adjust the
data to be properly used inside the model.
• There are different approaches you can take to handle it (usually called imputation):
Solution 1:
The simplest solution is to remove that observation. However, this is only recommended if:
1) You have a large dataset and a few missing records, so removing them won’t impact the distribution of your
dataset.
2) Most of the attributes of that observation are null, so the observation itself is meaningless.
Solution 2:
• Another solution is to use a global constant to fill that gap, like “NA” or 0, but only if it’s difficult to predict the
missing value. An alternative option is to use the mean or median of that attribute to fill the gap.
Ms.Duha Qutishat
Missing Data(cont.)
Solution 3:
• Using the backward/forward fill method is another approach that can be applied, where you either take the
previous or next value to fill the missing value.
Solution 4:
A more robust approach is the use of machine learning algorithms to fill these missing data points. For example:
• Using KNN, first find the k instances closer to the missing value instance, and then get the mean of that attribute
related to the k-nearest neighbors (KNN).
• Using regression, for each missing attribute, learn a regressor that can predict this missing value based on the
other attributes.
• It’s not easy to choose a specific technique to fill the missing values in our dataset, and the approach you use
strongly depends on the problem you are working on and the type of missing value you have.
Ms.Duha Qutishat
Structural Errors
• Structural errors usually refer to some typos and inconsistencies in the values of the data.
• For example, say that there is a marketplace and we sell shoes on our website.
• The data about the same product can be written in different ways by different sellers that sell the same shoes.
Imagine that one of the attributes we have is the brand of the shoes, and aggregating the name of the brand for
the same shoes we have: Nike, nike, NIKE.
• We need to fix this issue before giving this data to the model, otherwise, the model may treat them as different
things.
• In this case, it’s an easy fix: just transform all the words to lowercase. It may require more complex changes to
fix inconsistencies and typos in other scenarios, though.
• This issue generally requires manual intervention rather than applying some automated techniques.
Ms.Duha Qutishat
3.Dimensionality Reduction
• The dimensionality reduction is concerned with reducing the number of input features in training data.
The Curse of Dimensionality in Your Dataset
• With a real-world dataset, there are usually tons of attributes, and if we don’t reduce this number, it may affect
the model’s performance later when we feed it this dataset. Reducing the number of features while keeping as
much variation in the dataset as possible will have a positive impact in many ways, such as:
• Requiring less computational resources
• Increasing the overall performance of the model
• Preventing overfitting (when the model becomes too complex and the model memorizes the training data,
instead of learning, so in the test data the performance decreases a lot)
• Avoiding multicollinearity (high correlation of one or more independent variables). Also, applying this technique
will reduce the noise data.
Ms.Duha Qutishat
Data reduction
• The more data you’re working with, the harder it will be to analyze, even after cleaning and transforming it.
Depending on your task at hand, you may actually have more data than you need. Especially when working with
text analysis, much of regular human speech is superfluous or irrelevant to the needs of the researcher. Data
reduction not only makes the analysis easier and more accurate, but cuts down on data storage.
• It will also help identify the most important features to the process at hand.
• Attribute selection: Similar to discreditization, attribute selection can fit your data into smaller pools. It,
essentially, combines tags or features, so that tags like male/female and professor could be combined into male
professor/female professor.
• Numerosity reduction: This will help with data storage and transmission. You can use a regression model, for
example, to use only the data and variables that are relevant to your analysis.
• Dimensionality reduction: This, again, reduces the amount of data used to help facilitate analysis and
downstream processes. Algorithms like K-nearest neighbors use pattern recognition to combine similar data and
make it more manageable.
Ms.Duha Qutishat
Feature Selection
• Feature selection refers to the process of selecting the most important variables (features) related to your
prediction variable, in other words, selecting the attributes which contribute most to your model. Ere are some
techniques for this approach that you can apply either automatically or manually:
• Correlation Between Features: This is the most common approach, which drops some features that have a
high correlation with others.
• Statistical Tests: Another alternative is to use statistical tests to select the features, checking the relationship of
each feature individually with the output variable. There are many examples in the scikit-learn library like
SelectKBest, SelectPercentile, chi2, f_classif, f_regression.
• Recursive Feature Elimination (RFE): The Recursive Feature Elimination, also known as Backward
Elimination, where the algorithm trains the model with all features in the dataset, calculating the performance of
the model, and then drops one feature at a time, stopping when the performance improvement becomes
negligible.
• Variance Threshold: Another feature selection method is the variance threshold, which detects features with
high variability within the column, selecting those that got over the threshold. The premise of this approach is
that features with low variability within themselves have little influence on the output variable.
• Also, some models automatically apply a feature selection during the training. The decision-tree-based models
can provide information about the feature importance, giving you a score for each feature of your data. The
higher the value, the more relevant it is for your model. For more algorithms implemented in sklearn, consider
checking the feature_selection module.
Ms.Duha Qutishat
Feature Engineering: Using Domain Knowledge
to Create Features
• The feature engineering approach is used to create better features for your dataset that
will increase the model’s performance. We mainly use domain knowledge to create those
features, which we manually generate from the existing features by applying some
transformation to them. Here are some basic examples you can easily apply to your
dataset to potentially increase your model’s performance:
1)Decompose Categorical Attributes
• The first example is decomposing categorical attributes from your dataset. Imagine that
you have a feature in your data about hair color and the values are brown, blonde and
unknown. In this case, you can create a new column called “has color” and assign 1 if
you get a color and 0 if the value is unknown.
Ms.Duha Qutishat
Feature Engineering: Using Domain Knowledge
to Create Features
2)Decompose a DateTime
• Another example would be decomposing a datetime feature, which contains useful information, but it’s difficult
for a model to benefit from the original form of the data. So if you think that your problem has time
dependencies, and you may find some relationship between the datetime and the output variable, then spend
some time trying to convert that datetime column into a more understandable feature for your model, like “period
of day,” “day of the week,” and so on.
3)Reframe Numerical Quantities
• This last example is more about handling numerical data. Let’s say that you have a dataset about some
purchases of clothes for a specific store. Besides the absolute number of purchases, you may find interest in
creating new features regarding the seasonality of that purchase. So you may end up adding four more columns
to your dataset about purchases in summer, winter, fall, and spring. Depending on the problem you are trying to
solve it may help you and increase the quality of your dataset.
Ms.Duha Qutishat
4.Handling a Large Amount of Data
(Sampling Data)
• Even though the more data you have, the greater the model’s accuracy tends to be, some machine learning
algorithms can have difficulty handling a large amount of data and run into issues like memory saturation,
computational increase to adjust the model parameters, and so on. To address this problem, here are some of
the sampling data techniques we can use:
• Sampling without replacement. This approach avoids having the same data repeated in the sample, so if the
record is selected, it’s removed from the population.
• Sampling with replacement. With this approach, the object is not removed from the population and can be
repeated multiple times for the sample data since it can be picked up more than once.
• Stratified sampling. This method is more complex and refers to splitting the data into many partitions and
getting random samples for each partition. In cases where the classes are disproportional, this approach keeps
the proportional number of classes according to the original data.
• Progressive sampling. This last technique starts with a small size and keeps increasing the dataset until a
sufficient sample size is acquired.
Ms.Duha Qutishat
5.Data Transformation: Converting the Data to the Same
Structure
• One of the most critical steps in the preprocessing phase is data transformation, which converts the
data from one format to another. Some algorithms expect that the input data is transformed, so if
you don’t complete this process, you may get poor model performance or even create bias.
• For example, the KNN model uses distance measures to compute the neighbors that are closer to a
given record. If you have a feature whose scale is very high compared with other features in your
model, then your model will tend to use more of this feature than the others, creating a bias in your
model.
Ms.Duha Qutishat
Data transformation
• With data cleaning, we’ve already begun to modify our data, but data transformation will begin the process of
turning the data into the proper format(s) you’ll need for analysis and other downstream processes.
• This generally happens in one or more of the below:

1. Aggregation
2. Normalization
3. Feature selection
4. Discretization
5. Concept hierarchy generation
Ms.Duha Qutishat
1. Aggregation:
• Data aggregation combines all of your data together in a uniform format.

• is the process of summarizing a large pool of data for high level analysis.
• usually utilizing sum, average, mean, or median references. But it is important to note that
data to be aggregated does not need to be specifically numeric, for example, non-
numeric data elements can be counted as well.
• Some other specific examples of aggregated data can include things like:
• Voting information, when individual records are not relayed, just the total votes for
each candidate for a specific region
• Average consumer age, when for each product, the average age of the consumer is
recorded and not each individual customer
• in today’s world of big data and the internet of things, data aggregation is an important

component of successful business strategy.
Ms.Duha Qutishat
2. Normalization:
• Normalization scales your data into a regularized range so that you can compare it more
accurately.
• For example, if you’re comparing employee loss or gain within a number of companies
(some with just a dozen employees and some with 200+), you’ll have to scale them within
a specified range, like -1.0 to 1.0 or 0.0 to 1.0.
• is the process of organizing data such that it seems consistent across all records and
fields.
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Ms.Duha Qutishat
Min-max Normalization
• Min-max normalization (usually called feature scaling)

performs a linear transformation on the original data.
This technique gets all the scaled data in the range (0, 1).
The formula to achieve this is the following:
• Min-max normalization preserves the relationships

among the original data values
• Min-max normalization has one fairly significant
downside: it does not handle outliers very well
Ms.Duha Qutishat
Z-Score Normalization
• t will return a normalized value (z-score) based on the
mean and standard deviation
• Z-score normalization is a strategy of normalizing data
that avoids this outlier issue. The formula for Z-score
normalization is below:
• Here, μ is the mean value of the feature and σ is the

standard deviation of the feature. If a value is exactly
equal to the mean of all the values of the feature, it will
be normalized to 0. If it is below the mean, it will be a
negative number, and if it is above the mean it will be a
positive number.
Ms.Duha Qutishat
Min-max normalization vs Z-score
normalization
• Min-max normalization: Guarantees all features will have the exact
same scale but does not handle outliers well.
• Z-score normalization: Handles outliers, but does not produce
normalized data with the exact same scale.
Ms.Duha Qutishat
Normalization by decimal scaling
• Decimal scaling is a data normalization technique. In this technique, we move the decimal point of
values of the attribute.
• functions by converting a number to a decimal point.
• This movement of decimal points totally depends on the maximum value among all values in the
attribute.
• Here:
• V’ is the new value after applying the decimal scaling
• V is the respective value of the attribute
• Now, integer J defines the movement of decimal points. So, how to define it? It is equal to the
number of digits present in the maximum value in the data table.
Ms.Duha Qutishat
Normalization by decimal scaling
Employee Salary
Name
• Here is an example: ABC 10,000
• Suppose a company wants to compare the salaries of the new joiners. XYZ 25,000
Here are the data values: PQR 8,000
• Now, look for the maximum value in the data. In this case, it is MNO 15,000
25,000. Now count the number of digits in this value. In this case, it
is ‘5’. So here ‘j’ is equal to 5, i.e 100,000. This means the V (value
of the attribute) needs to be divided by 100,000 here. Salary
after
• Thus, decimal scaling can tone down big numbers into easy to Name Salary Decimal
understand smaller decimal values. Also, data attributed to different Scaling
units becomes easy to read and understand once it is converted into ABC 10,000 0.1
smaller decimal values. XYZ 25, 000 0.25
PQR 8, 000 0.08
MNO 15,000 0.15
Ms.Duha Qutishat
3. Feature selection:
• Feature selection is the process of deciding which variables

(features, characteristics, categories, etc.) are most important to
your analysis. These features will be used to train ML models.
It’s important to remember, that the more features you choose
to use, the longer the training process and, sometimes, the less
accurate your results, because some feature characteristics
may overlap or be less present in the data.
Ms.Duha Qutishat
4.Discretization
• a process of converting continuous data attribute values into a finite set of

intervals and associating with each interval some specific data value.
• Discretization pools data into smaller intervals.
• For example, when calculating average daily exercise, rather than using
the exact minutes and seconds, you could join together data to fall into 0-
15 minutes, 15-30, etc.
Ex:Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Ms.Duha Qutishat
5. Data discretization and concept hierarchy
generation
• Concept hierarchy generation can add a hierarchy
within and between your features that wasn’t present in
the original data.
• The term hierarchy represents an organizational structure or
mapping in which items are ranked according to their levels
of importance
• For example, in computer science, there are different types
of hierarchical systems. A document is placed in a folder in
windows at a specific place in the tree structure is the best
example of a computer hierarchical tree model. There are two
types of hierarchy: top-down mapping and the second one is
bottom-up mapping.
Ms.Duha Qutishat
Preprocessing Techniques:
6. Handling Data with Unequal Distribution of Classes
(Imbalanced Data)
• One of the most common problems we face when dealing with real-world data classification is that
the classes are imbalanced (one of the classes has more examples than the other), creating a
strong bias for the model.
• Imagine that you want to predict if a transaction is fraudulent. Based on your training data, 95% of
your dataset contains records about normal transactions, and only 5% of your data is about
fraudulent transactions. Based on that, your model most likely will tend to predict the majority class,
classifying fraudulent transactions as normal ones.
• There are three main techniques that we can use to address this deficiency in the dataset:
1.Oversampling
2.Undersampling
3.Hybrid
Ms.Duha Qutishat
Imbalanced Data
• Oversampling
• The oversampling approach is the process of increasing your dataset with synthetic data of the
minority class. The most popular technique used for this is the Synthetic Minority Oversampling
Technique (SMOTE). Briefly, it takes a random example from the minority class. Then another
random data point is selected through k-nearest neighbors of the first observation, and a new record
is created between these two selected data points. You can find this technique in the
imbalanced-learn library in Python.
• Undersampling
• The undersampling technique, in contrast, is the process of reducing your dataset and removing
real data from your majority class. The main algorithms used in this approach are the TomekLinks,
which removes the observation based on the nearest neighbor, and the Edited Nearest Neighbors (
ENN), which uses the k-nearest neighbor instead of only one as in Tomek.
• Hybrid
• The hybrid approach combines the oversampling and undersampling techniques in your dataset.
One of the algorithms that are used in this method is the SMOTEENN, which makes use of the
SMOTE algorithm for oversampling in the minority class and ENN for undersampling in the majority
class.
Ms.Duha Qutishat
The Data Preprocessing Pipeline
Ms.Duha Qutishat
1.Step 1: Start by analyzing and treating the correctness of attributes, like

identifying noise data and any structural error in the dataset.
2.Step 2: Analyze missing data, along with the outliers, because filling
missing values depends on the outliers analysis. After completing this step,
go back to the first step if necessary, rechecking redundancy and other
issues.
3.Step 3: The process of adding domain knowledge into new features for
your dataset. If you don’t get any useful new features for your project, don’t
worry and avoid creating useless features.
Ms.Duha Qutishat
1.Step 4: Use this step for transforming the features into the same
scale/unit. If you doubt that the model you will be using needs the data on
the same scale, then apply it. It won’t negatively affect the models that
don’t need data transformation.
2.Step 5: This stage avoids the curse of dimensionality, so if you think you’re
having this problem, you must apply this step in your pipeline. It comes
after data transformation because some of the techniques (e.g., PCA)
need transformed data.
3.Step 6: The last part before moving to the model phase is to handle the
imbalanced data. Also, there are some specific metrics for calculating the
model’s performance when you have this issue in your data.
Ms.Duha Qutishat
References
• https://www.scalablepath.com/data-science/data-preprocessing-phas
e
• https://monkeylearn.com/blog/data-preprocessing/
• https://www.techtarget.com/searchdatamanagement/definition/data
-preprocessing
• https://www.codecademy.com/article/normalization
• https://www.upgrad.com/blog/normalization-in-data-mining/
Ms.Duha Qutishat

Ch5 5 Data Preprocessing

Uploaded by

Copyright:

Available Formats

You might also like

Ch5 5 Data Preprocessing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch5 5 Data Preprocessing

Uploaded by

Copyright:

Available Formats

Ch5:Data Preprocessing

• Data preprocessing, a component of data preparation, describes any type

• Virtually any type of data analysis, data science or AI development

• Machine learning and deep learning algorithms work best when data is presented in a

• Measures for data quality: A multidimensional view

• One of the most important aspects of the data preprocessing phase is

• This generally happens in one or more of the below:

• Data aggregation combines all of your data together in a uniform format.

• in today’s world of big data and the internet of things, data aggregation is an important

• Min-max normalization (usually called feature scaling)

• Min-max normalization preserves the relationships

• Here, μ is the mean value of the feature and σ is the

• Feature selection is the process of deciding which variables

• a process of converting continuous data attribute values into a finite set of

Attribute Age Age Age Age

1.Step 1: Start by analyzing and treating the correctness of attributes, like

You might also like