Professional Documents
Culture Documents
Week 2 Data Pre-Processing
Week 2 Data Pre-Processing
Week 2 Data Pre-Processing
Data Pre-processing
Week 2
Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
What is data pre-processing
• Data pre-processing is a step in machine learning that transform the
original data to a new form to facilitate the next steps.
• Types:
– Scaling
– Standardization
– Normalizing observation
– Generating polynomial features
– Encoding categorical features
– Detecting outliers
– Handling missing values
Background
• Data can be recorded in different form
• Data can be recorded in different scale or unit
• Data may contain outliers
• Data may contain missing values
Scaling
• Some machine learning algorithm assume that all features are on the
same scale.
• For example, all features are needed to fall on or
• A common technique used to rescale a feature such that it falls on is
min-max scaling.
• Suppose a feature has values , to rescale to such that used the
following eq.
15
57
36
88
37
Scaling using scikit-learn’s MinMaxScaler
Scaling using scikit-learn’s MinMaxScaler
Standardization
• Standardization is used to transform a feature to a new form such
that it has a mean of 0 and a standard deviation of 1.
• Suppose a feature has values , to standardize to such that and
used the following eq.
• mean of
• standard deviation of
Standardization : example
• Let be a feature, standardize to 0 mean and std. dev. 1
15
57
36
88
37
Standardization using scikit-learn’s
StandardScaler
Normalizing observation
• Different from scaling and standardization that work on each feature,
normalizing observation works on an individual sample for all
features.
• This transformation is a common operation in text processing.
• Normalizing observation is used to transform an individual sample to
a new form such that it has unit norm (vector with length one).
• Suppose is a nonzero individual sample consisted of features, to
normalize to unit vector used the following eq.
• is the norm of
Norm
• There are three possible norm that can be used in normalizing
observation, which are Manhattan norm, Euclidean norm, and
maximum norm.
• Suppose is an individual sample consisted of features.
• Manhattan norm or L1 norm:
• LB:
• UB:
• Observations below LB or
above UB are considered as
outliers
Standard deviation method using numpy
Detecting outliers: Interquartile range method
• Interquartile Range (IQR) method can be applied to feature from any
distribution.
• IQR is the difference between third quartile () and first quartile ()
Outliers Outliers
Interquartile range method: example
• Let
• LB:
• UB:
• Outliers: 1.27 and 8.58
Interquartile range method using numpy
Handling missing values
• Sometimes the dataset used to build a machine learning model contains
missing values in input feature or desired output.
• Most machine learning algorithms cannot handle missing values.
• Handling missing values need to be performed during pre-processing.
• Types of missing values:
– Missing Completely At Random (MCAR): the probability that a value is missing
is independent of everything.
– Missing At Random (MAR): the probability that a value is missing is not
completely random but depends on the information captured in other features.
– Missing Not At Random (MNAR): the probability that a value is missing is not
random and depends on information not captured in our features.
Handling missing values
• Some method to handling missing values:
– Deleting observations with missing values
– Imputing missing values:
• Using interpolation
• Using machine leaning model
• Using mean, median, or mode.
Deleting observations with missing values
• If an observation contains a missing value of type MCAR or MAR, it
is sometimes acceptable to delete the observation.
Delete observation
with missing value
Deleting observations with missing values using
pandas’s DataFrame.dropna
Imputing missing values using interpolation
• Interpolation is a method to estimate unknown data points within the
range of a discrete set of known data points
• Linear interpolation is the simplest interpolation method that uses
average of two data points near the unknown data point.