Week 2 Data Pre-Processing

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

1604C055 - Machine Learning

Data Pre-processing

Week 2
Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
What is data pre-processing
• Data pre-processing is a step in machine learning that transform the
original data to a new form to facilitate the next steps.
• Types:
– Scaling
– Standardization
– Normalizing observation
– Generating polynomial features
– Encoding categorical features
– Detecting outliers
– Handling missing values
Background
• Data can be recorded in different form
• Data can be recorded in different scale or unit
• Data may contain outliers
• Data may contain missing values
Scaling
• Some machine learning algorithm assume that all features are on the
same scale.
• For example, all features are needed to fall on or
• A common technique used to rescale a feature such that it falls on is
min-max scaling.
• Suppose a feature has values , to rescale to such that used the
following eq.

• : the minimum value of , : the maximum value of


Scaling: example
• Let be a feature, rescale to

15
57
36
88
37
Scaling using scikit-learn’s MinMaxScaler
Scaling using scikit-learn’s MinMaxScaler
Standardization
• Standardization is used to transform a feature to a new form such
that it has a mean of 0 and a standard deviation of 1.
• Suppose a feature has values , to standardize to such that and
used the following eq.

• mean of
• standard deviation of
Standardization : example
• Let be a feature, standardize to 0 mean and std. dev. 1

15
57
36
88
37
Standardization using scikit-learn’s
StandardScaler
Normalizing observation
• Different from scaling and standardization that work on each feature,
normalizing observation works on an individual sample for all
features.
• This transformation is a common operation in text processing.
• Normalizing observation is used to transform an individual sample to
a new form such that it has unit norm (vector with length one).
• Suppose is a nonzero individual sample consisted of features, to
normalize to unit vector used the following eq.

• is the norm of
Norm
• There are three possible norm that can be used in normalizing
observation, which are Manhattan norm, Euclidean norm, and
maximum norm.
• Suppose is an individual sample consisted of features.
• Manhattan norm or L1 norm:

• Euclidean norm or L2 norm:

• Maximum norm or L-inf norm:


Normalizing observation: example
• Let be an individual sample
• Normalize using L2 norm.

– Normalized individual sample

• Normalize using L1 norm.

– Normalized individual sample


Normalizing observation: example
• Normalize using L-inf norm.

– Normalized individual sample


Normalizing observation using scikit-learn’s
Normalizer
Normalizing observation using scikit-learn’s
Normalizer
Normalizing observation using scikit-learn’s
Normalizer
Generating polynomial features
• If the relation between the input features and the desired output is a
nonlinear relationship, then generating polynomial features is one of
solution for this problem.
• Suppose are input features generating polynomial features is a
process to transform each individual to the higher order form, such
as etc, or by including an interaction feature that is the product of
individual features, such as etc.
Generating polynomial features: examples
• Let be a sample with 2 features.
• Generate polynomial features degree of 2 from .

• Generate polynomial features degree of 3 from .
Generating polynomial features using scikit-learn’s
PolynomialFeatures
Generating polynomial features using scikit-learn’s
PolynomialFeatures
Encoding categorical features
• Sometime the measurement of object is performed in term of quality.
• Quality measurement usually produces categorical features rather
then numerical features.
• Most of machine learning algorithms require inputs be numerical
values.
• Encoding categorical features is a process to transform categorical
features to numerical features
Encoding categorical features using pandas’s
DataFrame.repalace
Encoding categorical features using pandas’s
DataFrame.repalace
Detecting outliers
• Outlier is an observation that significantly different from other
observations.
• Some machine learning algorithms are sensitive to outliers.
• Outliers can be caused from:
– Measurement error
– Data corruption
– True outlier observation
• Some methods for outlier detection
– Standard deviation method
– Interquartile range method
Detecting outliers: Standard deviation method
• Assume the distribution of feature is Gaussian or close to Gaussian.
• Suppose is a feature that come from Gaussian distribution , where
and are the mean and the standard deviation of feature, resp.
satisfies the following properties:

• An observation can be considered as an outlier if it falls outside


Standard deviation method: example
• Let be a feature with 1000
observations from Gaussian
distribution

• LB:
• UB:
• Observations below LB or
above UB are considered as
outliers
Standard deviation method using numpy
Detecting outliers: Interquartile range method
• Interquartile Range (IQR) method can be applied to feature from any
distribution.
• IQR is the difference between third quartile () and first quartile ()

• An observation can be considered as an outlier if it falls outside

Outliers Outliers
Interquartile range method: example
• Let

• LB:
• UB:
• Outliers: 1.27 and 8.58
Interquartile range method using numpy
Handling missing values
• Sometimes the dataset used to build a machine learning model contains
missing values in input feature or desired output.
• Most machine learning algorithms cannot handle missing values.
• Handling missing values need to be performed during pre-processing.
• Types of missing values:
– Missing Completely At Random (MCAR): the probability that a value is missing
is independent of everything.
– Missing At Random (MAR): the probability that a value is missing is not
completely random but depends on the information captured in other features.
– Missing Not At Random (MNAR): the probability that a value is missing is not
random and depends on information not captured in our features.
Handling missing values
• Some method to handling missing values:
– Deleting observations with missing values
– Imputing missing values:
• Using interpolation
• Using machine leaning model
• Using mean, median, or mode.
Deleting observations with missing values
• If an observation contains a missing value of type MCAR or MAR, it
is sometimes acceptable to delete the observation.

Delete observation
with missing value
Deleting observations with missing values using
pandas’s DataFrame.dropna
Imputing missing values using interpolation
• Interpolation is a method to estimate unknown data points within the
range of a discrete set of known data points
• Linear interpolation is the simplest interpolation method that uses
average of two data points near the unknown data point.

Use the average


these two data points
to estimate missing
value
Imputing missing values using interpolation with
pandas’s DataFrame.interpolate
Imputing missing values using mean, median, or
mode
• Mean, median, or mode of a feature can be used to replace missing
values

Use mean, median or


mode of column 2 to Mean of column 2
replace missing value
Imputing missing values using mean, median
or mode with pandas’s DataFrame.fillna

You might also like