Professional Documents
Culture Documents
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
1. Incompatible with most of the Python libraries used in Machine Learning:- Yes, you
read it right. While using the libraries for ML(the most common is skLearn), they
don’t have a provision to automatically handle these missing data and can lead to
errors.
2. Distortion in Dataset:- A huge amount of missing data can cause distortions in the
variable distribution i.e it can increase or decrease the value of a particular category
in the dataset.
3. Affects the Final Model:- the missing data can cause a bias in the dataset and can
lead to a faulty analysis by the model.
Types of Imputation
• Numerical Imputation: Imputation is a more preferable option rather than
dropping because it preserves the data size. However, there is an
important selection of what you impute to the missing values. I suggest
beginning with considering a possible default value of missing values in
the column
• Categorical Imputation: Replacing the missing values with the
maximum occurred value in a column is a good option for handling
categorical columns
• Random sample imputation: This consists of taking random
observation from the dataset and we use this observation to replace
the NaN values
2. Handling
Outliers
• Outlier in terms of Standard Deviation
If a value has a distance to the average higher than x * standard
deviation, it can be assumed as an outlier
• Outlier in terms of Percentiles
Percentiles according to the range of the data. In other words, if
your data ranges from 0 to 100, your top 5% is not the values
between 96 and 100. Top 5% means here the values that are out
of the 95th percentile of data
3.
Binning
• Binning can be applied on both categorical and numerical data.
• The main motivation of binning is to make the model more robust
and prevent overfitting. However, it has a cost on the performance.
Every time you bin something, you sacrifice information and make
your data more regularized
4. Log
Transform
• Logarithm transformation (or log transform) is one of the most
commonly used mathematical transformations in feature engineering.
Here are the benefits of using log transform:
• It helps to handle skewed data and after transformation, the
distribution becomes more approximate to normal
• It also decreases the effect of the outliers due to the normalization of
magnitude differences and the model become more robust
• The data you apply log transform to must have only positive values,
otherwise you receive an error
5. One-Hot
Encoding
• One-hot encoding is one of the most common encoding methods in
machine learning. This method spreads the values in a column to
multiple flag columns and assigns 0 or 1 to them. These binary values
express the relationship between grouped and encoded column.
• This method changes your categorical data, which is challenging to
understand for algorithms, to a numerical format and enables you to
group your categorical data without losing any information
6. Grouping
Operations
Imputation for missing
values
• Datasets contain missing values, often encoded as blanks, NaNs or other
placeholders
• Ignoring rows and/or columns with missing values is possible, but at the price of
loosing data which might be valuable.
• Better strategy is to infer them from the known part of data.
• Strategies
• Mean: Basic approach
• Median: More robust to outliers
• Mode: Most frequent value
• Using a model: Can expose algorithmic bias
40
Types of
Outliers
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation,
e.g., temperature
• Can be viewed as a generalization of local outliers—whose density significantly deviates
from its local area
• Issue: How to define or formulate meaningful context?
Types of
Outliers
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers
• Applications: E.g., intrusion Collective Outlier
detection:
• When a number of computers keep sending denial-
of- service packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
A data set may have multiple types of outlier
One object may belong to more than one type of outlier
42
Reason for
outliers
• Variability in the data
• An Experimental measurement error
How to Visually Identify an
outlier?
• Using Box plots
• Using Scatter plot
• Using Z score
Handling
Outliers
• Trimming: Simply removing the outliers from our dataset.
• Imputing: We treat outliers as missing data, and we apply missing
data imputation techniques.
• Discretization: We place outliers in edge bins with higher or lower
values of the distribution.
• Censoring: Capping the variable distribution at the maximum and
minimum values.
Outlier
Treatment
• Interquartile Range(IQR) Method
• Data point that falls outside of 1.5 times of an Interquartile range
above the 3rd quartile (Q3) and below the 1st quartile (Q1)
Outlier
Treatment
• Z Score method
A Z-score is a numerical measurement that describes a value's relationship to the
mean of a group of values. Z-score is measured in terms of standard deviations
from the mean. If a Z-score is 0, it indicates that the data point's score is identical
to the mean score. A Z-score of 1.0 would indicate a value that is one standard
deviation from the mean. Z-scores may be positive or negative, with a positive
value indicating the score is above the mean and a negative score indicating it is
below the mean.