Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

Lecture 05: Feature Engineering

Ms. Mehroz Sadiq

11/10/2020 Bahria University Islamabad 1


Learning
Objectives
• Data Munging
• Feature Engineering
• Feature Selection

11/10/2020 Bahria University Islamabad 2


Definitio
n
• “Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on unseen
data." – Jason Brownlee
• “Coming up with features is difficult, time-consuming, requires expert
knowledge. 'Applied machine learning' is basically feature
engineering.” – Andrew Ng
Some Feature Engineering
Techniques
• Imputation
• Handling Outliers
• Binning
• Log Transform
• One-Hot Encoding
• Grouping Operations
• Scaling
1. Imputation
Imputation is a technique used for replacing the missing data with some substitute value
to retain most of the data/information of the dataset. These techniques are used
because removing the data from the dataset every time is not feasible and can lead to a
reduction in the size of the dataset to a large extend, which not only raises concerns for
biasing the dataset but also leads to incorrect analysis.
Why imputation is important

1. Incompatible with most of the Python libraries used in Machine Learning:- Yes, you
read it right. While using the libraries for ML(the most common is skLearn), they
don’t have a provision to automatically handle these missing data and can lead to
errors.
2. Distortion in Dataset:- A huge amount of missing data can cause distortions in the
variable distribution i.e it can increase or decrease the value of a particular category
in the dataset.
3. Affects the Final Model:- the missing data can cause a bias in the dataset and can
lead to a faulty analysis by the model.
Types of Imputation
• Numerical Imputation: Imputation is a more preferable option rather than
dropping because it preserves the data size. However, there is an
important selection of what you impute to the missing values. I suggest
beginning with considering a possible default value of missing values in
the column
• Categorical Imputation: Replacing the missing values with the
maximum occurred value in a column is a good option for handling
categorical columns
• Random sample imputation: This consists of taking random
observation from the dataset and we use this observation to replace
the NaN values
2. Handling
Outliers
• Outlier in terms of Standard Deviation
If a value has a distance to the average higher than x * standard
deviation, it can be assumed as an outlier
• Outlier in terms of Percentiles
Percentiles according to the range of the data. In other words, if
your data ranges from 0 to 100, your top 5% is not the values
between 96 and 100. Top 5% means here the values that are out
of the 95th percentile of data
3.
Binning
• Binning can be applied on both categorical and numerical data.
• The main motivation of binning is to make the model more robust
and prevent overfitting. However, it has a cost on the performance.
Every time you bin something, you sacrifice information and make
your data more regularized
4. Log
Transform
• Logarithm transformation (or log transform) is one of the most
commonly used mathematical transformations in feature engineering.
Here are the benefits of using log transform:
• It helps to handle skewed data and after transformation, the
distribution becomes more approximate to normal
• It also decreases the effect of the outliers due to the normalization of
magnitude differences and the model become more robust
• The data you apply log transform to must have only positive values,
otherwise you receive an error
5. One-Hot
Encoding
• One-hot encoding is one of the most common encoding methods in
machine learning. This method spreads the values in a column to
multiple flag columns and assigns 0 or 1 to them. These binary values
express the relationship between grouped and encoded column.
• This method changes your categorical data, which is challenging to
understand for algorithms, to a numerical format and enables you to
group your categorical data without losing any information
6. Grouping
Operations
Imputation for missing
values
• Datasets contain missing values, often encoded as blanks, NaNs or other
placeholders
• Ignoring rows and/or columns with missing values is possible, but at the price of
loosing data which might be valuable.
• Better strategy is to infer them from the known part of data.
• Strategies
• Mean: Basic approach
• Median: More robust to outliers
• Mode: Most frequent value
• Using a model: Can expose algorithmic bias

11/10/2020 Bahria University Islamabad 13


Imputation for missing
values

11/10/2020 Bahria University Islamabad 14


Handling
Outliers
What is an
Outlier?
• An outlier is a data point in a data set that is distant from all other
observation.
Extreme Value vs
Outlier
• An Extreme value is just a minimum or a maximum, it need not be
much different from of the data & a point that is far a way from the
other points called as outlier.
• Example: -Age of employees
Age = 21, 23, 24, 25, 26, 28, 30, 45
Where
Extreme value =30
Outlier =45
Type of
Outliers
• Point or global Outliers: Observations anomalous with respect to the
majority of observations in a feature. In-short A data point is
considered a global outlier if its value is far outside the entirety of the
data set in which it is found.
• Contextual (Conditional) Outliers: Observations considered
anomalous given a specific context.A data point is considered a
contextual outlier if its value significantly deviates from the rest of the
data points in the same context.
• Collective Outliers: A collection of observations anomalous but
appear close to one another because they all have a similar
anomalous value.
Types of
•Outliers
Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data
set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation

40
Types of
Outliers
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation,
e.g., temperature
• Can be viewed as a generalization of local outliers—whose density significantly deviates
from its local area
• Issue: How to define or formulate meaningful context?
Types of
Outliers
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers
• Applications: E.g., intrusion Collective Outlier
detection:
• When a number of computers keep sending denial-
of- service packages to each other
 Detection of collective outliers
 Consider not only behavior of individual objects, but also that of
groups of objects
 Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
42
Reason for
outliers
• Variability in the data
• An Experimental measurement error
How to Visually Identify an
outlier?
• Using Box plots
• Using Scatter plot
• Using Z score
Handling
Outliers
• Trimming: Simply removing the outliers from our dataset.
• Imputing: We treat outliers as missing data, and we apply missing
data imputation techniques.
• Discretization: We place outliers in edge bins with higher or lower
values of the distribution.
• Censoring: Capping the variable distribution at the maximum and
minimum values.
Outlier
Treatment
• Interquartile Range(IQR) Method
• Data point that falls outside of 1.5 times of an Interquartile range
above the 3rd quartile (Q3) and below the 1st quartile (Q1)
Outlier
Treatment
• Z Score method
A Z-score is a numerical measurement that describes a value's relationship to the
mean of a group of values. Z-score is measured in terms of standard deviations
 from the mean. If a Z-score is 0, it indicates that the data point's score is identical
to the mean score. A Z-score of 1.0 would indicate a value that is one standard
deviation from the mean. Z-scores may be positive or negative, with a positive
value indicating the score is above the mean and a negative score indicating it is
below the mean.

A z-score of 1 is 1 standard deviation above the mean.


A score of 2 is 2 standard deviations above the mean.
A score of -1.8 is -1.8 standard deviations below the mean.
Standard deviation/ Variance
• The standard deviation is also a measure of the spread of your
observations, a statement of how much your data deviates from a
typical data point.
• A positive square root of variance
Rounding
• Form of lossy compression: retain most significant features of the data.
• Sometimes too much precision is just noise
• Rounded variables can be treated as categorical variables
• Example: Some models like Association Rules work only with categorical features.
It is possible to convert a percentage into categorial feature this way

11/10/2020 Bahria University Islamabad 29


Binarization
• Transform discrete or continuous numeric features in binary features Example:
Number of user views of the same document

11/10/2020 Bahria University Islamabad 30


Binning
• Split numerical values into bins and encode with a bin ID.
• Can be set arbitrarily or based on distribution.
• Fixed-width binning

11/10/2020 Bahria University Islamabad 31


Binning
• Adaptative or Quantile binning
• Divides data into equal portions (eg. by median, quartiles, deciles)

11/10/2020 Bahria University Islamabad 32


Feature Transformation
and Scaling
MinMaxScaler
• Transforms the scale so that all values in the features range from 0 to 1.
Standard
Scaler
• For each feature, the Standard Scaler scales the values such that the
mean is 0 and the standard deviation is 1(or the variance).

• Standard Scaler assumes that the distribution of the variable is


normal. Thus, in case, the variables are not normally distributed, we
• either choose a different scaler
• or first, convert the variables to a normal distribution and then apply
this scaler
MaxAbsScale
r
• MaxAbs scaler takes the absolute maximum value of each column and
divides each value in the column by the maximum value.
• This operation scales the data between the range [-1, 1].
Robust
Scaler
• Robust Scaler is not sensitive to outliers.
• This scaler-
• removes the median from the data
• scales the data by the InterQuartile
Range(IQR)
Quantile Transformer
Scaler
• Quantile Transformer Scaler converts the variable distribution to a
normal distribution and scales it accordingly.
• Since it makes the variable normally distributed, it also deals with the
outliers
Log
Transform
• used to convert a skewed distribution to a normal distribution/less-
skewed distribution.
• In this transform, we take the log of the values in a column and use
these values as the column instead.
Power Transformer
Scaler
• Power Transformer also changes the distribution of the variable, as in,
it makes it more Gaussian(normal).
• Other similar power transforms such as square root, and cube root
transforms, and log transforms.
• The Power Transformer automates the decision making by
introducing a parameter called lambda.
• It decides on a generalized power transform by finding the best value
of lambda using either the:
• Box Cox Transformation
• Yeo-Johnson Power Transformations
Unit Vector
Scaler/Normalizer
• Normalizer converts the values between 0 and 1, and between -1
to 1 when there are negative values in our data.
• If we are using L1 norm, the values in each column are converted so
that the sum of their absolute values along the row = 1
• If we are using L2 norm, the values in each column are first squared
and added so that the sum of their absolute values along the row = 1
Feature
Encoding
Feature
Encoding
• Turn categorical features into numeric features to provide more fine-
grained information
• Help explicitly capture non-linear relationships and interactions
between the values of features
• Most of machine learning tools only accept numbers as their
input
Categorical Data
Encoding
Ordinal Data: The categories have an inherent order
• In Ordinal data, while encoding, one should retain the information
regarding the order in which the category is provided.
• For example the highest degree a person possesses, gives vital
information about his qualification. The degree is an important feature
to decide whether a person is suitable for a post or not.
Nominal Data: The categories do not have an inherent order
• While encoding Nominal data, we have to consider the presence or
absence of a feature. In such a case, no notion of order is present.
• For example, the city a person lives in. For the data, it is important to
retain where a person lives. Here, We do not have any order or
sequence.
Label Encoding or Ordinal Encoding or Integer
Encoding
• We use this categorical data encoding technique when the categorical
feature is ordinal.
• In this case, retaining the order is important. Hence encoding should
reflect the sequence.
• In Label encoding, each label is converted into an integer value.
• Example: Encoding Education Level
Label/Integer/ Ordinal
Encoding
• Advantages of integer (label) encoding
• Straightforward to implement.
• Does not expand the feature space.
• Can work well enough with tree-based algorithms.
• Allows agile benchmarking of machine learning models.
• Limitations of integer (label) encoding
• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in the test set
automatically.
• Creates an order relationship between the categories.
One Hot
• One-hot-encoding: is one of the most common encoding methods in machine
Encoding
learning.
• This method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them.
• The newly created binary features are known as Dummy variables.
• The number of dummy variables depends on the levels present in the
categorical variable.
• The binary values express the relationship between grouped and encoded
column.
One Hot
Encoding
Dummy
Encoding
• Dummy coding scheme is similar to one-hot encoding.
• This categorical data encoding method transforms the categorical
variable into a set of binary variables (also known as dummy
variables).
• In the case of one-hot encoding, for N categories in a variable, it uses
N binary variables.
• The dummy encoding is a small improvement over one-hot-encoding.
Dummy encoding uses N-1 features to represent N labels/categories.
Dummy
Encoding
Issues with OHE and
DE
• A large number of levels are present in data. If there are multiple
categories in a feature variable in such a case we need a similar
number of dummy variables to encode the data. For example, a
column with 30 different values will require 30 new variables for
coding.
• If we have multiple categorical features in the dataset similar
situation will occur and again we will end to have several binary
features each representing the categorical feature and their multiple
categories e.g a dataset having 10 or more categorical columns.
Effect
Encoding
• This encoding technique is also known as Deviation Encoding or Sum
Encoding.
• Effect encoding is almost similar to dummy encoding, with a little
difference.
• In dummy coding, we use 0 and 1 to represent the data but in effect
encoding, we use three values i.e. 1,0, and -1.
• The row containing only 0s in dummy encoding is encoded as -1 in
effect encoding.
Hash
Encoding
• Hash encoder represents categorical features using the new
dimensions.
• User can fix the number of dimensions after transformation
using n_component argument.
• A feature with 5 categories can be represented using N new features
similarly, a feature with 100 categories can also be transformed using
N new features.
• It may lead to information loss or collision
Binary
Encoding
• Binary encoding is a combination of Hash encoding and one-hot
encoding.
• In this encoding scheme, the categorical feature is first converted into
numerical using an ordinal encoder.
• Then the numbers are transformed in the binary number. After that
binary value is split into different columns
Binary
Encoding
Base N
Encoding
• In the case when categories are more and binary encoding is not able
to handle the dimensionality then we can use a larger base such as 4
or 8.
Scalin
g
• Models that are smooth functions of input features are sensitive to the scale of
the input (e.g. Linear Regression)
• Scale numerical variables into a certain range, dividing values by a normalization
constant (no changes in single-feature distribution)
• Popular techniques
• Min-Max Scaling
• Standard (Z) Scaling

11/10/2020 Bahria University Islamabad 58


Min-max
scaling
• Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.

11/10/2020 Bahria University Islamabad 59


Standard (Z)
Scaling
• After Standardization, a feature has mean of 0 and variance of 1 (assumption of
many learning algorithms)

11/10/2020 Bahria University Islamabad 60


Normalization
• Scales individual samples (rows) to have unit vector, dividing values by vector’s L2
norm, a.k.a. the Euclidean norm
• Useful for quadratic form (like dot-product) or any other kernel to quantify
similarity of pairs of samples. This assumption is the base of the Vector Space
Model often used in text classification and clustering contexts

11/10/2020 Bahria University Islamabad 61


Normalization

11/10/2020 Bahria University Islamabad 62


Temporal
Features
• Time Zone conversion:
• Factors to consider:
• Multiple time zones in some countries
• Daylight Saving Time (DST)
• Start and end DST dates

11/10/2020 Bahria University Islamabad 63


Time
binning
• Apply binning on time data to make it categorial and more general.
• Binning a time in hours or periods of day, like below.

11/10/2020 Bahria University Islamabad 64


Trendline
s
• Instead of encoding: total spend, encode things like: Spend in last week, spend in
last month, spend in last year.
• Gives a trend to the algorithm: two customers with equal spend, can have wildly
different behavior — one customer may be starting to spend more, while the
other is starting to decline spending.

11/10/2020 Bahria University Islamabad 65


Closeness to major
events
• Hardcode categorical features from dates
• Example: Factors that might have major influence on spending behavior
• Proximity to major events (holidays, major sports events)
• Eg. date_X_days_before_holidays
• Proximity to wages payment date (monthly seasonality)
• Eg. first_saturday_of_the_month

11/10/2020 Bahria University Islamabad 66


Time
Differences
• Differences between dates might be relevant
• Examples:
• user_interaction_date - published_doc_date
• To model how recent was the ad when the user viewed it.
• Hypothesis: user interests on a topic may decay over time
• last_user_interaction_date - user_interaction_date
• To model how old was a given user interaction compared to his last interaction

11/10/2020 Bahria University Islamabad 67


Spatial
Features
• Spatial Variables:
• Spatial variables encode a location in space, like:
• GPS-coordinates (lat. / long.) - sometimes require projection to a different
coordinate system
• Street Addresses - require geocoding
• Zip Codes, Cities, States, Countries - usually enriched with the centroid coordinate
of the polygon (from external GIS data)
• Derived features
• Distance between a user location and searched hotels (Expedia competition)
• Impossible travel speed (fraud detection)

11/10/2020 Bahria University Islamabad 68


Spatial
Enrichment
• Usually useful to enrich with external geographic data (eg. Census demographics)

11/10/2020 Bahria University Islamabad 69

You might also like