Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

What is Feature Engineering?

Feature engineering is the process of selecting, transforming, extracting,


combining, and manipulating raw data to generate the desired variables
for analysis or predictive modeling. It is a crucial step in developing
a machine learning model.

What is a Feature?

A feature refers to one unique attribute or variable in our data set. Since
data is often stored in rows and columns, a feature can often be defined
as a single column.

Why Do We Engineer Features?

The objective of every machine learning model is to predict the value of a


target variable using a set of predictor variables. Feature engineering
improves the performance of the machine learning model by selecting the
right features for the model and preparing the features in a way that is
suitable for the machine learning model.

For example, if we would like to predict the price of a car, the target
variable would be the Market Value. The predictor variables start as a long
list of attributes that, through feature engineering, is slimmed down and
manipulated to produce a set of effective predictor variables.

The process of feature engineering would involve questions like “Is


number of seats a good predictor?” it would also involve more explanatory
questions like:

• Should there be a predictor variable for the shoulder width of every


seat?
• Should the average shoulder width act as a single predictor variable?
• Should horsepower and torque be separate predictor variables, or
do they provide similar information and only one of them is needed?

As the above example conveys, feature engineering is a process that is


highly dependent on the dataset and the target variables. As a result,
there is no single correct method of conducting feature engineering.
Feature engineering is a process that is heavily dependent on the
experience and expertise of the data scientists conducting the analysis.

5 Steps to Feature Engineering

While there is no formula for effective feature engineering, the following


five steps will provide you with insights regarding feature engineering
decisions. These five steps will help you make good decisions in the
process of engineering your features.

1. Data Cleansing

Data cleansing is the process of dealing with errors or inconsistencies in


the data. This step involves identifying incorrect data, missing data,
duplicated data, and irrelevant data. Moreover, Data cleansing is the
process of deleting, replacing, or modifying data to remove outliers and
incorrect values.

Data cleansing prepares the data to be readable by the model; this means
that all missing values are appropriately handled and that all features are
in the correct data type. A typical data cleansing decision can be regarding
outliers. In some cases, removing outliers in the data will result in the best
model, while, in other cases, the outliers should be kept as the outliers
provide the model with valuable information about edge cases.

2. Data Transformation

Data transformation is the process of transforming the data from one


layout to another. Transformation needs to occur in a way that does not
change the meaning of the original data. There are several techniques to
transform the data depending on the desired outcome:

Transformation: Transformation refers to the application of a


mathematical function to every data point. Transformation is a great way
to handle highly skewed data.
Standardization: Standardization refers to the process of converting the
data into a uniform format. Data standardization is a great way of
handling data with different units.
Data Encoding: Encoding refers to the process of converting categorical
variables to numerical variables. Data encoding is a great way of handling
nominal and ordinal variables.

3. Feature Extraction

Feature extraction is the process of extracting new features from the


existing attributes. This process is primarily concerned with reducing the
number of features in the model. Feature extraction can be a lengthy
process that requires the use of advanced analytics techniques
(e.g., Principal Component Analysis).

However, in its essence, feature extraction answers the following


question: Are the available features necessary to explain the behavior of the
target variable, or can these features be aggregated and grouped in a way that
maintains the effect on the target variable while reducing the number of
features?

Feature extraction does not have to be complicated; it can simply be the


grouping of multiple variables into a feature that measures the average of
these variables.

You might also like