Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

1 Introduction

The process of producing, changing, or choosing features (sometimes referred to


as variables or attributes) from raw data in order to enhance the performance
of machine learning algorithms is known as feature engineering. It entails the
extraction of pertinent data and the development of fresh features that can
aid algorithms in better comprehending the data and producing more precise
predictions.

2 Examples:

– Age from date of birth:-If you have a dataset including an individual’s


date of birth, you can compute the age of that individual by subtracting their
current date from their date of birth. Particularly in jobs like forecasting health
risks, insurance premiums, or consumer segmentation, this characteristic may
have greater relevance for a machine learning system than the raw date of
birth.
– Text length:-The length of the text may prove to be a helpful characteristic
when solving a text categorization challenge. For instance, you can add a
new feature called ”text length” to categorise movie reviews as favourable
or unfavourable based on how many words or characters are contained in
each review. This characteristic can aid the computer in comprehending the
connection between a review’s length and sentiment.
– Average purchase amount:- Assume you have a dataset of client trans-
actions, each of which includes the date, item, and amount spent. Based on
their spending patterns, you should be able to estimate client attrition. By
figuring out the typical purchase amount for each customer, you may add a
2 Ayush Singh

new feature named ”average purchase amount”. With the help of this new
functionality, a machine learning system may be able to better identify client
spending habits and forecast customer attrition.

3 How Feature Engineering helps?

Take into account a dataset of fruit samples with the attributes weight and colour
(red or green). Predicting whether a particular fruit is an apple or a watermelon
is the objective. The dataset appears as follows:

Fruit Color Weight


Apple Red 150g
Apple Red 170g
Apple Green 160g
Watermelon Green 4,000g
Watermelon Green 4,500g
Watermelon Red 4,200g

Apples and watermelons can both be red or green, therefore if you were to
employ a machine learning method with these two features directly, it might
have trouble telling them apart.
Here, feature engineering may be useful. By dividing each fruit’s weight by
the total weight of all the fruits in the dataset, you may produce a new feature
called ”weight ratio.” The new dataset would seem as follows:

Fruit Color Weight Weight Ratio


Apple Red 150g 0.0326
Apple Red 170g 0.0369
Apple Green 160g 0.0348
Watermelon Green 4,000g 0.8696
Watermelon Green 4,500g 0.9783
Watermelon Red 4,200g 0.9130

The machine learning system can now quickly distinguish between apples
and watermelons based on their weight ratios thanks to this new functionality.
Compared to watermelons, apples have far smaller weight ratios, which facilitates
accurate classification by the algorithm.
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

1 Feature Selection and How does it helps?

The process of selecting a subset of the most pertinent and instructive features
from the initial collection of features in a dataset is known as feature selection.
This is done to simplify the model, prevent overfitting, boost training effective-
ness, and make the model easier to understand.
With feature selection, you can:

– Reduce overfitting: By limiting the model’s ability to fit data noise by only
using the most pertinent features, we improve generalisation to new data.
– Enhancing training effectiveness: The training procedure is quicker and uses
fewer CPU resources when there are fewer features.
– Improving interpretability: In areas where explainability is critical, a model
with fewer elements is simpler to comprehend and interpret.

Working illustration
Consider a dataset that contains details on homes, such as their age, location,
square footage, number of rooms, and proximity to the city centre. Predicting
housing prices is the objective.

House Rooms Sq. Footage Age Location Distance from City Center Price
1 3 1,200 10 Urban 2.0 250k
2 4 1,800 5 Suburban 5.5 300k
...
You discover after examining the dataset that ”Rooms” and ”Sq. Footage” have a
strong correlation (houses with more rooms generally have more square footage).
You discover that the variable ”Location” has little effect on the prices of houses
in your dataset. You choose to perform feature selection and remove the ”Sq.
2 Ayush Singh

Footage” and ”Location” aspects in order to streamline your model and boost
its functionality.

House Rooms Age Distance from City Center Price


1 3 10 2.0 250k
2 4 5 5.5 300k
...

With fewer features to take into account, your model may be less overfitted,
train more quickly, and produce predictions that are simpler to understand.
Many feature selection techniques exist, including filter techniques (such as
correlation and mutual information), wrapper techniques (such as forward selec-
tion and backward removal), and embedding techniques (e.g., LASSO, Ridge Re-
gression). The choice of method is based on the particular problem and dataset,
each of which has strengths and disadvantages.
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

1 Introduction to Categorical Variables

In contrast to continuous numerical values, categorical variables are a sort of


data that reflect discrete values or categories. They are frequently employed to
represent qualitative data from a dataset, like brand, gender, or colour. Since
numerical inputs are frequently required by machine learning algorithms, it is
vital to encode categorical variables into numerical values in a coherent and
understandable manner.
The two primary categories of categorical variables are:
– Ordinal categorical variables are ones that naturally rank or have an order.
The relative ranking of the categories is significant, and the sequence of the
categories conveys significant information. A Ph.D. is regarded as having a
higher degree of education than a Bachelor’s Degree, hence the variable ”Ed-
ucation Level,” which has the categories ”High School,” ”Bachelor’s Degree,”
”Master’s Degree,” and ”Ph.D.,” has an intrinsic order.
– Nominal Categorical Variables: Nominal variables lack any inherent hierarchy
or order. The order in which the categories appear is arbitrary; they are essen-
tially different labels. For instance, because there is no intrinsic order between
the hues red, blue, and green, the variable ”Color” is nominal.

Proper categorical variable encoding is crucial.


Correct categorical variable encoding is essential for a machine learning model
to succeed for a number of reasons:
– Compatibility: Several machine learning methods demand numerical inputs,
including neural networks, support vector machines, and linear regression.
These algorithms can process the dataset’s data by encoding categorical vari-
ables.
2 Ayush Singh

– Interpretability: Correct encoding makes sure that the connections between


categories are maintained, making the predictions of the model easier to un-
derstand and more significant.
– Performance: Effective encoding methods can aid in capturing the underlying
structure in the data, improving the performance of the model.

1.1 Label Encoding

A quick method for transforming category data into numerical values is label
encoding. It entails giving each category in the variable a different number. The
allocated integers are normally ordered sequentially, beginning with 0 or 1. For
ordinal variables, label encoding is especially useful since it may maintain the
categories’ natural order.
When to Employ Label Encoding and Why:
– Label encoding works best for ordinal variables because the encoded integers
can accurately reflect the categories’ natural order. Hence, the ordinal relation-
ship between the categories may be captured by machine learning techniques.
– Label encoding can be used for ordinal and nominal variables with some tree-
based algorithms, such as decision trees and random forests, because they
can handle the encoded values without assuming any hierarchy between the
categories.
Consider a dataset with the variable ’Size’ representing T-shirt sizes:

Size
Small
Medium
Large
Small
Large
Using label encoding, we can assign a unique integer to each category:

– Small: 0
– Medium: 1
– Large: 2
The encoded dataset will look like this:

Encoded Size
0
1
2
0
2
Feature Engineering in Machine Learning 3

However it’s important to exercise caution when employing label encoding for
nominal variables since the encoded integers can provide an erroneous hierarchy
that might not accurately reflect the relationships between the categories.
Example:
Let’s consider a dataset with information about cars, including a nominal
categorical variable ’Color’:

Car Color
A Red
B Blue
C Green
D Red
E Green
If we use label encoding for the ’Color’ variable, we might assign integers like
this:
– Red: 0
– Blue: 1
– Green: 2
The encoded dataset will look like this:

Car Encoded Color


A 0
B 1
C 2
D 0
E 2
However, this encoding creates an artificial order among the colors: Red <
Blue < Green. This order might not reflect any true relationship between the
categories and could lead to incorrect assumptions by the machine learning algo-
rithm. In such cases, it’s better to use encoding techniques like one-hot encoding
or dummy encoding, which do not impose an order on nominal variables.
To perform label encoding in Python, you can use the LabelEncoder class from
the scikit-learn library:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset


data = {’Size’: [’Small’, ’Medium’, ’Large’, ’Small’, ’Large’]}
df = pd.DataFrame(data)
4 Ayush Singh

# Initialize the LabelEncoder


encoder = LabelEncoder()

# Apply label encoding to the ’Size’ column


df[’Encoded Size’] = encoder.fit_transform(df[’Size’])

# Display the encoded dataset


print(df)

Output of this program will yield to:

Size Encoded Size


Small 0
Medium 1
Large 2
Small 0
Large 2
Table 1. Output
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

0.1 One Hot Encoding

For each distinct category in a nominal categorical variable, binary (0/1) features
are created using the one-hot encoding technique. According to this method, a
new binary column is created for each distinct category, with the existence of
the category in an observation being represented by 1 and the absence by 0.
Advantages and Disadvantages:

Advantages Disadvantages
Interpretability: One-hot encoding gener- Increased Dimensionality: When the cat-
ates a binary feature for each category, egorical variable has a large number of
making the connections between categories distinct categories, one-hot encoding can
and the target variable simple to under- greatly increase the dimensionality of the
stand. dataset. This may result in the ”curse of
dimensionality” and increasing computa-
tional complexity.
No Artificial Order: One-hot encoding is
useful for nominal categorical variables be-
cause, unlike label encoding, it does not
impose an artificial order on the categories.
Disadvantages:.
Table 1. Advantages and disadvantages of one-hot encoding

Worked Example: Consider a dataset with the variable ’Animal’ representing


different animal species:
2 Ayush Singh

Animal
Dog
Cat
Elephant
Dog
Elephant
Using one-hot encoding, we create a new binary column for each unique cat-
egory:

Dog Cat Elephant


1 0 0
0 1 0
0 0 1
1 0 0
0 0 1
To perform one-hot encoding in Python, you can use the get dummies func-
tion from the pandas library:

import pandas as pd

# Create a sample dataset


data = {’Animal’: [’Dog’, ’Cat’, ’Elephant’, ’Dog’, ’Elephant’]}
df = pd.DataFrame(data)

# Apply one-hot encoding to the ’Animal’ column


encoded_df = pd.get_dummies(df, columns=[’Animal’])

# Display the encoded dataset


print(encoded_df)

Animal Cat Animal Dog Animal Elephant


0 1 0
1 0 0
0 0 1
0 1 0
0 0 1
Table 2. Output
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

0.1 Dummy Encoding

To demonstrate the multicollinearity problem with one-hot encoding in linear


regression models, let’s look at a straightforward example. We have a dataset
with details about houses, such as their square footage, the area they are located,
and their prices.

Size Neighborhood Price


1000 A 200,000
1500 B 250,000
2000 C 300,000
1200 A 220,000
1800 B 280,000

We want to build a linear regression model to predict the house prices based
on their size and neighborhood.
First, let’s apply one-hot encoding to the ’Neighborhood’ column:

Size A B C Price
1000 1 0 0 200,000
1500 0 1 0 250,000
2000 0 0 1 300,000
1200 1 0 0 220,000
1800 0 1 0 280,000

Now, we build a linear regression model using ’Size’, ’A’, ’B’, and ’C’ as
independent variables:
2 Ayush Singh

P rice = Intercept + β1 ∗ Size + β2 ∗ A + β3 ∗ B + β4 ∗ C (1)


The numbers in columns ”A,” ”B,” and ”C” added together for every row
will always equal 1. This is so because every home must be a part of a specific
neighbourhood. The constant term (the intercept) in the linear regression model
also denotes a constant value of 1 for each observation. As the constant term and
the binary columns are perfectly multicollinear, it is challenging for the model
to predict the specific effects of each neighbourhood on the price of a home.

If we use dummy encoding instead, we can remove one neighborhood (e.g., ’A’)
as the reference category:

Size B C Price
1000 0 0 200,000
1500 1 0 250,000
2000 0 1 300,000
1200 0 0 220,000
1800 1 0 280,000

ˆ = β̂0 + β̂1 ∗ Size + β̂2 ∗ B + β̂3 ∗ C


P rice (2)
In this model, there is no perfect multicollinearity between the binary
columns (’B’ and ’C’) and the constant term. The intercept now represents
the baseline price for neighborhood ’A’, and the coefficients 2 and 3 represent
the price difference between neighborhoods ’B’ and ’C’ relative to neighborhood
’A’. This prevents multicollinearity issues and allows for more stable and inter-
pretable estimates.

When and Why to Use Dummy Encoding: Dummy encoding is partic-


ularly useful when working with linear regression models or other linear models
that assume no multicollinearity among independent variables. By omitting one
category and using it as a reference, dummy encoding eliminates the linear de-
pendence between the created binary features, ensuring that multicollinearity
does not adversely impact the model’s estimates.
To perform dummy encoding in Python, you can use the get dummies func-
tion from the pandas library with the drop first parameter set to True:

import pandas as pd

# Create a sample dataset


data = {’Animal’: [’Dog’, ’Cat’, ’Elephant’, ’Dog’, ’Elephant’]}
df = pd.DataFrame(data)

# Apply dummy encoding to the ’Animal’ column


encoded_df = pd.get_dummies(df, columns=[’Animal’], drop_first=True)
Feature Engineering in Machine Learning 3

# Display the encoded dataset


print(encoded_df)

How to choose reference category or category to be dropped?


Frequency

Interpretability

Domain Knowledge
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

0.1 Mean Encoding:

Mean encoding, also referred to as target encoding, is a method for transform-


ing categorical variables into numerical values by substituting the mean of the
target variable for each group in place of each one. As it reduces the dimension-
ality of the data without significantly reducing the amount of information, this
technique can be especially useful when working with categorical variables with
high cardinality.
Advantages
– Reduces the dimensionality of the problem, which can enhance model efficiency
and lower computational demands.
– Encapsulates in a unique numerical value the relationship between the cate-
gorical variable and the target variable.
Disadvantages
– Potential leakage of target information if not implemented correctly, leading
to overfitting.
– Not suitable for cases where the relationship between the categorical variable
and the target variable is not monotonic.
Worked Example Consider a small dataset containing information about cus-
tomers, including their age group and their spending score (target variable):

Age Group Spending Score


Youth 75
Adult 55
Youth 80
Senior 40
Adult 60
2 Ayush Singh

To perform target encoding, we replace each ’Age Group’ category with the
mean spending score for that category:
Youth: (75 + 80) / 2 = 77.5
Adult: (55 + 60) / 2 = 57.5
Senior: 40

The encoded dataset will look like this:


Encoded Age Group Spending Score
77.5 75
57.5 55
77.5 80
40.0 40
57.5 60
Target encoding must be done independently for each fold during cross-
validation in order to prevent target leakage. This stops data from the validation
set from entering the encoding process by only using the training data for each
fold to compute the mean of the target variable.

Now, because the encoding procedure used data from the entire dataset, in-
cluding the validation or test set, if we train a model on this encoded dataset
and use the same data for validation or testing, we run the risk of overfitting.
On this dataset, the algorithm would probably perform well, but it would not
generalise well to new, unforeseen data.
We can prevent target leakage and guarantee that the model is trained
on data that is independent of the validation or test set by employing cross-
validation and carrying out target encoding independently for each fold. The
model’s ability to generalise to new data is improved through this procedure,
which also offers a more precise assessment of the model’s performance.
Here’s how we do it using CV:
To demonstrate how to avoid target leakage with target encoding, let’s use
a small dataset and perform k-fold cross-validation. In this example, we’ll use a
3-fold cross-validation.
Dataset:
Age Group Spending Score
Youth 75
Adult 55
Youth 80
Senior 40
Adult 60
Senior 45
We’ll first split the dataset into 3 folds:
Feature Engineering in Machine Learning 3

Age Group Spending Score Age Group Spending Score


Youth 75 Youth 80
Adult 55 Senior 40
Age Group Spending Score
Adult 60
Senior 45
Now, for each fold, we’ll perform target encoding using only the training data
for that fold:
Fold 1 (train on Fold 2 and Fold 3, validate on Fold 1):
Training data:
Age Group Spending Score
Youth 80
Senior 40
Adult 60
Senior 45
Target encoding:
Age Group Target Encoding
Youth 80
Adult 60
Senior (40 + 45) / 2 = 42.5
Encoded validation data:
Encoded Age Group Spending Score
80 75
60 55

Fold 2 (train on Fold 1 and Fold 3, validate on Fold 2):


Training data:
Age Group Spending Score
Youth 75
Adult 55
Adult 60
Senior 45
Target encoding:
Age Group Target Encoding
Youth 75
Adult (55 + 60) / 2 = 57.5
Senior 45
Encoded validation data:
4 Ayush Singh

Encoded Age Group Spending Score


75 80
45 40

Fold 3 (train on Fold 1 and Fold 2, validate on Fold 3):


Training data:
Age Group Spending Score
Youth 75
Adult 55
Youth 80
Senior 40
Target encoding:
Age Group Target Encoding
Youth (75 + 80) / 2 = 77.5
Adult 55
Senior 40
Encoded validation data:
Encoded Age Group Spending Score
55 60
40 45
By performing target encoding separately for each fold during cross-validation,
we ensure that the target variable’s mean is calculated only using the training
data for each fold, preventing information from the validation set from leaking
into the encoding process.

Further Research

As a student, it’s important to explore and research various encoding techniques


to deepen your understanding of their use cases and implementation. We en-
courage you to research the following encoding techniques on your own:
– Binary Encoding
– Base-N Encoding
– Hashing

By studying these techniques, you will develop a stronger foundation in cat-


egorical variable encoding and be better prepared to select the most appropriate
encoding method for your specific machine learning problems.
Feature Engineering in Machine Learning 5

1 Choosing the right encoding technique


Selecting the appropriate encoding technique for categorical variables is essential
for the success of your machine learning models. The choice depends on the
problem, dataset, and algorithm. Here are some guidelines and recommendations
to help you decide which encoding technique to use:
1. Ordinal vs. Nominal Variables: Determine if the categorical variable is
ordinal (having a natural order) or nominal (having no order). For ordinal
variables, label encoding is a suitable choice, as it preserves the order. For
nominal variables, consider one-hot encoding, dummy encoding, or other
advanced techniques like binary encoding, base-N encoding, or hashing.
2. Cardinality: High cardinality categorical variables (i.e., those with many
unique categories) can lead to a large number of columns when using one-hot
encoding or dummy encoding. In such cases, consider using binary encoding,
base-N encoding, or hashing to reduce dimensionality.
3. Algorithm Sensitivity: Some machine learning algorithms, like decision
trees and random forests, can handle categorical variables directly, while
others, like linear regression and support vector machines, require numeri-
cal input. Consider the algorithm’s sensitivity to categorical variables when
choosing an encoding technique.
4. Memory and Computational Resources: If memory and computational
resources are limited, consider using encoding techniques like binary encod-
ing, base-N encoding, or hashing that reduce the number of columns com-
pared to one-hot encoding.
Feature Engineering in Machine Learning

Ayush Singh1

Antern Department of Artificial Intelligence


ayush@antern.co

Abstract. This document contains contents on data preparation in ma-


chine learning and we also cover several components of data preparation
like feature engineering, feature selection, dimensionality reduction, etc.
We provide readers with several traditional and modern techniques to
handle complicated data tasks.

Key words: data preparation, feature engineering, feature selection,


data cleansing, data transformation, dimensionality reduction

1 Engineering Numerical Features

Scaling is important because many machine learning algorithms are sensitive to


the size of the input features. If some features have much larger values than
others, the algorithms may focus too much on those features, leading to less
accurate results.
Consider a dataset with two features, ’Age’ and ’Income’:
Age Income
20 2000
25 2500
30 3000
35 3500
40 4000
In this dataset, the ’Income’ feature has a much larger magnitude than the
’Age’ feature. A machine learning algorithm might give more importance to
’Income’, even though ’Age’ might also be a crucial factor.
To fix this issue, we can think of a technique that transforms the features to
a common scale. One such technique is Min-Max scaling.
Min-Max scaling is a technique that scales the numerical features to a specific
range, usually [0, 1]. The formula for Min-Max scaling is:
X − Xmin
Xscaled = (1)
Xmax − Xmin
Dataset:
2 Ayush Singh

Age Height
20 150
25 155
30 160
35 165
40 170

Min-Max Scaling:
First, we calculate the minimum and maximum values for both features:
– Age: min=20, max=40
– Height: min=150, max=170
Next, we apply Min-Max scaling to the dataset:
Scaled dataset:
Scaled Age Scaled Height
0 0
0.25 0.25
0.5 0.5
0.75 0.75
1 1

Euclidean distance:
Original distance:
p √ √
Distance = (20 − 25)2 + (150 − 155)2 = 25 + 25 = 50
Scaled distance:
p √ √
Distance = (0 − 0.25)2 + (0 − 0.25)2 = 0.0625 + 0.0625 = 0.125
By applying Min-Max scaling, we can see that the distance calculation is
more balanced and gives equal importance to both ’Age’ and ’Height’. This will
help the machine learning algorithm to make better predictions by considering
both features fairly.

1.1 Several Transformations

Here’s the colab which contains the detailed explanation for the same:
https://colab.research.google.com/drive/1D5N7EDT5KtuwKsr4aptNw866Boh0C4nC?usp=sharing

2 Interaction Effects
Interactions in prediction:
– Occur when the combined effect of two or more features on the outcome is
different from their individual effects.
– Can improve predictions by considering the combined effects of features.
– Can occur between numerical, categorical, or mixed features.
Example 1 - Water and fertilizer on crop yield:
Feature Engineering in Machine Learning 3

– No water + some fertilizer = No yield (water is essential).


– Sufficient water + no fertilizer = Some yield (not optimal).
– Sufficient water + sufficient fertilizer =
– Optimal yield (combined effect greater than individual effects).
Example 2 - Ames housing data (age of house, air conditioning,
and sale price):
– Houses with air conditioning: positive relationship between age and sale price.
– Houses without air conditioning: no relationship between age and sale price.
– Interaction between age of house and presence of air conditioning, as their
combined effect on sale price is different from their individual effects.
Importance of interactions:
– Help improve model performance and accuracy.
– Identify and incorporate interactions to better understand the relationships
between features and outcomes.
Interaction representation in a simple linear model:
– Equation: y = 0 + 1x1 + 2x2 + 3x1x2 + error
– 0: overall average response
– 1 and 2: average rate of change due to x1 and x2, respectively
– 3: incremental rate of change due to the combined effect of x1 and x2
Estimating parameters:
– Use methods like linear regression (for continuous response) or logistic regres-
sion (for categorical response) to estimate the parameters from data
Evaluating interaction usefulness:
– Determine the usefulness of the interaction term (3x1x2) for explaining vari-
ation in the response after estimating the parameters
– Helps in understanding the significance of the interaction between predictors
Interaction Type Example Description
Additive Exercise and healthy diet The combined effect of exercise and a healthy diet
on weight loss on weight loss is the sum of their individual effects.
Antagonistic Sleep and caffeine intake The combined effect of sleep and caffeine intake on
on alertness alertness is less than the sum of their individual ef-
fects, as caffeine reduces the effectiveness of sleep on
alertness.
Synergistic Sunscreen and wearing a The combined effect of sunscreen and wearing a hat
hat on preventing sunburn on preventing sunburn is greater than the sum of
their individual effects, providing better protection.
Atypical Medication effect on pain The effect of medication on pain relief depends on
relief in acute vs. chronic the pain type (acute or chronic), but the main ef-
pain patients fect of one or both predictors on the response is not
significant.
4 Ayush Singh

How to find interaction terms?

Concept Example
Expert knowledge A nutritionist’s knowledge on the impact of
different nutrients on health.
Experimental design Designing a study to assess the effects of
different types of exercise on weight loss.
Interaction hierarchy In a pizza satisfaction study, pairwise
interactions (crust-sauce, crust-cheese)
should be considered before higher-order
interactions (crust-sauce-cheese).
Effect sparsity In the pizza satisfaction study, only a
few factors (e.g., crust, cheese) and
interactions (e.g., crust-sauce) might
significantly impact customer satisfaction.

2.1 Heredity Principle

This principle is inspired by genetic heredity and states that an interaction term
should only be considered if the preceding terms are effective in explaining the
response variation.
Strong Heredity Example
Suppose you are studying the effect of three factors on plant growth: sunlight
(x1), water (x2), and fertilizer (x3). You find that both sunlight (x1) and wa-
ter (x2) have significant main effects on plant growth. According to the strong
heredity principle, you can consider the interaction between sunlight and water
(x1 × x2) in your model. However, if only sunlight (x1) had a significant main
effect, you would not consider any interaction terms in the model, as strong
heredity requires all lower-level preceding terms to be significant.
Weak Heredity Example
Using the same plant growth example with factors sunlight (x1), water (x2),
and fertilizer (x3), let’s say you find that only sunlight (x1) has a significant
main effect on plant growth. According to the weak heredity principle, you can
consider the interactions between sunlight and water (x1 × x2) and sunlight
and fertilizer (x1 × x3) in your model, even though water (x2) and fertilizer
(x3) don’t have significant main effects. However, the interaction between water
and fertilizer (x2 × x3) would not be considered, as neither of the main effects
is significant.

3 Identifying Potential Interaction Terms


Imagine you are studying the effect of five factors on the sales of a product:
price (x1), advertising (x2), packaging (x3), product quality (x4), and customer
support (x5). You want to find the most important pairwise interactions that
affect sales.
Feature Engineering in Machine Learning 5

3.1 Brute-Force Approach

With the brute-force approach, you evaluate all possible pairwise interactions
for an association with the response (in this case, sales). For five factors, there
are 10 possible pairwise interactions: (x1 × x2), (x1 × x3), (x1 × x4), (x1 × x5),
(x2 × x3), (x2 × x4), (x2 × x5), (x3 × x4), (x3 × x5), and (x4 × x5).

3.2 Drawbacks

As the number of evaluated interaction terms increases, the probability of iden-


tifying an interaction associated with the response due to random chance also
increases. These terms, which are statistically significant only due to random
chance and not because of a true relationship, are called false positive find-
ings.
False positive findings can lead to overfitting and decrease a model’s predic-
tive performance. To protect against selecting these types of findings, an entire
sub-field of statistics is devoted to developing methodology for controlling the
chance of false positive findings.

3.3 Simple Screening

In the context of Simple Screening, let’s consider an example where you want
to predict house prices based on two factors: square footage (x1) and the age
of the house (x2).
Main Effects Model:

y = β0 + β1 x1 + β2 x2 + error (2)
Interaction Model:

y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + error (3)
These two models are called ”nested ” since the first model is a subset of the
second. When models are nested, a statistical comparison can be made regarding
the amount of additional information that is captured by the interaction term.
For linear regression, the residual error is compared between these two models
and the hypothesis test evaluates whether the improvement in error, adjusted
for degrees of freedom, is sufficient to be considered real. The statistical test
results in a p-value which reflects the probability that the additional information
captured by the term is due to random chance. Small p-values, say less than 0.05,
would indicate that there is less than a 5% chance that the additional information
captured is due to randomness. It should be noted that the 5% is the rate of false
positive findings, and is a historical rule-of-thumb. However, if one is willing to
take on more risk of false positive findings for a specific problem, then the cut-off
can be set to a higher value.
For linear regression, the objective function used to compare models is the
statistical likelihood (the residual error, in this case). For other models, such as
6 Ayush Singh

logistic regression, the objective function to compare nested models would be


the binomial likelihood.

You might also like