Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Heteroscedasticity and

Multicollinearity
Heteroscedasticity
• Consider the following Residual plot:

• Is the assumption of equal error variance of satisfied?


• What happens if the assumption checking and correction is ignored?
Reasons why you want Homoscedasticity

• Heteroscedasticity causes no significant change in regression


parameter coefficients but causes lesser precision in the
variability/standard error of the regression parameter coefficients.
• Lower precision increases the likelihood that the coefficient
estimates are further from the correct population value.

• Heteroscedasticity tends to produce p-values that are smaller than


they should be.
• This effect occurs because heteroscedasticity increases the
variance of the coefficient estimates.
• Hence, the t-values and F-values using an underestimated amount
of variance may lead to conclude that a model term is statistically
significant when it is actually not significant.
What causes Heteroscedasticity?

• Occurs in datasets that have a large range between the largest and
smallest observed values.

• Numerous reasons are there on why heteroscedasticity exist.

• Most common explanation is that the error variance changes


proportionally with a variable that might be a component in the
model.
An Example of Heteroscedasticity?

Suppose you model household consumption based on household


income.

You’ll find that the variability in consumption increases as income


increases.

Lower income households are less variable in absolute terms because


they need to focus on necessities and there is less room for different
spending habits.

Higher income households can purchase a wide variety of luxury


items, or not, which results in a broader spread of spending habits.
Eradicate Heteroscedasticity

2. Use
Change of
Weighted Least
variables
Squares Method.
Assumption:
Uncorrelated
Predictor • Another Assumption:
The predictor variables
Variables are uncorrelated.
Illustration: Multicollinearity
M.Savings Y. Income M. Consumption
(in $100) (in $1000) Expenditure (in $)
X1 X2 X3
10 50 520
15 75 750
18 90 970
24 120 1290
30 150 1520
● Clearly, X1 = 5 * X2.

● There is perfect collinearity between X1 and X2 since the coefficient of


correlation is 1.

● There is no perfect collinearity between X2 and X3 however, the


correlation is very high which is 0.99.
Multicollinearity:
Visualization
Also called “intercorrelation”

refers to the situation when the


covariates are related to each
Multicollinearity other and to the outcome of
interest

like confounding, but a statistical


terminology for it because of the
effects it has on regression
modeling
Sales Prediction Model for Mitlised Retail
Company
Mitlised, a retail company is developing a sales
prediction model to identify key factors influencing sales
performance. Promotion has short-term effects and
aims to push the short-term sales. Advertising has long
term effects and used to build brand image and sales as
well.
Effects of The company gathers data on various independent
Multicollinearity variables such as amount spent on advertising
(AdvertisingSpend) and promotional activities
(PromotionalActivity). Consider two situations
captured in Datasets - Mitlised 1 and Mitlised 2
comprising of the variables “AdvertisingSpend” and
“PromotionalActivity” and “Sales” in 50 countries. Sales
is the dependent variable. The goal is to understand the
relative importance of these variables and make
informed decisions to optimize sales strategies.
Effects of Multicollinearity

Mitlised1:

Mitlised2:
Effects of Multicollinearity: Mitlised1
Sales on Advertisement spending and Promotion Activity

Sales on Advertisement spending


Effects of Multicollinearity: Mitlised2
Sales on Advertisement spending and Promotion Activity

Sales on Advertisement spending


Estimate of Slopes are unstable.

The t-statistic of one or more


coefficients tend to be statistically
insignificant.
Effects of
Multicollinearity
R-square can be very high/misleading

The estimators and their standard


errors can be sensitive to small
changes in the data
Sources of Multicollinearity

The data collection method employed, for example, sampling over a


limited range of the values taken by the regressors in the population.

Constraints on the model or in the population being sampled. For


example, in the regression of electricity consumption on income (X2)
and house size (X3) (High X2 always mean high X3).

Model specification, for example, adding polynomial terms to a


regression model, especially when the range of the X variable is small.

An overdeterministic model. This happens when the model has more


explanatory variables than the number of observations.
• Psychological Factor: Consider a
situation in which customer loyalty
to a coffee shop belonging to a
popular chain is modelled using
following predictors.
Why – Age
Multicollinearity – Frequency of visits in a month
– Location of Shop: Residential
Occurs? Area/Commercial Area/ University
Area
– Satisfaction with quality of product: 1-
5
– satisfaction with the chain: 1-5
By Strategy/Design: Consider the
problem of impact of advertisement in
increasing sales of fashion retail stores
belonging to a popular chain. Predictors
include advertising and volume among
other shop-level predictors.
Why – Advertising: Amount of money spent
Multicollinearity – Volume: Size of store.
Occurs?
Upon analysis and further examination,
you observed that the predictors -
advertising and volume are correlated
because of allocation of a high ad
budget to cities with smaller stores and a
low ad budget to cities with larger stores.
• Find Variance Inflation Factor (VIF)
for each variable using
VIFi=1⁄(1 − 𝑅!" )

Detect 𝑅!" : Squared


coefficient of determination for
Multicollinearity regressing the ith independent variable
on other independent variables

• When Ri2 is equal to 0, and therefore,


when VIF is 1, the ith independent
variable is not correlated to the
remaining ones, meaning that
multicollinearity does not exist.
• VIF equal to 1 = variables are not
correlated

• VIF between 1 and 5 = variables are


moderately correlated

Detect • VIF greater than 5 = variables are


Multicollinearity highly correlated.

• The higher the VIF, the higher the


possibility that multicollinearity exists,
and further research is required. When
VIF is higher than 10, there is
significant multicollinearity that needs
to be corrected.
• Dropping a variable(s) and
specification bias

• Transformation of variables
Reduce
Multicollinearity
• Additional or new data

• Principal component analysis


• Multicollinearity occurs when there is a
strong linear relationship between 2 or
more predictors in a model.
• It is a problem because it increases the
standard errors of the regression
coefficients, leading to noisy estimates
Reduce
Multicollinearity:
Using PCA • Principal Components Analysis (PCA):
PCA gives a linear combination of
Replace the correlated variables with a set
of equal or fewer uncorrelated variables
that represent their shared part, called
principal components, as predictors in the
regression model.
• Principal Components: These are
linear combinations of the variables in
the model.

• After finding principal components,


either
PCA • retain initial principal components
cumulatively explaining almost
90% variability.
• Retain all (if number of variable
are lower).
• Then run MLR.
• More on PCA later.
• Divide the data into Train and Test in
70:30 ratio.
• On Train data: Use MLR (if applicable)
using all predictors.
• Use Stepwise method to weed-out
insignificant variables at one go. This uses
AIC to find the set of important predictor
variables.
Steps
• Check MLR assumptions.
• If one or more assumptions are not met,
use appropriate method so that the
assumptions of MLR meet.
• Use the final MLR model from Train data
to predict the response variable of the
Test data to find the prediction accuracy
of your model

You might also like