LN8 - Heteroscedasticity and Multicollinearity

Heteroscedasticity and
Multicollinearity
Heteroscedasticity
• Consider the following Residual plot:
• Is the assumption of equal error variance of satisfied?

• What happens if the assumption checking and correction is ignored?
Reasons why you want Homoscedasticity
• Heteroscedasticity causes no significant change in regression

parameter coefficients but causes lesser precision in the
variability/standard error of the regression parameter coefficients.
• Lower precision increases the likelihood that the coefficient
estimates are further from the correct population value.
• Heteroscedasticity tends to produce p-values that are smaller than

they should be.
• This effect occurs because heteroscedasticity increases the
variance of the coefficient estimates.
• Hence, the t-values and F-values using an underestimated amount
of variance may lead to conclude that a model term is statistically
significant when it is actually not significant.
What causes Heteroscedasticity?
• Occurs in datasets that have a large range between the largest and
smallest observed values.
• Numerous reasons are there on why heteroscedasticity exist.
• Most common explanation is that the error variance changes

proportionally with a variable that might be a component in the
model.
An Example of Heteroscedasticity?
Suppose you model household consumption based on household

income.
You’ll find that the variability in consumption increases as income

increases.
Lower income households are less variable in absolute terms because

they need to focus on necessities and there is less room for different
spending habits.
Higher income households can purchase a wide variety of luxury

items, or not, which results in a broader spread of spending habits.
Eradicate Heteroscedasticity
2. Use
Change of
Weighted Least
variables
Squares Method.
Assumption:
Uncorrelated
Predictor • Another Assumption:
The predictor variables
Variables are uncorrelated.
Illustration: Multicollinearity
M.Savings Y. Income M. Consumption
(in $100) (in $1000) Expenditure (in $)
X1 X2 X3
10 50 520
15 75 750
18 90 970
24 120 1290
30 150 1520
● Clearly, X1 = 5 * X2.
● There is perfect collinearity between X1 and X2 since the coefficient of

correlation is 1.
● There is no perfect collinearity between X2 and X3 however, the

correlation is very high which is 0.99.
Multicollinearity:
Visualization
Also called “intercorrelation”
refers to the situation when the

covariates are related to each
Multicollinearity other and to the outcome of
interest
like confounding, but a statistical

terminology for it because of the
effects it has on regression
modeling
Sales Prediction Model for Mitlised Retail
Company
Mitlised, a retail company is developing a sales
prediction model to identify key factors influencing sales
performance. Promotion has short-term effects and
aims to push the short-term sales. Advertising has long
term effects and used to build brand image and sales as
well.
Effects of The company gathers data on various independent
Multicollinearity variables such as amount spent on advertising
(AdvertisingSpend) and promotional activities
(PromotionalActivity). Consider two situations
captured in Datasets - Mitlised 1 and Mitlised 2
comprising of the variables “AdvertisingSpend” and
“PromotionalActivity” and “Sales” in 50 countries. Sales
is the dependent variable. The goal is to understand the
relative importance of these variables and make
informed decisions to optimize sales strategies.
Effects of Multicollinearity
Mitlised1:
Mitlised2:
Effects of Multicollinearity: Mitlised1
Sales on Advertisement spending and Promotion Activity
Sales on Advertisement spending

Effects of Multicollinearity: Mitlised2
Sales on Advertisement spending and Promotion Activity
Sales on Advertisement spending

Estimate of Slopes are unstable.
The t-statistic of one or more

coefficients tend to be statistically
insignificant.
Effects of
Multicollinearity
R-square can be very high/misleading
The estimators and their standard

errors can be sensitive to small
changes in the data
Sources of Multicollinearity
The data collection method employed, for example, sampling over a

limited range of the values taken by the regressors in the population.
Constraints on the model or in the population being sampled. For

example, in the regression of electricity consumption on income (X2)
and house size (X3) (High X2 always mean high X3).
Model specification, for example, adding polynomial terms to a

regression model, especially when the range of the X variable is small.
An overdeterministic model. This happens when the model has more

explanatory variables than the number of observations.
• Psychological Factor: Consider a
situation in which customer loyalty
to a coffee shop belonging to a
popular chain is modelled using
following predictors.
Why – Age
Multicollinearity – Frequency of visits in a month
– Location of Shop: Residential
Occurs? Area/Commercial Area/ University
Area
– Satisfaction with quality of product: 1-
5
– satisfaction with the chain: 1-5
By Strategy/Design: Consider the
problem of impact of advertisement in
increasing sales of fashion retail stores
belonging to a popular chain. Predictors
include advertising and volume among
other shop-level predictors.
Why – Advertising: Amount of money spent
Multicollinearity – Volume: Size of store.
Occurs?
Upon analysis and further examination,
you observed that the predictors -
advertising and volume are correlated
because of allocation of a high ad
budget to cities with smaller stores and a
low ad budget to cities with larger stores.
• Find Variance Inflation Factor (VIF)
for each variable using
VIFi=1⁄(1 − 𝑅!" )
Detect 𝑅!" : Squared

coefficient of determination for
Multicollinearity regressing the ith independent variable
on other independent variables
• When Ri2 is equal to 0, and therefore,

when VIF is 1, the ith independent
variable is not correlated to the
remaining ones, meaning that
multicollinearity does not exist.
• VIF equal to 1 = variables are not
correlated
• VIF between 1 and 5 = variables are

moderately correlated
Detect • VIF greater than 5 = variables are

Multicollinearity highly correlated.
• The higher the VIF, the higher the

possibility that multicollinearity exists,
and further research is required. When
VIF is higher than 10, there is
significant multicollinearity that needs
to be corrected.
• Dropping a variable(s) and
specification bias
• Transformation of variables
Reduce
Multicollinearity
• Additional or new data
• Principal component analysis

• Multicollinearity occurs when there is a
strong linear relationship between 2 or
more predictors in a model.
• It is a problem because it increases the
standard errors of the regression
coefficients, leading to noisy estimates
Reduce
Multicollinearity:
Using PCA • Principal Components Analysis (PCA):
PCA gives a linear combination of
Replace the correlated variables with a set
of equal or fewer uncorrelated variables
that represent their shared part, called
principal components, as predictors in the
regression model.
• Principal Components: These are
linear combinations of the variables in
the model.
• After finding principal components,

either
PCA • retain initial principal components
cumulatively explaining almost
90% variability.
• Retain all (if number of variable
are lower).
• Then run MLR.
• More on PCA later.
• Divide the data into Train and Test in
70:30 ratio.
• On Train data: Use MLR (if applicable)
using all predictors.
• Use Stepwise method to weed-out
insignificant variables at one go. This uses
AIC to find the set of important predictor
variables.
Steps
• Check MLR assumptions.
• If one or more assumptions are not met,
use appropriate method so that the
assumptions of MLR meet.
• Use the final MLR model from Train data
to predict the response variable of the
Test data to find the prediction accuracy
of your model

LN8 - Heteroscedasticity and Multicollinearity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LN8 - Heteroscedasticity and Multicollinearity

Uploaded by

Copyright:

Available Formats

Heteroscedasticity and

• Is the assumption of equal error variance of satisfied?

• Heteroscedasticity causes no significant change in regression

• Heteroscedasticity tends to produce p-values that are smaller than

• Numerous reasons are there on why heteroscedasticity exist.

• Most common explanation is that the error variance changes

Suppose you model household consumption based on household

You’ll find that the variability in consumption increases as income

Lower income households are less variable in absolute terms because

Higher income households can purchase a wide variety of luxury

● There is perfect collinearity between X1 and X2 since the coefficient of

● There is no perfect collinearity between X2 and X3 however, the

refers to the situation when the

like confounding, but a statistical

Sales on Advertisement spending

Sales on Advertisement spending

The t-statistic of one or more

The estimators and their standard

The data collection method employed, for example, sampling over a

Constraints on the model or in the population being sampled. For

Model specification, for example, adding polynomial terms to a

An overdeterministic model. This happens when the model has more

Detect 𝑅!" : Squared

• When Ri2 is equal to 0, and therefore,

• VIF between 1 and 5 = variables are

Detect • VIF greater than 5 = variables are

• The higher the VIF, the higher the

• Principal component analysis

• After finding principal components,

You might also like