Chapter 6

CHAPTER 6
MULTICOLLINEARITY
1
2
6.1 Introduction
A basic assumption is multiple linear regression model is that the rank of the matrix of observations on
explanatory variables is same as the number of explanatory variables. In other words, such matrix is of full
column rank. This in turn implies that all the explanatory variables are independent, i.e., there is no linear
relationship among the explanatory variables. It is termed that the explanatory variables are orthogonal.
In many situations in practice, the explanatory variables may not remain independent due to various reasons.
The situation where the explanatory variables are highly intercorrelated is referred to as multicollinearity. The
problem of multicollinearity exists when two or more regressor variables are strongly correlated or linearly
dependent.
3
Example 6.1:
Can we use the following data set to fit the model
Here (
But is singular, that is does not exist.

6.2 Source of multicollinearity:
4
(a) Method of data collection:
It is expected that the data is collected over the whole cross-section of variables. It may happen that the data
is collected over a subspace of the explanatory variables where the variables are linearly dependent. For
example, sampling is done only over a limited range of explanatory variables in the population.

(b). Model and population constraints
There may exists some constraints on the model or on the population from where the sample is drawn. The
sample may be generated from that part of population having linear combinations.
(c). Existence of identities or definitional relationships:
There may exist some relationships among the variables which may be due to the definition of variables or
any identity relation among them. For example, if data is collected on the variables like income, saving and
expenditure, then income = saving + expenditure. Such relationship will not change even when the sample
size increases.
(d). Imprecise
5 formulation of model
The formulation of the model may unnecessarily be complicated. For example, the quadratic (or
polynomial) terms or cross product terms may appear as explanatory variables. For example, let there be 3
variables and , so suppose their cross-product terms and are also added. Then rises to 6.
(e). An over-determined model
Sometimes, due to over enthusiasm, large number of variables are included in the model to make it more
realistic and consequently the number of observations becomes smaller than the number of explanatory
variables. Such situation can arise in medical research where the number of patients may be small but
information is collected on a large number of variables. In another example, if there is time series data for
50 years on consumption pattern, then it is expected that the consumption pattern does not remain same for
50 years. So better option is to choose smaller number of variables and hence it results into.
6
6.3 Effect of Multicollinearity (Problems due to multicollinearity)

(a) Strong multicollinearity between regressors results in large variance and covariance of regression
coefficients.
(b) Multicollinearity tends LSE that are too far from the true parameter .
(c) Model coefficient with negative value sign when positive sign is expected.
(d) High significance in a global F-test but in which none of the regressors are significance in partial F-
test.
(e) Different model selection procedures yields different models.

Example 6.1:
7
Consider the data in the following table
1 8 6
4 2 8
9 -8 1
11 -10 0
3 6 5
8 -6 3
5 0 2
10 -12 -4
2 4 10
7 -2 -3
6 -4 5
8
6.4 Techniques of Detecting Multicollinearity

(a) Examination of the correlation matrix
A simple measure of multicollinearity is inspection of off-diagonal elements in .
If indicates multicollinearity problem.
Examining the correlation matrix is helpful in detecting linear dependence between pairs of
regressors.
Examining the correlation matrix is not helpful in detecting multicollinearity problem arising from
linear dependence between more than two regressors.
Example 6.2
9
10.006 8 1 1 1 0.541 -0.009

9.737 8 1 1 0 0.130 0.070
15.087 8 1 1 0 2.116 0.115
8.422 0 0 9 1 -2.397 0.252
8.625 0 0 9 1 -0.046 0.017
16.289 0 0 9 1 0.365 1.509
5.958 2 7 0 1 1.996 -0.865
9.313 2 7 0 1 0.228 -0.055
12.960 2 7 0 1 1.380 0.502
5.541 0 0 0 10 -0.798 -0.399
8.756 0 0 0 10 0.257 0.101
10.937 0 0 0 10 0.440 0.432
10
The matrix in correlation form

indicates multicollinearity problem.
None of the pairwise correlations are suspiciously large and we have no indication of near linear
dependence.
(b). Multicolinearity can also be detected from the eigenvalues of the correlation matrix .
11
For a regressor model , there will be eigen values .
If there are one or more linear dependences in the data, then one or more eigenvalues will be small.
Define the the condition number of as
As a general rule
If indicates no serious problem with multicolinerity.
If indicates indicate moderate to strong multicolinearity.
If indicates a severe problem with multicolinerity.
The condition indicates of the matrix are
12
Clearly the largest the condition index is the condition number.

The number on is a useful measure of the number of near linear dependence in .
The eigenvalues of the web star data are
There is only one small eigen value.

The condition number is

Which indicate severe problem with multicollinearity.
13 The condition indices are

Only one condition index exceeds 1000, so we conclude that there is only one near linear
dependance in the data.
Eigen system analysis can also be used to identify the nature of linear dependencies in data.
14
(c). The correlation matrix may be decomposed as
Where and
where
is are eigen vectors associated with eigen values
If the eigen value is close to zero, the elements of the associated eigenvectors describe the nature of
linear dependence.
15
The smallest eigenvalue is and

(d). Variance influence factors
Variance of the ith regressor coefficient

16
is the coefficient of multiple determination when is regresses on the remaining regressors.

If is nearly orthogonal to the remaining regressors, is small and is close to unity,
If is nearly linearly dependent on some subset of the remaining regressors, is large and is large,
Can be viewed as the factor by which the is increased due to linear dependance among the regressors.
17
The VIF factor associated with regressor is defined by
Large value of indicates possible multicollinearity associated with .

In general
indicate possible multicollinearity problem.
indicate that multicollinearity is almost certainly a problem.

6.5 Dealing with Multicollinearity
18
(a) Collect additional data
Collecting additional data has been suggested as the best method of dealing with multicollinearity.
Addition data should be collected in manner to break up the multicollinearity in the existing data.
(b). Remove the regressor from the model
If the regressors are linearly dependent, it means they contain reductant information.
Thus we can pick one regressor to keep in the model and discard the other one.
If and are linearly dependent, then eliminating one regressor may helpful to reduce the effect of
multicollinearity .
Eliminating regressor to reduce multicollinearity may damage the predicting power of the model.
(c). Collapse Variables

Combine two or more variables which are linearly dependent into single composite variable.

Chapter 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6

Uploaded by

Copyright:

Available Formats

CHAPTER 6

But is singular, that is does not exist.

6.3 Effect of Multicollinearity (Problems due to multicollinearity)

(e) Different model selection procedures yields different models.

6.4 Techniques of Detecting Multicollinearity

10.006 8 1 1 1 0.541 -0.009

Clearly the largest the condition index is the condition number.

There is only one small eigen value.

(c). The correlation matrix may be decomposed as

The smallest eigenvalue is and

Variance of the ith regressor coefficient

is the coefficient of multiple determination when is regresses on the remaining regressors.

The VIF factor associated with regressor is defined by

Large value of indicates possible multicollinearity associated with .

(c). Collapse Variables

You might also like