Professional Documents
Culture Documents
Chapter 4
Chapter 4
SELECTING THE
BEST
1 REGRESSION MODEL
4.1 Introduction
2
For the multiple linear regression, the number of regression variables is more than one. But some of them can
be are irrelevant and can be removed from the regression equation.
The basic idea behind this finding the best regression model is that, we need to find an appropriate subset of
regressors that can explain the variability in the responsible variable well. And finding this subset regression
variable, this problem is called variable selection problem
While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as possible
explanatory variables.
2. In order to make the model as simple as possible, one way includes only fewer numbers of explanatory
variables.
3
There can be two types of incorrect model specifications.
A question arises after the selection of subsets of candidate variables for the model, how to judge which
subset yields better regression model. Various criteria have been proposed in the literature to evaluate and
compare the subset regression models.
4.2 Coefficient of Multiple Determination (
4
The coefficient of determination is the square of multiple correlation coefficient between the study variable
and set of explanatory variables denotes as . The coefficient of determination based on such variables is
Where and are the sum of squares due to regression and residuals, respectively in a subset model based on
explanatory variables. Since there are explanatory variables available and we select only out of them, so there
are − possible choices of subsets.
5
So proceed as follows:
Obviously . If is small, then stop and choose the value of for subset regression.
If is high, then keep on adding variables up to a point where an additional variable does not produces a
large change in the value of or the increment in becomes small.
To know such value of p, create a plot of versus .
6
The curve will look like as in the following figure.
7
The adjusted coefficient of determination has certain advantages over the usual coefficient of
determination. The adjusted coefficient of determination based on p -term model is
A model is said to have a better fit if residuals are small. The residual mean square based on a variable subset
regression model is defined as
So similarly as increases, initially decreases, then stabilizes and finally may increase if the model is not
sufficient to compensate the loss of one degree of freedom in the factor .
When is plotted versus p, the curve look like as in the following figure
9
So
10
Plot versus .
Choose corresponding to minimum value of such minimum value of will produce a with maximum
value. So
Choose near the point where the smallest value of turns upward.
When different subset models are considered, then the models with smallest are considered to be better than
those models with higher. So lower Cp is preferable.
The plot of Cp versus p for each regression equation will be a straight line passing through origin
12
and look like as follows:
13
4.6 Akaike’s Information Criterion (AIC)
Where
Similar to AIC, the Bayesian information criterion is based on maximizing the posterior distribution of
model given the observations. In the case of linear regression model, it is defined as
Choose a suitable criterion for model selection and evaluate each of the fitted regression equation with the
selection criterion.
18
4.9.2 Stepwise Regression Techniques
This methodology is based on choosing the explanatory variables in the subset model in steps which can be
either adding one variable at a time or deleting one variable at a times. Based on this, there are three
procedures
- Forward selection
- Stepwise regression.
These procedures are basically computer intensive procedures and are executed using a software.
4.9.3 Forward Selection Procedure:
19
This methodology assumes that there is no explanatory variable in the model except an intercept term. It adds
variables one by one and test the fitted model at each step using some suitable criterion. It has following steps.
All possible models with one regressor are considered and F-statistic for each regressor is computed. The
regressor having highest F statistic value is added to the model if
Partial F-statistics are computed for all of the remaining regressors in the presence of previously selected
regressors and the one yielding the highest F is added to the model if
Forward selection terminates when the highest partial F statistic at a particular stage does not exceeds or
when the last candidate regressor is added
4.9.4 Backward Elimination Procedure:
20
The backward elimination methodology begins with all explanatory variables and keeps on deleting one
variable at a time until a suitable model is obtained.
Compute partial F-statistic for each regressor in the presence other regressors in the model.
Partial F statistics are computed for this new model and process repeats.
All possible models with one regressors are considered and F-statistic for each regressor is computed. The
regressor having highest F statistic value is added to the model.
Partial F-statistics are computed for all of the remaining regressors in the presence of previously selected
regressors and the one yielding the highest F is added to the model if
All variables in the model are evaluated with partial F test to see if each one is still significant. At this step,
any regressor that is no longer significant is dropped from the model.
The stepwise selection terminates when, no other regressor yields a partial F greater than the threshold value
and all regressors in the model remains significant.
General comments:
22
1. None of the methods among forward selection, backward elimination or stepwise regression guarantees the
best subset model.
2. The order in which the explanatory variables enter or leave the models does not indicate the order of
importance of explanatory variable.
3. In forward selection, no explanatory variable can be removed if entered in the model. Similarly in backward
elimination, no explanatory variable can be added if removed from the model.
7 26 6 60 78.5
1 29 15 52 74.3
1 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4