Professional Documents
Culture Documents
Unit 5. Model Selection: María José Olmo Jiménez
Unit 5. Model Selection: María José Olmo Jiménez
Model selection
María José Olmo Jiménez
Econometrics
Contents
1 Introduction 2
1
1 Introduction
• When we estimate a linear regression model we assume that the regression model used
in the analysis is “correctly” specified.
• If the model is not “correctly” specified, we encounter the problem of model specifica-
tion error or model specification bias.
• In this unit we take a close and critical look at this assumption, because searching for
the correct model is not trivial. In particular we examine the following questions:
1. How does one go about finding the “correct” model? In other words, what are the
criteria in choosing a model for empirical analysis?
2. What types of model specification errors is one likely to encounter in practice?
3. What are the consequences of specification errors?
4. How does one detect specification errors? In other words, what are some of the
diagnostic tools that one can use?
5. Having detected specification errors, what remedies can one adopt and with what
benefits?
6. How does one evaluate the performance of competing models?
• Be encompassing and as much simple as possible, that is, other models cannot be
an improvement over the chosen model.
• Be consistent with theory, that is, it must make economic sense (Do the coefficients
have the expected signs?)
• Be jointly relevant, that is, rejecting the null hypothesis in the overall significance test.
• Have all regressors individually relevant, that is rejecting the null hypothesis in the
individual significance tests.
• Exhibit parameter constancy; that is, the values of the parameters should be stable.
Otherwise, forecasting will be difficult.
• Exhibit data coherency, that is, the residuals estimated from the model must be
coherent with the assumptions of the model about the error terms.
2
2 Model specification errors
Strictly speaking, a specification error is committed when some of the assumptions of the
linear regression model are violated. The violation of the assumptions related to the error terms
(homocedasticity, no autocorrelation and normality) will be studied in the following units. So,
in this unit we focus on the specification errors related to the explanatory variables and the
functional form:
Remark: The model can also present problems related to the sample information such as
multicollinearity, errors of measurement and outliers and influential observations.
Example
Yt = β0 + β1 X1t + β2 X2t + ut
3
where Yt is the actual inflation rate at time t (in %), X1t the actual unemployment rate
prevailing at time t (in %) and X2t the expected inflation rate at time t (in %).
Suppose that for some reason we fit the model without X2 which is a relevant variable:
• The estimates of the regression coefficients change, in fact, βb1 changes from negative
to positive.
So, the results obtained os resultados obtenidos nos llevarían a conclusiones erróneas.
• Subjective tools: If in fact there is such errors, a plot of the residuals will exhibit a
noticeable pattern (trend, cyclic...).
• Objective tools:
Possible solutions
The solution seems very simple: introduce the omitted variables in the model. However,
when we specify an econometric model we follow the economic theory and our common sense
and we are not aware of forgetting some variables. Moreover, the omission of relevant variables
may get confused with other violations of the basic assumptions of a LRM that need different
treatments.
Thus, if we know which the omitted variable is
• and we have data about this variable, we can include it in the model.
4
• but we do not have data about this variable:
– we can replace it by a proxy variable, that is, a highly correlated variables with
the original omitted variable. In this way, the bias caused by the omitted variable
is reduced. The lagged dependent variable is often used as proxy variable.
– we can used panel data, since they allow for reducing the bias caused by the omitted
variable when this variable is constant in time.
• Now let us assume that Y = β0 + β1 X1 + u is the true model, but we fit this one
Y = α0 + α1 X1 + α2 X2 + u. Then we commit the specification error of including
an unnecessary variable in the model, that is, we are overfitting the model.This error
happens accidentally because the researcher is not sure about its role in the model.
– The OLS estimators of the regression coefficients are unbiased and consistent.
– The error variance σ 2 is correctly estimated.
– The usual confidence intervals and hypothesis-testing procedures remain valid.
However,
– the estimated αj0 s will be generally inefficient, that is, their variances will be gen-
erally larger than those of the βj0 s of the true model.
– Therefore, the only penalty we pay for the inclusion of the superfluous variable
is that the estimated variances of the coefficients are larger, and as a result our
probability inferences about the parameters are less precise.
Example 3 of Unit 3
In Unit 3 we studied the demand of a product in terms of its price per unit considering a
third-degree polynomial regression model:
Y = β0 + β1 X + β2 X 2 + β3 X 3 + u
5
Let us consider a quadratic model since the term X 3 is not individually relevant. The results
are now:
Coefficient Estimate s.e. texp p−value
β0 1330.41 179.565 7.40905 0.0001
β1 -155.467 27.8676 -5.57878 0.0005
β2 4.86612 1.0303 4.72163 0.0015
X 3 was clearly an irrelevant variable since:
• Neither of the slopes were significant at the 5% significance level, because their standard
errors were really higher.
So, the results of the cubic model would lead us to wrong conclusions.
• If the elimination of this variable helps to clarify the model, we can remove it.
• In practice, if the elimination of a variable hasn’t been wrong, the estimates of the
regression coefficients will be more precise, that is, their standard errors will be smaller.
Summarising
• A multiple regression model suffers from functional form misspecification when it does
not properly account for the relationship between the dependent and the observed ex-
planatory variables.
6
• Sometimes the economic theory does not provide information about the functional form,
so we specify a linear model since it is the simplest one.
• However there are many possible functional forms and then, the chosen one may be
incorrect.
• The consequences of this specification error are the same as those of omitting a relevant
variable. In addition, some of the assumptions related to the error terms may be violated.
Example 1
The following table contains data about Y , U.S. expenditure on imported goods (in billions
of 1982 dollars), and X, personal disposable income (in billions of 1982 dollars), from 1968 to
1987:
Year Yi Xi Año Yi Xi
1968 135.7 1551.3 1978 274.1 2167.4
1969 144.6 1599.8 1979 277.9 2112.6
1970 150.9 1668.1 1980 253.6 2214.3
1971 166.2 1728.4 1981 258.7 2248.6
1972 190.7 1797.4 1982 249.5 2261.5
1973 218.2 1916.3 1983 282.2 2331.9
1974 211.8 1896.6 1984 351.1 2469.8
1975 187.9 1931.7 1985 367.9 2542.8
1976 229.3 2001.0 1986 412.3 2640.9
1977 259.4 2066.6 1987 439.0 2686.3
Yt = β0 + β1 Xt + β2 T + ut
where T is a trend variable (1 for the first year, 2 for the second...).
ln Yt = β0 + β1 ln Xt + β2 T + ut ,
7
The variables are individually and jointly significant in both models and the predictive power is
high (nevertheless, the R2 cannot be used to compare them). So, both models seem adequate.
Then, which model does one prefer? Is the functional form of any of them incorrect?
• Objective tools:
Steps:
2. Rerun the model introducing Ybi in some form as an additional regressor(s). The squared
Ybi2 and cubed Ybi3 terms have proven to be useful in most applications.
One advantage of RESET is that it is easy to apply, for it does not require one to specify
what the alternative model is. But that is also its disadvantage because knowing that a model
is mis-specified does not help us necessarily in choosing a better alternative.
Example 1 (continuation)
Returning to Example 1 we apply RESET to the two models (version with squares).
1. For the linear regression model, after calculating the Ybi values we fit the model
Y = β0 + β1 X + β2 T + β3 Yb 2 + u
8
obtaining the following results:
Since the term Ybi2 is individually significant at the 5%, one can conclude that the linear
regression model is misspecified..
2
ln Y = β0 + β1 ln X + β2 T + β3 ln
d Y + u.
• In regression model involving cross-sectional data, a similar structural change may happen
between two groups of observations.
• How do we find out that a structural change has in fact occurred? Chow test
9
• Suppose a sample of size n divided in two independent sub-samples of sizes n1 and n2 ,
respectively (n1 + n2 = n):
H0 : βj = βj0 , j = 0, . . . , k
H1 : Any pair of coefficients is different
• Estimate the model with all the observations (supposing that there is no parameter
instability) by the OLS method and obtain RSS.
• Estimate the model with the observations of the first sub-sample by the OLS method
and obtain RSS1 .
• Estimate the model with the observations of the second sub-sample by the OLS method
and obtain RSS2 .
• Now the idea behind the Chow test is that if in fact there is no structural change (i.e.,
the two regressions are essentially the same), then the RSS and RSS1 + RSS2 should
not be statistically different. Therefore, if we form the following ratio:
RSS−(RSS1 +RSS2 )
k+1
F = RSS1 +RSS2
v F (k + 1, n − 2(k + 1))
H0
n−2(k+1)
Example 2
We wish to analyse the per capita food consumption, Y , in function of the price, X1 , and
the per capita income (both adjusted by the CPI) in the periods 1927-1941 and 1948-1962,
using the data in Table 1 (the years of the World War II have been omitted). We suspect that
there could be a structural change in the relationship between Y and X1 and X2 due to the
effect of World War II. We apply the Chow test:
10
Year Yi X1i X2i Year Yi X1i X2i
1927 88.9 91.7 57.7 1948 96.7 105.3 82.1
1928 88.9 92 59.3 1949 96.7 102 83.1
1929 89.1 93.1 62 1950 98 102.4 88.6
1930 88.7 90.9 56.3 1951 96.1 105.4 88.3
1931 88 82.3 52.7 1952 98.1 105 89.1
1932 85.9 76.3 44.4 1953 99.1 102.6 92.1
1933 86 78.3 43.8 1954 99.1 101.9 91.7
1934 87.1 84.3 47.8 1955 99.8 100.8 96.5
1935 85.4 88.1 52.1 1956 101.5 100 99.8
1936 88.5 88 58 1957 99.9 99.8 99.9
1937 88.4 88.4 59.8 1958 99.1 101.2 98.4
1938 88.6 83.5 55.9 1959 101 98.8 101.8
1939 91.7 82.4 60.3 1960 100.7 98.4 101.8
1940 93.3 83 64.1 1961 100.8 98.8 103.1
1941 95.1 86.2 73.7 1962 101 98.4 105.5
Table 1: Data on per capita food consumption, price and per capita income before and after
the World War II
11
1. All possible regressions: This procedure requires that the analyst fit all the regression
equations involving one candidate regressor, two candidate regressors, and so on. These
equations are evaluated according to some suitable criterion and the “best” regression
model selected.
2. Stepwise regression methods: Because evaluating all posible regressions can be bur-
densome computationally, various methods have been developed for evaluating only a
small number of subset regression models by either adding or deleting regressors one at
a time. They can be classified into three broad categories:
• Forward selection
• Backward elimination
• Stepwise regression.
Forward selection
This procedure begins with the assumption that there are no regressors in the model
other than the intercept.An effort is made to find an optimal subset by inserting regressors
into the model one at a time.
Steps:
1. The first regressor selected for entry into the equation is the one that has the largest
simple correlation with the response variable Y .
2. We carry out all the possible regressions adding a new regressor to the previous model.
3. The second regressor chosen for entry is the one that now has the lowest p−value (lower
than α) in the corresponding individual significance test, that is, the most relevant
regressor among those that are relevant.
5. The procedure terminates either when all the p − values at a particular step are greater
than α or when the last candidate regressor is added to the model.
Backward elimination
Forward selection begins with no regressors in the model and attempts to insert variables
until a suitable model is obtained. Backward elimination attempts to find a good model by
working in the opposite direction.
Steps:
12
2. We first remove from the model the regressor which has the greatest p−value (greater
than α) corresponding to the individual significance test, that is, the most irrelevant
regressor among those that are irrelevant.
3. Now a regression model with k − 1 regressors is fit and the procedure repeat.
4. The backward elimination algorithm terminates when all the p−values corresponding to
the individual significance tests are less than α, that is, all the regressors in the model
are relevant.
Stepwise regression
• The two procedures described above suggest a number of possible combinations. One
of the most popular is the stepwise regression algorithm of Efroymson.
• Stepwise regression is a modification of forward selection in which at each step all re-
gressors entered into the model previously are reassessed via their p−value of the corre-
sponding individual significance test.
• A regressor added at an earlier step may now be redundant because of the relationships
between it and regressors now in the equation.
• If the p−value is greater than αout , that variable is dropped from the model.
• Some analysts prefer to choose the same α for adding and removing a regressor, that is,
αin = αout . Frequently we choose αin < αout , making it relatively more difficult to add
a regressor than to delete one.
• None of the procedures generally guarantees that the best subset regression model of
any size will be identified.
• Since all the stepwise-type procedures terminate with one final equation, inexperienced
analysts may conclude that they have found a model that is in some sense optimal.
• Part of the problem is that it is likely, not that there is one best subset model, but that
there are several equally good ones.
• The analyst should also keep in mind that the order in which the regressors enter or leave
the model does not necessarily imply an order of importance to the regressors. It is not
unusual to find that a regressor inserted into the model early in the procedure becomes
negligible at a subsequent step.
13
• Note that forward selection, backward elimination, and stepwise regression do not nec-
essarily lead to the same choice of final model.
• Some users have recommended that all the procedures be applied in the hopes of either
seeing some agreement or learning something about the structure of the data that might
be overlooked by using only one selection procedure. Furthermore, there is not necessarily
any agreement between any of the stepwise-type procedures and all possible regressions.
• For these reasons stepwise - type variable selection procedures should be used with
caution.
AIC = 2(k + 1) − 2 ln L
BIC = (k + 1) ln n − 2 ln L
The model with the lowest AIC, BIC or HQC is preferred. They can be used for nested and
non-nested models.
Remark: We have discussed several model selection criteria, but one should look at these
criteria as an adjunct to the various specification tests we have discussed in this unit. Some
of the criteria discussed above are purely descriptive and may not have strong theoretical
properties. Nonetheless, they are so frequently used by the practitioner that one should be
aware of them. No one of these criteria is necessarily superior to the others.
14
Example 3
We will apply the stepwise-type procedures to the Hald cement data given in the file
Hald_cement.xls.
Hald (1952) presents data concerning the heat evolved in calories per gram of cement (Y )
as a function of the amount of each of four ingredients in the mix: tricalcium aluminate (X1 ),
tricalcium silicate (X2 ), tetracalcium alumino ferrite (X3 ) and dicalcium silicate (X4 ).
15