Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

MODEL SELECTION CRITERIA AND

TESTS

Chapter 7
THE ATTRIBUTES OF A GOOD MODEL

 Whether a model used in empirical analysis is good, or appropriate, or the "right” model
cannot be determined without some reference criteria, or guidelines.

 Parsimony: A model can never completely capture the reality; some


amount of simplification is inevitable in any model building. The principle of
parsimony, suggests that a model be kept as simple as possible.
 Parameter constancy: that is, the values of the parameters should be stable.
Otherwise, forecasting will be difficult. The only relevant test of a model is
comparison of its predictions with experience. In the absence of parameter
constancy, such predictions will not be reliable
 Goodness of Fit: Since the basic thrust of regression analysis is to explain as much
of the variation in the dependent variable as possible by explanatory variables
included in the model, a model is judged to be good if this explanation, as
measured, say, by the adjusted is as high as possible.
…THE ATTRIBUTES OF A GOOD MODEL

 Theoretical Consistency No matter how high the goodness of fit measures, a


model may not be judged to be good if one or more coefficients have the wrong
signs. For example, in the demand function for a commodity, if the price
coefficient has a positive sign (positively sloping demand curve!). We must check
that the coefficients have the right sign in line with the theory even if the of the
model is high.
 It is one thing to list criteria of a “good” model and quite another to actually
develop it, for in practice one is likely to commit various model specification
errors, which we discuss next.
 One of the assumptions of the classical linear regression model (CLRM), is that
the regression model used in the analysis is “correctly” specified: If the model is
not “correctly” specified, we encounter the problem of model specification error
or model specification bias.
3
TYPES OF SPECIFICATION ERRORS
SPECIFICATION BIAS/ERRORS
 Knowing the consequences of specification errors is one thing but finding out whether one has
committed such errors is quite another, for we do not deliberately set out to commit such errors.
 Very often specification errors arise inadvertently, perhaps from our inability to formulate the
model as precisely as possible because the underlying theory is weak or because we do not have
the right kind of data to test the model.
 Because of the non-experimental nature of economics, we are never sure how the observed data
were generated.
 In such a set up, the practical question then is not why specification errors are made, but how to
detect them. Once it is found that specification errors have been made, the remedies can be used.

We will concentrate on following biases:


#1. Omission of relevant X variables
#2. Inclusion of unnecessary X variable(s)
#3. Wrong functional form (e.g., linear or non-linear).
5
#4. Measurement error
BIAS DUE TO OMISSION OF RELEVANT X VARIABLE(S): UNDERFITTING A MODEL
 The estimation of a regression model without relevant explanatory variable(s) may
introduce bias into the estimates. For instance assume that the correct function
explaining variation in Y is as follows:
[1]
In mean deviated form EQ.[1] can be written as:
[2]
 Now, assume that either due to ignorance about the true relation or unavailability of
relevant data on , following regression equation is estimated:
[3]
 It can be shown that is different from .

See Gujarati and Porter “Basic


6
Econometrics” for the proof
…BIAS DUE TO EXCLUSION OF RELEVANT X VARIABLE(S)
 On applying OLS to the misspecified regression model in EQ.[3], we obtain:
[4]
 On the other hand, the normal equations of the correctly specified regression model
in EQ. [2] are:
, and [5]
[6]
 Dividing EQ. [5] by , we obtain:

[7]

Where, is the slope coefficient in the regression of the omitted variable on the 7
included variable .
…BIAS DUE TO EXCLUSION OF RELEVANT X VARIABLE(S)
 Therefore, iff the term
 Now, from EQ. [7], we can obtain the bias introduced by the exclusion of a relevant
explanatory variable as:
Specification bias [8]
 Above exposition tells us that the bias due to exclusion of relevant explanatory
variable(s) depends on two terms:
(a) Regression coefficient of the explanatory variable(s) excluded from the
fitted model.
(b) The covariance/correlation between the explanatory variable(s) dropped and
kept in the fitted regression model, that is,

NOTE: We work with data in mean deviation form and assume that to simplify the derivations. We
already know that in econometrics, most, but not all, of the regression results for the intercept term has no 8
economic interpretation.
EXCLUSION OF RELEVANT X VARIABLE(S): CONSEQUENCES

 If the left-out or omitted, variable is correlated with the included variable ,


slope coefficient of original regression will be biased.
 The disturbance variance is incorrectly estimated.

 The variance of () is a biased estimator of the variance of the true


estimator . Variance of will, on average, overestimate the true variance: .
 In consequence, the usual confidence interval and hypothesis-testing
procedures are likely to give misleading conclusions about the statistical
significance of the estimated parameters.
 As another consequence, the forecasts based on the incorrect model and the
forecast (confidence) intervals will be unreliable.

9
INCLUSION OF UNNECESSARY/IRRELEVANT X VARIABLES:
CONSEQUENCES
 Another type of specification bias may arise when the set of explanatory variables is
enlarged by inclusion of one or more irrelevant variables.
 The philosophy is that so long as you include the theoretically relevant variables,
inclusion of one or more unnecessary or “nuisance” variables will not hurt—
unnecessary in the sense that there is no solid theory that says they should be
included.
 In that case inclusion of such variables will certainly increase (and adjusted when
the ).
 This is called overfitting a model. But if the variables are not economically
meaningful and relevant, such a strategy is not recommended.

10
INCLUSION OF UNNECESSARY/IRRELEVANT X VARIABLES: CONSEQUENCES

 Suppose the correctly specified model is as follows:


original/correct model [9]
 But a researcher adds the superfluous variable and estimates the following model:
estimated/incorrect model [10]

 The OLS estimators of the “incorrect” model are unbiased (as well as consistent). That
is, , and . If does not belong to the model, is expected to be zero.
 Also, the estimator of obtained from over-fitted regression is correctly estimated.

11
INCLUSION OF UNNECESSARY/IRRELEVANT X VARIABLES:
CONSEQUENCES
 The standard confidence interval and hypothesis-testing procedure on the basis of the t
and F tests remains valid.
 However, the ’s estimated from the regression are inefficient ─ their variances will be
generally larger than those of the ’s estimated from the true model.
 As a result, the confidence intervals based on the standard errors of ’s will be larger
than those based on the standard errors of ’s of the true model.
 However, we can use F-test statistic (pls. see lecture on multivariate regression) to
choose the right variable(s) for our model.

12
IS IT BETTER TO INCLUDE IRRELEVANT VARIABLES THAN TO
EXCLUDE THE RELEVANT ONES?

 The addition of unnecessary variables will lead to a loss in the efficiency of the
estimators (i.e., larger standard errors) and may also lead to the problem of
multicollinearity (Why?), not to mention the loss of degrees of freedom.

 The best approach is to include only explanatory variables that on theoretical


grounds directly influence the dependent variable and are not accounted for by
other included variables.

13
WRONG FUNCTIONAL FORM (LINEAR OR NON-LINEAR).
 Sometimes researchers mistakenly do not account for the nonlinear nature of
variables in a model.
 Moreover, some dependent variables (such as wage, which tends to be skewed to
the right) are more appropriately entered in natural log form.
 Consider the following true marginal cost model

[11]
 Instead the econometrician estimated following model
Consistently underestimate
[12] the true marginal cost

Between points P and Q, the linear


marginal cost curve will consistently
overestimate the true marginal cost

14
TESTS FOR OMITTED VARIABLE AND FUNCTIONAL FORM OF REGRESSION EQUATION

RESET Test:
 Consider the original model as: [13]
 RESET adds polynomials in the OLS fitted values to above EQ. to detect general kinds of
functional form misspecification.
 To implement RESET we need to decide how many functions of fitted values to include in an
expanded regression. However, there is no right or wrong answer.
 Let denotes the OLS fitted value.

 Consider the expanded model: [14]

 We use this equation to test whether original equation has missed important nonlinearities.

 , that is, . To test this hypothesis use test statistic:

 A significant F- statistic suggests some sort of functional problem.

15
…TESTS FOR OMITTED VARIABLE AND FUNCTIONAL FORM OF REGRESSION EQUATION

 MacKinnon-White-Davidson (MWD) test


 Consider following two models:

 To illustrate use of MWD test to identify which functional form is correct, we specify the
hypotheses as follows:
 Linear Model: is a linear function of the
Log-linear Model: is a linear function of or the ’s.

16
…TESTS FOR OMITTED VARIABLE AND FUNCTIONAL FORM OF REGRESSION
EQUATION
The MWD test involves the following steps:
 Estimate the linear model and obtain the values.

 Estimate the log-linear model and obtain the .

 Create a new variable

 Regress on the and :

 Reject if the coefficient of is statistically significant by the usual t test.

 Obtain

 Regress on the or and : or,

Reject if the coefficient of in the preceding equation is statistically significant.


►The idea behind the MWD test is simple.
If the linear model is in fact the correct model, the constructed variable should not be significant, because in that case
the estimated Y values from the linear model and those estimated from the loglinear model (after taking their antilog
values for comparative purposes) should not be different. The same comment applies to the alternative hypothesis .

17
…TESTS FOR OMITTED VARIABLE AND FUNCTIONAL FORM OF REGRESSION
EQUATION
 LM Test:
 This is an alternative to RESET test. Estimate model in EQ.[13] and obtain the
estimated residuals, .
 If in fact EQ.[13] is the correct model, then the residuals obtained from this model
should not be related to the regressors omitted from the model.
 We now regress on the regressors in the original model and the omitted variables
from the original model.
 [11]
 If the sample size is large, it can be shown that n (the sample size) times the R2
obtained from the auxiliary regression (follows a distribution, symbolically, .
 If the computed value > critical value at the chosen level of significance we reject
the null of no misspecification.
18
ERRORS OF MEASUREMENT

 So far we have assumed implicitly that the dependent variable and the
explanatory variables, the , are measured without any errors.
 Although not explicitly spelled out, this presumes that the values of the regressand as well
as regressors are accurate. That is, they are not guess estimates, extrapolated, interpolated
or rounded off in any systematic manner or recorded with errors.
 Consequences for Errors of Measurement in the Regressand:

1. The OLS estimators are still unbiased.


2. The variances and standard errors of OLS estimators are still unbiased.
3. But the estimated variances, and ipso facto the standard errors, are larger than in the
absence of such errors.

In short, errors of measurement in the regressand do not pose a very serious threat to OLS
estimation.
19
…ERRORS OF MEASUREMENT

Consequences for Errors of Measurement in the Regressor:


1. OLS estimators are biased as well as inconsistent.
2. Errors in a single regressor can lead to biased and inconsistent estimates of the
coefficients of the other regressors in the model.
 It is not easy to establish the size and direction of bias in the estimated
coefficients.
 It is often suggested that we use instrumental or proxy variables for variables
suspected of having measurement errors.
 The proxy variables must satisfy two requirements—that they are highly correlated with the variables for
which they are a proxy and also they are uncorrelated with the usual equation error as well as the
measurement error
 But such proxies are not easy to find.

 We should thus be very careful in collecting the data and making sure that some
obvious errors are eliminated.
20
OUTLIERS, LEVERAGE, AND INFLUENCE DATA
 Observations or data points that are not “typical” of rest of the sample are known
as outliers, leverage or influence points.
 Outliers: In the context of regression analysis, an outlier is an observation with a
large residual (ei), large in comparison with the residuals of the rest of the
observations.
 Leverage: An observation is said to exert (high) leverage if it is disproportionately
distant from the bulk of the sample observations. In this case such observation(s)
can pull the regression line towards itself, which may distort the slope of the
regression line.
 Influential point: If a levered observation in fact pulls the regression line toward
itself, it is called an influential point. The removal of such a data point(s) from the
sample can dramatically change the slope of the estimated regression line.

21
…OUTLIERS, LEVERAGE, AND INFLUENCE DATA

How do we handle outliers? Should we just


drop them and confine our attention to the
◙ remaining data points? NOTE that automatic
◙ rejection of outliers is not always a wise
procedure. Sometimes the outlier is providing
information that other data points cannot due to
◙ the fact that it arises from an unusual
combination of circumstances which may be of
vital interest and requires further investigation
rather than elimination. As a general rule,
In each subfigure, the solid line gives the OLS line for all the data and the broken outliers should be dropped only if they can be
line gives the OLS line without the outlier, denoted by ◙.
• In (a), the outlier is near the mean value of and has low leverage and little
traced to causes such as errors of recording the
influence on the regression coefficients. observations or setting up the apparatus [in a
• In (b), the outlier is far away from the mean value of X and has high leverage physical
as well as substantial influence on the regression coefficients. experiment]. Otherwise, careful investigation is
• In (c), the outlier has high leverage but low influence on the regression in order.
coefficients because it is in line with the rest of the observations.
22

You might also like