Ecmc 8

Model Specification and Data Problems
Part VIII
As of Oct 18, 2018

Seppo Pynnönen Econometrics I
Functional Form Misspecification
1 Model Specification and Data Problems

RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables

Outliers

A functional form misspecification generally means that the

model does not account for some important nonlinearities.
Recall that omitting important variable is also model
misspecification.
Generally functional form misspecification causes bias in the
remaining parameter estimators.

Example 1
Suppose that the correct specification of the wage equation is
(1)
log(wage) = β0 + β1 educ + β2 exper + β3 (exper)2 + u.

Then the return for an extra year of experience is
∂ log(wage)
= β2 + 2β3 exper. (2)
∂ exper
If the second order term is dropped from (1), use of the resulting biased
estimate of β2 can be misleading.


RESET test

Outliers

Ramsey (1969)2 proposed a general functional form

misspecification test, Regression Specification Error Test (RESET),
which has proven to be useful.
Estimate
y = β0 + β1 x1 + · · · + βk xk + u, (3)
get ŷ and test in the augmented model
y = β0 + β1 x1 + · · · + βk xk + δ1 ŷ 2 + δ2 ŷ 3 + e. (4)
Test the null hypothesis
H0 : δ1 = δ2 = 0. (5)
with the F -test with numerator df1 = 2 and denominator

df2 = n − k − 3.
2
Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the
Royal Statistical Society, Series B, 71, 350–371.
Example 2
Consider the house price data (Exercise 3.1) and estimate
price = β0 + β1 lotsize + β2 sqrft + β3 bdrms + u. (6)
Estimation results are:

Dependent Variable: PRICE
Method: Least Squares
Sample: 1 88
Included observations: 88
==========================================================
Variable Coefficient Std. Error t-Statistic Prob.
----------------------------------------------------------
C -21.77031 29.47504 -0.738601 0.4622
LOTSIZE 0.002068 0.000642 3.220096 0.0018
SQRFT 0.122778 0.013237 9.275093 0.0000
BDRMS 13.85252 9.010145 1.537436 0.1279
==========================================================
============================================================
R-squared 0.672362 Mean dependent var 293.5460
Adjusted R-squared 0.660661 S.D. dependent var 102.7134
S.E. of regression 59.83348 Akaike info criterion 11.06540
Sum squared resid 300723.8 Schwarz criterion 11.17800
Log likelihood -482.8775 F-statistic 57.46023
Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000
============================================================
\ 2 and (price)
Estimate next (6) augmented with (price) \ 3 as in (4).
The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2
and 82 degrees of freedom. The p-value is 0.012, such that we
reject the null hypothesis at the 5% level.
Thus, there is some evidence of non-linearity.

Estimate next
log(price) = β0 + β1 log(lotsize) + β2 log(sqrft) + β3 bdrms + u.

(7)
Estimation results:
Dependent Variable: LOG(PRICE)
Method: Least Squares
Date: 10/19/06 Time: 00:01
Sample: 1 88
Included observations: 88
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C -1.297042 0.651284 -1.991517 0.0497
LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000
LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000
BDRMS 0.036958 0.027531 1.342415 0.1831
============================================================
==============================================================
R-squared 0.642965 Mean dependent var 5.633180
Adjusted R-squared 0.630214 S.D. dependent var 0.303573
S.E. of regression 0.184603 Akaike info criterion -0.496833
Sum squared resid 2.862563 Schwarz criterion -0.384227
Log likelihood 25.86066 F-statistic 50.42374
Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000
==============================================================

Applying the RESET test, the F -statistic for the null hypothesis (5) is
now F = 2.56 with p-value 0.084, which implies that the hypothesis is
not rejected at the 5% level.
Thus overall, on the basis of the RESET test the log-log model (7) is
preferred.


RESET test

Outliers

For example if the model choices are
y = β0 + β1 x1 + β2 x2 + u (8)
and
y = β0 + β1 log(x1 ) + β2 log(x2 ) + u. (9)
Because the models are non-nested the usual F -test does not apply.
A common approach is to estimate a combined model
y = γ0 + γ1 x1 + γ2 x2 + γ3 log(x1 ) + γ4 log(x2 ) + u.
(10)
H0 : γ3 = γ4 = 0 is a hypothesis for (8) and H0 : γ1 = γ2 = 0 is a
hypothesis for (9). The usual F -test applies again here.

Davidson and MacKinnon (1981)3 procedure:

For example to test (8), estimate first
y = β0 + β1 x1 + β2 x2 + θ1 ŷˆ + v , (11)
where ŷˆ is the fitted value of (9). A significant t value of the

θ1 -estimate is a rejection of (8).
Similarly, if ŷ denotes the fitted values of (8), the test of (9) is the
t-staistic of the θ1 -estimate from
y = β0 + β1 log(x1 ) + β2 log(x2 ) + θ1 ŷ + v , (12)
3
Davidson, R. and J.G. MacKinnon (1981). Several tests for model
specification in the presence of alternative hypotheses, Econometrica 49,
781–793.
Remark 8.1: A clear winner need not emerge. Both models may be
rejected or neither may be rejected. In the latter case adjusted R-square
can be used to select the better fitting one. If both models are rejected,
more work is needed. 4
4
For more complicated cases, see Wooldridge, J.M. (1994). A simple
specification test for the predictive ability of transformation models, Review of
Economics and Statistics 76, 59–65.

RESET test

Outliers

As discussed earlier, an important source of bias in OLS is

omitted variables that are correlated with the included
explanatory variables.
Often the reason for omission is that these variables are
unobservable.
A way to mitigate the problem is to collect data on proxy
variables.
Consider the following regression
y = β0 + β1 x1 + β2 x2∗ + u, (13)
where x2∗ is unobservable variable (e.g. human ability).

Suppose that the primary interest is to estimate β1 , so that x2∗ is a

control variable.
However, as we know the simple regression y = β0 + β1 x1 + v
results to biased and inconsistent OLS estimator of β1 such
plim β̂1 = β1 + γ1 β2 , where δ1 is the coefficient of regression
x2∗ = γ0 + γ1 x1 + error
Suppose that we have a ’good’ proxy x2 for x2∗ such tat
E[x2∗ |x2 , x1 ] = E[x2∗ |x2 ], i.e., given the proxy x2 , x1 does not
help in predicting the unobserved variable x2∗ .
E[u|x2 ] = 0 for the error term in regression (13).
These imply that in regression x2∗ = δ0 + δ1 x2 + θx1 + e, θ = 0 so

that only the proxy x2 is related to the unobserved variable x2∗ , and
that the proxy x2 is not correlated with error term of the true
regression in equation (13).
With this kind of a good proxy instead of (13), the model to be

estimated becomes
y = α0 + β1 x1 + α2 x2 + w . (14)
Now OLS is unbiased and consistent estimator of β1 , the

parameter we are primarily interested in (also OLS estimators of α0
and α1 are unbiased and consistent for these parameters, but
α0 = β0 + β2 δ0 and α1 = δ1 β2 differ from β0 and β2 ).

Example 3
Consider the return to education in wages (monthly) for men (wage2
data set).
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black, data = wdf)
Residuals:
Min 1Q Median 3Q Max
-1.98069 -0.21996 0.00707 0.24288 1.22822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
educ 0.065431 0.006250 10.468 < 2e-16 ***
exper 0.014043 0.003185 4.409 1.16e-05 ***
tenure 0.011747 0.002453 4.789 1.95e-06 ***
married 0.199417 0.039050 5.107 3.98e-07 ***
south -0.090904 0.026249 -3.463 0.000558 ***
urban 0.183912 0.026958 6.822 1.62e-11 ***
black -0.188350 0.037667 -5.000 6.84e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3655 on 927 degrees of freedom

Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469
F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

The estimated return to education is 6.5%. However, if the omitted

ability is positively correlated with educ, the estimate is too high.
Adding IQ as a proxy to ability into the equation reduces the estimate to
5.4%, which is consistent with the omitted variable bias assumption.
urban + black + iq, data = wdf)
Residuals:
-2.01203 -0.22244 0.01017 0.22951 1.27478
Coefficients:
(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***
educ 0.0544106 0.0069285 7.853 1.12e-14 ***
exper 0.0141459 0.0031651 4.469 8.82e-06 ***
tenure 0.0113951 0.0024394 4.671 3.44e-06 ***
married 0.1997644 0.0388025 5.148 3.21e-07 ***
south -0.0801695 0.0262529 -3.054 0.002325 **
urban 0.1819463 0.0267929 6.791 1.99e-11 ***
black -0.1431253 0.0394925 -3.624 0.000306 ***
iq 0.0035591 0.0009918 3.589 0.000350 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Test whether the interaction of ability and education affects wages.

urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ
Residuals:
-2.00733 -0.21715 0.01177 0.23456 1.27305
Coefficients:
(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***
educ 0.0184560 0.0410608 0.449 0.653192
exper 0.0139072 0.0031768 4.378 1.34e-05 ***
tenure 0.0113929 0.0024397 4.670 3.46e-06 ***
married 0.2008658 0.0388267 5.173 2.82e-07 ***
south -0.0802354 0.0262560 -3.056 0.002308 **
urban 0.1835758 0.0268586 6.835 1.49e-11 ***
black -0.1466989 0.0397013 -3.695 0.000233 ***
iq -0.0009418 0.0051625 -0.182 0.855290
educ:iq 0.0003399 0.0003826 0.888 0.374564
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Adding iq × educ is not only insignificant but it also renders educ and
iq insignificant!
This is due to high correlation of the interaction term with its
components:
> with(wdf, cor(cbind(educ, iq, educ*iq)))
educ iq educ*iq
educ 1.0000000 0.5156970 0.8880035
iq 0.5156970 1.0000000 0.8453237
educ*iq 0.8880035 0.8453237 1.0000000
The implied collinearity can be materially reduced by defining the

interaction term in terms of demeand variables:
> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))
educ iq (e-m(e))*(i-m(i))
educ 1.0000000 0.5156970 0.1864668
iq 0.5156970 1.0000000 -0.0133327
(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000

Interaction term of the demeaned components leads also to a meaningful

interpretation of the implied model.
Writing the original model with interaction term as
log(wage) = β0 + β1 educ + β2 iq + β12 (educ × iq) + other factors (15)
an equivalent representation in terms of demeaned interaction becomes

g ×f
log(wage) = β0 + γ1 educ + γ2 iq + β12 (educ iq) + other factors (16)
] = educ − educ and i

where educ fq = iq − iq are demeaned educ and
iq.
The relation of the coefficients of the original model (15) and model (16)
are β0 = γ0 + β12 (educ × iq), β1 = γ1 − β12 iq, and β2 = γ2 − β12 educ.
For example, at the mean IQ, ifq = 0, so that γ1 indicates the return to
education for a person with average ability.

Estimating the model, however, indicates that β̂12 = .00034 with p-value
.37 is not at all statistically significant, which implies that there is no
evidence that variability in IQ as such affects return to education.
Dependent variable: log(wage)
Coefficients:
(Intercept) 5.1846286 0.1283466 40.396 < 2e-16
educ 0.0528786 0.0071406 7.405 2.94e-13
exper 0.0139072 0.0031768 4.378 1.34e-05
tenure 0.0113929 0.0024397 4.670 3.46e-06
married 0.2008658 0.0388267 5.173 2.82e-07
south -0.0802354 0.0262560 -3.056 0.002308
urban 0.1835758 0.0268586 6.835 1.49e-11
black -0.1466989 0.0397013 -3.695 0.000233
iq 0.0036357 0.0009957 3.652 0.000275
(iq - mean(iq)) x (educ - mean(educ)) 0.0003399 0.0003826 0.888 0.374564


Outliers

RESET test

Outliers

Outliers
Particularly in small data sets OLS estimates may be

influenced by one or several observations (see figure).
Generally such observations are called outliers or influential
observations.
Loosely, an observation is an outlier if dropping it changes
estimation results materially.
In detection of outliers a usual practice is to investigate
standardized (or ”studentized”) residuals.
If an outlier is an obvious mistake in recording the data, it can
be corrected. Usual practice also is to eliminate such
observations.
Data transformations, like taking logarithms often narrow the
range of data and hence may alleviate outlier problems, too.

Ecmc 8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ecmc 8

Uploaded by

Copyright:

Available Formats

Model Specification and Data Problems

Model Specification and Data Problems

As of Oct 18, 2018

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnönen Econometrics I

Functional Form Misspecification

A functional form misspecification generally means that the

Seppo Pynnönen Econometrics I

Functional Form Misspecification

log(wage) = β0 + β1 educ + β2 exper + β3 (exper)2 + u.

Seppo Pynnönen Econometrics I

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnönen Econometrics I

Functional Form Misspecification

Ramsey (1969)2 proposed a general functional form

Test the null hypothesis

with the F -test with numerator df1 = 2 and denominator

Functional Form Misspecification

price = β0 + β1 lotsize + β2 sqrft + β3 bdrms + u. (6)

Estimation results are:

Functional Form Misspecification

Seppo Pynnönen Econometrics I

Functional Form Misspecification

log(price) = β0 + β1 log(lotsize) + β2 log(sqrft) + β3 bdrms + u.

Seppo Pynnönen Econometrics I

Functional Form Misspecification

Seppo Pynnönen Econometrics I

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnönen Econometrics I

Functional Form Misspecification

For example if the model choices are

Seppo Pynnönen Econometrics I

Functional Form Misspecification

Davidson and MacKinnon (1981)3 procedure:

where ŷˆ is the fitted value of (9). A significant t value of the

y = β0 + β1 log(x1 ) + β2 log(x2 ) + θ1 ŷ + v , (12)

Functional Form Misspecification

Using proxies for unobserved explanatory variables

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnönen Econometrics I

Using proxies for unobserved explanatory variables

As discussed earlier, an important source of bias in OLS is

Consider the following regression

where x2∗ is unobservable variable (e.g. human ability).

Seppo Pynnönen Econometrics I

Using proxies for unobserved explanatory variables

Suppose that the primary interest is to estimate β1 , so that x2∗ is a

These imply that in regression x2∗ = δ0 + δ1 x2 + θx1 + e, θ = 0 so

Using proxies for unobserved explanatory variables

With this kind of a good proxy instead of (13), the model to be

Now OLS is unbiased and consistent estimator of β1 , the

Seppo Pynnönen Econometrics I

Using proxies for unobserved explanatory variables

Residual standard error: 0.3655 on 927 degrees of freedom

Seppo Pynnönen Econometrics I

Using proxies for unobserved explanatory variables