Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Model Specification and Data Problems

Part VIII

Model Specification and Data Problems

As of Oct 18, 2018


Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

A functional form misspecification generally means that the


model does not account for some important nonlinearities.
Recall that omitting important variable is also model
misspecification.
Generally functional form misspecification causes bias in the
remaining parameter estimators.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Example 1
Suppose that the correct specification of the wage equation is

(1)

log(wage) = β0 + β1 educ + β2 exper + β3 (exper)2 + u.


Then the return for an extra year of experience is

∂ log(wage)
= β2 + 2β3 exper. (2)
∂ exper

If the second order term is dropped from (1), use of the resulting biased
estimate of β2 can be misleading.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Ramsey (1969)2 proposed a general functional form


misspecification test, Regression Specification Error Test (RESET),
which has proven to be useful.
Estimate
y = β0 + β1 x1 + · · · + βk xk + u, (3)
get ŷ and test in the augmented model

y = β0 + β1 x1 + · · · + βk xk + δ1 ŷ 2 + δ2 ŷ 3 + e. (4)

Test the null hypothesis

H0 : δ1 = δ2 = 0. (5)

with the F -test with numerator df1 = 2 and denominator


df2 = n − k − 3.
2
Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the
Royal Statistical Society, Series B, 71, 350–371.
Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

Example 2
Consider the house price data (Exercise 3.1) and estimate

price = β0 + β1 lotsize + β2 sqrft + β3 bdrms + u. (6)

Estimation results are:


Dependent Variable: PRICE
Method: Least Squares
Sample: 1 88
Included observations: 88
==========================================================
Variable Coefficient Std. Error t-Statistic Prob.
----------------------------------------------------------
C -21.77031 29.47504 -0.738601 0.4622
LOTSIZE 0.002068 0.000642 3.220096 0.0018
SQRFT 0.122778 0.013237 9.275093 0.0000
BDRMS 13.85252 9.010145 1.537436 0.1279
==========================================================

============================================================
R-squared 0.672362 Mean dependent var 293.5460
Adjusted R-squared 0.660661 S.D. dependent var 102.7134
S.E. of regression 59.83348 Akaike info criterion 11.06540
Sum squared resid 300723.8 Schwarz criterion 11.17800
Log likelihood -482.8775 F-statistic 57.46023
Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000
============================================================
Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

\ 2 and (price)
Estimate next (6) augmented with (price) \ 3 as in (4).
The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2
and 82 degrees of freedom. The p-value is 0.012, such that we
reject the null hypothesis at the 5% level.
Thus, there is some evidence of non-linearity.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Estimate next

log(price) = β0 + β1 log(lotsize) + β2 log(sqrft) + β3 bdrms + u.


(7)
Estimation results:
Dependent Variable: LOG(PRICE)
Method: Least Squares
Date: 10/19/06 Time: 00:01
Sample: 1 88
Included observations: 88
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C -1.297042 0.651284 -1.991517 0.0497
LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000
LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000
BDRMS 0.036958 0.027531 1.342415 0.1831
============================================================

==============================================================
R-squared 0.642965 Mean dependent var 5.633180
Adjusted R-squared 0.630214 S.D. dependent var 0.303573
S.E. of regression 0.184603 Akaike info criterion -0.496833
Sum squared resid 2.862563 Schwarz criterion -0.384227
Log likelihood 25.86066 F-statistic 50.42374
Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000
==============================================================

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Applying the RESET test, the F -statistic for the null hypothesis (5) is
now F = 2.56 with p-value 0.084, which implies that the hypothesis is
not rejected at the 5% level.

Thus overall, on the basis of the RESET test the log-log model (7) is
preferred.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

For example if the model choices are

y = β0 + β1 x1 + β2 x2 + u (8)

and
y = β0 + β1 log(x1 ) + β2 log(x2 ) + u. (9)
Because the models are non-nested the usual F -test does not apply.
A common approach is to estimate a combined model

y = γ0 + γ1 x1 + γ2 x2 + γ3 log(x1 ) + γ4 log(x2 ) + u.

(10)
H0 : γ3 = γ4 = 0 is a hypothesis for (8) and H0 : γ1 = γ2 = 0 is a
hypothesis for (9). The usual F -test applies again here.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Davidson and MacKinnon (1981)3 procedure:


For example to test (8), estimate first

y = β0 + β1 x1 + β2 x2 + θ1 ŷˆ + v , (11)

where ŷˆ is the fitted value of (9). A significant t value of the


θ1 -estimate is a rejection of (8).
Similarly, if ŷ denotes the fitted values of (8), the test of (9) is the
t-staistic of the θ1 -estimate from

y = β0 + β1 log(x1 ) + β2 log(x2 ) + θ1 ŷ + v , (12)

3
Davidson, R. and J.G. MacKinnon (1981). Several tests for model
specification in the presence of alternative hypotheses, Econometrica 49,
781–793.
Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

Remark 8.1: A clear winner need not emerge. Both models may be
rejected or neither may be rejected. In the latter case adjusted R-square
can be used to select the better fitting one. If both models are rejected,
more work is needed. 4

4
For more complicated cases, see Wooldridge, J.M. (1994). A simple
specification test for the predictive ability of transformation models, Review of
Economics and Statistics 76, 59–65.
Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Using proxies for unobserved explanatory variables

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

As discussed earlier, an important source of bias in OLS is


omitted variables that are correlated with the included
explanatory variables.
Often the reason for omission is that these variables are
unobservable.
A way to mitigate the problem is to collect data on proxy
variables.

Consider the following regression

y = β0 + β1 x1 + β2 x2∗ + u, (13)

where x2∗ is unobservable variable (e.g. human ability).

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Suppose that the primary interest is to estimate β1 , so that x2∗ is a


control variable.
However, as we know the simple regression y = β0 + β1 x1 + v
results to biased and inconsistent OLS estimator of β1 such
plim β̂1 = β1 + γ1 β2 , where δ1 is the coefficient of regression
x2∗ = γ0 + γ1 x1 + error
Suppose that we have a ’good’ proxy x2 for x2∗ such tat

E[x2∗ |x2 , x1 ] = E[x2∗ |x2 ], i.e., given the proxy x2 , x1 does not
help in predicting the unobserved variable x2∗ .
E[u|x2 ] = 0 for the error term in regression (13).

These imply that in regression x2∗ = δ0 + δ1 x2 + θx1 + e, θ = 0 so


that only the proxy x2 is related to the unobserved variable x2∗ , and
that the proxy x2 is not correlated with error term of the true
regression in equation (13).
Seppo Pynnönen Econometrics I
Model Specification and Data Problems

Using proxies for unobserved explanatory variables

With this kind of a good proxy instead of (13), the model to be


estimated becomes

y = α0 + β1 x1 + α2 x2 + w . (14)

Now OLS is unbiased and consistent estimator of β1 , the


parameter we are primarily interested in (also OLS estimators of α0
and α1 are unbiased and consistent for these parameters, but
α0 = β0 + β2 δ0 and α1 = δ1 β2 differ from β0 and β2 ).

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Example 3
Consider the return to education in wages (monthly) for men (wage2
data set).
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black, data = wdf)

Residuals:
Min 1Q Median 3Q Max
-1.98069 -0.21996 0.00707 0.24288 1.22822

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
educ 0.065431 0.006250 10.468 < 2e-16 ***
exper 0.014043 0.003185 4.409 1.16e-05 ***
tenure 0.011747 0.002453 4.789 1.95e-06 ***
married 0.199417 0.039050 5.107 3.98e-07 ***
south -0.090904 0.026249 -3.463 0.000558 ***
urban 0.183912 0.026958 6.822 1.62e-11 ***
black -0.188350 0.037667 -5.000 6.84e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3655 on 927 degrees of freedom


Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469
F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

The estimated return to education is 6.5%. However, if the omitted


ability is positively correlated with educ, the estimate is too high.
Adding IQ as a proxy to ability into the equation reduces the estimate to
5.4%, which is consistent with the omitted variable bias assumption.
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq, data = wdf)

Residuals:
Min 1Q Median 3Q Max
-2.01203 -0.22244 0.01017 0.22951 1.27478

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***
educ 0.0544106 0.0069285 7.853 1.12e-14 ***
exper 0.0141459 0.0031651 4.469 8.82e-06 ***
tenure 0.0113951 0.0024394 4.671 3.44e-06 ***
married 0.1997644 0.0388025 5.148 3.21e-07 ***
south -0.0801695 0.0262529 -3.054 0.002325 **
urban 0.1819463 0.0267929 6.791 1.99e-11 ***
black -0.1431253 0.0394925 -3.624 0.000306 ***
iq 0.0035591 0.0009918 3.589 0.000350 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3632 on 926 degrees of freedom


Multiple R-squared: 0.2628,Adjusted R-squared: 0.2564
F-statistic: 41.27 on 8 and 926 DF, p-value: < 2.2e-16

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Test whether the interaction of ability and education affects wages.


lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ

Residuals:
Min 1Q Median 3Q Max
-2.00733 -0.21715 0.01177 0.23456 1.27305

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***
educ 0.0184560 0.0410608 0.449 0.653192
exper 0.0139072 0.0031768 4.378 1.34e-05 ***
tenure 0.0113929 0.0024397 4.670 3.46e-06 ***
married 0.2008658 0.0388267 5.173 2.82e-07 ***
south -0.0802354 0.0262560 -3.056 0.002308 **
urban 0.1835758 0.0268586 6.835 1.49e-11 ***
black -0.1466989 0.0397013 -3.695 0.000233 ***
iq -0.0009418 0.0051625 -0.182 0.855290
educ:iq 0.0003399 0.0003826 0.888 0.374564
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3632 on 925 degrees of freedom


Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Adding iq × educ is not only insignificant but it also renders educ and
iq insignificant!
This is due to high correlation of the interaction term with its
components:
> with(wdf, cor(cbind(educ, iq, educ*iq)))
educ iq educ*iq
educ 1.0000000 0.5156970 0.8880035
iq 0.5156970 1.0000000 0.8453237
educ*iq 0.8880035 0.8453237 1.0000000

The implied collinearity can be materially reduced by defining the


interaction term in terms of demeand variables:
> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))
educ iq (e-m(e))*(i-m(i))
educ 1.0000000 0.5156970 0.1864668
iq 0.5156970 1.0000000 -0.0133327
(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Interaction term of the demeaned components leads also to a meaningful


interpretation of the implied model.
Writing the original model with interaction term as
log(wage) = β0 + β1 educ + β2 iq + β12 (educ × iq) + other factors (15)

an equivalent representation in terms of demeaned interaction becomes


g ×f
log(wage) = β0 + γ1 educ + γ2 iq + β12 (educ iq) + other factors (16)

] = educ − educ and i


where educ fq = iq − iq are demeaned educ and
iq.
The relation of the coefficients of the original model (15) and model (16)
are β0 = γ0 + β12 (educ × iq), β1 = γ1 − β12 iq, and β2 = γ2 − β12 educ.
For example, at the mean IQ, ifq = 0, so that γ1 indicates the return to
education for a person with average ability.

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Estimating the model, however, indicates that β̂12 = .00034 with p-value
.37 is not at all statistically significant, which implies that there is no
evidence that variability in IQ as such affects return to education.
Dependent variable: log(wage)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1846286 0.1283466 40.396 < 2e-16
educ 0.0528786 0.0071406 7.405 2.94e-13
exper 0.0139072 0.0031768 4.378 1.34e-05
tenure 0.0113929 0.0024397 4.670 3.46e-06
married 0.2008658 0.0388267 5.173 2.82e-07
south -0.0802354 0.0262560 -3.056 0.002308
urban 0.1835758 0.0268586 6.835 1.49e-11
black -0.1466989 0.0397013 -3.695 0.000233
iq 0.0036357 0.0009957 3.652 0.000275
(iq - mean(iq)) x (educ - mean(educ)) 0.0003399 0.0003826 0.888 0.374564

Residual standard error: 0.3632 on 925 degrees of freedom


Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Outliers

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnönen Econometrics I


Model Specification and Data Problems

Outliers

Particularly in small data sets OLS estimates may be


influenced by one or several observations (see figure).
Generally such observations are called outliers or influential
observations.
Loosely, an observation is an outlier if dropping it changes
estimation results materially.
In detection of outliers a usual practice is to investigate
standardized (or ”studentized”) residuals.
If an outlier is an obvious mistake in recording the data, it can
be corrected. Usual practice also is to eliminate such
observations.
Data transformations, like taking logarithms often narrow the
range of data and hence may alleviate outlier problems, too.

Seppo Pynnönen Econometrics I

You might also like