Cursus Advanced Econometrics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 129

ADVANCED ECONOMETRICS - PART 1

Christophe Croux
christophe.croux@econ.kuleuven.be

1
Contents
1 Basic Econometrics 4
1.1 Basic Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Why Using Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Estimators of means and variances . . . . . . . . . . . . . . . . . . 10
1.1.4 Joint and Conditional Distributions . . . . . . . . . . . . . . . . . . 12
1.1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Econometrics in Practice: Introduction to the Use of Eviews . . . . . . . . 16
1.2.1 Hands on with Eviews . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Basic Principles of Eviews . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 Descriptive Univariate Statistics . . . . . . . . . . . . . . . . . . . . 19
1.2.4 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.5 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.2 The Ordinary Least Squares Estimator . . . . . . . . . . . . . . . . 27
1.3.3 Properties of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.4 Summary Statistics in Regression Analysis . . . . . . . . . . . . . . 29
1.3.5 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3.6 Using Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.4 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4.2 Coefficient Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.4.3 Omitted and Redundant Variables . . . . . . . . . . . . . . . . . . 37
1.4.4 Residual Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4.5 Consistent Estimation of Covariance . . . . . . . . . . . . . . . . . 39

2 Review: Multivariate Random Variables 41


2.1 Distribution and Density Function . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Expected Values and Covariances . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 42
2.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Non-linear Regression 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 How to Compute θ̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Asymptotic Distribution . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Example (Green, 1993) . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.4 The Delta-Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Homework: Non-Linear Regression . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Homework: Some Theoretical Exercises . . . . . . . . . . . . . . . . . . . . 53

2
4 Non-parametric regression 54
4.1 Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 1st Ingredient: Polynomial Fit . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 2nd ingredient: Weighted OLS . . . . . . . . . . . . . . . . . . . . . 55
4.2 Local fit around x0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Econometrics of Panel Data 58


5.1 Overview of Topics Discussed . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Basics and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 The Generalized Least Squares estimator . . . . . . . . . . . . . . . . . . . 61
5.4 The Fixed Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Least Squares Dummy Variable (LSDV) Estimation . . . . . . . . . 70
5.4.2 Within Groups Estimator . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 The Random Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Homework: The General Linear Model . . . . . . . . . . . . . . . . . . . . 79

6 Binary Choice Models 81


6.1 Overview of topics discussed . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Binary Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Probit and Logit Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . 85
6.5 Estimating Binary Models in Eviews . . . . . . . . . . . . . . . . . . . . . 87
6.6 Measures of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.7 Other Limited Dependent Variable Models . . . . . . . . . . . . . . . . . . 92
6.8 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7 The Poisson Regression Model 96


7.1 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 The Poisson Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 The Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . 98
7.4 Overdispersion and the Negative Binomial Model . . . . . . . . . . . . . . 99
7.5 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Solutions to Exercises 103


8.1 Exercises from Course Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.1.1 Exercise 2, page 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.1.2 Exercise page 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Homework General Linear Model, page 79 . . . . . . . . . . . . . . . . . . 107
8.3 Homework Non-Linear Regression, page 51 . . . . . . . . . . . . . . . . . . 113
8.4 Homework Some Theoretical Exercises, page 53 . . . . . . . . . . . . . . . 118
8.5 Homework: The Poisson Regression Model, page 101 . . . . . . . . . . . . 122

3
1 Basic Econometrics

1.1 Basic Statistical Concepts

1.1.1 Why Using Statistics?

By economic theory or by intuitive thinking one deduces relationships between variables.


Statistics (or econometrics) can help us with

1. Measuring these relationships.

2. Testing the theory behind these relationships.

3. Using these relationships for quantitative predictions/forecasts.

For this, collected/experimental or observed/historical data will be used. Typically, only


a sample of the whole population of interest is available, inducing uncertainty. Statistics
help us to quantify this uncertainty.

Example: Demand for money


M = demand for money; Y = (real)national income; r= interest rate; P = general price
level.

Theory says that M depends on Y and r only.

• How to specify and estimate this function?

• How to measure M , Y ,P and r?

Possible specifications:

1. Linear form: M = α + βY + γr. Note that ∂M/∂Y = β (marginal effect), which


means that if Y increases with one unit, then M increases with β units, ceteris
paribus.

2. Linear after log-transformation: log(M ) = α + β log(Y ) + γr. Note that


∂(M/M )/∂(Y /Y ) = β (elasticity), which means that if Y increases with 1%, then
M increases with β %, ceteris paribus.

4
3. Non linear forms, e.g. log(M ) = α + β(log(Y ))δ + γr, with non-constant elasticities.

After specification of the form, we still need to estimate the unknown parameters (e.g.
by Least Squares or Maximum Likelihood). We could be interested in testing whether P
has a significant effect on M or not. According to theory, it should not. Specify:

M = α + βY + γr + δP

Test the null hypothesis H0 : δ = 0. If we reject H0 , then we say that P is a significant


variable in the model. If we do not reject H0 , then P is said to have no significant effect.

Remark: The relationships above are not exact/deterministic. Otherwise M would be


perfectly predictable. We need to add a disturbance term ²:

log(M ) = α + β log(Y ) + γr + ² (1)

This disturbance term captures

1. measurement errors

2. (small) influences of omitted variables

3. unpredictable behavior/events

4. deviation from long run equilibrium

5. etc.

Equation (1) is supposed to hold at any time t. We may write for t = 1, 2, 3, . . .,

log(Mt ) = α + β log(Yt ) + γrt + ²t .

1.1.2 Probability Distributions

In statistics and econometrics we deal with variables whose values are (partly) determined
by some chance mechanism. Such variables are called stochastic or random variables. In
principle, we denote stochastic variables with X, Y, Z, ... and their outcomes/realizations
by x, y, z, . . .. Although we cannot predict which values a random variable will take,
we are often able to attach certain probabilities to these values, that is, to derive its
(probability) distribution.

5
Discrete stochastic variables X takes values in a finite set {0, 1, 2, . . . , N } or in
0, 1, 2, 3, . . .. Examples can be to pass or fail an exam, the number of matches won in a
championship, the number of car accidents of a person during one year, etc.

The probability distribution of X is given by listing all

pk = P (X = k) for k = 0, 1, 2, . . .

The expected value of X is defined as


X
E[X] = kpk .
k

It is also called the (population) mean of X. Similarly, the expected value of a Z = g(X)
is defined as
X
E[Z] = g(k)pk .
k

Example 1: Let X be the number of boys in a family with 3 children. Then

k 0 1 2 3
pk 0.125 0.375 0.375 0.125

it follows that E[X] = 0 ∗ 0.125 + 1 ∗ 0.375 + 2 ∗ 0.375 + 3 ∗ 0.125 = 1.5.

Example 2: Let X be the number of children in a family. Here we do not know the
exact probabilities and we need to estimate them. We therefore collect a random sample
of n = 60 families and record the number of children in each family. The results are
pictured in the barplot below:
How could you obtain an estimate for the expected value of X?

6
20
15
frequency

10
5
0

0 1 2 3 4 5 6

number of children

Continuous stochastic variables If X can take any value in a certain interval of


real values, then we say that X is a continuous stochastic variable. The probability
distribution is characterized by a density function f (x), which has the property that
Z ∞
f (x) ≥ 0 and f (x)dx = 1.
−∞

The density function allows us to compute probabilities,


Z b
P (a ≤ X ≤ b) = f (x)dx.
a

Alternatively, we can characterize the probability distribution by the cumulative distri-


bution function F , defined as
F (x) = P (X ≤ x).
Rx
We see that F (x) = −∞
f (y)dy and that the derivative of F equals f . The expected
value of X is defined as Z
E[X] = xf (x)dx.

7
It is also called the (population) mean of X. The expected value of Z = g(X) is defined
as Z
E[Z] = g(x)f (x)dx.

The variance of a stochastic variable X (discrete or continuous) is defined as

V ar(X) = E[(X − E[X])2 ],

and the standard deviation of X equals


p
SD(X) = V ar(X).

Note that the standard deviation has the same units of measurement as X.

8
Example: The standard normal distribution with its density function

f (x) = exp(−x2 /2)/ 2π,

standard normal density function


0.4
0.3
0.2
f(x)

0.1
0.0

-4 -2 0 2 4

and its cumulative distribution function,

standard normal cumulative distribution function


1.0
0.8
0.6
F(x)

0.4
0.2
0.0

-4 -2 0 2 4

9
For X a standard normal random variable, we have E[X] = 0 and SD(X) = 1. Further-
more: P (−1 ≤ X ≤ 1) ≈ 0.68, P (−2 ≤ X ≤ 2) ≈ 0.95, P (−3 ≤ X ≤ 3) ≈ 0.997.

A random variable Y is said to follow a normal distribution with parameters µ and σ


if
Y −µ
σ
has a standard normal distribution. Notation: Y ∼ N (µ, σ). We have E[Y ] = µ and
SD(Y ) = σ.

Note that the interval


[mean − 2 ∗ SD, mean + 2 ∗ SD]

contains 95% of the possible outcomes of X. It is called the 2σ-interval. For example,
if the average IQ of the Belgian population is 100 with SD=15, then about 95% of the
Belgians have an IQ in the interval [70;130], under the assumption of normality.

1.1.3 Estimators of means and variances

The (population) mean µ = E[X] and standard deviation σ = SD(X) are (almost always)
unknown. We call them population parameters. We can only compute them exactly when
the probability distribution is completely known, which is rarely the case. However, it
is possible to estimate them. If we have a random sample X1 , . . . , Xn then the sample
average is
n
1X
µ̂ = Xi ,
n i=1
and the sample standard deviation
v
u n
u1 X
σ̂ = t (Xi − µ̂)2 .
n i=1

When estimating µ by µ̂ (or σ by σ̂) we will make an estimation error. This error
is quantified by the standard error (SE). It is the magnitude of a “typical” error. An
(approximative) rule to construct a confidence interval for an estimated parameter is

10
given by
[estimator − 2 ∗ SE, estimator + 2 ∗ SE].

This rule is valid for “most” estimators and based on the fact that the distribution of
“most” estimators is close to a normal distribution for large sample sizes (Central Limit
Theorem).

Example: Let X be the monthly income of a factory worker (in euro). We want to
estimate the population mean E[X]. We take the income for a random sample of n = 300
workers, given in the following histogram,
60
50
40
frequency

30
20
10
0

1000 1200 1400

monthly income

The sample average equals 1201, with standard error SE=6.47. The sample standard
deviation is 112. We conclude that a 95% confidence interval for the population mean is
given by [1189;1215]. Moreover, since the distribution of X seems to be close to a nor-
mal distribution, we may say that about 95% of the monthly incomes are in the interval
[978;1427].

An additional property useful when working with a function of variable X is as follows:


let a and b be two real numbers, then

E[aX + b] = aE[X] + b

and
SD(aX + b) = |a|SD(X).

11
1.1.4 Joint and Conditional Distributions

Let X and Y be two stochastic variables which possibly interact. To simplify the formu-
las, we consider them as discrete.

The joint distribution of (X, Y ) is given by the probabilities of the form

P (X = k, Y = l).

The conditional distribution of Y given X = k is given by probabilities of the form

P (Y = l and X = k)
P (Y = l|X = k) = ,
P (X = k)

for each possible outcome k of X.

The marginal distributions of X and Y are nothing less than the distributions of X
and Y separately.

Definition: When the conditional distribution of Y given X equals the marginal distribu-
tion of Y , then X and Y are statistically independent.

Properties:

- E[X+Y]=E[X]+E[Y]

- Var(X+Y)=Var(X)+Var(Y), if X and Y are statistically independent.

Example: Let X be the gender of the first child (0=girl, 1=boy) and Y the gender of the
second child of a family. The joint distribution of (X, Y ) is,

P ((X, Y ) = (0, 0)) =0.25


P ((X, Y ) = (0, 1)) =0.25
P ((X, Y ) = (1, 0)) =0.25
P ((X, Y ) = (1, 1)) =0.25

12
Marginal distribution of Y :
P (Y = 0) = 0.5 and P (Y = 1) = 0.5.
Conditional distribution of Y given X = 1:
P (Y = 1|X = 1) = 0.5 and P (Y = 0|X = 1) = 0.5.
Conditional distribution of Y given X = 0:
P (Y = 1|X = 0) = 0.5 and P (Y = 0|X = 0) = 0.5.
We see that X and Y are statistically independent.

1.1.5 Exercises

1. Below we see a graphical representation of the total number of bankruptcies in


Belgium over the last 4 months. We clearly see a significant increase, indicating the
start of a recession. Comment on this.
1560
1550
1540
number

1530
1520
1510

2. Let X be the total amount of money (cash) that a family keeps at home. From a
sample of size n = 100 we obtain a sample mean of 800 euro with standard error 50
euro. The sample standard deviation is given by 500 euro.

(a) Construct a 95% confidence interval for E[X].

(b) Construct a 99% confidence interval for E[X].

(c) Construct an interval that will contain approximately 95% of the outcomes of
X.

13
(d) Do you think that the distribution of X is normal?

(e) Would it be a good idea to construct my sample by interviewing 100 persons


waiting at the railway station? Would you suggest another way of selecting a
sample? Which one?

3. Let X be the return of a stock A, and suppose that it follows a distribution N(2,1).
Let Y be the return of a stock B, and suppose that it follows a distribution N(3,2).
Suppose that X and Y are independent. I buy 10 stocks of A and 10 of B. Compute
the expected return and the standard deviation of the total return. How can I
increase the expected return? Comment on this.

4. The next 4 histograms are all based on 100 observations. Comment on their forms.
25

20
20

15
15

10
10

5
5
0

-1 0 1 2 3 -1 0 1 2 3

x1 x2
20

20
15

15
10

10
5

5
0

-2 0 2 4 -1 0 1 2 3 4

x3 x4

As an alternative to histograms, kernel density estimates can be computed. The


latter can be considered as a kind of smoothed histograms. Compare the kernel
density estimates on the next page with the previous histograms.

14
0.5

0.5
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0
-2 -1 0 1 2 3 4 -1 0 1 2 3 4

x1 x2

0.4
0.15

0.3
0.10

0.2
0.05

0.1
0.0

0.0
-4 -2 0 2 4 6 -2 0 2 4

x3 x4

5. Let Y be the hourly wage of a person and X the number of years of schooling.
Compare the marginal distribution of Y with the conditional distribution of Y given
X = 16 and with the conditional distribution of Y given X = 12.

15
1.2 Econometrics in Practice: Introduction to the Use of Eviews

1.2.1 Hands on with Eviews

The following steps will give you a first impression of the power of Eviews:

1. We will analyse data in the file demo.xls. Let us first have a look at this file.
The variables or series have been put into columns and represent aggregate money
demand (M 1), income (GDP ), the price level (P R) and the short term interest
rates (RS). We see that the data are quarterly. Record the number of variables and
the time span. Close the .xls file and start Eviews.

2. Create a workfile (/File/new/workfile/). Import the data with /Procs/Import/


Read Text-Lotus-Excel/. Do not forget to specify the number of series or the names
of the series. Have a look at the objects in the workfile.

3. Open the object “GDP” by clicking on it. Using the View menu, try out: (1)
/graph/line, (2) /descriptive statistics/histogram and stats/.

4. Note that “GDP” is a non-stationary series. Construct the series in differences


by using the menu /Genr/. Use the expression ”dGDP=d(GDP).” Is this series
stationary?

5. Generate now the series in log-difference using “growth=dlog(GDP)”. Is this series


stationary? Using the View menu, try out /distribution/ (1) /quantile-quantile
graph/, and (2) /kernel density graph/.

[Let Xt be a time series associated to X. Eviews computes

- The lagged series X(−1) ≡ Xt−1

- The series in differences d(X)=X-X(-1) ≡ Xt − Xt−1

- The series in log-difference dlog(X)=d(log(X))≡ log(Xt ) − log(Xt−1 ).

Note that
Xt − Xt−1
log(Xt ) − log(Xt−1 ) = log(Xt /Xt−1 ) ≈ .]
Xt−1

16
6. Select now in the workfile the series M1, GDP, PR, and RS. By clicking on the
right mouse button, you can open them as a group. Try /View/Graphs/Lines/ and
/View/Multiple Graph/Lines/. Compute the correlation matrix of these variables,
by /View/Correlations. Are these correlations spurious?

[Time series showing trends over time will always be highly correlated. The reason
is that they are both driven by time. The high correlation does not imply a causal
relationship, it may be spurious.]

7. Use /Quick/Estimate Equation/ to estimate the equation:

log(M 1) = α + β log(GDP ) + γRS + ε.

The equation specification in the dialog window is simply “log(M1) c log(GDP)


RS.” Have a look at the estimates of the unknown parameters of the equation.

8. Select now the menu /estimate/ to estimate the equation

log(M 1) = α + β log(GDP ) + γRS + δ log(P R) + ε.

Have a look at the output. Construct a confidence interval for δ. Do you think that
PR is a significant variable?

The test statistic for testing H0 : δ = 0 is given by

δ̂ − 0
T = .
SE(δ̂)

The statistic T follows approximately a standard normal distribution (under H0 ).


The value that it takes is called the “t-stat.”

The P-value is the probability that the test statistic takes values more extreme than
the observed one (under H0 ).

The following rule applies:

17
P-value > α ⇒ do not reject H0 at significance level α
P-value < α ⇒ reject H0 at significance level α

The default choice for the significance level is α = 0.05. This level gives the type I
error, i.e. the probability of rejecting H0 when it holds. The smaller the choice of α,
the more conservative we are towards H0 . If P-value< 0.05, then the corresponding
variable is said to be significant (for explaining Y ). If P-value< 0.01, then it is
highly significant.

It is often better to interpret the P-value on a continuous scale (e.g. P=0.049 and
P=0.051 is almost identical). The smaller the P-value, the more evidence in the
data against the null hypothesis.

Some authors prefer to report only the t-stats. A variable is significant if the t-stat
is larger than 2 in absolute value.

9. There is a problem with the regression model estimated above. The error term
still contains too much “structure”. Save the residuals, which can be considered as
estimates of the error terms, by selecting /Procs/make residual series/ within the
equation window. Make then the correlogram of the residuals and comment.

10. Take the model in differences:

∆ log(M 1) = α0 + β∆ log(GDP ) + γ∆RS + δ∆ log(P R) + ε0 ,

where ∆ is the mathematical notation for the difference operator. In Eviews, you
can specify “dlog(m1) c dlog(gdp) d(rs) dlog(pr).”

(a) Is the constant term significant?

(b) Test H0 : δ = 0.

(c) Make the correlogram of the residuals.

18
1.2.2 Basic Principles of Eviews

- Eviews is a windows oriented econometrical software package.

- For every new data set, a workfile needs to be created.

- Workfiles are characterized by a frequency and a range.

- A workfile contains different objects.

- Objects may be of different types like series, groups, equations, graphs, . . . .

- The available toolbars/menus of an object window depend on the type of the object.

- The same Eviews instruction can be given in several ways.

- It is possible to write programs in Eviews.

1.2.3 Descriptive Univariate Statistics

Given a univariate sample x1 , . . . , xn , we can compute

- location measures: mean x̄, median, . . . .

- spread/dispersion measures: standard deviation σ̂, range=maximum-minimum, . . . .

- measure of asymmetry: skewness coefficient


n
1 X xi − x̄ 3
Sk = ( )
n i=1 σ̂

Positive skewness means long right tail.

- measure of “heavy tails”: kurtosis coefficient


n
1 X xi − x̄ 4
κ= ( )
n i=1 σ̂

At normal distributions κ ≈ 3. If κ < 3, the distribution is said to be peaked


(leptocurtic). If κ > 3, the distribution is said to be heavy tailed or flat (platycurtic)
w.r.t. the normal distribution.

19
The distribution of the data can be pictured by an histogram or a kernel density plot.

A quantile-quantile plot (QQ-plot) can be used as a visual check for normality of the
data. The points in the QQ-plot should closely follow a linear pattern, if the distribution
is normal.

A formal test of normality is the Jarque-Bera test, which is based on skewness and kur-
tosis. If the associated P-value is smaller than 5%, then we do reject the null hypothesis
of normality.

Before starting an econometrical analysis, it is important to screen the data for outliers.
Making boxplots is helpful.

1.2.4 Correlation Coefficients

Given two stochastic variables X and Y . The covariance between X and Y is defined as

Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))].

The correlation between X and Y is defined as

Cov(X, Y )
Corr(X, Y ) = p .
V ar(X)V ar(Y )

We have that

* −1 ≤ Corr(X, Y ) ≤ 1

* |Corr(aX + b, cY + d)| = |Corr(X, Y )|

* Corr(X, Y ) = 1 (respectively = -1) if and only if there exist a > 0 (resp. a < 0)
and b such that Y = aX + b.

* If Corr(X, Y ) = 0 then we say that X and Y are uncorrelated. If (X, Y ) follows a


normal distribution, then uncorrelatedness implies independency.

20
From a random sample X1 , . . . , Xn , we can estimate ρ = Corr(X, Y ) by the correlation
coefficient Pn
i=1 (xi
− x̄)(yi − ȳ)
ρ̂ = pPn Pn .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)

The correlation coefficient is used as a measure of the strength of the linear association
between 2 variables. It is telling us to which extent 2 variables “move together”, and has
nothing to say about causal relations.

Exercise: For 6 datasets, visualized by their scatterplots, we computed correlation coef-


ficients and obtained: -0.70 0.01 0.74 0.74 0.95 0.99. Match the computed correlation
coefficients with the respective scatterplots on the next page.

21
4.0
5
4

3.0
3
y

y
2

2.0
1

1.0
-1

1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

x x

40
0
-2

20
y

y
-4

0
-6

-20
-8

1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

x x
0.8

2.5
0.6

1.5
y

y
0.4

0.5
0.2

-0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

x x

22
Serial Correlation If the data are a random sample X1 , X2 , . . . , Xn , then
Corr(Xi , Xj ) = 0 for 1 ≤ i 6= j ≤ n. But if the data form a time series, then presence of
autocorrelation or serial correlation may occur. If Xt is a stationary time series, then this
autocorrelation can be quantified.

Definition: Xt is a stationary series if

1. E[Xt ] = µ for all t

2. V ar(Xt ) = σ 2 for all t

3. Corr(Xt , Xt−k ) = ρk for all t, and for k = 1, 2, . . ..

We call ρk the autocorrelation at lag k. It can be estimated as


Pn
(xt − x̄)(xt−k − x̄)
ρ̂k = t=k+1Pn 2
.
t=1 (xt − x̄)

The graph of the function k → ρ̂k is called the correlogram. A correlogram also indicates
critical bounds. If ρ̂k passes these bounds, it is significantly different from zero (H0 : ρk = 0
is rejected at 5%).

23
series 1 correlogram

1.0
6
4

0.6
ACF
2

0.2
0
-2

-0.2
0 20 40 60 80 100 0 5 10 15 20
Lag
Time

series 2 correlogram
1.0
2
1

0.6
0

ACF
-1

0.2
-2

-0.2
-3

0 20 40 60 80 100 0 5 10 15 20
Lag
Time

The correlograms above show that there is much more serial correlation in the first than
in the second series. One says that there is more persistency in the first series.

1.2.5 Exercise

In the file spurious.txt you find annual data from 1971 to 1990 for the variables,

- X: average disposable income per capita (1980 prices)

- Y : average consumption per capita (1980 prices)

- Z: number of professional football players

We are interested in the correlations between these variables. Note that we have here
time series data and no cross-sectional data.

1. Create a new workfile and import the data (/File/Import/Text/).

24
2. Open X and Y as a group (select them and use the right mouse button).

3. Make a graph of the 2 series (View/Graph/Line).

4. Make a scatterplot of Y versus X (View/Graph/Scatter/). Besides a simple scatter


plot, you can also add a fitted line (scatter with regression) or a fitted curve (scatter
with kernel fit).

5. Compute the covariance and the correlation between X and Y (view/correlations


and view/covariances).

6. Use Quick/Estimate equation to estimate the coefficients of the regression equation


Y = α + βX + ². (Type “Y c X ” where c represents the constant term α).

7. Open now X, Y and Z as a group.

(a) Make a graph of the 3 series together.

(b) Make a scattermatrix(/View/Multiple Graphs/Scatter/).

(c) Compute the Correlation Matrix.

8. Create by using /Genr/ the 3 series in differences. (Use d(X), d(Y), d(Z)). Compute
now the correlation matrix. Comment on the results.

25
1.3 The Linear Regression Model

1.3.1 The Model

In a regression problem a dependent/respons variable Y is explained using explanatory/explicative


variables X1 , . . . , Xp . If p = 1, then we have simple regression, otherwise multiple regres-
sion. The regression model is given by

Y = α + β1 X1 + . . . + βp Xp + ²,

with α the intercept parameter and β1 , . . . , βp slope parameters. The disturbance term is
². We observe a sample {(Xi , Yi )|1 ≤ i ≤ n} of size n, and suppose that every observation
follows the model:
Yi = α + β1 Xi1 + . . . + βp Xip + ²i ,

for 1 ≤ i ≤ n. Conditions on the error terms ²i are

H1. The explicative variables are independent of the error terms.

H2. E(²i ) = 0.

H3. Cov(²i , ²j ) = 0 for all i 6= j (independent error terms).

H4. V ar(²i ) ∼ constant = σ 2 (homoscedasticity).

H5. ²i follow a normal distribution.

Conditions H1 and H2 are crucial and always needed. The condition of uncorrelatedness
of the error terms (H3) is often violated for time series data. Condition (H4) says that
the error terms need to be homoscedastic and not heteroscedastic. Conditions H1 and H2
imply that the conditional mean function or regression function is given by

E[Y |X1 , . . . , Xp ] = α + β1 X1 + . . . βp Xp .

This is the function of main interest in linear regression: it allows to estimate conditional
expectations. Note that for any 1 ≤ j ≤ p

∂E[Y |X1 , . . . , Xp ]
= βj .
∂Xj

26
The interpretation of βj is as follows:

“If Xj changes with one unit, then Y changes with βj units, on average and all other
variables kept constant.”

If Xj and Y are measured in logarithms:

“If Xj changes with one percentage, then Y changes with βj percentages, on average and
all other variables kept constant.”

If also H4 holds then the conditional variance function is given by

Var[Y |X1 , . . . , Xp ] = σ 2 .

1.3.2 The Ordinary Least Squares Estimator

Let α̂ and β̂1 , . . . , β̂p be estimators of the regression parameters. The fitted values of the
dependent variable are then

Ŷi = α̂ + β̂1 Xi1 + . . . + β̂1 Xip .

The residuals are defined as


ri = yi − ŷi .

The ordinary least squares (OLS) estimators of α and β1 , . . . , βp are such that
n
X
ri2
i=1

is minimized. The estimated regression function is Ŷ = α̂ + β̂1 X1 + . . . + β̂p Xp . If we


have a new observation with values X01 , . . . , X0p for the explicative variables, then we
predict the associated Y0 by Ŷ0 = α̂ + β̂1 X01 + . . . + β̂p X0p . While the error terms are not
observable, the residuals are, and can be used as estimate for the ²i . The residuals will
be used later on to check the conditions H3 up to H5.

27
The parameter σ 2 is estimated by
n
1 X 2
σ̂ 2 = r ,
n − k i=1 i

where k = (p + 1) is the number of estimated regression parameters, and ri = Yi − Ŷi .

1.3.3 Properties of OLS

Let β stand for one of the regression parameters, then

• Under H1+H2, the OLS-estimators are unbiased, that is

E[β̂] = β and E[α̂] = α.

• Expressions for the standard errors of α̂ and β̂ are known and valid under H1+H2+H3+H4.
If also H5 is true, then

α̂ ∼ N (α, SE(α̂)) and β̂ ∼ N (β, SE(β̂)).

If H5 is not true, then the OLS estimators are only asymptotically normal.

• A test statistic for testing H0 : β = 0 is given by


β̂
T = .
SE(β̂)
Under H1→ H5, this test statistic follows a student t-distribution with n − (p + 1)
degrees of freedom. The values that T takes is called the “t-stat.” It is convenient
to consider the P-value, defined as P-value=P (|T | ≤ |t|), where t is the “t-stat”.
We can apply the rule

P > δ ⇒ do not reject H0 at significance level δ


P < δ ⇒ reject H0 at significance level δ

The default choice for the significance level is 0.05. If we reject H0 , then we say that
X is a significant variable. If P-value< 0.01, then X is a highly significant variable.
Recall that the P-value needs to be interpreted on a continuous scale. The smaller
the P-value, the more evidence in the data against the null hypothesis.

28
1.3.4 Summary Statistics in Regression Analysis

R-squared The R-squared statistic is the fraction of the variance of the dependent
variable explained by the independent variables:

V arŶ V ar(residuals)
R2 = =1− .
V arY V arY

It measures the predictive power of the regression equation,

- R2 = 1 if and only if Yi = Ŷi for all i

- R2 = 0 if and only if Ŷi = Ȳ for all i

We also call R2 the squared multiple correlation coefficient. Do not use R2 to compare
models with different dependent variables.

Adjusted R-squared A problem with using R2 as a measure of goodness of fit is that


it never decreases if you add more regressors. The adjusted R2 penalizes for the addition
of regressors which do not contribute to the explanatory power of the model:

n−1
AdjustedR2 = 1 − (1 − R2 ) .
n−k

F-statistic The F-statistic tests the hypothesis that all of the slope coefficients (exclud-
ing the intercept) in a regression are zero,

H0 : β1 = . . . = βp = 0.

An accompanying P-value is given by the software. The F-test is a joint test, keeping the
joint type I error under control. Note that even if all the t-statistics are insignificant, it
is not excluded that the F-statistic is highly significant.

Durbin-Watson Statistic The Durbin-Watson (DW) statistic measures the serial cor-
relation (of order one) in the residuals. The statistic is computed as,
Pn
(rt − rt−1 )2
DW = t=2Pn 2 ,
t=1 ri

with the following properties,

29
- 0 ≤ DW ≤ 4,

- independent error terms (H3 ) → DW ≈ 2.

- DW << 2 indicates positive autocorrelation.

However, there are better tests for serial correlation in the error terms.

1.3.5 Residual Plots

Residual plots help to check the model assumptions. The residuals ri or standardized
residuals ri /σ̂ are plotted versus their index (best for time series data) or versus Ŷi (best
for cross-sectional data). These plots should,

- show no particular structure (to check linearity assumption),

- have homogeneous dispersion (to check for homoscedasticity),

- if residuals have the tendency to remain “close” to each other, this might indicate
positive serial correlation. This is also the case when sequences of residuals with the
same sign are observed. Additionally, a correlogram of the residuals can be made.

- Values larger than 3 for ri /σ̂ might refer to possible outliers.

- To check for normality of error terms, use QQ-plots. If the error terms deviate
strongly from normality and the sample size is not too large, then results are in
doubt.

Exercise: Comment on the residuals plots below:

30
500
1
residual

residual
0

0
-1

-1000
0 5 10 15 20 25 30 0 5 10 15 20 25 30

time time
8

2
6

1
residual

residual
4

0
2

-1
-2 0

-2

0 5 10 15 20 25 30 0 5 10 15 20 25 30

time time
0.4
residual

residual

50
0.0

0
-50
-0.4

0 5 10 15 20 25 30 0 5 10 15 20 25 30

time time

31
1.3.6 Using Dummy Variables

Y needs to be a continuous variable. The explicative variables may be continuous or


categorical. In the latter case, one should replace X by (K − 1) dummy variables in the
model equation, where K is the number of categories.

Example: Let Y be the income of a person. We want to relate it to its work experience X
(in years), sex (male/female) and educational level (primary/secondary/higher eduction).
We consider the model

Y = α + β1 X + β2 M + β3 E1 + β4 E2 + ²,

where

• M = 1 if person is male and M = 0 if not.

• E1 = 1 if primary school and E1 = 0 if not.

• E2 = 1 if secondary school and E2 = 0 if not.

• We call higher education the “reference level”.

The expected income of a female with secondary school education and 10 years of work
experience is therefore 10β1 + β3 + α.

Exercise: During 7 years we measure every season the total number of cars sold in a certain
country (denoted by Yt , in units of 10000). Let Xt be the price of oil in real terms. We
want to regress Yt on Xt , a time trend and “seasonal dummies.” So the model is

Yt = α + βXt + γt + δ1 St1 + δ2 St2 + δ3 St3 + ²t .

The data can be found on the next page.

1. Estimate and interpret the regression parameters. How does the interpretation
changes if X and Y are taken in logarithms?

2. Comment on the values of R2 , Durbin-Watson-statistic, and the F-statistic.

32
3. Make the residual plot and comment.

Season Yt Xt
Spring 1990 16.30 39.04
Summer 1990 15.20 31.22
Autumn 1990 19.40 34.47
Winter 1990 15.60 31.44
Spring 1991 17.60 28.95
.. 15.98 29.59
.. 19.10 29.15
.. 16.53 22.99
.. 17.66 25.50
.. 16.17 28.71
.. 20.05 29.77
.. 16.12 25.86
.. 17.23 32.33
.. 16.41 27.88
.. 19.67 31.07
.. 15.35 27.80
.. 18.81 27.28
.. 18.64 28.32
.. 18.74 30.71
.. 15.24 27.28
.. 17.44 37.83
.. 17.30 30.74
.. 20.79 25.11
.. 16.03 28.52
.. 18.91 25.28
.. 16.52 30.01
.. 19.07 31.37
.. 12.83 37.65

33
1.3.7 Exercises

1. We have annual data for the UK economy, for the years 1953-1964, on the percentage
change in wages W and the percentage of the labor force unemployed U :
W 4.4 5.4 7.1 6.2 4.2 3.1 2.6 3.3 3.8 3.6 4.1 4.4
U 1.5 1.3 1.1 1.2 1.4 2.1 2.2 1.6 1.5 2.0 2.1 1.6

(a) Create a workfile and enter the data (use /quick/empty group (edit series)/).

(b) Make a scatterplot of W versus U . Does the relation between them looks
linear? What is their correlation?

(c) Estimate the regression equation W = α + βU + ². Interpret the estimated


value of β. Is U a significant variable?

(d) Have a look at the residual series. Do the assumptions H1-H5 seem to be
plausible? Make a QQ-plot of the residual series to check for normality.

2. For 25 households we have data (in “households.wf1”) on their total consumption


expenditure (X) and on their food expenditure (Y ).

(a) Estimate the regression equation Y = α + βX + ². Predict the value of Y for


X = 200 and for X = 1000.

(b) Estimate the regression equation log(Y ) = α + β log(X) + ². Predict the value
of Y for X = 200 and for X = 1000.

(c) Which of the 2 models do you prefer? Make a scatterplot of Y versus X.

3. Ten experts make a prediction for the economic growth in the EU and ten other
experts in the US for next year:

34
EU US
2.1 2.6
2.5 2.4
2.3 3.2
1.4 0.8
1.5 1.3
1.5 2.1
2.4 1.6
2.7 3.2
2.8 3.1
1.1 1.4

(a) Test whether the predictions for US and Europe are on average the same.

(b) Test for normality of the error terms, given the small sample size.

(c) How does your answer change if the 10 experts making predictions are the
same?

35
1.4 Model Specification

Model specification refers to the following decisions: Which model to use? Which variables
to include? Which kind of disturbance terms? How to specify a model is a difficult task.
Diagnostic tests help us to check for the validity of the model specification. Also residual
plots can serve as diagnostics for the model assumptions. If the diagnostic tests reject the
validity of the model, then it is misspecified and another model needs to be proposed.

1.4.1 Running Example

Demand for food in the USA, yearly data (1963-1992, file: “food.wmf”), with,

- Q: the demand for food in constant prices

- X: total expenditure in current prices

- P : a price index for food

- G: a general price index

Economic theory suggests Q = f (X, P, G).

1. Make line graphs and some descriptive statistics of the series Q, X, P and G.

2. Estimate the model,

log(Q) = α + β1 log(X) + β2 log(P ) + β3 log(G) + ².

3. Interpret the signs and the magnitude of the estimated regression parameters.
Which variables are significant?

4. Interpret the values of R2 and adjusted R2 , the Durbin-Watson-statistic and the


F-statistic.

5. Make a graph of the actual and the fitted series log(Q). Make a residual plot (use
/View/Actual, fitted,residuals). Make a QQ-plot and a correlogram of the residuals.
Comment.

36
1.4.2 Coefficient Tests

The Wald-test is the most often used for testing for restrictions on the coefficients. If
k restrictions are tested for, then the Wald-test statistic follows asymptotically a chi-
squared distribution with k degrees of freedom. If we test for “H0 : g(parameters)=0”,
the Wald-test rejects the null hypothesis if “g(estimated parameters)” is too far from 0.
In case that all restrictions are linear the F -statistic can be used. Use the same example
as in the previous paragraph and answer the following questions.

1. Use /View/representations/ to know how Eviews labels the coefficients.

2. Test H0 : β2 = 0. Are you surprised by this outcome? (Use /View/coefficient tests/)

3. Test

(a) H0 : β2 = β3 = 0,

(b) H0 : β1 + β2 + β3 = 0,

(c) H0 : β22 = β3 β4.

4. Do you propose to drop log(P ) from the model?

1.4.3 Omitted and Redundant Variables

Models are preferably compact or parsimoneous. Clearly non-significant variables should


not be included in the model. But it is dangerous to delete all variables having P-value
larger than, for example, 5%. Indeed, we have the following important consequences,

* Omitting important variables from the regression equation yields biased estimators.

* Adding non-significant variables increases the variability of the estimator (larger


standard-errors), but the estimates remain valid (i.e. unbiased and consistent).

If variables are important for economic reasons, they should remain in the model. Tests
for redundant/omitted variables are based on a comparison of the original model and the
model with/without the redundant/omitted variables. They compare

37
* the values of R2 for the 2 models (F-test),

* the value of the log-likelihood of the 2 models (Likelihood-Ratio test).

Test the following questions on the example of the demand for food in the USA,

1. Use Eviews to test whether log(P ) is a redundant variable. Are you surprised by
this result? Delete log(P ) from the equation and estimate the model again using
/Estimate/.

2. Generate a variable representing a time trend (use “@trend”).

3. Test whether this trend is an omitted variable.

4. Test whether trend and log(P ) together are omitted variables.

Remark: If two (or more) explicative variables are highly correlated, then we have the
problem of multicollinearity. In this case, the estimates remain valid, but we have in-
creased variability. A simple solution to avoid multicollinearity is to drop one of the two
highly correlated variables. This is even mandatory in case of perfect correlation between
the two variables.

1.4.4 Residual Tests

There exist many tests for misspecification. Most of them are based on the residuals.

* White heteroscedasticity test: with r denoting the residuals, we estimate the equa-
tion
r2 = c + γ1 X1 + γ2 X12 + γ3 X2 + γ4 X22 + . . .

Under the null hypothesis of no heteroscedasticity, none of these estimated gamma’s


should be significantly different from zero. White uses the value of nR2 of the model,
which should not be too big. It needs to be compared with the critical value of a
chi-square distribution with as many degrees of freedom as slope parameters in the
above test equation.

38
* Breusch-Godfrey LM test for serial correlation: With r denoting the residuals, we
estimate the equation

r = c + γ1 X1 + . . . γp Xp + γp+1 r(−1) + . . . γp+k r(−k) + ²,

with r(-1), ..., r(-k) lagged version of the residuals. Under the null hypothesis of no
serial correlation, none of these estimated gamma’s should be significantly different
from zero. The test uses the value of nR2 of the model, which should not be too big.
It needs to be compared with the critical value of a chi-square distribution with k
degrees of freedom.

1.4.5 Consistent Estimation of Covariance

The Least Squares estimator is still consistent (meaning that it converges to the true value
of the parameter if the sample size tends to infinity) under mild form of serial correlation
or heteroscedasticity. However, the theoretical expression of the covariance matrix of this
estimator is derived under the hypothesis H1 → H5 , and becomes invalid if H3 or H4 no
longer hold. In this case, one could use

* White formula for heteroscedasticity consistent covariance estimation.

* Newey-West formula for heteroscedasticity and autocorrelation consistent (HAC)


covariance estimation.

Remarks:

* It is also possible to correct for heteroskedasticity by specifying the form of the


conditional variance and using weighted least squares.

* By specifying a dynamic model for the disturbance terms serial correlation can be
corrected for. In some cases adding lagged versions of the dependent (and indepen-
dent) variables can solve the problem of serial correlation.

Perform the following tasks with same example as in the previous paragraphs,

39
1. Apply the White heteroscedasticity test with and without cross-terms. What is the
difference between both tests? Do you need to reject the assumption of homoscedas-
ticity.

2. Apply the serial correlation LM test. What is an ARCH LM test doing?

3. Look how the standard errors around your estimates change when you use the
White estimator for the covariance matrix of the estimator (Use the options when
estimating the model equation in Eviews). Same question for Newey/West.

40
2 Review: Multivariate Random Variables

2.1 Distribution and Density Function

The cumulative distribution of a p-dimensional random variable X = (X1 , . . . Xp )t is a


function FX : Rp → [0, 1] defined as

FX (u) = FX (u1 , . . . , up ) = P (X1 ≤ u1 , . . . , Xp ≤ up )

for every u = (u1 , . . . , up )t ∈ Rp . If the distribution of X is ”continuous” (more rigourously,


absolute continuous w.r.t. the Lebesgue measure, i.e. an extension of the classical no-
tions of length and area to more complicated sets), then there exists a density function
fX : Rp → R+ such that
Z u1 Z up
FX (u1 , . . . , up ) = ... fX (x1 , . . . , xp )dxp . . . dx1 .
−∞ −∞

Note that for f almost everywhere continuous:

∂F (u)
= f (u).
∂u1 . . . ∂up

2.2 Expected Values and Covariances

For a continuous p-dimensional random variable X, and a (measurable) function h : Rp →


Rk the expected value of h(X) is defined as
Z
E[h(X)] = h(u)f (u)du.
Rp
R
Note that E[h(X)] ∈ Rk . This expected value is properly defined if Rp
|h(u)|f (u)du is
finite. The expected value of X is given by E[X] (if existing) and the covariance matrix
or variance-covariance matrix (if existing), as

Cov(X) = E[(X − E(X))(X − E(X))t ].

For a p-dimensional random variable X, we have the following properties:

41
* Cov(X) is symmetric and positive definite

* E[X]j = E[Xj ] for j = 1 . . . p.

* Cov(X)ij = E[(X − E[X])i (X − E[X])j ] := Cov(Xi , Xj )

* Cov(X)ii = E[(X − E[X])2i ] := Var(Xi )

* For any matrix A ∈ Rm×p and for every vector b ∈ Rm :

E[AX + b] = AE[X] + b and Cov(AX + b) = ACov(X)At .

The correlation matrix of X is defined as

Cor(X) = D−1/2 Cov(X)D−1/2 with D = diag(Var(X1 ), . . . , Var(Xp ))

Exercise:

1. Prove thatE[trace(Y )] = trace(E[Y ]) for every p × p-dimensional random variable


Y.

2. Prove that trace(AB) = trace(BA) for any two matrices A and B for which the
matrix multiplications make sense.

3. Prove that trace(Cov(X)) = p for every p-dimensional random variable X possessing


a correlation matrix.

2.3 The Multivariate Normal Distribution

A p-dimensional random variable X follows a normal distribution with parameters µ and


Σ (Notation: X ∼ Np (µ, Σ)), where µ ∈ Rp and Σ a symmetric positive definite matrix
of size p, if the density function of X equals

exp(−(u − µ)t Σ−1 (u − µ)/2)


fX (u) = .
(2π)p/2 det(Σ)1/2

In particular:
X ∼ Np (0, I) ⇔ fX (u1 , . . . , up ) = φ(u1 ) . . . φ(up ),

42
with
exp(−t2 /2)
φ(t) = √ .

We have that

* X ∼ Np (µ, Σ) ⇒ E[X] = µ and Cov(X) = Σ

* X ∼ Np (µ, Σ) ⇒ AX + b ∼ Nm (µ + b, AΣAt ) for every matrix A ∈ Rm+p with


rank(A) = m (≤ p) and any vector b ∈ Rm .

* X ∼ Np (µ, Σ) ⇒ Σ−1/2 (X − µ) ∼ Np (0, I), where Σ+1/2 is a matrix B having the


property that BB t = Σ.

* X ∼ Np (µ, Σ) ⇒ Xi ∼ N1 (µi , Σii ) (Note that the reverse is not true.)

* X ∼ Np (µ, Σ) and Σij = 0 ⇒ Xi and Xj are independent.

* X ∼ Np (µ, Σ) ⇒ (X − µ)t Σ−1 (X − µ) ∼ χ2p .

2.4 Asymptotic Normality

A p-dimensional statistic Tn is said to have an asymptotically normal distribution if there


exists a θ such that
√ d
n(Tn − θ) → Np (0, V ).

This implies that the distribution of Tn can be approximated by a Np (θ, V /n). The matrix
V is called the asymptotic covariance matrix of Tn .

Using the “Delta-method” we have that for a smooth function g : Rp → Rk :


µ ¶
√ d dg(θ) t dg(θ)
n(g(Tn ) − g(θ)) → Nk 0, V ,
dθ dθ

where dg(θ)

∈ Rp×k is a matrix of partial derivatives of g. Therefore, g(Tn ) '
³ t dg(θ)
´
Nk g(θ), dg(θ)

V dθ
/n .

43
3 Non-linear Regression

3.1 Introduction

In this section we define

- y = endogenous/dependent/respons/... variable

- x = exogenous/explanatory/... variable

y ∈ R, x ∈ Rp

Example:
y=hourly wage

x=education

Our model is based on the following assumptions:

y = m(x) + ²
²⊥x
E[²|x] = 0

With an additional condition of homoscedasticity,

E[²2 |x] = σ 2 (x) = σ 2 .

We then define the conditional mean function m(x) as

E[y|x] = m(x).

44
Note that E[y] = E[m(x)]. However, sometimes one can be interested in the whole
conditional distribution [y|x].
Most of the time, we will work with parametric models,

m(x) = f (x, θ),

with f a specified function and θ an unknown parameter. The linear model, with f (x, θ) =
xt θ, is a special case. Here we can interpret the parameter,

δf (x, θ)
θ= = marginal effect,
δx
δlog ỹ
θ= = elasticity of ỹ w.r.t. x̃ if x = logx̃ and y = log ỹ.
δlogx̃
In a non-parametric setup, we only require that the function m is smooth, i.e. that the
function has derivatives of all orders.

3.2 Least Squares Estimation

To estimate θ of a parametric regression model, using observations (xi , yi ) ∈ Rp × R


following
yi = f (xi , θ) + ²i with i=1,...,n,

we can use the Least Squares Estimator


n
X
θ̂ = argmin (yi − f (xi , θ))2 .
θ i=1

3.2.1 How to Compute θ̂

The first order condition yields


n
X δf (xi , θ)
(yi − f (xi , θ)) = 0.
i=1
δθ

In the case of a linear regression model δf (xi , θ)/δθ = xi , which yields


n
X n
X
θ̂ = ( xi xti )−1 ( xi yi ).
i=1 i=1

45
Note that with    
xt1 y
   1 
 .   .. 
X =  ..  and Y =  . ,
   
t
xn yn
and when supposing that rank(X)=p, we can rewrite

θ̂ = (X t X)−1 (X t Y ).

In the case of a non-linear regression model, for an initial value θ̂0 , we can write

δf (xi , θ)
f (xi , θ) ≈ f (xi , θ̂0 ) + (θ − θ̂0 )t |θ̂0 + . . . ,
δθ

which yields the first order condition,


Xn
δf (xi , θ) δf (xi , θ) δf (xi , θ)
(yi − f (xi , θ̂0 ) + θ̂0 |θ̂0 −θt |θ̂0 ) = 0,
| {z δθ } δx i δx i
i=1 | {z } | {z }
y˜i x˜i x˜i

and therefore,
n
X Xn
t −1
θ̂1 = ( x̃i x̃i ) ( x̃i ỹi ).
i=1 i=1

Using a Gauss-Newton algorithm, we can iterate θ̂1 ⇒ . . . ⇒ θ̂k ⇒ . . ., until convergence


is reached. However, we face the following problems:

- How to define convergence?

- How to choose the initial value θ̂0 ? Close to the value to be estimated θ̂?

- Is convergence guaranteed?

- The first order condition is only necessary, but not sufficient.

3.2.2 Asymptotic Distribution

One can show, e.g. by using results from General Method of Moments Estimation, that

θ̂ ≈ N (θ0 , σ 2 (X̃ t X̃)−1 ),

46
with  
t

 1  δf (xi , θ)
 .. 
X̃ =  .  and x̃i = |θ0 ,
  δθ
t
x˜n
and with θ0 denoted as the true parameter. In practice, use
n
δf (xi , θ) 2 1X
x̃i ≈ |θ̂ and σ̂ = (yi − f (xi , θ̂))2 .
δθ n i=1

Corollary: for the construction of a 95% confidence interval for θ0,i , we take
q
θ̂i ± 2 ∗ σ̂ (X̃ t X̃)−1
ii .

with the null hypothesis H0 : θ0,i = 0 and the alternative hypothesis HA : θ0,i 6= 0. Do
not reject H0 if
|θ̂ |
q i < 2.
σ̂ (X̃ t X̃)−1
ii

3.2.3 Example (Green, 1993)

Take C=consumption and Y=income, then

C = α + βY γ + ².

Using data from Table 1, the estimated coefficients which are stable after only six iter-
ations are given in Table 2 and the results for both linear en non-linear regression are
shown in Table 3. The marginal propensity to consume (MPC) is defined as

δC
= βγY γ−1 .
δY

The following null hypotheses are tested: firstly, take H0 : γ = 1, meaning that the
MPC is assumed to be constant and therefore refers to a linear model. We reject the null
hypothesis if
|γ̂ − 1|
> 1.96.
SE(γ̂)

47
Table 1: Aggregate Income, Y and Consumption,C
1950 1951 1952 1953 1954 1955 1956 1957 1958
Y 791.8 819.0 844.3 880.0 894.0 944.5 989.4 1012.1 1028.8
C 733.2 748.7 771.4 802.5 822.7 873.8 899.8 919.7 932.9
1959 1960 1961 1962 1963 1964 1965 1966 1967
Y 1067.2 1091.1 1123.2 1170.2 1207.3 1291.0 1365.7 1431.3 1493.2
C 979.4 1005.1 1025.2 1069.0 1108.4 1170.6 1236.4 1298.9 1337.7
1968 1969 1970 1971 1972 1973 1974 1975 1976
Y 1551.3 1599.8 1688.1 1728.4 1797.4 1916.3 1896.6 1931.7 2001.0
C 1405.9 1456.7 1492.0 1538.8 1621.9 1689.6 1674.0 1711.9 1803.9
1977 1978 1979 1980 1981 1982 1983 1984 1985
Y 2066.6 2167.4 2212.6 2214.3 2248.6 2261.5 2334.6 2468.4 2509.0
C 1883.8 1961.0 2004.4 2000.4 2024.2 2050.7 2145.9 2239.9 2312.6

Table 2: Iterations and Estimated Values

α̂ β̂ γ̂
1 11.1458 0.898534 1.0
2 209.825 -0.237125 1.15139
3 187.786 0.246078 1.14613
4 187.710 0.245692 1.15699
5 187.915 0.245968 1.15641
6 187.899 0.246004 1.15640

48
Table 3: Outcomes for Linear and Non-Linear Model and Estimated Asymptotic Covari-
ance Matrix
Linear Non-Linear
Estimate Std. Error Estimate Std. Error
α 11.1458 9.64 187.899 38.946
β 0.898530 0.00586 0.246004 0.07947
γ 1.00000 - 1.15640 0.03927
e0 e 12.068 8420
σ 18.309 15.294
R2 0.99856 0.99899
α̂ β̂ γ̂
α̂ 1516.8 - -
β̂ -3.0533 0.006315 -
γ̂ 1.5045 -0.003120 0.001542

An asymptotic t-test, based on the standard normal distribution, was carried out,

|γ̂ − 1| 1.1564 − 1
z= = = 3.983.
SE(γ̂) 0.03927

This is larger than the critical value of 1.96 for a 5 percent significance level and thus
the linear model is rejected in favor of the nonlinear regression. Secondly, take H0 : γ =
1 and β = 1, to test whether the MPC is constant and equal to one. Since, under the
given null hypothesis,
 
µ ¶ µµ ¶ ¶
γ̂ 1  V ar(γ̂) Cov(γ̂, β̂)
≈ ,  ,
β̂ 1 Cov(γ̂, β̂) V ar(β̂)
| {z }
Σγβ

then  
³ ´ γ̂ − 1
γ̂ − 1 β̂ − 1 Σ−1
γβ
  ∼ χ22 ,
β̂ − 1
and we reject the null hypothesis if the above quantity exceeds χ22,1−α . To test the
hypothesis that the MPC is one, we require a particular value of Y. Therefore, take

49
H0 : βγy0γ−1 = 1 for y0 = Y1985 = 2509. Using the Delta-method (for more information
on the Delta-method, see section 3.2.4), we obtain under H0 ,
 −1  
³ ³ ´ V ar(β) Cov(β, γ) δM P C/δβ ´
(β̂γ̂y0γ̂−1 − 1) ≈ N 0, δM P C/δβ δM P C/δγ    
Cov(β, γ) V ar(γ) δM P C/δγ
 
³ ³ ´ γy0γ−1 ´
γ−1 γ−1 −1   .
≈ N 0, γy0 βy0 (1 + γln(y0 ) Σγβ
βy0γ−1 (1 + γln(y0 ))
| {z }
SE 2 (β̂γ̂y0γ̂−1 )

Following the method described above yields a standard error of 0.018423. We reject H0
if
|β̂γ̂y0γ̂−1 |
> 1.96,
SE
so for testing the hypothesis that in 1985 the MPC was equal to one, we compute the
following test-statistic,

|β̂γ̂y0γ̂−1 | 0.9676 − 1
z= = = 1.7587,
SE 0.018423

which is not larger than the critical value of 1.96 and therefore we can not reject the null
hypothesis that the MPC in 1985 was equal to one.

3.2.4 The Delta-Method

The Delta-method is used to derive an expression for the standard error of a a function
of an asymptotically normal estimator β̂. Assume that Cov(β̂) is known, and g is a real
valued function.

q
SE(g(β̂)) = V ar(g(β̂))
s
g(β)t
= V ar[g(β0 )) + |β (β̂ − β0 ) + . . .],
dβ 0

with β0 the true value of the parameter. The Taylor-expression was used in the equation
above and works well if β̂ is close to the true value β0 . Therefore, (β̂ − β0 ) is small, as
(β̂ − β0 )2 , (β − β0 )3 ,. . . , are an order of magnitude smaller. As a consequence,

50
q
SE(g(β̂)) = V ar(Aβ̂)
q
= Cov(Aβ̂) for a one-dimensional variable
q
= ACov(β̂)At
sµ ¶t
dg(β) dg(β)
≈ |β̂ Cov(β̂) |
dβ dβ β̂

A special case for g : IR → IR and β̂ ∈ IR is,

v à !2
u
u dg( β̂)
SE(g(β̂)) = tV ar(β̂)

= SE(β̂)|g 0 (β̂)|

3.3 Homework: Non-Linear Regression

Take the data in the Eviews file “nlhome.wf1”. Let yt be the share of total US-steel
production produced by a certain type of technology, measured from 1970 up to 1997.
Consider the following model
α
yt = + ²t ,
1 + e−β−γt
where α, β and γ > 0 are the parameters to estimate. Here, t simply stands for the time.
The above model is an example of a diffusion model, and models how a certain new type
of technology is adapted by its users. This model is also used to model which percentage
of the population of potential costumers already purchased a new type of product, e.g. a
flat screen television.

1. The only explicative variable here is the time. The general formulation of the non-
linear regression model we saw in class was yt = f (xt , θ) + ²t , for t = 1, . . . , n. What
are xt , θ and f for this example?

2. Formulate explicitly, as a system of 3 equations, the first order conditions for the
LS estimator.

51
3. Plot the series. The value of α gives the diffusion level for t tending to infinity. It is
called the saturation level. Do you think that we are already close to the saturation
level?

4. Estimate the model parameters, by specifying in Eviews the equation

Y=c(1)/(1+exp(-c(2)-c(3)*@trend))

5. Comment on the number of iterations needed to compute the LS-estimator. As


starting value for the iterative algorithm, Eviews takes the values in the constant
vector c, displayed in the workfile window. [When starting up Eviews, the constant
vector is initialized at zero. Be careful, after having estimated an equation, the
constant vector c contains the estimated parameters.]. Set the value of the constant
vector c back to zero, by clicking on it and using the edit option afterwards. Change
the value of the convergence criterion to 10−10 and see how this affects the number
of iterations.

6. Forecast the values of yt up to 2020. For this, you need to change the range of
the workfile (via ‘/Procs/Change workfile range’ in the workfile window) as well as
the sample range (via ‘/Sample’ in the workfile window) to 1970-2020. You can
make the forecast within the equation window via the Forecast menu by setting the
forecast range as 1970-2020. When are we getting close to the saturation level?

7. Another important quantity is the point of inflection tI of the diffusion curve. It


gives the point in time after which the growth rate of diffusion declines. In other
words, after tI the number of new users/customers is decreasing. This inflection
point is given by tI = −β/γ (the root of the second derivative of the diffusion
curve). What is your estimate of tI ? What do we learn from this? Test whether
the hypothesis that the point of inflexion occurred in 1990 holds.

52
3.4 Homework: Some Theoretical Exercises

1. Let X be a bivariate distribution with mean µ and covariance matrix Σ, where


   
1 3 1
µ =   and Σ =  .
0 1 1

Define
Y = bt X + ²,

where ² follows a standard normal distribution and is independent of X. Finally,


let b = (−1, 2)t .

(a) Compute E(Y )

(b) Compute Var(Y )

(c) Compute the correlation between Y and ².

2. Consider the regression model

log(yi ) = α + βxi + γzi + ²i ,

for 1 ≤ i ≤ n and with ²i the error terms. Suppose that β̂ = 1.5 and SE(β̂) = 0.75.
Suppose that γ̂ = 4 and SE(γ̂) = 0.75. Moreover, Cov(β̂, γ̂) = 0.2.

(a) Is the variable xi significant?

(b) Is β̂ significantly different from 5?

(c) Using the delta-method, compute SE(β̂ 2 ).

(d) Test whether H0 : β = γ holds.

(e) Compute SE(β̂γ̂)

(f) Complete the next sentence:


“If x increases with 1 unit, then we expect that y increases with . . . ”

53
4 Non-parametric regression
In non-parametric regression or “scatterplot-smoothing” we try to draw a smooth line
through a scatterplot. Important here is that we will only take one explanatory variable
x ∈ IR. The approach works less well in higher dimensions.

4.1 Ingredients

4.1.1 1st Ingredient: Polynomial Fit

Bandwidth

x0−h x0 x0+h
x

Fix x0 . Then,

yi = α1 +β1 (xi − x0 ) +β2 (xi − x0 )2 + . . . + βp (xi − x0 )p +²i ,


|{z} | {z } | {z } | {z }
zi1 zi2 zi3 zip+1
| {z }
m(xi )

with m(xi ) = zi0 γ with

54
   
α 1
   
   
β1   zi2 
γ= 
 ..  and zi = 
 ..


.  . 
   
βp zip+1
This model can be considered as a linear regression model. Note that m(x0 ) = α. We
want to estimate α, which is the first component of γ̂ = α̂ = m̂(x0 ):

X X
γ̂ = ( zi zi0 )−1 ( zi yi )
i i

If we look at the graph, a polynomial fit might not be good, but if we look only at one
window, with bandwith h, it might be ok.

4.1.2 2nd ingredient: Weighted OLS

n
X
γ̂W LS = argmin (yi − zi0 γ)2 ∗ wi ,
γ
i=1

with wi weight attached to observation i. We give a weight of zero to the observations


which are different from the model, i.e. outside the window. Within the window, we want
the fit to be quadratic. This local fit within the window is a property of nonparametric
regression. By providing different weights to different observations, we influence the
importance of certain observations. So,

n
X
γ̂W LS = argmin (yi − zi0 γ)2 ∗ wi
γ
i=1
n
X √ √
= argmin ( wi yi − wi zi0 γ)2
γ
i=1
X X
=( z˜i z˜i 0 )−1 ( ẑi ŷi )
i i
X X
=( zi zi0 wi )−1 ( zi yi wi )
i i

= (Z 0 W Z)−1 (Z 0 W Y ),

55
with
 
w1 0 ... 0
 
 
0 w2 ... 0 
W =
 ..


0 0 . 0
 
0 0 . . . wn

4.2 Local fit around x0

Estimate m(x0 ) using weighted polynomial regression, i.e.

X
argmin [yi − α − β1 (x − x0 ) − . . . − βp (x − x0 )p ]2 ∗ wi ,
α,β1 ,...,βp i

with wi the weights, which become more important the closer we come to the center of the
window. The smaller the bandwidth h, the more local the fit, the larger the bandwidth
h, the more global the fit. The weights wi are defined as

µ ¶
xi − x0
wi = K
h
with K the Kernel function, which is defined over the interval [-1,1] with max in 0. For
xi = x0 −h, we have K((xi −x0 )/h) = K(−1), for xi = x0 , we have K((xi −x0 )/h) = K(0)
and for xi = x0 + h, we have K((xi − x0 )/h) = K(1). So for non-parametric regression
we need to define:

• x0 , depending on what you want to test, and the width of the window h

• the degree of polynomial fit p

• the Kernel function K

4.3 Non-parametric regression

Compute over a a grid of x values the respective non-parametric estimates m̂(x). You
can take more grid points than observations.

Comments:

56
- Why do we only work with one explanatory variable? ...

- The choice of the bandwidth h, for p = 1 (as shown in the figures on the next page).

* If h → ∞: very smooth (if h increases, the smoothness increases), but large


bias (it doesn’t capture the true regression well).

* If h → 0: irregular, not smooth at all, but a very small bias.

Conclusion: the choice of h has to be somewhere in between.

57
5 Econometrics of Panel Data

5.1 Overview of Topics Discussed

* Basics and Examples

* The Generalized Least Squares Estimator

* Fixed Effects model

* Random Effects model

5.2 Basics and Examples

We observe variables for N units, called the cross-sections, for T


consecutive periods:
(Yit, Xit)

* i = 1, . . . , N , with N the cross-sectional dimension.

* t = 1, . . . , T , with T the temporal dimension.

→ panel of size N × T .

58
• Yit is the income of family i during year t, for 1 ≤ i ≤ 1000,
and observed in years 2000, 2001, 2002, so T = 3.

• Yit is the unemployment rate for EU-country i, (1 ≤ i ≤ 15),


observed monthly from 1998:01 up to 2001:12, so T = 48.

Note that:

T large, N small → multiple time series

T small, N large → survey data on individuals/firms for a


small number of waves.

Example 1: South American countries


For 8 South-American countries we want to model the Real GDP per capita in 1985 prices
(=Rgdpl) in function of the following explicative variables.

• Population in 1000’s (Pop)

• Real Investment share of GDP, in % (I)

• Real Government share of GDP, in % (G)

• Exchange Rate with U.S. dollar (XR)

• Measure of Openness of the Economy (Open)

You find the data in the file ”penn.wmf”, already in Eviews format. We are in particular

interested in the effect of openness on economic growth.

59
1. Create a“pool” object in Eviews (‘/Object/New object’). Give it a name and define
the cross-section identifiers. These identifiers are those parts of the names of the
series identifying the cross-section.

2. Open the XR-variables as a group and make a plot of them. Compute them in
log-difference, using the PoolGenr menu of the pool object and “logdifXR?=
dlog(XR?)”. The “?” will be substituted by every cross-section identifier. Plot the
transformed variables.

3. Compute the medians of the variable I? for the different countries (use View/
descriptive statistics within the Pool object). Compute now the average value of I?
for every year.

4. Estimate the regression model for Brazil, using “/Quick/estimate equation’ and
specifying in Eviews the equation
dlog(rgdpl bra) c dlog(pop bra) i bra g bra dlog(xr bra) open bra

5. Now we want to pool the data of all countries, to increase the sample size. Use,
within the pooled object, ‘/Estimate’, and specify: dependent variable=
dlog(rgdpl?); common coefficients=c dlog(pop?) i? g? dlog(xr?) open?. This is a
pooled regression model.

6. Pooling the data ignores the fact that the data originate from different countries.
Dummy variables for the different countries need to be added. This can be done by
specifying the constant term as a “cross section specific coefficient.” We obtain a
fixed effect panel data model. Discuss the regression output.

7. The fixed effect panel data model assumes that the effect of openness is the same
for all countries. How could you relax this assumption?

8. Test whether all country effects are equal (to know how Eviews labels the coefficients,
use View/Representation), using a Wald test. The country effects are called the fixed
effects, and if they are significantly different, then there is unobserved heterogeneity.

60
5.3 The Generalized Least Squares estimator

Standard linear regression model:

Yi = Xi0β + ²i (i = 1, . . . , n)

with

* Var(²i) = σ 2 is constant ⇒ homoscedastic errors

* Cov(²i, ²j ) = 0 for i 6= j ⇒ uncorrelated errors

For the standard model, the Ordinary Least Squares (OLS) esti-
mator is

* Consistent, meaning that β̂ → β for n tending to infinity.

* Has the smallest variance among all estimators (for normal


errors) and smallest variance among all linear estimators.

One has that


à n
!−1 Ã n !
X X
β̂OLS = XiXi0 XiYi .
i=1 i=1

61
What if the the errors are not homoscedastic and uncorrelated?

E.g. for panel data:

* Cross-sectional heteroscedasticity

* Correlation among cross sections

* Serial correlation within and across cross-sections

* etc.

The Ordinary Least Squares (OLS) estimator is still consistent,


but not optimal anymore.

General linear regression model:

Yi = Xi0β + ²i (i = 1, . . . , n)

with

* Var(²i) = σi2 ⇒ heteroscedastic errors

* Cov(²i, ²j ) = σij for i 6= j ⇒ correlated errors.

One can still use OLS (not even a bad idea), if one uses

* White standard errors (if heteroscedasticity).

* Newey-West standard errors (if correlated errors + heteroscedas-


ticity).

62
The Generalized Least Squares (GLS) estimator will be
consistent and optimal and is given by
 −1  
Xn X n n X
X n
β̂GLS =  wij XiXj  
0
wij XiYj  ,
i=1 j=1 i=1 j=1

where the weights depends on the values of σij .

More precisely: let Σ be the n × n matrix with elements σij , then

wij = (Σ−1)ij .

Unfortunately, the values in Σ are unknown.

The Feasible Generalized Least Squares (GLS) proceeds


in 2 steps:

1. Compute β̂OLS and the residuals

riOLS = Yi − Xi0β̂OLS .

2. Use the above residuals to estimate the σij . [This will require
some additional assumptions on the structure of Σ.] Compute
then the GLS estimator with estimated weights wij .

The above scheme can be iterated → fully iterated GLS estimator.

63
Theoretical Example
Our sample of size n = 20 consists of two groups of equal size (e.g.
men and women). There is no correlation among the observations,
but we think that the variances of the error terms for men and
women might be of different size.

[The error terms contain the omitted and unobserved variables.


We might indeed think that their size is different for women than
for men, e.g. when regressing salary on individual characteristics.]

* σi2 = σii = σM
2
for i = 1, . . . , 10

* σi2 = σii = σF2 for i = 11, . . . , 20

* σij = 0 for i 6= j.

64
Computation of the (Feasible) GLS estimator:

1. Compute the OLS estimator and the residuals riOLS .

2. Estimate
10 20
2 1 X OLS 2 2 1 X OLS 2
σ̂M = (r ) and σ̂F = (r ).
10 i=1 i 10 i=11 i

Due to the simple structure of the matrix Σ, we have

1 1
ŵi = 2 (i = 1, . . . , 10) and ŵ i = (i = 11, . . . , 20)
σ̂M σ̂F2

à n
!−1 Ã n
!
X X
⇒ β̂GLS = wiXiXi0 wiXiYi .
i=1 i=1

65
Application to panel data regression
Let ²it be the error term of a panel data regression model, with
1 ≤ i ≤ n, and 1 ≤ t ≤ T.
Three different specifications are common:

1. V ar(²it) = σ 2 and all covariances between error terms are


zero. OLS can be applied (no weighting).

2. V ar(²it) = σi2 and all covariances between error terms are


zero. We have cross-sectional heteroscedasticity. GLS can be
applied (cross-section weights).

3. V ar(²it) = σi2, Cov(²it, ²jt) = σij , all other covariances zero.


We allow now for contemporaneous correlation between cross-
sections. GLS can be applied (SUR weights).

66
Example South American (continued)

1. Have a look at the residuals (View/residuals/Graphs) within


the pool object). Compute the covariance and the correlation
matrix of the residuals (i) Is there cross-sectional heteroscedas-
ticity? (ii) Is there contemporaneous correlation?

2. Estimate now the model with the appropriate GLS estimator.


Are the results depending a lot on the weighting scheme?

3. Is there still serial correlation present in the residuals, i.e.


(cross)-correlation at leads and lags? Hence, is the model cap-
turing the dynamics in the data?

67
5.4 The Fixed Effects Model

Fixed Effects Model:

Yit = Xit0 β + αi + ²it

with t = 1, . . . T time periods and i = 1 . . . , N cross-sectional


units.

* The αi contain the omitted variables, constant over time, for


every unit i.

* The αi are called the fixed effects, and induce unobserved


heterogeneity in the model.

* The Xit are the observed part of the heterogeneity. The ²it
contain the remaining omitted variables.

68
Testing for unobserved heterogeneity: test for redundant fixed
effects
H0 : α1 = . . . = αN := α

In case H0 holds, there is no unobserved heterogeneity, and the


model reduces to the pooled regression model:

Yit = Xit0 β + α + ²it

Ignoring unobserved heterogeneity may lead to severe bias of the


estimated β, see figure:

15

Cross Section 1

Cross Section 2

10

Pooled Regression
y

Cross Section 3

0
1 2 3 4 5 6 7
x

69
5.4.1 Least Squares Dummy Variable (LSDV) Estimation

Rewrite the model as

Yit = α1Di1 + . . . + αnDin + Xit0 β + ²it,

with Dij = 1 if i = j and zero if i 6= j.

Estimate model by OLS or GLS (weighting).

If necessary, use White/Newey West type of standard errors (also


if GLS is used, see later).

5.4.2 Within Groups Estimator

Compute averages of Xit and Yit within each “group” of cross-


sectional unit → X̄i. and Ȳi.


 Yit = X 0 β + αi + ²it
it
 Ȳ = X̄ 0 β + α + ²̄
i. i. i i.

⇒ (Yit − Ȳi.) = (Xit − X̄i.)0β + (²it − ²̄i.)

Regress the centered Yit on the centered Xit by OLS.


By centering, the fixed effects are eliminated! One can show that
the within group estimator is identical to LSDV.
70
5.4.3 Comments

1. If a variable Xit is constant in time for all cross-sections, the


FE model cannot be estimated.

Why?

2. The fixed effects model can be rewritten with a common in-


tercept included as

Yit = Xit0 β + α + µi + ²it,

and
µ1 + µ2 + . . . + µN = 0.

Obviously, we have αi = α + µi, and α is the average of the


fixed effects.

3. One can add time effects (or period effects) in the model:

Yit = Xit0 β + αi + δt + ²it,

The δt contain the omitted variables, constant over cross-


sections, at every time point t.

The time effects capture the business cycle.

71
4. If we think that the cross-sectional units are an i.i.d. sample
(typical for micro-applications), but serial correlation or period
heteroscedasticity is present (within each unit), then OLS can
be made more precise/efficient:

(a) V ar(²it ) = σt2 and all covariances between error terms are zero.
We have period heteroscedasticity. GLS can be applied (Period
weights):

(b) V ar(²it ) = σt2 , Cov(²it , ²is ) = σts , all other covariances zero. We

allow for serial correlation. GLS can be applied (SUR weights).

Example: Grunfeld data


We consider investment data for 10 American firms from 1935-
1954, and consider the model

Iit = βi1Fit + βi2Kit + αi + ²it

for 1 ≤ i ≤ N = 10, and 1 ≤ t ≤ T = 20. The variables are

* Gross investment for the firm (I)

* Value of the firm (F)

* Real Value of the Capital stock (plant and equipment) (K)

The data are in the excel file “grunfeld2.xls.”

72
1. Have a look at the data in the Excel File. Write up the number of
observations, the number of variables, and the upper left cell of the
data matrix. Close the Excel file, create an unstructured Workfile and
read in the data (Proc/Import/Read Text Lotus Excel).

2. To apply a panel structure, double click on the “Range:” line at the


top of the workfile window, or select Proc/Structure/Resize Current
Page. Select Dated Panel, and enter the appropriate variables as
“Date Series” and as “Cross Section ID series.”

3. Open the investment series. Explore the “Descriptive Statistics and


tests menu.”

4. Use View/Graph to (i) Make a line plot of the time series for every
cross section (ii) Make boxplots of the distribution of investment over
the different cross sections and over time.

5. Use Quick/Estimate Equation to estimate the fixed effects model.


Specify the equation “inv c cap value” and use Panel Options to in-
dicate that you use fixed effects.

6. Interpret your outcome. Would it be useful to add period effects?


Test whether they this is necessary with View/Fixed Random
Effects testing.

7. Select an appropriate weighting scheme within Panel Options. Inter-

pret your outcome.

73
5.5 The Random Effects Model

Model
Yit = c + Xit0 β + ²it

where the error term is decomposed as

²it = αi + vit.

* αi is a random effect ∼ N (0, σα2 ).


It is the permanent component of the error term.

* vit a noise term ∼ N (0, σv2).


It is the idiosyncratic component of the error term.

(The vit are uncorrelated among cross-sections, are serially uncor-


related at all leads and lags, within and across cross sections. The
random effects are uncorrelated among cross-sections.)

• At the price of one extra parameter σα2 , the random effects model
allows for correlation within cross-section units:
For every i and t 6= s:

Cov(²it, ²is) = Cov(αi + vit, αi + vis) = σα2

• The following Variance decomposition holds:

V ar(²it) = V ar(αi + vit) = σα2 + σv2.

74
⇒ Within groups/cross sections correlation:
σα2
ρ = Corr(²it, ²is) = 2 .
σα + σv2

The larger the value of ρ, the more unobserved heterogeneity.

• One estimates β by Generalized Least Squares, and obtains the


RE-estimator. Different methods exist to make GLS feasible.

• Testing for correlated random effects:


The random effect αi needs to be uncorrelated with the X-variables.
This is a strong assumption. If not, there is an endogeneity prob-
lem, and the RE-estimator is inconsistent.

H0 : Corr(αi, Xit) = 0

The Hausman test compares two estimators: the FE (always con-


sistent) and the RE estimator (consistent under H0).

One rejects H0 if the difference between the two estimators is large.

75
Using fixed or random effects?

* In econometrics, the fixed effects model seems to be the most


appropriate (HO not needed).

* If N is large, and T is small, and the cross-sectional units are a


random sample from a population, then random effects model
becomes attractive: it is a parsimonious model, that captures
within group-correlation. (For N large, FE requires estimation
of many parameters.)

* Random effects models are popular for modeling grouped data:

– Sample of 1000 children coming from 30 different schools

– Sample of 1000 persons from 20 different villages

– etc.

76
Robust Standard Errors: For RE no weighted versions are avail-
able. Using robust standard errors (or coefficient covariance) might
be appropriate. This only affects the SE, not the estimators.

1. White cross section: robust to V ar(²it) = σi2 and


Cov(²it, ²jt) = σij .

[robust to cross-section heteroscedasticity and contemporaneous cor-

relation among cross sections; appropriate if N << T .]

2. White period: robust to V ar(²it) = σt2 and


Cov(²it, ²is) = σts.

[robust to serial correlation within cross-section and changing vari-

ances over time; appropriate if cross-sections are random sample and

T << N .]

3. White diagonal: robust to V ar(²it) = σit2

[robust to all forms of heteroscedasticity, but not robust for any type

of correlation over time of across cross-section.]

Can also be used for FE.

77
Exercise Consider the Grunfeld data in “grundfeld2.wf1.” The
model was:
Iit = βi1Fit + βi2Kit + αi + ²it

1. Estimate the model as a random effects model.

2. What is the “within-group” correlation?

3. Perform the Hausman test. (View/Fixed random effects test-


ing/Correlated random effects)

4. Compute different types of robust SE. How is this affecting the


results?

78
5.6 Homework: The General Linear Model

We are interested in knowing whether persons having a loyalty card spend more in a
supermarket or not. Take the data in the Eviews file “glsex.wf1”. We have data for 2
supermarkets, and for each of them we have for 20 clients the amount spent (AMOUNT),
the size of the household the person belongs to (HHS), and a binary variable (CARD)
indicating whether the person owns a loyalty card or not. Denote yij the amount spent
by customer i in supermarket j, and xij the personal characteristics of the client (in this
case, only consisting of HHS). The proposed model is:

yij = β 0 xij + αj + δ CARD + ²ij ,

where αj , for j = 1, 2 are fixed effects for each supermarket. Our main interest is to know
whether δ is significant or not.

1. Estimate the parameters of the above model by OLS. Interpret briefly the param-
eters estimates. (Hint: since the β and δ parameters are supposed to be the same
for the 2 supermarkets, it will be necessary to pool the data. You will also need
to create yourself the appropriate dummy variables STORE1 and STORE2 to take
the fixed effects into account.)

2. We are afraid that there is groupwise heteroscedasticity in the error terms, i.e.
Var(²ij ) = σj2 for j = 1, 2.

(a) Estimate the variances σ12 and σ22 using the OLS-residuals.

(b) Estimate now the parameters by GLS. Write down an expression for Σ, the co-
variance matrix of the error terms, and show that GLS boils down to Weighted
Least Squares (WLS) here. Create the series of weights to be used, and carry
out the WLS estimation (in Eviews, take estimation method LS with option
Weighted LS).

(c) What is the advantage of WLS over OLS?

3. We are also afraid that there might be interaction between the variable CARD and
the supermarket. In particular, the effect of the loyality card might differ among

79
different supermarkets. The model becomes now

yij = β 0 xij + αj + δj CARD + ²ij .

(a) Estimate the above model by OLS. Do you think there might be interaction?
(Hint: creating the variables STORE1*CARD and STORE2*CARD might be
useful.)

(b) Test whether the interaction is significant or not.

4. Economists would say that there is a serious endogeneity problem here. There
probably exists a feed-back relation from AMOUNT to CARD. Could you explain
why this might be the case? Explain in words why it might indeed be that the error
terms are correlated with the variable CARD.

80
6 Binary Choice Models

6.1 Overview of topics discussed

• Binary Dependent Variables

• Probit and Logit Regression

• Maximum Likelihood estimation

• Estimation Binary Models in Eviews

• Measures of Goodness of Fit

• Other Limited Dependent Variable Models

• Exercise

6.2 Binary Dependent Variables

The variable of interest Y is binary. The two possible outcomes


are labeled as 0 and 1. We want to model Y as a function of
explanatory variables X = (X1, . . . , Xp).

Example: Y =employed (1) or unemployed (0); X=educational


level, age, marital status, ...
Example: Y =expansion (1) or recession (0); X=unemployment
level, inflation, ...
81
Can we still use linear regression?
Then
E[Y |X] = β0 + β1X1 + . . . βpXp

and the OLS fitted values are given by

Ŷ = β̂0 + β̂1X1 + . . . β̂pXp.

! Problem: the left hand side of the above equations takes values
between 0 and 1, while the right hand side may take any value on
the real line.
Note that

E[Y |X] = 0.P (Y = 0|X) + 1.P (Y = 1|X) = P (Y = 1|X)

The conditional expected values are conditional probabilities.


data cloud

0.8 S−shaped fit

0.6
y

Linear Fit
0.4

0.2

−0.2
0 1 2 3 4 5 6 7 8 9 10
x

82
6.3 Probit and Logit Regression

Binary regression model:

P (Y = 1|X) = F (β0 + β1X1 + . . . βpXp)

with
1
• F (u) = ⇒ Logit
1 + exp(−u)
• F (u) = Φ(u), standard normal cumulative distribution func-
tion ⇒ Probit

• 0 < F (u) < 1 and F increasing

• etc.

data cloud

0.8

Probit Logit

0.6
y

0.4

Difference is small;
Probit function is steeper.
0.2

−0.2
0 1 2 3 4 5 6 7 8 9 10
x

83
Interpretation of parameters

dP (Y = 1|X)
= β1 f (β0 + β1X1 + . . . βpXp)
dX1
with f = F 0 always positive.

• Marginal effects are not constant, different for each value of


X.

• Sign of β1 ≡ Sign of marginal effect

• Marginal effects can be summarized by evaluating at the av-


erage value X = X̄, or by averaging the marginal effects over
the sample values of X.

At Logit model: interpretation in terms of Odds-Ratio (OR)


P (Y = 1|X)
log OR = log( ) = β0 + β1X1 + . . . βpXp
P (Y = 0|X)
Prediction
For an observation xi = (xi1, . . . , xip) we predict the probability
of success as

P (Y =\
1|X = xi) = F (β̂0 + β̂1xi1 + . . . β̂pxip).

Set ŷi = 1 if P (Y =\
1|X = xi) > 0.5 and zero otherwise.
84
(Other cut-off values than 0.5=50% are sometimes taken)

6.4 Maximum Likelihood Estimation (MLE)

General principle: let L(β) be the likelihood of β, or equivalently,


the joint density of the observations y1, . . . , yn, depending on an
unknown parameter β (assuming independent observations)
n
Y
L(β) = f (yi, β)
i=1

Then the maximum likelihood estimator β̂ is the β maximizing


L(β):
n
X
β̂ = argmax log L(β) = argmax log f (yi, β)
β β i=1

Denote Lmax = L(β̂).

MLE for Bernoulli Variables


Let yi be the outcome of a 0/1 (failure/success) experiment, with p
be the probability of success. Then f (1, p) = p and f (0, p) = 1−p,
hence
f (yi, p) = pyi (1 − p)1−yi

The MLE p̂ maximizes


Xn
{yi log(p) + (1 − yi) log(1 − p)}.
i=1

85
1
Pn
It is not difficult to check that p̂ = n i=1 yi , the fraction of
successes in the sample.

MLE for Probit Model


We will condition on the explanatory variables; hence keep them
fixed.

f (yi, pi) = pyi i (1 − pi)1−yi with pi = Φ(β0 + β1Xi1 + . . . βpXip)

The MLE β̂ = (β̂0, β̂1 . . . , β̂p) maximizes


n
X
{yi log Φ(β0 +β1 Xi1 +. . . βp Xip )+(1−yi ) log(1−Φ(β0 +β1 Xi1 +. . . βp Xip ))}.
i=1

The MLE needs to be computed using a numerical algorithm on


the computer (similar for Logit model).

If the model is correctly specified, then

1. MLE is consistent and asymptotically normal.

2. MLE is asymptotically the most precise estimator, hence effi-


cient (smallest variance)

3. Inference (testing, confidence intervals) can be done.

If the model is misspecified, then the MLE may loose the above
properties.

86
6.5 Estimating Binary Models in Eviews

Example: Deny=application for mortgage denies (1) or accepted


(0) for a sample of 2380 applicants, Boston.

Explanatory variables:

* black: dummy race variable, 1 if applicant is black, 0 otherwise

* pi rat: ratio of monthly loan payments to monthly income

* married: 1 if married, 0 otherwise

* ltv med: 1 if loan to value ratio is medium (between 80% and


95%).

* ltv high: 1 if loan to value ratio is high (above 95%).


(loan to value ratio below 80% is the reference category)

(1) We first regress “deny” on a constant, “black” and “pi rat.” In


Eviews, we specify within the equation specification, Estimation
Settings: Method: BINARY-Binary Choice and select logit.

87
Both explanatory variables are highly significant. They have a
positive effect on the probability of deny, as expected. They are also
jointly highly significant (LR stat =152, P<0.001). The pseudo R-
squared is pretty low (R2 = 0.08).

Below some descriptives (Categorical regressor stats):

88
Predictive accuracy (Expectation-prediction table):

88% is correctly classified, with a sensitivity of only 4.2% and a


specificity of 99.7%. The gain is only 0.25 percentage points w.r.t.
a majority forecast (i.e. ‘all applications are accepted’).

(2) Repeat the analysis, now with all predictor variables.

89
6.6 Measures of Fit

Pseudo R-squared (also called McFadden R-squared)


Compare the value of the likelihood of the full model with an empty
model:
M(full): P (Y = 1|X) = F (β0 + β1X1 + . . . + βpXp)
M(empty): P (Y = 1|X) = F (β0)
log Lmax(F ull)
Pseudo R-squared=1 −
log Lmax(Empty)
Likelihood ratio test
The Likelihood Ratio (LR) statistic for H0 : β1 = . . . = βp = 0 is

LR = 2{log Lmax(F ull) − log Lmax(Empty)}

We reject H0 for large values of LR. The LR statistics can be used


to compare any pair of two nested model. Suppose that M1 is a
submodel of M2, and we want to test H0 : M1 = M2. Then, under
H0:
d
LR = 2{log Lmax(M 2) − log Lmax(M 1)} ∼ χ2k ,

where k is the number of restrictions (i.e. the difference in number


of parameters between M 2 and M 1).

90
In practice, we work with the P-value. For example, if k = 4 and
LR = 7.8
density of chi−squared distribution with 4 degrees of freedom
0.2

0.18

0.16

0.14

0.12

0.1
P−value=0.092
0.08

0.06

0.04

0.02

0
0 5 LR=7.8 10 15

Percentage correctly predicted

This is a measure of predictive accuracy, and defined as

n
1X
I(yi = ŷi)
n i=1
Its complement is an estimate of the error rate of the prediction
rule.

[This estimate is overly optimistic, since based on the estimation


sample. It is better to compute it using an out-of-sample predic-
tion.]

91
6.7 Other Limited Dependent Variable Models

Censored regression models: Mixture of a discrete (at 0) and a


continuous variable (Tobit models).

Examples: Car expenditures, Income of Females, ... The value


zero will often be observed.

Truncated regression models: Data above or below a certain thresh-


old are unobserved or censored. These data are not available. We
have a selected sample.

Count data: Poisson type models (see chapter 7)


Examples: number of strikes in a firm, number of car acci-
dents, number of children

Multiple choice data: Multinomial logit/probit.


Examples: mode of transport

Ordered response data: Ordered probit/logit


Examples: educational level, credit ratings (B/A/AA/...)

92
6.8 Exercise

We will analyse the data in the file “grade.wf1”. We have a sample


of students and we want to study the effect of the introduction
of a new teaching method, called PSI. The dependent variable is
GRADE, indicating whether students’ grade improved or not after
the introduction of the new method. The explanatory variables
are

• PSI: a binary variable indicating whether the student was ex-


posed to the new teaching method or not.

• TUCE: the score on a pretest that indicates entering knowl-


edge of the material to be taught.

We will now run a LOGIT-regression of GRADE on a constant,


PSI and TUCE.

93
1. Why do we add TUCE to the regression model, if we are only inter-
ested in the effect of PSI?

2. Interpret the estimated regression coefficients.

3. Take a student with TUCE=20.

(a) Estimate the probability that he will increase his grade if he fol-
lows the PSI-method.

(b) What is this probability to increase his grade if he will not follow
this PSI-method?

(c) Will this student improve his grade, if PSI=1?

(d) Compute the log odds-ratio (for improving the grade or not) for
this student once for PSI=1 and once for PSI=0. Compute the

94
difference between these two log-odds ratios. Compare with the
regression coefficient of PSI.

4. Compute the percentage of correctly classified observations and com-


ment (you can use View/Expectation-Prediction table).

5. The output shows the value “LR statistic”? How is this value com-
puted?

6. Run now a PROBIT regression. Is there much difference between the

estimates? And for the percentage of correctly classified observations?

References:

- Greene, W.H., Econometric Analysis, 5th edition (2003) Prentice Hall.

- Stock, J.H., Watson, M.W., Introduction to Econometric, 2nd edition (2007) Pear-
son.

95
7 The Poisson Regression Model
The Poisson regression model aims at modeling a counting variable Y , counting the num-
ber of times that a certain event occurs during a given time period. We observe a sample
Y1 , . . . , Yn . Here, Yi can stand for the number of car accidents that person i has during
the last 5 years; the number of children of family i; the number of strikes in company i
over the last 3 years; the number of brevets deposed by firm i during the last year (as a
measure of innovation); .... The Poisson regression model wants to explain this counting
variable Yi using explicative variables xi , for 1 ≤ i ≤ n. This p-dimensional variable xi
contains characteristics for the ith observation.

7.1 The Poisson Distribution

By definition, Y follows a Poisson distribution with parameter λ if and only if


exp(−λ)λk
P (Y = k) = , (1)
k!
for k = 0, 1, 2, . . . , We recall that for a Poisson variable:

E[Y ] = λ and Var[Y ] = λ. (2)

The Poisson distribution is a discrete distribution. The shape of its distribution is shown
in Figure 1, for several values of λ. In Figure 1, the distribution is visualized by plotting
P (Y = k) versus k. For low values of λ, the distribution is highly skewed. For large
values of λ, the distribution of Y looks more normal. In the examples given above, Yi
counts rather rare event, so that the value of λ will be rather small. For example, we
have high probabilities of having no or one car accident, but the probabilities of having
several car accidents decay exponentially fast. The Poisson distribution is the most simple
distribution for modeling counting data, but it is not the only one.

7.2 The Poisson Regression Model

Like in a linear regression model, we will model the conditional mean function using a
linear combination β t xi of the explicative variables:

E[Yi |xi ] = exp(β t xi ). (3)

96
Figure 1: The Poisson distribution for different values of λ
lambda=0.5 lambda=1
0.6

0.3
0.4

0.2
p

p
0.2

0.1
0.0

0.0
0 5 10 15 20 0 5 10 15 20

k k

lambda=3 0.12 lambda=10


0.20
0.15

0.08
p

p
0.10

0.04
0.05
0.0

0.0

0 5 10 15 20 0 5 10 15 20

k k

The use of the exponential function in (3) assures that the right hand side in the above
equation is always positive, as is the expected value of the counting variable Yi in the left
hand side of the above equation. The choice for this exponential “link” function is mainly
for reasons of simplicity. In principle, other “link” functions returning only positive values
could be used, but then we can’t refer to it as a Poisson regression model.

Moreover, to be able to use the Maximum Likelihood framework, we will specify a distri-
bution for Yi , given the explicative variables xi . We ask that every Yi , conditional on xi ,
follows a Poisson distribution with parameter λi . Equations (2) and (3) give

E[Yi |xi ] = λi = exp(β t xi ).

Aim is then to estimate β, the unknown parameter in the model. Note that estimation of
β induces an estimate of the whole conditional distribution of Yi given xi . This will allow
us to estimate quantities like P (Yi = 0|xi ), P (Yi > 5|xi ), .... So we will be able to answer
to questions like “What is the probability that somebody will have no single car accidents
during a 5 year period, given the persons characteristics xi ?”, “What is the probability

97
that a family, given its characteristics xi , has more than 5 children?”, ...

Interpretation of the parameters:

Information on β allows us to analyze the influence of an explicative variable on the


expected value of Yi . Suppose for example that we have xi = (xi1 , xi2 , 1)t . Then the
Poisson regression model gives

E[Yi |xi ] = exp(β1 xi1 + β2 xi2 + β3 ).

The marginal effect of the first explicative variable on the expected value of Yi , keeping
the other variables constant, is given by
∂E[Yi |xi ]
= β1 exp(β1 xi1 + β2 xi2 + β3 ).
∂xi1
We see that β1 has the same sign as this marginal effect, but the numerical value of the
effect depends on the value of xi . We could summarize the marginal effects by replacing
in the above equation xi1 an xi2 by average values of the explicative variables over the
whole sample. It is also possible to interpret β1 as a semi-elasticity:
∂ log E[Yi |xi ]
= β1 .
∂xi1

7.3 The Maximum Likelihood Estimator

We observe data {(xi , yi )|1 ≤ i ≤ n}. The number yi is a realization of the random
variable Yi . The total log-likelihood is, using independency, given by
n
X
LogL(y1 , . . . , yn |β, x1 , . . . , xn ) = logP (Yi = yi |β, xi ),
i=1

with, according to (1),


exp(−λi )λyi i
P (Yi = yi |β, xi ) = (4)
yi !
and λi = exp(β t xi ). Write now LogL(β) as shorthand notation for the total likelihood.
Then it follows
n
X
LogL(β) = {− exp(β t xi ) + yi (β t xi ) − log(yi !)}. (5)
i=1

98
The maximum likelihood (ML) estimator is then of course defined as

β̂M L = argmax LogL(β).


β

It is instructive to compute the first order condition that the ML-estimator needs to fulfill.
Derivation of (5) yields
n
X
(yi − ŷi )xi = 0,
i=1
t
with ŷi = exp(β̂M L xi ) the fitted value of yi . The predicted/fitted value has as usual been

taken as the estimated value of E[Yi |xi ]. This first order condition tells us that the vector
of residuals is orthogonal to the vectors of explicative variables.

The advantage of the Maximum Likelihood framework is that a formula for cov(β̂M L ) is
readily available: Ã n !−1
X
cov(β̂M L ) = xi xti ŷi
i=1

Also, hypothesis tests can now be carried out by Wald tests, Lagrange Multiplier tests or
Likelihood Ratio tests.

7.4 Overdispersion and the Negative Binomial Model

If we deal with a Poisson regression model, then we have

E[Yi |xi ] = Var[Yi |xi ],

implying that the conditional mean function equals the conditional variance function.
This is very restrictive. If E[Yi |xi ] < Var[Yi |xi ], respectively E[Yi |xi ] > Var[Yi |xi ], we
speak of overdispersion and underdispersion respectively. The Poisson model does not
allow for over- or underdispersion. A richer model is obtained by using the negative
binomial distribution instead of the Poisson distribution. Instead of (4), we use
µ ¶yi µ ¶θ
Γ(θ + yi ) λi λi
P (Yi = yi |β, xi ) = 1− .
Γ(yi + 1)Γ(θ) λi + θ λi + θ

The symbol Γ refers to the Gamma function, which will be discussed in appendix of this
section. The negative binomial distribution above has conditional mean λi and conditional

99
variance λi (1 + η 2 λi ), with η 2 := 1/θ. Note that the parameter η 2 is not allowed to vary
over the observations. As before, the conditional mean function is modeled as

E[Yi |xi ] = λi = exp(β t xi ).

The conditional variance function is then given by

Var[Yi |xi ] = exp(β t xi )(1 + η 2 exp(β t xi )).

Using maximum likelihood, we can then estimate the regression parameter β and the extra
parameter η as well. The parameter η measures the degree of over- or under- dispersion.
The limit case η = 0 corresponds to the Poisson model.

Appendix: The Gamma function

The Gamma function is defined as


Z ∞
Γ(x) = sx−1 exp(−s)dx
0

for every x > 0. Its most important properties are

1. Γ(k + 1) = k! for every k = 0, 1, 2, 3, . . .

2. Γ(x + 1) = xΓ(x) for every x > 0.



3. Γ(0.5) = π

The Gamma function can be seen as an extension of the factorial function k → k! =


k(k − 1)(k − 2) . . . .... to all real positive numbers. The Gamma function is increasing
faster to infinity than any polynomial function or even the exponential function.

100
7.5 Homework

We are interested in the number of accidents per service month for a sample of ships. The
data can be found in the file “ships.wmf”. The endogenous variable is called ACC. The
explicative variables are:

• TYPE: there are 5 types of ships, labeled as A-B-C-D-E or 1-2-3-4-5. TYPE is a


categorical variable, so 5 dummy variables can be created: TA, TB, TC, TD, TE.

• CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading
to the dummy variables T6064, T6569, T7074, and T7579.

• SERVICE: a measure for the amount of service that the ship has already carried
out.

Questions:

1. Make an histogram of the variable ACC. Comment on its form. Is this the histogram
for the conditional or unconditional distribution of ACC?

2. Estimate the Poisson regression model, including all explicative variables and a
constant term. (Use estimation method: COUNT- integer counting data.)

3. Comment on the coefficient for the variable SERVICE. Is it significant?

4. Perform a Wald test to test for the joint significance of the construction year dummy
variables.

5. Given a ship of category A, constructed in the period 65-69, with SERVICE=1000.


Predict the number of accidents per service month. Also estimate (a) the probability
that no accident will occur for this ship, and (b) the probability that at most one
accident will occur.

6. The computer output mentions: “Convergence achieved after 9 iterations”. What


does this mean?

7. What do we learn from the value of “Probability(LR stat)”? What is the corre-
sponding null hypothesis?

101
8. Estimate now a Negative Binomial Model. EViews reports the log(η 2 ) as the mixture
parameter in the estimation output. (a) Compare the estimates of β given by the
two models. (b) Compare the pseudo R2 values of the two models.

9. Estimate now the Poisson model with only a constant term, so without explicative
variables (empty model). Derive mathematically a formula for this estimate of the
constant term (in the empty model), using the first order condition of the ML-
estimator.

102
8 Solutions to Exercises
In this section the solutions to some exercises in these course notes and to the exercises
of the homework can be found.

8.1 Exercises from Course Notes

8.1.1 Exercise 2, page 13

Let X be the total amount of money (cash) that a family keeps at home. From a sample
of size n = 100 we obtain a sample mean of 800 Euro with a standard error 50 Euro. The
sample standard deviation is given by 500 Euro.

(i) Construct a 95% confidence interval for E[X].

Solution

Using the central limit theorem, and given the relatively large number of observa-
tions, we can assume that the sample mean µ̂ is normally distributed. Hence, the
95%-CI for the sample mean is

µ̂ ± 1.96 · SE(µ̂) = [702; 898].

(ii) Construct a 99% confidence interval for E[X].

Solution

We make the same assumption on the sample mean µ̂ as in (i). Hence, the 99%-CI
for the sample mean is

µ̂ ± 2.58 · SE(µ̂) = [671; 929].

(iii) Construct an interval that will contain approximately 95% of the outcomes of X.

Solution

103
To construct this interval, we assume that the distribution of X is normal. Hence,
the 95%-CI for X, the total amount of (cash) money a family keeps at home, is

µ̂ ± 1.96σˆX = [−180; 1780].

Since it is impossible to have a negative amount of cash money at home, we can cut
this interval at 0, and arrive at [0; 1780].

(iv) Do you think that the distribution of X is normal?

Solution

No. First of all, it is impossible to have less than 0 Euro in cash at home. Second,
the distribution is strongly skewed towards the left, with a long tail towards the
right, since most people keep only moderate amounts of money (< 800 Euro) at
home. On the other hand, there is a very small segment of the population which
keeps very large amounts of money at home, tens of thousands of Euro and more.

Important: Please keep in mind that the central limit theorem only makes a
statement about the distribution of a mean of a large number of random variables
(for example a sample mean), and does not say anything about the distribution of
the variable X itself.

(v) Would it be a good idea to construct my sample by interviewing 100 persons waiting
at the railway station? Do you have a better idea?

Solution

Restricting your sample to people waiting at the railway station isn’t a good idea,
because this introduces selection bias. Since not all types of people take the train
with the same frequency, you will not end up with a representative sample. For
example, around 17.00h, on a Friday, in the main station of Leuven, your sample will
consist mainly of students, and although students are a very interesting population
to study, they don’t represent the entire population well.

A better idea is, for example, to use stratified sampling: randomly picking out 20
towns/cities, and selecting 5 families from each town or city. Alternatively, you
could go into the national register, and select 100 households at random.

104
8.1.2 Exercise page 32

During 7 years we measure every season the total number of cars sold in a certain country
(in logarithms Yt ). Let Xt be the (log of the) price of oil in real terms. We want to regress
Yt on Xt + time trend + “seasonal dummies.” So the model is

Yt = α + βXt + δ1 St1 + δ2 St2 + δ3 St3 + εt

(a) Estimate and interpret the regression parameters.

Solution

Estimating the proposed model yields the following values for the estimates of the
parameters.

Coefficient SE t-stat p-value


C 22.80577 1.232618 18.50190 0.0000
Xt -0.171243 0.037971 -4.509837 0.0002
t 0.015881 0.017537 0.905557 0.3750
S1 -1.358439 0.400261 -3.393885 0.0026
S2 1.695257 0.398694 4.252026 0.0003
S3 -2.727948 0.407349 -6.696834 0.0000

Here, the C corresponds to the constant or intercept term, and t is the trend variable
starting at 0, and which increases by 1 each season. S1 , S2 , and S3 correspond to the
effects of respectively summer, autumn, and winter, relative to spring, the reference
category, on the (log of the) total number of cars sold Yt .

Analysing the table, we find that all variables, except the trend variable, are highly
significant (p < 0.01%). We see that a higher oil price lowers the number of cars
sold during the quarter. More precisely, if the oil price increases by 1%, the total
number of cars sold will decrease by 0.171%, assuming all other variables remain
constant (ceteris paribus).

In the same spirit, the number of cars sold increases (non-significantly) by 1.588%
each season. Note that interpreting this variable has little merit, since it is not
significant, and in essence, a negligible effect.

105
Finally, relative to spring, more cars are sold in autumn, and less during the summer
and the winter. For a precise interpretation, you could be tempted to use the rule
that if x increases by 1, y increases on average and ceteris paribus by β · 100% in a
model of the form
log(Y ) = α + βX + ε.

However, this leads to inaccurate and sometimes nonsensical conclusions if β is large


(|β| > 0.3). For example, δ1 , the coefficient associated with S1 , is –1.358. This would
mean that the number of cars sold decreases by 135.8% in summer, with respect to
spring! Clearly, this is impossible since that would imply a negative number of cars
being sold. In this case, a better rule would be the following:

If x increases by 1, then y increases on average and ceteris paribus by (eβ − 1) · 100%


if β > 0, and y decreases by (1 − eβ ) · 100% if β < 0.

Note that for small values of β, |β| < 0.3, the approximations eβ − 1 ≈ β (for β > 0)
and 1 − eβ ≈ β (for β < 0) still hold fairly well.

Using the rule above, we interpret the coefficients corresponding to the seasonal
dummies as: on average and ceteris paribus, the number of cars sold decreases in
summer with 1−e−1.358 = 75.28%, in winter with 1−e−2.728 = 93.47%, and increases
in autumn with e1.695 − 1 = 444.66%, with respect to spring.

(b) Comment on the values of the R2 , DW, and the F-statistic.

Solution

The R2 -value of 0.8647 shows us that 86.47% of the total variance in the response
variable Yt is explained by the explanatory variables, which is usually a very high
percentage. As such, we can conclude that we have a model with good predictive
power.

The value of the Durbin-Watson statistic, 1.773, shows that there is a very slight
positive autocorrelation of lag one (or serial autocorrelation) between the residuals
of the estimated model. However, as this value is still close to 2, there is a reason-
able assumption that this serial autocorrelation is insignificant. A more conclusive

106
statistical test will have to be made to verify this however.

The F-statistic of 28.13 and it’s associated p-value (< 5·10−7 ) show that the variables
in the proposed model are jointly significant, i.e. that the null hypothesis

H0 : β = γ = δ1 = δ2 = δ3 = 0

can be strongly rejected. Note however that the F-statistic does not test for the
significance of the intercept α.

(c) Make the residual plot and comment.

Solution
Theoretical Quantile-Quantile
2.0
3
1.5
2
1.0
Normal Quantile

1
0.5

0.0 0

-0.5 -1

-1.0
-2
-1.5
1990 1991 1992 1993 1994 1995 1996 -3
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
RESID RESID

There doesn’t seem to be much structure within the residuals and they look like
noise. Furthermore, the spread of the residuals does not appear to be changing,
although there aren’t many entries to base this observation on. We conclude that
there doesn’t seem to be much risk for autocorrelation and heteroscedasticity among
the residuals.

The right plot is a QQ plot, showing the residuals plotted against the quantiles of a
standard normal distribution, or vice versa. Looking at this plot we see that these
points lie approximately on a line, indicating that the residuals are approximately
normally distributed (a Jarque-Bera test could confirm this).

8.2 Homework General Linear Model, page 79

We are interested in knowing whether persons having a loyalty card spend more in a
supermarket or not. Take the data in the Eviews file “glsex.wf1”. We have data for 2

107
supermarkets, and for each of them we have for 20 clients the amount spent (AMOUNT),
the size of the household the person belongs to (HHS), and a binary variable (CARD)
indicating whether the person owns a loyalty card or not. Denote yij the amount spent
by customer i in supermarket j, and xij the personal characteristics of the client (in this
case, only consisting of HHS). The proposed model is:

yij = β 0 xij + αj + δ CARD + ²ij ,

where αj , for j = 1, 2 are fixed effects for each supermarket. Our main interest is to know
whether δ is significant or not.

1. Estimate the parameters of the above model by OLS. Interpret briefly the param-
eters estimates. (Hint: since the β and δ parameters are supposed to be the same
for the 2 supermarkets, it will be necessary to pool the data. You will also need
to create yourself the appropriate dummy variables STORE1 and STORE2 to take
the fixed effects into account.)

Solution

Estimating the model with OLS yields the following estimates:

Coefficient Std. Error t-Statistic Prob.


HHS (β) 31.01160 1.969869 15.74298 0.0000
CARD (δ) 9.433054 4.905733 1.922863 0.0624
STORE1 (α1 ) 11.23020 8.110628 1.384628 0.1747
STORE2 (α2 ) -2.138706 7.939190 -0.269386 0.7892

As can be seen, the only variable that has a truly significant effect on the amount
spent is the size of the customer’s household, where each additional member yields
an average increase of 31.01 currency units on the amount spent (given that all
other variables remain fixed). However, the possession of a loyalty card seems to
have an effect as well, which, although not significant, should not be dismissed out of
hand (p = 0.0624). It appears that customers with a loyalty card tend to spend, on
average, 9.43 currency units more than customers without a loyalty card (assuming
the other variables are the same).

108
2. We are afraid that there is groupwise heteroscedasticity in the error terms, i.e.
Var(²ij ) = σj2 for j = 1, 2.

(a) Estimate the variances σ12 and σ22 using the OLS-residuals.

(b) Estimate now the parameters by GLS. Write down an expression for Σ, the co-
variance matrix of the error terms, and show that GLS boils down to Weighted
Least Squares (WLS) here. Create the series of weights to be used, and carry
out the WLS estimation (in Eviews, take estimation method LS with option
Weighted LS).

(c) What is the advantage of WLS over OLS?

Solution

(a) From the OLS-residuals it is found that

σ̂12 = 93.052

σ̂22 = 327.002

(b) From these estimates of the groupwise residual variances, an expression for Σ̂
is found,  
σ̂12 I20 0
Σ̂ =  
0 σ̂22 I20
where In is the n × n identity matrix. From this, the GLS-estimator can be
computed.
θ̂GLS = (X t Σ̂−1 X)(X t Σ̂−1 Y )

where
 
HHS1,1 CARD1,1 STORE11,1 STORE21,1
 
 .. .. .. .. 
 . . . . 
 
 
 HHS20,1 CARD20,1 STORE120,1 STORE220,1 
X=



 HHS1,2 CARD1,2 STORE11,2 STORE21,2 
 
 .. .. .. .. 
 . . . . 
 
HHS20,2 CARD20,2 STORE120,2 STORE220,2

109
and Y = (y1,1 , . . . , y20,1 , y1,2 , . . . , y20,2 )t , θ = (β, δ, α1 , α2 )t . Working out the
expression for the estimator, it is found that, due to the form of Σ̂−1 ,

θ̂GLS = (X t Σ̂−1 X)−1 (X t Σ̂−1 Y )


  −1 −1
2
σ̂ I
1 20 0
= X t   X
2
0 σ̂2 I20
  −1 
2
σ̂1 I20 0
× X t   Y
2
0 σ̂2 I20
   −1
1
2 I20 0
= X t  σ̂1  X
1
0 I
σ̂22 20
   
1
2 I20 0
× X t  σ̂1 Y 
1
0 I
σ̂ 2 20
2

à 20 20
!−1
X 1 X 1
t t
= Xi,1,· X +
2 i,1,·
Xi,2,· Xi,2,·
i=1
σ̂1 i=1
σ̂22
à 20 20
!
X 1 X 1
t t
× Xi,1,· y +
2 i,1
Xi,2,· yi,2
i=1
σ̂1 i=1
σ̂22
à 20
!−1 Ã 20
!
XX XX
t t
= wi,j Xi,j,· Xi,j,· wi,j Xi,j,· yi,j
j=1,2 i=1 j=1,2 i=1

1 1
where wi,1 = σ̂12
and wi,2 = σ̂22
for i = 1, . . . , 20. As such, the GLS problem
has been reduced to a WLS problem. Using Eviews to estimate the parameters
yields the following:

Coefficient Std. Error t-Statistic Prob.


HHS (β) 31.08443 1.533161 20.27473 0.0000
CARD (δ) 17.62158 3.730573 4.723558 0.0000
STORE1 (α1 ) 7.305034 5.943001 1.229183 0.2270
STORE2 (α2 ) -6.056593 6.676884 -0.907099 0.3704

The size of the household is still a very significant variable, but, as can be seen,
the possession of a loyalty card has also become a very significant influencing

110
factor on the amount of money spent by a customer, with a customer having
such a card spending, on average, 17.62 currency units more than a customer
without loyalty card (assuming the other variables are the same).

(c) OLS assigned the same weight to the observations of supermarket 1 as to the
observations of supermarket 2, whereas the observations of supermarket 2 are
less useful, due to the higher variance, than those of supermarket 1. This
caused the effect of owning a loyalty card to be partially masked. Hence, WLS
is more efficient.

3. We are also afraid that there might be interaction between the variable CARD and
the supermarket. In particular, the effect of the loyality card might differ among
different supermarkets. The model becomes now

yij = β 0 xij + αj + δj CARD + ²ij .

(a) Estimate the above model by OLS. Do you think there might be interaction?
(Hint: creating the variables STORE1*CARD and STORE2*CARD might be
useful.)

(b) Test whether the interaction is significant or not.

Solution

(a) Estimating the model by OLS yields the following output:

Coefficient Std. Error t-Statistic Prob.


HHS 29.64042 1.736519 17.06887 0.0000
STORE1*CARD 23.89327 5.767522 4.142727 0.0002
STORE2*CARD -6.827699 6.110978 -1.117284 0.2715
STORE1 9.247987 7.004234 1.320342 0.1953
STORE2 9.566397 7.538886 1.268940 0.2128

111
The household size is still a highly significant variable, and owning a loyalty
card still has a significant effect on the amount purchased, but only for super-
market 1! For supermarket 2, there is only a non-significant, negative effect.
From this, the conclusion can be drawn that there is an interaction between
the supermarket and owning a loyalty card for the amount of money expended.

(b) Testing whether the effect is significant or not amounts to performing a Wald
test for:
H0 : δ1 = δ2

The test returns a Chi-square value of 13.55827, and a p-value of 0.000231.


As such, there is a strongly significant interaction between loyalty card and
supermarket for the effect on amount spent.

4. Economists would say that there is a serious endogeneity problem here. There
probably exists a feed-back relation from AMOUNT to CARD. Could you explain
why this might be the case? Explain in words why it might indeed be that the error
terms are correlated with the variable CARD.

Solution

Most likely, the loyalty card will mostly be promoted to customers who spend much
but don’t have a card yet. Since for those big spenders, the benefits of the card are
most clear, they will be more inclined to get one for themselves.

Vice versa, for people already owning a card, the benefits of it are higher if they
spend more, so they might be inclined to naturally spend more than a person without
a card, leading to a correlation between owning a loyalty card (CARD) and the error
term in the model.

Remark: because the number of observations were the same for both groups, this
problem could also have been solved like a panel data problem (for panel data
models, see chapter 5). However, this is NOT true in general, as different groups
will contain a different number of observations.

112
8.3 Homework Non-Linear Regression, page 51

1. The only explicative variable here is the time. The general formulation of the non-
linear regression model we saw in class was yt = f (xt , θ) + ²t , for t = 1, . . . , n. What
are xt , θ and f for this example?

Solution

For the model under consideration, it is found that

xt = t

θ = (α, β, γ)t
α
f (xt , θ) =
1 + e−β−γt

2. Formulate explicitly, as a system of 3 equations, the first order conditions for the
LS estimator.

Solution

In general, the first order condition for the model yt = f (xt , θ) + ²t is the following
n ¯
X d ¯

(yt − f (xt , θ)) ¯ =0
t=1
dθ θ=θ̂

or equivalently ¯
Xn
d ¯
(yt − f (xt , θ)) f (xt , θ)¯¯ =0
t=1
dθ θ=θ̂

113
For the example under consideration, this yield the following system of equations
 P ¯
 n ∂ ¯

 t=1 (yt − f (xt , θ)) ∂α f (xt , θ) θ=θ̂ = 0
 P ¯
n ∂ ¯
t=1 (yt − f (xt , θ)) ∂β f (xt , θ)¯ =0

 ¯θ=θ̂
 P
 n (y − f (x , θ)) ∂ f (x , θ)¯¯
t=1 t t ∂γ t =0
 θ=θ̂
Pn ³ ´¯

 ∂ α ¯

 (y
t=1 t − f (x t , θ)) ∂α 1+e−β−γt ¯
=0
 P ³ ´¯θ=θ̂
n ∂ α ¯
 t=1 (yt − f (xt , θ)) ∂β 1+e−β−γt ¯ =0

 P ³ ´¯ θ=θ̂

 nt=1 (yt − f (xt , θ)) ∂γ ∂ α ¯
1+e−β−γt ¯
=0
θ=θ̂
 P


n
(yt − f (xt , θ)) 1+e−1β̂−γ̂t = 0

 Pt=1
n αe−β̂−γ̂t
t=1 (yt − f (xt , θ)) (1+e−β̂−γ̂t )2 = 0


 Pn (y − f (x , θ)) αte−β̂−γ̂t = 0

t=1 t t (1+e−β̂−γ̂t )2

Note that it is not necessary to make a first order Taylor approximation in order to
write down the first order conditions.

3. Plot the series. The value of α gives the diffusion level for t tending to infinity. It is
called the saturation level. Do you think that we are already close to the saturation
level?

Solution

A plot of the series is found in Figure 2 below:

A visual inspection of the plot does not give an indication that the steel production
by that type of technology is close to the saturation level.

4. Estimate the model parameters, by specifying in Eviews the equation

Y = c(1) / (1 + exp(- c(2) - c(3)*@trend))

Solution

Performing the model estimation yields the following output (in LATEX typeset form):
Dependent Variable: Y
Method: Least Squares
Sample: 1970 — 1997
Included observations: 28

114
Figure 2: The series yt w.r.t. the year

Convergence achieved after 8 iterations


Coefficient Std. Error t-Statistic Prob.
C(1) 0.462303 0.018176 25.43471 0.0000
C(2) -0.794178 0.056648 -14.01944 0.0000
C(3) 0.116835 0.010961 10.65954 0.0000
All three parameters α, β and γ are highly significant, and their estimates are equal
to 

 α̂ = 0.462303


β̂ = −0.794178



 γ̂ = 0.116835

5. Comment on the number of iterations needed to compute the LS estimator. As


starting value for the iterative algorithm, Eviews takes the values in the constant
vector c, displayed in the workfile window. [When starting up Eviews, the constant
vector is initialized at zero. Be careful, after having estimated an equation, the
constant vector c contains the estimated parameters.] Set the value of the constant
vector c back to zero, by clicking on it and using the edit option afterwards. Change
the value of the convergence criterion to 10−10 and see how this affects the number

115
of iterations.

Solution

As can be seen, the algorithm converges after 8 iterations, which is an indication


that the chosen starting values for C(1), C(2), and C(3), all zero in this case, were
good starting values. Inspecting the convergence criterion, it is found to be equal
to 10−4 , something which can also explain the swift convergence.

Changing the convergence criterion to 10−10 , and resetting the starting values for the
parameter estimates back to zero, 10 iterations are needed to achieve convergence,
which is not surprising because of the nature of the algorithm used (Gauss-Newton,
quadratic convergence, i.e. the number of correct digits doubles in each iteration
step when the algorithm is “close enough” to the true solution).

6. Forecast the values of yt up to 2020. For this, you need to change the range of the
workfile (via ‘/Procs/Change workfile range’ in the workfile window) as well as the
sample range (via ‘/Procs/Sample’ in the workfile window) to 1970-2020. You can
make the forecast within the equation window via the Forecast menu by setting the
forecast range as 1970-2020. When are we getting close to the saturation level?

Solution

A plot of the forecast of yt in the range of 1970-2020, with confidence bounds, is


found in Figure 3 below:

As expected, the saturation level for the specific method of producing steel is almost
reached in the year 2020, and one can say that it is getting “close” to the saturation
level as soon as the estimated value of α, α̂ = 0.462303, lies within the confidence
bounds of the forecast, which happens somewhere around the year 2000.

7. Another important quantity is the point of inflection tI of the diffusion curve. It


gives the point in time after which the growth rate of diffusion declines. In other
words, after tI the number of new users/customers is decreasing. This inflection

116
Figure 3: A forecast of the series yt in the range 1970-2020, with confidence bounds in
dashed lines

point is given by tI = −β/γ (the root of the second derivative of the diffusion
curve). What is your estimate of tI ? What do we learn from this? Test whether
the hypothesis that the point of inflection occurred in 1990 holds.

Solution

An approximate, visual estimation would place the point of inflection somewhere


between 1975 and 1980. However, a more precise measurement will of course be
needed. Using the previously found estimates for β and γ, it is found that the
estimated point of inflection is

−0.794178
t̂I = − = 6.797
0.116835

and thus, the year of inflection is 1977, due to 1970 being the reference year (t = 0).
This means that, roughly, after the year 1977, there is a decrease in the number of
US-steel producers that switch to the new technology.

To check the hypothesis that the point of inflection occurred in 1990, a Wald test
needs to be performed with the following null-hypothesis:

β
H0 : − = 1990 − 1970 = 20
γ

117
Performing the Wald test yields the following output
F-statistic 75.02388 Probability 0.000000
Chi-square 75.02388 Probability 0.000000
from which the conclusion can be drawn that the year of inflection is very signifi-
cantly different from 1990.

8.4 Homework Some Theoretical Exercises, page 53

1. Let X be a bivariate distribution with mean µ and covariance matrix Σ, where


   
1 3 1
µ =   and Σ =  .
0 1 1

Define
Y = bt X + ²,

where ² follows a standard normal distribution and is independent of X. Finally,


let b = (−1, 2)t .

(a) Compute E(Y )


Solution
E[Y ] = E[bt X + ²] = bt E[X] + E[²] = bt µ

Filling in the values for b and µ leads to


 
³ ´ 1
E[Y ] = −1 2   = −1
0

(b) Compute Var(Y )


Solution

1
Var(Y ) = Var(bt X + ε) = bt Cov(X)b + Var(ε) = bt Σb + 1

Filling in the values for b and Σ leads to


  
³ ´ 3 1 −1
Var(Y ) = −1 2   +1=4
1 1 2

118
(c) Compute the correlation between Y and ².

Solution

1
Cov(Y, ε) = Cov(bt X + ε, ε) = Cov(ε, ε) = Var(ε) = 1

From this, ρY,ε , the correlation between Y and ε can be computed.

Cov(Y, ε) 1 1
ρY,ε = p =√ =
Var(Y )Var(ε) 4 2

2. Consider the regression model

log(yi ) = α + βxi + γzi + ²i ,

for 1 ≤ i ≤ n and with ²i the error terms. Suppose that β̂ = 1.5 and SE(β̂) = 0.75.
Suppose that γ̂ = 4 and SE(γ̂) = 0.75. Moreover, Cov(β̂, γ̂) = 0.2.

(a) Is the variable xi significant?

Solution

In other words, the null hypothesis H0 : β = 0 has to be tested.

First, determine the 95% confidence interval for β.

CI = β̂ ± 1.96 · SE(β̂) = 1.5 ± 1.96 · 0.75 = [0.0375; 2.9625]

Since 0 6∈ [0.0375; 2.9625], H0 can, in principle, be rejected and the variable xi


is statistically significant. Analyzing the p-value, it is found that

p = 0.0455

meaning that the significance is not really strong, i.e. it is a borderline case.

(b) Is β̂ significantly different from 5?

Solution
1
holds because X and ε are independent.

119
The hypothesis H0 : β = 5 must be tested. The 95% confidence interval is the
same as in (a), and it is found that 5 6∈ [0.0375; 2.9625]. Moreover, computing
the p-value, it is found that
p < 10−5

which means that β̂ is extremely strongly significantly different from 5.

(c) Using the Delta-method (for more information see section 3.2.4), compute
SE(β̂ 2 ).

Solution

First, define
g : IR −→ IR : θ = β 7−→ β 2

From this, it is found that

dg(β) dβ 2
= = 2β
dβ dβ

Here, V = Var(β̂), and hence

√ d
n(β̂ 2 − β 2 ) −→ N (0, 4β̂ 2 V )

From this, it is inferred that

Var(β̂ 2 ) = 4β̂ 2 Var(β̂)

and, filling in the values,

Var(β̂ 2 ) = 4 · (1.5)2 (0.75)2 = 5.0625

or
SE(β̂ 2 ) = 2.25

(d) Test whether H0 : β = γ holds.

Solution

120
Testing the hypothesis H0 : β = γ is equivalent to testing H0 : β − γ = 0. A
side computation reveals that

Var(β̂ − γ̂) = Var(β̂) + Var(γ̂) − 2Cov(β, γ)

= SE(β̂)2 + SE(γ̂)2 − 2Cov(β, γ)

= 0.752 + 0.752 − 2 · 0.2 = 0.725

which leads to the following value for the Wald-statistic


β̂ − γ̂ −2.5
z=q =√ = −2.9631
Var(β̂ − γ̂) 0.725

Since it is larger than |1.96|, we can reject the null hypotheses with 95% signif-
icance, and thus, β̂ and γ̂ are significantly different. The p-value (p = 0.0166)
shows that the significance isn’t very strong.

(e) Compute SE(β̂γ̂)


Solution
First, define  
β
g : IR2 −→ IR : θ =   7−→ βγ
γ
From this, it is found that
 
dg(θ) dβγ   γ
= =
dθ dθ β
 
SE(β̂)2 Cov(β, γ)
Here, V =  , and hence
2
Cov(β, γ) SE(γ̂)
  
√ ³ ´ γ̂
d
n(β̂γ̂ − βγ) −→ N 0, γ̂ β̂ V  
β̂
From this, it is inferred that
 
³ ´ γ̂
Var(β̂γ̂) = γ̂ β̂ V 
β̂
  
³ ´ Var(β̂) Cov(β, γ) γ̂
= γ̂ β̂   
Cov(β, γ) Var(γ̂) β̂

121
and, filling in the values,
  
³ ´ (0.75)2 0.2 4
Var(β̂γ̂) = 4 1.5    = 12.67
2
0.2 (0.75) 1.5
or

SE(β̂γ̂) = 12.67 = 3.56

(f) Complete the next sentence:


“If x increases with 1 unit, then we expect that y increases with . . . ”
Solution
Starting from the model, it is found that

log(y) = α + βx + γz + . . .

From this, it is easily seen that, if x increases with 1 unit, log(y) is expected
to increase with β units, or analytically
d log(y)

dx
A quick calculation then reveals that, if dx = 1
dy
d log(y) = ≈β
y
This leads to
³ y ´
dy ≈ βy = β × 100
100
or
dy = (β × 100)%

Filling in β̂ = 1.5 for β, it is found that if x increases with 1 unit, ceteris


paribus, y is expected to increase with 1.5 · 100 = 150%.

8.5 Homework: The Poisson Regression Model, page 101

We are interested in the number of accidents per service month for a sample of ships. The
data can be found in the file “ships.wmf”. The endogenous variable is called ACC. The
explicative variables are:

122
• TYPE: there are 5 types of ships, labeled as A-B-C-D-E or 1-2-3-4-5. TYPE is a
categorical variable, so 5 dummy variables can be created: TA, TB, TC, TD, TE.

• CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading
to the dummy variables T6064, T6569, T7074, and T7579.

• SERVICE: a measure for the amount of service that the ship has already carried
out.

Questions:

1. Make an histogram of the variable ACC. Comment on its form. Is this the histogram
for the conditional or unconditional distribution of ACC?

Solution

The histogram can be found in Figure 4 below:

Figure 4: A histogram of the variable ACC

On first sight, the shape of the histogram is more or less that of a histogram of a
Poisson distributed variable, with the parameter λ small (around 2 or 3). This is
the histogram of the unconditional distribution of ACC.

123
2. Estimate the Poisson regression model, including all explicative variables and a
constant term. (Use estimation method: COUNT- integer counting data.)

Solution

Estimation of the full model, according to a Poisson Model, yields the following
values for the coefficients (TA was chosen to be the reference category for TYPE,
and T6064 was chosen to be the reference category for CONSTRUCTION YEAR):

Coefficient Std. Error t-Statistic Prob.


C 0.867530 0.241285 3.595456 0.0003
TB 0.989658 0.212252 4.662659 0.0000
TC -1.219122 0.327417 -3.723450 0.0002
TD -0.858781 0.287597 -2.986056 0.0028
TE -0.242659 0.236351 -1.026689 0.3046
T6569 0.950927 0.176265 5.394863 0.0000
T7074 1.266906 0.227427 5.570601 0.0000
T7579 0.719230 0.277312 2.593581 0.0095
SERVICE 4.48E-05 7.42E-06 6.042115 0.0000

Note that if TA and/or T6064 were included in the model as well, an error message
would be generated, and no estimation would be performed.

3. Comment on the coefficient for the variable SERVICE. Is it significant?

Solution

The estimated value of the coefficient for the variable SERVICE is very small, as is its
standard error. This is most likely due to the variable having large variance. Calling
up the descriptive statistics for the variable supports this (σ̂ = 9644.166). However,
despite the small absolute value of the coefficient, it is still highly significant (p ≤
5 · 10−5 ).

124
4. Perform a Wald test to test for the joint significance of the construction year dummy
variables.

Solution

Performing the Wald test on the hypothesis:

H0 : βT6569 = βT7074 = βT7579 = 0

yields a value of 40.56575 for the χ2 -statistic, and hence, a p-value smaller than
5 · 10−7 , indicating that the construction year dummy variables are jointly strongly
significant.

5. Given a ship of category A, constructed in the period 65-69, with SERVICE=1000.


Predict the number of accidents per service month. Also estimate (a) the probability
that no accident will occur for this ship, and (b) the probability that at most one
accident will occur.

Solution

Denote x = (0, 0, 0, 0, 1, 0, 0, 1000)t the encoding of the ship (the first four numbers
are the dummy’s for TYPE, where type A is the reference category as before, the
next three are for CONSTRUCTION YEAR, with 60-64 the reference category, and
the last entry is the value for SERVICE), and let λ̂(x) be the predicted number of
accidents per service month. Then

t
λ̂(x) = E[ACC | x] = ex β̂+β̂0

where β̂0 is the estimate of the intercept. Filling in the estimated values for β̂, it
is found that λ̂(x) = 6.444693. From this, the following two probabilities can be
estimated:

(a)
P̂(ACC = 0 | x) = e−λ̂(x) = 0.001588932

125
(b)

P̂(ACC ≤ 1 | x) = P̂(ACC = 0 | x) + P̂(ACC = 1 | x)

= e−λ̂(x) + e−λ̂(x) λ̂(x)

= 0.01182911

6. The computer output mentions: “Convergence achieved after 9 iterations”. What


does this mean?

Solution

This indicates that the 9th step of the iterative algorithm used to compute the ML
estimators returned a solution which was closer to the previous solution than the
convergence criterion.

7. What do we learn from the value of “Probability(LR stat)”? What is the corre-
sponding null hypothesis?

Solution

The value “probability (LR)” is here very low (smaller than 5 · 10−7 ) meaning that
here, the variables used to construct the model are jointly extremely significant.
Writing this as a null hypothesis yields:

H0 : β = 0

with β the vector of parameters (excluding the intercept), and as explained before,
this hypothesis can be strongly rejected.

8. Estimate now a Negative Binomial Model. EViews reports the log(η 2 ) as the mixture
parameter in the estimation output. (a) Compare the estimates of β given by the
two models. (b) Compare the pseudo R2 values of the two models.

126
Solution

Estimation of the full model, according to a Negative Binomial Model, yields the
following values for the coefficients (TA was chosen to be the reference category for
TYPE, and T6064 was chosen to be the reference category for CONSTRUCTION
YEAR):
Coefficient Std. Error t-Statistic Prob.
C 0.380217 0.469879 0.809181 0.4184
TB 0.997461 0.539695 1.848192 0.0646
TC -1.112612 0.477577 -2.329703 0.0198
TD -0.882003 0.444638 -1.983644 0.0473
TE -0.147957 0.418851 -0.353246 0.7239
T6569 0.982162 0.419928 2.338883 0.0193
T7074 1.857623 0.443221 4.191192 0.0000
T7579 1.097159 0.498048 2.202918 0.0276
SERVICE 6.41E-05 2.42E-05 2.645928 0.0081
log(η 2 ) -1.104076 0.454214 -2.430740 0.0151

Once again, TA was chosen as reference category for TYPE and T6064 was chosen
as reference category for CONSTRUCTION YEAR.

(a) As can be seen, some of the estimated coefficients have changed only slightly
between models, while others (the intercept, β̂TE , β̂T7074 , β̂T7579 , β̂SERVICE ) have
changed (much) more.

(b) For the two estimated models, Poisson Model (PM) and Negative Binomial
Model (NBM), the following values for pseudo R2 have been found:
Model Pseudo R2
PM 0.713334
NBM 0.757713
This indicates that the Negative Binomial Model is slightly preferable, in terms
of explicative power, than the Poisson Model.

127
9. Estimate now the Poisson model with only a constant term, so without explicative
variables (empty model). Derive mathematically a formula for this estimate of the
constant term (in the empty model), using the first order condition of the ML-
estimator.

Solution

Estimation of the estimation model, according to a Poisson Model, yields the fol-
lowing value for the estimated intercept:
Coefficient Std. Error t-Statistic Prob.
C 2.348570 0.053000 44.31276 0.0000
Analytically, this coefficient can be found as follows. Given the data {acci } (since
no variables will be included), the total log-likelihood can be written as (using
independency of the observations)
n
X
log L(acc1 , . . . , accn | β0 ) = log P(ACC = acci | β0 )
i=1

where β0 is the intercept. Assuming a Poisson Model, and taking λi = λ = eβ0 , it


is found that

e−λ λacci
P(ACC = acci | β0 ) =
(acci )!
β
e−e 0 eβ0 acci
=
(acci )!

and hence
n
X
log L(β0 ) = {−eβ0 + β0 acci − log(acci )!}
i=1

The maximum likelihood estimator of β0 is then defined as

β̂0,ML = argmax log L(β0 )


β0

To get an analytic expression for this estimator, write down the first-order condition

128
of the estimator:
¯ n
d ¯ X
log L(β0 )¯¯ = {−eβ̂0 + acci }
dβ0 β0 =β̂0 i=1
n
X
β̂0
= −ne + acci }
i=1

= n(acc − eβ̂0 )

= 0

or
eβ̂0 = acc ⇔ β̂0 = log acc

where acc is the sample mean of ACC.

129

You might also like