Econometrics Lecture Notesa 7 1

JIMMA UNIVERSITY CHAPTER 1 - 1
2008/09
INTRODUCTION TO
A
D
AB
ECONOMETRICS
N
SE
(ECON. 352)
AS
H
HASSEN A. (M.Sc.)
.
JIMMA UNIVERSITY HASSEN A. CHAPTER 1 - 2
2008/09
CHAPTER ONE
INTRODUCTION
1.1 The Econometric Approach
1.2 Models, Economic Models &
HASSEN
Econometric ABDA
Models
1.3 Types of Data for Econometric
Analysis
.
2008/09

WHAT IS ECONOMETRICS?
Econometrics means “economic

measurement”
In simple terms, econometrics deals
HASSEN ABDA
with the application of statistical
methods to economics.
The application of mathematical &
statistical techniques to data in order to
collect evidence on questions of interest
to economics.
.
2008/09

Unlike economic statistics, which
mainly collects & summarizes statistical
data, econometrics combines economic
theory, mathematical economics,
economic statistics & mathematical
statistics:
HASSEN ABDA
economic theory: providing the
theory, or, imposing a logical
structure on the form of the
question). e.g., when price goes
up, quantity demanded goes down.
.
2008/09
mathematical economics:
expressing economic theory
using math (mathematical form).
economic statistics: data
HASSEN ABDA
presentation & description.
mathematical statistics:
estimation & testing techniques.
.
2008/09

Goals/uses of econometrics
Estimation/measurement of
economic parameters or
relationships, which may be needed
for policy- or decision-making;
HASSEN ABDA
Testing (& possibly refining)
economic theory;
Forecasting/prediction of future
values of economic magnitudes; &
Evaluation of policies/programs.
.
2008/09
1.2 Models, Economic Models & Econometric Models

Model: a simplified representation of
the real world phenomena.
Combines the economic model
with assumptions about the
random nature of the data
HASSEN ABDA
ECONOMETRIC
MODEL
ECONOMIC MODEL
MODEL .
2008/09

1. Economic theory or model
4. Some
2. Econometric model: a
priori 3. Data
statement of the economic theory
information
in an empirically testable form
HASSEN
5. Estimation ABDA
of the model
6. Tests of any hypothesis

suggested by the economic model
7. Interpreting results & using the

model for prediction & policy .
2008/09

1. Statement of theory or hypothesis:
e.g. Theory: people increase
consumption as income increases, but
not by as much as the increase in their
income.
HASSEN
2. Specification ABDA
of mathematical model:
C = α + βY; 0 < β < 1.
where: C = Consumption,
Y = Income,
β = slope = MPC = ∆C/∆Y,
α = intercept .
2008/09
3. Specification of econometric (statistical)

model:
C = α + βY + ɛ; 0 < β < 1.
α = intercept = autonomous
consumption HASSEN ABDA
ɛ = error/stochastic/disturbance term. It
captures several factors:
omitted variables,
measurement error in the dependent
variable and/or wrong functional form.
randomness of human behavior .
2008/09
4. Obtain data….
5. Estimate parameters of the model: How?
3 methods!
Suppose Cˆ i = 184.08 + 0.8Yi
6. Hypothesis HASSEN
testing: ABDA
Is 0.8 statistically <1?
7. Interpret the results & use the model for
policy or forecasting:
A 1 Br. increase in income induces an 80
cent rise in consumption, on average.
If Y = 0, then average C = 184.08 .
2008/09
Predict the level of C for a given Y,

Pick the value of the control variable
(Y) to get a desired value of the target
variable (C), …
HASSEN ABDA
.
2008/09
1.3 Types of Data for Econometric Analysis

Time series data: a set of observations on
the values that a variable takes at different
times. e.g. money supply, unemployment
rate, … over years.
Cross-sectional data: data on one or more
variables collected at the same point in
time. HASSEN ABDA
Pooled data: cross-sectional observations
collected over time, but the units don’t
have to be the same.
Longitudinal/panel data: a special type of
pooled data in which the same cross-
sectional unit (say, a family or a firm) is
surveyed over time. .
2008/09
CHAPTER TWO
SIMPLE LINEAR REGRESSION
2.1 The Concept of Regression Analysis
2.2 The Simple Linear Regression Model
2.3 The Method of Least Squares
HASSEN ABDA
2.4 Properties of Least-Squares Estimators and the
Gauss-Markov Theorem
2.5 Residuals and Goodness of Fit
2.6 Confidence Intervals and Hypothesis Testing in
Regression Analysis
2.7 Prediction with the Simple Linear Regression
2008/09

Origin of the word regression!
Our objective in regression analysis is to find out
how the average value of the dependent variable (or
the regressand) varies with the given values of the
explanatory variable (or the regressor/s).
HASSEN ABDA
Compare regression & correlation! (dependence vs.
association).
The key concept underlying regression analysis is
the conditional expectation function (CEF), or
population regression function (PRF).
E[Y | X i ] = f ( X i )
2008/09

For empirical purposes, it is the stochastic PRF that
matters. Y = E[Y | X ] + ε
i i i
The stochastic disturbance term ɛi plays a critical

role in estimating the PRF.
HASSEN ABDA
The PRF is an idealized concept, since in practice
one rarely has access to the entire population.
Usually, one has just a sample of observations.
Hence, we use the stochastic sample regression
function (SRF) Yî = f(Xi ) to estimate the PRF, i.e., we
]
use: Yi = f (Yî ,ei ) to estimate Yi = f (E[Y | Xi ,ei ) .
2008/09

We assume linear PRFs, i.e., regressions that are
linear in parameters (α and β). They may or may not
be linear in variables (Y or X).
E[Y | Xi ] =α + βXi ⇒Yi =α + βXi +εi
Simple because weHASSEN ABDA
have only one regressor (X).
Accordingly, we use:
Yˆ = αˆ + βˆX to estimate E[Y | X i ] = α + β X i .
i i
⇒ αˆ , βˆ and e from a sample are estimates of
i
α, β and ε i , respective ly. .
2008/09

Using the theoretical relationship between X and Y,
Yi can be decomposed into its non-stochastic
component α+βXi and its random component ɛi.
This is a theoretical decomposition because we do
not know the values of α and β, or the values of ɛ.
HASSEN ABDA
An operational decomposition of Y (used for
practical purposes) is with reference to the fitted
line. The actual value of Y is equal to the fitted
value Yî = αˆ + βˆX i plus the residual ei.
The residuals ei serve a similar purpose as the
stochastic term ɛi, but the two are not identical.
2008/09
From the PRF:

Yi = E [Y i | X i ] + ε i ε i = Yi − E[Yi | X i ]
εεεε iiii
YYYY iiii
αααα
β X
but, E[Yi | X i ] = α + βX i = − −
iiii
HASSEN ABDA
From the SRF:
Y = Yˆ + e e i = Y i − Yî
i i i
eeee iiii
YYYY iiii
αααα
ββββ
XXXX
iiii
but Yî = αˆ + βˆX = − −
ˆ ˆ
i
2008/09
E[Y|X2] = α + βX2
O4
Y ɛ4
E[Y|Xi] = α + βXi
O1 P3
HASSEN ABDA P4
ɛ1 P2 ɛ3
ɛ2 O3
α P1
O2
X
X1 X2 X3 X4
2008/09
O4
SRF : Yî = αˆ + βˆX i
Y e4 ɛ
4
R4 PRF: Y = α + βX
R3 i i
O1
HASSEN ABDA P4 Ɛi & ei are
RP P3
e1 ɛ1 22 e3 not identical
ɛ3
e2 ɛ2 Ɛ1 < e1
P1 O3
α O2
R1 Ɛ2 = e2
α̂
Ɛ3 < e3
X1 X2 X3 X4 X Ɛ4 > e4
2008/09

Remember that our sample is only one of the large
number of possibilities.
Implication: the SRF line in the figure above is just
one of the many possible such lines. Each of the
SRF lines has unique αˆ and βˆ values.
HASSEN ABDA
Then, which of these lines should we choose?
Generally we will look for the SRF which is very close
to the PRF.
But, how can we devise a rule that makes the SRF
as close as possible to the PRF? Equivalently, how
can we choose the best technique to estimate the
parameters of interest (α and β)?
2008/09

Generally speaking, there are 3 methods of
estimation:
method of least squares,
method of moments, and
maximum likelihood estimation.
HASSEN ABDA
The most common method for fitting a regression
line is the method of least-squares. We will use
the LSE, specifically, the Ordinary Least Squares
(OLS) in Chapters 2 and 3.
What does the OLS do?
2008/09

A line gives a good fit to a set of data if the points
(actual observations) are close to it. That is, the
predicted values obtained by using the line should
be close to the values that were actually observed.
Meaning, the residuals should be small. Therefore,
when assessing theHASSEN
fit of a line,
ABDAthe vertical distances
of the points to the line are the only distances that
matter because errors are measured as vertical
distances.
The OLS method calculates the best-fitting line for
the observed data by minimizing the sum of the
squares of the vertical deviations from each data
point to the line (the RSS).
2008/09

n
RSS =∑ i
2
Minimize e
i =1
We could think of minimizing RSS by successively
choosing pairs of values for αˆ and βˆ until RSS is
made as small as possible
But, we will use differential calculus (which turns
HASSEN ABDA
out to be a lot easier).
Why the squares of the residuals? Why not just
minimize the sum of the residuals?
To prevent negative residuals from cancelling
positive ones. Because the deviations are first
squared, then summed, there are no cancellations
between positive and negative values.
2008/09

n
If we use ∑ ei , all the error terms ei would receive

i =1
equal importance no matter how close or how
widely scattered the individual observations are
from the SRF.
A consequence of this is that
HASSEN it is quite possible that
ABDA
the algebraic sum of the ei is small (even zero)
although the eis are widely scattered about the SRF.
Besides, the OLS estimates possess desirable
properties of estimators under some assumptions.
OLS Technique: n n n
minimize ∑ e = (Y − Yˆ ) = (Y − αˆ − βˆX )2
2
i ∑ 2
i i ∑ i i
i =1 i =1 i =1
αˆ , βˆ
2008/09

n n
F.O.C.: (1) ∂ (∑ ei2 ) ∂[∑ (Yi − αˆ − βˆX i ) 2 ]

i =1
=0⇒ i =1
=0
∂αˆ ∂αˆ
n
⇒ 2.[∑ (Yi − αˆ − βˆX i )][−1] = 0
i =1
n
⇒HASSEN
∑ i
(Y −ABDA
αˆ − βˆX ) = 0
i
i =1
n n n
⇒ ∑ Yi − ∑αˆ − ∑ βˆX i = 0
i =1 i =1 i =1
n n
⇒ ∑ Yi − nαˆ − βˆ ∑ X i = 0.
αααα
YYYY
ββββ
XXXX
i =1 i =1
⇒Y −αˆ − βˆX = 0 ⇒ ˆ= −ˆ
2008/09

n n
F.O.C.: (2) ∂ ( ∑ e i2 ) ∂[ ∑ (Y i − αˆ − βˆX i ) 2 ]
i =1
=0⇒ i =1
=0
∂ βˆ ∂ βˆ
n
⇒ 2.[∑(Yi − αˆ − βˆX i )][− X i ] = 0
i =1
HASSEN
n ABDA
⇒ ∑ [(Y i − αˆ − βˆX i ) ( X i )] = 0
i =1
n n n
⇒ ∑ Yi X i − ∑ αˆX i − ∑ βˆX i2 = 0
i =1 i =1 i =1
n n n
⇒ ∑Yi X i = αˆ ∑ X i + βˆ ∑ X i2
i =1 i =1 i =1
2008/09
YYYY iiii
XXXX iiii
αααα
XXXX iiii
ββββ
XXXX2222 iiii
αααα
YYYY
ββββ
XXXX
Solve ˆ = − ˆ and ∑ = ˆ∑ + ˆ∑
(called normal equations) simultaneously!
n n n
∑Yi X i = α ∑ X i + β ∑ X i ∑ i i
ˆ ˆ 2
⇒ Y X = (Y − ˆ
βX)( X
∑ i ) + ˆ
β X
∑ i
2
i =1 i =1 i =1
⇒ ∑Yi Xi = Y ∑ Xi − βHASSEN
ˆX ∑ Xi + βABDA
ˆ ∑ Xi
2
⇒ ∑Yi X i − Y ∑ X i = βˆ ∑ X i2 − βˆX ∑ X i
⇒ ∑Yi Xi − Y ∑ Xi = β( ∑ Xi − X ∑ Xi )
ˆ 2
⇒∑Yi Xi −nXY = β( ∑Xi −nX )

ˆ 2 2
∑ Xi
b/c X = ⇔ ∑ X i = nX.
n
2008/09
Thus, ∑Yi X i − nXY

1. βˆ =
X 2 − nX 2 To easily recall
∑ i
ββββ
the formula:
Alternative expressions for
ˆ :
HASSEN ABDA
∑(X − X)(Y −Y ) ˆβ = ∑ xy
2. βˆ = i i ∑x
2
2 where: x = X − X & y = Y − Y .
∑ ( Xi− X) i i
Cov( X , Y ) n∑Y X − (∑ X )(∑Y )

3. β =
ˆ 4. βˆ = i i i i
Var( X ) n∑ X 2 − (∑ X 17)
i i
2008/09
αααα
YYYY
ββββ
XXXX
αααα
for ˆ just use: ˆ = − ˆ ∑Yi Xi − nXY
Or, if you wish: αˆ = Y −{X.[ ]}
X 2 − nX 2
2 2 2 ∑
[∑ X − nX ]Y −[X∑Y X − nX Y] i
⇒αˆ = i i i
2 2
∑ X − nX HASSEN ABDA
i
Y ∑ X 2 − nX 2Y − X ∑Y X + nX 2Y
⇒αˆ = i i i
X 2 − nX 2
∑ i
2
Y ∑ X − X ∑Y X (∑Y )(∑ X 2) − (∑ X )(∑Y X )
i i i ⇒αˆ = i i i i i
⇒ αˆ =
∑ X 2 − nX 2 n(∑ X 2 − nX 2)
i i 18
2008/09
Previously, we came across the following two

normal
n
equations: n
1. ∑(Yi −αˆ −βˆXi ) = 0 this is equivalent to: ∑ei = 0

i=1
i=1
n n
− α − ˆX )( X HASSEN
β = ABDA ∑e X = 0
2. ∑ i
[(Y
i=1
ˆ i i )] 0 equivalently, i
i=1
i
Note also the following property: Y = Yˆ

Y = Yˆ + e ∑ Yi ∑ Yî ∑ e i
= +
i i i ⇒
n n n
⇒∑Y = ∑Yˆ + ∑e ⇒ Y = Yˆ since e = 0 ⇔ e = 0.
i i i ∑i 19
2008/09
The facts that Ŷ and Y have the same average and

that this average value is achieved at the average
value of X (i.e., Y = Yˆ & Y = αˆ + βˆX ) together imply
that the sample regression line passes through the
sample mean/average values
HASSEN of X and Y.
ABDA
Yî = αˆ + βˆXi
Y
X
X
2008/09

Assumptions Underlying the Method of Least Squares
To obtain the estimates of α and β, assuming that
our model is correctly specified and that the
systematic and the stochastic components in the
equation are independent suffice.
But the objective inHASSEN ABDAanalysis is not only
regression
to obtain αˆ and βˆ but also to draw inferences about
the true α and β . For example, we’d like to know
how close αˆ and βˆ are to α and β or Ŷi to E[Y | X i ] .
To that end, we must not only specify the functional
form of the model, but also make certain assumps
about the manner in which Y are generated.
i
2008/09

Assumptions Underlying the Method of Least Squares
The PRFYi = α + βXi + εi shows that Yi depends on
both Xi and εi .
Therefore, unless we are specific about howXi and εi
are created or generated, there is no way we can
HASSEN ABDA
make any statistical inference about the Yi and also
about α and β .
Thus, the assumptions made about the X variable
and the error term are extremely critical to the valid
interpretation of the regression estimates.
2008/09

THE ASSUMPTIONS:
1. Zero mean value of disturbance, ɛi: E(ɛi|Xi) = 0.
Or equivalently, E[Yi|Xi] = α + βXi.
2. Homoscedasticity or equal variance of ɛi. Given the
value of X, the variance of ɛi is the same (finite
positive constant σHASSEN ABDA
2) for all observations. That is,
var(ɛi|Xi) = E[ɛi–E(ɛi|Xi)]2 = E(ɛi)2 = σ2.
By implication: var(Yi|Xi) = σ2.
var(Yi|Xi) = E{α+βXi+ɛi – (α+βXi)}2
= E(ɛi)2
= σ2 for all i.
2008/09

3. No autocorrelation between the disturbance terms.
Each random error term ɛi has zero covariance
with, or is uncorrelated with, each and every other
random error term ɛs (for s ≠ i).
cov(ɛi,ɛs|Xi,Xs) = E{[ɛi−E(ɛi)]|Xi}{[ɛs−E(ɛs)]|Xs} =
E(ɛi|Xi)(ɛs|Xs) = 0.HASSEN ABDA
Equivalently, cov(Yi,Ys|Xi,Xs) = 0. (for all s ≠ i).
4. The disturbance ɛ and explanatory variable X are
uncorrelated. cov(ɛi,Xi) = 0.
cov(ɛi,Xi) = E[ɛi−E(ɛi)][Xi−E(Xi)]
= E[ɛi(Xi−E(Xi))]
= E(ɛiXi)−E(Xi)E(ɛi) = E(ɛiXi) = 0
2008/09

5. The error terms are normally and independently
distributed, i.e., ε i ~ NID(0, σ ).
2
Assumptions 1 to 3 together imply that i ε ~ IID( 0, σ 2

).
The normality assumption enables us to derive the
sampling distributions of the OLS estimators (
αˆ and βˆ ). ThisHASSEN ABDA
simplifies the task of establishing
confidence intervals and testing hypotheses.
6. X is assumed to be non-stochastic, and must take at
least two different values.
7. The number of observations n must be greater than
the number of parameters to be estimated.
n > 2 in this case.
2008/09

Firm (i) Sales (Yi) Advertising Expense (Xi)
Numerical
1 11 10
Example:
2 10 7
Explaining sales
3 12 10
= f(advertising)
4 6 5
Sales are in HASSEN ABDA
5 10 8
thousands
6 7 8
of Birr &
7 9 6
advertising
expenses are in 8 10 7
hundreds of Birr. 9 11 9
10 10 10
2008/09

i Yi Xi y i = Yi − Y xi = X i − X xi y i 10
1 11 10 1.4 2 2.8 ∑Y i
96
2 10 7 0.4 -1 -0.4 Y = i =1
=
. n 10
3 12 10 2.4 2 4.8
= 9.6
4 6 5 -3.6 -3 10.8
10
5 10 8 0.4 HASSEN 0ABDA 0
6 7 8 -2.6 0 0
∑X i
80
X= i =1
=
7 9 6 -0.6 -2 1.2 n 10
8 10 7 -0.4 -1 -0.4 =8
9 11 9 -1.4 1 1.4
10 10 10 0.4 2 0.8
Ʃ 96 80 0 0 21
2008/09

i yi y i2 xi xi2
1 1.4 1.96 2 4
βˆ =
∑ xiy i
=
21
= 0.75
2 0.4 0.16 -1 1
3
.
2.4 5.76 2 4
∑x 2
i 28
4 -3.6 12.96 -3 9
HASSEN
0 ABDA
5 0.4 1.96 0
αˆ = Y − βˆX
6 -2.6 6.76 0 0
7 -0.6 0.36 -2 4 = 9.6 − 0.75(8) = 3.6
8 -0.4 0.16 -1 1
9 -1.4 1.96 1 1
10 0.4 0.16 2 4
Ʃ 0 30.4 0 28
2008/09

Yî = 3.6 + 0.75Xi ei = Yi −Yi e i
∑ = 14.65
2
i ˆ 2
e i
1 11.1 -0.10 0.01
2.
3
8.85
11.10
1.15
0.90
1.3225
0.81 ∑ yˆ = 15.75
2
i
4 7.35 -1.35 1.8225
5
6
9.60
9.60
HASSEN
0.40
-2.60
ABDA
0.16
6.76
∑ y = 30.42
i
7 8.10 0.90 0.81 ∑y = ∑x = ∑yˆ

i i i
1.15 1.3225
8 8.85
= ∑e = 0 i
9 10.35 0.65 0.4225
10 11.10 -1.10 1.21
Ʃ 96 0 14.65
2008/09
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

☞Given the assumptions of the classical linear
regression model, the least-squares estimators
possess some ideal or optimum properties.
These statistical properties are extremely important
because they provide criteria for choosing among
HASSEN ABDA
alternative estimators.
These properties are contained in the well-known
Gauss–Markov Theorem.
2008/09

Gauss-Markov Theorem: Under the above
assumptions of the linear regression model, the
estimators αˆ and βˆ have the smallest variance of
all linear and unbiased estimators of α and β . That
is, OLS estimators are the Best Linear Unbiased
Estimators (BLUE)HASSEN ABDA
of α and β .
The Gauss-Markov Theorem does not depend on the
assumption of normality (of the error terms).
Let us prove that β̂ is the BLUE of β !

2008/09

Linearity of β̂ : (in a stochastic variable, Yi or ε i ).
∑ xi yi ∑ xi (Yi − Y ) xi
β=
ˆ = ⇒ βˆ =
∑xi 2
∑xi 2 ∑ ( x 2 )Yi
∑ i
=
∑ x Y ∑x Y
−
i i i
xi
⇒ β ABDA
HASSENˆ = ∑kiYi whereki =
∑x ∑x2
i
2
i
∑ xi2
βˆ = ∑ x Y
−
i i Y ∑ x i
∑x ∑x2
i
2
i
⇒βˆ = k1Y1 + k2Y2 +...+ knYn
=
∑ xYi i
(since∑x = 0) i
∑x 2
i
2008/09
Note that:
(1) ∑ i is a constant
2
x
(2) because xi is non-stochastic, ki is also nonstochastic
(3) .∑ k i = ∑ ( xi
)=
∑ xi
=0
∑ x HASSEN
2
i ∑ x ABDA
2
i
= ∑(
x
)( x ) =
∑ x
=1
2
i
(4) . ∑ k i x i
i
i
∑x 2
i ∑x 2
i
xi
(5) . ∑ k = ∑[( 2 )] =
2 2
=
1
.
∑i
x 2
i
∑ xi (∑ xi )
2 2
∑ xi
2
xi xi
(6) . ∑ki X i = ∑( 2 )( X i ) = ∑( 2 )( xi + X ) = 2 + ∑i
X ∑ xi
x 2
=1
∑ xi ∑ xi ∑ xi ∑ xi2
2008/09

Unbiasedness: βˆ = ∑ k i Yi
βˆ = ∑ k i ( α + β X i + ε i )
βˆ = α ∑ ki + β ∑ ki X i + ∑ kiε i
βˆ = β + ∑ kiε i [because
HASSEN∑ ki = 0 and ∑ ki X i = 1]
ABDA
E ( βˆ ) = E ( β ) + E ( k1ε 1 + k 2 ε 2 + ... + k n ε n )
E ( βˆ ) = E ( β ) + ( ∑ k i ).E (ε i )
E(βˆ ) = β + ( k ).(0)
∑ i
E(βˆ ) = β
2008/09

Efficiency:
~
Suppose β is another unbiased linear estimator of β .
~
Then, var( β ) ≤ var( β ) .
ˆ
Proof: var( βˆ ) = var( ∑ k i Yi )
HASSEN ABDA
var( β ) = var( k1Y1 + k 2Y2 + ... + k n Yn )
ˆ
var( βˆ ) = var( k1Y1 ) + var( k 2 Y2 ) + ... + var( k n Yn )
{since the covariance between Yi and Ys (for ∀i ≠ s) = 0}
var(β ) = k1 var(Y1 ) + k2 var(Y2 ) + ... + kn var(Yn )
ˆ 2 2 2
var(β ) = k (σ ) + k (σ ) + ... + k (σ )
ˆ 2
1
2 2
2
2 2
n
2
.
2008/09

σ 2
var(β ) = σ ∑ki or , var( βˆ ) = x 2

ˆ 2 2
~ ∑ i
Suppose : β = ∑ wi Yi where wi s are coefficien ts.

~
β = ∑ wiYi
~
β = ∑ wi (α + β X i + ε i )
HASSEN ABDA
~
β = α ∑ wi + β ∑ wi Xi + ∑ wiε i
~
E(β) = ( ∑ wi ).E(α) + ( ∑ wi X i ).E( β) + ( ∑ wi ).E(εi )
~
E(β ) = ( ∑ wi ).α + ( ∑ wi X i ).β
~
for β to be an unbiased estimator of β , ∑ wi = 0 and ∑w X
i i = 1.
2008/09

~
. var(β ) = var(∑ wi Yi )
~
var(β ) = var(w1Y1 + w2Y2 + ... + wnYn )
~
var( β ) = var( w1Y1 ) + var( w2Y2 ) + ... + var( wn Yn )
since the covariance HASSEN Y and Y (for ∀i ≠ s) = 0
between ABDA
i s
~
var(β ) = w1 var(Y1 ) + w2 var(Y2 ) + ... + wn var(Yn )
2 2 2
~
var(β ) = w1 (σ ) + w2 (σ ) + ... + wn (σ )
2 2 2 2 2 2
~
var(β ) = σ 2
∑w 2
i
2008/09

~
. ∗ Let us now compare var( β ) and var( β )!
ˆ
∗ Suppose wi ≠ k i , and the r/p b/n them
be given by : d i = wi − k i .
∗ Because both ∑ wi and ∑k i equal zero :
⇒ ∑ d i = ∑ wiHASSEN
− ∑ k i ABDA
=0
∗ Because both ∑ wi xi and ∑ k i xi equal one :
⇒ ∑ d i xi = ∑ wi xi − ∑ k i xi = 1 − 1 = 0
∗ (wi ) 2 = (k i + d i ) 2 ⇒ wi2 = k i2 + d i2 + 2k i d i
xi
⇒ ∑ w = ∑ k + ∑ d + 2∑ ( d i )(
2
i i
2
i
2
)
∑ xi2
2008/09

. ⇒ ∑wi2 = ∑ki2 + ∑di2 + 2( 1 2 )(∑di xi )
∑xi
1
⇒ ∑w = ∑k + ∑d + 2(
2
i i
2
i
2
)(0)
∑xi2
⇒ ∑wi2 = ∑ki2 + ∑di2

HASSEN wi ≠ k i , not all d i s
ABDA
(given
⇒ ∑w > ∑k 2
i i
2
are zero a nd thus, d i2 > 0 ).
∑
⇒σ 2
∑w >σ ∑k
2
i
2
i
2
~ ~
var( β ) = var (βˆ) if and o nly if all
⇒ var(β) > var(β). d s are zero and thus, ∑ d
ˆ
i
2
i39 = 0.
2008/09
Linearity of αˆ: αˆ = Y − βˆ X
⇒ αˆ = Y − X{∑kiY i }
⇒ αˆ = Y − X{k1Y1 + k2Y2 + ...+ knYn }
1
⇒ αˆ = (Y1 + Y2 + ... +HASSEN
Yn ) − {XABDA
k1Y1 + Xk2Y2 + ...+ XknYn }
n
1 1 1
⇒αˆ = ( − Xk1)Y1 + ( − Xk2 )Y2 + ...+ ( − Xkn )Yn
n n n
1
⇒αˆ = f1Y1 + f2Y2 + ...+ fnYn where fi = − Xki
n 40
2008/09

Unbiasedness:αˆ = Y − βˆX
⇒αˆ = (α + βX) − X{(∑ki )(α + βXi + ε i )}
⇒ αˆ = (α + βX) − X{α ∑ki + β ∑ki X i + ∑ki ε i }

⇒ αˆ = (α + βX) −HASSEN ABDA
X{β + ∑ ki ε i }
⇒ αˆ = (α + β X − β X − X ∑kε i i )
⇒ E (αˆ ) = E (α ) − E ( X ∑ k i ε i )
⇒ E (αˆ ) = E (α ) − X ( ∑ k i ). E (ε i )
⇒ E (αˆ ) = α
2008/09

Efficiency:
Suppose α~ is another unbiased linear estimator of α .
Then, var( αˆ ) ≤ var( α~ ) .
Proof: var(αˆ ) = var( fiYi )
∑
var(αˆ ) = var( f1Y1 +ABDA
HASSEN f 2Y2 + ... + f nYn )
var(αˆ ) = var( f1Y1 ) + var( f 2Y2 ) + ... + var( f nYn )
{since cov(Yi , Ys ) = 0 for ∀i ≠ s}
var(α ) = f1 var(Y1 ) + f2 var(Y2 ) + ...+ fn var(Yn )
ˆ 2 2 2
var(α ) = f1 (σ ) + f2 (σ ) +... + fn (σ ) =σ ∑ fi
ˆ 2 2 2 2 2 2 2 2
2008/09

1
var(αˆ ) = σ ∑ fi = σ (∑( − Xki )2 )
2 2 2
n
1 2
var(αˆ ) = σ {∑( 2 + X ki − Xki )}
2 2 2
n n
2
2 1 2 1 X
var(αˆ ) = σ { + X ∑ kiHASSEN
2 2
− X ∑ABDA
ki } var(αˆ ) = σ 2 ( + )
n n n ∑xi 2
2
1 1 X
ˆ) = σ 2 ∑ i 2
2
var(αˆ ) = σ { + X ∑ ki } = σ { +
2 2 2 2
}
2 or,var(α
X
n n ∑ xi ∑
n x i
note that :
2
1 1 X
∑ fi =∑(n − Xki ) =1− X∑ki =1 ∑ fi = +
2
n ∑43
xi
2
2008/09

Suppose : α~ = ∑ ziYi where z i s are coefficients.
α~ = ∑ z i Yi
α~ = ∑ z (α + β X
i i + εi)
α~ = α ∑ zi + β ∑ zi Xi + ∑ ziε i
~ HASSEN ABDA
E(α) = ( ∑ zi ).E(α) + ( ∑ zi X i ).E( β) + ( ∑ zi ).E(εi )
E(α~) = ( ∑ zi ).α + ( ∑ zi X i ).β
for α~ to be an unbiased estimator of α , ∑ z i = 1 & ∑ z i X i = 0.
var(α~) = var(∑ ziYi )

var(α~) = var(z1Y1 + z2Y2 + ... + znYn )
2008/09

~
var(α ) = var(z1Y1 ) + var(z 2Y2 ) + ... + var(z nYn )
since cov (Y , Y ) = 0 for ∀i ≠ s.
i s
.
~
var(α ) = z1 var(Y1 ) + z2 var(Y2 ) + ... + zn var(Yn )
2 2 2
~
var(α ) = z (σ ) + z (σ ) + ... + z (σ )
2 2 HASSEN
2 2 ABDA 2 2
1 2 n
~
var(α ) = σ 2
∑z 2
i
∗ Let us now compare var(αˆ ) and var(α~ )!
∗ Suppose zi ≠ f i , and the relatioship
b/n them be given by : d i = zi − f i .
2008/09

. ∗ Because ∑ z i X i = 0, and ∑ z i = 1,
⇒ ∑ z i xi = ∑ z i ( X i − X ) = ∑ z i X i − ∑ z i X
= ∑ zi X i − X ∑ zi
= 0 − X (1) = − X
1 xi
∑d = ∑ z + ∑ f
i
2 2
i i − 2{∑
2HASSEN
zABDA
i fi }
where f i = − X (
n ∑ xi2
)
1 xi
∑ d = ∑ z + ∑ f i − 2{∑[ zi ( n − X
i
2 2
i
2
)]}
∑ xi2
1 X
⇒∑d = ∑z + ∑fi − 2{ ∑zi − ( 2 )(∑zi xi )}
i
2 2
i
2
n ∑xi
1 X
⇒∑d = ∑z + ∑fi − 2{ − ( 2 )(−X)}
i
2 2
i
2
n ∑xi
2008/09

2
1 X
⇒ ∑ d i2 = ∑ z i2 + ∑ f i 2 − 2{ + }
n ∑ xi 2
⇒ ∑ d i2 = ∑ z i2 + ∑ f i 2 − 2∑ f i 2
.
⇒ d =
∑ i ∑ i ∑ i
2
z 2
− f 2
⇒∑z =∑ ∑2
i d +
HASSEN
i
2
ABDA
f 2
i
⇒ ∑ z >∑ f i
2
i
2
⇒σ ∑ z >σ ∑ f
2
i
2 2
i
2
~
⇒ var(α ) > var(αˆ ). all ds and ∑d are zero.
var (α~) = var (αˆ) if and only if
2
i
2008/09

Decomposing the variation in Y:
HASSEN ABDA
2008/09

Decomposing the variation in Y:
One measure of the variation in Y is the sum of its
squared deviations around its sample mean, often
described as the Total Sum of Squares, TSS.
TSS, the total sum of squares of Y can be
HASSEN ABDA
decomposed into ESS, the ‘explained’ sum of
squares, and RSS, the residual (‘unexplained’) sum
of squares.
TSS = ESS + RSS
∑(Yi −Y) = ∑(Yi −Y) + ∑ei

2 ˆ 2 2
2008/09

Yi = Yî + ei ⇒ Yi − Y = Yî − Y + ei
(Yi − Y ) = (Yi − Y + ei )
2 ˆ 2
∑ (Yi − Y ) = ∑ (Yi − Y + ei )
2 ˆ 2
∑y 2
i = ∑ ( yˆ i + HASSEN
2
ei ) ABDA
∑ i ∑ i ∑ i + 2∑ yî ei
y 2
= ˆ
y 2
+ e 2
The last term equals zero:

∑ ii ∑ i
ˆ
y e = (Yˆ − Y )ei = ∑ i ei − ∑Yei
Yˆ
⇒ ∑ yî ei = ∑(αˆ + βˆXi )ei − Y ∑ei

2008/09

. ⇒ ˆ
y e = α
ˆ e + β
ˆ Xe
∑ i i ∑i ∑ i i
⇒ ∑ yî ei = 0
Hence: ⇒ y 2 = yˆ 2 + e2
∑ i ∑ i ∑i
HASSEN ABDA
TSS = ESS + RSS
30.4 = 15.75 + 14.65
Coefficient of Determination (R2):
the proportion of the variation in the dependent
variable that is explained by the model.
2008/09

1. R 2
=
ESS
=
∑ yˆ 2
TSS ∑ y2
ESS 2 ∑
x 2
2. R =
2
=βˆ
∑ ( βˆ x ) 2
R2 =
ESS
=
TSS ∑ y 2
TSS ∑ y2
The OLS regression coefficients are chosen in such
HASSEN ABDA
a way as to minimize the sum of the squares of the
residuals. Thus it automatically follows that they
maximize R2. ⇒1=
ESS
+
RSS
TSS = ESS + RSS TSS TSS
ESS RSS
⇒
TSS ESS RSS ⇒
= + TSS
=1−
TSS ⇒ 3. R2 = 1−
∑i
e 2
TSS TSS TSS ∑

52
y 2
2008/09

Coefficient of Determination (R2):
R =
2 ESS
= β(
ˆ ∑ xy ∑ x 2
ESS ∑ ˆy i2
)( ) R =
2
=
TSS ∑x ∑y
2 2
TSS ∑y 2
4. R 2
=
ESS ∑ xy
= βHASSEN
ˆ =
15 . 75
= 0 .5181
2 ABDA 30 . 4
TSS ∑ y
R 2
=
∑ xy ∑ xy
∑ x ∑ y
2 2
2
( ∑ xy )2
[cov(X , Y )]
⇒ 5. R 2
= ⇒ 6. R = 2
∑x ∑y2 2
var(X ) × var(Y )
2008/09

A natural criterion of goodness of fit is the
correlation between the actual and fitted values of
Y. The least squares principle also maximizes this.
In fact, ⇒ R =HASSEN 2
= (rx, y )
(ryˆ , y )ABDA 2 2
where r yˆ , y and rx,y are the coefficients of correlation

between Yˆ & Y, and X & Y, defined as:
cov( Yˆ , Y ) cov( X , Y )
r yˆ , y =
σ Yˆ σ & rx , y = , respectively.
Y σ XσY
RSS = (1 − R )∑ y
Note: 2 2
2008/09
To sum up:
Use Yî = αˆ + βˆX i to estimate E[Y | X i ] = α + β X i .
n n n
OLS: min i∑=1 ∑
e = 2
i (Yi − Yî ) 2
= ∑ i (Y − ˆ
α − ˆ
β X i )2
i =1 i =1
α̂, β̂
∑ xy
β =
ˆ
2 ˆ
α = Y − ˆ
βX
HASSEN ∑ x ABDA
Given the assumptions of the linear regression
model, the estimators αˆ and βˆ have the smallest
variance of all linear and unbiased estimators of
α and β
σ 2 ∑ i
2 2 2
1 X X
var( β ) =
ˆ var(αˆ ) = σ 2 ( + ) =σ
∑ x i2 n ∑ xi 2
n∑55xi2
2008/09
To sum up …
2
σ σ
∑ y = ∑ yˆ + ∑e
2 2
. 2 2
i i i var( βˆ ) = =
∑x 2
28
TSS = ESS + RSS
i
≈ 0 . 0357 σ 2
R 2
=
ESS
=
∑ yˆ 2
1 X 2
var( αˆ ) = σ 2
( + )
TSS ∑y 2
HASSEN ABDA n ∑ x i
2
∑ y = β̂ ∑ xy
ˆ 2
= σ ( 2 1
10
+
64
28
)
∑ ˆ
y 2
= β
ˆ 2 x2
∑ ≈ 2 . 3857 σ 2
RSS= (1− R )∑y 2 2

But, σ = ? 2
56
2008/09
An unbiased estimator for σ2
E ( RSS ) = E (∑ e ) = (n − 2)σ
2
i
2
Thus, if we define σˆ 2
=
∑ e i
2
, then :
n−2
1 2 ABDA1
HASSEN
E (σˆ ) = (
2
) E (∑ ei ) = ( )(n − 2)σ = σ
2 2
n−2 n−2
⇒ σˆ 2
=
∑ e i
2
is an unbiased estimator of σ . 2
n−2
2008/09
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

Why is the Error Normality Assumption Important?
The normality assumption permits us to derive the
functional form of the sampling distributions of
αˆ , βˆ & σˆ 2.
Knowing the formHASSENof the sampling
ABDA distributions
enables us to derive feasible test statistics for the
OLS coefficient estimators.
These feasible test statistics enable us to conduct
statistical inference, i.e.,
1)to construct confidence intervals for α , β & σ 2.
2)to test hypothesis about the values of α, β &σ .
2
2008/09
ε
. i ~ N(0, ) σ 2
i⇒ Y ~ N(α + βX ,σ )i
2
σ 2 αˆ − α 2 ∑ i
X 2
βˆ ~ N (β , ) α
ˆ ~ N(α,σ ) ~ t n −2
∑i
x 2
∑ xi
2
s ˆ
e(αˆ )
HASSEN ABDA 2
βˆ − β seˆ(αˆ) = σˆ.
∑ iX
(
σ
) ∑x 2
i ~ N(0,1) 2
n ∑ xi
βˆ − β σˆ
~t seˆ(βˆ ) = ∑e 2
s eˆ ( βˆ )
n−2 2 σˆ = i
x
∑ i n−2
59
2008/09
αααα
αααα
Confidence Interval for α and β :
ˆ−
αααα
P{−t n−2
α /2 ≤ ≤t n−2
α /2 } = 1−α
seˆ( ˆ )
αααα
αααα
HASSEN ABDA
100( 1 − α)% Two-Sided
αααα
::::
CI for
ˆ ± (t α/n−22 )seˆ( ˆ)
Similarly,
100( 1 − α)% Two-S ided
CI for β: β ± (tα / 2 ) seˆ( β )
ˆ n− 2 ˆ
2008/09

σˆ 2
(n − 2) 2 ~ χ n−2
2
σ
CI for σ 2: P{χ 2 ≤χ ≤χ 2 2
} = 1−α
1−(α / 2);df df (α / 2);df
σ
( n − 2)ABDA
ˆ 2
⇒ P{χ 2
1− (α / 2 ); ( n − 2 ) ≤HASSEN ≤ χ (α / 2 );( n − 2 ) } = 1 − α
2
σ 2
1 σ 2
1
⇒ P{ ≥ ≥ } = 1−α
χ 2
1− (α / 2 );( n − 2 ) (n − 2)σˆ 2
χ 2
(α / 2 );( n − 2 )
2
1 σ 1
⇒ P{ ≤ ≤ 2 } = 1− α
χ 2
(α / 2 );(n − 2 ) (n − 2 )σˆ 2
χ1−(α / 2 );(n− 2 )
2008/09
CI for σ 2 (continued):
( n − 2 )σˆ 2
( n − 2 )σˆ 2
⇒ P{ 2 ≤σ ≤ 2
2
} = 1−α
χ (α / 2 ); n − 2 χ 1− (α / 2 ); n − 2
⇒ 100( 1 − α)% Two-S ided CI
HASSEN ABDAfo r σ 2
:
(n − 2)σˆ 2
(n − 2)σˆ 2
[ , ]
χ 2
(α / 2 );n − 2 χ 2
1−(α / 2 );n − 2
RSS RSS
[ , ]
OR
χ 2
(α / 2);n−2 χ2
1−(α / 2);n−2
2008/09

Let us continue with our earlier example.
We have: n = 10, αˆ = 3.6 , β = 0.75, R = 0.5181,
ˆ 2
var( αˆ ) ≈ 2 .3857 σ 2 , var(βˆ ) ≈ 0.0357σ 2 ,& ∑ ei2 = 14.65

2
σ is estimated by: σˆ =
2 2 e
∑ i 14.65
= = 1.83125
HASSENn − ABDA
2 8
⇒ σˆ = 1.83125 ≈ 1.3532
Thus, vâr(αˆ ) ≈ 2.3857(1.83125) ≈ 4.3688
⇒ seˆ(αˆ ) ≈ 4.3688 ≈ 2.09
vâr( βˆ ) ≈ 0.0357 (1.83125 ) ≈ 0.0654
⇒ seˆ(βˆ ) ≈ 0.0654 ≈ 0.256
2008/09
αααα
::::
95% CI for α and β :
95% CI for 1−α = 0.95⇒α = 0.05 ⇒ α / 2 = 0.025
3.6 ± (t 08.025 )( 2 .09 )
⇒ 95% CI for α :
= 3.6 ± ( 2 .306 )( 2 .09 )
= 3 .6 ± 4 . 8195 :::: [−1.2195, 8.4195]
HASSEN ABDA
95% CI for β
0.75 ± (t 08.025 )(0.256 ) ⇒ 95% CI for β :
= 0.75 ± ( 2.306 )(0.256 ) [0.1597, 1.3403]
= 0.75 ± 0.5903
2008/09
95% CI for σ 2 : (n − 2)σˆ 2

(n − 2)σˆ 2
[ , ]
σˆ = 1.83125
2
χ 2
(α / 2);n −2 χ 2
1−(α / 2);n −2
χ α / 2;n − 2 : χ 2
2
0.025;8 = 17.5
χ 2
: χ 2 HASSEN ABDA
0.975;8 = 2.18
1− (α / 2 ); n − 2
::::
⇒ 95% CI for σ 2
14.65 14.65
=[ ,
17.5 2.18
]
= [0.84, 6.72]
2008/09
The confidence intervals we have constructed for

α , β & σ 2 are two-sided intervals.
Sometimes we want either the upper or lower limit
only, in which case we construct one-sided intervals.
For instance, let usHASSEN constructABDA
a one-sided (upper
limit) 95% confidence interval for β .
Form the t-table, t08.05 = 1.86 .
Hence, βˆ + t 08.05 .seˆ( βˆ ) = 0.75 + 1.86(0.256)
= 0.75 + 0.48 = 1.23
The confidence interval is (- ∞, 1.23].
2008/09
Similarly, lower limit:

βˆ − t 08.05 seˆ( βˆ ) = 0.75 − 1.86(0.256)
= 0.75 − 0.48 = 0.27
Hence, the 95% CI is: [0.27, ∞).
HASSEN ABDA
Hypothesis Testing:
Use our example to test the following hypotheses.
Result: Yî = 3.6 + 0.75 X i
(2.09) (0.256)
1. Test the claim that sales doesn’t depend on
advertising expense (at 5% level of significance).
2008/09
H0: β = 0 against Ha: β ≠ 0 .

Test statistic: t = βˆ − β 0.75 − 0
⇒ tc = = 2.93
seˆ( β )
c
ˆ 0.256
Critical value: (tt = t-tabulated)
n−2
α = 0.05 ⇒ α / 2 = 0.025 t = tα / 2 = t 0.025 = 2.306
HASSENt ABDA 8
Since t c > t t , we reject the null (the alternative is

supported). That is, the slope coefficient is
statistically significantly different from zero:
advertising has a significant influence on sales.
2. Test whether the intercept is greater than 3.5.
2008/09
H0:α = 3.5 against Ha:α > 3.5.

Test statistic: t = αˆ − α ⇒ t = 3.6 − 3.5 = 0.1 = 0.05
se(α )
c c
ˆ ˆ 2.09 2.09
Critical value: (tt = t-tabulated)

HASSEN ABDA
At 5% level of significance (α = 0.05),
n−2
t t = tα =t 8
0.05 = 1.86
Since t c < t t , we do not reject the null (the null is
supported). That is, the intercept (coefficient) is not
statistically significantly greater than 3.5.
2008/09

3. Can you reject the claim that a unit increase in
advertising expense raises sales by one unit? If so,
at what level of significance?
H0: β = 1 against Ha: β ≠ 1 .
Test statistic: βˆ −HASSEN
β 0.75 − 1 − 0.25
ABDA
tc = ⇒ tc = = = −0.98
s eˆ ( β )
ˆ 0.256 0.256
At α = 0.05, t 08.025 = 2.306 and thus H0 can’t be rejected.
Similarly, at α = 0.10, 0.05 = 1.86 H0 can’t be rejected.
t 8
At α = 0.20, t 08.10 = 1.397 and thus H0 can’t be rejected.

At α = 0.50, t 08.05 = 0.706 H0 is rejected.
2008/09
For what level of significance (probability) is the

value of the t-tabulated for 8 df as extreme as
t c = 0 .98 ?
i.e., find P for which P{ t > 0.98}P{t .> 0.98 or - t < −0.98} = ?
P{t > 0.706} = HASSEN P{t > 1.397} = 0.10
0.25 & ABDA
0.98 is between the two numbers (0.706 and 1.397).
So, P{ t > 0 .98} is somewhere between 0.25 & 0.10.
1.397 – 0.706 = 0.691, and 0.98 is 0.98 – 0.706 =
0.274 units above 0.706. Thus, the P-value for 0.98
(P{t > 0 .98}) is ( 0.274)(0.25 − 0.10) units below 0.25.
0.691
2008/09
That is, the P-value for 0.98 is 0.06 units below

0.25. i.e., P{t > 0.98} ≈ 0.25 − 0.06 ≈ 0.19.
Hence, P{ t > 0.98} = 2 P{t > 0.98} ≈ 0.38.
For our H0 to be rejected, the minimum level of
HASSEN ABDA
significance (the probability of Type I error) should
be as high as 38%. To conclude, H0 is retained!
The p-value associated with the calculated sample
value of the test statistic is defined as the lowest
significance level at which H0 can be rejected.
Small p-values constitute strong evidence against H0.
2008/09

There is a correspondence between the confidence
intervals derived earlier and tests of hypotheses.
For instance, the 95% CI we derived earlier for β is:
(0.16 < β < 1.34).
Any hypothesis that says β = c , where c is in this
HASSEN ABDA
interval, will not be rejected at the 5% level for a
two-sided test.
For instance, the hypothesis β = 1 was not rejected,
but the hypothesis β = 0 was.
For one-sided tests we consider one-sided
confidence intervals.
2008/09

The estimated regression equation Yî = αˆ + βˆX i is used
for predicting the value (or the average value) of Y
for given values of X.
Let X0 be the given value of X. Then we predict the
corresponding value Y of Y by: Yˆ = αˆ + βˆX
HASSEN
P ABDA P 0
The true value YP is given by: YP = α + βX 0 + ε P

Hence the prediction error is:
YˆP − YP = (αˆ − α ) + (βˆ − β ) X 0 − ε P
E(YˆP −YP ) = E(αˆ −α) + E(βˆ − β)X0 − E(εP )⇒ E(YˆP − YP ) = 0
YˆP = αˆ + βˆX 0 is an unbiased predictor of Y. (BLUP!)
2008/09

The variance of the prediction error is:
var(YˆP − YP ) = var(αˆ − α ) + X 02 var(βˆ − β )
+ 2 X 0 cov(αˆ − α , βˆ − β ) + var(ε P )
var(YP − YP ) = σ
ˆ ∑ i
2
X 2
+
HASSENσ 2 X 2
0
ABDA − 2 X σ 2 X
+ σ 2
0
n∑ xi
2
∑ix 2
∑i
x 2
1 ( X − X ) 2
var(YP − YP ) = σ [1 + +
ˆ 2 0
]
n ∑xi 2
Thus, the variance increases the farther away the

value of X0 is from X , the mean of the observations
on the basis of which αˆ & βˆ have been computed.
2008/09
That is, prediction is more precise for values nearer

to the mean (as compared to extreme values).
within-sample prediction (interpolation): if X0 lies
within the range of the sample observations on X.
out-of-sample prediction (extrapolation):
HASSEN ABDA
if X 0 lies
outside the range of the sample observations. Not
recommended!
Sometimes, we would be interested in predicting the
mean of Y, given X0. We use:YˆP = αˆ + βˆX P to predict
YP = α + βX P. (The same predictor as before!)
The prediction error is: YˆP −YP = (αˆ −α) + (βˆ − β)XP
2008/09
The variance of the prediction error is:

var(YP − YP ) = var(α − α ) + X 0 var(βˆ − β ) + 2X 0 cov(αˆ − α, βˆ − β )
ˆ ˆ 2
1 ( X − X ) 2
⇒ var(YˆP − YP ) = σ 2 [ + 0 2 ]
n
HASSEN ABDA
∑ xi
Again, the variance increases the farther away the
value of X0 is from X .
The variance (the standard error) of the prediction
error is smaller in this case (of predicting the
average value of Y, given X) than that of predicting
a value of Y, given X.
2008/09

Predict (a) the value of sales, and (b) the average
value of sales, for a firm with an advertising expense
of six hundred Birr.
a. From Yî = 3.6 + 0.75Xi , at Xi = 6,
i = 3.6 + 0.75(6) = 8.1
Point prediction: YˆHASSEN ABDA
[Sales value | advertising of 600 Birr] = 8,100 Birr.
Interval prediction: 95% CI: t 0.025 = 2.306
8
1 ( X − X ) 2
1 ( 6 − 8) 2
se(YP ) = σ [1 + +
ˆ ˆ *
ˆ 2 0
] ⇒ s ˆ
e (Yˆ
P ) = 1.35 1 +
*
+
n ∑ xi 2
10 28
= 1.35(1.115) = 178.508
2008/09

Hence, 95% CI : 8.1 ± (2.306)(1.508)
[4.62,11.58]
b. From Yˆ = 3.6 + 0.75X i , at Xi = 6, Yî = 3.6 + 0.75(6) = 8.1
i
Point prediction:
HASSEN ABDA
[Average sales | advertising of 600 Birr] = 8,100 Birr.
Interval prediction: 95% CI: 1 (X − X )2
seˆ (YˆP* ) = σˆ 2 [ + 0
]
n ∑x 2
i
1 ( 6 − 8) 2
⇒ se (Yˆ ) = 1.35 + ⇒ se(YP ) = 1.35(0.493) = 0.667
ˆ
* *
P
10 28
ˆ
95% CI : 8.1 ± (2.306)(0.667) [ 6 . 56 ,9 . 64 ] 79

2008/09
Notes on interpreting the coefficient of X in simple linear regression
1. Y = α + βX + ε ⇒ dY = β .dX ⇒ β =
dY
= slope
dX
β is the (AVERAGE) change in Y resulting from
a unit change in X.
α + βX +ε HASSEN ABDA
2. Y =e ⇒ lnY = α + βX +ε
1 dY )
⇒ d(lnY ) = β.dX ⇒ .dY = β.dX ⇒ β = (
Y = Relative ∆ in Y
Y dX Absolute ∆ in X
( dY ) × 100 %age ∆ in Y
⇒ β (×100) = Y = β ( ×100) is the (AVERAGE)
dX dX percentagechange in Y resul -
⇒ %age ∆ in Y = β .dX (× 100 ) ting from a unit change in X.
2008/09
Notes on interpreting the coefficient of X in simple linear regression

β
3. e
Y
= AX E ⇒ Y = α + β ln X + ε ;
α = ln( A) &ε = ln(E )
Absolute ∆ in Y
⇒β =
dY
=
dY
= β dX
d (ln X ) ( 1 )dX Relative ∆ in X ⇒ dY = .( × 100)
X 100 X
⇒ dY = (0.01β ).(% age ∆ in X )
HASSEN change
β ( ×0.01) is the (AVERAGE) ABDA in Y resulting from
a percentage change in X.
4. Y = AX e ⇒ lnY = α + β (ln X ) + ε ;α = ln A
β ε
d (ln Y ) dY / Y %age ∆ in Y
⇒β = = =
d (ln X ) dX / X %age ∆ in X
β is the (AVERAGE) percentage change in Y
= Elasticity
resulting from a percentage change in X. 81
2008/09
HASSEN ABDA
STATA SESSION
2008/09 CHAPTER THREE
THE MULTIPLE LINEAR REGRESSION

3.1 Introduction: The Multiple Linear Regression
3.2 Assumptions of the Multiple Linear Regression
3.3 Estimation: The Method of OLS
3.4 Properties of OLS Estimators
3.5 Partial Correlations and Coefficients of
Multiple Determination
.
3.6 Statistical Inferences in Multiple Linear
Regression
3.7 Prediction with Multiple Linear Regression
2008/09
3.1 Introduction: The Multiple Linear Regression
Relationship between a dependent & two or more
independent variables is a linear function
Population Random
Population slopes
Y-intercept Error
Yi = β 0 + β1 X 1i + β 2 X 2i + • • • + β K X Ki + ε i
. Y = βˆ + βˆ X + βˆ X + • • • + βˆ X + e
i 0 1 1i 2 2i K
Residual
Dependent (Response) Independent (Explanatory)
variable (for sample) variables (for sample)
Ki i
2008/09
3.1
3.1 Introduction:
Introduction: The
The Multiple
Multiple Linear
Linear Regression
Regression
) What changes as we move from simple to
multiple regression?
1. Potentially more explanatory power with more
variables;
2. The ability to control for other variables; (and
the interaction of the various explanatory
variables: correlations and multicollinearity);
.
3. Harder to visualize drawing a line through
three or more (n)-dimensional space.
4. The R2 is no longer simply the square of the
correlation coefficient between Y and X.
3.1 Introduction: The HASSEN
JIMMA UNIVERSITY
2008/09 Multiple
A.
Linear Regression
CHAPTER 3 - 4
)Slope ( βj ):
Ceteris paribus, Y changes by βj for every 1 unit
change in X , on average.
j
)Y-Intercept (β0 ):
The average value of Y when all Xj s are zero.
(may not be meaningful all the time)
)A multiple linear regression model is defined to
be linear in the regression parameters rather
.than in the explanatory variables.

)Thus, the definition of multiple linear regression
includes polynomial regression.
e.g. Y = β + β X + β X + β X 2 + β X X + ε
i 0 1 1i 2 2i 3 1i 4 1i 2i i
2008/09
3.2
3.2 Assumptions
Assumptions of
of the
the Multiple
Multiple Linear
Linear Regression
Regression
) Assumptions 1 – 7 in Chapter Two.
1. E(ɛi|Xji) = 0. (for all i = 1, 2, …, n; j = 1, …, K)
2. var(ɛi|Xji) = σ2. (i ≠ s) (Homoscedastic errors)
3. cov(ɛi,ɛs|Xji,Xjs) = 0. (i ≠ s) (No autocorrelation)
4. cov(ɛi,Xji) = 0. Errors are orthogonal to the Xs.
5. Xj is non-stochastic, and must assume different
values.
.
6. n > K+1. (Number of observations > number of
parameters to be estimated). Number of
parameters is K+1 in this case ( β0, β1, …, βK )
7. ɛi ~N(0, σ2). Normally distributed errors.
3.2
3.2 Assumptions of the Multiple
Assumptions of the Multiple Linear
Linear Regression
2008/09
Regression
) Additional Assumption:
8. No perfect multicollinearity: That is, no exact
linear relation exists between any subset of
explanatory variables.
) In the presence of perfect (deterministic) linear
relationship between/among any set of the Xjs,
the impact of a single variable ( β j ) cannot be
. identified.
) More on multicollinearity in a later chapter!
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation: The
The A.Method
Estimation:HASSEN Method of
of OLS
OLS
CHAPTER 3 - 7
The Case of Two Regressors (X1 and X2)

Yî = βˆ0 + βˆ1 X1i + βˆ2 X 2i Yî = βˆ0 + βˆ1 X1i + βˆ2 X 2i
yˆ i = βˆ1 x1i + βˆ2 x2i ei = Yi −Yî = Yi −Y +Y −Yî = Yi −Y − (Yî −Y )
⇒ ei = yi − yˆ i RSS = ∑e = ∑( yi − β1 x1i − β2 x2i )

2
i
ˆ ˆ 2
) Minimize the RSS with respect to βˆ1 & βˆ2 .

∂(RSS)
= 2∑( yi − βˆ1x1i − βˆ2 x2i )(−x ji ) = 0; j = 1,2 ⇒ −2∑ ei x ji = 0
∂β j
ˆ
1. ∑ ( y i − βˆ1 x1i − βˆ 2 x 2i )( x1i ) = 0
. ⇒ ∑ y i x 1i = βˆ1 ∑ x 12i + βˆ 2 ∑ x 1i x 2 i
2. ∑ ( y i − βˆ1 x1i − βˆ 2 x 2i )( x 2i ) = 0
⇒ ∑ yi x2i = β1 ∑ x1i x2i + β 2 ∑ x2i
ˆ ˆ 2
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation:
Estimation: The
The Method
Method of
HASSEN A.
of OLS
OLS CHAPTER 3 - 8
⎛ ∑ y i x1i ⎞ ⎡ ∑ x 1i2 ∑x x ⎤ ⎛ βˆ 1 ⎞
⇒ ⎜⎜ ⎟=⎢ 1i 2i
⎥ ⎜⎜ ˆ ⎟⎟
⎝ ∑ i 2i ⎠ ⎢⎣∑ x 2i x 1i
⎟
y x ∑x 2
2i ⎥⎦ ⎝ β 2 ⎠
F = A • β̂
Solve for the coefficients:
Determinant: A = ∑ 1i ∑ x1i x2i
x 2
= ∑ x12i ∑ x22i − (∑ x1i x 2i ) 2

∑ x1i x2i ∑ 2i
x 2
To find β̂ 1, substitute the first column of A by

A1
elements of F, then find |A1|, and finally find .
.
A1 =
β1 =
ˆ
∑ yi x1i
∑y x
A1
A
i 2i
=
∑x
∑x1i x2i
2
2i
= (∑ yi x1i )(∑x ) − (∑x1i x2i )(∑ yi x2i )
2
2i
(∑ y i x1i )(∑ x 2i ) − (∑ x1i x 2i )(∑ y i x 2i )

2
(∑ x1i )(∑ x 2i ) − (∑ x1i x 2i )

2 2 2
A
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation: The
Estimation: The Method
Method of
HASSEN A.
of OLS
OLS
CHAPTER 3 - 9
Similarly, to find β̂2, substitute the second column

of A by elements of F, then find |A2|, and finally
A2
find .
A
A2 =
∑ x ∑ yx 2
1i i 1i
= (∑ yi x2i )(∑x12i ) − (∑x1i x2i )(∑ yi x1i )
∑x x ∑y x 1i 2i i 2i
A2 (∑ y i x 2i )(∑ x ) − (∑ x1i x 2i )(∑ y i x1i )

2
. β2 =
ˆ
A
=
1i
(∑ x1i )(∑ x 2i ) − (∑ x1i x 2i )

2 2 2
βˆ 0 = Y − βˆ1 X 1 − βˆ 2 X 2
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation: The
TheA.Method
of OLS
OLS
CHAPTER 3 - 10
The Case of K Explanatory Variables

) The number of parameters to be estimated:
K+1 ( β0 , β1, β2 ,…, β K).
Y1 = βˆ0 + βˆ1 X11 + βˆ2 X 21 + …+ βˆK X K1 + e1

Y = βˆ + βˆ X + βˆ X + …+ βˆ X + e
2 0 1 12 2 22 K K2 2
.Y3 = βˆ0 + βˆ1 X13 + βˆ2 X 23 + …+ βˆK X K 3 + e3
Yn = βˆ0 + βˆ1 X1n + βˆ2 X 2n + …+ βˆK X Kn + en

2008/09
3.3
3.3 Estimation:
Estimation: The
The Method
Method of
of OLS
OLS
.
⎡ Y1 ⎤ ⎡1 X 11 X 21 X 31 … X K1 ⎤ ⎡ βˆ 0 ⎤ ⎡e 1 ⎤
⎢Y ⎥ ⎢1 ⎥ ⎢ˆ ⎥ ⎢ ⎥
⎢ 2⎥ ⎢ X12 X 22 X 32 … X K2 ⎥ ⎢ β 1 ⎥ ⎢e 2 ⎥
⎢Y3 ⎥ = ⎢1 X13 X 23 X 33 … X K3 ⎥ • ⎢ βˆ 2 ⎥ + ⎢e 3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣Yn ⎥⎦ ⎢⎣1 X1n X 2n X 3n … X Kn ⎥⎦ ⎢⎣βˆ K ⎥⎦ ⎢⎣e n ⎥⎦
.
n ×1
Y = Xβ + e
ˆ
n × ( K + 1) (K +1) ×1 n ×1
2008/09
3.3
3.3 Estimation:
Estimation: The
The Method
Method of
of OLS
OLS
.
⎡e 1 ⎤ ⎡ Y1 ⎤ ⎡1 X 11 X 21 X 31 … X K1 ⎤ ⎡ βˆ 0 ⎤
⎢e ⎥ ⎢Y ⎥ ⎢1 ⎥ ⎢ˆ ⎥
⎢ 2⎥ ⎢ 2⎥ ⎢ X 12 X 22 X 32 … X K2 ⎥ ⎢ β 1 ⎥
⎢e 3 ⎥ = ⎢Y3 ⎥ − ⎢1 X 13 X 23 X 33 … X K3 ⎥ * ⎢ βˆ 2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣e n ⎥⎦ ⎢⎣Yn ⎥⎦ ⎢⎣1 X 1n X 2n X 3n … X Kn ⎥⎦ ⎢⎣βˆ K ⎥⎦
. e = Y − Xβ̂
3.3
3.3 Estimation: The
Method of
of OLS
2008/09
OLS
⎛ e1 ⎞
⎜ ⎟
⎜ e2 ⎟
RSS = ∑ei2 = e12 + e22 + ...+ en2 = (e1 e2 … en ).⎜ ⎟ ⇒RSS= e'e
⎜ ⎟
⎜e ⎟
⎝ n⎠
RSS = (Y − Xβˆ )'(Y − Xβˆ ) = Y'Y − Y'Xβˆ − βˆ ' X'Y + βˆ ' X'Xβˆ
Since Y' Xβˆ is a costant, Y' Xβˆ = (Y' Xβˆ )' = βˆ ' X' Y
.
⇒ RSS = Y'Y − 2βˆ ' X'Y + βˆ ' (X'X)βˆ
F.O.C. :
∂( RSS)
∂(βˆ )
=0 ⇒
∂(RSS )
∂(βˆ )
= −2X' Y + 2X' Xβˆ = 0
⇒ −2X' (Y − Xβˆ ) = 0
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation: The
Method of
HASSEN A.
of OLS
OLS
CHAPTER 3 - 14
⎡ 1 1 … 1 ⎤ ⎛ e ⎞ ⎛ 0⎞
⎢ X11 X12 … X1n ⎥ ⎜ 1 ⎟ ⎜ ⎟
⎢X ⎥ ⎜ e2 ⎟ ⎜ 0⎟
⇒
. X'e = 0 ⇒ ⎢ 21 X 22 … X 2n ⎥.⎜ ⎟ = ⎜ ⎟
⎢ ⎥⎜ ⎟ ⎜ ⎟
⎢X ⎜ ⎟ ⎜ ⎟
⎣ K1 X K 2 … X Kn ⎥⎦ ⎝ n ⎠ ⎝ 0⎠
e
1. ∑ ei = 0 2. ∑ ei X ji = 0. ( j = 1,2,..., K )
. X'e = X'(Y − Xβˆ ) = 0 ⇒ X' Xβˆ = X' Y

ˆβ = ( X' X ) − 1 X' Y
JIMMA UNIVERSITY
2008/09 3.3
3.3 Estimation: The
TheA.Method
of OLS
OLS CHAPTER 3 - 15
⎛ βˆ0 ⎞ ⎡ 1 ⎤ ⎡1 X11 … X K1 ⎤
⎜ ⎟ 1 … 1 ⎢ ⎥
⎜ β
ˆ ⎟ ⎢ X X … X 1n ⎢
⎥ 1 X … X K2 ⎥
βˆ = ⎜ 1 ⎟ X/ X = ⎢ 11 12
⎥.
12
⎢ ⎥⎢ ⎥
⎜ ⎟
⎜ βˆ ⎟ ⎢⎣ X K1 X K 2 … X Kn ⎥⎦ ⎢1 X … X ⎥
⎝ K⎠ ⎣ 1n 2n ⎦
⎡ n ∑X … ∑XK ⎤
⎢ ⎥
1
⇒X X=
/ ⎢ ∑ X1 ∑X 2
1 …∑ X 1 X K ⎥
⎢ ⎥
⎢ ⎥
⎢⎣∑ X K ∑X K X1 … ∑ X K2 ⎥⎦
. ⎡ 1
⎢ X 11
X Y=⎢
/
⎢
⎢⎣ X K 1
1
X 12
X K2
…
…
…
1 ⎢ ⎥⎤
⎥
⎡
⎥⎢ ⎥
Y1 ⎤
X 1 n ⎥ ⎢Y 2 ⎥
X Kn ⎥⎦ ⎢Y ⎥
⎣ n⎦
⎡ ∑Y ⎤
⎢ YX ⎥
⇒ X/Y = ⎢
⎢
∑ 1⎥
⎢ YX ⎥
⎣∑
⎥
K ⎦
3.3
3.3 Estimation:
Estimation: The
The Method
Method of
of OLS
2008/09
OLS
. ˆβ = (X' X)−1 (X' Y)
-1
⎛ βˆ 0 ⎞ ⎡ n ∑X ∑X … ∑X ⎤ ⎛ ∑Y ⎞
⎜ ⎟ ⎢ 1 2 K
⎥ ⎜ ⎟
⎜ βˆ1 ⎟ ⎢ ∑ X 1 ∑X ∑X X … ∑X X ⎜ ∑ YX 1 ⎟
2
1 1 2 1 K ⎥
⎜ ˆ ⎟=⎢ X ⎜ YX ⎟
⎜ β2 ⎟ ⎢ ∑ 2 ∑X X ∑X ∑X X ⎜∑ 2 ⎟
2
… K
⎥
⎥
2 1 2 2
⎜ ⎟ ⎢ ⎥ ⎜ ⎟
⎜⎜ ˆ ⎟⎟ ⎢ 2 ⎥ ⎜⎜ ⎟⎟
⎝ β K ⎠ ⎣∑ X K ∑X X1 ∑X X2 … ∑ K ⎦
X ⎝ ∑ YX K ⎠
.
(K+1)×1
K K
(K +1)×(K +1) (K+1)×1

JIMMA UNIVERSITY
2008/09 3.4 PropertiesHASSEN
of OLSA.
Estimators CHAPTER 3 - 17
)Given the assumptions of the classical linear

regression model (in Section 3.2), the OLS
estimators of the partial regression coefficients
are BLUE: linear, unbiased and have minimum
variance in the class of all linear unbiased
estimators – the Gauss-Markov Theorem.
)In cases where the small-sample desirable
properties (BLUE) may not be found, we look for
.asymptotic (or large-sample) properties, like

consistency and asymptotic normality (CLT).
)The OLS estimators are consistent:
p lim (βˆ − β) = 0 & lim var(βˆ ) = 0
n →∞ n→∞
3.5 Partial Correlations and
JIMMA UNIVERSITY
2008/09 Coefficients of Determination
HASSEN A. CHAPTER 3 - 18
)In the multiple regression equation with 2

regressors (X1 and X2), Yi = βˆ0 + βˆ1 X1i + βˆ2 X2i + ei , we
can talk of:
¾the joint effect of X1 and X2 on Y, and
¾the partial effect of X1 or X2 on Y.
)The partial effect of X1 is measured by β̂1 and the
partial effect of X2 is measured by β̂ 2 .
.
)Partial effect: holding the other variable constant
or after eliminating the effect of the other variable.
)Thus, β̂ 1 is interpreted as measuring the effect of
X1 on Y after eliminating the effect of X2 on X1.
JIMMA UNIVERSITY
)Similarly, β̂ 2 measures the effect of X2 on Y after

eliminating the effect of X1 on X2.
)Thus, we can derive the estimatorβ̂1 of β1 in two
steps (by estimating two separate regressions):
)Step 1: Regress X1 on X2 (an auxiliary regression
to eliminate the effect of X2 from X1). Let the
regression equation be: X 1 = a + b12 X 2 + e12
.Or, in deviation form: x1 = b12x2 + e12

Then, b = ∑x1x2
12
∑x2 2
)e12 is part of X1 which is free from the influence

of X2.
JIMMA UNIVERSITY
)Step 2: Regress Y on e12 (residualized X1). Let the

regression equation be: y = b ye e12 + v in
deviation form.
∑ ye12
Then, b ye = e 2
∑ 12
) b ye is the same as β̂1in the multiple regression,
y = βˆ1x1 + βˆ2 x2 + e. i.e., b ye = βˆ1
)Proof: (You may skip the proof!)
. b ye
⇒ bye
=
∑
∑e
ye
=
=2
∑
12 y( x − b x )
∑ (x − b x )
12
∑ yx − b ∑ yx
1
∑ x + b ∑ x − 2b ∑ x x
2
1
2
12
1
12
2
2
12 2
12 2
2
12
2
1 2
But, b12 =
∑
∑x
xx 1 2
2
2
JIMMA UNIVERSITY
∑ xx
∑ yx − ( x )∑ yx
1 2
.
∑
1 2 2
⇒b =
2
∑ ∑
ye
xx xx
∑ x + ( x ) ∑ x − 2( x )∑ x x
2 1 2 2 2 1 2
∑ ∑
1 2 2 2 1 2
2 2
[∑ x22 ∑ yx1 − ∑ x1 x2 ∑ yx2 ]
⇒ bye =
∑2 x 2
(∑ x x ) 2
(∑ x x ) 2
∑ x1 + −2
2 1 2 1 2
∑2 x 2
∑2 x 2
.
bye =
[∑ x22 ∑ yx1 − ∑ x1 x2 ∑ yx2 ]
[∑ x12 ∑ x22 − (∑ x1 x2 ) 2 ]
2
b =
∑2
x 2
∑x
∑
2
x ∑ yx − ∑x x ∑ yx
2
2
∑x ∑x − (∑x x )
ye 2
1
1
⇒ bye = β̂1
2
2
1 2
1 2
2
2
JIMMA UNIVERSITY
)Alternatively, we can derive the estimator βˆ1 of β 1

as follows:
)Step 1: regress Y on X2, & save the residuals, ey2.
1. y = by2 x2 + ey2 …... [ey2 = residualized Y]
)Step 2: regress X1 on X2, & save the residuals, e12.
2. x1 = b12x2 + e12 …… [e12 = residualized X1]
)Step 3: regress ey2 (that part of Y cleared of the
.influence of X2) on e12 (part of X1 cleared of the

influence of X2). 3. e = α e + u
y2 12 12
Then, α 12 in regression (3) = βˆ 1 in y = βˆ1 x1 + βˆ 2 x 2 + e!

JIMMA UNIVERSITY
) Suppose we have a dependent variable, Y, and

two regressors, X1 and X2.
) Suppose also: y1 and y 2 are the squares of the
r 2
r 2
simple correlation coefficients between Y & X1

and Y & X2, respectively.
) Then,
r y12 = the proportion of TSS that X1 alone explains.
.r y22 = the proportion of TSS that X2 alone explains.

) On the other hand, R y2•12 is the proportion of the
variation in Y that X1 & X2 jointly explain.
) We would also like to measure something else.
JIMMA UNIVERSITY
For instance:
a) How much does X2 explain after X1 is already
included in the regression equation? Or,
b) How much does X1 explain after X2 is included?
) These are measured by the coefficients of partial
determination: ry 2•1 and r y21• 2 , respectively.
2
) Partial correlation coefficients of the first order:
. ry1•2 & ry 2•1.

) Order = number of X's already in the model.
ry1•2 =
ry1 − ry2r12
(1− r )(1− r )
2
y2
2
12
ry2•1 =
ry2 − ry1r12
(1− r )(1− r )
2
y1
2
12
JIMMA UNIVERSITY
On Simple and Partial Correlation Coefficients

1. Even if ry1 = 0, ry1.2 will not be zero unless ry2 or
r12 or both are zero.
2. If ry1 = 0; and ry2 ≠ 0, r12 ≠ 0 and are of the same
sign, then ry1.2 < 0, whereas if they are of
opposite signs, ry1.2 > 0.
Example: Let Y = crop yield, X1 = rainfall, X2 =
. temperature. Assume: ry1 = 0 (no association

between crop yield and rainfall); ry2 > 0 & r12 <
0. Then, ry1.2 > 0, i.e., holding temperature
constant, there is a positive association between
yield and rainfall.
JIMMA UNIVERSITY
3. Since temperature affects both yield & rainfall,

in order to find out the net relationship between
crop yield and rainfall, we need to remove the
influence of temperature. Thus, the simple
coefficient of correlation (CC) is misleading.
4. ry1.2 & ry1 need not have the same sign.
5. Interrelationship among the 3 zero-order CCs:
. 0 ≤ r + r + r − 2ry1ry 2 r12 ≤ 1
2
y1
2
y2
2
12
6. ry2 = r12 = 0 does not mean that ry1 = 0.
Y & X1 and X1 & X2 are uncorrelated does not
mean that Y and X1 are uncorrelated.
JIMMA UNIVERSITY
) The partial r , measures the (square of the)

r2,
2
y 2•1
mutual relationship between Y and X2 after the
influence of X1 is eliminated from both Y and X2.
) Partial correlations are important in deciding
whether or not to include more regressors.
e.g. Suppose we have: two regressors (X1 & X2);
ry2 = 0.95; and ry22•1 = 0.01.
2
) To explain Y, X2 alone can do a good job (high
. simple correlation coefficient between Y & X2).

) But after X1 is already included, X2 does not add
much – X1 has done the job of X2 (very low
partial correlation coefficient between Y & X2).
JIMMA UNIVERSITY
) If we regress Y on X1 alone, then we would

have: RSS SIMP = (1 − R y2•1 )∑ y 2
i.e., of the total variation in Y, an amount =
2 2
∑
(1 − Ry•1 ) yi remains unexplained (by X1 alone).
) If we regress Y on X1 and X2, the variation in Y
(TSS) that would be left unexplained is:
RSSMULT = (1 − R 2
y •12 )∑ y 2
.
) Adding X2 to the model reduces the RSS by:
RSSSIMP − RSSMULT = (1 − R
= (R
y•1 ∑
2
)
2
y 2
y•12
− (1 − Ry•12 ∑
2
− R )∑ y
2
y•1
) y
2
2
JIMMA UNIVERSITY
) If we now regress that part of Y freed from the

effect of X1 (residualized Y) on the part of X2
freed from the effect of X1 (residualized X2), we
will be able to explain the following proportion
of the RSSSIMP:
(R 2
y •12 − R )∑ y2
y •1
2
i R 2
y •12 −R 2
y •1
r
2
= =
(1 − R )∑ y
y 2•1 2
y •1
2
i 1 − R y2•1
.
) This is the Coefficient of Partial Determination
(square of the coefficient of partial correlation).
) We include X2 if the reduction in RSS (or the
increase in ESS) is significant.
) But, when exactly? We will see later!
JIMMA UNIVERSITY
) The amount y•12

R − R y•1 ∑ i represents the
y
2 2 2
( )
incremental contribution of X2 in explaining the
TSS. (R2 − R2 ) = (1 − R2 )r 2
y•12 y•1 y•1 y 2•1
4. the proportion of the

1. proportion of
incremental contribution
∑y 2
i explained by
of X 2 in explaining the
.
X 1 & X 2 jointly
2. proportion of ∑y 2
i
unexplained part of
3. proportion of
explained by X 1 alone X leaves unexplained
1
∑ i that
y 2
∑ i
y 2
JIMMA UNIVERSITY
) Coefficient of Determination (in Simple Linear

Regression): 2 βˆ ∑ xy
R = Or , R =
2
β ∑
ˆ 2 x2
∑ y 2
∑y 2
) Coefficient of Multiple Determination:

K n
βˆ1 ∑x1 y + βˆ 2 ∑x2 y ∑ j ∑ x ji yi }

{βˆ
Ry2•12 = R 2 = Ry2•12...K =
j =1 i =1
∑y 2
∑ i
y 2
n
.
) Coefficients of Partial Determination:
r 2
y 2•1 =
R 2
y•12
1− R
−R
2
y•1
2
y•1
r2
y1•2 =
R 2
y •12
1− R
−R
2
i =1
2
y •2
y •2
JIMMA UNIVERSITY
)The coefficient of multiple determination (R2)

measures the proportion of the variation in the
dependent variable explained by (the set of all the
regressors in) the model.
)However, the R2 can be used to compare the
goodness-of-fit of alternative regression
equations only if the regression models satisfy
two conditions.
1) The models must have the same dependent
. variable.
Reason: TSS, ESS, and RSS depend on the units
in which the regressand Yi is measured.
For instance, the TSS for Y is not the same as the
TSS for log(Y).
JIMMA UNIVERSITY
2) The models must have the same number of

regressors and parameters (the same value of K).
Reason: Adding a variable to a model will never
raise the RSS (or, will never lower ESS or R2)
even if the new variable is not very relevant.
)The adjusted R-squared, R 2 , attaches a penalty to
adding more variables.
)It is modified to account for changes/differences
. in degrees of freedom (df): due to differences in

number of regressors (K) and/or sample size (n).
)If adding a variable raises R 2 for a regression,
then this is a better indication that it has
improved the model than if it merely raises R 2.
JIMMA UNIVERSITY
∑ yˆ 2
∑ e 2
[ ∑ e 2
]
R 2
= = 1− n − (K + 1)
R 2
= 1−
∑y 2
∑y 2
[∑
y 2
]
n−1
(Dividing TSS and RSS by their df).
)K + 1 represents the number of parameters to be
estimated. e2
R 2
= 1−[
∑ •
n −1
]
∑y 2
n − K −1
n −1
. R = 1− (1− R ) • (
2
1− R = (1− R ) • (
2 n −1
2
n − K −1
2
n − K −1
)
)
As long as K ≥ 1,
1− R 2 > 1− R2 ⇒ R 2 < R2
In general, R 2 ≤ R 2
As n grows larger (relative
to K ), R 2 → R 2 .
JIMMA UNIVERSITY
1. While R 2 is always non-negative, R 2can be

positive or negative.
2. R. 2 can be used to compare the goodness-of-fit of
two regression models only if the models have
the same regressand.
3. Including more regressors reduces both the RSS
and df; and R 2 raises only if the former effect
dominates.
.
4. R. 2 should never be the sole criterion for choosing
between/among models:
) Consider expected signs & values of coefficients,
) Look for results consistent with economic theory
or reasoning (possible explanations), ...
Numerical Example:
2008/09
Y (Salary in X1 (Years of post X2 (Years of

'000 Dollars) high school Experience)
Education)
30 4 10
20 3 8
36 6 11
. ƩY = 150
24
40
4
8
ƩX1 = 25
9
12
ƩX2 = 50
Numerical Example:
2008/09
X1Y X2Y X12 X22 X1X2 Y2 n=5

ƩX1 = 25
120 300 16 100 40 900
ƩX2 = 50
60 160 9 64 24 400 ƩY = 150
216 396 36 121 66 1296 ƩYX1=812
ƩYX2=1567
96 216 16 81 36 576
ƩX1X2=262
. 320 480 64 144 96 1600 ƩX 2 = 141

1
ƩX1Y ƩX2Y ƩX12 ƩX22 ƩX1X2 ƩY2 = ƩX22 = 510
= 812 = 1552 = 141 = 510 = 262 4772
ƩY2 = 4772
βˆ = ( X' X ) X' Y −1
2008/09
−1
⎛ βˆ ⎞ ⎡ n
⎜ ⎟ ⎢
0 ∑ X ∑ X 1 2 ⎤ ⎛ ∑Y ⎞
⎥ ⎜ ⎟
⎜ βˆ ⎟ = ⎢∑ X
1 ∑ X1 ∑ X X 2
1 1 2 ⎥ • ⎜ ∑YX1 ⎟
⎜⎜ ˆ ⎟⎟ ⎢ ⎜ YX ⎟
β
⎝ ⎠
2 ⎣∑ X ∑ X X
2 ∑ X 1 2
2 ⎥
2 ⎦ ⎝∑ 2 ⎠
⎛ βˆ 0 ⎞ ⎡ 5 25 50 ⎤ −1 ⎛ 150 ⎞
⎜ ⎟ ⎜ ⎟
⎢ ⎥
⎜ βˆ 1 ⎟ = 25 141 262 • ⎜ 812 ⎟
⎜⎜ ˆ ⎟⎟ ⎢ ⎥
⎜ 1552 ⎟
β ⎢ 50 262 510 ⎥
⎝ 2⎠ ⎣ ⎦ ⎝ ⎠
⎛ βˆ 0 ⎞ ⎛ - 23.75 ⎞
.
⎛βˆ 0 ⎞ ⎡40.825 4.375 - 6.25⎤ ⎛ 150⎞
⎜ ⎟
⎜⎜ ˆ ⎟⎟ ⎢
β
⎝ 2⎠ ⎣
⎢
⎢ - 6.25 - 0.75 1
⎥
⎥
⎜
⎜βˆ 1 ⎟ = 4.375 0.625 - 0.75 •⎜ 812⎟
⎥
⎜1552⎟
⎦ ⎝ ⎠
⎜ ⎟ ⎜
⎟ ⇒ ⎜ βˆ 1 ⎟ = ⎜ - 0.25
⎜⎜ ˆ ⎟⎟ ⎜
Ŷ = −23.75 − 0.25 X1 + 5.5 X 2

β
⎝ ⎠ 2 ⎝ 5.5
⎟
⎟
⎟
⎠
)One more year of experience, after controlling
2008/09
for years of education, results in $5500 rise in

salary, on average.
)Or, if we consider two persons with the same
level of education, the one with one more year of
experience is expected to have a higher salary of
$5500.
)Similarly, for two people with the same level of
.experience, the one with an education of one

more year is expected to have a lower annual
salary of $250.
)Experience looks far more important than
education (which has a negative sign).
) The constant term - 23.75 is the salary one
2008/09
would get with no experience and no education.

) But, a negative salary is impossible.
) Then, what is wrong?
1. The sample must have been drawn from a
subgroup. We have persons with experience
ranging from 8 to 12 years (and post high
school education ranging from 3 to 8 years). So
. we cannot extrapolate the results too far out of

this sample range.
2. Model specification: is our model correctly
specified (variables, functional form); does our
data set meet the underlying assumptions?
1. TSS = ∑ y 2 = ∑ Y 2 − nY 2
2008/09
TSS = 4772 − 5(30) 2

⇒ TSS = 272
2. ESS = ∑ yˆ = ∑ 2
(β x + β x )
ˆ
1 1
ˆ 2
2 2
ESS = β1 ∑ x1 + β2 ∑ x2 + 2 βˆ1 βˆ 2 ∑ x1 x2
ˆ 2 2 ˆ 2 2
ESS = βˆ12( ∑X12 − nX12 ) + βˆ22( ∑X 22 − nX 22 )

+ 2βˆ1 βˆ2( ∑X1 X 2 − nX1 X 2 )
.
ESS = ( −0.25) 2 [141 − 5(5) 2 ] + (5.5) 2 [510 − 5(10) 2 ]
+ 2( −0.25)(5.5) [262 − 5(5)(10)]
⇒ ESS = 270.5
OR : ESS = βˆ1 ∑ yx1 + βˆ2 ∑ yx2
2008/09
ESS = βˆ1 (∑YX1 − nX1Y) + βˆ2( ∑YX2 − nX 2Y)

⇒ ESS = − 0 .25 ( 62 ) + 5 .5( 52 ) = 270 .5
3. RSS = TSS − ESS ⇒ RSS = 272 − 270.5
⇒ RSS = 1.5
ESS 270.5
4. R = 2
= ⇒ R = 0.9945
2
TSS 272
.Our model (education and experience together)

explains about 99.45% of the wage differential.
5. R = 1 −
2 RSS (n − K − 1)
TSS (n − 1)
= 1−
1.5 2
272 4
⇒ R 2
= 0.9890
Regressing Y on X 1 :
2008/09
βˆ y1 =
∑ yx 1
=
∑ YX 1 − nX 1Y
=
62
= 3.875
∑x 2
1 ∑X 1
2
− nX 1
2
16
ESS SIMP βˆ y •1 ∑ yx1 3.875 × 62
6. R 2
= = = = 0.8833
∑y
y •1
TSS 2
272
RSSSIMP = (1 − 0.8833)(272) = 0.1167(272) = 31.75
X 1 (education ) alone explanis about 88.33% of the difference s
.
in wages, and leaves about 11.67% ( = 31.75) unexplaine d.
7. R
(R 2
2
y•12
y •12
−R
2
y •1
2
y•1 = 0.9945 − 0.8833 = 0.1112
− R )∑ y = 0.1112(272) = 30.25
2
2008/09
X 2 (experience) enters the wage equation with an

extra (marginal) contribution of explaining about
11.12% ( = 30.25) of the total variation in wages.
Note that this is the contribution of the part of X 2 which
is not related to (free from the influence of) X 1 .
R 2
y •12 −R 2
y •1 0.9945 − 0.8833
8. r 2
y 2•1 = = = 0.9528
. 1− R 2
y •1
Or, X 2 (experienc e) explains about 95.28%

1 − 0.8833
( = 30.25) of the wage differenti al that X 1 has

left unexplaine d ( = 31.75).
3.6 Statistical InferencesHASSEN
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression CHAPTER 3 - 45
) The case of two regressors (X1 & X2):
ε i ~ N(0,σ ) 2
σ 2
ˆ ˆ
β1 ~ N( β1 , var(β1 )); var(βˆ )=
∑x (1 − r )
1 2 2
1i 12
σ 2
βˆ2 ~ N (β 2 , var(βˆ2 )); var(βˆ2 ) =
∑x (1 − r )
.βˆ0 ~ N(β0 , var(βˆ0 ));
var(β0 ) =
ˆ σ
n
2
2
2i
+ X 22 var(βˆ1 ) + X12 var(βˆ2 ) + 2X1 X 2 cov(βˆ1 , βˆ2 )

2
12
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 46
− σ r 2 2
cov(βˆ1 , βˆ2 ) = 12
∑ x1i x2i (1 − r12 )

2
( ∑ x 1i x 2 i ) 2
r2
=
∑x ∑x
12 2 2
1i 2i
∑ 1i 12 ) is the RSS from regressing X 1 on X 2 .

x 2
(1 − r 2
.
∑x
σˆ =
RSS
2
n−3
2
2i (1 − r ) is the RSS from regressing X 2 on X 1 .
2
12
is an unbiased estimator of σ .
2
3.6
3.6 Statistical
Statistical Inferences
JIMMA UNIVERSITY
2008/09 in
in Multiple
InferencesHASSENMultiple
A.
Linear
Linear Regression
Regression
CHAPTER 3 - 47
−1
⎡ n ∑ X1 … ∑ XK ⎤
⎢ ⎥
2⎢∑ 1
X ∑ X1 2
…∑ X1 X K ⎥
var− cov(β) = σ (X X) = σ
ˆ 2 / -1
⎢ ⎥
⎢ 2 ⎥
⎢⎣∑ X K ∑ X K X1 … ∑ X K ⎥⎦
−1
⎡ n ∑ X1 … ∑ XK ⎤
⎢ ⎥
2⎢∑ ∑ …∑ X1 X K ⎥
2
∧ X1 X1
var − cοο(β) = σˆ
ˆ
⎢ ⎥
⎢ ⎥
∑ X K ∑ X K X 1 … ∑ X K ⎥⎦
.
)Note that: ⎢
⎣
(a) (X'X)-1 is the same matrix we use to derive the
OLS estimates, and
(b) σˆ 2 = RSS in the case of two regressors.
n−3
2
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 48
) In the general case of K explanatory variables,

RSS is an unbiased estimator of σ 2.
σˆ =
2
n − K −1
Note:
) Ceteris paribus, the higher the correlation
coefficient between X1 & X2 ( r12 ), the less
precise will the estimates βˆ1 & βˆ2 be, i.e., the CIs
. for the parameters β1 & β 2 will be wider.

) Ceteris paribus, the higher the degree of
variation of the Xjs (the more Xjs vary in our
sample), the more precise will the estimates be –
narrow CIs for population parameters.
JIMMA UNIVERSITY
2008/09 in Multiple
A.
) The above two points are contained in:

σ2
βˆ j ~ N (β j , ); ∀ j = 1,2,..., K .
RSS j
where RSSj is the RSS from an auxiliary regres-
sion of Xj on all other (K–1) X's and a constant.
) We use t test to test about single parameters and
single linear functions of parameters.
.
) To test hypotheses about & construct intervals
for individual β j use: ˆ
βj −βj
seˆ(βˆ j )
*
~ tn−K−1;∀j = 0,1,...,K.
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 50
) Tests about and interval estimation of the error

variance σ are based on:
2
RSS (n − K − 1)σ̂ 2
2
= 2
~ χ 2
n− K −1
σ σ
) Tests of several parameters and several linear
functions of parameters are F-tests.
Procedures for Conducting F-tests:
.
1. Compute the RSS from regressing Y on all Xjs
(URSS=Unrestricted Residual Sum of Squares).
2. Compute the RSS from the regression with the
hypothesized/specified values of parameters (β s)
(RRSS = Restricted RSS).
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 51
3. Under H0 (if the restriction is correct)

(RRSS − URSS) / J (RU2 − RR2 )/J
~ FJ ,n−K −1 ~ FJ,n−K −1
URSS/(n − K − 1) (1 − RU )/(n− K − 1)
2
where J is the number of restrictions imposed.

If F-calculated is greater than the F-tabulated,
then the RRSS (is significantly) greater than the
URSS, and thus we reject the null.
.
) A special F-test of common interest is to test the
null that none of the Xs influence Y (i.e., that
our regression is useless!):
Test H0: β1 = β 2 = ... = β K = 0 vs. H1: H0 is not true.
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 52
K n
URSS = (1 − R )∑ y = ∑ y − ∑{βˆ j ∑ x ji yi }.
2 2
i
2
i
j =1 i =1
RRSS = ∑ y . 2
i
( RRSS − URSS) / K R2 / K
⇒ = ~ FK ,n− K −1
URSS /(n − K − 1) (1 − R ) /(n − K − 1)
2
) With reference to our example on wages, test
. the following at the 5% level of significance.

a) β1 = 0 ; b) β 2 = 0 ; c) β 0 = 0;
d) the overall significance of the model; and
e) β1 = β 2 .
var− cov(β ) = σ ( X' X )

ˆ −1
2008/09
2
−1
⎡ 5 25 50 ⎤ ⎡40.825 4.375 - 6.25 ⎤
(X' X) −1 ⎢ ⎥
= ⎢25 141 262⎥ ⎢ = ⎢ 4.375 0.625 - 0.75 ⎥⎥
⎢⎣50 262 510⎥⎦ ⎢⎣ - 6.25 - 0.75 1 ⎥⎦
RSS 1.5
σ is estimated by : σˆ =
2 σˆ =
2
2 = 0.75
n − K −1 2
∧
⎡ 40.825 4.375 - 6.25 ⎤
var − cov ( β̂ ) = 0.75 ⎢⎢ 4.375 0.625 - 0.75 ⎥
. ⎢⎣ - 6.25
⎡30.61875 3.28125
= ⎢⎢ 3.28125 0.46875
⎢⎣ - 4.6875 - 0.5625
- 0.75 1
- 4.6875 ⎤
- 0.5625 ⎥⎥
0.75 ⎥⎦
⎥
⎥⎦
2008/09
⎡ var( βˆ 0 ) cov( βˆ 0 , βˆ 1 ) cov( βˆ 0 , βˆ 2 ) ⎤ ⎡30.61875 3.28125 - 4.6875 ⎤

= ⎢⎢ - 0.5625 ⎥⎥
⎢ ⎥
⎢ var( βˆ 1 ) cov( βˆ 1 , βˆ 2 ) ⎥ 0.46876
⎢ var( βˆ ) ⎥
⎣ 2 ⎦ ⎢⎣ 0.75 ⎥⎦
βˆ1 − 0 − 0.25
a) t c =
seˆ( βˆ1 )
=
0.46875
≈ −0.37 ttab = t 2
0.025 ≈ 4.30
t cal ≤ t tab , ⇒ we do not reject the null.
β −0
ˆ 5.5
t cal > t tab
b) t c = 2
= ≈ 6.35
.
c) t c =
seˆ( βˆ 2 )
βˆ0 − 0
seˆ( βˆ0 )
=
− 23.75
30.61875
0.75
≈ −4.29
t cal ≤ t tab , ⇒ we do not reject the null! ! !

⇒ reject the null.
R /K 2
2008/09
0.9945 / 2
d) Fc = = ≈ 180.82
(1 − R ) /(n − K − 1) 0.0055 / 2
2
Ft = F20, .205 ≈ 19 Fcal > Ftab , ⇒ reject the null.

e) From Yî = βˆ0 + βˆ1 X 1i + βˆ 2 X 2i , URSS = 1.5
Now run Yˆ = βˆ + βˆ X + βˆ X
i 0 1i 2i
⇒ Yî = βˆ 0 + βˆ ( X 1i + X 2 i ). ⇒ RRSS = 12.08
.Fc =
( RRSS − URSS) / J (12.08 − 1.5) / 1
(URSS) /(n − K − 1)
Ft = F 0.05
1, 2
=
1.5 / 2
≈ 18.51
≈ 14.11
Fcal ≤ Ftab , ⇒ we do not reject the null.

JIMMA UNIVERSITY
2008/09 in Multiple
A.
) Note that we can also use t-test to test the single

restriction that β1 = β2 (equivalently, β1 - β2 = 0).
β̂ 1 − β̂ 2 − 0 β̂ 1 − β̂ 2
= ~ t1
sê(β̂ 1 − β̂ 2 ) vâr(β̂ 1 ) + vâr(β̂ 2 ) − 2côv(β̂ 1 , β̂ 2 )
− 5.75
tc = ≈ −3.76
0.6846532+ 0.8660254− 2( −0.5625)
. tt = t = 12.706
1
0.025
t cal < t tab ⇒ do not reject the null.
) The same result as the F-test, but the F-test is
easier to handle.
JIMMA UNIVERSITY
2008/09 in Multiple
A.
Linear Regression
CHAPTER 3 - 57
To sum up:
Assuming that our model is correctly specified
and all the assumptions are satisfied,
) Education (after controlling for experience)
doesn’t have a significant influence on wages.
) In contrast, experience (after controlling for
education) is a significant determinant of wages.
) The intercept parameter is also insignificant
.
)
)
(though at the margin). Less Important!
Overall, the model explains a significant portion
of the observed wage pattern.
We cannot reject the claim that the coefficients
of the two regressors are equal.
3.7 Prediction with Multiple Linear Regression
2008/09
) In Chapter 2, we used the estimated simple

linear regression model for prediction: (i) mean
prediction (i.e., predicting the point on the
population regression function (PRF)), and (ii)
individual prediction (i.e., predicting an
individual value of Y), given the value of the
regressor X (say, X = X0).
) The formulas for prediction are also similar to
. those in the case of simple regression except
that, to compute the standard error of the
predicted value, we need the variances and
covariances of all the regression coefficients.
2008/09 3.7 Prediction with Multiple
JIMMA UNIVERSITY HASSEN A.
Linear Regression
CHAPTER 3 - 59
Note:
) Even if the R2 for the SRF is very high, it does
not necessarily mean that our forecasts are
good.
) The accuracy of our prediction depends on the
stability of the coefficients between the period
used for estimation and the period used for
. prediction.
) More care must be taken when the values of the
regressors (X's) themselves are forecasts.
2008/09
CHAPTER FOUR
VIOLATING THE ASSUMPTIONS OF
THE CLASSICAL LINEAR
. REGRESSION MODEL (CLRM)
2008/09
4.1 Introduction
) The estimates derived using OLS techniques

and the inferences based on those estimates are
valid only under certain conditions.
) In general, these conditions amount to the
regression model being "well-specified".
) A regression model is statistically well-specified
. for an estimator (say, OLS) if all of the

assumptions required for the optimality of the
estimator are satisfied.
) The model will be statistically misspecified if
one/more of the assumptions are not satisfied.
2008/09
4.1 Introduction
)Before we proceed to testing for violations of (or

relaxing) the assumptions of the CLRM
sequentially, let us recall: (i) the basic steps in a
scientific enquiry & (ii) the assumptions made.
I. The Major Steps Followed in a Scientific Study:
Study
1. Specifying a statistical model consistent with
. theory (or a model representing the theoretical

relationship between a set of variables).
)This involves at least two choices to be made:
A.The choice of variables to be included into
the model, and
2008/09
4.1 Introduction
B.The choice of the functional form of the link

(linear in variables, linear in logarithms of
the variables, polynomial in regressors, etc.)
2. Selecting an estimator with certain desirable
properties (provided that the regression model
in question satisfies a given set of conditions).
. 3. Estimating the model. When can one estimate a

model? (sample size? perfect multicollinearity?)
4. Testing for the validity of assumptions made.
5. a) If there is no evidence of misspecification, go
on to conducting statistical inferences.
2008/09
4.1 Introduction
5. b) If the tests show evidence of misspecification

in one or more relevant forms, then there are
two possible courses of action implied:
)If the precise form of model misspecification
can be established, then it may be possible to
find an alternative estimator that is optimal
. under the particular sort of misspecification.

)Regard statistical misspecification as an
indication of a defective model. Then, search
an alternative, well-specified regression
model, and start over (return to Step 1).
2008/09
4.1 Introduction
II. The Assumptions of the CLRM:
A1 n > K+1. Otherwise, estimation is not possible.
A1:
A2 No perfect multicollinearity among the X's.
A2:
Implication: any X must have some variation.
⎧σ 2
for s = t
A3 ɛi|Xji ~ IID(0,σ ) or E(ε s εt | X j ) = ⎨
A3: 2
A3.1: var(ɛ |X ) = σ2 (0 < σ2 < ∞). ⎩0 for s ≠ t

i j
A3.2: cov(ɛi,ɛs|Xj) = 0, for all i ≠ s; s = 1, …, n.
.A4 ɛi's are normally distributed: ɛi|Xj ~ N(0,σ2).

A4:
A5 E(ɛi|Xj) = E(ɛi) = 0; i = 1, …, n & j = 1, …, K.
A5:
A5.1: E(ɛi) = 0 and X’s are non-stochastic, or
A5.2: E(ɛiXji) = 0 or E(ɛi|Xj) = E(ɛi) with stochastic X’s.
Implication: ɛ is independent of Xj & thus cov(ɛ,Xj) = 0.
2008/09
4.1 Introduction
)Generally speaking, the several tests for the

violations of the assumptions of the CLRM are
tests of model misspecification.
)The values of the test statistics for testing
particular H0's tend to reject these H0's when
the model is misspecified in some way.
. e.g., tests for heteroskedasticity or autocorrelation

are sensitive to omission of relevant variables.
)A significant test statistic may indicate hetero-
skedastic (or autocorrelated) errors, but it may
also reflect omission of relevant variables.
2008/09
4.1 Introduction
Outline:
1. Small Samples (A1?)
2. Multicollinearity (A2?)
3. Non-Normal Errors (A4?)
4. Non-IID Errors (A3?):
A. Heteroskedasticity (A3.1?)
B. Autocorrelation (A3.2?)
5. Endogeneity (A5?):
. A. Stochastic Regressors and Measurement Error

B. Model Specification Errors:
a. Omission of Relevant Variables
b. Wrong Functional Form
c. Inclusion of Irrelevant Variables (?XXX)
d. Stability of Parameters
C. Simultaneity (or Reverse Causality)
2008/09
4.2 Sample Size: Problems with Few Data Points
) Requirement for estimation: n > K+1.

) If the number of data points (n) is small, it may
be difficult to detect violations of assumptions.
) With small n, it is hard to detect heteroskedast-
icity or nonnormality of ɛi's even when present.
) Though none of the assumptions is violated, a
. linear regression with small n may not have

sufficient power to reject βj = 0, even if βj ≠ 0.
) If [(K+1)/n] > 0.4, it will often be difficult to fit
a reliable model.
) Rule of thumb: aim to have n ≥ 6X & ideally n ≥ 10X.
2008/09
4.3 Multicollinearity
) Many social research studies use a large

number of predictors.
) Problems arise when the various predictors are
highly and linearly related (highly collinear).
) Recall that, in a multiple regression, only the
independent variation in a regressor (an X) is
. used in estimating the coefficient of that X.

) If two X's (X1 & X2) are highly correlated with
each other, then the coefficients of X1 & X2 will
be determined by the minority of cases where
they don’t vary together (or overlap).
2008/09
) Perfect multicollinearity: occurs when one (or

more) of the regressors in a model (e.g., XK) is a
linear function of other/s (Xi, i = 1, 2, …, K-1).
) For instance, if X2 = 2X1, then there is a perfect
(an exact) multicollinearity between X1 & X2.
) Suppose, PRF: Y=β0+β1X1+β2X2, & X2=2X1.
.) The OLS technique yields 3 normal equations:

∑Y = nβ̂ + β̂ ∑X + β̂ ∑X
i 0 1
∑Y X = β̂ ∑X + β̂ ∑X + β̂ ∑X X
i 1i 0 1i
∑Y X = β̂ ∑X + β̂ ∑X X + β̂ ∑X
i 2i 0 2i
1i
1
2
2
1i
1i
2i
2i
2
2
1i 2i
2
2i
2008/09
) But, substituting 2X1 for X2 in the 3rd equation

yields the 2nd equation.
) That is, one of the normal equations is in fact
redundant.
) Thus, we have only 2 independent equations (1
& 2 or 1 & 3) but 3 unknowns (β's) to estimate.
.) As a result, the normal equations will reduce to:
∑ i 0 1 2 ∑ 1i
Y
Y
=
X
nβˆ + [ βˆ + 2βˆ ] X
∑ i 1i 0 ∑ 1i 1 2 ∑ 1i
= ˆ X + [ βˆ + 2βˆ ] X 2
β
2008/09
⎛ ∑Yi ⎞ ⎡ n
⇒⎜⎜ ⎟=⎢ ∑ 1i ⎥.⎜ 0 ⎞⎟
X ⎤ ⎛ ˆ
β
⎝ ∑Yi X1i
⎟
⎠ ⎣∑X1i ∑ 1i ⎦ ⎝ 1 2 ⎠
X 2 ⎜ˆ
β + 2βˆ ⎟
)The number of β's to be estimated is greater
than the number of independent equations.
)So, if two or more X's are perfectly correlated, it
. is not possible to find the estimates for all β's.

i.e., we cannot find β̂1 & β̂2separately, but β̂1 + 2.β̂2
α̂ = β̂1 + 2β̂2 =
∑YX
i
∑X
1i
2
1i
− nX1Y
− nX2
1
& β̂0 = Y − [β̂1 + 2β̂2 ]X1
2008/09
) High, but not perfect, multicollinearity: two or

more regressors in a model are highly (but
imperfectly) correlated. e.g. X1 = 3 – 5XK + ui.
) This makes it difficult to isolate the effect of
each of the highly collinear X's on Y.
) If there is inexact but strong multicollinearity:
. * The collinear regressors (X's) explain the

same variation in the regressand (Y).
* Estimated coefficients change dramatically,
depending on the inclusion/exclusion of
other predictor/s into (or out of) the model.
2008/09
* .β̂' s tend to be very shaky from one sample to

another.
* Standard errors of β̂' s will be inflated.
* As a result, t-tests will be insignificant & CIs
wide (rejecting H0: βj = 0 becomes very rare).
* We get low t-ratios but high R2 (or F): there
. is not enough individual variation in the X's,

but a lot of common variation.
)Yet, the OLS estimators are BLUE.
BLUE
)BLUE – a property of repeated-sampling – says
nothing about estimates from a single sample.
2008/09
) But, multicollinearity is not a problem if the

principal aim is prediction, given that the same
pattern of multicollinearity persists into the
forecast period.
Sources of Multicollinearity:
) Improper use of dummy variables. (Later!)
.) Including the same (or almost the same)

variable twice (e.g. different operationaliaztions
of a single concept used together).
) Method of data collection used (e.g. sampling
over a limited range of X values).
2008/09
)Including a variable computed from other
variables in the model (e.g. using family income,
mother’s income & father’s income together).
)Adding many polynomial terms to a model,
especially if the range of the X variable is small.
)Or, it may just happen that variables are highly
correlated (without any fault of the researcher).
. Detecting Multicollinearity:
)The classic case of multicollinearity occurs
when R2 is high (& significant), but none of X's
is significant (some of the X's may even have
wrong sign).
2008/09
) Detecting the presence of multicollinearity is

more difficult in the less clear-cut cases.
) Sometimes, simple or partial coefficients of
correlation among regressors are used.
) However, serious multicollinearity may exist
even if these correlation coefficients are low.
.) A statistic commonly used for detecting multi-

collinearity is VIF (Variance Inflation Factor).
) From a simple linear regression of Y on Xj we
have: σ2
var( β̂ j ) =
∑ x 2ji
2008/09
) From multiple linear regression of Y on X's:

σ2 1 σ2 σ2
var(β̂ j ) = var(β̂ j ) = . = VIFj .
∑ ji
x 2
(1 − R 2
j ) (1 − R2
j ) ∑ ji
x 2
∑ ji
x 2
where R 2j is R2 from regressing Xj on all other X's.

) The difference between variance of βj in the
two cases arises from the correlation between
. Xj and the other X's, and is captured by:

VIF j =
1
1 − R 2j
) If Xj is not correlated with the other X's, R = 0,
VIFj = 1 and the two variances will be identical.
2
j
2008/09
) As Rj2 increases, VIFj rises.
) If Xj is perfectly correlated with the other X's,
VIFj = ∞. Implication for precision (or CIs)???
) Thus, a large VIF is a sign of serious/severe (or
“intolerable”) multicollinearity.
) There is no cutoff point on VIF (or any other
. measure) beyond which multicollinearity is

taken as intolerable.
) A rule of thumb: VIF > 10 is a sign of severe
multicollinearity.
# In stata (after regression): vif
2008/09
Solutions to Multicollinearity:
) Solutions depend on the sources of the problem.
) The formula below is indicative of some
solutions: σ̂ 2 =
∑ 2
ei
vâr(β̂ j ) =
∑ x ji (1 − Rj )
2 2 (n − K − 1)∑ ji
x 2
(1 − R2
j )
) More precision is attained with lower variances
. of coefficients. This may result from:

a) Smaller RSS (or variance of error term) –
less “noise”, ceteris paribus (cp);
b) Larger sample size (n) relative to the
number of parameters (K+1), cp;
2008/09
c) Greater variation in values of each Xj, cp;
d) Less correlation between regressors, cp.
)Thus, serious multicollinearity may be solved by
using one/more of the following:
1. “Increasing sample size” (if possible). ???
2. Utilizing a priori information on parameters
. (from theory or prior research).

3. Transforming variables or functional form:
a) Using differences (ΔX) instead of levels (X)
in time series data where the cause may be
X's moving in the same direction over time.
2008/09
b) In polynomial regressions, using deviations

of regressors from their means ((Xj–X̅j)
instead of Xj) tends to reduce collinearity.
c) Usually, logs are less collinear than levels.
4. Pooling cross-sectional and time-series data.
5. Dropping one of the collinear predictors. ???
. However, this may lead to the omitted variable

bias (misspecification) if theory tells us that the
dropped variable should be incorporated.
6. To be aware of its existence and employing
cautious interpretation of results.
2008/09
4.4 Non-normality of the Error Term
) Normality is not required to get BLUE of β's.

) The CLRM merely requires errors to be IID.
) Normality of errors is required only for valid
hypothesis testing, i.e., validity of t- and F-tests.
) In small samples, if the errors are not normally
distributed, the estimated parameters will not
. follow normal distribution, which complicates

inference.
) NB: there is no obligation on X's to be normally
distributed.
# In stata (after regression): kdensity residual, normal
2008/09
4.4 Non-normality of the Error Term
)A formal test of normality is the Shapiro-Wilk

test [H0: errors are normally distributed].
)Large p-value shows that H0 cannot be rejected.
#In stata: swilk residual
)If H0 is rejected, transforming the regressand or
re-specifying (the functional form of) the model
.may help.
)With large samples, thanks to the central limit
theorem, hypothesis testing may proceed even if
distribution of errors deviates from normality.
)Tests are generally asymptotically valid.
2008/09
4.5 Non-IID Errors
)The assumption of IID errors is violated if a

(simple) random sampling cannot be assumed.
)More specifically, the assumption of IID errors
fails if the errors:
1) are not identically distributed, i.e., if var(εi|Xji)
varies with observations – heteroskedasticity.
. 2) are not independently distributed, i.e., if errors

are correlated to each other – serial correlation.
3) are both heteroskedastic & autocorrelated.
This is common in panel & time series data.
2008/09
4.5.1 Heteroskedasticity
) One of the assumptions of the CLRM is homo-

skedasticity, i. e., var(εi|X) = var(εi) = σ2.
) This will be true if the observations of the error
term are drawn from identical distributions.
) Heteroskedasticity is present if var(εi)=σi2≠σ2:
different variances for different segments of the
population (segments by the values of the X's).
.e.g.: Variability of consumption rises with rise in

income, i.e., people with higher incomes display
greater variability in consumption.
) Heteroskedasticity is more likely in cross-
sectional than time-series data.
2008/09
)With a correctly specified model (in any other

aspect), but heteroskedastic errors, the OLS
coefficient estimators are unbiased & consistent
but inefficient.
)Reason: OLS estimator for σ2 (and thus for the
standard errors of the coefficients) are biased.
.)Hence, confidence intervals based on biased

standard errors will be wrong, and the t & F
tests will be misleading/invalid.
NB: Heteroskedasticity could be a symptom of
other problems (e.g. omitted variables).
2008/09
) If heteroskedasticity is a result (or a reflection)

of specification error (say, omitted variables),
OLS estimators will be biased & inconsistent.
) In the presence of heteroskedasticity, OLS is
not optimal as it gives equal weight to all
observations, when, in fact, observations with
. larger error variances (σi2) contain less

information than those with smaller σi2 .
) To correct, give less weight to data points with
greater σi2 and more weight to those with
smaller σi2. [i.e., use GLS (WLS or FGLS)].
2008/09
Detecting Heteroskedasticity:
A. Graphical Method
) Run OLS and plot squared residuals versus
fitted value of Y (Ŷ) or against each X.
# In stata (after regression): rvfplot
) The graph may show some relationship (linear,
. quadratic, …), which provides clues as to the

nature of the problem and a possible remedy.
e.g. let, the plot of ũ2 (from Y = α + βX + u) against
X signifies that var(ui) increases proportional
to X2; (var(ui)=σi2 =cXi2). What is the Solution?
2008/09
) Now, transform the model by dividing Y, α, X

and u by X. Y = α 1 + β X + u
X X X X
⇒ y* = αx * + β + u *
) Now, u* is homoskedastic: var(ui*) = c; i.e.,
using WLS solves heteroskedasticity!
) WLS yields BLUE for the transformed model.
.) If the pattern of heteroskedasticity is unknown,

log transformation of both sides (compressing
the scale of measurement of variables) usually
solves heteroskedasticity.
) This cannot be used with 0 or negative values.
2008/09
B. A Formal Test:
) The most-often used test for heteroskedasticity
is the Breusch-Pagan (BP) test.
H0: homoskedasticity vs. Ha: heteroskedasticity
) Regress ũ2 on Ŷ or ũ2 on the original X's, X2's
and, if enough data, cross-products of the X's.
.) H0 will be rejected for high values of the test

statistic [n*R2~χ2q] or for low p-values.
) n & R2 are obtained from the auxiliary
regression of ũ2 on q (number of) predictors.
# In stata (after regression): hettest or hettest, rhs
2008/09
) The B-P test as specified above:

9 uses the regression of ũ2 on Ŷ or on X's;
9 and thus consumes less degrees of freedom;
9 but tests for linear heteroskedasticity only;
9 and has problems when the errors are not
normally distributed.
# Alternatively, use: hettest, iid or hettest, rhs iid
. This doesn’t need the assumption of normality.

) If you want to include squares & cross products
of X's, generate these variables first and use:
# hettest varlist or hettest varlist, iid
) The hettest varlist, iid version of B-P test is the
same as White’s test for heteroskedasticity:
2008/09
# In stata (after regression): imtest, white

Solutions to (or Estimation with) Heteroskedasticity
) If heteroskedasticity is detected, first check for
some other specification error in the model
(omitted variables, wrong functional form, …).
) If it persists even after correcting for other
. specification errors, use one of the following:

1. Use better method of estimation (WLS/FGLS);
2. Stick to OLS but use robust (heteroskedasticity
consistent) standard errors.
# In stata: reg Y X1 … XK, robust
This is OK even with homoskedastic errors.
2008/09
4.5.2 Autocorrelation
) Error terms are autocorrelated if error terms

from different (usually adjacent) time periods
(cross-sectional units) are correlated, E(εiεj)≠0.
) Autocorrelation in cross-sectional data is called
spatial autocorrelation (in space, not over time).
) However, spatial autocorrelation is uncommon
. since cross-sectional data do not usually have

some ordering logic, or economic interest.
) Serial correlation occurs in time-series studies
when the errors associated with a given time
period carry over into future time periods.
2008/09
) et are correlated with lagged values: et-1, et-2, …

) Effects of autocorrelation are similar to those
of heteroskedasticity:
) OLS coefficients are unbiased and consistent,
but inefficient; the estimate of σ2 is biased, and
thus inferences are invalid.
Detecting Autocorrelation
.) Whenever you do on time series data, set up

your data as a time-series (i.e., identify the
variable that represents time or the sequential
order of observations).
# In stata: tsset varname
2008/09
) Then, plotting OLS residuals against the time

variable, or a formal test could be used to check
for autocorrelation.
# In stata (after regression and predicting residuals):
scatter residual time
The Breusch-Godfrey Test
) Commonly-used general test of autocorrelation.
.) It tests for autocorrelation of first or higher

order, and works with stochastic regressors.
Steps:
Steps
1. Regress OLS residuals on X's and lagged
residuals: et = f(X1t,...,XKt, et-1,…,et-j)
2008/09
2. Test the joint hypothesis that all the estimated

coefficients on lagged residuals are zero. Use the
test statistic: jFcal ~ χ2j ;
3. Alternatively, test the overall significance of the
auxiliary regression using nR2 ~ χ2(k+j).
4. Reject H0: no serial correlation for high values
. of the test statistic or for small p-values.

# In stata (after regression): bgodfrey, lags(#)
Eg. bgodfrey, lags(2) tests for 2nd order auto in error
terms (et's up to 2 periods apart) like et, et-1, et-2;
while bgodfrey, lags(1/4) tests for 1st, 2nd, 3rd & 4th
order autocorrelations.
2008/09
Estimation in the Presence of Serial Correlation:

)Solutions to autocorrelation depend on the
sources of the problem.
)Autocorrelation may result from:
)Model misspecification (e.g. Omitted
variables, a wrong functional form, …)
. )Misspecified dynamics (e.g. static model

estimated when dependence is dynamic), …
)If autocorrelation is significant, check for model
specification errors, & consider re-specification.
2008/09
) If the revised model passes other specification

tests, but still fails tests of autocorrelation, the
following are the key solutions:
1. FGLS: Prais-Winston regression, ….
# In stata: prais Y X1 … XK
.2. OLS with robust standard errors:

# In stata: newey Y X1 … XK, lags(#)
2008/09
4.6 Endogenous Regressors: E(ɛi|Xj) ≠ 0
) A key assumption maintained in the previous

lessons is that the model, E(Y|X) = Xβ or
K
E(Y|X) = β + β X , was correctly specified.
0 ∑ i i
i =1
) The model Y = Xβ + ε is correctly specified if:

1. ε is orthogonal to the X's, enters the model
. with an additively (separable effect on Y),

and this effect equals zero on average; and,
2. E(Y|X) is linear in stable parameters (β's).
) If the assumption E(εi|Xj) = 0 is violated, the
OLS estimators will be biased & inconsistent.
2008/09
)Assuming exogenous regressors (orthogonal

errors & X's) is unrealistic in many situations.
)The possible sources of endogeneity are:
1. stochastic regressors & measurement error;
2. specification errors: omission of relevant
variables or using a wrong functional form;
3. nonlinearity in & instability of parameters; and
. 4. bidirectional link between the X's and Y

(simultaneity or reverse causality);
)Recall two versions of exogeneity assumption:
1. E(ɛi) = 0 and X’s are fixed (non-stochastic),
2. E(ɛiXj) = 0 or E(ɛi|Xj) = 0 with stochastic X’s.
2008/09
) The assumption E(εi) = 0 amounts to: “We do

not systematically over- or under-estimate the
PRF,” or the overall impact of all the excluded
variables is random/unpredictable.
) This assumption cannot be tested as residuals
will always have zero mean if the model has an
intercept.
. ) If there is no intercept, some information can
be obtained by plotting the residuals.
) If E(ɛi) = μ (a constant ≠ 0) & X's are fixed, the
estimators of all β's, except β0, will be OK!
) But, can we assume non-stochastic regressors?
2008/09
4.6.1 Stochastic Regressors and Measurement Error
A. Stochastic Regressors
) Many economic variables are stochastic, and it
is only for ease that we assumed fixed X's.
) For instance, the set of regressors may include:
* a lagged dependent variable (Yt-1), or
* an X characterized by a measurement error.
. ) In both of these cases, it is not reasonable to

assume fixed regressors.
) As long as no other assumption is violated, OLS
retains its desirable properties even if X's are
stochastic.
2008/09

) In general, stochastic regressors may or may
not be correlated with the model error term.
1. If X & ɛ are independently distributed, E(ɛ|X)
= 0, OLS retains all its desirable properties.
2. If X & ɛ are not independent but are either
contemporaneously uncorrelated, [E(ɛi|Xi±s) ≠
0 for s = 1, 2, … but E(ɛi|Xi) = 0], or ɛ & X are
. asymptotically uncorrelated, OLS retains its

large sample properties: estimators are biased,
but consistent and asymptotically efficient.
) The basis for valid statistical inference remains
but inferences must be based on large samples.
2008/09

3. If X & ɛ are not independent and are
correlated even asymptotically, then OLS
estimators are biased and inconsistent.
) SOLUTION: IV/2SLS REGRESSION!
) Thus, it is not the stochastic (or fixed) nature of
regressors by itself that matters, but the nature
of the correlation between X's & ɛ.
. B. Measurement Error
) Measurement error in the regressand (Y) only
does not cause bias in OLS estimators as long
as the measurement error is not systematically
related to one or more of the regressors.
2008/09
) If the measurement error in Y is uncorrelated

with X's, OLS is perfectly applicable (though
with less precision or higher variances).
) If there is a measurement error in a regressor
and this error is correlated with the measured
variable, then OLS estimators will be biased
. and inconsistent.
) SOLUTION: IV/2SLS REGRESSION!
2008/09
4.6.2 Specification Errors
) Model misspecification may result from:

) omission of relevant variable/s,
) using a wrong functional form, or
) inclusion of irrelevant variable/s.
1. Omission of relevant variables: when one/more
relevant variables are omitted from a model.
) Omitted-variable bias: bias in parameter
. estimates when the assumed specification is

incorrect in that it omits a regressor that must
be in the model.
) e.g. estimating Y=β0+β1X1+β2X2+u when the
correct model is Y=β0+β1X1+β2X2+β3Z+u.
2008/09
) Wrongly omitting a variable (Z) is equivalent

to imposing β3 = 0 when in fact β3 ≠ 0.
) If a relevant regressor (Z) is missing from a
model, OLS estimators of β's (β0, β1 & β2) will
be biased, except if cov(Z,X1) = cov(Z,X2) = 0.
) Even if cov(Z,X1) = cov(Z,X2) = 0, the estimate
for β0 is biased.
. ) The OLS estimators for σ2 and for the
standard errors of the β̂'s are also biased.
) Consequently, t- and F-tests will not be valid.
) In general, OLS estimators will be biased,
inconsistent and the inferences will be invalid.
2008/09
) These consequences of wrongly excluding

variables are clearly very serious and thus,
attempt should be made to include all the
relevant regressors.
) The decision to include/exclude variables
should be guided by economic theory and
. reasoning.
2008/09
2. Error in the algebraic form of the relationship:

a model that includes all the appropriate
regressors may still be misspecified due to
error in the functional form relating Y to X's.
) e.g. using a linear functional form when the
true relationship is logarithmic (log-log) or
semi-logarithmic (lin-log or log-lin).
.) The effects of functional form misspecification

are the same as those of omitting of relevant
variables, plus misleading inferences.
) Again, rely on economic theory, and not just on
statistical tests.
2008/09
Testing for Omitted Variables and Functional

Form Misspecification
1. Examination of Residuals
) Most often, we use the plot of residuals versus
fitted values to have a quick glance at problems
like nonlinearity.
. ) Ideally, we would like to see residuals rather

randomly scattered around zero.
# In stata (after regression): rvfplot, yline(0)
) If in fact there are such errors as omitted
variables or incorrect functional form, a plot of
the residuals will exhibit distinct patterns.
2008/09
2. Ramsey’s Regression Equation Specification

Error Test (RESET)
) It tests for misspecification due to omitted
variables or a wrong functional form.
) Steps:
1. Regress Y on X's, and get Ŷ & ũ.
.2. Regress: a) Y on X's Ŷ2 & Ŷ3, or

b) ũ on X's, Ŷ2 & Ŷ3, or
c) ũ on X's, X2's, Xi*Xj's (i ≠ j).
3. If the new regressors (Ŷ2 & Ŷ3 or X2's, Xi*Xj's)
are significant (as judged by F test), then reject
H0, and conclude that there is misspecification.
2008/09
# In stata (after regression): ovtest or ovtest, rhs

) If the original model is misspecified, then try
another model: look for some variables which
are left out and/or try a different functional
form like log-linear (but based on some theory).
) The test (by rejecting the null) does not suggest
. an alternative specification.
3. Inclusion of irrelevant variables: when one/more
irrelevant variables are wrongly included in the
model. e.g. estimating Y=β0+β1X1+β2X2+β3X3+u
when the correct model is Y=β0+β1X1+β2X2+u.
2008/09
) The consequence is that the OLS estimators will

remain unbiased and consistent but inefficient
(compared to OLS applied to the right model).
) σ2 is correctly estimated, and the conventional
hypothesis-testing methods are still valid.
) The only penalty we pay for the inclusion of the
. superfluous variable/s is that the estimated

variances of the coefficients are larger.
) As a result, our probability inferences about the
parameters are less precise, i.e., precision is lost
if the correct restriction β3 = 0 is not imposed.
2008/09
)To test for the presence of irrelevant variables,

use F-tests (based on RRSS & URSS) if you
have some ‘correct’ model in your mind.
)Do not eliminate variables from a model based
on insignificance implied by t-tests.
)In particular, do not drop a variable with |t| > 1.
)Do not drop two or more variables at once (on
. the basis of t-tests) even if each has |t| < 1.

)The t statistic corresponding to an X (Xj) may
radically change once another (Xi) is dropped.
)A useful tool in judging the extra contribution
of regressors is the added variable plot.
2008/09
) The added variable plot shows the (marginal)

effect of adding a variable to the model after all
other variables have been included.
) In a multiple regression, the added variable plot
for a predictor, say Xj, is the plot showing the
residuals of Y on all predictors except Xj
. against the residuals of Xj on all other X's.

# In stata (after regression): avplots or avplot varnarnes
) In general, model misspecification due to the
inclusion of irrelevant variables is less serious
than that due to omission of relevant variable/s.
2008/09
) Taking bias as a more undesirable outcome

than inefficiency, if one is in doubt about which
variables to include in a regression model, it is
better to err by including irrelevant variables.
) This is one reason behind the advocacy of
Hendry’s “general-to-specific” methodology.
.) This preference is reinforced by the fact that

standard errors are incorrect if variables are
wrongly excluded, but not if variables are
wrongly included.
2008/09
) In general, the specification problem is less

serious when the research task/aim is model
comparison (to see which has a better fit to the
data) as opposed to when the task is to justify
(and use) a single model and assess the relative
importance of the independent variables.
.
2008/09
4.6.3 Stability of Parameters and the Dummy

Variables Regression (DVR)
) So far we assumed that the intercept and all the
slope coefficients (βj's) are the same/stable for
the whole set of observations. Y = Xβ + e
) But, structural shifts and/or group differences
are common in the real world. May be:
) the intercept differs/changes, or
. ) the (partial) slope differs/changes, or

) both the intercept and slope differ/change
across categories or time period.
) Two methods for testing parameter stability:
(i) Using Chow tests, or (ii) Using DVR.
2008/09

A. The Chow Tests
) Using an F-test to determine whether a single
regression is more efficient than two (or more)
separate regressions on sub-samples.
) The stages in running the Chow test are:
1. Run two separate regressions on the data (say,
. before and after war or policy reform, …) and

save the RSS's: RSS1 & RSS2.
) RSS1 has n1–(K+1) df & RSS2 has n2–(K+1) df.
) The sum RSS1 + RSS2 gives the URSS with
n1+n2–2(K+1) df.
2008/09

2. Estimate the pooled/combined model (under
H0: no significant change/difference in β's).
) The RSS from this model is the RRSS with
n–(K+1) df; where n = n1+n2.
3. Then, under H0, the test statistic will be:
[RRSS − URSS]
. F cal =
URSS
(K + 1)
[n − 2(K + 1)]
4. Find the critical value: FK+1,n-2(K+1) from table.
5. Reject the null of stable parameters (and favor
Ha: that there is structural break) if Fcal > Ftab.
2008/09

Example: Suppose we have the following results
from the OLS Estimation of real consumption
on real disposable income:
i. For the period 1974-1991: consi = α1+β1*inci+ui
Consumption = 153.95 + 0.75*Income
p-value: (0.000) (0.000)
. RSS = 4340.26114; R2 = 0.9982

ii. For the period 1992-2005: consi = α2+ β2*inci+ui
p-value: (0.975) (0.000)
RSS = 10706.2127; R2 = 0.9949
2008/09

iii. For the period 1974-2005: consi = α+ β*inci+ui
t-ratio: (4.96) (155.56)
RSS = 22064.6663; R2 = 0.9987
1. URSS = RSS1 + RSS2 = 15064.474
2. RRSS = 22064.6663
. ) K = 1 and K + 1 = 2; n1 = 18, n2 = 15, n = 33.

3. Thus, F =
cal
[22064.666 3 − 15064.474]
15064.474
29
2 = 6.7632981
4. p-value = Prob(F-tab > 6.7632981) = 0.003883

2008/09

5. So, reject the null that there is no structural
break at 1% level of significance.
) The pooled consumption model is inadequate
specification and thus we should run separate
regressions for the two periods.
) The above method of calculating the Chow test
. breaks down if either n1 < K+1 or n2 < K+1.

) Solution: use Chow’s second (predictive) test!
) If, for instance, n2 < K+1, then the F-statistic
will be altered as follows.
) Replace URSS by RSS1 and use the statistic:
2008/09

[RRSS − RSS 1 ]
n2
Fcal =
RSS 1
n 1 − (K + 1)
* The Chow test tells if the parameters differ on
average, but not which parameters differ.
* The Chow test requires that all groups have the
same error variance.
. ) This assumption is questionable: if parameters

can be different, then so can the variances be.
) One method of correcting for unequal error
variances is to use the dummy variable
approach with White's Robust Standard Errors.
2008/09

B. The Dummy Variables Regression
I. Introduction:
# Not all information can easily be quantified.
) So, need to incorporate qualitative information.
e.g. 1. Effect of belonging to a certain group:
1 Gender, location, status, occupation
1 Beneficiary of a program/policy
. 2. Ordinal variables:
1 Answers to yes/no (or scaled) questions...
# Effect of some quantitative variable may differ
between groups/categories:
1 Returns to education may differ between
sexes or between ethnic groups …
2008/09

# Interest in determinants of belonging to a group
1 Determinants of being poor …
)Dummy dependent variable (logit, probit…)
) Dummy Variable: a variable devised to use
qualitative information in regression analysis.
) A dummy variable takes 2 values: usually 0/1.
e.g. Yi=β0+β1*D+u; D = ⎧1 for i ϵ group 1, and
. ⎨
⎩0 for i ∉ group 1.
¾If D = 0, E(Y) = E(Y|D = 0) = β0
¾If D = 1, E(Y) = E(Y|D = 1) = β0 + β1
) Thus, the difference between the two groups (in
mean values of Y) is: E(Y|D=1) – E(Y|D=0) = β1.
2008/09

) So, the significance of the difference between
the groups is tested by a t-test of β1 = 0.
e.g.: Wage differential between male and female
) Two possible ways: a male or a female dummy.
1. Define a male dummy (male = 1 & female = 0).
# reg wage male
# Result: Yi = 9.45 + 172.84*D + ûi
. p-value: (0.000) (0.000)

) Interpretation: the monthly wage of a male
worker is, on average, 172.84$ higher than that
of a female worker.
) This difference is significant at 1% level.
2008/09

2. Define a female dummy (female = 1 & male = 0)
# reg wage female
# Result: Yi = 182.29 – 172.84*D + ûi
p-value: (0.000) (0.000)
) Interpretation: the monthly wage of a female
worker is, on average, 172.84$ lower than that
of a male worker.
.) This difference is significant at 1% level.

II. Using the DVR to Test for Structural Break:
) Recall the example of consumption function:
period 1: consi = α1+ β1*inci+ui vs.
period 2: consi = α2+ β2*inci+ui
2008/09

) Let’s define a dummy variable D1, where:
⎧1 for the period 1974-1991, and
D1 = ⎨
⎩0 for the period 1992-2005
) Then, consi = α0+α1*D1+β0*inci+β1(D1*inci)+ui
For period 1: consi = (α0+α1)+(β0+β1)inci+ui
Intercept = α0+α1; Slope (= MPC) = β0+β1.
.For period 2 (base category): consi=α0+β0*inci+ui

Intercept = α0; Slope (= MPC) = β0.
) Regressing cons on inc, D1 and (D1*inc) gives:
cons = 1.95 + 152D1 + 0.806*inc – 0.056(D1*inc)
p-value: (0.968) (0.010) (0.000) (0.002)
2008/09

) Substituting D1=1 for i ϵ period-1 and D1=0 for
i ϵ period-2:
period 1 (1974-1991): cons = 153.95 + 0.75*inc
period 2 (1992-2005): cons = 1.95 + 0.806*inc
) The Chow test is equivalent to testing α1=β1=0
in: cons=1.95+152D1+0.806*inc – 0.056(D1*inc)
.# In stata (after regression): test D1=D1*inc=0.

) This gives F(2, 29) = 6.76; p-value = 0.0039.
) Then, reject H0! There is a structural break!
) Comparing the two methods, it is preferable to
use the method of dummy variables regression.
2008/09

) This is because with the method of DVR:
1. we run only one regression.
2. we can test whether the change is in the
intercept only, in the slope only, or in both.
In our example, the change is in both. Why???
) For a total of m categories, use m–1 dummies!
) Including m dummies (1 for each group) results
. in perfect multicollinearity (the dummy

variable trap). e.g.: 2 groups & 2 dummies:
X = [constant D1
) constant = D1 + D2 !!!
D2 ]
X
⎡1
= ⎢⎢ 1
⎢⎣ 1
X
X
X
11
12
13
1
1
0
0
0
1
⎤
⎥
⎥
⎥⎦
2008/09
4.6.4 Simultaneity Bias
) Simultaneity occurs when an equation is part of

a simultaneous equations system, such that
causation runs from Y to X as well as X to Y.
) In such a case, cov(X,ε) ≠ 0 and OLS estimators
are biased and inconsistent.
) Such situations are pervasive in economic
. models so simultaneity bias is a vital issue.

e.g. The Simple Keynesian Consumption Function
) Structural form model: consists of the national
accounts identity and a basic consumption
function, i.e., a pair of simultaneous equations.
2008/09
⎧Yt = C t + I t
⎨
⎩C t = α + βYt + U t
) Yt & Ct are endogenous (simultaneously
determined) and It is exogenous.
) Reduced form: expresses each endogenous
variable as a function of exogenous variables,
. (and/or predetermined variables – lagged

endogenous variables, if present) and random
error term/s.
) The reduced form is: ⎨
⎧ 1
⎪⎪Yt = ( 1 − β )[α + I t + U t ]
1
⎪C = (
⎪⎩ t 1− β
)[α + βI + U ]
t t
2008/09
) The reduced form equation for Yt shows that:

1
cov(Y t , U t ) = cov[( )(α + I t + U t ), U t ]
1− β
1
=( )[cov(α ,U t ) + cov( I t ,U t ) + cov(U t ,U t )]
1− β
1 σ U2
=( )var(Ut ) = ( )≠0
. 1− β 1− β
) Yt, in Ct = α + βYt + Ut, is correlated with Ut.
) OLS estimators for β (MPC) & α (autonomous
consumption) are biased and inconsistent.
) Solution:
Solution IV/2SLS
2008/09
… THE END …
GOOD LUCK!
.

Econometrics Lecture Notesa 7 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econometrics Lecture Notesa 7 1

Uploaded by

Copyright:

Available Formats

JIMMA UNIVERSITY CHAPTER 1 - 1

1.1 The Econometric Approach

Econometrics means “economic

1.1 The Econometric Approach

1.1 The Econometric Approach

1.1 The Econometric Approach

1.2 Models, Economic Models & Econometric Models

1.2 Models, Economic Models & Econometric Models

6. Tests of any hypothesis

7. Interpreting results & using the

1.2 Models, Economic Models & Econometric Models

1.2 Models, Economic Models & Econometric Models

3. Specification of econometric (statistical)

1.2 Models, Economic Models & Econometric Models

1.2 Models, Economic Models & Econometric Models

Predict the level of C for a given Y,

1.3 Types of Data for Econometric Analysis

2.1 The Concept of Regression Analysis

2.1 The Concept of Regression Analysis

The stochastic disturbance term ɛi plays a critical

2.2 The Simple Linear Regression Model

2.2 The Simple Linear Regression Model

2.2 The Simple Linear Regression Model

From the PRF:

2.2 The Simple Linear Regression Model

2.2 The Simple Linear Regression Model

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

If we use ∑ ei , all the error terms ei would receive

2.3 The Method of Least Squares

F.O.C.: (1) ∂ (∑ ei2 ) ∂[∑ (Yi − αˆ − βˆX i ) 2 ]

2.3 The Method of Least Squares

2.3 The Method of Least Squares

⇒∑Yi Xi −nXY = β( ∑Xi −nX )

2.3 The Method of Least Squares

Thus, ∑Yi X i − nXY

Cov( X , Y ) n∑Y X − (∑ X )(∑Y )

2.3 The Method of Least Squares

2.3 The Method of Least Squares

Previously, we came across the following two

1. ∑(Yi −αˆ −βˆXi ) = 0 this is equivalent to: ∑ei = 0

Note also the following property: Y = Yˆ

2.3 The Method of Least Squares

The facts that Ŷ and Y have the same average and

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

Assumptions 1 to 3 together imply that i ε ~ IID( 0, σ 2

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

2.3 The Method of Least Squares

7 8.10 0.90 0.81 ∑y = ∑x = ∑yˆ

2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

Let us prove that β̂ is the BLUE of β !

2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

2.4 Properties of OLS Estimators and the Gauss-Markov Theorem