4 - 5947019371120429002 Econo ch3

Econ 3044: Introduction to Econometrics
Chapter-3: Multiple Linear Regression
Lemi Taye
Addis Ababa University

lemi.taye@aau.edu.et
December 14, 2019
Lemi Taye (AAUSC) Ch 3: Multiple Linear Regression December 14, 2019 1 / 129
Overview
1 Motivation for Multiple Regression
2 Mechanics and Interpretation of Ordinary Least Squares
3 The Gauss-Markov Assumptions for Multiple Regression
4 Expected Value and Variances of the OLS Estimators
5 The Gauss-Markov Theorem for Multiple Linear Regression
6 Multiple Regression Analysis: Inference
Motivation for Multiple Regression
The primary drawback in using simple regression analysis for empirical

work is that it is very difficult to draw ceteris paribus conclusions
about how x affects y.
The key assumption, SLR.4—that all other factors affecting y are
uncorrelated with x—is often unrealistic.
Multiple regression analysis is more amenable to ceteris paribus
analysis because it allows us to explicitly control for many other
factors that simultaneously affect the dependent variable.
Naturally, if we add more factors to our model that are useful for
explaining y, then more of the variation in y can be explained.
An additional advantage of multiple regression analysis is that it can
incorporate fairly general functional form relationships.
The Model with Two Independent Variables
We begin with the following simple example to show how multiple

regression analysis can be used to solve problems that cannot be
solved by simple regression.
The example is a simple variation of the wage equation introduced in
Chapter 2 for obtaining the effect of education on hourly wage:
wage = β0 + β1 educ + β2 exper + u, (1)
where exper is years of labor market experience.

Thus, wage is determined by the two explanatory or independent
variables, education and experience, and by other unobserved factors,
which are contained in u.
We are still primarily interested in the effect of educ on wage, holding
fixed all other factors affecting wage; that is, we are interested in the
parameter β1 .
Compared with a simple regression analysis relating wage to educ,

equation (1) effectively takes exper out of the error term and puts it
explicitly in the equation.
Not surprisingly, just as with simple regression, we will have to make
assumptions about how u in (1) is related to the independent
variables, educ and exper.
However, there is one thing of which we can be confident: because (1)
contains experience explicitly, we will be able to measure the effect of
education on wage, holding experience fixed.
In a simple regression analysis—which puts exper in the error
term—we would have to assume that experience is uncorrelated with
education, a tenuous assumption.
Generally, we can write a model with two independent variables as
y = β0 + β1 x1 + β2 x2 + u, (2)
where
β0 is the intercept.
β1 measures the change in y with respect to x1 , holding other factors
fixed.
β2 measures the change in y with respect to x2 , holding other factors
fixed.
Multiple regression analysis is also useful for generalizing functional

relationships between variables.
As an example, suppose family consumption (cons) is a quadratic
function of family income (inc):
cons = β0 + β1 inc + β2 inc2 + u, (3)
where u contains other factors affecting consumption.

In this model, consumption depends on only one observed factor,
income; so it might seem that it can be handled in a simple regression
framework.
But the model falls outside simple regression because it contains two
functions of income, inc and inc2 (and therefore three parameters, β0 ,
β1 , and β2 ).
Mechanically, there will be no difference in using the method of

ordinary least squares (introduced in the next section) to estimate
equations as different as (1) and (3).
Each equation can be written as (2), which is all that matters for
computation.
There is, however, an important difference in how one interprets the
parameters.
In the model with two independent variables, the key assumption

about how u is related to x1 and x2 is
E(u|x1 , x2 ) = 0. (4)
The interpretation of condition (4) is similar to the interpretation of

Assumption SLR.4 for simple regression analysis.
It means that, for any values of x1 and x2 in the population, the
average of the unobserved factors is equal to zero.
As with simple regression, the important part of the assumption is
that the expected value of u is the same for all combinations of x1
and x2 ; that this common value is zero is no assumption at all as long
as the intercept β0 is included in the model.
In equation (1), the assumption is E(u|educ, exper) = 0.

This implies that other factors affecting wage are not related on
average to educ and exper.
Therefore, if we think innate ability is part of u, then we will need
average ability levels to be the same across all combinations of
education and experience in the working population. This may or may
not be true.
When applied to the quadratic consumption function in (3), the zero
conditional mean assumption has a slightly different interpretation.
Written literally, equation (4) becomes E(u|inc, inc2 ) = 0. Since inc2
is known when inc is known, including inc2 is redundant:
E(u|inc, inc2 ) = 0 is the same as E(u|inc) = 0.
The Model with k Independent Variables
Multiple regression analysis allows many observed factors to affect y.

In the wage example, we might also include amount of job training,
years of tenure with the current employer, measures of ability, and
even demographic variables like the number of siblings or mother’s
education.
The general multiple linear regression (MLR) model(also called the
multiple regression model ) can be written in the population as
y = β0 + β1 x1 + β2 x2 + β3 x3 + · · · + βk xk + u, (5)
where
β0 is the intercept.
β1 is the parameter associated with x1 .
β2 is the parameter associated with x2 , and so on.
Since there are k independent variables and an intercept, equation (5)

contains k + 1 (unknown) population parameters.
Just as in simple regression, the variable u is the error term or
disturbance. It contains factors other than x1 , x2 , . . . , xk that affect y.
No matter how many explanatory variables we include in our model,
there will always be factors we cannot include, and these are
collectively contained in u.
When applying the general multiple regression model, we must know

how to interpret the parameters.
Suppose that CEO salary (salary) is related to firm sales (sales) and
CEO tenure (ceoten) with the firm by
log(salary) = β0 + β1 log(sales) + β2 ceoten + β3 ceoten2 + u.
This fits into the multiple regression model (with k = 3) by defining

y = log(salary), x1 = log(sales), x2 = ceoten, and x3 = ceoten2 .
As we know from Chapter 2, the parameter β1 is the ceteris paribus
elasticity of salary with respect to sales.
If β3 = 0, then 100β2 is approximately the ceteris paribus percentage

increase in salary when ceoten increases by one year.
When β 6= 0, the effect of ceoten on salary is more complicated.
The term “linear” in a multiple linear regression model means that
equation (5) is linear in the parameters, βj .
The key assumption for the general multiple regression model is easy
to state in terms of a conditional expectation:
E(u|x1 , x2 , . . . , xk ) = 0. (6)
At a minimum, equation (6) requires that all factors in the unobserved

error term be uncorrelated with the explanatory variables.
It also means that we have correctly accounted for the functional
relationships between the explained and explanatory variables.
Any problem that causes u to be correlated with any of the
independent variables causes (6) to fail.
We will show that assumption (6) implies that OLS is unbiased and
will derive the bias that arises when a key variable has been omitted
from the equation.
Obtaining the OLS Estimates
In the general case with k independent variables, we seek estimates,

βb0 , βb1 , . . . , βbk in the equation
yb = βb0 + βb1 x1 + βb2 x2 + · · · + βbk xk . (7)
The OLS estimates, k + 1 of them, are chosen to minimize the sum of

squared residuals:
X
(yi − βb0 − βb1 xi1 − · · · − βbk xik )2 ,
where xij means the ith observation on the j th independent variable.

This minimization problem can be solved using multivariable calculus.
This leads to k + 1 linear equations in k + 1 unknowns βb0 , βb1 , . . . , βbk :
X
(yi − βb0 − βb1 xi1 − · · · − βbk xik ) = 0
X
xi1 (yi − βb0 − βb1 xi1 − · · · − βbk xik ) = 0
X
xi2 (yi − βb0 − βb1 xi1 − · · · − βbk xik ) = 0 (8)
..
.
X
xik (yi − βb0 − βb1 xi1 − · · · − βbk xik ) = 0.
These are often called the OLS first order conditions.
For even moderately sized n and k, solving the equations in (8) by

hand calculations is tedious.
Nevertheless, modern computers running standard statistics and
econometrics software can solve these equations with large n and k
very quickly.
As in simple regression analysis, equation (7) is called the OLS
regression line or the sample regression function (SRF).
We call βb0 the OLS intercept estimate and βb1 , βb2 , . . . , βbk the OLS
slope estimates (corresponding to the independent variables x1 , x2 ,
. . . , xk ).
Interpreting the OLS Regression Equation
More important than the details underlying the computation of the βj

is the interpretation of the estimated equation.
We begin with the case of two independent variables:
yb = βb0 + βb1 x1 + βb2 x2 . (9)
The intercept βb0 in equation (9) is the predicted value of y when

x1 = 0 and x2 = 0.
Sometimes, setting x1 and x2 both equal to zero is an interesting
scenario; in other cases, it will not make sense.
Nevertheless, the intercept is always needed to obtain a prediction of y
from the OLS regression line, as (9) makes clear.
The estimates βb1 and βb2 have partial effect, or ceteris paribus,
interpretations. From equation (9), we have
∆b
y = βb1 ∆x1 + βb2 ∆x2 ,
so we can obtain the predicted change in y given the changes in x1
and x2 .
In particular, when x2 is held fixed, so that ∆x2 = 0, then
∆b
y = βb1 ∆x1 ,
holding x2 fixed.
The key point is that, by including x2 in our model, we obtain a
coefficient on x1 with a ceteris paribus interpretation. This is why
multiple regression analysis is so useful. Similarly,
∆b
y = βb2 ∆x2 ,
holding x1 fixed.
Example (Determinants of College GPA)
Using the data in GPA1 we obtain the following OLS regression line to
predict college GPA from high school GPA and achievement test score:
\A = 1.29 + .453 hsGP A + .0094 ACT

colGP
n = 141.
How do we interpret this equation? First, the intercept 1.29 is the

predicted college GPA if hsGP A and ACT are both set as zero. Since no
one who attends college has either a zero high school GPA or a zero on the
achievement test, the intercept in this equation is not, by itself, meaningful.
More interesting estimates are the slope coefficients on hsGP A and ACT .
As expected, there is a positive partial relationship between colGP A and
hsGP A: Holding ACT fixed, another point on hsGP A is associated with
.453 of a point on the college GPA, or almost half a point.
Example (Determinants of College GPA (continued ))

In other words, if we choose two students, A and B, and these students
have the same ACT score, but the high school GPA of Student A is one
point higher than the high school GPA of Student B, then we predict
Student A to have a college GPA .453 higher than that of Student B. (This
says nothing about any two actual people, but it is our best prediction.)
The sign on ACT implies that, while holding hsGP A fixed, a change in
the ACT score of 10 points affects colGP A by less than one-tenth of a
point. This is a small effect, and it suggests that, once high school GPA is
accounted for, the ACT score is not a strong predictor of college GPA.
The case with more than two independent variables is similar. The
OLS regression line in terms of changes is
y = βb1 ∆x1 + βb2 ∆x2 + · · · + βbk ∆xk .

∆b
The coefficient on x1 measures the change in yb due to a one-unit

increase in x1 , holding all other independent variables fixed. That is,
∆b
y = βb2 ∆x1 ,
holding x2 , x3 , . . . , xk fixed.
Thus, we have controlled for the variables x2 , x3 , . . . , xk when
estimating the effect of x1 on y. The other coefficients have a similar
interpretation.
Example (Hourly Wage Equation)

Using the 526 observations on workers in WAGE1, we include educ (years
of education), exper (years of labor market experience), and tenure (years
with the current employer) in an equation explaining log(wage). The
estimated equation is
\ = .284 + .092 educ + .0041 exper + .022 tenure

log(wage)
n = 526.
As in the simple regression case, the coefficients have a percentage

interpretation. The only difference here is that they also have a ceteris
paribus interpretation. The coefficient .092 means that, holding exper and
tenure fixed, another year of education is predicted to increase log(wage)
by .092, which translates into an approximate 9.2% [100(.092)] increase in
wage.
Example (Hourly Wage Equation (continued ))

Alternatively, if we take two people with the same levels of experience and
job tenure, the coefficient on educ is the proportionate difference in
predicted wage when their education levels differ by one year. This measure
of the return to education at least keeps two important productivity factors
fixed. This measure of the return to education at least keeps two important
productivity factors fixed; whether it is a good estimate of the ceteris
paribus return to another year of education requires us to study the
statistical properties of OLS
Changing More Than One Independent Variable
Simultaneously
Sometimes, we want to change more than one independent variable at

the same time to find the resulting effect on the dependent variable.
For instance, in the previous example, we can obtain the estimated
effect on wage when an individual stays at the same firm for another
year: exper (general workforce experience) and tenure both increase
by one year. The total effect (holding educ fixed) is
\ = .0041 ∆exper + .022 ∆tenure = .0041 + .022 = .0261

∆log(wage)
or about 2.6%.
Since exper and tenure each increase by one year, we just add the
coefficients on exper and tenure and multiply by 100 to turn the
effect into a percentage.
OLS Fitted Values and Residuals
After obtaining the OLS regression line, we can obtain a fitted or

predicted value for each observation. For observation i, the fitted
value is simply
ybi = βb0 + βb1 xi1 + βb2 xi2 + · · · + βbk xik ,
which is just the predicted value obtained by plugging the values of

the independent variables for observation i into equation (7).
Normally, the actual value yi for any observation i will not equal the
predicted value, ybi .
The residual for observation i is defined just as in the simple
regression case,
bi = yi − ybi .
u
The OLS fitted values and residuals have some important properties
that are immediate extensions from the single variable case:
1 The sample average of the residuals is zero and so y = yb.

2 The sample covariance between each independent variable and the
OLS residuals is zero. Consequently, the sample covariance between
the OLS fitted values and the OLS residuals is zero.
3 The point (x̄1 , x̄2 , . . . , x̄k , ȳ) is always on the OLS regression line:
ȳ = βb0 + βb1x1 + βb2x2 + · · · + βbkxk
The first two properties are immediate consequences of the set of

equations used to obtain the OLS estimates.
The first equation in (8) says that the sum of the residuals is zero.
P
The remaining equations are of the form bi = 0, which implies
xij u
that each independent variable has zero sample covariance with u bi .
Property (3) follows immediately from property (1).
Goodness-of-Fit
As with simple regression, we can define the total sum of squares

(SST), the explained sum of squares (SSE), and the residual sum
of squares or sum of squared residuals (SSR).
Using the same argument as in the simple regression case, we can
show that
SST = SSE + SSR.
Just as in the simple regression case, the R-squared is defined to be
SSE SSR
R2 ≡ =1− ,
SST SST
and it is interpreted as the proportion of the sample variation in yi
that is explained by the OLS regression line.
Goodness-of-Fit: Adjusted R-Squared
An important property of R2 is that it is a nondecreasing function of

the number of explanatory variables or regressors present in the model.
As the number of regressors increases, R2 almost invariably increases
and never decreases.
To see this, recall the definition of the coefficient of determination:
P 2
2 SSR u
bi
R =1− =1− P (10)
SST (yi − ȳ)2
Now (yi − ȳ)2 is independent of the number of x variables in the

P
model.
P 2
The SSR, bi , however, depends on the number of regressors
u
present in the model.
b2i
P
Intuitively, it is clear that as the number of x variables increases, u
is likely to decrease; hence R2 as defined in (10) will increase.
In view of this, in comparing two regression models with the same
dependent variable but differing number of x variables, one should be
very wary of choosing the model with the highest R2 .
To compare two R2 terms, one must take into account the number of
x variables present in the model.
This can be done readily if we consider an alternative coefficient of
determination, which is as follows:
[SSR/(n − k − 1)]
R̄2 = 1 − (11)
[SST /(n − 1)]
where k is the number of variables in the model.

The R2 thus defined is known as the adjusted R2 , denoted by R̄2 .
The term adjusted means adjusted for the df associated with the sums
of squares entering into (10).
The primary attractiveness of R̄2 is that it imposes a penalty for
adding additional independent variables to a model.
It is easy to see that R̄2 and R2 are related because, substituting (10)
into (11), we obtain
n−1
R̄2 = 1 − (1 − R2 ) . (12)
n−k−1
It is immediately apparent from (12) that
1 for k > 1, R̄2 < R2 which implies that as the number of x variables
increases, the adjusted R2 increases less than the unadjusted R2 ; and
2 R̄2 can be negative, although R2 is necessarily nonnegative. In case
R̄2 turns out to be negative in an application, its value is taken as
zero.
The Gauss-Markov Assumptions for Multiple Regression
Assumption MLR.1 (Linear in Parameters)

The model in the population can be written as
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u,
where β0 , β1 , . . . , βk are the unknown parameters (constants) of interest

and u is an unobserved random error or disturbance term.
Assumption MLR.2 (Random Sampling)

We have a random sample of n observations,
{(xi1 , xi2 , . . . , xik , yi ) : i = 1, 2, . . . , n}, following the population model in
Assumption MLR.1.
Assumption MLR.3 (No perfect Collinearity)
In the sample (and therefore in the population), none of the independent
variables is constant, and there are no exact linear relationships among the
independent variables.
Assumption MLR.4 (Zero Conditional mean)

The error u has an expected value of zero given any values of the
independent variables. In other words,
E(u|xi1 , xi2 , . . . , xik ) = 0.
Assumption MLR.5 (Homoskedasticity)

The error u has the same variance given any value of the explanatory
variables. In other words,
Var(u|xi1 , xi2 , . . . , xik ) = σ 2 .

Assumption MLR.3 is more complicated than its counterpart for

simple regression because we must now look at relationships between
all independent variables.
If an independent variable is an exact linear combination of the other
independent variables, then we say the model suffers from perfect
collinearity, and it cannot be estimated by OLS.
It is important to note that Assumption MLR.3 does allow the
independent variables to be correlated; they just cannot be perfectly
correlated.
If we did not allow for any correlation among the independent
variables, then multiple regression would be of very limited use for
econometric analysis.
The simplest way that two independent variables can be perfectly

correlated is when one variable is a constant multiple of another. This
can happen when a researcher inadvertently puts the same variable
measured in different units into a regression equation.
For example, in estimating a relationship between consumption and
income, it makes no sense to include as independent variables income
measured in dollars as well as income measured in thousands of
dollars. One of these is redundant.
What sense would it make to hold income measured in dollars fixed
while changing income measured in thousands of dollars?
Unbiasedness of the OLS Estimators
UNBIASEDNESS OF OLS
Under Assumptions MLR.1 through MLR.4,
E(βbj ) = βj , j = 0, 1, . . . , k,
for any values of the population parameter βj . In other words, the OLS
estimators are unbiased estimators of the population parameters.
The Variance of the OLS Estimators
SAMPLING VARIANCES OF THE OLS SLOPE ESTIMATORS

Under Assumptions MLR.1 through MLR.5, conditional on the sample
values of the independent variables,
σ2
Var(βbj ) = , (13)
SSTj (1 − Rj2 )
for j = 1, 2, . . . , k, where SSTj = ni=1 (xij − x̄j )2 is the total sample

P
variation in xj , and Rj2 is the R-squared from regressing xj on all other
independent variables (and including an intercept).
The Components of the OLS Variances
Equation (13) shows that the variance of βbj depends on three factors:
σ 2 , SSTj , and Rj2 .
The Error Variance, σ 2
From equation (13), a larger σ 2 means larger sampling variances for
the OLS estimators.
More “noise” in the equation (a larger σ 2 ) makes it more difficult to
estimate the partial effect of any of the independent variables on y,
and this is reflected in higher variances for the OLS slope estimators.
Because σ 2 is a feature of the population, it has nothing to do with
the sample size.
The Total Sample Variation in xj , SSTj

From equation (13), we see that the larger the total variation in xj is,
the smaller is Var(βbj ).
Thus, everything else being equal, for estimating βj we prefer to have
as much sample variation in xj as possible.
The Linear Relationships among the Independent Variables, Rj2

The term Rj2 in equation (13) is obtained from a regression involving
only the independent variables in the original model, where xj plays
the role of a dependent variable.
Consider first the k = 2 case: y = β0 + β1 x1 + β2 x2 + u. Then,

Var(βb1 ) = σ 2 /[SST1 (1 − R12 )], where R12 is the R-squared from the
simple regression of x1 on x2 (and an intercept).
Because the R-squared measures goodness-of-fit, a value of R12 close
to one indicates that x2 explains much of the variation in x1 in the
sample. This means that x1 and x2 are highly correlated.
As R2 increases to one, Var(βbj ) gets larger and larger. Thus, a high
1
degree of linear relationship between x1 and x2 can lead to large
variances for the OLS slope estimators.
High (but not perfect) correlation between two or more independent
variables is called multicollinearity.
Estimating σ 2
The unbiased estimator of σ 2 in the general multiple regression case is

P 2
2 u
bi SSR
σ
b = = (14)
n−k−1 n−k−1
The term n − k − 1 in (14) is the degrees of freedom (df ) for the
general OLS problem with n observations and k independent variables.
Since there are k+ 1 parameters in a regression model with k
independent variables and an intercept, we can write
df = n − (k + 1)
= (number of observations) − (number of estimated parameters).
Standard Errors of the OLS Estimators
b2 , denoted σ
The positive square root of σ b, is called the standard
error of the regression (SER).
For constructing confidence intervals and conducting tests, we will
need to estimate the standard deviation of βbj , which is just the
square root of the variance:
σ
sd(βbj ) = h i1/2 .
SSTj (1 − Rj2 )
Since σ is unknown, we replace it with its estimator, σ

b. This gives us
the standard error of βbj :
σ
se(βbj ) = h
b
i1/2 .
SSTj (1 − Rj2 )
Efficiency of OLS: The Gauss-Markov Theorem
The Gauss-Markov Theorem justifies the use of the OLS method

rather than using a variety of competing estimators.
We know one justification for OLS already: under Assumptions MLR.1
through MLR.4, OLS is unbiased.
However, there are many unbiased estimators of the βj under these
assumptions.
If we limit the class of competing estimators appropriately, then we
can show that OLS is best within this class.
For the current theorem, best is defined as having the smallest
variance. Given two unbiased estimators, it is logical to prefer the one
with the smallest variance.
Efficiency of OLS: The Gauss-Markov Theorem
THEOREM 3.1: GAUSS-MARKOV THEOREM

Under Assumptions MLR.1 through MLR.5, βb0 , βb1 , . . . , βbk are the best
linear unbiased estimators (BLUEs) of β0 , β1 , . . . , βk , respectively.
The Gauss-Markov Theorem says that, under Assumptions MLR.1

through MLR.5, for any estimator β̃j that is linear and unbiased,
Var(βbj ) ≤ Var(β̃j ), and the inequality is usually strict.
In other words, in the class of linear unbiased estimators, OLS has the
smallest variance (under the five Gauss-Markov assumptions).
Sampling Distributions of the OLS Estimators
Knowing the expected value and variance of the OLS estimators is

useful for describing the precision of the OLS estimators.
However, in order to perform statistical inference, we need to know
more than just the first two moments of βbj ; we need to know the full
sampling distribution of the βbj .
When we condition on the values of the independent variables in our
sample, it is clear that the sampling distributions of the OLS
estimators depend on the underlying distribution of the errors.
To make the sampling distributions of the βbj tractable, we now
assume that the unobserved error is normally distributed in the
population. We call this the normality assumption.
Assumption MLR.6 (Normality)

The population error u is independent of the explanatory variables
x1 , x2 , . . . , xk and is normally distributed with zero mean and variance σ 2 :
u ∼ Normal(0, σ 2 ).
Assumption MLR.6 is much stronger than any of our previous

assumptions.
In fact, since u is independent of the xj under MLR.6,
E(u|x1 , . . . , xk ) = E(u) = 0 and Var(u|x1 , . . . , xk ) = Var(u) = σ 2 .
Thus, if we make Assumption MLR.6, then we are necessarily
assuming MLR.4 and MLR.5.
For cross-sectional regression applications, Assumptions MLR.1

through MLR.6 are called the classical linear model (CLM)
assumptions.
Thus, we will refer to the model under these six assumptions as the
classical linear model.
Under the CLM assumptions, the OLS estimators βb0 , βb1 , . . . , βbk have
a stronger efficiency property than they would under the
Gauss-Markov assumptions.
A succinct way to summarize the population assumptions of the CLM
is
y|x ∼ Normal(β0 + β1 x1 + β2 x2 + β3 x3 + · · · + βk xk , σ 2 ),
where x is shorthand for (x1 , . . . , xk ).
Thus, conditional on x, y has a normal distribution with mean linear

in x1 , . . . , xk and a constant variance.
For a single independent variable x, this situation is shown in Figure 1.
Figure 1: The homoskedastic normal distribution with a single explanatory

variable.
Normality of the error term translates into normal sampling

distributions of the OLS estimators:
THEOREM 3.2: NORMAL SAMPLING DISTRIBUTIONS

Under the CLM assumptions MLR.1 through MLR.6, conditional on the
sample values of the independent variables,
βbj ∼ Normal[βj , Var(βbj )], (15)
where Var(βbj ) was given in (13). Therefore,
βbj − βj
∼ Normal(0, 1).
sd(βbj )
The second part of this theorem follows immediately from the fact
that when we standardize a normal random variable by subtracting off
its mean and dividing by its standard deviation, we end up with a
standard normal random variable.
In addition to (15), any linear combination of the β0 , β1 , . . . , βk is
also normally distributed. These facts underlie the testing results in
the remainder of this chapter.
Testing Hypotheses: The t Test
For a full understanding of hypothesis testing, one must remember

that the βj are unknown features of the population, and we will never
know them with certainty.
Nevertheless, we can hypothesize about the value of βj and then use
statistical inference to test our hypothesis.
In order to construct hypotheses tests, we need the following result:
THEOREM 3.3: t DISTRIBUTION FOR THE STANDARDIZED

ESTIMATORS
Under the CLM assumptions MLR.1 through MLR.6,
βbj − βj
∼ tn−k−1 = tdf , (16)
se(βbj )
where k + 1 is the number of unknown parameters in the population model

y = β0 + β1 x1 + β2 x2 + β3 x3 + · · · + βk xk + u (k slope parameters and
the intercept β0 ) and n − k − 1 is the degrees of freedom (df).
This result differs from Theorem 4.2 in some notable respects.

Theorem 3.2 showed that, under the CLM assumptions,
(βbj − βj )/sd(βbj ) ∼ Normal(0, 1).
The t distribution in (16) comes from the fact that the constant σ in
sd(βbj ) has been replaced with the random variable σ
b.
Essentially, (16) can be written as the ratio of the standard normal
b2 /σ 2 .
random variable (βbj − βj )/sd(βbj ) over the square root of σ
These random variables can be shown to be independent, and
(n − k − 1)b σ 2 /σ 2 ∼ χ2n−k−1 . The result then follows from the
definition of a t random variable.
Theorem 3.3 is important in that it allows us to test hypotheses

involving the βj .
In most applications, our primary interest lies in testing the null
hypothesis
H0 : βj = 0, (17)
where j corresponds to any of the k independent variables.
Since βj measures the partial effect of xj on (the expected value of)
y, after controlling for all other independent variables, (17) means
that, once x1 , x2 , . . . , xj−1 , xj+1 , . . . , xk have been accounted for,
xj has no effect on the expected value of y.
We cannot state the null hypothesis as “xj does have a partial effect
on y” because this is true for any value of βj other than zero.
As an example, consider the wage equation
log(wage) = β0 + β1 educ + β2 exper + β3 tenure + u.
The null hypothesis H0 : β2 = 0 means that, once education and

tenure have been accounted for, the number of years in the workforce
(exper) has no effect on hourly wage.
This is an economically interesting hypothesis. If it is true, it implies
that a person’s work history prior to the current employment does not
affect wage.
If β2 > 0, then prior work experience contributes to productivity, and
hence to wage.
The statistic we use to test (17) (against any alternative) is called

“the” t statistic or “the” t ratio or βbj and is defined as
βbj
tβbj ≡ . (18)
se(βbj )
For particular applications, it is helpful to index t statistics using the

name of the independent variable; for example, teduc would be the t
statistic for βbeduc .
The t statistic for βbj is simple to compute given βbj and its standard
error. In fact, most regression packages do the division for you and
report the t statistic along with each coefficient and its standard error.
Before discussing how to use (18) formally to test H0 : βj = 0, it is

useful to see why tβbj has features that make it reasonable as a test
statistic to detect βj 6= 0.
First, since se(βbj ) is always positive, t b has the same sign as βbj : if βbj
βj
is positive, then so is tβbj and if βbj is negative, so is tβbj .
Second, for a given value of se(βbj ), a larger value of βbj leads to larger
values of tβbj . If βbj becomes more negative, so does tβbj .
Since we are testing H0 : βj = 0, it is only natural to look at our
unbiased estimator of βj , βbj , for guidance.
In any interesting application, the point estimate βbj will never exactly
be zero, whether or not H0 is true. The question is: How far is βbj
from zero?
A sample value of βbj very far from zero provides evidence against
H0 : βj = 0.
However, we must recognize that there is a sampling error in our
estimate βbj , so the size of βbj must be weighed against its sampling
error.
Since the standard error of βbj is an estimate of the standard deviation
of βbj , t b measures how many estimated standard deviations βbj is
βj
away from zero.
Values of tβbj sufficiently far from zero will result in a rejection of H0 .
The precise rejection rule depends on the alternative hypothesis and
the chosen significance level of the test.
Determining a rule for rejecting (17) at a given significance

level—that is, the probability of rejecting H0 when it is true—requires
knowing the sampling distribution of tβbj when H0 is true.
From Theorem 3.3, we know this to be tn−k−1 . This is the key
theoretical result needed for testing (17).
It is important to remember that we are testing hypotheses about the
population parameters. We are not testing hypotheses about the
estimates from a particular sample.
Testing against One-Sided Alternatives
To determine a rule for rejecting H0 , we need to decide on the

relevant alternative hypothesis. First, consider a one-sided
alternative of the form
H1 : βj > 0. (19)
How should we choose a rejection rule? We must first decide on a
significance level (“level” for short) or the probability of rejecting H0
when it is in fact true.
For concreteness, suppose we have decided on a 5% significance level,
as this is the most popular choice. Thus, we are willing to mistakenly
reject H0 when it is true 5% of the time.
Now, while tβbj has a t distribution under H0 —so that it has zero
mean—under the alternative βj > 0, the expected value of tβbj is
positive.
Thus, we are looking for a “sufficiently large” positive value of tβbj in

order to reject H0 : βj = 0. in favor of H1 : βj > 0. Negative values
of tβbj provide no evidence in favor of H1 .
The definition of “sufficiently large,” with a 5% significance level, is
the 95th percentile in a t distribution with n − k − 1 degrees of
freedom; denote this by c.
In other words, the rejection rule is that H0 is rejected in favor of H1
at the 5% significance level if
tβbj > c. (20)
By our choice of the critical value c, rejection of H0 will occur for

5% of all random samples when H0 is true.
The rejection rule in (20) is an example of a one-tailed test. To obtain

c, we only need the significance level and the degrees of freedom.
(Refer to a t table.)
For example, for a 5% level test and with n − k − 1 = 28 degrees of
freedom, the critical value is c = 1.701.
If tβbj ≤ 1.701, then we fail to reject H0 in favor of (19) at the 5%
level.
Note that a negative value for tβbj , no matter how large in absolute
value, leads to a failure in rejecting H0 in favor of (19). (See Figure
2.)
As the degrees of freedom in the t distribution get large, the t
distribution approaches the standard normal distribution. Thus, for
degrees of freedom greater than 120, one can use the standard normal
critical values.
Figure 2: 5% rejection rule for the alternative H1 : βj > 0 with 28 df .
Example (Hourly Wage Equation)

Using the data in WAGE1 gives the estimated equation
\ = .284 + .092 educ + .0041 exper + .022 tenure

log(wage)
(.104) (.007) (.0017) (.003)
n = 526, R2 = .316,
where standard errors appear in parentheses below the estimated

coefficients. This equation can be used to test whether the return to
exper, controlling for educ and tenure, is zero in the population, against
the alternative that it is positive. Write this as H0 : βexper = 0 versus
H1 : βexper > 0.
Example (Hourly Wage Equation (continued ))

Since we have 522 degrees of freedom, we can use the standard normal
critical values. The 5% critical value is 1.645, and the 1% critical value is
2.326. The t statistic for βbexper is
.0041
texper = ≈ 2.41,
.0017
and so βbexper or exper, is statistically significant even at the 1% level. We

also say that “ βbexper is statistically greater than zero at the 1% significance
level.”
The one-sided alternative that the parameter is less than zero,
H1 : βj < 0, (21)
also arises in applications.

The rejection rule for alternative (21) is just the mirror image of the
previous case.
Now, the critical value comes from the left tail of the t distribution. In
practice, it is easiest to think of the rejection rule as
tβbj < −c,
where c is the critical value for the alternative H1 : βj > 0.

For simplicity, we always assume c is positive, since this is how critical
values are reported in t tables, and so the critical value −c is a
negative number.
For example, if the significance level is 5% and the degrees of freedom

is 18, then c = 1.734, and so H0 : βj = 0 is rejected in favor of
H1 : βj < 0 at the 5% level if tβbj < −1.734.
It is important to remember that, to reject H0 against the negative
alternative (21), we must get a negative t statistic.
A positive t ratio, no matter how large, provides no evidence in favor
of (21). The rejection rule is illustrated in Figure 3.
Figure 3: 5% rejection rule for the alternative H1 : βj < 0 with 18 df .

Example (Student Performance and School Size)

The file MEAP93 contains data on 408 high schools in Michigan for the
year 1993. We can use these data to test the null hypothesis that school
size has no effect on standardized test scores against the alternative that
size has a negative effect. Performance is measured by the percentage of
students receiving a passing score on the Michigan Educational Assessment
Program (MEAP) standardized tenth-grade math test (math10). School
size is measured by student enrollment (enroll). The null hypothesis is
H0 : βenroll = 0, and the alternative is H1 : βenroll < 0.
We will control for two other factors, average annual teacher compensation
(totcomp) and the number of staff per one thousand students (staf f ).
Teacher compensation is a measure of teacher quality, and staff size is a
rough measure of how much attention students receive.
Example (Student Performance and School Size (continued ))

The estimated equation, with standard errors in parentheses, is
\ = 2.274 + .00046 totcomp + .048 staf f − .00020 enroll

math10
(6.113) (.00010) (.040) (.00022)
n = 408, R2 = .0541.
The coefficient on enroll, −.00020, is in accordance with the conjecture

that larger schools hamper performance. Since n − k − 1 = 408 − 4 = 404,
we use the standard normal critical value. At the 5% level, the critical
value is −1.65; the t statistic on enroll must be less than −1.65 to reject
H0 at the 5% level.
Example (Student Performance and School Size (continued ))

The t statistic on enroll is −0.00020/.00022 ≈ −.91 which is larger than
−1.65; we fail to reject H0 in favor of H1 at the 5% level. In fact, the
15% critical value is −1.04, and since −.91 > −1.04, we fail to reject H0
even at the 15% level. We conclude that enroll is not statistically
significant at the 15% level.
Testing against Two-Sided Alternatives
In applications, it is common to test the null hypothesis H0 : βj = 0

against a two-sided alternative; that is,
H1 : βj 6= 0. (22)
Under this alternative, xj has a ceteris paribus effect on y without

specifying whether the effect is positive or negative.
This is the relevant alternative when the sign of βj is not well
determined by theory (or common sense).
When the alternative is two-sided, we are interested in the absolute
value of the t statistic.
The rejection rule for H0 : βj = 0 against (22) is
|tβbj | > c.
To find c, we again specify a significance level, say 5%.

For a two-tailed test, c is chosen to make the area in each tail of the t
distribution equal 2.5%. In other words, c is the 97.5th percentile in
the t distribution with n − k − 1 degrees of freedom.
When n − k − 1 = 25, the 5% critical value for a two-sided test is
c = 2.060. Figure 4 provides an illustration of this distribution.
Figure 4: 5% rejection rule for the alternative H1 : βj 6= 0 with 25 df .
When a specific alternative is not stated, it is usually considered to be

two-sided.
Usually, the default will be a two-sided alternative, and 5% will be the
default significance level.
If H0 is rejected in favor of (22) at the 5% level, we usually say that
“xj is statistically significant, or statistically different from zero, at
the 5% level.”
If H0 is not rejected, we say that “xj is statistically insignificant at
the 5% level.”
Example (Determinants of College GPA)

We use the data in GPA1 to estimate a model explaining college GPA
(colGP A), with the average number of lectures missed per week (skipped)
as an additional explanatory variable. The estimated model is
\A = 1.39 + .412 hsGP A + .015 ACT − .083 skipped

colGP
(.33) (.094) (.011) (.026)
n = 141, R2 = .234.
We can easily compute t statistics to see which variables are statistically

significant, using a two sided alternative in each case. The 5% critical
value is about 1.96, since the degrees of freedom (141 − 4 = 137) is large
enough to use the standard normal approximation. The 1% critical value is
about 2.58.
Example (Determinants of College GPA (continued ))

The t statistic on hsGP A is 4.38 (≈ .412/.094), which is significant at
very small significance levels. Thus, we say that “hsGP A is statistically
significant at any conventional significance level.” The t statistic on ACT
is 1.36 (≈ .015/.011), which is not statistically significant at the 10% level
against a two-sided alternative.
The coefficient on skipped has a t statistic of −.083/.026 ≈ −3.19, so

skipped is statistically significant at the 1% significance level
(3.19 > 2.58). This coefficient means that another lecture missed per week
lowers predicted colGP A by about .083. Thus, holding hsGP A and ACT
fixed, the predicted difference in colGP A between a student who misses no
lectures per week and a student who misses five lectures per week is about
.42.
Testing Other Hypotheses about βj
Although H0 : βj = 0 is the most common hypothesis, we sometimes

want to test whether βj is equal to some other given constant. Two
common examples are βj = 1 and βj = −1.
Generally, if the null is stated as
H0 : βj = aj
where aj is our hypothesized value of βj , then the appropriate t

statistic is
βbj − aj
t= .
se(βbj )
As before, t measures how many estimated standard deviations βbj is
away from the hypothesized value of βj .
Testing Other Hypotheses about βj
The general t statistic is usefully written as

estimate − hypothesized value
t= .
standard error
Under the null hypothesis, this t statistic is distributed as tn−k−1 .
We can use the general t statistic to test against one-sided or
two-sided alternatives.
Computing p-Values for t Tests
Committing to a significance level ahead of time can hide useful

information about the outcome of a hypothesis test.
Rather than testing at different significance levels, it is more
informative to answer the following question: Given the observed value
of the t statistic, what is the smallest significance level at which the
null hypothesis would be rejected?
This level is known as the p-value for the test.
For example, suppose that we wish to test the null hypothesis that a
parameter is zero against a two-sided alternative, and with 40 degrees
of freedom we obtain a t statistic equal to 1.85.
We obtain the actual p-value by computing the probability that a t
random variable, with 40 df, is larger than 1.85 in absolute value.
That is, the p-value is the significance level of the test when we use
the value of the test statistic, 1.85 in the above example, as the
critical value for the test.
This p-value is shown in Figure 5.
Because a p-value is a probability, its value is always between zero and
one.
Figure 5: Obtaining the p-value against a two-sided alternative, when t = 1.85

and df = 40.
In the example with df = 40 and t = 1.85, the p-value for testing the
null hypothesis H0 : βj = 0 against the two-sided alternative is
computed as
p-value = P (|T | > 1.85) = 2P (T > 1.85) = 2(.0359) = .0718,
where P (T > 1.85) is the area to the right of 1.85 in a t distribution

with 40 df .
This means that, if the null hypothesis is true, we would observe an
absolute value of the t statistic as large as 1.85 about 7.2 percent of
the time.
This provides some evidence against the null hypothesis, but we would
not reject the null at the 5% significance level.
The previous example illustrates that once the p-value has been
computed, a classical test can be carried out at any desired level.
If α denotes the significance level of the test (in decimal form), then
H0 is rejected if p-value < α; otherwise, H0 is not rejected at the
100 ·α% level.
The p-value rule for testing hypotheses

Reject the null hypothesis when the p-value is less than, or equal to, the
level of significance α. This is, if p ≤ α, then reject H0 . If p > α, then do
not reject H0 .
Example (Effect of Job Training on Firm Scrap Rates)

The scrap rate for a manufacturing firm is the number of defective
items—products that must be discarded—out of every 100 produced. Thus, for a
given number of items produced, a decrease in the scrap rate reflects higher
worker productivity.
We can use the scrap rate to measure the effect of worker training on
productivity. Using the data in JTRAIN, but only for the year 1987 and for
nonunionized firms, we obtain the following estimated equation:
\ = 12.46 − .029 hrsemp − .962 log(sales) + .761 log(employ)

log(scrap)
(5.69) (.023) (.453) (.407)
n = 29, R2 = .262.
The variable hrsemp is annual hours of training per employee, sales is annual
firm sales (in dollars), and employ is the number of firm employees.
Example (Effect of Job Training on Firm Scrap Rates (continued ))

The main variable of interest is hrsemp. One more hour of training per
employee lowers log(scrap) by .029, which means the scrap rate is about
2.9% lower.
What about the statistical significance of the training variable? The t

statistic on hrsemp is −.029/.023 ≈ −1.26, and with 29 − 4 = 25 degrees
of freedom for the one-sided alternative, H1 : βhrsemp < 0, the 5% critical
value is about −1.71. Thus, using a strict 5% level test, we must conclude
that hrsemp is not statistically significant, even using a one-sided
alternative.
The p-value is easily computed as P (T25 < −1.26) = .110. Thus, hrsemp
is not statistically significant, even at the 10% level.
Confidence Intervals
Under the CLM assumptions, we can easily construct a confidence

interval (CI) for the population parameter βj .
Confidence intervals are also called interval estimates because they
provide a range of likely values for the population parameter, and not
just a point estimate.
Using the fact that (βbj − βj )/se(βbj ) has a t distribution with
n − k − 1 degrees of freedom, simple manipulation leads to a CI for
the unknown βj : a 95% confidence interval, given by
βbj ± c · se(βbj ), (23)
where the constant c is the 97.5th percentile in a tn−k−1 distribution.
More precisely, the lower and upper bounds of the confidence interval
are given by
β j ≡ βbj − c · se(βbj )
and
β¯j ≡ βbj + c · se(βbj ),
respectively.
What is the meaning of a confidence interval? If random samples were
obtained over and over again, with β j and β¯j computed each time,
then the (unknown) population value βj would lie in the interval
(β j , β¯j ) for 95% of the samples.
Three quantities are needed to construct a confidence interval: βbj ,
se(βbj ), and c.
As an example, for df = n − k − 1 = 25, a 95% confidence interval for

any βj is given by [βbj − 2.06 · se(βbj ), βbj + 2.06 · se(βbj )].
It is easy to construct confidence intervals for any other level of
confidence. For example, a 90% CI is obtained by choosing c to be the
95th percentile in the tn−k−1 distribution.
When df = n − k − 1 = 25, c = 1.71, and so the 90% CI is
βbj ± 1.71 · se(βbj ), which is necessarily narrower than the 95% CI.
Many modern regression packages save us from doing any calculations
by reporting a 95% CI along with each coefficient and its standard
error.
Once a confidence interval is constructed, it is easy to carry out
two-tailed hypotheses tests. If the null hypothesis is H0 : βj = aj ,
then H0 is rejected against H1 : βj 6= aj at (say) the 5% significance
level if, and only if, aj is not in the 95% confidence interval.
Example (Model of R&D Expenditures)

Economists studying industrial organization are interested in the
relationship between firm size—often measured by annual sales—and
spending on research and development (R&D).Using the data in RDCHEM
on 32 U.S. firms in the chemical industry, we estimate the following
equation (with standard errors in parentheses below the coefficients):
\ = −4.38 + 1.084 log(sales) + .0217 prof marg

log(rd)
(.47) (.060) (.0128)
n = 32, R2 = .918.
The estimated elasticity of R&D spending with respect to firm sales is

1.084, so that, holding profit margin fixed, a 1% increase in sales is
associated with a 1.084% increase in R&D spending.
Example (Model of R&D Expenditures (continued ))

We can construct a 95% confidence interval for the sales elasticity once we
note that the estimated model has n − k − 1 = 29 degrees of freedom.
From the t table, we find the 97.5th percentile in a t29 distribution:
c = 2.045. Thus, the 95% confidence interval for βlog(sales) is
1.084 ± .060(2.045), or about (.961, 1.21). That zero is well outside this
interval is hardly surprising: we expect R&D spending to increase with firm
size. More interesting is that unity is included in the 95% confidence
interval for βlog(sales) , which means we cannot reject H0 : βlog(sales) = 1
against H1 : βlog(sales) 6= 1 at the 5% significance level. In other words, the
estimated R&D-sales elasticity is not statistically different from 1 at the
5% level.
Example (Model of R&D Expenditures (continued ))

The estimated coefficient on prof marg is also positive, and the 95%
confidence interval for the population parameter, βprof marg , is
.0217 ± .0128(2.045), or about (−.0045, .0479). In this case, zero is
included in the 95% confidence interval, so we fail to reject
H0 : βprof marg = 0 against H1 : βprof marg 6= 0 at the 5% level.
Nevertheless, the t statistic is about 1.70, which gives a two-sided p-value
of about .10, and so we would conclude that prof marg is statistically
significant at the 10% level against the two-sided alternative, or at the 5%
level against the one-sided alternative H1 : βprof marg > 0.
Testing Hypotheses about a Single Linear Combination of
the Parameters
In applications, we must often test hypotheses involving more than
one of the population parameters.
In this section, we show how to test a single hypothesis involving more
than one of the βj .
To illustrate the general approach, we will consider a simple model to
compare the returns to education at junior colleges and four-year
colleges; for simplicity, we refer to the latter as “universities.”
The population includes working people with a high school degree,
and the model is
log(wage) = β0 + β1 jc + β2 univ + β3 exper + u, (24)
where
jc = number of years attending a two-year college.
univ = number of years at a four-year college.
exper = months in the workforce.
the Parameters
The hypothesis of interest is whether one year at a junior college is

worth one year at a university: this is stated as
H0 : β1 = β2 (25)
Under H0 , another year at a junior college and another year at a

university lead to the same ceteris paribus percentage increase in
wage.
For the most part, the alternative of interest is one-sided: a year at a
junior college is worth less than a year at a university. This is stated as
H1 : β1 < β2 (26)
The hypotheses in (25) and (26) concern two parameters, β1 and β2 ,

a situation we have not faced yet.
the Parameters
We cannot simply use the individual t statistics for βb1 and βb2 to test
H0 .
However, conceptually, there is no difficulty in constructing a t
statistic for testing (25).
To do so, we rewrite the null and alternative as H0 : β1 − β2 = 0 and
H1 : β1 − β2 < 0, respectively.
The t statistic is based on whether the estimated difference βb1 − βb2 is
sufficiently less than zero to warrant rejecting (25) in favor of (26).
To account for the sampling error in our estimators, we standardize
this difference by dividing by the standard error:
βb1 − βb2
t= . (27)
se(βb1 − βb2 )
the Parameters
Once we have the t statistic in (27), testing proceeds as before.

We choose a significance level for the test and, based on the df, obtain
a critical value.
Because the alternative is of the form in (26), the rejection rule is of
the form t < −c, where c is a positive value chosen from the
appropriate t distribution.
Or we compute the t statistic and then compute the p-value.
The only thing that makes testing the equality of two different
parameters more difficult than testing about a single βj is obtaining
the standard error in the denominator of (27).
the Parameters
To find se(βb1 − βb2 ), we first obtain the variance of the difference.

Using the results on variances, we have
Var(βb1 − βb2 ) = Var(βb1 ) + Var(βb2 ) − 2Cov(βb1 , βb2 )
The standard deviation of βb1 − βb2 is just the square root of

Var(βb1 − βb2 ).
We have not displayed a formula for Cov(βb1 , βb2 ).
Some regression packages have features that allow one to obtain
\
Cov(βb1 , βb2 ), an estimate of Cov(βb1 , βb2 ), in which case one can
compute the standard error and then the t statistic in (27).
Some of the more sophisticated econometrics programs include special
commands that can be used for testing hypotheses about linear
combinations (checkout help lincom in Stata).
the Parameters
Example
Using the data in TWOYEAR, which comes from Kane and Rouse (1995), we
estimate equation (24):
\ = 1.472 + .0667 jc + .0769 univ − .0049 exper

log(wage)
(.021) (.0068) (.0023) (.0002)
2
n = 6, 763, R = .222.
It is clear from this result that jc and univ have both economically and
statistically significant effects on wage. This is certainly of interest, but we are
more concerned about testing whether the estimated difference in the coefficients
is statistically significant. The difference is estimated as βb1 − βb2 = −.0102, is the
numerator of the t statistic in (24). Unfortunately, the regression result do not
contain enough information to obtain the standard error of βb1 − βb2 .
the Parameters
Example (continued )
The standard error in the denominator of the t-statistic is the square root of
\ \ \ \
Var(βb1 − βb2 ) = Var(βb1 ) + Var(βb2 ) − 2Cov(βb1 , βb2 )
If you are using Stata you can compute this value manually using the estimated
covariance matrix of the least squares estimates, obtained post-estimation using
estat vce.
. estat vce
Covariance matrix of coefficients of regress model
e(V) jc univ exper _cons
jc .00004663
univ 1.928e-06 5.330e-06
exper -1.718e-08 3.933e-08 2.480e-08
_cons -.00001741 -.00001573 -3.105e-06 .00044353
the Parameters
Example (continued )
Using lincom we can estimate a linear combination such as c1 β1 + c2 β2 , and test
the more general form of linear hypotheses quite easily. For this particular
example, c1 = 1 and c2 = −1. In the case of the command lincom we can
simply refer to the variable names, such as jc - univ. The command will
compute the value of the expression and its standard error, and produce a t
statistic and an interval estimate.
. lincom jc-univ
( 1) jc - univ = 0
lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) -.0101795 .0069359 -1.47 0.142 -.0237761 .003417
The t statistic for testing (25) is −1.47. Against the one-sided alternative (26),
the p-value is about .071, so there is some, but not strong, evidence against (25).
Testing Multiple Linear Restrictions: The F Test
So far, we have only covered hypotheses involving a single restriction.

Frequently, we wish to test multiple hypotheses about the underlying
parameters β0 , β1 , . . . , βk .
We begin with the leading case of testing whether a set of
independent variables has no partial effect on a dependent variable.
The F Test: Testing Exclusion Restrictions
We already know how to test whether a particular variable has no

partial effect on the dependent variable: use the t statistic.
Now, we want to test whether a group of variables has no effect on
the dependent variable.
More precisely, the null hypothesis is that a set of variables has no
effect on y, once another set of variables has been controlled.
As an illustration of why testing significance of a group of variables is
useful, we consider the following model that explains sales of Burger
Barn:
sales = β0 + β1 price + β2 advert + β3 advert2 + u (28)
Suppose now we wish to test whether sales is influenced by

advertising.
Since advertising appears in (28) as both a linear term advert and as

a quadratic term advert2 , advertising will have no effect on sales if
β2 = 0 and β3 = 0; advertising will have an effect if β2 6= 0 or β3 6= 0
or if both β2 and β3 are nonzero.
Thus, for this test our null and alternative hypotheses are
H0 : β2 = 0, β3 = 0
H1 : β2 6= 0 or β3 6= 0 or both are nonzero
Relative to the null hypothesis H0 : β2 = 0, β3 = 0, the model in (28)

is called the unrestricted model; the restrictions in the null
hypothesis have not been imposed on the model.
It contrasts with the restricted model, which is obtained by assuming
the parameter restrictions in H0 are true.
When H0 is true, β2 = 0 and β3 = 0, and advert and advert2 drop

out of the model. It becomes
sales = β0 + β1 price + u (29)
The F test for the hypothesis H0 : β2 = 0, β3 = 0 is based on a

comparison of the sums of squared residuals (SSR) from the
unrestricted model in (28) and the restricted model in (29).
Our shorthand notation for these two quantities is SSRu and SSRr ,
respectively.
Adding variables to a regression reduces the sum of squared
errors—more of the variation in the dependent variable becomes
attributable to the variables in the regression and less of its variation
becomes attributable to the error.
In terms of our notation, SSRr − SSRu ≥ 0.
Using the data in the file ANDY to estimate (28) and (29), we find
that SSRu = 1532.084 and SSRu = 1896.391.
Adding advert and advert2 to the equation reduces the sum of
squared errors from 1896.391 to 1532.084.
What the F test does is to assess whether this reduction is sufficiently
large to be significant.
If adding the extra variables has little effect on the sum of squared
errors, then those variables contribute little to explaining variation in
the dependent variable, and there is support for a null hypothesis that
drops them.
On the other hand, if adding the variables leads to a big reduction in
the sum of squared errors, those variables contribute significantly to
explaining the variation in the dependent variable, and we have
evidence against the null hypothesis.
In the general case, write the unrestricted model with k independent

variables as
y = β0 + β1 x1 + · · · + βk xk + u; (30)
the number of parameters (including the intercept) in the unrestricted
model is k + 1.
Suppose that we have j exclusion restrictions to test: that is, the null
hypothesis states that j of the variables in (30) have zero coefficients.
For notational simplicity, assume that it is the last j variables in the
list of independent variables: xk−j+1 , . . . , xk .
The null hypothesis is stated as
H0 : βk−j+1 = 0, . . . , βk = 0, (31)
which puts j exclusion restrictions on the model (30).
The alternative to (31) is simply that it is false; this means that at

least one of the parameters listed in (31) is different from zero.
When we impose the restrictions under H0 , we are left with the
restricted model:
y = β0 + β1 x1 + · · · + βk−j xk−j + u.
We assume that both the unrestricted and restricted models contain

an intercept, since that is the case most widely encountered in
practice.
Earlier, we suggested that looking at the relative decrease in the SSR
when moving from the restricted to the unrestricted model should be
informative for testing the hypothesis (31).
The F statistic determines what constitutes a large reduction or a

small reduction in the sum of squared residuals. It is given by
(SSRr − SSRu )/j

F = (32)
SSRu /(n − k − 1)
where j is the number of restrictions, and n − k − 1 is the residual

degrees of freedom.
If the null hypothesis is true, then the statistic F has what is called an
F distribution with j numerator degrees of freedom and n − k − 1
denominator degrees of freedom. We write this as
F ∼ F(j, n−k−1)
If the null hypothesis is not true, then the difference between SSRr
and SSRu becomes large, implying that the restrictions placed on the
model by the null hypothesis significantly reduce the ability of the
model to fit the data.
A large value for SSRr − SSRu means that the value of F tends to
be large, so that we reject the null hypothesis if the value of the F
test statistic becomes too large.
What is too large is decided by comparing the value of F to a critical
value c, which leaves a probability α in the upper tail of the F
distribution with j and n − k − 1 degrees of freedom.
The 10%, 5%, and 1% critical values for the F distribution are
tabulated.
The rejection rule is simple. Once c has been obtained, we reject H0

in favor of H1 at the chosen significance level if
F > c.
With a 5% significance level, j = 3, and n − k − 1 = 60, the critical

value is c = 2.76. We would reject H0 at the 5% level if the computed
value of the F statistic exceeds 2.76.
The 5% critical value and rejection region are shown in Figure 6. For
the same degrees of freedom, the 1% critical value is 4.13.
Figure 6: The 5% critical value and rejection region in an F(3,60) distribution.
If H0 is rejected, then we say that xk−j+1 , . . . , xk are jointly

statistically significant at the appropriate significance level.
If the null is not rejected, then the variables are jointly insignificant,
which often justifies dropping them from the model.
For the sales example discussed earlier, we estimate the unrestricted
model. This gives,
[ = 109.72 − 7.64 price + 12.15 advert − 2.77advert2

sales
(6.80) (1.04) (3.56) (.941)
n = 75, SSR = 1532.084, R2 = .5082.
When we estimate the restricted model, we obtain
[ = 121.9 − 7.83 price

sales
(6.53) (1.14)
n = 75, SSR = 1896.391, R2 = .3913.
Using the SSRs in the unrestricted and restricted model, we have
(1896.391 − 1532.084) (75 − 4)

F = · ≈ 8.44
1532.084 2
The 5% critical value with two numerator degrees of freedom and 72
denominator degrees of freedom is 3.126, and the 1% critical value is
4.912.
Since F = 8.44 is well above the 1% critical value, we reject the null
hypothesis that both β2 = 0 and β3 = 0, and conclude that at least
one of them is not zero. Advertising does have a significant effect
upon sales revenue.
The R-Squared Form of the F Statistic
For testing exclusion restrictions, it is often more convenient to have a

form of the F statistic that can be computed using the R-squareds
from the restricted and unrestricted models.
Using the fact that SSRr = SST (1 − Rr2 ) and
SSRu = SST (1 − Ru2 ), we can substitute into (32) to obtain
(Ru2 − Rr2 )/j

F = (33)
(1 − Ru2 )/(n − k − 1)
This is called the R-squared form of the F statistic.

For the sales model example, we can use (33) to obtain the F statistic:
(.5082 − .3913) (75 − 4)

F = · ≈ 8.438
(1 − .5082) 2
which is very close to what we obtained before.

Computing p-Values for F Tests
For reporting the outcomes of F tests, p-values are especially useful.

In the F testing context, the p-value is defined as
p-value = P (F > F ),
where, for emphasis, we let F denote an F random variable with

(j, n − k − 1) degrees of freedom, and F is the actual value of the test
statistic.
A small p-value is evidence against H0 .
As with t testing, once the p-value has been computed, the F test can
be carried out at any significance level. For example, if the p-value
= .024, we reject H0 at the 5% significance level but not at the 1%
level.
The F Statistic for Overall Significance of a Regression
In the model with k independent variables, we can write the null

hypothesis as
H0 : x1 , x2 , . . . , xk do not help to explain y.
This null hypothesis states that none of the explanatory variables has
an effect on y.
Stated in terms of the parameters, the null is that all slope parameters
are zero:
H0 : β1 = β2 = · · · = βk = 0, (34)
and the alternative is that at least one of the βj is different from zero.
There are k restrictions in (34), and when we impose them, we get the
restricted model
y = β0 + u; (35)
all independent variables have been dropped from the equation.
Now, the R-squared from estimating (35) is zero; none of the variation
in y is being explained because there are no explanatory variables.
Therefore, the F statistic for testing (34) can be written as
R2 /k
, (36)
(1 − R2 )/(n − k − 1)
where R2 is just the usual R-squared from the regression of y on

x1 , x2 , . . . , xk .
Most regression packages report the F statistic in (36) automatically,

which is valid only for testing joint exclusion of all independent
variables.
This is sometimes called determining the overall significance of the
regression.
If we fail to reject (34), then there is no evidence that any of the
independent variables help to explain y.
This usually means that we must look for other variables to explain y.
Testing General Linear Restrictions
Testing exclusion restrictions is by far the most important application

of F statistics.
Sometimes, however, the restrictions implied by a theory are more
complicated than just excluding some independent variables.
It is still straightforward to use the F statistic for testing.
As an example, consider the following equation:
log(price) = β0 + β1 log(assess) + β2 log(lotsize)

(37)
+ β3 log(sqrf t) + β4 bdrms + u,
where
price = house price.

assess = the assessed housing value (before the house was sold).
lotsize = size of the lot, in square feet.
sqrf t = square footage.
bdrms = number of bedrooms.
Now, suppose we would like to test whether the assessed housing price
is a rational valuation.
If this is the case, then a 1% change in assess should be associated
with a 1% change in price; that is, β1 = 1.
In addition, lotsize, sqrf t, and bdrms should not help to explain
log(price), once the assessed value has been controlled for.
Together, these hypotheses can be stated as
H0 : β1 = 1, β2 = 0, β3 = 0, β4 = 0. (38)
Four restrictions have to be tested; three are exclusion restrictions, but

β1 = 1 is not.
As in the exclusion restriction case, we estimate the unrestricted

model, (37) in this case, and then impose the restrictions in (38) to
obtain the restricted model.
To implement the second step, we simply plug in the restrictions. If
we write (37) as
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + u, (39)
then the restricted model is y = β0 + x1 + u.

Now, to impose the restriction that the coefficient on x1 is unity, we
must estimate the following model:
y − x1 = β0 + u. (40)
This is just a model with an intercept (β0 ) but with a different

dependent variable than in (39).
The procedure for computing the F statistic is the same: estimate
(40), obtain the SSR (SSRr ), and use this with the unrestricted SSR
from (39) in the F statistic (32).
We are testing q = 4 restrictions, and there are n − 5 df in the
unrestricted model.
The F statistic is simply [(SSRr − SSRu )/SSRu ][(n − 5)/4].
The estimated unrestricted model using the data in HPRICE1 is
log(price) = .264 + 1.043 log(assess) + .0074 log(lotsize)

(.570) (.151) (.0386)
− .1032 log(sqrf t) + .0338 bdrms
(.1384) (.0221)
n = 88, SSR = 1.822, R2 = .773.
The SSR from the restricted model turns out to be SSRr = 1.880,
and so the F statistic is [(1.880 − 1.822)/1.822](83/4) = .661.
The 5% critical value in an F distribution with (4, 83) df is about
2.50, and so we fail to reject H0 .
There is essentially no evidence against the hypothesis that the
assessed values are rational.
************* End of Chapter Three *************

4 - 5947019371120429002 Econo ch3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 - 5947019371120429002 Econo ch3

Uploaded by

Copyright:

Available Formats

Econ 3044: Introduction to Econometrics

Chapter-3: Multiple Linear Regression

Addis Ababa University

December 14, 2019

1 Motivation for Multiple Regression

2 Mechanics and Interpretation of Ordinary Least Squares

3 The Gauss-Markov Assumptions for Multiple Regression

4 Expected Value and Variances of the OLS Estimators

5 The Gauss-Markov Theorem for Multiple Linear Regression

6 Multiple Regression Analysis: Inference

The primary drawback in using simple regression analysis for empirical

We begin with the following simple example to show how multiple

wage = β0 + β1 educ + β2 exper + u, (1)

where exper is years of labor market experience.

Compared with a simple regression analysis relating wage to educ,

Generally, we can write a model with two independent variables as

Multiple regression analysis is also useful for generalizing functional

cons = β0 + β1 inc + β2 inc2 + u, (3)

where u contains other factors affecting consumption.

Mechanically, there will be no difference in using the method of

In the model with two independent variables, the key assumption

The interpretation of condition (4) is similar to the interpretation of

In equation (1), the assumption is E(u|educ, exper) = 0.

Multiple regression analysis allows many observed factors to affect y.

Since there are k independent variables and an intercept, equation (5)

When applying the general multiple regression model, we must know

log(salary) = β0 + β1 log(sales) + β2 ceoten + β3 ceoten2 + u.

This fits into the multiple regression model (with k = 3) by defining

If β3 = 0, then 100β2 is approximately the ceteris paribus percentage

At a minimum, equation (6) requires that all factors in the unobserved

In the general case with k independent variables, we seek estimates,

yb = βb0 + βb1 x1 + βb2 x2 + · · · + βbk xk . (7)

The OLS estimates, k + 1 of them, are chosen to minimize the sum of

where xij means the ith observation on the j th independent variable.

These are often called the OLS first order conditions.

For even moderately sized n and k, solving the equations in (8) by

More important than the details underlying the computation of the βj

yb = βb0 + βb1 x1 + βb2 x2 . (9)

The intercept βb0 in equation (9) is the predicted value of y when

\A = 1.29 + .453 hsGP A + .0094 ACT

How do we interpret this equation? First, the intercept 1.29 is the

Example (Determinants of College GPA (continued ))

y = βb1 ∆x1 + βb2 ∆x2 + · · · + βbk ∆xk .

The coefficient on x1 measures the change in yb due to a one-unit

Example (Hourly Wage Equation)

\ = .284 + .092 educ + .0041 exper + .022 tenure

As in the simple regression case, the coefficients have a percentage

Example (Hourly Wage Equation (continued ))

Sometimes, we want to change more than one independent variable at

\ = .0041 ∆exper + .022 ∆tenure = .0041 + .022 = .0261

After obtaining the OLS regression line, we can obtain a fitted or

ybi = βb0 + βb1 xi1 + βb2 xi2 + · · · + βbk xik ,

which is just the predicted value obtained by plugging the values of

1 The sample average of the residuals is zero and so y = yb.

The first two properties are immediate consequences of the set of

As with simple regression, we can define the total sum of squares

An important property of R2 is that it is a nondecreasing function of

Now (yi − ȳ)2 is independent of the number of x variables in the

where k is the number of variables in the model.

Assumption MLR.1 (Linear in Parameters)

where β0 , β1 , . . . , βk are the unknown parameters (constants) of interest

Assumption MLR.2 (Random Sampling)