Multiple Regression Analysis - Inference

Introduction to Econometrics
Topic 4: Multiple Regression Analysis: Inference
Prof. Dr. Michael Kvasnicka

Chair of Applied Economics
Otto-von-Guericke-Universität Magdeburg
1 / 100
Outline
Topics we cover:
Introduction
4.1 Sampling Distributions of the OLS Estimator
4.2 Hypothesis Testing: The t Test
4.3 Confidence Intervals
4.4 Testing Hypotheses about a Single Linear Combination of the Parameters
4.5 Testing Multiple Linear Restrictions: The F Test
4.6 Reporting Regression Results
Reading: Wooldridge (2013), Introductory Econometrics, Chapter 4.
2 / 100
Outline
Introduction
3 / 100
Introduction
I This topic continues our treatment of multiple regression analysis.
⇒ We now turn to the problem of testing hypothesis about the parameters in

the population regression model:
I We begin by finding the distributions of the OLS estimators under the added
assumption that the population error is normally distributed (Section 4.1).
I Sections 4.2 and 4.3 cover hypotheses testing about individual parameters.
I Section 4.4 discusses how to test a single hypothesis involving more than
one parameter.
I In Section 4.5, we focus on testing multiple restrictions and pay particular
attention to determining whether a group of independent variables can be
omitted from a model.
4 / 100
Outline
Introduction
5 / 100
4.1 Sampling Distributions of the OLS Estimators
I What we have done so far:

I We have formed a set of assumptions under which OLS is unbiased; we have
also derived and discussed the bias caused by omitted variables.
I In Section 3.4, we obtained the variances of the OLS estimators under the
Gauss-Markov assumptions.
I In Section 3.5, we showed that this variance is smallest among linear
unbiased estimators (Gauss-Markov Theorem: under the Gauss-Markov
assumptions, the OLS estimators are BLUE).
I Knowing the expected value and variance of the OLS estimators is useful
for describing the precision of the OLS estimators.
I However, in order to perform statistical inference, we need to know more
than just the first two moments of β̂j ; for statistical inference, we need to
know the full sampling distribution of the β̂j .
I Note that even under the Gauss-Markov assumptions, the distribution of β̂j
can have virtually any shape.
6 / 100
I To obtain the sampling distributions of the β̂j , we make use of the fact that
when we condition on the values of the independent variables in our sample
(the X ’s), the sampling distributions of the OLS estimators depend on the
underlying distribution of the errors.
I So, to obtain the sampling distributions of the β̂j , we will proceed in the
following steps:
1. First, we specify the distribution of the error term u.
2. Then, we derive the resulting distribution of y (holding x constant).
3. This in turn determines the distribution of (β̂0 , β̂1 , ..., β̂k ).
I Why? Because the OLS estimator is a linear function of the y -observations,
so the distribution of y carries over to β̂.
7 / 100
Assumption MLR.6 (Normality)
I To make the sampling distributions of the β̂j tractable, we now assume

that the unobserved error is normally distributed in the population. We call
this the normality assumption:

The population error u is:
I independent of the explanatory variables x1 , x2 , ..., xk , and
I normally distributed with zero mean and variance σ 2 : u ∼ Normal(0, σ 2 ).
8 / 100
I Assumption MLR.6 is much stronger than any of our previous assumptions.

In fact, since u is independent of the xj under MLR.6, it follows that:
1. E (u|x1 , ..., xk ) = E (u) = 0
⇒ Thus MLR.4 is satisfied.
2. Var (u|x1 , ..., xk ) = Var (u) = σ 2
⇒ Thus MLR.5 is satisfied.
Thus, if we make Assumption MLR.6, then necessarily assumptions MLR.4
and MLR.5 will hold.
I To emphasize that we are assuming more than before, we will refer to the
full set of Assumptions MLR.1 through MLR.6.
9 / 100
MLR.1–MLR.6: The Classical Linear Model
I For cross-sectional regression applications, Assumption MLR.1 through

MLR.6 are called the classical linear model (CLM) assumptions.
I = the Gauss-Markov assumptions plus the assumption of a normally
distributed error term.
I We will refer to the model under these six assumptions as the classical
linear model.
I Under the CLM assumptions, the OLS estimators β̂0 , β̂1 , ..., β̂k have a
stronger efficiency property than under the Gauss-Markov assumptions:
I It can be shown that the OLS estimators are the minimum variance
unbiased estimators.
I i.e. OLS has the smallest variance among unbiased estimators, no longer
only among those estimators that are linear in the yi .
(for further discussion of this property, see Appendix E in textbook)
10 / 100
I A succinct way to summarize the population assumptions of the CLM is:
y |x ∼ Normal(β0 + β1 x1 + β2 x2 + ... + βk xk , σ 2 )
where x is again a shorthand for (x1 , ..., xk ).

⇒ Thus, conditional on x, y has a normal distribution with:
I a linear combination of the x1 , ..., xk as mean.
I a constant variance σ 2 .
I For a single independent variable x, this situation is shown in the figure on

the next slide.
11 / 100
I The CLM Model with a single explanatory variable: y = β0 + β1 x + u
Source: Wooldridge, Figure 4.1.

(The homoskedastic normal distribution with a single explanatory variable.)
I Normally distributed error term, and homoskedastic error term.
12 / 100
Justification of Normal Distribution for the Errors
I How can we justify the normal distribution for the errors?

⇒ Since the error u is the sum of many different unobserved factors affecting
y , we can invoke the Central Limit Theorem (CLT) to conclude that u has
an approximate normal distribution.
(For the CLT, see Appendix C in textbook.)
I This argument has some merit, but it is not without weaknesses:

1. The factors in u can have very different distributions in the population.
(E.g. ability and quality of schooling in the error in a wage equation.)
Although the CLT can still hold in such cases, the normal approximation can be
poor depending on how many factors appear in u and how different their
distributions are.
2. The factors in u can affect y in a fashion that is not separate and additive
[= more serious problem].
If u is a complicated function of the unobserved factors, then the CLT argument
does not really apply.
13 / 100
Reasonability of Normality Assumption in Practical Applications
I Economic theory usually offers little guidance:

IE.g., there is no theorem that says wage conditional on educ, exper, and
tenure is normally distributed.
⇒ So, in any application, whether normality of u can be assumed is really an
empirical matter.
I Normal error terms imply that y can take any value on the real line:
Sometimes, simple reasoning suggests y cannot have a normal distribution,
e.g.:
I Example 1: hourly wages: y ≥ 0.
I Example 2: hourly wages with minimum wage floor: y ≥ ymin .
I Example 3: number of children born in a particular family: y ≥ 0
I Past empirical evidence suggests that normality is not a good assumption

for wages (i.e. the conditional wage distribution is not “close” to being
normal).
14 / 100
I Often, a transformation, especially taking the log, yields a distribution that

is closer to normal:
I E.g. log (price) tends to have a distribution that looks more normal than the
distribution of price.
I E.g. the unobservables in a wage distribution are typically not normal, but
the distribution of unobservables in log(wages) typically are.
... so, again, this is an empirical issue. We will discuss the consequences of
nonnormality for statistical inference in Topic 5.
I Normality, as a consequence, can be a reasonable approximation, even if
normality may not hold exactly.
15 / 100
I There are some examples, where MLR.6 is clearly false.

Whenever y takes on just a few values it cannot have anything close to a
normal distribution:
I The dependent variable in Example 3.5. provides a good example: The
variable narr86, i.e. the number of times a young man was arrested in 1986,
takes on a small range of integer values and is zero for most men. Thus,
narr86 is far from being normally distributed.
I What can be done in these cases?

I As we will see in Topic 5 - and this is important - nonnormality of the errors
is not a serious problem with large sample sizes.
I For now, we just make the normality assumption.
16 / 100
I Normality of the error term translates into normal sampling distributions of
the OLS estimators:
Theorem (Normal Sampling Distributions)

Under the CLM assumptions MLR.1 through MLR.6, conditional on the sample
values of the independent variables,
h i
β̂j ∼ Normal βj , Var β̂j ,

where Var β̂j was given in Topic 3. Therefore:
β̂j − βj
∼ Normal(0, 1),
sd β̂j

where sd β̂j denotes the standard deviation of β̂j (again, see previous topic).
17 / 100
Proof of Theorem
The proof of the theorem is not that difficult, given the properties of normally
distributed random variables (for these properties, see Appendix B in textbook):
→ We sketch the proof for the first part of the theorem in 5 steps:
1. Each β̂j can be written as:
Pn n
rîj ui X
β̂j = βj + Pi=1
n 2
= β j + wij ui
i=1 rîj i=1
where wij = rîj /SSRj , rîj is the i th residual from the regression of the xj on all the other
independent variables, and SSRj is the sum of squared residuals from this regression.
I Since the wij depend only on the independent variables (because of the way
the rîj are obtained), they can be treated as nonrandom.
⇒ So, β̂j is just a linear combination of the errors in the sample
{ui : 1, 2, ..., n}.
18 / 100
Proof of Theorem
2. Under Assumption MLR.6 (and the random sampling Assumption MLR.2),

the errors are independent, identically distributed (i.i.d.) Normal(0,σ 2 )
random variables.
3. ... and (which is an important fact about independent normal variables) a
linear combination of such random variables is itself normally distributed
(see Appendix B).
⇒ This basically completes the proof because:
4. Now we only need to find the mean and variance for this normal
distribution.
5. But we already know mean and variance of this distribution
from the
previous topic: In Section 3.3, we showed that E β̂j = βj , and we derived

Var β̂j in Section 3.4; there is no need to re-derive these facts.
19 / 100
Proof of Theorem
I The second part of the theorem follows immediately from the fact that
when we standardize a normal random variable by subtracting off its mean
and dividing by its standard deviation, we end up with a standard normal
random variable.
I Note that the theorem also implies that:

I any linear combination of the (β̂0 , β̂1 , ..., β̂k ) is also normally distributed.
I any subset of the β̂j has a joint normal distribution.
⇒ These facts underlie the testing results in the remainder of this topic.
I In Topic 5, we will show that the normality of the OLS estimators is still
approximately true in large samples even without normality of the errors.
20 / 100
Outline
Introduction
21 / 100
I We now turn to the very important topic of testing a hypothesis about a

single population parameter in the population regression function.
I Starting point of our analysis:
I The population model: y = β0 + β1 x1 + ... + βk xk + u
I Assumption that population model satisfies the CLM assumptions.
I We already know that OLS produces unbiased estimators of the βj .

I Now we study how to test hypotheses about a particular βj :
I For all full understanding of hypothesis testing, one must remember that the
βj are unknown features of the population, and we will never know their
value with certainty.
I Nevertheless, we can hypothesize about the value of βj and then use
statistical inference to test our hypothesis, i.e. make probabilistic statements
about it.
22 / 100
I In order to construct hypotheses tests, we need the following result:
Theorem (t Distribution for the Standardized Estimators)

Under the CLM assumptions MLR.1 through MLR.6,
β̂j − βj
∼ tn−k−1 = tdf ,
se β̂j
where k + 1 is the number of unknown parameters in the population model

y = β0 + β1 x1 + ... + βk xk + u (k slope parameters and the intercept β0 ) and
n − k − 1 is the degrees of freedom (df).
23 / 100
I This result differs from the last theorem (Normal Sampling Distributions)
in some notable respects:
I The last theorem (Normal
Sampling
Distributions)
showed that under the
CLM assumptions, β̂j − βj /sd β̂j ∼ Normal (0, 1).
I The t distribution in this new theorem (t Distribution for the
Standardized

Estimators) comes from the fact that the constant σ in sd β̂j has been
replaced with the random variable σ̂.
(The proof that this leads to a statistic which has a t distribution with n − k − 1
degrees of freedom is difficult and not very instructive - so we skip it here.)
I Accounting for this additionalestimation
step, the statistic in the new
theorem contains the term se β̂j (standard error ), which replaces the

sd β̂j in the last theorem.
24 / 100
I The new Theorem (t Distribution for the Standardized Estimators) is

important in that it allows us to test hypotheses involving the βj .
I In most applications, our primary interest lies in testing the null hypothesis
on a population parameter:
H 0 : βj = 0
I Interpretation: Once x1 , x2 , ..., xj−1 , xj+1 , ..., xk have been accounted for, xj
has no effect on (the expected value of) y .
I Note: We cannot state the null hypothesis as “xj does have a partial effect
on y ” because this is true for any value of βj other than zero. Classical
testing is suited for testing simple hypotheses like the null hypothesis above.
25 / 100
I The statistic we use to test this null hypothesis (against any alternative) is
called “the” t-statistic or “the” t-ratio of β̂j and is defined as:
β̂
t-statistic or t-ratio tβ̂j ≡ j
se β̂j
I tβ̂j measures how many estimated standard deviations β̂j is away from zero.
⇒ Values of tβ̂j sufficiently large from zero will result in a rejection of H0 .
I Why does tβ̂j have features that make it reasonable as a test statistic to
detect βj 6= 0?

1. Since se β̂j is always positive, tβ̂j has the same sign as β̂j : if β̂j is positive,
then so is tβ̂j , and if β̂j is negative, so is tβ̂j .

2. For a given value of se β̂j , a larger value of β̂j leads to larger values of tβ̂j .
If β̂j becomes more negative, so does tβ̂j .
26 / 100
I Note that in any interesting application, the point estimate β̂j will never
exactly be zero, whether or not H0 ist true. The relevant question is: How
far is β̂j from zero?
I A sample value of β̂j very far from zero provides evidence against
H0 : βj = 0.
I However, we must recognize that there is sampling error in our estimate β̂j ,
so the size of β̂j must be weighed against its sampling error.
I Since the standard error of β̂j is an estimate of the standard deviation of β̂j ,
tβ̂j measures how many estimated standard deviations β̂j is away from zero.
I This is precisely what we do in testing whether the mean of a population is
zero, using the standard t-statistic from introductory statistics.
I Values of tβ̂ sufficiently far from zero will result in a rejection of H0 .
I The precise rejection rule depends on the alternative hypothesis and the
chosen significance level of the test.
27 / 100
I Determining a rule for rejecting H0 : βj = 0 at a given significance level –

that is, the probability of rejecting H0 when it is true – requires knowing
the sampling distribution of tβ̂j when H0 is true.
I From the last Theorem (t Distribution for the Standardized Estimators), we
know this to be tn−k−1 .
I This is the key theoretical result needed for testing H0 : βj = 0.
I To determine a rule for rejecting H0 , we need to decide on the relevant

alternative hypothesis. In the following, we consider a number of
alternatives ...
28 / 100
Testing against One-Sided Alternatives: Case 1 (H1 : βj > 0)
I First, we consider a one-sided alternative of the form:

H 1 : βj > 0
I Note that when we state the alternative this way, we are really saying that
the null hypothesis is H0 : βj ≤ 0:
I For example, if βj is the coefficient on education in a wage regression, we
only care about detecting that βj is different from zero when βj is actually
positive.
I Recall from introductory statistics that the null value that is hardest to
reject in favor of H1 : βj > 0 is βj = 0. In other words, if we reject the null
βj = 0 then we automatically reject βj < 0.
⇒ Therefore, it suffices to act as if we are testing H0 : βj = 0 against
H1 : βj > 0, effectively ignoring βj < 0, and that is the approach we take
here.
29 / 100
I How should we choose a rejection rule?

I We must first decide on a significance level (“level” for short) or the
probability of rejecting H0 when it is in fact true.
I For concreteness, suppose we have decided on a 5% significance level, as

this is the most popular choice:
I Thus, we are willing to mistakenly reject H0 when it is true 5% of the time.
I Now, while tβ̂j has a t-distribution under H0 – so it has a zero mean – under
the alternative βj > 0, the expected value of tβ̂j is positive.
⇒ Thus, we are looking for a “sufficiently large” positive value of tβ̂j in order
to reject the null H0 : βj = 0 in favor of the alternative H1 : βj > 0.
Negative values of tβ̂j provide no evidence in favor of H1 .
30 / 100
I What does “sufficiently large” mean?:

I The definition of “sufficiently large”, with a 5% significance level, is the 95th
percentile in a t distribution with n − k − 1 degrees of freedom: we call this
the critical value and denote it by c:
Pr(t ≤ c) = 0.95 for t ∼ tn−k−1
I In other words, the rejection rule is that H0 is rejected in favour of H1 at

the 5% significance level if:
tβ̂j > c
I This rejection rule is an example of a one-tailed test.

I By our choice of the critical value c, rejection of H0 will occur for 5% of all
random samples when H0 is true.
31 / 100
Step-By-Step: How Do we Test for Significance of a Coefficient?
⇒ Consider one-sided test, where H0 : βj = 0 and H1 : βj > 0:

¶ Estimate parameter β̂j and corresponding standard error se β̂j .
· Compute t-statistic corresponding to the null hypothesis:
β̂
tβ̂j = j
se β̂j
¸ Compute the degrees of freedom n − k − 1.

¹ For small samples, use the tabulated t-distribution to find critical value c,
corresponding to the significance level.
º Apply rejection rule: H0 is rejected in favour of H1 at the 5% significance
level if:
tβ̂j > c
32 / 100
I Ad 4 (finding critical value c):

I To obtain the critical value c, we only need the significance level and the
degrees of freedom.
I For example, for a 5% level test and with n − k − 1 = 28 degrees of
freedom, the critical value is c = 1.701.
(For a 10% level test with df = 21, we have a critical value c of 1.323.)
(For a 1% level test with df = 21, we have a critical value c of 2.518.)
I Ad 5 (apply rejection rule):

I If tβ̂j ≤ 1.701, then we fail to reject H0 in favor of H1 : βj > 0 at the 5%
level.
(Note that a negative value for tβ̂ , no matter how large in absolute value, leads to
j
a failure in rejecting H0 in favor of H1 : βj > 0.)
33 / 100
34 / 100
I There is a pattern in the critical values (see Table “Critical values of the t
distribution” on the next slide):
I As the significance level falls, the critical value increases.
I So we require a larger and larger value of tβ̂j in order to reject H0 .
I Thus, if H0 is rejected at, say, the 5% level, then it is automatically rejected
also at the 10% level.
(It makes no sense to reject the null hypothesis at, say, the 5% level and then redo
the test to determine the outcome at the 10% level.)
I Note that as the degrees of freedom in the t distribution get larger, the t
distribution approaches the standard normal distribution:
I E.g., when n − k − 1 = 120, the 5% critical value for the one-sided
alternative is c = 1.658, compared with the standard normal value of 1.645.
These are close enough for practical purposes.
⇒ So, for degrees of freedom greater than 120, one can use the standard
normal critical values.
35 / 100
36 / 100
Example
Example (Wage effects of education (continued))

I Suppose we wanted to perform a id wage schooling
one-sided test: 1 6 8
H0 : βschooling = 0 2 5.3 12
3 8.75 16
versus H1 : βschooling > 0 4 11.25 18
5 5 12
I Question: can we reject the null 6 3.6 12
hypothesis in this test at a five 7 18.18 17
percent level of significance? 8 6.25 16
I We assume for this application that 9 8.13 13
MLR.1-MLR.6 hold. 10 8.77 12
(see earlier discussion on MLR.6!)
37 / 100
Example

¶ Recall our previous estimates:
wage = -3.569 + 0.8597 schooling, n = 10, R 2 = 0.395

\
(5.23) (0.376)
· Compute the t-statistic as tschooling = 0.8597/0.376 = 2.29.

¸ Compute degrees of freedom: n − k − 1 = 10 − 1 − 1 = 8.
¹ Find critical value from Table G.2: 1-tailed test, 5%-significance level, 8
degrees of freedom: c = 1.860.
º Apply rejection rule: We reject H0 since tschooling > c.
I Conclusion:
In the one-sided test, we reject the null hypothesis that H0 : βschooling = 0
at the 5% level of significance. 38 / 100
Testing against One-Sided Alternatives: Case 2 (H1 : βj < 0)
I The other one-sided alternative that the parameter is less than zero, i.e.:
H 1 : βj < 0
also arises in applications.

I The rejection rule for this alternative is just the mirror image of the
previous case:
I Now, the critical value comes from the left tail of the t distribution.
I In practice, it is easiest to think of the rejection rule as follows: H0 is

rejected in favour of H1 if:
tβ̂j < −c
where c is the critical value for the alternative H1 : βj > 0.
39 / 100
I Note that for simplicity, we always assume c is positive:

I Why? Because this is how critical values are reported in t tables.
I So, the critical value −c is a negative number.
I Illustrative example:
I If the significance level is 5% and the degrees of freedom is 18, then c is
1.734.
I H0 : βj = 0 is therefore rejected in favor of H1 : βj < 0 at the 5% level if
tβ̂j < −1.734.
I It is important to remember that, to reject H0 against the negative
alternative H1 : βj < 0, we must get a negative t statistic:
I A positive t ratio, no matter how large, provides no evidence in favor of
H1 : βj < 0.
40 / 100
I The rejection rule is illustrated in the following figure:
41 / 100
Two-Sided Alternatives: Case 3 (H1 : βj 6= 0)
I In applications, it is common to test the null hypothesis H0 : βj = 0 against

a two-sided alternative (also called two-tailed test), i.e.:
H1 : βj 6= 0
I So, under this alternative, xj has a ceteris paribus effect on y without

specifying whether the effect is positive or negative.
I This is the relevant alternative when the sign of βj is not well determined by
theory (or common sense).
I Even when we know whether βj is positive or negative under the

alternative, a two-sided test is often prudent:
I At a minimum, using a two-sided alternative prevents us from looking at the
estimated equation and then basing the alternative on whether β̂j is positive
or negative.
I Using regression estimates to help us formulate the null or alternative
hypotheses is not allowed as classical statistical inference presumes that we
state the null and alternative about the population before looking at the
data.
42 / 100
I When the alternative is two-sided, we are interested in the absolute value of

the t statistic. The rejection rule for H0 : βj = 0 against H1 : βj 6= 0 is:
|tβ̂j | > c
where | · | denotes the absolute value and c is an appropriately chosen

critical value.
I To find c, we again specify a significance level, say 5%.
I With a two-tailed test at the 5% level of significance, c is chosen to make
the area in each tail of the t distribution equal 2.5%.
I In other words, c is the 97.5th percentile in the t distribution with n − k − 1
degrees of freedom.
I When n − k − 1 = 25, the 5% critical value for a two-sided test is c = 2.060.
I The figure on the next slide provides an illustration of this distribution.
43 / 100
44 / 100
I If H0 is rejected in favour of H1 at the 5% level, we say that “xj is

statistically significant, or statistically different from zero, at the 5% level.”
I If H0 is not rejected, we say that “xj is statistically insignificant, or not
statistically different from zero, at the 5% level.”
I Note: In applications, we normally test against a two-sided alternative:

I One-sided tests are rarely used in practise.
I So, unless there is a very specific case for a one-sided test, the two-sided
test would be the standard test to carry out.
45 / 100

I Suppose we now wanted to perform a two-sided test of
H0 : βschooling = 0 versus H1 : βschooling 6= 0.
I t-statistic is the same as before: tschooling = 0.8597/0.376 = 2.29.
I Same degrees of freedom: n − k − 1 = 10 − 1 − 1 = 8.
I Find critical value from Table G.2: 2-tailed test, 5%-significance level, 8
degrees of freedom: c = 2.306.
I Apply rejection rule: We do not reject H0 since |tschooling | < c.
I Conclusion:
In the two-sided test, we cannot reject the null hypothesis that
H0 : βschooling = 0.
Schooling is not statistically significant at 5%-level in this regression.
46 / 100
Testing Other Hypotheses about βj
I Sometimes we want to test whether βj is equal to some given constant
other than zero. In this case, the null hypothesis is:
H0 : βj = aj
where aj is our hypothesized value of βj .

I The appropriate t statistic is:

β̂j − aj (estimate – hypothesized value)
t= =
se β̂j standard error
As before, t measures how many estimated standard deviations β̂j is away from the
hypothesized value of βj . The usual t statistic is obtained when aj = 0.
I Under H0 , we know that the t statistic is distributed as:

t ∼ tn−k−1
47 / 100
I We can use the general t statistic to test against one- or two-sided
alternatives exactly as before – the only difference is in how we compute
the t statistic:
I Example 1 (a one-sided alternative): E.g. H0 : βj = 1, H1 : βj > 1:
I The rejection rule is the usual one for a one-sided test: reject H0 in favor of
H1 if t > c, where c is a one-sided critical value (found exactly as before).
I If H0 is rejected, we say that “β̂j is statistically greater than one” at the
appropriate significance level.
I Example 2 (a two-sided alternative): E.g. H0 : βj = −1, H1 : βj 6= −1:
I We still compute the t statistic as usual:

β̂j − aj β̂j − (−1) β̂j + 1
t= = =
se β̂j se β̂j se β̂j
I The rejection rule is the usual one for a two-sided test: reject H0 if |t| > c,
where c is a two-tailed critical value (found exactly as before).
I If H0 is rejected, we say that “β̂j is statistically different from negative one”
at the appropriate significance level.
48 / 100

I Suppose someone claims that the effect of an additional year of schooling
on hourly wages equals only 0.1.
I Question: Can we reject this claim at 1% level with a two-sided test?
I t statistic:
0.8597 − 0.1
t= = 2.02
0.376
I Critical value from Table G.2: 1% level of significance, two-sided, 8 degrees
of freedom: c = 3.355.
I Rejection rule: We cannot reject since |tβ̂j | < c.
I Conclusion: Although our estimate is much higher than the claimed value,
we cannot reject this claim at a significance level of 1%.
49 / 100
Example 4.4: Campus Crime and Enrollment
I Consider a simple model relating the annual number of crimes on college

campuses (crime) to student enrollment (enroll):
log (crime) = β0 + β1 log (enroll) + u
(This is a constant elasticity model. β1 is the elasticity of crime w.r.t. enrollment.)
I Note that it makes no sense to test H0 : β1 = 0, as we expect crime to

increase as the size of the campus increases.
I (A more interesting) Hypothesis: The elasticity of crime with respect to
enrollment is one (H0 : β1 = 1):
I The hypothesis implies that a 1% increase in enrollment leads to, on
average, a 1% increase in crime.
I (A noteworthy) Alternative hypothesis: The elasticity of crime with respect
to enrollment is greater than one (H1 : β1 > 1):
I The alternative hypothesis implies that a 1% increase in enrollment
increases campus crime by more than 1%.
50 / 100
I If β1 > 1, then, in a relative sense – not just an absolute sense – crime is

more of a problem on larger campuses.
I One way to see this is to take the exponential of the equation
log (crime) = β0 + β1 log (enroll) + u, which gives:
crime = exp (β0 ) enroll β1 exp(u)
(See Appendix A for properties of the natural logarithm and exponential functions.)
I For β0 and u = 0, this equation is graphed in the figure on the next slide
for β1 < 1, β1 = 1, and β1 > 1.
51 / 100
52 / 100
I We test β1 = 1 against β1 > 1 using data on 97 colleges and universities in

the United States for the year 1992 (CAMPUS.dta), provided by the FBI’s
Uniform Crime Reports. The estimated equation is:
(crime) = −6.63 + 1.27 log (enroll), n = 97, R 2 = 0.585.

log\
(1.03) (0.11)
I The estimated elasticity of crime with respect to enroll is 1.27, which is in

the direction of the alternative β1 > 1.
I The correct t statistic is: t = (1.27 − 1)/0.11
= 0.27/0.11 ≈ 2.45.
(Note: The correct t statistic is not t = β̂1 /se β̂j = 1.27/0.11!)
I The one-sided 5% critical value for a t distribution with 97 − 2 = 95 df is:
c ≈ 1.66 (using df = 120).
⇒ Apply rejection rule: We do reject H0 since tlog(enroll) > c.
I.e. we reject β1 = 1 in favor of β1 > 1 at the 5% level.
53 / 100
Summary: Critical Values for large Number of Degrees of Freedom
I Recall that if the number of degrees of freedom (n − k − 1) is large, we can
approximate the t-distribution well using the standard normal distribution.
I In practise, the approximation is close enough for a t-test with more than
120 degrees of freedom. The resulting rejection rules are summarized in the
table:
Significance Level
10% 5% 1%
2-Sided test (6=)

Reject if |t| is greater than 1.64 1.96 2.58
1-Sided test (>)

Reject if t is greater than 1.28 1.64 2.33
1-Sided test (<)

Reject if t is less than -1.28 -1.64 -2.33
54 / 100

I Suppose we wanted to again perform a two-sided test of
H0 : βschooling = 0 versus H1 : βschooling 6= 0 at a 1% level of
significance.
I Suppose we now have a bigger sample of n = 526, and obtained the
following results, using this larger sample:
wage =
\ -0.905 + 0.541 schooling, n = 526, R 2 = 0.1648
(0.685) (0.053)
I We continue to assume that MLR.1–MLR.6 hold.
55 / 100

I We compute t= 0.541/0.053=10.2.
I Degrees of freedom: n − k − 1 = 526 − 1 − 1 = 524, so that the large
sample approximation works well.
I Critical value (see last but one slide), two-sided, 1% level of significance:
c = 2.58.
I Apply rejection rule: |tschooling | > c
I Conclusion: We can reject H0 at 1% significance. The regressor schooling
is statistically significant at the 1% level (in this larger sample).
56 / 100
Computing p-Values for t Tests
I So far, we talked about how to test hypotheses using a classical approach,

in which one proceeds as follows:
1. State the alternative hypothesis.
2. Choose the significance level, which then determines the critical value.
3. Calculate the t statistic.
4. Compare the t statistic with the critical value to reject/not reject the null at
ex ante chosen significance level.
I However, the choice of a particular significance level ahead of time has

potential drawbacks:
I It involves some arbitrariness (there is no “correct” significance level, and
different researchers may may prefer different significance levels).
I It can hide useful information about the outcome of a hypothesis test (e.g.
failure to reject H0 at 5% level, but rejection of H0 at 10% level).
57 / 100
I Rather than testing at different significance levels, it is more informative to

answer the following question:
I Given the observed value of the t statistic, what is the smallest significance
level at which the null hypothesis would be rejected?
I The smallest significance level at which the null hypothesis would be

rejected is called the p-value for the test (see Appendix C in the textbook):
I The p-value is the significance level of the test when we use the value of the
test statistic (t) as the critical value for the test.
I The p-value for testing the null hypothesis H0 : βj = 0 against the two-sided
alternative is:
p − value = P(|T | > |t|) = 2P(T > t)
where:
- T is a t distributed random variable with n − k − 1 d.f..
- t denotes the numerical value of the test statistic.
- P(T > t) is the area to the right of t in a t distribution with n − k − 1 d.f..
58 / 100
I Interpretation: The p-value is the probability of observing a t statistic as

large as we actually did if the null hypothesis is true. So:
⇒ Small p-values are evidence against H0 .
⇒ Large p-values provide little evidence against H0 .
E.g., if the p-value = 0.5 (reported always as a decimal, not a percentage),
then we would observe a value of the t statistic as large as we did in 50% of
all random samples when the null hypothesis is true; this is pretty weak
evidence against H0 .
I Note that, once the p-value has been computed, a classical test can be
carried out at any desired level:
I If α denotes the significance level of the test (in decimal form), then H0 is
rejected if p-value< α; otherwise, H0 is not rejected at the 100 × α% level.
59 / 100

I In the two-sided test with n = 10, we are looking for:
P(|T | > 2.29) = 2 × P(T > 2.29), where T ∼ t8
I We can use a statistics program (e.g. Stata) to compute p=0.052.

I Interpretation:
I Schooling is not statistically significant at e.g. 5% level.
I Schooling is statistically significant at e.g. 10% level.
60 / 100
Two Caveats
1. A Reminder of the Language of Classical Hypothesis Testing:

I When H0 is not rejected, we say “We fail to reject H0 at the x% level”, and
not “H0 is accepted at the x% level”.
I Why? Because there are many other values for βj that cannot be rejected.
2. Economic (or Practical) versus Statistical Significance:

I The statistical significance of a variable xj is determined entirely by the size
of tβ̂j .
I The economic significance or practical significance of a variable xj is related
to the size (and sign) of β̂j .
β̂
I Recall that the t statistic for testing H0 : βj = 0 is defined as tβ̂j = j ,
se β̂j
so tβ̂j can indicate statistical significance either because β̂j is “large” or

because se β̂j is “small”.
61 / 100
Outline
Introduction
62 / 100
I Under the CLM assumptions, we can construct a confidence interval (CI)

for the population parameter βj :
I Confidence intervals are also called interval estimates because they provide a
range of likely values for the population parameter, and not just a point
estimate.

I Using the fact that β̂j − βj /se β̂j has a t distribution with n − k − 1
d.f., simple manipulation leads to a CI for the unknown βj , e.g. a 95% CI
given by:
β̂j ± c × se β̂j
where the constant c is the 97.5th percentile in a tn−k−1 distribution.
I More precisely, the lower and upper bounds of the confidence interval are
given by respectively:
β j ≡ β̂j − c × se β̂j
and
β j ≡ β̂j + c × se β̂j
63 / 100
I What does a confidence interval mean?

I If random samples were obtained over and over again, with β j and β j
computed each
time, then the (unknown) population value βj would lie in
the interval β j , β j for 95% of the samples.
I So, for the single sample that we use to construct the CI, we do not know
whether βj is actually contained in the interval. We hope we have obtained
a sample that is one of the 95% of all samples where the interval estimate
contains βj , but we have no guarantee.
I Two remarks:
I Note again that, when n − k − 1 > 120, the tn−k−1 distribution is close
enough to normal to use the 97.5th percentile in a standard
normal
distribution for constructing a 95% CI: β̂j ± 1.96 × se β̂j .
I Once a CI is constructed, it is easy to carry out two-tailed hypothesis tests:
If the null hypothesis is H0 : βj = aj , then H0 is rejected against H1 : βj 6= aj
at (say) the 5% significance level, if, and only if, aj is NOT in the 95% CI
itself.
64 / 100

I The 95% confidence interval corresponding to our n = 10 regression would
be:

CI = β̂schooling ± c × se β̂schooling
= 0.8597 ± 2.306 × 0.376
= [−0.007; 1.727]
65 / 100
I It is important to remember that a confidence interval is only as good as

the underlying assumptions used to construct it:
I Omitted variables: If we have omitted factors that are correlated with the
explanatory variables, then the coefficient estimates are not reliable: OLS is
biased.
I Heteroskedasticity: If heteroskedasticity
is present, then the standard error is
not valid as an estimate of sd β̂j , and the confidence interval computed
using these standard errors will not truly be a 95% CI.
I Nonnormality of errors: We have also used the normality assumption on the
errors in obtaining these CIs, but, as we will see in Topic 5, this is not as
important for applications involving hundreds of observations.
66 / 100
Outline
Introduction
67 / 100
4.4 Testing Hypotheses about a Single Linear Combination
of the Parameters
I The previous two sections have shown how to use classical hypothesis
testing (the t-test) or confidence intervals to test hypotheses about a single
parameter βj at a time.
I In applications, however, we often must test hypotheses involving more
than one population parameter.
I We will cover two cases:
1. In Section 4.4: Testing a single restriction, involving several parameters
(a single hypothesis): a modified t-test.
2. In Section 4.5: Testing several restrictions jointly
(multiple hypotheses): the F -test.
68 / 100
of the Parameters
Example (Returns to education)
I Consider a simple model to compare the returns to education at (two-year)
junior colleges (jc) and four-year colleges (univ ):
log (wage) = β0 + β1 jc + β2 univ + β3 exper + u
where jc (univ ) is # years attending a 2-year college (4-year college), and exper is
months in workforce. The population are working people with a high school degree.
I The hypothesis of interest is whether another year at junior college and

another year at university lead to the same ceteris paribus percentage
increase in wages:
H 0 : β1 = β2
I For the most part, the alternative of interest is one-sided: a year at a junior
college is worth less than a year at a university:
H1 : β1 < β2
69 / 100
of the Parameters
I How do we proceed?
I The two hypotheses (H0 , H1 ) concern two parameters, β1 and β2 , a situation
we have not faced yet.
I We cannot simply use the individual t statistics for β̂1 and β̂2 to test H0 .
I However, conceptually there is no difficulty in constructing a t statistic for
testing such a H0 .
⇒ We can rewrite the null and the alternative as:

H 0 : β1 − β2 = 0
H 1 : β1 − β2 < 0
and apply the familiar t statistic to the difference:
β̂ − β̂2
t= 1
se β̂1 − β̂2
(i.e. the t-statistic is based on whether the estimated difference β̂1 − β̂2 is sufficiently
less than zero to warrant rejecting of H0 in favor of H1 . To account for the sampling
error in our estimators, we standardize this difference by dividing by the standard error.)
70 / 100
of the Parameters
I Once we have this t-statistic, testing proceeds as before:
I We choose a significance level for the test, and, based on the df , obtain a
critical value.
I Because the alternative is of the form H1 : β1 − β2 < 0, the rejection rule is
of the form: Reject H0 if t < −c, where c is a positive value chosen from
the appropriate t distribution (or we compute the t statistic and then
compute the p-value).
I The only thing that makes testing equality of two different parameters
more difficult than testing about a single βj is obtaining the standard error
in the denominator of the t statistic:
I Obtaining the numerator, β̂1 − β̂2 , in contrast, is trivial once we have
performed the OLS regression.

⇒ So, how do we compute the denominator of the t statistic, se β̂1 − β̂2 ?

I Note: se β̂1 − β̂2 6= se β̂1 − se β̂2 !
71 / 100
of the Parameters

I To compute se β̂1 − β̂2 , we first obtain the variance of the difference.
Using the results on variances in Appendix B in the textbook, we have:

Var β̂1 − β̂2 = Var β̂1 + Var β̂2 − 2Cov β̂1 , β̂2
I The standard deviation of β̂1 − β̂2 is just the square root of this, and since
h i2
se β̂1 is an unbiased estimator of Var β̂1 , and similarly for
h i2
se β̂2 , we have:
h i2 h i2 1/2
se β̂1 − β̂2 = se β̂1 + se β̂2 − 2s12

where s12 is an estimate of Cov β̂1 , β̂2 .
I However, while this approach is feasible, it requiresus to
estimate the
covariance between the two slope estimators, Cov β̂1 , β̂2 .
72 / 100
of the Parameters

I However, rather than trying to compute se β̂1 − β̂2 from the above
equation, it is much easier to estimate instead a different model that
directly gives us the standard error of interest:
I Define a new parameter as the difference between β1 and β2 : θ1 = β1 − β2 .
I Then, we want to test:
H0 : θ1 = 0
against H1 : θ1 < 0

IThe t statistic from before, i.e. t = β̂1 − β̂2 /se β̂1 − β̂2 , in terms of θ̂1

is just t = θ̂1 /se θ̂1 .

⇒ So, the challenge is now finding se θ̂1 .
I We can do this by rewriting the original model so that θ1 = β1 − β2

appears directly on one of the independent variables:
I As a consequence, we can assess the null hypothesis directly by testing (by
way of a standard t-test) whether the coefficient on this independent
variable, θ1 , is zero.
73 / 100
of the Parameters
Example (Returns to education (continued))
I Because θ1 = β1 − β2 , we can write β1 = θ1 + β2 . Plugging this into our
original model equation and rearranging gives the equation:
log (wage) = β0 + (θ1 + β2 )jc + β2 univ + β3 exper + u

→ log (wage) = β0 + θ1 jc + β2 (jc+univ) + β3 exper + u
→ log (wage) = β0 + θ1 jc + β2 totcoll + β3 exper + u
I The parameter we are interested in testing hypotheses about, θ1 , now
multiplies the variable jc.
I So, when we estimate this transformed equation, we obtain
directly
estimates of both θ1 and its standard error, i.e. se θ̂1 .
I Note that we must construct a new variable totcoll = jc + univ , capturing
total years of college, and include it in the regression model in place of
univ .
74 / 100
Outline
Introduction
75 / 100
I So far, we have only covered hypotheses involving a single restriction:

1. Hypotheses about one parameter:
The t statistic associated with any OLS coefficient can be used to test
whether the corresponding unknown parameter in the population is equal to
any given constant (which is usually, but not always, zero).
2. Hypotheses about a single linear combination of the βj :
To test such hypotheses, one can rearrange the equation and run a
regression using transformed variables.
I Now, we want to test multiple hypotheses about the underlying parameters

β0 , β1 , ..., βk .
I We begin with the leading case of testing whether a set of independent
variables has no partial effect on a dependent variable:
I I.e. we want to test whether a group of variables has no effect on the
dependent variable, once another set of variables has been controlled for.
76 / 100
Testing Exclusion Restrictions
I We begin with an example to illustrate why testing significance of a group
of variables can be useful:
Example (Major league baseball players’ salaries)
I The following model explains major league baseball players’ salaries:
log (salary) =β0 + β1 years + β2 gamesyr + β3 bavg
+ β4 hrunsyr + β5 rbisyr + u
where: salary is the 1993 total salary, years is years in the league, gamesyr is average
games played per year, bavg is the career batting average, hrunsyr is home runs per year,
and rbisyr is runs batted in per year.
I Suppose we want to test the null hypothesis that, once years in the league
and games per year have been controlled for, the statistics measuring
performance (bavg, hrunsyr, rbisyr) have no effect on salary, i.e.:
H0 : β3 = 0, β4 = 0, β5 = 0
(I.e., essentially, H0 states that productivity as measured by baseball statistics has no
effect on salary.)
77 / 100
I The full model in the example is called unrestricted model:

log (salary) =β0 + β1 years + β2 gamesyr + β3 bavg +β4 hrunsyr + β5 rbisyr + u
I The null hypothesis in the example constitutes three exclusion restrictions:

H0 : β3 = 0, β4 = 0, β5 = 0
I A test of multiple restrictions is called a multiple hypotheses test or a joint
hypotheses test.
I Imposing these restrictions leads to a second model, called restricted model:
log(salary ) = β0 + β1 years + β2 gamesyr + u
I The appropriate alternative hypothesis is that at least one of β3 , β4 , or β5
is different from zero (i.e. any or all could be different from zero), so:
H1 : H0 is not true
I The test we study is hence constructed to detect any violation of H0 .
78 / 100
I How should we proceed in testing H0 against H1 ?

IWe need a way to test the exclusion restrictions jointly!
[Note: We can’t use the t statistics on the variables bavg, hrunsyr and
rbisyr, as a particular t statistic tests a hypothesis that puts no restrictions
on the other parameters.]
⇒ The F -test provides a way to test the exclusion restrictions jointly.
I The F -test is based on the sum of squared residuals (SSR) of the models:
IIntuition of the test: If relevant variables are dropped, we should see a
substantial increase in the sum of squared residuals SSR.
I In our example: Does the SSR increase significantly, when we drop the
variables bavg, hrunsyr and rbisyr?

I This motivates paying attention to the difference in SSR values:
SSRr − SSRur .
⇒ Thus, we want to reject the null hypothesis if the increase in the SSR in
going from the unrestricted model to the restricted model is large.
79 / 100
I Because it is no more difficult, we will derive the F -test directly for the
general case:
I The unrestricted model with k independent variables (and hence k + 1
parameters) is:
y = β0 + β1 x1 + ... + βk xk + u
I Suppose that we have q exclusion restrictions to test, so that the null
hypothesis e.g. states the last q variables have zero coefficients:
H0 : βk−q+1 = 0, ..., βk = 0
I The alternative hypothesis is simply that H0 is false (i.e. at least one of the
parameters listed in H0 is different from zero).
I When we impose the restrictions under H0 , we get the restricted model:
y = β0 + β1 x1 + ... + βk−q xk−q + u
80 / 100
I The F statistic (or F ratio) measures the relative increase in SSR when
moving from the unrestricted to the restricted model. It is defined by:
(SSRr − SSRur )/q
F ≡
SSRur /(n − k − 1)
I SSRr = the sum of squared residuals from the restricted model.
I SSRur = the sum of squared residuals from the unrestricted model.
! F statistic is always nonnegative (as SSRr can be no smaller than SSRur ).
! Denominator of F is just the unbiased estimator of σ 2 = Var (u) in the
unrestricted model.
I Numerator and denominator degrees of freedom:
I numerator degrees of freedom = dfr − dfur = q,
i.e. the number of restrictions imposed in moving from the unrestricted to
the restricted model (q independent variables are dropped).
I Why? df in each case equals n − k and n is identical, but k (number of
estimated parameters) differs by q.
I denominator degrees of freedom = dfur = n − k − 1.
81 / 100
I To use the F statistic, we must know its sampling distribution under the
null in order to choose critical values and rejection rules.
I It can be shown that, under H0 (and assuming the CLM assumptions hold),
F is distributed as an F random variable with (q, n − k − 1) degrees of
freedom. We write this as:
F ∼ Fq,n−k−1
I Why?
(SSRr − SSRur )/q
I It can be shown that the equation F ≡ is actually the
SSRur /(n − k − 1)
ratio of two independent chi-square random variables, divided by their
respective degrees of freedom, i.e. q and n − k − 1.
I This is the definition of an F distributed random variable (see Appendix B).
I This result allows us to use the tabulated F -distribution to find critical
values.
82 / 100
I It is pretty clear from the definition of F that we will reject H0 in favor of

H1 when F is sufficiently ”large”. How large depends on our chosen level of
significance.
I So actual testing proceeds as before:
I We choose a significance level for the test, and we obtain a critical value c
based on the d.f.
I Once c has been obtained, we reject H0 in favor of H1 at the chosen
significance level if:
F >c
⇒ If H0 is rejected, we say xk−q+1 , ..., xk are jointly statistically significant
(or just jointly significant) at the appropriate significance level.
[Note: This test alone does not allow us to say which of the variables has a partial
effect on y ; they may all affect y or maybe only one affects y .]
⇒ If H0 is not rejected, we say that the variables are jointly insignificant,
which often justifies dropping them from the model.
83 / 100
I Example: With a 5% significance level, q = 3, and n − k − 1 = 60, the
critical value is c = 2.76:
⇒ We would reject H0 at the 5% level if the computed value of the F statistic

exceeds c = 2.76 (Note: for the same df , the 1% critical value is 4.13).
84 / 100
I The F statistic in practical applications (1/2):

I In most applications, the numerator df (q) will be notably smaller than the
denominator df (n − k − 1).
I Applications where n − k − 1 is small are unlikely to be successful because
the parameters in the unrestricted model will probably not be precisely
estimated.
I When the denominator df reaches about 120, the F distribution is no longer
sensitive to it.
[This is entirely analogous to the t distribution being well approximated by the
standard normal distribution as the df gets large.]
[Thus, there is an entry in the table for the denominator df = ∞, and this is what
we use with large samples (because n − k − 1 is then large).]
85 / 100
Step-By-Step Approach to F Test
¶ Estimate the unrestricted model, and collect SSRur .

· Estimate the restricted model, and collect SSRr .
¸ Compute degrees of freedom (dfr − dfur ) and (n − k − 1).
¹ Compute F statistic.
º Find critical value from tables, corresponding to chosen significance level α.
» Apply rejection rule and interpret results.
Statistics software can compute the F statistic for us.
86 / 100

Question: Are the experience effects jointly significant in our wage regression?
¶ Estimate unrestricted model, collect SSRur :
log\
(wage) =0.128 + 0.0904 schooling + 0.0410 exper − 0.000714 expersq
R 2 = 0.300, n = 526, SSR = 103.790
· Estimate restricted model, collect SSRr :
log\(wage) =0.584 + 0.0827 schooling ,
R 2 = 0.186, n = 526, SSR = 120.769
¸ Compute degrees of freedom: numerator df = 2, denominator df = 522.
¹ Compute F statistic:
16.979/2
F = = 42.70
103.79/522
º Find critical value for chosen α (say at 5%): c = 3.00.
» Apply rejection rule and interpret results: Since F > c, we reject the null
hypothesis. The experience effect is jointly significant.
87 / 100
I The F statistic in practical applications (2/2):

I It is possible that two (or more) variables that each have insignificant t
statistics can be jointly highly significant as a group.
I The reason for this is that variables may be highly correlated, and this
multicollinearity makes it difficult to uncover the partial effect of each
variable (this is reflected in the individual t statistics).
I The F statistic is useful because it can be used to test whether such
variables are jointly significant as a group (and multicollinearity between
variables is much less relevant for testing this hypothesis).
I Example: CEO salary and firm performance:
I There might be many measures of firm performances, and these measures
might be highly correlated.
I Hoping to find individually significant measures might be asking too much
due to multicollinearity.
I But an F test can be used to find out whether, as a group, the firm
performance variables affect salary.
88 / 100
Relationship between F and t Statistics
I We have seen how the F statistic can be used to test whether a group of
variables should be included in a model.
I But what happens if we apply the F statistic to the case of testing
significance of a single independent variable?
[e.g. H0 : βk = 0, q = 1 to test the single exclusion restriction that xk can be excluded
from the model]
I It can be shown that the F statistic for testing exclusion of a single variable
is equal to the square of the corresponding t statistic:
2
I Since tn−k−1 has an F1,n−k−1 distribution, the two approaches lead to the
same outcome, provided the alternative hypothesis is two-sided.
I As t statistics are more flexible for testing a single hypothesis (they can be
directly used to test against one-sided alternatives) and are easier to obtain
than F statistics, there is no reason to use an F statistic to test hypotheses
about a single parameter.
I Warning:
I It is possible that we can group a bunch of insignificant variables with a
significant variable and conclude that the entire set of variables is jointly
insignificant.
89 / 100
The R-Squared Form of the F Statistic
I For testing exclusion restrictions, it is often more convenient to have a form

of the F statistic that can be computed using the R 2 from the restricted
and unrestricted models.
I The R 2 is always between zero and one, whereas the SSRs can be very large
depending on the unit of measurement of y , making the calculation based
on the SSRs tedious.
I Using the fact that SSRr = SST (1 − Rr2 ) and SSRur = SST (1 − Rur 2
) and
substituting these expressions into our equation for the F statistic, we get:
2
(Rur − Rr2 )/q 2
(Rur − Rr2 )/q
F = 2
= 2 )/df
(1 − Rur )/(n − k − 1) (1 − Rur ur
which is called the R-squared form of the F statistic.
90 / 100
The R-Squared Form of the F Statistic
I Note:
I The R 2 is reported with almost all regressions (the SSR is not), so it is easy
to use the R 2 s from the unrestricted and restricted models to test for
exclusion of some variables.
I In the numerator, the unrestricted R 2 comes first. In the SSR-based version
of the F statistic, the SSR of the restricted model (SSRr ) comes first.
Example (Returns to education - continued)

I To compute the F -statistic, we can alternatively use the R 2 statistics.
I Using the information on slide 87, we obtain:
2
(Rur − Rr2 )/q (0.300 − 0.186)/2 0.114/2
F = 2 )/df
= = = 42.70
(1 − Rur ur (1 − 0.300)/522 0.7/522
i.e. the same value for the F statistic as we obtain when using the SSRs of
the restricted and unrestricted models.
91 / 100
Computing p-Values for F Tests
I For reporting the outcomes of F tests, p-values are especially useful.
[Why? Since the F distribution depends on the numerator and denominator df , it is difficult to get a
feel for how strong or weak the evidence is against H0 simply by looking at the value of the F
statistic and one or two critical values. As with t testing, once the p-value has been computed, the F
test can be carried out at any significance level. E.g. if the p-value = 0.024, we reject H0 at the 5%
significance level but not at the 1% level.]
I The p-value for an F test is defined as:

p − value = Pr (F > F )
where F denotes an F -distributed random variable with (q, n − k − 1) degrees of freedom, and F
denotes the actual value of the test statistic we obtain.
⇒ Same interpretation as before:

I The p-value is the probability of
finding a realization of F at least as
large as we did, given that the null
hypothesis is true.
I A small p-value is evidence against
H0 .
92 / 100
The F Statistic for Overall Significance of a Regression
I The null hypothesis in this case states that none of the explanatory
variables has an effect on y :
H0 : β1 = β2 = ... = βk = 0
Another way of stating the null hypothesis is:
H0 : E (y |x1 , x2 , ..., xk ) = E (y )
[I.e. knowing the values of x1 , ..., xk does not affect the expected value of y .]
I The alternative hypothesis is that at least one βj is different from zero.
I Imposing the k restrictions in H0 , the restricted model is:
y = β0 + u
I As Rr2 = 0, The F statistic is simply (denoting Rur
2
with R 2 ):
R 2 /k
(1 − R 2 )/(n − k − 1)
This special form of the F statistic is valid for testing the joint exclusion of all the k
independent variables, or determining the overall significance of the regression.
93 / 100
The F Statistic for Overall Significance of a Regression
I If we fail to reject H0 , then there is no evidence that any of the

independent variables help to explain y .
I This usually means that we must look for other variables to explain y .
I Occasionally, the F statistic for the hypothesis that all independent

variables are jointly insignificant is the focus of a study:
I E.g. assume you want to test whether stock returns over a four-year horizon
are predictable based on information known only at the beginning of the
period.
I Under the efficient markets hypothesis, the returns should not be
predictable; the null hypothesis is precisely the null hypothesis for
determining the overall significance of the regression.
94 / 100
Testing General Linear Restrictions
I Testing exclusion restrictions is by far the most important application of F
statistics.
I Sometimes, however, the restrictions implied by a theory are more
complicated than just excluding some independent variables.
I In such cases, it is still straightforward to use the F statistic for testing.
Example (Housing prices - actual and assessed)

I Consider the following equation:
log (price) = β0 + β1 log (assess) + β2 log (lotsize) + β3 log (sqrft) + β4 bdrms + u
I Suppose you want to test whether the assessed house price (assess) is a rational
valuation. If the case, then a 1% change in assess should be associated with a
1% change in price, i.e. β1 = 1. In addition, lotsize, sqrft, and bdrms should not
help to explain log (price), once the assessed value has been controlled for.
I The unrestricted model is:
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + u
95 / 100
Testing General Linear Restrictions
Example (Housing prices - actual and assessed (continued))
I The various hypotheses together imply the null hypothesis:
H0 : β1 = 1, β2 = 0, β3 = 0, β4 = 0
i.e. 4 restrictions, but only 3 of them are exclusion restrictions.
I The restricted model is therefore:
y = β0 + x1 + u
I In order to impose the restrictions, we estimate the following model:
y − 1x1 = β0 + u
I We first compute a new dependent variable y − x1 .
I And then we regress this on a constant.
I We compute the F statistic as before:
F = [(SSRr − SSRur )/SSRur ][(n − 5)/4]
I Note: we cannot use the R 2 form of the F statistic here:
I Our regression for the restricted model now has a different dependent variable.
Thus, the total sum of squares will be different. We can no longer re-write the test
in the R 2 -form.
96 / 100
Outline
Introduction
97 / 100
I We conclude by considering a few guidelines on how to report multiple

regression results for relatively complicated empirical projects.
I This helps you to read published works in the applied social sciences and
prepares you to write your own empirical papers.
I Which statistics are usually reported?

I the coefficient estimates (β̂)
I the corresponding standard errors (se(β̂))
I the number of observations
I the R 2 -measure
I How are the results usually reported?

I If we report a single equation, we can write the results in “equation form”.
I If we report different specifications, we can summarize the results in a table,
with one column per specification (see example on next slide).
98 / 100
(1) (2) (3)
Education 0.083 0.098 0.090

(0.008) (0.008) (0.007)
Experience 0.010 0.041

(0.002) (0.005)
Experience2 -0.001
(0.000)
Constant 0.584 0.217 0.128

(0.097) (0.109) (0.106)
Observations 526 526 526

R-squared 0.186 0.249 0.300
Note: Dependent variable is log (wage) .
99 / 100
I We can use stars to indicate the result of a simple t-test on each coefficient.
I Possibly report additional relevant tests/statistics at bottom of table.
(1) (2) (3)
Education 0.083*** 0.098*** 0.090***

(0.008) (0.008) (0.007)
Experience 0.010*** 0.041***

(0.002) (0.005)
Experience2 -0.001***
(0.000)
Constant 0.584*** 0.217** 0.128

(0.097) (0.109) (0.106)
Observations 526 526 526

R-squared 0.186 0.249 0.300
Mean 1.623 1.623 1.623
Standard deviation 0.532 0.532 0.532
Note: Dependent variable is log (wage) .
?? indicates significance at 5% level, ? ? ? indicates significance at 1% level.
100 / 100

Multiple Regression Analysis - Inference

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression Analysis - Inference

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

Topic 4: Multiple Regression Analysis: Inference

Prof. Dr. Michael Kvasnicka

4.1 Sampling Distributions of the OLS Estimator

4.2 Hypothesis Testing: The t Test

4.3 Confidence Intervals

4.4 Testing Hypotheses about a Single Linear Combination of the Parameters

4.5 Testing Multiple Linear Restrictions: The F Test

4.6 Reporting Regression Results

Reading: Wooldridge (2013), Introductory Econometrics, Chapter 4.

4.1 Sampling Distributions of the OLS Estimator

4.2 Hypothesis Testing: The t Test

4.3 Confidence Intervals

4.4 Testing Hypotheses about a Single Linear Combination of the Parameters

4.5 Testing Multiple Linear Restrictions: The F Test

4.6 Reporting Regression Results

I This topic continues our treatment of multiple regression analysis.

⇒ We now turn to the problem of testing hypothesis about the parameters in

4.1 Sampling Distributions of the OLS Estimator

4.2 Hypothesis Testing: The t Test

4.3 Confidence Intervals

4.4 Testing Hypotheses about a Single Linear Combination of the Parameters

4.5 Testing Multiple Linear Restrictions: The F Test

4.6 Reporting Regression Results

I What we have done so far:

I To make the sampling distributions of the β̂j tractable, we now assume

Assumption MLR.6 (Normality)

I Assumption MLR.6 is much stronger than any of our previous assumptions.

I For cross-sectional regression applications, Assumption MLR.1 through

I A succinct way to summarize the population assumptions of the CLM is:

where x is again a shorthand for (x1 , ..., xk ).

I For a single independent variable x, this situation is shown in the figure on

Source: Wooldridge, Figure 4.1.

I How can we justify the normal distribution for the errors?

I This argument has some merit, but it is not without weaknesses:

I Economic theory usually offers little guidance:

I Past empirical evidence suggests that normality is not a good assumption

I Often, a transformation, especially taking the log, yields a distribution that

I There are some examples, where MLR.6 is clearly false.

I What can be done in these cases?

Theorem (Normal Sampling Distributions)

2. Under Assumption MLR.6 (and the random sampling Assumption MLR.2),

I Note that the theorem also implies that:

4.1 Sampling Distributions of the OLS Estimator

4.2 Hypothesis Testing: The t Test

4.3 Confidence Intervals

4.4 Testing Hypotheses about a Single Linear Combination of the Parameters

4.5 Testing Multiple Linear Restrictions: The F Test

4.6 Reporting Regression Results

I We now turn to the very important topic of testing a hypothesis about a

I We already know that OLS produces unbiased estimators of the βj .

I In order to construct hypotheses tests, we need the following result:

Theorem (t Distribution for the Standardized Estimators)

where k + 1 is the number of unknown parameters in the population model

I The new Theorem (t Distribution for the Standardized Estimators) is

I Determining a rule for rejecting H0 : βj = 0 at a given significance level –

I To determine a rule for rejecting H0 , we need to decide on the relevant

I First, we consider a one-sided alternative of the form:

I How should we choose a rejection rule?

I For concreteness, suppose we have decided on a 5% significance level, as

I What does “sufficiently large” mean?:

Pr(t ≤ c) = 0.95 for t ∼ tn−k−1

I In other words, the rejection rule is that H0 is rejected in favour of H1 at

I This rejection rule is an example of a one-tailed test.