2 Regression With Multiple Regressors 1

lOMoARcPSD|38987964
2. Regression with multiple regressors 1
Economics (University of Lagos)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)
lOMoARcPSD|38987964
OLS with multiple regressors

Joonhyung Lee
University of Memphis
Econ 7810/8810
Contents
1 Omitted Variable Bias 2
1.1 Alternative expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Multiple Regressors 7
2.1 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 OLS Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Properties of Multivariate OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 2
2.5 Goodness of Fit (adjusted-R R ) . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Conditional Mean Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Selection of regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.1 Controlling for too many factors . . . . . . . . . . . . . . . . . . . . . . 13
2.7.2 Adding regressors to reduce the error variance . . . . . . . . . . . . . . 14
2.8 Data scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Test in the Multivariate Regression 15

3.1 Single Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 More Than One Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 F-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Dummy Variables for Multiple Categories 17

4.1 Using Dummy Variables to Incoprate Ordinal Information . . . . . . . . . . . . 18
5 Stata 19
6 Homework 21

lOMoARcPSD|38987964
1 Omitted Variable Bias

• The methodology we’ve covered so far has (at least) one big limitation: there’s only one
RHS variable “explaining” Y .
• In test score example, what if ST R is picking up something besides just the student-
teacher ratio? In other words, what if something else is driving test scores? It could
be:
– percent of English learners, teacher quality, richer school, richer neighborhood, par-
ent’s education,...
• Why do we care? We’d like to establish a causal effect.
– We don’t want ST R getting credit (or blame) for the effect of something else.
• Worse, what if ST R is significant only because other variables are correlated with ST R
and T EST SCR?
• Definition: If a regressor is correlated with a variable that has been omitted from the
analysis but that determines (in part) the dependent variable, then the OLS estimator
will have omitted variable bias.
• Omitted variable bias (OVB) occurs when two conditions hold:
1. The omitted variable is correlated with the included regressor

2. The omitted variable is a determinant of the dependent variable
• Examples:
– Percentage of English learners: meet both conditions, need to be included in the

regression
– Time of day of the test: may meet 2, not 1
– Teachers’ parking lot space per pupil: may meet 1, not 2
– Education and Wages
W age = β 0 + β 1 Educ + u
∗ Omitting ability may overestimate the importance of schooling.
• Formally, omitted variable bias occurs when we don’t include in our regression all the
variables that are correlated with Y and one or more of the regressors (X’s).
• Let’s see what happens when we omit a relevant variable from our analysis. Suppose the
true model is:
Yi = β 0 + β 1 X1i + β 2 X2i + ui (1)
with E [ui | X1i , X2i ] = 0
• So β 1 is the true slope of X1i

lOMoARcPSD|38987964
• It’s useful to note that for any variables P and Q:

X X
Pi − P Qi − Q = Pi − P Q i
• In particular, this means we can write the OLS estimator for β 1 as

P P
b = X 1i − X 1 Y i − Y X 1i − X 1 Yi
β 1 P 2 = P 2
X1i − X1 X1i − X1
• Now suppose we (incorrectly) assume
Yi = α0 + α1 X1i + vi (2)
where vi ≡ β 2 X2i + ui . In most cases, we omit x2 because we cannot collect data on it.
• If we estimate the slope (α1 ) using equation (2) we get

P
(X1i −X1 )Yi
α
b1 = P 2 , but is E (b
α1 ) = β 1 ?
(X1i −X1 )
• So what is E (b α1 )?
P
(X1i −X1 )Yi
E (b
α1 ) = E P 2
(X1i −X1 )
P
(X1i −X1 )(β 0 +β 1 X1i +β 2 X2i +ui )
=E P 2
(X1i −X1 )
P P P P
β 0 (X1i −X1 ) β 1 (X1i −X1 )X1i β 2 (X1i −X1 )X2i (X1i −X1 )ui
=E P 2 +E P 2 +E P 2 +E P 2
(X1i −X1 ) (X1i −X1 ) (X1i −X1 ) (X1i −X1 )
P 2
P P
β (X1i −X1 ) (X1i −X1 )(X2i −X2 ) (X1i −X1 )ui
= 0 + E 1P 2 + β 2 E P 2 + E P 2
(X1i −X1 ) (X1i −X1 ) (X1i −X1 )
P
(X1i −X1 )(X2i −X2 )
= β1 + β2E P 2 +0
(X1i −X1 )
P
(X1i −X1 )(X2i −X2 )
= β1 + β2E P 2
(X1i −X1 )
h i
= β 1 + β 2 cov(X 1 ,X2 )
var(X1 )
• So the second term will only equal 0 in the case where X1i & X2i are uncorrelated or
β 2 = 0 (so αb 1 will be unbiased only if Cov(X1 , X2 ) = 0 and/or X2 is irrelevant, OVB
conditions).

s
• Otherwise, E (b α1 ) = β 1 + β 2 E Xs21 X2
X1
• Thus, if Cov(X1 , X2 ) 6= 0 and β 2 6= 0 (the two conditions stated earlier), α

b 1 will give
you a biased estimate of X1i ’s expected impact on Y.
• If we ignore the problem, we will reach very misleading conclusions.

lOMoARcPSD|38987964
• Omitted variable bias means that the OLS Assumption 4 (E (vi | Xi ) = 0 in model (2))
is incorrect.
• Why? Consider again equation (1) and (2)
• Since vi = β 2 X2i + ui , it is easy to show that in (2)

E (vi | X1i ) = E (β 2 X2i + ui | X1i ) = β 2 E (X2i | X1i ) + E (ui | X1i )= β 2 E (X2i | X1i )
which will not equal 0 in general.
1.1 Alternative expression

sX1 X2
• E (b
α1 ) = β 1 + β 2 E s2X
= β 1 + β 2 δ̃ 1 where δ̃ 1 is the slope coefficient in the auxiliary
1
regression
xi2 on xi1
• Therefore,
α1 ) = β 2 δ̃ 1
Bias(b
• If β 2 6= 0 and Corr(xi1 , xi2 ) 6= 0 then α

b 1 is generally biased.
Bias in the Simple Regression Estimator of α1
Corr(x1 , x2 ) > 0 Corr(x1 , x2 ) < 0
β2 > 0 Positive Bias Negative Bias
β2 < 0 Negative Bias Positive Bias
• Determing bias is more difficult in situations with more than two explanatory variables.
The following simplication often works. Suppose
lwage = β 0 + β 1 educ + β 2 exper + β 3 abil + u
and we omit ability. So we regress lwage on educ, exper. Call these estimators β̃ j ,
j = 0, 1, 2.
• Even if exper is uncorrelated with abil, β̃ 2 is generally biased, just as is β̃ 1 . This is

because educ and exper are usually correlated.
• In the general case, it should be remembered that correlation of any xj with an omitted
variable generally causes bias in all of the OLS estimators, not just in β̃ j . (Exception:
conditional mean independence)
1.2 EXAMPLES
• Wage equation Omitted Ability Bias.
lwage = β 0 + β 1 educ + β 2 abil + u
where abil is “ability.” Essentially by definition, β 2 > 0.

lOMoARcPSD|38987964
– We also think
Corr(educ, abil) > 0
so that higher ability people get more education, on average.
– In this scenario,
E(b
α1 ) > β 1
so there is an upward bias in simple regression. Our failure to control for ability
leads to (on average) overestimating the return to education. We attribute some
of the effect of ability to education because ability and education are positively
correlated.
• Effects of a Tutoring Program on Student Performance
GP A = β 0 + β 1 tutor + β 2 abil + u
where tutor is hours spent in tutoring. Again, β 2 > 0.
– Suppose that students with lower ability tend to use more tutoring:
Corr(tutor, abil) < 0
– Then
E(b
α1 ) = β 1 + β 2 δ̃ 1 = β 1 + (+)(−) < β 1
so that our failure to account for ability leads us to underestimate the effect of
tutoring.
– In fact, it could happen that β 1 > 0 but E(b
α1 ) ≤ 0, so we tend to find no effect or
even a negative effect.
b 1 > 0 and β 2 δ̃ 1 < 0, we know for sure β 1 > 0, and vice
– (Note) if it turns out that α
versa.
• Final Exam Score and Missed Classes (ATTEND.DTA)
– Sample of 680 students in introductory microeconomics. Attendance was recorded

electronically, monitored by teaching assistants.
– We first regress f inal (out of 40 points) on missed (lectures missed out of 32) in
a simple regression analysis. Then we add priGP A (prior GPA) as a control for
student quality.
f\
inal = 26.60 − .121 missed
n = 680
– So the simple regression estimate implies that, say, 10 more missed classes reduces
the predicted score by about 1.2 points (out of 40, where f inal has a mean of 25.9
and sd = 4.7).

lOMoARcPSD|38987964
– Now we control for prior GPA:
f\
inal = 17.42 + .017 missed + 3.24 priGP A
n = 680
– The coefficient on missed actually becomes positive, but it is very small.

– The coefficient on priGP A means that one more point on prior GPA (for example,
from 2.5 to 3.5) predicts a final exam score that is 3.24 points higher.
– Note that missed and priGP A are negatively correlated, Corr(missedi , priGP Ai ) =
−0.427. In other words, on average, students with higher previous GPAs miss fewer
lectures.
• WAGE2.DTA
– The estimated return to a year of education falls from about 5.9% to about 3.9%
when we control for differences in IQ.
– To interpret the multiple regression equation, we do this thought experiment: Take
two people, A and B, with the same IQ score. Suppose person B has one more year
of schooling than person A. Then we predict B to have a wage that is 3.9% higher.
– Simple regression does not allow us to compare people with the same IQ score. The
larger estimated return from simple regression is because we are attributing part of
the IQ effect to education.
– Not suprisingly, there is a nontrival, positive correlation between educ and IQ:
Corr(educi , IQi ) = .573.
– Multiple regression “partials out” the other independent variables when looking at
the effect of, say, educ. Can show that β̂ 1 measures the effect of educ on lwage once
the the correlation between educ and IQ has been partialled out.
– reg educ IQ, r
– predict vhat, residual
– reg lwage vhat, r
– reg lwage educ IQ, r
– compare between the coefficient on vhat and educ.
1.3 Consistency
• Furthermore, when you have OVB, OLS is not just biased, it is also inconsistent. Let’s
see why.
• From the previous proof, we see that

1 P 1 P
(X1i −X1 )(X2i −X2 ) (X1i −X1 )(ui −u)
α
b1 = β1 + β2 n 1 P 2 + n 1P 2
n ( 1i 1 )
X −X n (X1i −X1 )
sX1 X2 s̃X1 u
= β1 + β2 s2X
+ s2X
1 1

lOMoARcPSD|38987964
p p p
• We know sX1 X2 → σ X1 X2 , s2X1 → σ 2X1 , and s̃X1 u → σ X1 u = 0
• Therefore:
p σ X1 X 2
α
b1 → β1 + β2 σ 2X
6= β 1 unless σ X1 X2 = 0 (or β 2 = 0)
1
• The same point is made.
• Suppose that one (or more) variables have been omitted from the regression, meaning
that X is now likely to be correlated with the error term u.
2 Multiple Regressors
• The solution to the omitted variable bias problem is to add (if you can) the other relevant
variables to the regression.
• Examples:
W age = β 0 + β 1 Educ + β 2 Ability + u

T estScr = β 0 + β 1 ST R + β 2 El P ct + u
• Of course, the interpretation of the coefficients changes.

b is now the expected change in T estScr associated with a unit change in
• For example, β 1
ST R, holding the percent of English learners constant.
– Now we are estimating the pure impact of ST R on T estScr, controlling for this
other variable.
b is the partial effect on Y of X1 , holding X2
– Another phrase used to describe β 1
fixed (ceteris paribus effect)
• In the multiple regressors model, we assume that the population regression line (the
relationship that holds between Y and the X’s on average) is given by
E (Yi | X1i , ..., Xki ) = β 0 + β 1 X1i + ... + β k Xki
• The concepts of heteroskedasticity and homoskedasity also carry over to the multiple
regressors model.
2.1 Matrix Notation

• The formulas are most easily written using matrix algebra, but the intuition is the same
as in the univariate case.
       0 
Y1 U1 1 X11 · · · Xk1 X1
 Y2   U2   1 X12 · · · Xk2   X 0 
       2 
• Y =  . , U =  . , X =  . .. . ..  =  .. 
.
 .   . .  .. . . . .   . 
0
Yn Un 1 X1n · · · Xkn Xn

lOMoARcPSD|38987964
 
β0
 β1 
 
and β =  .. 
 . 
βk
• So, Y is n × 1, X is n × (k + 1), U is n × 1, and β is (k + 1) × 1
• In this notatino, Y is n×1 dimensional vector of n observations on the dependent variable.
• X is the n×(k+1) dimensional matrix of n observations on the k+1 regressors (including

the constant regressor for the intercept).
• U is then × 1 dimensional vector of the n error terms.
• β is (k + 1) × 1 dimensional vector of the k + 1 unknown regression coefficients.
• As in the univariate case, β 0 is the intercept and β k is the slope coefficient of Xk .
• β 0 (the intercept) is the expected value of Yi when all the regressors (Xki ’s) are zero.
• β 1 (the slope coefficient of X1 ) is the effect on Y (the expected change in Y ) of a unit

change in X1 , holding all other variables constant (or “controlling for all other variables”).
• β 1 may also be described as the partial effect on Y of X1 , holding all other variables
constant.
• The population regression is then given by
Y = Xβ + U
Where (by definition) U = Y − E (Y | X)

0
• OLS minimizes the sum of these squared errors U U yielding explicit formulas for the
b ,β
estimators β b , ..., β
b .
0 1 k
0
ˆ = 0 ⇒ β̂ = (X 0 X)−1 (X 0 Y )
min (Y − Xβ) (Y − Xβ) ⇒ −2X 0 (Y − X β)
β
2.2 OLS Assumptions

• As in the univariate case, in order to have unbiasedness, consistency and asymptotic
normality, we must make some assumptions.
• Recall that the population regression model is
Yi = β 0 + β 1 X1i + ... + β k Xki + ui
1. Linear in parameters.
2. (Xi , Yi ), i = 1, . . . , n are simple random sample.
3. X has full column rank (no perfect collinearity)
4. E (ui | X1i , ..., Xki ) = E (ui | Xi ) = 0

lOMoARcPSD|38987964
(a) This is the most important assumption

(b) If u is correlated with any of the xi , this assumption is violated. This is usually
a good way to think about the problem.
(c) EXAMPLE: Effects of Class Size on Student Peformance
i. Suppose, for a standardized test score,
score = β 0 + β 1 classize + β 2 income + u
ii. Even at the same income level, families differ in their interest and concern
about their children’s education. Family support and student motivation
are in u. Are these correlated with class size even though we have included
income? Probably.
(d) When this assumption holds, we say x1 , ..., xk are exogenous explanatory
variables. If xj is correlated with u, we often say xj is an endogenous ex-
planatory variable.
5. Homeskedacity: V ar(ui |Xi ) = σ 2
2.3 Collinearity
• Assumption 3 is new. Why is it important?
• Suppose X2i = a + bX1i
• Then,Yi = β 0 + β 1 X1i + β 2 X2i + ui = (β 0 + β 2 a) + (β 1 + β 2 b) X1i + ui
• This is equivalent to the univariate regression, which we were able to estimate because
the FOCs gave us two equations and two unknowns.
• Now there are three unknowns, but still only two equations, so we can’t identify the
parameters
• This is not usually a problem in practice, since if you include perfectly collinear variables,
your software will either give you back an error message, or drop as many regressors as
necessary to make the remaining variables non-collinear.
• But what if two or more of the regressors are highly (but not perfectly) collinear?
• In other words, what if there’s a linear function of the regressors that is highly correlated
with another regressor?
– Example: Macro variables like wage rate, per capita GDP, etc.
• This situation, called imperfect multicollinearity, does not pose any problem for the
theory of the OLS estimators.
• In fact, a purpose of OLS is to sort out the independent effects of the various regressors
when they are potentially correlated.

lOMoARcPSD|38987964
• However, with a high degree of imperfect multicollinearity among the regressors, σ βb will
j
tend to be high, which can make it difficult to get estimates of the separate effects.
• Why? Recall that a coefficient is an estimate of the partial effect of one regressor, holding
the other ones constant. If the regressors tend to move together, this effect will be hard
to estimate precisely.
2.4 Properties of Multivariate OLS

• Assumptions A1-A4 are sufficient to prove that OLS is an unbiased and consistent esti-
mator.
• They are also sufficient to prove that it is asymptotically normal.
• In this case, this means that for each coefficient

a
β̂ ∼ N β j , var(β̂)
• var(β̂) = E[(β̂−β)(β̂−β)0 ] = E[(X 0 X)−1 X 0 U U 0 X(X 0 X)−1 ] = σ 2 (X 0 X)−1 X 0 ΩX(X 0 X)−1 ,

where Ω = E(U U 0 ) is n × n variance-covarince matrix.
• All the estimated coefficients are jointly normal1 .
• If one further assumes no serial correlation between errors and homoskedacity (i.e. iid
assumption), var(β̂) = σ 2 (X 0 X)−1 . This OLS is BLUE. (M arkov T heorem)
• Simple proof for M arkov T heorem.
– Suppose there exists another better estimate than OLS estimate such as
βˆ∗ = (D + (X 0 X)−1 X 0 )Y
Because this estimate should be better than OLS, this new estimate needs to be
unbiased as well.(E(βˆ∗ ) = β)
βˆ∗ = (D + (X 0 X)−1 X 0 )(Xβ + e)

= DXβ + β + De + (X 0 X)−1 X 0 e
Take the expection,
E(DXβ + β + De + (X 0 X)−1 X 0 e) = DXβ + β
In order to have unbiasedness, DX = 0. Hence,
βˆ∗ = β + De + (X 0 X)−1 X 0 e
1
Recall that if a group of random variables are distributed joint normal, then each of them is normally
distributed as well.
10

lOMoARcPSD|38987964
Now, let’s compare the variance of this new estimate and OLS estimate.
var(βˆ∗ ) = E[(βˆ∗ − β)(βˆ∗ − β)0 ]

= E[(De + (X 0 X)−1 X 0 e)(e0 D0 + e0 X(X 0 X)−1 )
= E[Dee0 D0 + Dee0 X(X 0 X)−1 + (X 0 X)−1 X 0 ee0 D0 + (X 0 X)−1 X 0 ee0 X(X 0 X)−1 ]
= σ 2 DD0 + σ 2 DX(X 0 X)−1 + σ 2 (X 0 X)−1 X 0 D0 + σ 2 (X 0 X)−1
= σ 2 DD0 + σ 2 (X 0 X)−1
We know
var(β̂) = σ 2 (X 0 X)−1
In the end, var(β̂) ≤ var(βˆ∗ ).
2

2.5 Goodness of Fit (adjusted-R2 R )
• As before, the regression R2 is the fraction of the sample variance of Yi explained by the
regressors.
P b 2
Yi − Y ESS RSS
R2 = P 2 = =1−
Yi − Y T SS T SS
• However, in multiple regressors, the R2 increases whenever a regressor is added.
• So one might be tempted to keep adding regressors to inflate R2 .
• One way to adjust for this is to deflate the R2 by some factor, which is what the adjusted-
2
R2 or R does.
2 n − 1 RSS s2
R =1− = 1 − 2ub
n − k − 1 T SS sY
• Notice that:
2 n−1 2
1. R is always less than R2 : n−k−1 > 1 =⇒ R < R2
2 2
2. Adding a regressor has two effects on R : (i) RSS falls, which increases R , (ii)
n−1 2
n−k−1 increases. So the total effect on R depends on which effect is bigger.
2
3. R can be negative.
2
• Some caveats about using R2 and R in practice
2
– An increase in R2 or R does not mean that an added variable is statistically
significant.
2
– R2 always increases when we add regressors. R might not, but even if it does,
it doesn’t mean that the added variable is statistically significant. You need a
hypothesis test to establish this.
11

lOMoARcPSD|38987964
2
– A high R2 or R does not mean that the regressors are a true cause of the dependent
variable.
2
– A high R2 or R does not mean that there is no omitted variable bias.
2
– A high R2 or R does not necessarily mean that you have the most appropriate set
2
of regressors, nor does a low R2 or R necessarily mean that you have a bad set of
regressors.
2.6 Conditional Mean Independence

• Control variables are not the object of interest in the study; rather it is a regressor
included to hold constant factors that, if neglected, could lead the estimated causal
effect of interest to suffer from omitted variable bias.
• Conditional mean independence assumption (alternative to zero conditional mean as-

sumption) guarantees the OLS estimator of the effect of interest is unbiased, but the
OLS coefficients on control variables are in general biased and do not have causal inter-
pretation. That is, the coefficient on X1i has a causal interpretation but the coefficient
on X2i does not.
• In most cases, the reason why we need other controls is to have this conditional mean
independence property.
• This assumption implies that, once we control for other variables, we can argue that the
variable of interest is actually exogenous (random). For example, the mean of the error
term does not depend on the student-teacher ratio given English learning.
• The question is, is class size “as if” randomly assigned among schools with the same
values of english learning?
• Mathematically, the conditional expectation of ui given X1i and X2i does not depend
on (is independent of) X1i , although it can depend on X2i . Once X2i is controlled, X1i
becomes independent from ui .
E(ui |X1i , X2i ) = E(ui |X2i )

Then,
Yi = β 0 + β 1 X1i + β 2 X2i + ui
= β 0 + β 1 X1i + β 2 X2i + E(ui |X1i , X2i ) + vi , where vi = ui − E(ui |X1i , X2i )
= β 0 + β 1 X1i + β 2 X2i + E(ui |X2i ) + vi
= β 0 + β 1 X1i + β 2 X2i + γ 0 + γ 2 X2i + vi , (using linear assumption)
= (β 0 + γ 0 ) + β 1 X1i + (β 2 + γ 2 )X2i + vi
= δ 0 + β 1 X1i + δ 2 X2i + vi
• Example: consider an experiment to study the effect on grades in econometrics class of

mandatory vs. optional homework. So, the instructor assigns 50% students in mandatory
12

lOMoARcPSD|38987964
and the other in optional group. But, the thing is that the distribution of economics and
non-economics majors differs by the group, i.e. mandatory group might end up having
more economics major students. For example, among economics majors (X2i = 1),
75% are assigned to the treatment group (mandatory homework: (X1i = 1)), while
among non-economics majors (X2i = 0), only 25% are assigned to the treatment group.
Given the assumption that economics majors are better at econometrics, the test score
of mandatory group is higher. So, if one does not control for major, the different scores
are not from mandatory homework but from major. Treatment is random within majors
and within nonmajors, but not random after pooling.
• This means
E(ui |X1i , X2i ) 6= 0
E(ui |X1i , X2i ) = E(ui |X2i )

because some other factors such as math skill may determine majors and econometrics
scores, i.e. X2i will have OVB.
• Including majors in the regression eliminates this omitted variable bias (treatment is
random given major), making the OLS estimator on X1i an unbiased estimator of the
causal effect.
2.7 Selection of regressors

2.7.1 Controlling for too many factors
• Need to remember the ceteris paribus interpretation of regression. Sometimes it does not
make sense to hold other factors fixed when studying the effect of a particular variable.
In some cases, certain variables should not be held fixed. For example, in a regression of
traffic fatalities on state beer taxes (and other factors), if one controls for beer consump-
tion, one rules out the key channel through which beer tax affects fatalities. As another
example, in a regression of GPA on alcohol consumption, if one controls for attendance,
one rules out the possibility of attendance affecting GPA. On the other hand, if excluding
attendance, the estimate on alcohol consumption will be the conbined impacts of direct
(alcohol to GPA) and indirect (alcohol to GPA through attendace) impacts.
• Sometimes, the issue of wether or not to control for certain factors is not clear-cut. If
other X variables could be the outcome or the cause of the variable of interest, you should
be careful about the interpretation. In practice, you should include the variables that
the literature has included.
• In our example of str on testscr, one might want to control for total spending per student.
The problem is that part of spending goes toward lowering str. Once we hold spending
fixed, the role for str is limited (evidently).
2
• It is tempting to over-control because often R increase substantially.
• EXAMPLE: Effects of spending on math pass rates. (MEAP93.DTA)
13

lOMoARcPSD|38987964
– If want the effects of total spending per student – the idea being we let districts
decide how to allocate resources – the estimated effect is nontrivial. Spending is in
logs, so
\ = (6.22/100)%∆spend ≈ .062(%∆spend)
∆math4
\ increases by about 0.062
so, if spending increases by 1%, holding lunch fixed, math4
percentage points.
– Now suppose we add the log of the average teacher salary, lsalary.
– The coefficient on lspend is not statistically different from zero. Does spending no
longer matter?
– The problem is that part of spending goes toward increasing salaries. Once we hold
those fixed, the role for spending is limited (evidently). We seem to be finding
that spending other than to increase salary and lunch program has no effect on
performance. But this does not mean total spending has no effect.
2.7.2 Adding regressors to reduce the error variance

• Trade-off: adding a new independent variable to a regression can exacerbate the multi-
collinearity problem. On the other hand, since we are taking something out of the error
term, adding a variable generally reduces the error variance. Generally, we cannot know
which effect will dominate.
• However, there is one clear case: we should always include independent variables that
affect y and are uncorrelated with the variables of interest. Because adding such a
variable does not induce multicollinearity, but it will reduce the error variance so that
the standard errors of all OLS estimators will be reduced. However, such uncorrelated
variables may be hard to find.
• Note that issue is not unbiasedness here because the included variable is not correlated
with the error. The issue is getting an estimator with a smaller sampling variance, i.e.
more precise estimates.
2.8 Data scaling

• If X variables are measured in different units (ex: dollars, years, ratios, etc.), one may be
wondering what the coefficient really means, and how one might compare the strength
of each coefficients.
• To address this problem, one can report beta coefficient:
bˆj = (σˆj /σˆy )βˆj
• In Stata, one can add an option to the regress command called beta, which will give us
the standardized regression coefficients. The beta coefficients are used to compare the
relative strength of the various predictors within the model. Because the beta coefficients
are all measured in standard deviations, instead of the units of the variables, they can be
compared to one another. In other words, the beta coefficients are the coefficients that
14

lOMoARcPSD|38987964
you would obtain if the outcome and predictor variables were all transformed standard
scores, also called z-scores, before running the regression.
• (ex) reg testscr str el pct meal pct lnavginc , robust beta
• str has the Beta coefficient, -0.072 (in absolute value), English percentage, -0.168. Thus,
a one standard deviation increase in str leads to a 0.072 standard deviation decrease in
predicted testscr, with the other variables held constant. And, a one standard deviation
increase in English percentage, in turn, leads to a 0.168 standard deviation decrease in
predicted testscr with the other variables in the model held constant.
3 Test in the Multivariate Regression

• Since each coefficient is asymptotically normal, tests and CIs related to one coefficient
proceed just as before.
3.1 Single Coefficient

• To test the hypothesis H0 : β j = β j,0 against the alternative HA : β j 6= β j,0

b , SE β
– Compute the standard error of β b
j j
– Compute the t-statistic

b −β
β j
t= j,0
SE β b
j
– Compute the p-value

p-value = 2Φ − tact
• When the sample size is large, a 95% confidence interval for β j can be constructed as

b ± 1.96 · SE β
β b or:
j j

β b ,β
b − 1.96 · SE β b
b + 1.96 · SE β
j j j j
3.2 More Than One Coefficient

• But what if we want to test a more complicated H0 ?
• For example, what if the null is related to more than one coefficient?
– Example 1: H0 : β 1 = β 2
– Example 2: H0 : β 1 + β 2 = 1
– Example 3: H0 : β 1 = 0 & β 2 = 0
• Suppose we want to regress test score on student ratio, expenditure per student, and
English learning ratio. (reg testscr str expn stu el pct, robust)
15

lOMoARcPSD|38987964
• Want to test the null hypothesis that both the coefficient on ST R and the coefficient on
EXP N ST U are zero.
• This is a joint hypothesis since we are imposing two restrictions on the regression model
(β 1 = 0 and β 2 = 0)
• Can we just use the two t-statistics to construct two t-tests?
• H0 : β 1 = 0 and β 2 = 0
– Compare t β 1 and t β2 to the 5% critical value (1.96) and reject if either one is
bigger than it? No.
• Why not? Even though you cannot reject the individual null, the combined effect can
be greater than zero. You are likely not to reject the null though the null is wrong.
3.2.1 F-Statistic
• The F-Statistic exploits the fact that the t-statistics of individual coefficients are normally
distributed.
• Recall
– χ2m is the sum of m squared independent standard normals.

χ2n
– Fn,m = n
2
χm
(where the χ2 are independent).
m
χ2n
– Fn,∞ = n (the average of n squared normals).
• (ex) F-Statistic can be used to test complicated hypotheses like
H0 : β 1 = β 2 and β 3 = 2β 1
which has 2 restrictions.

 
β1
1 −1 0  0
– β2  =
−2 0 1 0
β3
• H0 : Rβ = r vs. H1 : Rβ 6= r
ˆ · R0 = σ 2 R · (X 0 X)−1 X 0 ΩX(X 0 X)−1 · R0

– var(Rβ̂) = R · var(β)
– (Rβ −r)0 [σ 2 R·(X 0 X)−1 X 0 ΩX(X 0 X)−1 ·R0 ]−1 (Rβ −r) ∼ χ2 (q) (note if w ∼ N (0, Ω),
then w0 Ω−1 w ∼ χ2 )
P (uˆi −0)2
– (n − k)σˆ2 /σ 2 ∼ χ2 (n − k) (note ûi ∼ N (0, σ 2 ) so that σ2
∼ χ2 (n − k) and
P
1
n−k uˆ2 = σˆ2 )
i
[(Rβ−r)0 [σ 2 R·(X 0 X)−1 X 0 ΩX(X 0 X)−1 ·R0 ]−1 (Rβ−r)]/q [(Rβ−r)0 [R·(X 0 X)−1 X 0 ΩX(X 0 X)−1 ·R0 ]−1 (Rβ−r)]
– Fq,n−k = =
[(n−k)σˆ2 /σ 2 ]/(n−k) q σˆ2
16

lOMoARcPSD|38987964
• The formula for the F-statistic with q restrictions is quite complicated. However, the
good news is the F-Statistic is automatically computed by statistical packages (like Stata)
with simple commands.
• The F-Statistic can be used to test any linear restriction or set of linear restrictions.
• (example)
– reg math lexpend lsalary lnchprg, r

– test ( b[lexpend] = b[lsalary]) ( b[lnchprg] = 2* b[lexpend])
– If one is specifically interested in β 3 − 2β 1 and its significance, lincom b[lnchprg] -
2* b[lexpend]
4 Dummy Variables for Multiple Categories

• Suppose in the wage example we have two qualitative variables, gender and maritial
status. Call these f emale and married.
• We can define four exhaustive and mutually exclusive groups. These are married males
(marrmale), married females (marrf em), single males (singmale), and single females
(singfem).
• Note that we can define each of these dummy variables in terms of f emale and married:
marrmale = married · (1 − f emale)

marrf em = married · f emale
singmale = (1 − married) · (1 − f emale)
singf em = (1 − married) · f emale
• We can allow each of the four groups to have a different intercept by choosing a base
group and then including dummies for the other three groups.
• So, if we choose single males as the base group, we include marrmale, marrf em, and
singf em in the regression. The coefficients on these variabels are relative to single men.
• With lwage as the dependent variable, we can give them a percentage change interpre-
tation.
• Use WAGE1.DTA. Control for education, experience (as a quadratic), and tenure (as a
quadratic).
• Using the usual approximation based on differences in logarithms – and holding fixed
education, experience, and tenure – a married man is estimated to earn about 21.3%
more than a single man. Remember, this compares two men with the same level of
schooling, general workforce experience, and tenure with the current employer.
17

lOMoARcPSD|38987964
• This marriage premium for men has long been noted by labor economists. Does marriage
make men more productive? Is being married a signal to employers (say, of stability and
reliability)? Is there a selection issue in that more productive men are likely to be
married, on average? The regression cannot tell us which explanation is correct.
• A married woman, at given levels of the other variables, earns about 19.8% less than
a single man. A single woman earns about 11.0% less than a comparable single man.
(Stata reports tsingf em = −1.93, so statistically significant at the 5% level with p-value
= .054)
• What if we want to compare married women and single women? Just plug in the correct
set of zeros and ones.
Intercept for married women = .321 − .198

Intercept for single women = .321 − .110
difference = −.198 − (−.110) = −.088
so married women earn about 8.8% less than single women (controlling for other factors).
• We cannot tell from the previous output whether this difference is statistically significant.
Note how the intercept for single men gets differenced away.
• Two approaches: (1) Use the lincome command in Stata. (2) Choose, say, marrield
femals as the base group and reestimate the model (including the dummies marrmale,
singmale, and singf em).
• reg lwage i.female#i.married educ exper tenure expersq tenursq
• lincom b[1.female#0.married] - b[1.female#1.married]
• The t statistic for the estimated difference −.088 is −1.68, which is significant at the 10%
level (but not much lower than that).
• Unlike for mean, the marriage “premium” for women is either nonexistent or in fact
negative. (8.8% is not a small economic effect. It is a bit more than one year of schooling.)
4.1 Using Dummy Variables to Incoprate Ordinal Information

• The data set BEAUTY.DTA includes a ranking of physical attractiveness of each man
or woman, on a scale of 1 to 5, with 5 being “strikingly beautiful or handsome.” This is a
subset of the data used in Hamermesh and Biddle (1994, American Economic Review ).
• As we move up the scale from 1 to 5, why should a one-unit increase mean the same
amount of “beauty”?
• The “looks” variable is what we call an ordinal variable: we know that the order of
outcomes conveys information (5 is better than 4, and 2 is better than 1) but we do not
know that the difference between 5 and 4 is the same as 2 and 1.
18

lOMoARcPSD|38987964
• In fact, very few people are at the extreme values 1 and 5 (less than 1% each). It
makes sense to combine into three categories: below average (belavg), average, and above
average (abvavg).
• Take average as the base group.
• 12.3% of people are “below average,” 30.4% are “above average,” and everyone else
(57.3%) has looks = 3 (labeled “average”).
• With “average” as the base group, we include belavg and abvavg in a regression:
• Controlling for no other factors, those with below average looks earn about 20.9% less
than those with average looks. The t statistic is very significant (p-value is zero to three
decimal places).
• Those with above average looks are estimated to earn about 4.5% less than those with
average looks, but the p-value is .228. So there is little evidence the effect is different
from zero.
• Now control for some other factors, including gender and education.
• The effect of having below average looks is now about 15% lower salary (on average).
Above average looks is still statistically insignificant and gets smaller in magnitude.
• Good practice to look at all coefficients to see if the signs and magnitudes make sense.
They do, although the premium for males is very large.
• Putting in the variable looks means that better looks always has to have a positive effect.
It is not as significant and fits slightly less well (use adjusted R2 ).
• One shortcoming in the previous analysis is that it ignores occupation. Maybe we should
allow people to sort into occupation (perhaps partly based on looks) and see if there is a
“looks premium” in a given occupation. Biddle and Hamermesh (1998, Journal of Labor
Economics) study lawyers’ looks and earnings and find similar results.
• Variables such as credit ratings, or any variables asked on a scale, are ordered variables.
For example, someone may be assigned a credit rating on a scale from 1 to 7, or someone
may be asked to rate their “happiness” on a scale of 1 to 5.
5 Stata
• EXAMPLE: Major League Baseball Salaries (MLB1.DTA)
lsalary = β 0 + β 1 years + β 2 gamesyr + β 3 bavg + β 4 hrunsyr + β 5 rbisyr + u
– H0 : Once we control for experience (years) and amount played (gamesyr), actual
performance has no effect on salary.
H0 : β 3 = 0, β 4 = 0, β 5 = 0
19

lOMoARcPSD|38987964
– An example of exclusion restrictions: The three variables, bavg, hrunsyr, and

rbisyr can be excluded from the equation.
– To test H0 , we need a joint (multiple) hypotheses test.
H1 : H0 is not true
– So, H1 means at least one of β 3 , β 4 , and β 5 is different from zero.

– In the Stata output that follows, something curious occurs: none of the three pe-
formance variables is statistically significant, even though the coefficient estimates
are all positive.
– The estimated effect of bavg is not real large: 10 more points on lifetime average
means salary is estimated to be 0.97% higher. Another RBI (without hitting a home
run) is worth about 1.07%.
– We can use this equation to answer questions such as: How much is a solo HR
worth? That would be adding the two coefficients, so about .025, or 2.5%.
– On the basis of the three insignificant t statistics, should we conclude that none of
bavg, hrunsyr, and rbisyr affects baseball player salaries? No. This would be a
mistake.
– A hint at the problem is severe multicollinearity between hrunsyr and rbisyr: the
correlation is about .89. In fact, one cannot hit a home run without getting at least
one RBI. Generally, home run hitters also tend to produce lots of RBIs.
• EXAMPLE: test score
– reg testscr str el pct, robust

– For a one student increase in STR, we expect test scores to decrease by 1.1 points,
holding all other variables constant.
– The expected decrease was 2.28 before. It’s gone down because el pct and str are
positively correlated and β 2 < 0. Recall that
p σ X1 X2
α
b1 → β1 + β2
σ 2X1
– Intuitively, str was getting some of the blame that really belongs to el pct.
– Given a 1% increase in EL P CT we expect T EST SCR to decrease by .65 points,
holding all other variables constant.
– reg testscr str el pct meal pct , robust
– Selection of regressors. Let’s add expenditure per pupil.
– reg testscr str el pct meal pct expn stu , robust
– Is this imperfectly collinearity case? pwcorr testscr str el pct meal pct expn stu,
star(1)
– Interpretation? The expenditure of school might be a bad control since str could be
the outcome of this variable. Another interpretation: Once the budget is efficiently
allocated, str has little impacts.
20

lOMoARcPSD|38987964
– Beta coefficient
– gen lnavginc=ln( avginc)
– reg testscr str el pct meal pct lnavginc , robust
– reg testscr str el pct meal pct lnavginc , robust beta
– F-statistic
– reg testscr str expn stu el pct , robust
– Now let’s test a interesting null hypothesis: H0 : β 1 = 0 and β 2 = 0
– test str expn stu
– Let’s consider yet another null: H0 : β 1 = 0 and β 2 = 0 and β 3 = 0
– How about a single restriction? Let’s test H0 : β 1 = 0
– Let’s test H0 : β 1 = β 2 and β 3 = 2β 1
– test (str=expn stu) (el pct=2*str)
– lincom
6 Homework
• 8, 10 in Chapter 3
– Hint of 10(vi) in Chapter 3

∗ reg educ motheduc fatheduc c.abil c.abil#c.abil, r
∗ margins, at(abil=(-5(0.5)6) moth=12.18 fath=12.45)
∗ marginsplot
• 9 (extension of #8 in Chapter 3), 10, 11 (extension of #10 in Chapter 3) in Chapter 4
• 11, 12 (i-iv) in Chapter 6
• 9 (not vii), 12 (not vi) in Chapter 7
21

2 Regression With Multiple Regressors 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Regression With Multiple Regressors 1

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|38987964

2. Regression with multiple regressors 1

Economics (University of Lagos)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

OLS with multiple regressors

3 Test in the Multivariate Regression 15

4 Dummy Variables for Multiple Categories 17

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

1 Omitted Variable Bias

• Why do we care? We’d like to establish a causal effect.

• Omitted variable bias (OVB) occurs when two conditions hold:

1. The omitted variable is correlated with the included regressor

– Percentage of English learners: meet both conditions, need to be included in the

• So β 1 is the true slope of X1i

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

• It’s useful to note that for any variables P and Q:

• In particular, this means we can write the OLS estimator for β 1 as

• Now suppose we (incorrectly) assume

• If we estimate the slope (α1 ) using equation (2) we get

• Thus, if Cov(X1 , X2 ) 6= 0 and β 2 6= 0 (the two conditions stated earlier), α

• If we ignore the problem, we will reach very misleading conclusions.

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

• Why? Consider again equation (1) and (2)

• Since vi = β 2 X2i + ui , it is easy to show that in (2)

1.1 Alternative expression

• If β 2 6= 0 and Corr(xi1 , xi2 ) 6= 0 then α

lwage = β 0 + β 1 educ + β 2 exper + β 3 abil + u

• Even if exper is uncorrelated with abil, β̃ 2 is generally biased, just as is β̃ 1 . This is

lwage = β 0 + β 1 educ + β 2 abil + u

where abil is “ability.” Essentially by definition, β 2 > 0.

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

• Effects of a Tutoring Program on Student Performance

where tutor is hours spent in tutoring. Again, β 2 > 0.

Corr(tutor, abil) < 0

• Final Exam Score and Missed Classes (ATTEND.DTA)

– Sample of 680 students in introductory microeconomics. Attendance was recorded

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

– Now we control for prior GPA:

– The coefficient on missed actually becomes positive, but it is very small.

• From the previous proof, we see that

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

• The same point is made.

W age = β 0 + β 1 Educ + β 2 Ability + u

• Of course, the interpretation of the coefficients changes.

E (Yi | X1i , ..., Xki ) = β 0 + β 1 X1i + ... + β k Xki

2.1 Matrix Notation

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

• In this notatino, Y is n×1 dimensional vector of n observations on the dependent variable.

• X is the n×(k+1) dimensional matrix of n observations on the k+1 regressors (including

• U is then × 1 dimensional vector of the n error terms.

• β is (k + 1) × 1 dimensional vector of the k + 1 unknown regression coefficients.

• As in the univariate case, β 0 is the intercept and β k is the slope coefficient of Xk .

• β 1 (the slope coefficient of X1 ) is the effect on Y (the expected change in Y ) of a unit

• The population regression is then given by

Where (by definition) U = Y − E (Y | X)

2.2 OLS Assumptions

• Recall that the population regression model is

Yi = β 0 + β 1 X1i + ... + β k Xki + ui

Downloaded by Shayan Aliyannezhad (shayanaliyannezhad@gmail.com)

(a) This is the most important assumption

score = β 0 + β 1 classize + β 2 income + u

• Suppose X2i = a + bX1i

• Then,Yi = β 0 + β 1 X1i + β 2 X2i + ui = (β 0 + β 2 a) + (β 1 + β 2 b) X1i + ui