Professional Documents
Culture Documents
2 Regression With Multiple Regressors 1
2 Regression With Multiple Regressors 1
Econ 7810/8810
Contents
1 Omitted Variable Bias 2
1.1 Alternative expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Multiple Regressors 7
2.1 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 OLS Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Properties of Multivariate OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 2
2.5 Goodness of Fit (adjusted-R R ) . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Conditional Mean Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Selection of regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.1 Controlling for too many factors . . . . . . . . . . . . . . . . . . . . . . 13
2.7.2 Adding regressors to reduce the error variance . . . . . . . . . . . . . . 14
2.8 Data scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Stata 19
6 Homework 21
• In test score example, what if ST R is picking up something besides just the student-
teacher ratio? In other words, what if something else is driving test scores? It could
be:
– percent of English learners, teacher quality, richer school, richer neighborhood, par-
ent’s education,...
– We don’t want ST R getting credit (or blame) for the effect of something else.
• Worse, what if ST R is significant only because other variables are correlated with ST R
and T EST SCR?
• Definition: If a regressor is correlated with a variable that has been omitted from the
analysis but that determines (in part) the dependent variable, then the OLS estimator
will have omitted variable bias.
• Examples:
• Formally, omitted variable bias occurs when we don’t include in our regression all the
variables that are correlated with Y and one or more of the regressors (X’s).
• Let’s see what happens when we omit a relevant variable from our analysis. Suppose the
true model is:
Yi = β 0 + β 1 X1i + β 2 X2i + ui (1)
with E [ui | X1i , X2i ] = 0
Yi = α0 + α1 X1i + vi (2)
where vi ≡ β 2 X2i + ui . In most cases, we omit x2 because we cannot collect data on it.
• So the second term will only equal 0 in the case where X1i & X2i are uncorrelated or
β 2 = 0 (so αb 1 will be unbiased only if Cov(X1 , X2 ) = 0 and/or X2 is irrelevant, OVB
conditions).
s
• Otherwise, E (b α1 ) = β 1 + β 2 E Xs21 X2
X1
• Omitted variable bias means that the OLS Assumption 4 (E (vi | Xi ) = 0 in model (2))
is incorrect.
• Therefore,
α1 ) = β 2 δ̃ 1
Bias(b
• Determing bias is more difficult in situations with more than two explanatory variables.
The following simplication often works. Suppose
and we omit ability. So we regress lwage on educ, exper. Call these estimators β̃ j ,
j = 0, 1, 2.
• In the general case, it should be remembered that correlation of any xj with an omitted
variable generally causes bias in all of the OLS estimators, not just in β̃ j . (Exception:
conditional mean independence)
1.2 EXAMPLES
• Wage equation Omitted Ability Bias.
– We also think
Corr(educ, abil) > 0
so that higher ability people get more education, on average.
– In this scenario,
E(b
α1 ) > β 1
so there is an upward bias in simple regression. Our failure to control for ability
leads to (on average) overestimating the return to education. We attribute some
of the effect of ability to education because ability and education are positively
correlated.
GP A = β 0 + β 1 tutor + β 2 abil + u
– Suppose that students with lower ability tend to use more tutoring:
– Then
E(b
α1 ) = β 1 + β 2 δ̃ 1 = β 1 + (+)(−) < β 1
so that our failure to account for ability leads us to underestimate the effect of
tutoring.
– In fact, it could happen that β 1 > 0 but E(b
α1 ) ≤ 0, so we tend to find no effect or
even a negative effect.
b 1 > 0 and β 2 δ̃ 1 < 0, we know for sure β 1 > 0, and vice
– (Note) if it turns out that α
versa.
f\
inal = 26.60 − .121 missed
n = 680
– So the simple regression estimate implies that, say, 10 more missed classes reduces
the predicted score by about 1.2 points (out of 40, where f inal has a mean of 25.9
and sd = 4.7).
f\
inal = 17.42 + .017 missed + 3.24 priGP A
n = 680
• WAGE2.DTA
– The estimated return to a year of education falls from about 5.9% to about 3.9%
when we control for differences in IQ.
– To interpret the multiple regression equation, we do this thought experiment: Take
two people, A and B, with the same IQ score. Suppose person B has one more year
of schooling than person A. Then we predict B to have a wage that is 3.9% higher.
– Simple regression does not allow us to compare people with the same IQ score. The
larger estimated return from simple regression is because we are attributing part of
the IQ effect to education.
– Not suprisingly, there is a nontrival, positive correlation between educ and IQ:
Corr(educi , IQi ) = .573.
– Multiple regression “partials out” the other independent variables when looking at
the effect of, say, educ. Can show that β̂ 1 measures the effect of educ on lwage once
the the correlation between educ and IQ has been partialled out.
– reg educ IQ, r
– predict vhat, residual
– reg lwage vhat, r
– reg lwage educ IQ, r
– compare between the coefficient on vhat and educ.
1.3 Consistency
• Furthermore, when you have OVB, OLS is not just biased, it is also inconsistent. Let’s
see why.
p p p
• We know sX1 X2 → σ X1 X2 , s2X1 → σ 2X1 , and s̃X1 u → σ X1 u = 0
• Therefore:
p σ X1 X 2
α
b1 → β1 + β2 σ 2X
6= β 1 unless σ X1 X2 = 0 (or β 2 = 0)
1
• Suppose that one (or more) variables have been omitted from the regression, meaning
that X is now likely to be correlated with the error term u.
2 Multiple Regressors
• The solution to the omitted variable bias problem is to add (if you can) the other relevant
variables to the regression.
• Examples:
– Now we are estimating the pure impact of ST R on T estScr, controlling for this
other variable.
b is the partial effect on Y of X1 , holding X2
– Another phrase used to describe β 1
fixed (ceteris paribus effect)
• In the multiple regressors model, we assume that the population regression line (the
relationship that holds between Y and the X’s on average) is given by
• The concepts of heteroskedasticity and homoskedasity also carry over to the multiple
regressors model.
β0
β1
and β = ..
.
βk
• So, Y is n × 1, X is n × (k + 1), U is n × 1, and β is (k + 1) × 1
• β 0 (the intercept) is the expected value of Yi when all the regressors (Xki ’s) are zero.
• β 1 may also be described as the partial effect on Y of X1 , holding all other variables
constant.
Y = Xβ + U
1. Linear in parameters.
2. (Xi , Yi ), i = 1, . . . , n are simple random sample.
3. X has full column rank (no perfect collinearity)
4. E (ui | X1i , ..., Xki ) = E (ui | Xi ) = 0
ii. Even at the same income level, families differ in their interest and concern
about their children’s education. Family support and student motivation
are in u. Are these correlated with class size even though we have included
income? Probably.
(d) When this assumption holds, we say x1 , ..., xk are exogenous explanatory
variables. If xj is correlated with u, we often say xj is an endogenous ex-
planatory variable.
5. Homeskedacity: V ar(ui |Xi ) = σ 2
2.3 Collinearity
• Assumption 3 is new. Why is it important?
• This is equivalent to the univariate regression, which we were able to estimate because
the FOCs gave us two equations and two unknowns.
• Now there are three unknowns, but still only two equations, so we can’t identify the
parameters
• This is not usually a problem in practice, since if you include perfectly collinear variables,
your software will either give you back an error message, or drop as many regressors as
necessary to make the remaining variables non-collinear.
• But what if two or more of the regressors are highly (but not perfectly) collinear?
• In other words, what if there’s a linear function of the regressors that is highly correlated
with another regressor?
– Example: Macro variables like wage rate, per capita GDP, etc.
• This situation, called imperfect multicollinearity, does not pose any problem for the
theory of the OLS estimators.
• In fact, a purpose of OLS is to sort out the independent effects of the various regressors
when they are potentially correlated.
• However, with a high degree of imperfect multicollinearity among the regressors, σ βb will
j
tend to be high, which can make it difficult to get estimates of the separate effects.
• Why? Recall that a coefficient is an estimate of the partial effect of one regressor, holding
the other ones constant. If the regressors tend to move together, this effect will be hard
to estimate precisely.
• If one further assumes no serial correlation between errors and homoskedacity (i.e. iid
assumption), var(β̂) = σ 2 (X 0 X)−1 . This OLS is BLUE. (M arkov T heorem)
– Suppose there exists another better estimate than OLS estimate such as
βˆ∗ = (D + (X 0 X)−1 X 0 )Y
Because this estimate should be better than OLS, this new estimate needs to be
unbiased as well.(E(βˆ∗ ) = β)
βˆ∗ = β + De + (X 0 X)−1 X 0 e
1
Recall that if a group of random variables are distributed joint normal, then each of them is normally
distributed as well.
10
Now, let’s compare the variance of this new estimate and OLS estimate.
We know
var(β̂) = σ 2 (X 0 X)−1
In the end, var(β̂) ≤ var(βˆ∗ ).
2
2.5 Goodness of Fit (adjusted-R2 R )
• As before, the regression R2 is the fraction of the sample variance of Yi explained by the
regressors.
P b 2
Yi − Y ESS RSS
R2 = P 2 = =1−
Yi − Y T SS T SS
• One way to adjust for this is to deflate the R2 by some factor, which is what the adjusted-
2
R2 or R does.
2 n − 1 RSS s2
R =1− = 1 − 2ub
n − k − 1 T SS sY
• Notice that:
2 n−1 2
1. R is always less than R2 : n−k−1 > 1 =⇒ R < R2
2 2
2. Adding a regressor has two effects on R : (i) RSS falls, which increases R , (ii)
n−1 2
n−k−1 increases. So the total effect on R depends on which effect is bigger.
2
3. R can be negative.
2
• Some caveats about using R2 and R in practice
2
– An increase in R2 or R does not mean that an added variable is statistically
significant.
2
– R2 always increases when we add regressors. R might not, but even if it does,
it doesn’t mean that the added variable is statistically significant. You need a
hypothesis test to establish this.
11
2
– A high R2 or R does not mean that the regressors are a true cause of the dependent
variable.
2
– A high R2 or R does not mean that there is no omitted variable bias.
2
– A high R2 or R does not necessarily mean that you have the most appropriate set
2
of regressors, nor does a low R2 or R necessarily mean that you have a bad set of
regressors.
• In most cases, the reason why we need other controls is to have this conditional mean
independence property.
• This assumption implies that, once we control for other variables, we can argue that the
variable of interest is actually exogenous (random). For example, the mean of the error
term does not depend on the student-teacher ratio given English learning.
• The question is, is class size “as if” randomly assigned among schools with the same
values of english learning?
• Mathematically, the conditional expectation of ui given X1i and X2i does not depend
on (is independent of) X1i , although it can depend on X2i . Once X2i is controlled, X1i
becomes independent from ui .
Yi = β 0 + β 1 X1i + β 2 X2i + ui
= β 0 + β 1 X1i + β 2 X2i + E(ui |X1i , X2i ) + vi , where vi = ui − E(ui |X1i , X2i )
= β 0 + β 1 X1i + β 2 X2i + E(ui |X2i ) + vi
= β 0 + β 1 X1i + β 2 X2i + γ 0 + γ 2 X2i + vi , (using linear assumption)
= (β 0 + γ 0 ) + β 1 X1i + (β 2 + γ 2 )X2i + vi
= δ 0 + β 1 X1i + δ 2 X2i + vi
12
and the other in optional group. But, the thing is that the distribution of economics and
non-economics majors differs by the group, i.e. mandatory group might end up having
more economics major students. For example, among economics majors (X2i = 1),
75% are assigned to the treatment group (mandatory homework: (X1i = 1)), while
among non-economics majors (X2i = 0), only 25% are assigned to the treatment group.
Given the assumption that economics majors are better at econometrics, the test score
of mandatory group is higher. So, if one does not control for major, the different scores
are not from mandatory homework but from major. Treatment is random within majors
and within nonmajors, but not random after pooling.
• This means
• Including majors in the regression eliminates this omitted variable bias (treatment is
random given major), making the OLS estimator on X1i an unbiased estimator of the
causal effect.
• Sometimes, the issue of wether or not to control for certain factors is not clear-cut. If
other X variables could be the outcome or the cause of the variable of interest, you should
be careful about the interpretation. In practice, you should include the variables that
the literature has included.
• In our example of str on testscr, one might want to control for total spending per student.
The problem is that part of spending goes toward lowering str. Once we hold spending
fixed, the role for str is limited (evidently).
2
• It is tempting to over-control because often R increase substantially.
13
– If want the effects of total spending per student – the idea being we let districts
decide how to allocate resources – the estimated effect is nontrivial. Spending is in
logs, so
\ = (6.22/100)%∆spend ≈ .062(%∆spend)
∆math4
\ increases by about 0.062
so, if spending increases by 1%, holding lunch fixed, math4
percentage points.
– Now suppose we add the log of the average teacher salary, lsalary.
– The coefficient on lspend is not statistically different from zero. Does spending no
longer matter?
– The problem is that part of spending goes toward increasing salaries. Once we hold
those fixed, the role for spending is limited (evidently). We seem to be finding
that spending other than to increase salary and lunch program has no effect on
performance. But this does not mean total spending has no effect.
• However, there is one clear case: we should always include independent variables that
affect y and are uncorrelated with the variables of interest. Because adding such a
variable does not induce multicollinearity, but it will reduce the error variance so that
the standard errors of all OLS estimators will be reduced. However, such uncorrelated
variables may be hard to find.
• Note that issue is not unbiasedness here because the included variable is not correlated
with the error. The issue is getting an estimator with a smaller sampling variance, i.e.
more precise estimates.
• In Stata, one can add an option to the regress command called beta, which will give us
the standardized regression coefficients. The beta coefficients are used to compare the
relative strength of the various predictors within the model. Because the beta coefficients
are all measured in standard deviations, instead of the units of the variables, they can be
compared to one another. In other words, the beta coefficients are the coefficients that
14
you would obtain if the outcome and predictor variables were all transformed standard
scores, also called z-scores, before running the regression.
• (ex) reg testscr str el pct meal pct lnavginc , robust beta
• str has the Beta coefficient, -0.072 (in absolute value), English percentage, -0.168. Thus,
a one standard deviation increase in str leads to a 0.072 standard deviation decrease in
predicted testscr, with the other variables held constant. And, a one standard deviation
increase in English percentage, in turn, leads to a 0.168 standard deviation decrease in
predicted testscr with the other variables in the model held constant.
• When the sample size is large, a 95% confidence interval for β j can be constructed as
b ± 1.96 · SE β
β b or:
j j
β b ,β
b − 1.96 · SE β b
b + 1.96 · SE β
j j j j
• For example, what if the null is related to more than one coefficient?
– Example 1: H0 : β 1 = β 2
– Example 2: H0 : β 1 + β 2 = 1
– Example 3: H0 : β 1 = 0 & β 2 = 0
• Suppose we want to regress test score on student ratio, expenditure per student, and
English learning ratio. (reg testscr str expn stu el pct, robust)
15
• Want to test the null hypothesis that both the coefficient on ST R and the coefficient on
EXP N ST U are zero.
• This is a joint hypothesis since we are imposing two restrictions on the regression model
(β 1 = 0 and β 2 = 0)
• H0 : β 1 = 0 and β 2 = 0
– Compare t β 1 and t β2 to the 5% critical value (1.96) and reject if either one is
bigger than it? No.
• Why not? Even though you cannot reject the individual null, the combined effect can
be greater than zero. You are likely not to reject the null though the null is wrong.
3.2.1 F-Statistic
• The F-Statistic exploits the fact that the t-statistics of individual coefficients are normally
distributed.
• Recall
H0 : β 1 = β 2 and β 3 = 2β 1
• H0 : Rβ = r vs. H1 : Rβ 6= r
16
• The formula for the F-statistic with q restrictions is quite complicated. However, the
good news is the F-Statistic is automatically computed by statistical packages (like Stata)
with simple commands.
• The F-Statistic can be used to test any linear restriction or set of linear restrictions.
• (example)
• We can define four exhaustive and mutually exclusive groups. These are married males
(marrmale), married females (marrf em), single males (singmale), and single females
(singfem).
• Note that we can define each of these dummy variables in terms of f emale and married:
• We can allow each of the four groups to have a different intercept by choosing a base
group and then including dummies for the other three groups.
• So, if we choose single males as the base group, we include marrmale, marrf em, and
singf em in the regression. The coefficients on these variabels are relative to single men.
• With lwage as the dependent variable, we can give them a percentage change interpre-
tation.
• Use WAGE1.DTA. Control for education, experience (as a quadratic), and tenure (as a
quadratic).
• Using the usual approximation based on differences in logarithms – and holding fixed
education, experience, and tenure – a married man is estimated to earn about 21.3%
more than a single man. Remember, this compares two men with the same level of
schooling, general workforce experience, and tenure with the current employer.
17
• This marriage premium for men has long been noted by labor economists. Does marriage
make men more productive? Is being married a signal to employers (say, of stability and
reliability)? Is there a selection issue in that more productive men are likely to be
married, on average? The regression cannot tell us which explanation is correct.
• A married woman, at given levels of the other variables, earns about 19.8% less than
a single man. A single woman earns about 11.0% less than a comparable single man.
(Stata reports tsingf em = −1.93, so statistically significant at the 5% level with p-value
= .054)
• What if we want to compare married women and single women? Just plug in the correct
set of zeros and ones.
so married women earn about 8.8% less than single women (controlling for other factors).
• We cannot tell from the previous output whether this difference is statistically significant.
Note how the intercept for single men gets differenced away.
• Two approaches: (1) Use the lincome command in Stata. (2) Choose, say, marrield
femals as the base group and reestimate the model (including the dummies marrmale,
singmale, and singf em).
• The t statistic for the estimated difference −.088 is −1.68, which is significant at the 10%
level (but not much lower than that).
• Unlike for mean, the marriage “premium” for women is either nonexistent or in fact
negative. (8.8% is not a small economic effect. It is a bit more than one year of schooling.)
• As we move up the scale from 1 to 5, why should a one-unit increase mean the same
amount of “beauty”?
• The “looks” variable is what we call an ordinal variable: we know that the order of
outcomes conveys information (5 is better than 4, and 2 is better than 1) but we do not
know that the difference between 5 and 4 is the same as 2 and 1.
18
• In fact, very few people are at the extreme values 1 and 5 (less than 1% each). It
makes sense to combine into three categories: below average (belavg), average, and above
average (abvavg).
• 12.3% of people are “below average,” 30.4% are “above average,” and everyone else
(57.3%) has looks = 3 (labeled “average”).
• With “average” as the base group, we include belavg and abvavg in a regression:
• Controlling for no other factors, those with below average looks earn about 20.9% less
than those with average looks. The t statistic is very significant (p-value is zero to three
decimal places).
• Those with above average looks are estimated to earn about 4.5% less than those with
average looks, but the p-value is .228. So there is little evidence the effect is different
from zero.
• Now control for some other factors, including gender and education.
• The effect of having below average looks is now about 15% lower salary (on average).
Above average looks is still statistically insignificant and gets smaller in magnitude.
• Good practice to look at all coefficients to see if the signs and magnitudes make sense.
They do, although the premium for males is very large.
• Putting in the variable looks means that better looks always has to have a positive effect.
It is not as significant and fits slightly less well (use adjusted R2 ).
• One shortcoming in the previous analysis is that it ignores occupation. Maybe we should
allow people to sort into occupation (perhaps partly based on looks) and see if there is a
“looks premium” in a given occupation. Biddle and Hamermesh (1998, Journal of Labor
Economics) study lawyers’ looks and earnings and find similar results.
• Variables such as credit ratings, or any variables asked on a scale, are ordered variables.
For example, someone may be assigned a credit rating on a scale from 1 to 7, or someone
may be asked to rate their “happiness” on a scale of 1 to 5.
5 Stata
• EXAMPLE: Major League Baseball Salaries (MLB1.DTA)
– H0 : Once we control for experience (years) and amount played (gamesyr), actual
performance has no effect on salary.
H0 : β 3 = 0, β 4 = 0, β 5 = 0
19
H1 : H0 is not true
– Intuitively, str was getting some of the blame that really belongs to el pct.
– Given a 1% increase in EL P CT we expect T EST SCR to decrease by .65 points,
holding all other variables constant.
– reg testscr str el pct meal pct , robust
– Selection of regressors. Let’s add expenditure per pupil.
– reg testscr str el pct meal pct expn stu , robust
– Is this imperfectly collinearity case? pwcorr testscr str el pct meal pct expn stu,
star(1)
– Interpretation? The expenditure of school might be a bad control since str could be
the outcome of this variable. Another interpretation: Once the budget is efficiently
allocated, str has little impacts.
20
– Beta coefficient
– gen lnavginc=ln( avginc)
– reg testscr str el pct meal pct lnavginc , robust
– reg testscr str el pct meal pct lnavginc , robust beta
– F-statistic
– reg testscr str expn stu el pct , robust
– Now let’s test a interesting null hypothesis: H0 : β 1 = 0 and β 2 = 0
– test str expn stu
– Let’s consider yet another null: H0 : β 1 = 0 and β 2 = 0 and β 3 = 0
– How about a single restriction? Let’s test H0 : β 1 = 0
– Let’s test H0 : β 1 = β 2 and β 3 = 2β 1
– test (str=expn stu) (el pct=2*str)
– lincom
6 Homework
• 8, 10 in Chapter 3
21