Chapter 6-Linear Regression With Multiple Regressors

Introduction to Econometrics
Fourth Edition, Global Edition
Chapter 6
Linear Regression with
Multiple Regressors
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Outline
1. Omitted variable bias
2. Using regression to estimate causal effects
3. Multiple regression and OLS
4. Measures of fit
5. Sampling distribution of the OLS estimator with multiple
regressors
6. Control variables

Omitted Variable Bias
• In the class size example, β1 is the causal effect on test scores of
a change in the STR by one student per teacher
• When β1 is a causal effect, the first least squares assumption for
causal inference must hold: E(u|X) = 0.
• The error u arises because of factors, or variables, that influence
Y but are not included in the regression function. There are
always omitted variables!
• If the omission of those variables results in E(u|X) ≠ 0, then the
OLS estimator will be biased

The bias in the OLS estimator that occurs as a result of an omitted
factor, or variable, is called omitted variable bias. For omitted
variable bias to occur, the omitted variable “Z” must satisfy two
conditions:
The two conditions for omitted variable bias
1. Z is a determinant of Y (i.e. Z is part of u); and
2. Z is correlated with the regressor X (i.e. corr(Z,X) ≠ 0)
Both conditions must hold for the omission of Z to result in omitted

variable bias

In the test score example: English language learners (immigrants)
1. English language ability (whether the student has English as a
second language) plausibly affects standardized test scores: Z is
a determinant of Y
2. Immigrant communities tend to be less affluent and thus have
smaller school budgets and higher STR: Z is correlated with X
Accordingly, ˆ1 is biased. What is the direction of this bias?

– What does common sense suggest?
– If common sense fails you, there is a formula…

Omitted Variable Bias and the 1st LSA
Omitted variable bias means that the 1st LSA (𝑬 𝒖𝒊 𝑿𝒊 = 𝟎)
does not hold
• The error term 𝑢𝑖 in the simple regression represents all factors,
other than 𝑋𝑖 , that are determinants (influence) of 𝑌𝑖
• If one of these factors is correlated with 𝑋𝑖 , this means that the
error term 𝑢𝑖 is correlated with 𝑋𝑖
• As a result, since 𝑢𝑖 and 𝑋𝑖 are correlated → 𝐸 𝑢𝑖 𝑋𝑖 ≠ 0 (this
correlation violates the 1st LSA)
➔The consequence is serious: the OLS estimator is biased
• This bias does not vanish even when 𝑛 is large → OLS estimator
is inconsistent
A formula for omitted variable bias: recall the equation,
n
1 n
 ( X i − X )ui 
n i =1
vi
ˆ1 − 1 = i =n1 =
 n −1 2
i =1
(Xi − X ) 2

 n 
 sX
where vi = ( X i − X )ui  ( X i −  X )ui .
Under Least Squares Assumption #1, E[(Xi – μX)ui] = cov(Xi,ui) = 0 →

𝐸 𝛽መ1 = 𝛽1 (unbiased)
But what if E[(Xi – μX)ui] = cov(Xi,ui) = σXu ≠ 0?

The OLS estimator is biased → 𝑬(𝜷 ෡ 𝟏 ) is no more equal to the
෡ 𝟏 ) ≠ 𝜷𝟏 )
true value 𝜷𝟏 (𝑬(𝜷
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 𝑢𝑖
𝛽መ1 − 𝛽1 = 𝑛
σ𝑖=1 𝑋𝑖 − 𝑋ത 2
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 𝑢𝑖
𝛽መ1 = 𝛽1 + 𝑛
σ𝑖=1 𝑋𝑖 − 𝑋ത 2
𝜎𝑋𝑢
𝛽መ1 = 𝛽1 + 2
𝜎𝑋
𝜎𝑋𝑢
Note that, the correlation coefficient is: corr X, u = 𝜌𝑋𝑢 =
𝜎𝑋 𝜎𝑢
Thus, 𝜎𝑋𝑢 = 𝜌𝑋𝑢 × 𝜎𝑋 𝜎𝑢

𝜎𝑋𝑢
𝛽መ1 = 𝛽1 +
𝜎𝑋2
𝐴𝑛𝑑 𝜎𝑋𝑢 = 𝜌𝑋𝑢 × 𝜎𝑋 𝜎𝑢
𝜌𝑋𝑢 × 𝜎𝑋 𝜎𝑢 𝜌𝑋𝑢 × 𝜎𝑢
→ 𝛽መ1 = 𝛽1 + 2 = 𝛽1 +
𝜎𝑋 𝜎𝑋
෡ 𝟏 is unbiased
• If the 1st LSA is satisfied → 𝝆𝑿𝒖 = 𝟎 → 𝜷
(𝑬 𝜷෡ 𝟏 = 𝜷𝟏 )
෡ 𝟏 is biased
• If the 1st LSA is not satisfied → 𝝆𝑿𝒖 ≠ 𝟎 → 𝜷
(𝑬 𝜷෡ 𝟏 ≠ 𝜷𝟏 )

The omitted variable bias formula
ˆ
p  u 
1 → 1 +    Xu
X 
• If an omitted variable Z is both:
1. A determinant of Y (that is, it is contained in u); and
2. Correlated with X, then 𝜌𝑋𝑢 ≠ 0
➔The OLS estimator 𝛽መ1 is biased and inconsistent

ˆ
p  u 
1 → 1 +    Xu
X 
Note:
𝜎𝑢
- The term 𝜌𝑋𝑢 is the bias in 𝛽መ1 , and it persists even in large
𝜎𝑋
samples.
- Whether this bias is large or small depends on the correlation
(𝜌𝑋𝑢 ) between the regressor and the error term. The larger 𝜌𝑋𝑢
is, the larger the bias.
- The direction of the bias in 𝛽መ1 depends on whether 𝑋 and 𝑢 are
positively or negatively related.
What can we do about omitted variable bias?

• In the test score example, class size is correlated with the
fraction (percentage) of English learners.
• One way to address this problem is to select a subset of districts
that have the same % of English learners but have different class
sizes (STR).
• For each subset of districts, class size cannot be picking up the
English learner effect because the fraction of English learner is
held constant.
• More generally, this observation suggests estimating the effect of
STR on test scores, holding constant the % of English learners.
TABLE 6.1 Differences in Test Scores for California School Districts with Low and High
Student–Teacher Ratios, by the Percentage of English Learners in the District
Student–Teacher Student–Teacher Difference in Test Scores,
Ratio < 20 Ratio ≥ 20 Low vs. High STR
Blank Average Average
Test Score n Test Score n Difference t-statistic
All districts 657.4 238 650.0 182 7.4 4.04
Percentage of Blank Blank Blank Blank Blank Blank
English learners
< 1.9% 664.5 76 665.4 27 −0.9 −0.30
1.9–8.8% 665.2 64 661.8 44 3.3 1.13
8.8–23.0% 654.9 54 649.7 50 5.2 1.72
> 23.0% 636.7 44 634.8 61 1.9 0.68
• Districts with lower % of English learners have higher test scores

• Districts with lower % of English learners have smaller classes (small STR)
• Among districts with comparable % of English learners, the effect of class
size (STR) is small (recall overall “test score gap” = 7.4)
Using regression to estimate causal effects
• The Test Score/STR/Fraction of English Learners example shows that, if
an omitted variable satisfies the two conditions for omitted variable
bias, then the OLS estimator is biased and inconsistent (even if 𝑛 is
large, 𝛽መ1 will not be close to 𝛽1 ).
• We have distinguished between two uses of regression: for prediction,
and to estimate causal effects.
• In the class size application, we clearly are interested in a causal effect:
what do we expect to happen to test scores if the superintendent reduces
the class size?

What, precisely, is a causal effect?
• “Causality” is a complex concept!
• In this course, we take a practical approach to defining causality:
A causal effect is defined to be the effect measured in an ideal
randomized controlled experiment.

Ideal Randomized Controlled Experiment
• Ideal: subjects all follow the treatment protocol – perfect
compliance, no errors in reporting, etc.!
• Randomized: subjects from the population of interest are
randomly assigned to a treatment or control group (so there are
no confounding factors).
• Controlled: having a control group permits measuring the
differential effect of the treatment.
• Experiment: the treatment is assigned as part of the experiment:
the subjects have no choice, so there is no “reverse causality” in
which subjects choose the treatment they think will work best.

Back to class size
Imagine an ideal randomized controlled experiment for measuring
the effect on Test Score of reducing STR…
• In that experiment, students would be randomly assigned to
classes, which would have different sizes.
• Because they are randomly assigned, all student characteristics
(and thus ui) would be distributed independently of STRi.
• Thus, E(ui|STRi) = 0 – that is, LSA #1 holds in a randomized
controlled experiment.

How does our observational data differ from
this ideal?
• The treatment is not randomly assigned.
• Considering the % English learners in the district: It
plausibly satisfies the two criteria for omitted variable bias:
Z = % of English learners is:
1. a determinant of Y; and
2. correlated with the regressor X
• Thus, the “control” and “treatment” groups differ in a

systematic way, so corr(STR,%EL) ≠ 0

How does our observational data differ from
this ideal?
• Randomization implies that any differences between the
treatment and control groups are random – not systematically
related to the treatment.
• We can eliminate the difference in the % of English learners
between the large class (control) and small class (treatment)
groups by examining the effect of class size among districts with
the same % of English learners.
• This is one way to “control” for the effect of % of English
learners when estimating the effect of STR.

Return to omitted variable bias
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment (STR) is
randomly assigned: then % of English learners is still a determinant of
TestScore, but it is uncorrelated with STR (This solution to omitted
variable bias is rarely feasible).
2. Adopt the “cross tabulation” approach, with finer gradations of STR
and % of English learners – within each group, all classes have the
same % of English learners, so we control for % of English learners.
3. Use a regression in which the omitted variable (% of English learners)
is no longer omitted: include % of English learners as an additional
regressor in a multiple regression.

The Population Multiple Regression Model
• Consider the case of two regressors:
Yi = β0 + β1X1i + β2X2i + ui, i = 1,…,n
• Y is the dependent variable
• X1, X2 are the two independent variables (regressors)
• (Yi, X1i, X2i) denote the ith observation on Y, X1, and X2
• β0 = unknown population intercept
• β1 = effect on Y of a change in X1, holding X2 constant
• β2 = effect on Y of a change in X2, holding X1 constant
• ui = the regression error (omitted factors)
Interpretation of coefficients in multiple
regression
Yi = β0 + β1X1i + β2X2i + ui, i = 1,…,n
Consider the difference in the expected value of Y for two values
of X1 holding X2 constant:
Population regression line when X1 = X1,0:
Y = β0 + β1X1,0 + β2X2
Population regression line when X1 = X1,0 + ΔX1:
Y + ΔY = β0 + β1(X1,0 + ΔX1) + β2X2

Interpretation of coefficients in multiple
regression
Before: Y = β0 + β1X1,0 + 2X2
After: Y + ΔY = β0 + β1(X1,0 + ΔX1) + β2X2
Difference: ΔY = β1ΔX1
So:
Y
1 = , holding X 2 constant
X 1
Y
2 = , holding X 1 constant
X 2
β0 = predicted value of Y when X1 = X2 = 0.
Homoscedasticity and Heteroscedasticity in
multiple regression
• The definitions of homoscedasticity and heteroscedasticity in the
multiple regression model extend their definitions in the single-
regressor model.
• The error term 𝑢𝑖 is homoscedastic if the variance of the
conditional distribution of 𝑢𝑖 given 𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑘𝑖 is constant:
𝑣𝑎𝑟 𝑢𝑖 𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑘𝑖 is constant for 𝑖 = 1,2, … , 𝑛.
• Otherwise, the error term is heteroscedastic.

The OLS Estimator in Multiple Regression
• With two regressors, the OLS estimator solves:
n
min b0 ,b1 ,b2  [Yi − (b0 + b1 X 1i + b2 X 2i )]2
i =1
• The OLS estimator minimizes the average squared difference

between the actual values of Yi and the prediction (predicted
value) based on the estimated line
• This minimization problem is solved using calculus
• This yields the OLS estimators of β0, β1 and β2

Example: The California test score data
Regression of TestScore against STR:
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 − 2.28 × 𝑆𝑇𝑅 𝑅2 = 0.051
• From the perspective of the researcher looking for a way to
predict test scores, this relation is not very satisfying because 𝑅2
is only 0.051, that is, the STR explains only 5.1% of the
variation in test scores
• Can this prediction be made more precise by including additional
regressors?

Example: The California test score data
Now include percent English Learners in the district (𝑃𝑐𝑡𝐸𝐿):
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 686.0 − 1.10 × 𝑆𝑇𝑅 − 0.65 × 𝑃𝑐𝑡𝐸𝐿
• What happens to the coefficient on STR when we use 2
regressors? The coefficient on STR is half as large as when the
STR was the only regressor
• Why? Because the coefficient on STR in the multiple regression
holds constant (or controls for) PctEL, whereas in the simple
regression it is not held constant
• The quality of this prediction has been improved?

Multiple regression in STATA
reg testscr str pctel, robust;
Regression with robust standard errors Number of obs = 420

F( 2, 417) = 223.82
Prob > F = 0.0000
R-squared = 0.4264
Root MSE = 14.464
-----------------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+--------------------------------------------------------------------------
str | −1.101296 .4328472 −2.54 0.011 −1.95213 −.2504616
pctel | −.6497768 .0310318 −20.94 0.000 −.710775 −.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
-----------------------------------------------------------------------------------------
TestScore = 686.0 − 1.10  STR − 0.65PctEL

More on this printout later…
Measures of Fit in Multiple Regression
• 𝑆𝐸𝑅 = 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢ො 𝑖 (with d.f. correction)
• 𝑅𝑀𝑆𝐸 = 𝑠𝑡𝑑. 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢ො 𝑖 (without d.f. correction)
• 𝑅2 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑌 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑋
• 𝑅ത 2 = 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 𝑅2 with a degrees of freedom correction
that adjusts for estimation uncertainty (𝑅ത 2 < 𝑅2 )

SER and RMSE
As in regression with a single regressor, the SER and the RMSE are
measures of the spread of the Ys around the regression line:
𝑛
1
𝑆𝐸𝑅 = ෍ 𝑢ො 𝑖2
𝑛−𝑘−1
𝑖=1
𝑛
1
𝑅𝑀𝑆𝐸 = ෍ 𝑢ො 𝑖2
𝑛
𝑖=1
• σ𝑛𝑖=1 𝑢ො 𝑖2 = 𝑆𝑆𝑅 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠

• Using 𝑛 − 𝑘 − 1 rather than 𝑛 is called a degrees of freedom
adjustment (correction), with 𝑘 the number of regressors
• When 𝑛 is large, the effect of the d.f. adjustment is negligible
ഥ 𝟐 (adjusted 𝑹𝟐 )
𝑹𝟐 and 𝑹
The R2 is the fraction of the variance of 𝑌𝑖 explained by the
regressors – same definition as in simple regression:
2
𝐸𝑆𝑆 𝑆𝑆𝑅
𝑅 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
𝑛 𝑛 𝑛
2
𝑊ℎ𝑒𝑟𝑒 𝐸𝑆𝑆 = ෍ 𝑌෠𝑖 − 𝑌ത , 𝑆𝑆𝑅 = ෍ 𝑢ො 𝑖2 , 𝑇𝑆𝑆 = ෍ 𝑌𝑖 − 𝑌ത 2
𝑖=1 𝑖=1 𝑖=1
• The 𝑅2 always increases (never decreases) whenever a regressor

is added.
• Why? A bit of a problem for a measure of “fit”.

𝟐 ഥ 𝟐 𝟐
𝑹 and 𝑹 (adjusted 𝑹 )
• The 𝑅2 increases when a new regressor is added, but this does
not mean that adding a regressor actually improves the fit of the
model.
• The 𝑅ത 2 corrects this problem by “penalizing” us for including
another regressor → 𝑅ത 2 does not necessarily increase when
another regressor is added:
𝑛−1 𝑆𝑆𝑅
𝑅ത 2 = 1 −
𝑛 − 𝑘 − 1 𝑇𝑆𝑆

𝟐 ഥ 𝟐 𝟐
𝑹 and 𝑹 (adjusted 𝑹 )
𝑛−1
• is always greater than 1, so 𝑅ത 2 is always less than 𝑅2 .
𝑛−𝑘−1
• Adding a regressor has two opposite effects on the 𝑅ത 2 :

– SSR falls which increases 𝑅ത 2 .
𝑛−1
– increases which decreases 𝑅ത 2 .
𝑛−𝑘−1
– Whether the 𝑅ത 2 increases or decreases depends on which of these
two effects is stronger.
• When 𝑛 is large, 𝑅2 and 𝑅ത 2 will be very close.

Measures of Fit for Multiple Regression
Test score example:
1 𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 − 2.28 × 𝑆𝑇𝑅
𝑅2 = 0.05, 𝑆𝐸𝑅 = 18.6
2 𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 686.0 − 1.10 × 𝑆𝑇𝑅 − 0.65 × 𝑃𝑐𝑡𝐸𝐿
𝑅2 = 0.426, 𝑅ത 2 = 0.424, 𝑆𝐸𝑅 = 14.5
• What does this tell you about the fit of regression (2) compared
with regression (1)? Including the % of English Learners improves
the fit of the regression.
• Why are the R2 and the 𝑅ത 2 so close in (2)? Because 𝑛 is large and
we used only two regressors.
• The SER falls from 18.6 to 14.5 when the second regressor is
included → the predictions about test scores become more precise.
The Least Squares Assumptions for Causal
Inference in Multiple Regression
Let 𝛽1 , 𝛽2 , … , 𝛽𝑘 be the causal effects
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝑢𝑖 , 𝑖 = 1, … , 𝑛
1. The conditional distribution of 𝑢 given the 𝑋’s has mean zero,
that is, 𝐸 𝑢𝑖 𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑘𝑖 = 0
2. (𝑋1𝑖 , … , 𝑋𝑘𝑖 , 𝑌𝑖 ) are i.i.d for all 𝑖 = 1, … , 𝑛
3. Large outliers are unlikely: 𝑋1 , … , 𝑋𝑘 , and 𝑌 have nonzero
4
finite fourth moments (finite kurtosis): 𝐸 𝑋1𝑖
4
< ∞, … , 𝐸 𝑋𝑘𝑖 < ∞, 𝐸 𝑌𝑖4 < ∞
4. There is no perfect multicollinearity

Assumption #1: the conditional mean of u
given the included Xs is zero
E(u|X1 = x1,…, Xk = xk) = 0
• This has the same interpretation as in regression with a single regressor.
• Failure of this condition leads to omitted variable bias, specifically, if an
omitted variable:
1. is a determinant of Y and
2. is correlated with an included X
• Then this condition fails (1st LSA in violated) and there is OV bias.
• The best solution, if possible, is to include the omitted variable in the
regression.
• A second, related solution is to include a variable that controls for the
omitted variable (discussed shortly).
Assumption #2 and #3
Assumption #2: (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
This is satisfied automatically if the data are collected by simple
random sampling.
Assumption #3: large outliers are rare (finite fourth moments)

This is the same assumption as we had before for a single
regressor. As in the case of a single regressor, OLS can be sensitive
to large outliers, so you need to check your data (scatterplots!) to
make sure there are no crazy values (typos or coding errors).

Assumption #4: There is No Perfect
Multicollinearity
The regressors are said to exhibit perfect multicollinearity (or to
be perfectly multicollinear) if one of the regressors is a perfect
linear function of the other regressors.
• Why does perfect multicollinearity make it impossible to
compute the OLS estimator?
➢The coefficient on one of the regressors is the effect of a change
in that regressor, holding the other regressors constant.
➢If two regressors are perfectly multicollinear → the question is
illogical: if 𝑋1𝑖 is a perfect linear function of 𝑋2𝑖 , how to hold
𝑋2𝑖 constant when computing the coefficient on 𝑋1𝑖 ?

Multicollinearity
Example: Suppose you accidentally include STR twice:
regress testscr str str, robust
Regression with robust standard errors Number of obs = 420
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+---------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
str | (dropped)
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
------------------------------------------------------------------------
Depending on how the software used handles perfect multicollinearity, it will
automatically drop one of the STR, or it will refuse to calculate the OLS
estimates and give an error message
Multicollinearity
• In the previous regression, β1 is the effect on TestScore of a unit
change in STR, holding STR constant (???)
• We will return to perfect (and imperfect) multicollinearity
shortly, with more examples…
• With these least squares assumptions in hand, we now can derive
the sampling distribution of 𝛽መ1 , 𝛽መ2 , … , 𝛽መ𝑘

The Sampling Distribution of the OLS
Estimator
Under the four Least Squares Assumptions:
• The sampling distribution of 𝛽መ1 has mean 𝛽1 (𝐸 𝛽መ1 = 𝛽1 )
• 𝑣𝑎𝑟(𝛽መ1 ) is inversely proportional to 𝑛

• Other than its mean and variance, the exact distribution of 𝛽መ1 is very
complicated, but for large 𝑛, 𝛽መ1 is approximated by normal distribution
• These statements hold for all the OLS estimators (𝛽መ0 , 𝛽መ1 ,…, 𝛽መ𝑘 )
Conceptually, there is nothing new here!

Perfect Multicollinearity
Perfect multicollinearity is when one of the regressors is an exact
linear function of the other regressors
• To discuss the perfect multicollinearity, we will examine three
additional hypothetical regressions
• In each regression, a third regressor is added to the regression of
TestScore on STR and PctEL

Perfect Multicollinearity (Example 1)
Let FracEL be the fraction of English learners (varies between 0 and 1)
• If FracEL is included as a third regressor in addition to the STR and PctEL,
the regressors would be perfectly multicollinear
• The reason is that PctEL is the percentage of English learners, so that:
𝑃𝑐𝑡𝐸𝐿𝑖 = 100 × 𝐹𝑟𝑎𝑐𝐸𝐿𝑖 for every district
• Thus, one of the regressors (PctEL) can be written as a perfect linear
function of another regressor (FracEL)
• As a result, it is impossible to compute the OLS estimates of this regression
because of this perfect multicollinearity
• OLS fails because we are asking: what is the effect of a unit change in the
percentage of English learners, holding constant the fraction of English
learners →since the two move together in a perfect linear relationship, the
question makes no sense
Let NVS be a binary variable that equals to 1 if 𝑆𝑇𝑅 ≥ 12 and
equals 0 if 𝑆𝑇𝑅 < 12

• If NVS is included as a third regressor in addition to the STR and PctEL,
the regressors would be perfectly multicollinear
• The reason is that there are no districts in our data set with a 𝑆𝑇𝑅 < 12
(the smallest value of STR is 14)
• Thus, NVS is equal to 1 for all the 𝑖
• Note that the linear regression model with an intercept can equivalently
be thought of as including a regressor 𝑋0𝑖 that equals 1 for all 𝑖
• Thus, 𝑁𝑉𝑆𝑖 = 1 × 𝑋0𝑖 for all 𝑖 → 𝑁𝑉𝑆 can be written as a perfect linear
function of another regressor (𝑋0𝑖 )
• As a result, it is impossible to compute the OLS estimates of this
regression because of this perfect multicollinearity
This example illustrates two important points about perfect
multicollinearity:
1. When the regression includes an intercept, then one of the
regressors that can be implicated in perfect multicollinearity is
the constant regressor 𝑋0𝑖
2. Perfect multicollinearity is a statement about the data set you
have on hand (while it is possible to imagine a school district
with a 𝑆𝑇𝑅 < 12, there are no such districts in our data set, so
we cannot analyze them in our regression)

Let PctES be the percentage of English Speakers (the percentage
of students who are not English learners)
• If PctES is included as a third regressor in addition to the STR
and PctEL, the regressors would be perfectly multicollinear
• The reason is that PctEL is the percentage of English learners, so
that: 𝑃𝑐𝑡𝐸𝐿𝑖 = 100 − 𝑃𝑐𝑡𝐸𝑆𝑖 for every district
• Thus, one of the regressors (PctEL) can be written as a perfect
linear function of another regressor (PctES)
• As a result, it is impossible to compute the OLS estimates of this
regression, because we are asking: what is the effect of a unit
change in the percentage of English learners, holding constant
the percentage of English speakers
The dummy variable trap
Suppose that a set of multiple binary (dummy) variables are used as
regressors
• Suppose we divide the school districts into two categories: Small class size
(𝑆𝑇𝑅 ≤ 20) and large class size (𝑆𝑇𝑅 > 20)
• Each district falls into one (and only one) category
• Let these dummy variables be:
- 𝑆𝑚𝑎𝑙𝑙𝑖 ,which equals 1 if the 𝑆𝑇𝑅 ≤ 20 and 0 otherwise
- 𝐿𝑎𝑟𝑔𝑒𝑖 ,which equals 1 if the 𝑆𝑇𝑅 > 20 and 0 otherwise
• If we include the two binary variables in the regression along with a

constant, the regressors will be perfectly multicollinear because each
district belongs to one and only one category: 𝑆𝑚𝑎𝑙𝑙𝑖 + 𝐿𝑎𝑟𝑔𝑒𝑖 = 1 = 𝑋0𝑖
• This situation is called the dummy variable trap
The dummy variable trap
• In general, when there are G binary variables that are included as
regressors, if:
- Each observation falls into one and only one category, and
- There is an intercept in the regression
➔The regression will fail because of perfect multicollinearity
• Solutions to the dummy variable trap:

- Exclude one of the binary variables from the regression
- Exclude the intercept from the regression

Solutions to Perfect Multicollinearity
• Perfect multicollinearity usually reflects a mistake in the
definitions of the regressors, or an oddity in the data
• Sometimes it is easy to spot the mistake, but sometimes it is not
• However, if you have perfect multicollinearity, your statistical
software will let you know – either by crashing or giving an error
message or by “dropping” one of the variables arbitrarily
• The solution to perfect multicollinearity is to modify your list of
regressors so that you no longer have perfect multicollinearity

Imperfect multicollinearity
Imperfect and perfect multicollinearity are quite different despite
the similarity of the names.
• Imperfect multicollinearity occurs when two or more regressors
are very highly correlated
• Why the term “multicollinearity”? If two regressors are very

highly correlated, then their scatterplot will pretty much look like
a straight line – they are “co-linear” – but unless the correlation
is exactly ±1, that collinearity is imperfect
• Imperfect multicollinearity does not pose any problems for the
theory of the OLS estimators

Imperfect multicollinearity
Imperfect multicollinearity implies that one or more of the regression
coefficients will be imprecisely estimated
• The idea: the coefficient on X1 is the effect of X1 holding X2 constant;
but if X1 and X2 are highly correlated, there is very little variation in X1
once X2 is held constant – so the data don’t contain much information
about what happens when X1 changes but X2 doesn’t. If so, the
variance of the OLS estimator of the coefficient on X1 will be large.
• Imperfect multicollinearity results in large standard errors for one or
more of the OLS coefficients

Control variables
• Distinction between a regressor for which we want to estimate a
causal effect (that is a variable of interest) and control variables
• A control variable is not the object of interest in the study
• A control variable is a regressor included to hold constant factors
that, if neglected, could lead the estimated causal effect of
interest to suffer from omitted variable bias
• This distinction leads to a modification in the 1st LSA, in which
some of the variables are control variables
• If this alternative assumption (discussed shortly) holds, the OLS
estimator of the variable of interest is unbiased
• However, the OLS coefficients on control variables are, in
general, biased and do not have a causal interpretation
Control variables (test score example)
• Consider the potential omitted variable bias arising from
omitting "outside learning opportunities" from a test score
regression
• Outside learning opportunities is a broad concept that is difficult
to measure
• However, those opportunities are correlated with the student’s
economic background, which can be measured
• Thus, a measure of economic background can be included in a
test score regression to control for omitted income-related
determinants of test scores, like outside learning opportunities

• Therefore, we augment the regression of test scores on STR and
PctEL with the percentage of students receiving a free or
subsidized school lunch (Lchpct)
• Students are eligible for this program if their family income is

less than a certain threshold
• Thus, LchPct measures the fraction of economically

disadvantaged children in the district

• If we could run an experiment, we would randomly assign
students (and teachers) to different sized classes
• Then STRi would be independent of all the other factors that go

into ui, so E(ui|STRi) = 0
• Thus, the OLS slope estimator in the regression of TestScorei on

STRi would be an unbiased estimator of the desired causal effect

But with observational data, ui depends on additional factors
(parental involvement, knowledge of English, access in the
community to learning opportunities outside school, etc.)
• If you can observe those factors (e.g. PctEL), then include them
in the regression
• But usually, you can’t observe all these omitted causal factors
(e.g. parental involvement in homework) → In this case, you can
include “control variables” which are correlated with these
omitted causal factors, but which themselves are not causal

Control variables: an example from the
California test score data
TestScore = 700.2 − 1.00STR − 0.122 PctEL − 0.547 LchPct , R 2 = 0.773
(5.6) (0.27) (.033) (.024)
PctEL = percent English Learners in the school district
LchPct = percent of students receiving a free/subsidized lunch
(only students from low-income families are eligible)
• Including the control variable LchPct does not substantially
change any conclusions about the class size effect → the
coefficient on STR changed slightly (from -1.10 to -1)
• The coefficient of LchPct is large → the difference in test scores
between a district with LchPct = 0% and one with LchPct = 50%
is estimated to be 27.35 points
(5.6) (0.27) (.033) (.024)
• STR is the variable of interest
• PctEL probably has a direct causal effect (school is tougher if you are
learning English!). But it is also a control variable: immigrant
communities tend to be less affluent and often have fewer outside
learning opportunities, and PctEL is correlated with those omitted
causal variables → PctEL is both a possible causal variable and a
control variable
• LchPct might have a causal effect (eating lunch helps learning); it also
is correlated with and controls for income-related outside learning
opportunities → LchPct is both a possible causal variable and a
control variable
(5.6) (0.27) (.033) (.024)
• Does the coefficient on LchPct have a causal interpretation?
• Suppose that the superintendent proposed eliminating the
reduced-price lunch program so that, for her district, LchPct
would immediately drop to 0
• Would eliminating the lunch program boost her district’s test
scores?
• Common sense suggests that the answer is no
• By leaving some students hungry, eliminating the reduced-price
lunch program might well have the opposite effect
1. Three interchangeable statements about what makes an
effective control variable:
i. An effective control variable is one which, when included in
the regression, makes the error term uncorrelated with the
variable of interest
ii. Holding constant the control variable(s), the variable of
interest is “as if ” randomly assigned
iii. Among individuals (entities) with the same value of the
control variable(s), the variable of interest is uncorrelated with
the omitted determinants of Y

2. Control variables need not be causal, and their coefficients
generally do not have a causal interpretation. For example:

(5.6) (0.27) (.033) (.024)
• Does the coefficient on LchPct have a causal interpretation? If

so, then we should be able to boost test scores (by a lot! Do the
math!) by simply eliminating the school lunch program, so that
LchPct = 0! (Eliminating the school lunch program has a well-
defined causal effect: we could construct a randomized
experiment to measure the causal effect of this intervention)

Control variables and conditional mean
independence
• To distinguish between variables of interest and control
variables, we modify the notation of the linear regression
model to include k variables of interest, denoted by X, and r
control variables, denoted by W
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝛽𝑘+1 𝑊1𝑖 + ⋯ + 𝛽𝑘+𝑟 𝑊𝑟𝑖 + 𝑢𝑖

• The coefficients on the X's, are causal effects of interest
• The reason for including control variables in multiple
regression is to make the variable of interest no longer
correlated with the error term (once the control variables are
held constant)
independence
• Because a control variable is correlated with an omitted causal
factor, LSA #1 (E(ui|X1i,…,Xki) = 0) should be replaced
For example, the coefficient on LchPct is correlated with unmeasured
determinants of test scores such as "outside learning opportunities", so it
is subject to OV bias. But the fact that LchPct is correlated with these
omitted variables is precisely what makes it a good control variable!
• If LSA #1 doesn’t hold, then what does?

• We replace LSA #1 with an assumption called conditional
mean independence

independence
• Conditional mean independence requires that the conditional
expectation of 𝑢𝑖 given the variable of interest and the control
variables does not depend on the variable of interest (however,
it can depend on control variables)
• 𝑊𝑖 is an effective control variable if the conditional mean
independence holds: 𝐸 𝑢𝑖 𝑋𝑖 , 𝑊𝑖 = 𝐸 𝑢𝑖 𝑊𝑖
• This assumption (that is relevant for control variables) replaces
the 1st LSA assumption

independence
• The idea of the conditional mean independence is that once you
control for the W’s, the X’s can be treated as if they were
randomly assigned
• Because controlling for W makes the X’s uncorrelated with the
error term
• The control variables, however, remain correlated with the
error term → the coefficients on the control variables are
subject to omitted variable bias and do not have a causal
interpretation

LSA for causal inference in the multiple
regression with control variables

Control variables: Summary
Consider the regression model,
Y = β0 + β1X + β2W + u
where X is the variable of interest, β1 is its causal effect, and W is an
effective control variable so that conditional mean independence
holds: E(ui|Xi, Wi) = E(ui|Wi)
In addition, suppose that LSA #2, #3 and #4 hold, then:
1. 𝛽መ1 has a causal interpretation
2. 𝛽መ1 is unbiased
3. In general, 𝛽መ2 :
- Does not estimate a causal effect
- It is biased (this bias does not pose problem because we are interested in the
coefficients on the X’s, not on the W’s)

Chapter 6-Linear Regression With Multiple Regressors

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 6-Linear Regression With Multiple Regressors

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6-Linear Regression With Multiple Regressors

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

Fourth Edition, Global Edition

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Both conditions must hold for the omission of Z to result in omitted

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Accordingly, ˆ1 is biased. What is the direction of this bias?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

where vi = ( X i − X )ui  ( X i −  X )ui .

Under Least Squares Assumption #1, E[(Xi – μX)ui] = cov(Xi,ui) = 0 →

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Thus, 𝜎𝑋𝑢 = 𝜌𝑋𝑢 × 𝜎𝑋 𝜎𝑢

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

What can we do about omitted variable bias?

• Districts with lower % of English learners have higher test scores

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Thus, the “control” and “treatment” groups differ in a

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Population regression line when X1 = X1,0 + ΔX1:

Y + ΔY = β0 + β1(X1,0 + ΔX1) + β2X2

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• The OLS estimator minimizes the average squared difference

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Regression with robust standard errors Number of obs = 420

TestScore = 686.0 − 1.10  STR − 0.65PctEL

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• σ𝑛𝑖=1 𝑢ො 𝑖2 = 𝑆𝑆𝑅 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠

𝑖=1 𝑖=1 𝑖=1

• The 𝑅2 always increases (never decreases) whenever a regressor

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Adding a regressor has two opposite effects on the 𝑅ത 2 :

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Assumption #3: large outliers are rare (finite fourth moments)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• 𝑣𝑎𝑟(𝛽መ1 ) is inversely proportional to 𝑛

Conceptually, there is nothing new here!

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• If we include the two binary variables in the regression along with a

➔The regression will fail because of perfect multicollinearity

• Solutions to the dummy variable trap:

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.