14 Simple Linear Regression

Lecture
14 Simple Linear Regression
Ordinary Least Squares (OLS)
Consider the following simple linear regression model
Yi = α + βXi + εi
where, for each unit i,
• Yi is the dependent variable (response).
• Xi is the independent variable (predictor).
• εi is the error between the observed Yi and what the model predicts.
This model describes the relation between Xi and Yi using an intercept and a slope parameter.
The error εi = Yi − α − βXi is also known as the residual because it measures the amount of
variation in Yi not explained by the model.
We saw last class that there exists αˆ and βˆ that minimize the sum of
i ε . Specifically, we wish
2
to
ni
find αˆ and βˆ such Σ
2
=1
that Yi − (αˆ + βˆXi )
is the minimum. Using calculus, it turns out that
Σ
¯ ˆ ¯ (XΣi − X¯ )(Yi − Y¯ )
αˆ = Y − ˆ
β= (
βX —i X¯ )2
X
OLS and Transformation
If we center the predictor, X˜i = Xi − X¯ , then X˜i has mean zero. Therefore,
Σ ˜
αˆ = Y ¯ X i (Yi − Y¯ )
∗
βˆ =
∗ Σ .
i
2
˜
By horizontally shifting the value of Xi, note that β∗ = β, X
but the intercept changed to the overall
average of Yi
Consider the linear transformation Zi = a + bXi with Z¯ = a + bX¯ . Consider the linear model
Yi = α∗ + β∗Zi + εi
1
Let αˆ and βˆ be the estimates using X as the predictor. Then the intercept and slope using Z
as the predictor are
Σ
(ZΣi − Z¯ )(Yi − Y¯ )
βˆ =
∗
( —i
Σ Z¯ )2
(aZ
Σ+ bXi − a − bX¯ )(Yi − Y¯ )
=Σ ¯ ¯
= (bX Σ − bX )(Y Y )
(bX¯ − bX¯ )2 2¯
−i i
Σ
¯ ¯
b (Xi − X )(Yi − Y ) 1 ˆ
= Σ = β
b2 (X¯ − X¯ )2 b
and
αˆ ∗ = Y¯ − βˆ∗ Z¯
βˆ aβˆ a
= Y¯ − (a + bX¯ ) = Y¯ − βˆX¯ − = αˆ − βˆ
b b b
Zi−a
We can get the same results by substituting Xi = b into the linear model.
Yi = α + βXi + εi
Zi − a
Yi = α + βb +ε i
aβ β
Yi = α − + Zi + εi
b∗ b
Yi = α + β Zi + εi
∗
Properties of OLS
Given the estimates and βˆ, we can define (1) the estimated predicted Yî and (2) the
αˆ estimated residual value
εî.
Yî = αˆ + εî = Yi − Yî = Yi − αˆ − βˆXi
βˆXi
The least squared estimates have the following properties.

Σ
1. i εî = 0
n n n n
Σ Σ Σ Σ
εî = (Yi i=1 ˆXi ) =
− αˆ − βi=1 − βˆ
Yi − nαˆ i=1 Xi i=1
= nY¯ − nαˆ − nβˆX¯ = n(Y¯ − αˆ − βˆX¯ )

= n(Y¯ − (Y¯ − βˆX¯ ) − βˆX¯ ) = 0
2. X¯ and is always on the fitted line.
Y¯
αˆ + βˆX¯ = (Y¯ − βˆX¯ ) + βˆX¯ = Y¯
3. βˆ = sY
× , where and are the sample standard deviation of X and Y , and rXY is
rXY s
sY X sX
the correlation between X and Y . Note that the sample correlation is given by:
Σ
¯ ¯
rXY = (Xi − X )(Yi − Y ) .
(n − 1)sXsY
Therefore,
Σ
¯ n − 1 sY
Σi − X )(Yi − ×
sX
ˆ = (X
β
( × Y X
Y¯ ) — X¯ )2 n−1 ×
i
X s
Cov(X,
2 Y ) s
sYY ssXX
=
sX × ×
sX sY 1 sX
=r × sY
XY
sX
4. βˆ is a weighted sum of
Yi . Σ
(Xi − X¯ )(Yi − Y¯ )
ˆ=
β Σ i
Σ
( — X¯ Σ
(Xi − X¯)2)Yi (Xi − X¯ )Y¯
= Σ X i ¯ 2 Σ
( —X ) ( i
— X¯
Σ Σ
X(Xi − −X¯ )Yi Y¯X (X)i2− X¯ )
= Σ Σ
( —i X¯ )2 ( i
— X¯
Σ
− ¯
X(Xi − X )Yi X
= Σ
)2
( i −
Σ —XX ¯ )2 ¯
i − X Yi
XΣ= (Xi − X¯
)2
Σ
Xi − X¯
= wiYi , Σ=
i
wi
5. αˆ is a weighted sum of Yi . 2¯(X − X)βˆX¯Σ
1
αˆ = Y¯ − = Y
Σ
— X¯ wY
i i i
Σn 1 X¯ (Xi − X¯ )
i
— X¯ )2
Σ = viYi , vi = −Σ
6. Xiεî = n (X
0.
Σ Σ Σ
Xi εî = (Yi − Yî )Xi = (Yi − αˆ − βˆXi )Xi
Σ
= (Yi − (Y¯ − βˆX¯ ) − βˆXi )Xi
Σ
= ((Yi − Y¯ ) − βˆ(Xi − X¯ ))Xi
Σ Σ
= Xi (Yi − Y¯ ) − βˆ Xi (Xi − X¯ )
Σ Σ
= βˆ (Xi − X¯ )2 − βˆ Xi (Xi − X¯ )
Σ
= βˆ (Xi − X¯ )2 − Xi (Xi − X¯ )
Σ
= βˆ X 2 ¯
i − 2Xi X + X
¯2− i X +
2
Σ
ˆ¯ ¯2 ¯
Xi X ) X − Xi X
= β
Σ
= βˆX¯ X¯ − Xi = 0
Σ ¯ ¯ Σ ¯
ˆ
Note that β Σi − X )(Yi − Y =) Σ Xi (Yi − Y )
(X
= ( i — X¯ )2 ( i — X¯
X X )2
7. Partition of total variation.
(Yi − Y¯ ) = Yi − Y¯ + Yî − Yî = (Yî − Y¯ ) + (Yi − Yî )
• (Yi − Y¯ ): total deviation of the observed Yi from a null model (β = 0).

• (Yˆ − Y¯ ): deviation of the modeled value Yî (β =/ 0) from a null model (β = 0).
• (Yˆ − Yi ): deviation of the modeled value from the observed value.
Total Variations (Sum of Squares)

Σ ¯ 2
• (Yi − Y ) : total variation: (n − 1) V (Y ).
Σ
• × (Yˆ − Y¯ )2 : variation explained by the model
Σ ˆ Σ
• (Y − Yi )2 : variation not explained by the model; residual sum of squares (SSE), i
εˆ2
Partition of Total Variation

Σ Σ Σ
(Yi − Y¯ )2 = (Yî − Y¯ )2 + (Yi − Yî )2
Proof: Σ Σ
(Yi − Y¯ )2 = h(Yî − Y¯ ) + (Yi − Yˆ )i
2
i
Σ Σ Σ
= (Yî − Y¯ )2 + (Yi − Yî )2 + 2 (Yî − Y¯ )(Yi − Yî )
Σ Σ
= (Yî − Y¯ )2 + (Yi − Yî )2
where,
Σ Σ
(Yî − Y¯ )(Yi − Yî ) = h(αˆ + βˆXi ) − (αˆ + βˆX¯ )i εî
Σ
= hβˆ(Xi − X¯ )i εî
Σ Σ
= βˆ Xi εî − βˆX¯ εî = 0 + 0
Σ
8. Alternative formula for εî 2
Σ Σ Σ
(Yi − Yî )2 = (Yi − Y¯ )2 − (Yî − Y¯ )2
Σ Σ
= (Yi − Y¯ )2 − (αˆ + βˆXi − Y¯ )2
Σ Σ
=Σ (Yi − Y¯ )2 −Σ (Y¯ + βXX¯ + βˆXi − Y¯ )2
= (Yi − Y¯ )2 − hβˆ(Xi − X¯ )i
2
Σ Σ
= (Yi − Y¯ )2 − βˆ2 (Xi − X¯ )2
or
Σ Σ Σ 2
ˆ 2 ¯ 2 (Xi − X¯ )(Yi − Y¯ )
(Yi − Yi) = (Yi − Y ) − Σ ¯ 2
(Xi — X )
9. Variation Explained by the Model:

Σ
An easy way to calculate (Yî − Y¯ )2 is
Σ Σ Σ
(Yî − Y¯ )2 = (Yi − Y¯ )2 − (Yi − Yî )2
Σ Σ 2
¯ 2 (Xi − X¯ )(Yi − Y¯ )
Σ= (Yi − Y ) − ¯ 2
i − X )
Σ (X
(Xi − X¯ )(Yi − Y¯
= 2 Σ
) ( — X¯
X )i2
10. Coefficient of Determination r2:
2 variation explained by the model Σ

(Yî − Y¯ )2
r = = Σ
total variation (Y —i Y¯
)2
(a) r2 is related the residual variance.
Σ Σ Σ
(Yi − Yî )2 = (Yi − Y¯ )2 − (Yî − Y¯ )2
hΣ Σ i Σ(Y − Y¯ )2
¯ 2 i
= (Yi − Y − (Yi − Y¯ 2
× i — Y¯
ˆ ) Σ
)
Σ )2
= (1 − r 2 ) × (Yi − Y¯ )2
(Y
Diving the right-hand side by n − 2 and the left-hand side by n − 1, for large n
s2 ≈ (1 − r2)Var(Y )
(b) r2 is the square of the sample correlation between X and Y .

Σ ˆ
2 (Y i − Y¯ )2
r =
i — Y¯
Σ Σ )2 ¯ )(Yi − Y¯ ) 2
= (Y (Xi − X
—i Y¯ )2 —i X¯
Σ #
Σ" Σ(Xi − X¯ )(Y )2 i − Y¯ ) 2
(X 2
= (Y√Σ √ Σ
(Y 2—i Y¯ )(Xi —
)2 X
¯
= rXY
Statistical Inference for OLS Estimates
Parameters αˆ and βˆ can be estimated for any given sample of data. Therefore, we also
need to consider their sampling distributions because each sample of (Xi, Yi) pairs will result in
different estimates of α and β.
Consider the following distribution assumption on the error,
i
Yi = α + βXi + εi εi ∼id N (0, σ 2 ) .
The above is now a statistical model that describes the distribution of Yi given Xi. Specifically,
we assume the observed Yi is error-prone but centered around the linear model for each value of
Xi.
i
Yi ∼id N (α + βXi , σ 2 )
The error distribution results in the following assumptions.

1. Homoscedasticity: variance of Yi is the same for any Xi.
V (Yi) = V (α + βXi + εi) = V (εi) = σ2
2. Linearity: the mean of Yi is linear with respect to Xi
E(Yi) = E(α + βXi + εi) = α + βXi + E(εi) = α + βXi
3. Independence: Yi are independent of each other.
Cov(Yi, Yj) = Cov(α + βXi + εi, α + βXj + εj) = Cov(εi, εj) = 0
We can now investigate the bias and variance of OLS estimators αˆ and βˆ. We assume that Xi
are known constants.
• βˆ is
unbiased.
Σ X − X¯
E[βˆ] = E h wYi , i = Σ
i
(Xi − X¯
w i i )2
Σ
= wiE[Yi]
Σ
= wi(α + βXi)
Σ Σ
=α wi + β wiXi = β
Because
Σ 1 Σ
i = ¯ (Xi − X¯ ) = 0
(Xi − X
w Σ )2
Σ Σ
1
wiXi = — X¯ )
(Xi − X¯
Σ
)2 Xi(Xi
1 Xi (Xi − X¯ ) − 1
=
Σ
= Σ Σ
Xi (Xi − X¯ ) −Σ — X¯ )
¯ X¯ (Xi −
= X
X¯ ) Xi(Xi
— X¯ )
Σ Σ
Xi
Σ (Xi −
i
(X
X¯ i) — X ¯)
Σ 1 Xi(X
Xi (Xi − X¯ ) − 0
=1
• αˆ is
unbiased.
E[αˆ] = E[Y¯ − βˆX¯ ] = E[Y¯ ] − X¯ E[βˆ]
1Σ ¯ 1Σ
=E Y
i − βX = E [α + βX
i ]−
n n
βX¯
1
= (nα + nβX ) − βX¯
i
n
= α + βX¯ − βX¯ = α
ˆ
n σ2 σ2 x2
• Σβ has variance i=1
— 2
= 2
, where σ 2 is the residual variance and s is the
(Xi ) (n − 1)s x
X¯ 1
sample variance of x1, . . . , xn.
Σ X − X¯
ˆ
V [β ] = V h wi iY i , i = Σ
i
(Xi − X¯
w )2
Σ
= w2 Vi [Yi ]
Σ
= σ2 wi2
where
Σ 1 Σ
w2 Σ (Xi — X¯ )
2
[ (Xi − X¯ )
= 2 ]2
i
1
=Σ ¯ 2
(X i — X )
2
1 X¯ 2
• αˆ has variance Σ
σ n
+ n ( i 1 )2
i=1 X—
hΣ i X¯
V [αˆ] = 1 X¯ (Xi − X¯ )
V
viYi , vi = −
Σ n i — X¯
Σ (X )2
= v2V
[Yi] i 2
X¯ (Xi − X¯ )
2 Σ
1 −
=
Σ
i — X¯
σ n (X
Σ 1 Σ X¯ 2 (X)i2− X¯ )2 Σ X¯ (Xi − X¯ )
2
= −
σ n2 Σ
2 [ (X i ¯ 2 2 − n —i¯ X¯ Σ
1 —
Σ X ) ] Σ 2X
X¯ 2
)2 ¯
Σ 2 ¯ (X
=σ −
n [ (X 2—i]2X¯ ) Σ
(Xi − X) — n(Xi —
¯
)2 X (Xi − X)
2 ¯ 2
= Σ+ n ( X
1
)2
σ n i=1 X —i
1
X¯
• αˆ and βˆ are negatively correlated.
Σ Σ
ˆ 1 X¯ (Xi − Xi −i X¯
Cov(αˆ, β) = viYi, wi Y , vi = − , wi =
Cov i Σ n¯ )
X — X¯ )2
(X Σ
i
(X — X¯
Σ
= viwiCov (Yi, Yi) )2
Σ
= 2 vi wi σ 2
Σ 1 X¯ (Xi − X¯ ) Xi − X¯
=
σ −Σ Σ
i — X¯ ( i — X¯
Σ n (X ) Σ X¯ (Xi −)2
2
Xi − X¯ Xi − X¯
¯ − XΣ
2
= X )
σ n ( — X¯ )2 i — X¯
Σ i — X ¯ i
Σ )2
(X )2 X (X
Σ Σ ¯
2 (Xi −ΣX¯ ) X (Xi − X¯ )2
= 2 X¯ (Xi − X −
¯ )2
σ n i ¯ )2 i — X¯ )
Σ — X 2 ]2
(X Σ—i X¯ )2 ]2
=σ 0 − Σ [ (X
[ (X
σ 2 X¯
= −Σ ¯ 2
(X i — X )
Therefore, the sampling distributions for the regression parameters are:
2
1 X¯ 2 βˆ ∼ β Σn 2
αˆ ∼ α, Σ+ 2 , i=1 ( σ — )2
σ
n
i=1( — )1 N X i¯ 1
N n X X
i X¯
However, in most cases, we don’t know what σ2 and will need to estimate it from the data. One
unbiased estimator for σ2 is obtained from the residuals.
N
1
s = Σ
2
(Yi
n−2 — Yî )2
i=1
By replacing σ 2 with s2 , we have the following sampling distributions for αˆ and βˆ.
2
1 X¯ 2 ˆ
αˆ ∼ α, Σ+ n β Σ s2
s i=1 ( , i
tn−2 n X
1 1
i β∼ n
— )2 i=1 ( — )2
tn−2 X
Note that our degrees of freedom is nX− ¯ 2 because we estimated two parameters X¯(α and β) in
order to calculate the residuals. Similarly recall that (Y i − Yi) = 0.
Σ
ˆ
Knowing that αˆ and βˆ follow a t-distribution with degrees of freedom n-2, we are able to
construct confidence interval and carry out hypothesis test as before. For example, a 90% confidence
interval for β is given by
s
β ± t5%,n−2 × Σ
ˆ s2
n
i=1 (Xi
— )2 .
X¯ 1
Similarly, consider testing the hypotheses H0β = β0 versus HAβ β0. The corresponding t-test
statistic is given by
s Σn βˆ − β0
∼ tn−2 .
s2
(Xi −
i=1
X¯ 1 )2
Prediction
Based on the sampling distribution of αˆ and βˆ we can also investigate the uncertainty
associated with Yˆ0 = αˆ + βˆX0 on the fitted line for any X0 . Specifically,
ˆ 2
(X0 − X¯ )2
Y0 ∼ α + X0β, +
1 σ n i .
N — X¯
Σ textbook2 uses the notation µ0. Replacing the
Note that α + X0β is the mean of Y given X0. The
(X )
unknown with our estimates, we have
ˆ ˆ 2 1
(X0 − X¯ )2
+ Σ i
Y0 ∼ tn−2 αˆ + X0 β, n (X — ¯
)2 X .
s
2 (X0 − X¯ )2
• The fitted value Yˆ0 = αˆ +ˆβX0 has mean α + X0 β and variance + i
σ 1 n
Σ
(X
.
— X¯
E(Yˆ0 ) = E(αˆ + βˆX0 ) = E(αˆ) + X0 E(βˆ) = α + X0 β )2
V (Yˆ0 ) = V (αˆ + βˆX0 ) = V (αˆ)0 + X 2 V (βˆ) + X0 Cov(αˆ, βˆ)
2 X0 X¯
1 X¯ 2 Σ X20 2
Σ
=σ Σ+ ¯ 2 + 2
− 2σ (X — X¯ )2
i=1 (X — X 1 )
n σ i=1 (X —
n X¯ 1 i
n i i )2
2 ¯ 2
= 1 + (X0 − X )
σ n —i X¯
Σ )2
(X
We can also consider the uncertainty of a new observation Y0∗ at X0. In other words, what values
of Y0∗ do I expect to see at X0∗ ? Note that this is different from Yˆ0 which is the fitted value
that represents the mean of Y at X0. Since
Y0∗ = α + βX0 + ε, ε ∼ N (0, σ2) ,

by replacing the unknown parameters with our estimates, we obtain
∗ ˆ 2
+ (X0 − X¯ )2
Y0 ∼ tn−2 αˆ + βX0 , +
2
n .
s s 1 Σ —i X¯
(X
Again, the above distribution describes the actual observable )2 instead of model parameters;
data,
interval estimates are also known as prediction intervals, instead of confidence intervals.

14 Simple Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14 Simple Linear Regression

Uploaded by

Copyright:

Available Formats

Lecture

14 Simple Linear Regression

Ordinary Least Squares (OLS)

Consider the following simple linear regression model

• Xi is the independent variable (predictor).

OLS and Transformation

The least squared estimates have the following properties.

= nY¯ − nαˆ − nβˆX¯ = n(Y¯ − αˆ − βˆX¯ )

7. Partition of total variation.

(Yi − Y¯ ) = Yi − Y¯ + Yî − Yî = (Yî − Y¯ ) + (Yi − Yî )

• (Yi − Y¯ ): total deviation of the observed Yi from a null model (β = 0).

Total Variations (Sum of Squares)

Partition of Total Variation

9. Variation Explained by the Model:

10. Coefficient of Determination r2:

2 variation explained by the model Σ

(b) r2 is the square of the sample correlation between X and Y .

Statistical Inference for OLS Estimates

The error distribution results in the following assumptions.

V (Yi) = V (α + βXi + εi) = V (εi) = σ2

2. Linearity: the mean of Yi is linear with respect to Xi

E(Yi) = E(α + βXi + εi) = α + βXi + E(εi) = α + βXi

3. Independence: Yi are independent of each other.

Cov(Yi, Yj) = Cov(α + βXi + εi, α + βXj + εj) = Cov(εi, εj) = 0

Therefore, the sampling distributions for the regression parameters are:

Y0∗ = α + βX0 + ε, ε ∼ N (0, σ2) ,

You might also like