Lecture 4 Linear Regression II: STAT 441: Statistical Methods For Learning and Data Mining

Multiple Linear Regression
Estimation and Inference

Indicator Variables
Summary and Remark
Lecture 4 Linear Regression II

STAT 441: Statistical Methods for Learning and Data Mining
Jinhan Xie
Department of Mathematical and Statistical Sciences

University of Alberta
Winter, 2023
Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 1/23

Indicator Variables
Summary and Remark
Outline
1 Multiple Linear Regression
2 Estimation and Inference
3 Indicator Variables
4 Summary and Remark

Indicator Variables
Summary and Remark
Multiple Linear Regression has more than one covariates,
Y = β0 + β1 X1 + · · · + βp Xp + ε,
where usually ε ∼ N(0, σ 2 ).

We interpret βj as the average effect on Y of a one unit increase in Xj ,
while holding all the other covariates fixed.
In the advertising example, the model becomes
Sales = β0 + β1 × TV + β2 × Radio + β3 × Newspaper + ε.

Indicator Variables
Summary and Remark
Coefficient Interpretation
The ideal scenario is when the predictors are uncorrelated — a balanced

design.
Each coefficient can be estimated and tested separately.
Interpretations such as a unit change in Xj is associated with a βj change in
Y , while all the other variables stay fixed, are possible.
Correlations amongst predictors cause problems.
The variance of all coefficient tends to increase, sometimes dramatically.
Interpretations become hazardous — when Xj changes, everything else
changes.
Claims of causality should be avoided for observational data.

Indicator Variables
Summary and Remark
The woes of regression coefficients
Data Analysis and Regression, Mosteller and Tukey 1977

A regression coefficient βj estimates the expected change in Y per unit
change in Xj , with all other predictors held fixed. But predictors usually
change together!
Example: Y total amount of change in your pocket; X1 = # of coins;
X2 = # of pennies, nickels and dimes. By itself, regression coefficient of
Y on X2 will be > 0. But how about with X1 in model?
Y = number of tackles by a football player in a season; W and H are his
weight and height. Fitted regression model is Y = β0 + 0.50W − 0.10H.
How do we interpret βˆ2 < 0?

Indicator Variables
Summary and Remark
Two quotes by famous Statisticians
1919 - 2013 (aged 93)
Essentially, all models are wrong, but some are useful.

George Box
The only way to find out what will happen when a complex system is
disturbed is to disturb the system, not merely to observe it passively.
Fred Mosteller and John Tukey, paraphrasing George Box
Indicator Variables
Summary and Remark
Coefficient estimation
Given the estimates βˆ0 , βˆ1 , · · · , and βˆp , the estimated regression line is
ŷ = βˆ0 + βˆ1 x1 + · · · + βˆp xp .
We estimate all the coefficients βj , j = 0, 1, · · · , p as the values that

minimize the sum of squared residuals
X
RSS = (yi − ŷi )2 ,
i
where ŷi = βˆ0 + βˆ1 xi1 + · · · + βˆp xip is the predicted values.
This is done using standard statistical software. The values βˆ0 , βˆ1 , · · · ,
and βˆp that minimize RSS are the multiple least squares regression
coefficient estimates.

Indicator Variables
Summary and Remark
Estimation Example
X2
X1

Indicator Variables
Summary and Remark
Observations: yi , xij , for i = 1, . . . , n and j = 0, 1, . . . , p. Noise: ε1 , . . . , εn .

Unknown parameters: β0 , β1 , . . . , βp , which we try to mine data for.
y1 = β0 + β1 x11 + · · · + βp x1p + ε1
y2 = β0 + β1 x21 + · · · + βp x2p + ε2
..
.
yn = β0 + β1 xn1 + · · · + βp xnp + εn

Indicator Variables
Summary and Remark
We write
  
1 x11 x12 · · · x1p β0
 1 x21 x22 · · · x2p  β1 
Y=  + ε,
  
.. .. .. .. ..  ..
 . . . . .  . 
1 xn1 xn2 · · · xnp βp
or,
Y = X β + ε ,
(n×1) (n×(p+1))((p+1)×1) n×1
which is an elegant and equivalent matrix form.

Indicator Variables
Summary and Remark
Least squares estimation
n
X
RSS = (yj − b0 − b1 xj1 − · · · − bp xjp )2
j=1
= (y − Xb)0 (y − Xb)
When X has full rank p + 1 ≤ n, the least square estimate of β is
β = (X0 X)−1 X0 y.
β̂
Derivation:
Let x ∈ Rm , A, B be constant matrices, and c be a constant vector.
Dx hAx, ci = A0 c, Dx hAx, Bxi = B0 Ax + A0 Bx.
Therefore Db RSS = Db hXb − y, Xb − yi = 2X 0 (Xb − y).
We let Db RSS = 0 to obtain the estimate.
Indicator Variables
Summary and Remark
Example
Fit a straight-line regression model
Y = β0 + β1 X1 + ε,
with data
X1 0 1 2 3 4
.
Y 1 4 3 8 9
Solution.
   
1 0 1
 1 1   4 
    0 5 10
X= 1 2 , y= 3 , XX= .
    10 30
 1 3   8 
1 4 9

Indicator Variables
Summary and Remark
Recall that if

a11 a12
A= ,
a21 a22
and if A is invertible, then |A| = a11 a22 − a12 a21 and

1 a22 −a12 a11 a12 1 0
= .
|A| −a21 a11 a21 a22 0 1
So

−1 a22 −a12
1
A = , and
−a21 a11
|A|
3
− 51

0 −1 1 30 −10 5
(X X) = = .
5 × 30 − 10 × 10 −10 5 − 15 1
10

Indicator Variables
Summary and Remark
On the other hand,

25
X0Y = ,
70
so
3
− 15

5 25 1
β̂ = = .
− 51 1
10 70 2

Indicator Variables
Summary and Remark
Inference
Is at least one predictor useful?

H0 : β1 = · · · = βp = 0, v.s. H1 : otherwise
Insight: to test if there is a linear relationship between Y and X1 , . . . , Xp .
TSS−RSS)/p ∼ F
Test statistic F = (RSS/(n−p−1) p,n−p−1 .
n
X n
X
RSS = (yi − ŷi )2 , TSS = (yi − ȳ)2
i=1 i=1
Under H0 , F ∼ Fp,n−p−1 , the F distribution with degrees of freedom p

and n − p − 1. One rejects H0 if F is too large (depending on the test
significance), in which case β̂ is significant.

Indicator Variables
Summary and Remark
Inference
What about an individual coefficient, say if βj useful?

H0 : βj = 0, v.s. H1 : βj 6= 0.
j β̂ −0
Test statistic: t = ∼ tn−p−1 .
SE(β̂j )
tn−p−1 , the t-distribution with n − p − 1 degrees of freedom.
SE(β̂j ) is called the estimated standard error, or simply standard
error of the coefficient β̂j ,
s
RSS
SE(β̂j ) = [(X 0 X)−1 ]jj ,
n−p−1
where (X 0 X)−1 jj is the (j, j)-entry of the matrix (X 0 X)−1 .


Indicator Variables
Summary and Remark
R2 and Adjusted R2
Pn
2 (ŷi − ȳ)2
R = Pi=1
n 2
i=1 (yi − ȳ)
R2 ∈ [0, 1] is called the coefficient of determination. It gives the

proportion of the total variation in the yj ’s, that is “explained” by, or
attributable to, the predictor variables X1 , . . . , Xp .
R2 = 1 if the fitted equation passes through all the data point (perfect fit,
ŷj = yj for any j).
R2 = 0 if βˆ0 = ȳ and β̂1 = · · · = β̂p = 0, in which case X1 , . . . , Xp have
no influence on the response Y.

Indicator Variables
Summary and Remark
R2 and Adjusted R2
The adjusted R2 statistic
RSS/(n − r − 1)
R2Adj = 1 − .
TSS/(n − 1)
n
(ŷi − ȳ)2 (we skip the proof),
P
since TSS = RSS +
i=1
Pn
− ȳ)2
2 i=1 (ŷi
RSS
R = =1−
TSS TSS
2
In general, R never decreases when one more prediction variable (e.g.
Xp+1 ) is added to the model, even if the new predictor contributes
nothing.
Pn 2
i=1 (yi − ŷi ) /(n − r − 1) is the residual mean square. When one more
predictor is added in, R2Adj is increased only if
Pn 2
i=1 (yi − ŷi ) /(n − r − 1) is decreased.
Indicator Variables
Summary and Remark
Estimating the regression function at x0
For the linear regression model Y = Xβ + ε, x00 β̂ = β̂0 + β̂1 x01 + · · · + β̂p x0p
is the unbiased linear estimator of E[Y|x0 ]. If the errors ε1 , . . . , εn are
independently distributed and all have N(0, σ 2 ) distribution then a
100(1 − α)% confidence interval for E[Y|z0 ] = z00 β is given by
α q
x00 β̂ ± tn−p−1 (x00 (X 0 X)−1 x0 )s2 ,
2
1 Pn
where s2 = n−p−1 2
i=1 (yi − ŷi ) .

Indicator Variables
Summary and Remark
Advertising example
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 1.686 on 196 degrees of freedom

Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
> predict(TVadlm, newdata, interval="c", level=0.95)

fit lwr upr
1 20.52397 19.99627 21.05168
> predict(TVadlm, newdata, interval="p", level=0.95)
fit lwr upr
1 20.52397 17.15828 23.88967

Indicator Variables
Summary and Remark
Indicator Variables
Some predictors are not quantitative but are qualitative, taking a discrete
set of values.
These are also called categorical predictors or factor variables.
Example: investigate difference in credit card balance between males
and females, ignoring the other variables. We create a new variable,
(
1 if i-th person is female,
xi = .
0 if i-th person is male
Resulting model
(
β0 + β1 + ε i if i-th person is female,
yi = β0 + β1 xi + εi = .
β0 + εi if i-th person is male
Interpretation and more than two levels (categories)?

Indicator Variables
Summary and Remark
Indicator Variables
In general, if we have k levels, we need (k − 1) indicator variables.

For example, we have 3 levels — A, B, and C for a covariate x,
( (
1 if x is A, 1 if x is B,
xA = ; xB = .
0 if x is not A 0 if x is not B
If x is C, then xA = xB = 0. We call C as baseline.

βA is the contrast between A and C and βB is the contrast between B and
C.

Indicator Variables
Summary and Remark
Summary and Remark
Multiple linear regression

Estimation and inference
Indicator variables
Read textbook Chapter 3

Lecture 4 Linear Regression II: STAT 441: Statistical Methods For Learning and Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 Linear Regression II: STAT 441: Statistical Methods For Learning and Data Mining

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression

Estimation and Inference

Lecture 4 Linear Regression II

Department of Mathematical and Statistical Sciences

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 1/23

1 Multiple Linear Regression

2 Estimation and Inference

4 Summary and Remark

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 2/23

Multiple Linear Regression

Multiple Linear Regression has more than one covariates,

where usually ε ∼ N(0, σ 2 ).

Sales = β0 + β1 × TV + β2 × Radio + β3 × Newspaper + ε.

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 3/23

The ideal scenario is when the predictors are uncorrelated — a balanced

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 4/23

The woes of regression coefficients

Data Analysis and Regression, Mosteller and Tukey 1977

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 5/23

Two quotes by famous Statisticians

1919 - 2013 (aged 93)

Essentially, all models are wrong, but some are useful.

ŷ = βˆ0 + βˆ1 x1 + · · · + βˆp xp .

We estimate all the coefficients βj , j = 0, 1, · · · , p as the values that

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 7/23

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 8/23

Observations: yi , xij , for i = 1, . . . , n and j = 0, 1, . . . , p. Noise: ε1 , . . . , εn .

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 9/23

which is an elegant and equivalent matrix form.

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 10/23

Least squares estimation

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 12/23

and if A is invertible, then |A| = a11 a22 − a12 a21 and

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 13/23

On the other hand,

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 14/23

Is at least one predictor useful?

Under H0 , F ∼ Fp,n−p−1 , the F distribution with degrees of freedom p

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 15/23

What about an individual coefficient, say if βj useful?

where (X 0 X)−1 jj is the (j, j)-entry of the matrix (X 0 X)−1 .

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 16/23

R2 ∈ [0, 1] is called the coefficient of determination. It gives the

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 17/23

Estimating the regression function at x0

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 19/23

Residual standard error: 1.686 on 196 degrees of freedom

> predict(TVadlm, newdata, interval="c", level=0.95)

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 20/23

Interpretation and more than two levels (categories)?

In general, if we have k levels, we need (k − 1) indicator variables.

If x is C, then xA = xB = 0. We call C as baseline.

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 22/23

Summary and Remark

Multiple linear regression

Jinhan Xie (University of Alberta) Lecture 4 LR II Winter, 2023 23/23

You might also like