330 Lecture5 2015

STATS 330: Lecture 5
The Multiple Regression Model
29.07.2015
Housekeeping
I Contact details
Office auckland.ac.nz hours
Steffen Klaere 303.219 s.klaere 9:3010:30, Thu+Fri
Arden Miller 303.229C a.miller Wed 910, Thu 1213
I Class representatives
Course aucklanduni.ac.nz
Jessica Courtney 330 jcou608
Monica Hill 330 mhil084
Ben Wilson 762 bwil003
Tutor Office Hours
I Blake Seers
I Tuesday, 9-11
I Thursday, 10-11
I Friday, 14-15
I Hongbin Guo
I Monday, 9-11
I Tuesday, 15-16
I Wednesday, 14-15
I Stage 3 Assistance Room 303S.294
Assignments
I Assignment 1 is due August 10.
I Focusses on data cleaning and exploratory analysis.
I Submit to Student Resource Centre by 2pm.
I Use cover page provided on webpage.

Getting RStudio
http://www.rstudio.com/
http://r-project.org/
Mosaicplot: The Titanic
> data(Titanic)
> mosaicplot(Titanic,main="Survival on the Titanic",
+ col=c("tomato","steelblue"))
Survival on the Titanic
1st 2nd 3rd Crew
Child Adult Child Adult Child Adult Child Adult
No
Male
Yes
Sex
No
Female
Yes
Class
Todays lecture: The multiple regression model
Aims of the lecture:
I Describe type of data suitable for multiple regression.
I Review multiple regression model.
I Describe some exploratory tools to check suitability of model.

Two uses for regression
Many texts give the impression that there is a true

regression model, and that the aim of model building is
always to find this model.
This is almost completely unhelpful.
Regression is useful for two main tasks: prediction and

causal inference. Model building for these two aims is
very different. In neither case is the aim to find the true
model.
Thomas Lumley
Suitable data
Suppose our data frame contains

I A numeric response variable Y ,
I One or more numeric explanatory variables X1 , . . . , Xk (aka
covariates, independent variables)
and we want to explain (model, predict) Y in terms of the
explanatory variables, e.g.,
I explain the volume of cherry trees in terms of height and
diameter.
I explain nitrogen oxide emissions in terms of compression ratio
and equivalence ratio.
The regression model
The model assumes

I The responses are normally distributed with means (each
response has a mean) and constant variance 2 .
I The mean response of a typical observation depends on the

covariates through a linear relationship.
= 0 + 1 X1 + + k Xk
I The responses are independent!

The regression plane
I The relationship
= 0 + 1 X1 + + k Xk
can be visualised as a plane.
I The plane is determined by the coefficients
0 , 1 , . . . , k .
I E.g. for k = 2 (number of covariates)

1
0.5
Y 0
-0.5
0.9
0.6
-1 X1
0.3
0
0.2
0.4
0
0.6
0.8
X2 1
Non-deterministic form of model
Y =+
= 0 + 1x1 + + k xk +
where is normally distributed with mean zero and variance 2 .
Y
X1
X2
Checking the data
Task: How can we tell when this model is suitable?

Suitable: data are randomly scattered about plane (planar
for short)
k = 1: Draw scatterplot of response y vs. covariate x.
k = 2: Use spinner, look for the edge
k 2: Use coplots. If data are planar the plots should be
parallel:
coplot(y~x1|x2*x3)
y
0 40 80 0 40 80 0 40 80
-5
4
8
12
4
8
0
12
4
8
12
5
x1
4
Given : x2
8
12
4
8
10
12
4
8
12
0 40 80 0 40 80 0 40 80
1 2 3 4 5
Given : x3
Interpretation of coefficients
I The regression coefficient 0 gives the average response when

all covariates are zero.
I The regression coefficient j , j = 1, . . . , k
I gives the slope of the plane in the xj direction;

I measures the average increase in the response for a unit
increase in xj when all other covariates are held constant;
I Is the slope of plots of y vs. xj in a coplot;
I Is not necessarily the slope of y vs. xj in a scatter plot.
Interpretation of coefficients
Assume covariates x1 and x2 are highly correlated with % = 0.95,
and the response y obtained from model
y = 5 2x1 + 4x2 + .
Regression of y on x1 alone has slope approximately 1 = 1.77

Given : x2
4 3 2 1 0 1 2 3

10

4 2 0 2 4 4 2 0 2 4

10

5
5

y

10
4 2 0 2 4
4 2 0 2 4
x1
x1
Points to note
I The regression coefficients 1 and 2 refer to the conditional

mean of the response, given the covariates (in this case
conditional on x1 and x2 )
I 1 is the slope of y vs. x1 in the coplot, conditional on x2 .
I The coefficient in the plot of y vs. x1 refers to a different

conditional distribution (conditional on x1 only).
I The same if and only if x1 and x2 are uncorrelated.

Estimation of coefficients
I We estimate the unknown regression plane by the least

squares plane (best fitting plane).
I The best fitting plane minimises the sum of squared vertical

deviations from the plane.
I That is, minimise the least squares criterion

n
X
(yi b0 b1 xi1 bk xik )2
j=1
Estimation of coefficients
I The R command lm calculates the coefficients of the best

fitting plane.
I This function solves the normal equations, a set of linear

equations derived by differentiating the least squares criterion
with respect to the coefficients.
Math Stuff: Matrix calculus
Arrange data on response and covariates into vector y and a

matrix X:

y1 1 x11 . . . x1k
y = ... , X = ... .. ..

. .
yn 1 xn1 . . . xnk
Math Stuff: Normal equations
Estimates are solutions of
X0Xb = X0y.
Calculation of best fitting plane
> summary(lm(y~x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9863 0.0304 164.01 <2e-16 ***
x1 -2.0098 0.1015 -19.80 <2e-16 ***
x2 3.9800 0.1015 39.22 <2e-16 ***
---
Fitted plane: y = 4.9864 2.0098x1 + 3.98x2 .

How well does the plane fit?
Judge by examining the residuals and fitted values. Each

observation has a fitted value
I Yi is response for observation i; xi1 , xi2 are values of
explanatory variables for observation i.
I Fitted plane is
y = b0 + b1 x1 + b2 x2 .
I Fitted value for observation i is
ybi = b0 + b1 xi1 + b2 xi2 ,
the height of the fitted plane at (xi1 , xi2 ).

The residual is the difference between the response and the fitted
value

ri = yi b0 + b1xi1 + b2xi2
= yi b
yi
I For a good fit, residuals must be
I Small relative to the y s,

I Have no pattern (do not depend on the fitted values, xs etc.)
I For the model to be useful, we should have a strong

relationship between the response and the explanatory
variables.
Measuring goodness of fit
I How can we measure the relative size of residuals and the

strength of the relationship between y and the xs?
I The ANOVA identity is a way of explaining this.

The residual sum of squares
n
X
RSS = yi ) 2
(yi b
i=1
is an overall measure of the size of the residuals.
The total sum of squares
n
X
TSS = (yi y )2
i=1
is an overall measure of the variability of the data.
Math stuff: ANOVA identity
n
X n
X n
X
(yi y )2 = yi y )2
(b + (yi ybi )2
i=1 i=1 i=1
TSS = RegSS + RSS
Since all sum of squares are non-negative, RSS must be less or

equal to TSS so that RSS/TSS is between 0 and 1.
Interpretation
I The smaller RSS is compared to TSS, the better the fit.
I If RSS = 0, and TSS is positive, we have a perfect fit.
I If RegSS = 0, then RSS = TSS and the plane is flat (all

coefficients except the constant are zero), so the xs do not
help predict y .
A more familiar interpretation: Using R 2
I R 2 is defined as
R 2 = 1 RSS/TSS = RegSS/TSS
I R 2 is always between 0 and 1
I 0 means flat plane

I 1 means perfect fit
I R 2 is the square of the correlation between the observations

and the fitted values.
Estimate of residual variance 2
I Recall that 2 controls the scatter of the observations about

the regression plane:
I The bigger 2 , the more scatter,

I The smaller 2 , the smaller R 2 ;
I 2 is estimated by
RSS
s2 = .
nk 1
I s is also known as the residual standard error

Calculations for cherry trees
cherry.lm <- lm(volume~diameter+height,data=cherry.df)

summary(cherry.lm)
I The lm function produces an lm object that contains all the

information from fitting the regression.
I lm stands for linear model

Call:
lm(formula = volume ~ diameter + height, data = cherry.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 4.7082 0.2643 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
Hence, we get the model
V = 0.3393h + 4.7082d 57.9877.

Hence, we get the model
V = 0.3393h + 4.7082d 57.9877.
Is this the true model?

330 Lecture5 2015

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

330 Lecture5 2015

Uploaded by

Copyright:

Available Formats

STATS 330: Lecture 5

The Multiple Regression Model

I Assignment 1 is due August 10.

I Focusses on data cleaning and exploratory analysis.

I Submit to Student Resource Centre by 2pm.

I Use cover page provided on webpage.

Aims of the lecture:

I Describe type of data suitable for multiple regression.

I Review multiple regression model.

I Describe some exploratory tools to check suitability of model.

Many texts give the impression that there is a true

This is almost completely unhelpful.

Regression is useful for two main tasks: prediction and

Suppose our data frame contains

The model assumes

I The mean response of a typical observation depends on the

I The responses are independent!

can be visualised as a plane.

I The plane is determined by the coefficients

I E.g. for k = 2 (number of covariates)

Task: How can we tell when this model is suitable?

I The regression coefficient 0 gives the average response when

I The regression coefficient j , j = 1, . . . , k

I gives the slope of the plane in the xj direction;

Regression of y on x1 alone has slope approximately 1 = 1.77

I The regression coefficients 1 and 2 refer to the conditional

I 1 is the slope of y vs. x1 in the coplot, conditional on x2 .

I The coefficient in the plot of y vs. x1 refers to a different

I The same if and only if x1 and x2 are uncorrelated.

I We estimate the unknown regression plane by the least

I The best fitting plane minimises the sum of squared vertical

I That is, minimise the least squares criterion

I The R command lm calculates the coefficients of the best

I This function solves the normal equations, a set of linear

Arrange data on response and covariates into vector y and a

Estimates are solutions of

Fitted plane: y = 4.9864 2.0098x1 + 3.98x2 .

Judge by examining the residuals and fitted values. Each

I Fitted value for observation i is

ybi = b0 + b1 xi1 + b2 xi2 ,

the height of the fitted plane at (xi1 , xi2 ).

I For a good fit, residuals must be

I Small relative to the y s,

I For the model to be useful, we should have a strong

I How can we measure the relative size of residuals and the

I The ANOVA identity is a way of explaining this.

The residual sum of squares

The total sum of squares

Since all sum of squares are non-negative, RSS must be less or

I The smaller RSS is compared to TSS, the better the fit.

I If RSS = 0, and TSS is positive, we have a perfect fit.

I If RegSS = 0, then RSS = TSS and the plane is flat (all

I R 2 is always between 0 and 1

I 0 means flat plane

I R 2 is the square of the correlation between the observations

I Recall that 2 controls the scatter of the observations about

I The bigger 2 , the more scatter,

I s is also known as the residual standard error

cherry.lm <- lm(volume~diameter+height,data=cherry.df)

I The lm function produces an lm object that contains all the

I lm stands for linear model

Residual standard error: 3.882 on 28 degrees of freedom