STATS 330: Lecture 6: Inference For The Multiple Regression Model

STATS 330: Lecture 6

Inference for the Multiple Regression Model

Inference for the regression model

Aim of todays lecture

I To discuss how we assess the significance of variables in the


I Key concepts:

I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients

I Imagine that we keep the xs fixed, but resample the errors

and refit the plane. How much would the plane (estimated
coefficients) change?

I This gives us an idea of the variability (accuracy) of the

estimated coefficients as estimates of the coefficients of the
true regression plane.

Variability of the regression coefficients

I Variability depends on

I The arrangement of the xs (the more correlation, the more

I The error variance (the more scatter about the true plane, the
more the fitted plane changes)

I Measure variability by the standard error of the coefficients

Example: Cherries

lm(formula = volume ~ diameter + height)

Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *

Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals

CI : Estimated coefficient standard error t

t : 97.5% point of t distribution with df degrees of


df : n k 1.

n : number of observations.

k : number of covariates (assuming we have a constant

Confidence intervals
Example: Cherries

Use stats function confint

> confint(cherry.lm)
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test

I Often we ask do we need a particular variable, given the

others are in the model?

I Note that this is not the same as asking is a particular

variable related to the response?

I Can test the former by examining the ratio of the coefficient

to its standard error.
Hypothesis test

I This is the t-statistic t.

I The bigger t, the more we need the variable.

I Equivalently, the smaller the p-value, the more we need the

Example: Cherries

Recall: p-value
Density for t with df=28

2.607 2.607

pvalue = 0.0145

4 2 0 2 4
Other hypotheses

I Overall significance of the regression: do none of the variables

have a relationship with the response?

I Use the F statistic: the bigger F , the more evidence that at

least one variable has a relationship.

I equivalently, the smaller the p-value, the more evidence that

at least one variable has a relationship.
Example: Cherries

Testing if a subset is required

I Often we want to test if a subset of variables is unnecessary.

I Terminology

Full model: Model containing all variables.

Submodel: Model with a set of variables removed.

I Test is based on comparing the RSS of the submodel with the

RSS of the full model. Full model RSS is always smaller
Testing if a subset is required

I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra

I To do the test, we

I fit both models, get RSS for both;

I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required

I The test statistic is

(RSSsub RSSfull )
F =
s 2 (dffull dfsub )

I dffull dfsub is the number of variables dropped.

I s 2 is the estimate of 2 from the full model (the residual

mean square)

I R has a function anova to do the calculation.


I If the submodel is correct, the test statistic has an

F -distribution with dffull dfsub and n k 1 degrees of

I We assess if the value of F calculated from the sample is a

plausible value from this distribution by means of a p-value.

I if the p-value is too small, we have evidence against the

hypothesis that the submodel is ok.
Density for F with 2 and 16 degrees of freedom




0 2 4 6 8 10
Example: Free fatty acid data

I Use physical measures to model a biochemical parameter in

overweight children.

I Variables are

FFA: Free fatty acid level in blood (response variable)

Age: months

Weight: pounds

Skinfold thickness: inches


lm(formula = ffa ~ age + weight + skinfold, data = fatty.df)

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

This suggests
I age is not required if weight and skinfold are retained

I skinfold is not required if weight and age are retained

I Can we get away with just weight?


> model.sub <- lm(ffa~weight,data=fatty.df)

> anova(model.sub,model.full)
Analysis of Variance Table

Model 1: ffa ~ weight

Model 2: ffa ~ age + weight + skinfold
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261

I Small F and large p-value suggest weight alone is adequate.

I But test should be interpreted with caution, confounding?
I Non-causal relation due to missing variable.

I Effect can be checked by comparing coefficients in full and

submodel (if available).
> summary(model.sub)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **

