Chapter 2

Chapter 2
Simple Linear Regression
2.1 Inheritance of height

1. To motivate the study of regression we shall consider the heights of mothers and daugh-
ters. This data set, collected by the famous statistician E. S. Pearson, recorded the
heights of 1375 mothers under the age of 65 and one of their adult daughters over the
age of 18. Please see page 2 of the Weisberg book for more regarding this data set.
2. The interest here is to see if a mother’s height can be used to predict her daughter’s
height.
3. For the ith pair of mother and daughter, let Yi denote the response variable, the daugh-
ter’s height (which we want to predict), and Xi the observed covariate or explanatory
variable, the mother’s height (which we want to use in the prediction), i = 1, . . . , n.
The problem is to model the relationship between Yi and Xi so that we can use X to
predict Y .
4. The first step of any statistical modelling exercise is to examine the data graphically,
or by calculating informative summaries.
5. The scatter plot of the data is provided in Figure 5. What all conclusion can we make
from here?
6. Write some important characteristics of the scatter plot.
7. ...
8. ...
9. To interpret a scatterplot, think of dividing the scatterplot into narrow vertical strips
so that each vertical strip contains the observations on samples which have nearly the
same mother’s height. Within each vertical strip, we see a distribution of daughter’s
height which is the conditional distribution of mother’s height given that the mother’s
height is in the strip.
19
70
65
hts$Dheight
60
55
55 60 65 70
hts$Mheight
Figure 2.1: The heights data set
10. We need to describe how these distributions change as we move across the vertical
strips. The most e↵ective graphical way of seeing how the distributions change is to
draw a boxplot for the data in each strip. Examining these boxplots shows the features
we noted above.
11. Interpreting a boxplot.
(a) The boxplots show the median as the middle line.

(b) The horizontal edges, also called hinges, are where the lower and upper quartiles
are. The box covers the ‘middle half’ of the observation.
(c) The dotted lines are called the whiskers which extend from the hinges as far as
the most extreme observation which lies within a distance 1.5 ⇥ (Q3 Q1 ), of the
hinges.
(d) Any observations beyond the ends of the whiskers (further than 1.5 ⇥ IQR from
the hinges) are outliers and are each marked on the plot as individual points at
the appropriate values.
70
65
Dheight
60
55
(55.4,57.3] (59.2,61.2] (63.1,65] (65,67] (67,68.9]
Mheight
Figure 2.2: Boxplots for the heights data.
20
12. What can you conclude from these side by side boxplots?
13. ...
14. ...
15. In practice, we learn to interpret scatterplots without going through this formal exer-
cise, but it does give useful insight into the conditional nature of the interpretation.
16. Return to the original scatterplot. If there was no variability, the points in the
plot would fall exactly on a curve describing the exact relationship between Y and X;
the variability is reflected in the scatter of points about this curve.
17. An interesting question is what is the source of the variability? One explanation is that
the variability is due to the measurement process and stochastic inaccuracies which
arise in it. Alternatively, the daughter’s height may be a↵ected by other variables
which we have not held constant and their hidden variation induces variation in the
data. An entirely pragmatic view is that the source of the variability does not mater,
variability is simply there and we must deal with it.
2.2 The model

1. From the scatterplot, a plausible model for the conditional distribution of Y given X
has
E(Yi |Xi ) = 0 + 1 Xi and V ar(Yi |Xi ) = 2 .
There are three parameters in this model.
2. Conditional Mean Parameters
• the intercept 0 is the mean value of daughter’s height when mother’s height is
0 (not really a practical thing to interpret here though).
• the slope 1 is the e↵ect on the mean daughter’s height of a unit change in
mother’s height (e.g. how many inches do we expect daughter’s height to increase
if the mother is an inch taller).
3. Conditional Variance 2 , sometimes called a nuisance parameter, as its value does

not a↵ect predictions (but does a↵ect uncertainty concerning predictions).
4. To complete the model, we need to say more about the random variation.
• Provided the observations have been made from unrelated experiments, it seems
reasonable to begin by assuming independence.
• It is difficult to say much about the shape of the conditional distribution of Yi
given Xi : we usually start by making the optimistic assumption that it is normal.
The discrepant points may make this somewhat implausible – it is important to
check later.
21
5. Putting the four assumptions together, we have
ind 2
Yi |Xi ⇠ N ( 0 + 1 Xi , ).
The structure of the underlying population can then be viewed graphically.
Figure 2.3: The structure of the underlying population
6. A useful alternative way to write the model is as
Yi = 0 + 1 Xi + ✏i
ind
with ✏i ⇠ N (0, 2 ) independent of Xi . Here ✏i is the random di↵erence between the
observation Yi and its conditional mean 0 + 1 Xi . Thus ✏i is a random variable
called an error (not meaning a mistake, though) with a distribution called the error
distribution. In this form, the model is a decomposition of the observation into a
structural part (the regression line) plus random variation (the error).
●
E(Y|X)=β0 + β1X
●
●
εi
●
Y
●
●
●
Figure 2.4: The structural and random components of the underlying population
7. Here and throughout our treatment of regression models, we proceed conditionally

given the covariates Xi . No attempt is made to model the variability in X, and hence
the Xi are not considered to be random variables – they are treated as constants, fixed
at the observed values xi . Henceforth, we shall use the notation xi to emphasise that
these are known constants.
8. Therefore, given a pair of variables, it matters which we choose as the response and
which as the explanatory variables. Often, this choice is clear, as for example when
22
• the aim of the modelling exercise is to predict one of the variables using the other
(as in the current example).
• one of the variables is controlled (fixed by the experimenter) while the variability
of the other is observed.
2.3 Model Fitting

1. The choice of model fitting method depends on what we know about the error distri-
bution. We don’t actually know much at this stage, but we have made some optimistic
assumptions, so we fit the model under these assumptions and then review their plau-
sibility.
2. Least Squares Estimation works well when the error distribution is normal. This
entails estimating ( 0 , 1 ) by ( ˆ0 , ˆ1 ) which minimises the sum S of squared vertical
distances between the observed response values yi and the regression line 0 + 1 xi , at
the corresponding observed values of the explanatory variable. Clearly
n
X
2
S= (yi 0 1 xi )
i=1
so we have n
@S X
= 2 (yi 0 1 xi )
@ 0 i=1
and n
@S X
= 2 xi (yi 0 1 xi ).
@ 1 i=1
To minimise S with respect to 0 and 1 we set

@S @S
= 0, = 0.
@ 0 @ 1
The above two equations are called the normal equations. So the least squares estimates
( ˆ0 , ˆ1 ) satisfy the normal equations
ȳ ˆ0 ˆ1 x̄ = 0 or ˆ0 = ȳ ˆ1 x̄
and n n
X X
xi yi nx̄ ˆ0 ˆ1 x2i = 0.
i=1 i=1
3. The first equation shows that the fitted line passes through the point(x̄, ȳ).
4. It follows that
n n
!
X X
0 = xi yi nȳx̄ ˆ1 x2i nx̄2
i=1 i=1
23
n
X n
X
= (xi x̄)yi ˆ1 (xi x̄)2
i=1 i=1
n
X
= (xi x̄)yi ˆ1 (n 1)s2X ,
i=1
Pn
say, where s2X = (n 1) 1
i=1 (xi x̄)2 , and hence
Pn
ˆ1 = i=1 (xi 2x̄)yi
P(n 1)sX
= ni=1 w1i yi ,
where
xi x̄
w1i = .
(n 1)s2X
Substituting back into the first equation, we obtain
n
X
ˆ0 = ȳ ˆ1 x̄ = w0i yi
i=1
where
1 x̄(xi x̄)
w0i = .
n (n 1)s2X
5. Practical method to calculate ˆ0 and ˆ1 .
• Note that (or Satisfy yourself that):

n
X n
X
(n 1)s2X = SXX = (xi x̄) = 2
x2i nx̄2 .
i=1 i=1
• Similarly
n
X n
X
SXY = (xi x̄)yi = x i yi nx̄ȳ.
i=1 i=1
• Hence, ˆ1 can be calculated by the formula.

Pn
ˆ1 = SXY = Pi=1 xi yi nx̄ȳ
n 2
SXX i=1 xi nx̄2 .
• Once we have ˆ1 , ˆ0 can simply be obtained from:
ˆ0 = ȳ ˆ1 x̄
6. For the following data show that ˆ0 = 0.916, ˆ1 = 0.215.
24
y x
4 30
10 50
15 70
21 90
21 100
22 120
29 140
34 160
39 180
41 200
2.4 Estimating the variance

1. In addition to the parameters 0 and 1 which describe the conditional mean relation-
ship, it is also useful to be able to estimate the error variance 2 . We usually use the
estimator n n
1 X 1 X 2
S2 = (Yi ˆ0 ˆ1 xi )2 = R ,
n 2 i=1 n 2 i=1 i
where division by n 2 instead of n or n 1 is used since two degrees of freedom have
been ‘lost’ in estimating the conditional mean parameters. It can be shown that
(n 2)S 2 2
2
⇠ n 2
independent of ˆ0 and ˆ1 . Hence, S 2 is an unbiased estimator of 2

. As usual, we use
s2 to denote the actual value of S 2 for our observed data.
2. The sampling distribution of S 2 can be used to construct confidence intervals for 2

in
the same way as described for the simple normal models in Chapter 1.
3. In practice we use
n
X
(n 2)s2 = (yi ˆ0 ˆ1 xi )2 = SY Y ˆ2 SXX.
1
i=1
Pn 2
Pn
where SY Y = 1=1 yi nȳ 2 and similarly, SXX = 1=1 x2i nx̄2 .
4. Show that s2 = 2.644 for the data set given above.
2.5 Fitted Values and the Residuals

1. We can construct two useful further estimates:
• Fitted Values are the estimates ŷi = ˆ0 + ˆ1 xi of the conditional means at the
actual observed melting point measurements
25
• Residuals are the di↵erences ri = yi ŷi between the observed responses and
the fitted values.
2. The decomposition
yi = ˆ0 + ˆ1 xi + ri ,
is analogous to the model decomposition into a structural part (regression line) plus
random variation (error) and shows that the residuals can be regarded as estimates of
the errors.
3. Note that the least squares estimates given above are single observations of correspond-
ing random variables (estimators) given by
n
X n
X
ˆ0 = w0i Yi , ˆ1 = w1i Yi
i=1 i=1
4. These are linear functions of the responses so it will be straightforward to write down
their sampling distributions. Similarly, the fitted values ŷi and residuals ri are observed
values of random variables
n
X n
X
Ŷi = w0j Yj + xi w1j Yj , i = 1, . . . , n
j=1 j=1
and
R i = Yi Ŷi , i = 1, . . . , n
respectively.
Residual = Observed – Fitted
2.6 Diagnostics
Once we have fitted the model and computed the residuals from our fit, we can use the
residuals to explore the quality of the fit and the properties of the error distribution. The
most e↵ective methods for carrying out this exploration are graphical. These are collectively
known as diagnostic methods.
2.6.1 The (Anscombe) residual plot

1. The (Anscombe) residual plot is a plot of the residuals ri against the fitted values
ŷi = ˆ0 + ˆ1 xi .
If the model is correct, the points in the residual plot should be randomly distributed
within a horizontal strip centred about the horizontal axis. Significant deviations from
the desirable shape indicate problems with the assumptions of the model.
(a) Curvature: the band of points does not remain horizontal but curves as X
changes. This indicates that the conditional means are not linearly related to the
explanatory variable.
26
Figure 2.5: Curvature in a residual plot
Figure 2.6: Heteroscedsaticity in a residual plot
(b) Heteroscedasticity: a fan shaped residual scatter indicates that the variability
in the error term ✏i in the model is not constant.
(c) Discrepant responses (outliers): Y-values which do not follow the same linear
model as the rest of the sample Y-values. The e↵ect of such values can be to bias
the sample estimates away from the parameters 0 and 1 of the linear model
satisfied by the rest of the sample values and to inflate the error.
2. How can you interpret the Anscombe plot for the heights data set? Is it a random
scatter?
3. ...
4. The information in the residual plot (ŷ, y ŷ) is also contained in the scatterplot (x, y).
Residual plots are graphically more e↵ective than scatterplots because the dominating
linear relationship has been removed and it is easier to look for deviations from the
horizontal rather than from the fitted regression line.
5. Residual plots are also useful in problems involving more than two variables where
scatterplots of pairs of variables are ine↵ective. In such cases, non-linearity of the
relationship between the response and an explanatory variable may not be obvious
27
Figure 2.7: Discrepant points in a residual plot
5
residual
0
−5
60 62 64 66 68
fitted
Figure 2.8: The Anscombe residual plot for the heights data set
from the Anscombe plot, and this plot should be supplemented by plotting the residuals
against each of the individual explanatory variables – more details later.
2.6.2 Normal Probability Plots

1. The normality of the error distribution is checked using a normal probability plot of
the residuals.
2. Order the residuals r1 , . . . , rn in increasing size so r1ord  . . .  rnord . Suppose that
Ri ⇠ N (µ, 2 ).
3. It can be shown that:
✓ ◆
i 38
E(Riord ) ⇡µ+ 1
, i = 1, . . . , n,
n + 14
where ⇣ 3 ⌘( (x) = P (Z  x) where
is the standard normal distribution function
ord 1 i 8
Z ⇠ N (0, 1)). Hence, a plot of ri against n+ 1
for i = 1, . . . , n should be
4
28
approximately a straight line, if the normal model holds. Departures from linearity
correspond to departures from the model.
[NB: Always check which axis the observed residuals are on since this a↵ects interpreta-
ord
tion. The R software
⇣ 3 ⌘puts the observed residuals, ri , on the y-axis and the theoretical
1 i 8
quantiles, n+ 14
on the x-axis. Thus, from a plot in R we can estimate as the
slope and estimate µ as the intercept.]
Shorter Tails Longer Tails Outlier
Asymmetry Rounding Clustering
Figure 2.9: Normal probability plots
Normal Q−Q Plot

5
Sample Quantiles
0
−5
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Figure 2.10: The normal probability plot of the residuals for the heights data set
4. How can you interpret the normal probability plot for the heights data set? Does the
points lie on a straight line?
29
5. ...
2.6.3 Dependence Plots

1. Dependence may be visible in the residual plot; it is more likely to be seen as patterns
in plots of residuals against time and/or space variables or of residuals against adjacent
residuals. For data collected sequentially in time, a relationship in a plot of the residual
Ri against the previous residual Ri 1 (called the lagged residual) often exhibits a linear
relationship when the observations are dependent. Of course, in the absence of the
required spatio-temporal information, we cannot construct such plots.
2. It is always important to think about the way in which the data were collected. Learn-
ing, repeated use of the same material, and arrangements in time and space can all in-
duce dependence between the errors. Dependence-inducing structure in the data which
is a result of the data collection process ordinarily needs to be reflected in the model.
Failing to account for dependence, where it is present, typically leads to overconfident
conclusions, which understate the amount of uncertainty. Models for dependent data
are outside the scope of this course.
2.6.4 Discrepant Explanatory Variables

1. An X-value which is separated from the rest of the sample X-values is also important
in modelling. The e↵ect of such points is to attract the fitted line towards themselves
and to increase the precision of the fitted line (see later). This means that discrepant
explanatory variables can have a beneficial or a detrimental e↵ect according to whether
they are accurately observed or not.
2. Leverage Points. In the simple regression framework, X-outliers are sometimes called
leverage points. In more general models, leverage points have a more complicated
definition than X-outliers but in the case of a simple regression model, a point (xi , yi )
is a leverage point if its leverage
1 (xi x̄)2
hii = +
n (n 1)s2X
is large, that is when xi is far from x̄. This is basically, a definition of an X-outlier if
we use x̄ to represent the bulk of the X-values.
3. Influential Points. Finally, X- and Y-outliers are sometimes referred to as influential
points. Generally, influential points (xi , yi ) are observations that have an excessive
‘influence’ on the analysis in the sense that deleting them (or a set of them) results in
large changes to the analysis. Clearly, large outliers are influential points and in the
simple regression framework, there are no other kinds of influential points. However,
when we work with more complicated models, it is possible to have influential points
which are not also outliers. A measure of the influence of a point is Cook’s distance
Pn ⇣ (i) ⌘2
j=1 ŷ j ŷ j
di =
kˆ 2
30
(i)
where ŷj is the fitted value for observation j, calculated using the least squares es-
timates obtained from the modified data set with the ith observation deleted. In the
denominator k is the number of linear parameters estimated (2 for simple regression)
and ˆ 2 is an estimate of the error variance. A rule of thumb is that values of di greater
than 1 indicate influential points.
2.6.5 Remedies
1. Transformation: If the residual plot shows nonlinearity and/or heteroscedasticity or
the normal probability plot shows a long-tailed distribution, we may be able to find
a transformation which linearises the relationship, stabilises the variability and makes
the variability more normal. The most widely used transformations (in increasing
order of strength) are the cube root, square root (for count data), logarithmic (for
multiplicative relationships) or negative reciprocal (for time and rates). However, there
is no guarantee that a transformation can be found to remedy any of these defects,
never mind all simultaneously. Also, when a transformation has been used, the final
model fitted is no longer in terms of the original variables. This latter objection is
often unimportant but there are calibration problems in which it may matter.
2. Reacting to Discrepant observations: First try to establish whether the points
are mistakes, observations from a distinct, identifiable subgroup in the population or
due to changes in the data collection process. Often, unless we have access to the
people who collected the data, there is no additional information available. On the
basis of whatever information we can collect, we can correct mistakes, build more
complicated models (which either change the distributional assumptions or treat the
subsets separately) or exclude the subsets from the analysis. The choice between these
depends on whether we want to try to mix the possibly disparate structures into one
model or we are interested in the pattern followed by the bulk (majority) of the data.
We can exclude points in two ways.
• Robust methods: Use robust fitting methods to fit the model to the bulk of
the data.
• Exclusion: Remove the points and then refit using least squares. (The disad-
vantage of this approach is that the identification of the points is not always
straightforward and the inferences do not then reflect this facet of our analysis.)
3. Multiple analyses can be presented to show the range of possible scientific outcomes.
It is important to identify discrepant points and then explain and justify the actions
we take in relation to them.
2.7 Correlation and Goodness-of-fit

1. Recall that if Y is a random variable with mean µY and variance Y2 and X is another
2
random variable with mean µX and variance X , then the correlation ⇢ between Y and
X is defined as
E[(Y µY )(X µX )]
⇢= .
Y X
31
2. We have
• |⇢|  1;
• |⇢| ⇡ 1 indicates a strong linear relationship,
• |⇢| = 0 indicates a weak linear relationship.
3. Positive (negative) values of ⇢ indicate that Y tends to increase (decrease) with increas-
ing X. The quantity ⇢ is a population parameter (a property of the joint probability
distribution of Y and X) which can be estimated, given a representative sample from
the population, by
sXY
⇢ˆ =
(n 1)sY sX
where s2X and s2Y are the sample variances, and
n
X
sXY = (yi ȳ)(xi x̄)
i=1
is the sample covariance. The statistic ⇢ˆ is called the Pearson or product-moment

correlation coefficient.
4. Regression and correlation operate in di↵erent contexts.
• The regression model represents the error distribution as a normal distribution.

This implies that the response has a normal distribution. The explanatory variable
is treated as fixed and not modelled.
• Correlation on the other hand treats both variables as random and is most mean-
ingful when the joint distribution of the response and explanatory variables is the
bivariate normal distribution.
5. Nevertheless, there is a simple algebraic relationship between the estimate ˆ1 of the

slope parameter in a simple linear regression model and the correlation coefficient ⇢ˆ
n
X
ˆ1 = w1i yi
P
i=1
n
i=1 (xi x̄)yi
=
(n 1)s2X
sXY
=
(n 1)s2X
sXY sY
=
(n 1)sY sX sX
sY
= ⇢ˆ
sX
whatever the nature of the explanatory variable. The algebraic relationship ignores
di↵erences in context, interpretation etc. so should not be overinterpreted.
32
6. The correlation coefficient is sometimes used in conjunction with a regression model
to summarise the quality of the fit. A low value of s2 implies small estimated error
variance, and hence a good fit. It is more meaningful to look at s2 relative to how
well we could fit the data if we ignored the explanatory variable than to compare s2 to
the theoretical but unattainable ‘ideal value’ of zero. Without information about the
explanatory variable, then we would have to model the response values as identically
distributed about a common mean, and the error variance estimate would be s2Y the
sample variance of Y , as in Chapter 1. We can therefore measure the fit by
s2Y s2 s2
Adj R2 = =1 ,
s2Y s2Y
the proportionate reduction in error variance achieved by including the explanatory

variable in the model.
7. The linear relationship is strong if Adj R2 ⇡ 1 and weak if Adj R2 ⇡ 0. Using the
analysis of variance decomposition
(n 1)s2Y = ˆ12 (n 1)s2X + (n 2)s2
and the relationship between ˆ1 and ⇢ˆ, we can write
n 1 2 ˆ2 s2 ) = n 1
s2 = (s (1 ⇢ˆ2 )s2Y
n 2 Y 1 X
n 2
and hence
n 1 2 1
Adj R2 = ⇢ˆ .
n 2 n 2
8. It follows that when n is large Adj R2 ⇡ ⇢ˆ2 . It is common to denote ⇢ˆ2 by R2 , and
then the relationship
2 n 2 1 (n 2)s2
R = Adj R2 + =1
n 1 n 1 (n 1)s2Y
defines a form of squared correlation coefficient which generalises to more complex

regression models.
9. The sample correlation coefficient ⇢ˆ (and consequently Adj R2 and R2 ) has limitations
as a summary of the strength of relationships. In particular,
• ⇢ˆ is sensitive to outliers
• ⇢ˆ is sensitive to skewed data distributions
• ⇢ˆ takes the same numerical value for very di↵erent scatterplots and
• a correlation of zero means no linear relationship but does not mean no
relationship at all.
33
● ● ●
●
● ● ●
● ● ●
● ●
● ●
● ●● ●
● ●
●
●●
● ● ● ●●
● ●
● ●
● ●
Y
● ● ●●● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
●
● ●
●
● ●
● ● ● ●
● ●●
●
● ● ●
● ●
●
●
●
●●
●● ●●
●●
●●
● ● ●●
●● ●
●
X X
Figure 2.11: Two examples of correlation ⇡ 0
10. Thus the summary provided by a correlation coefficient can only be meaningfully in-
terpreted if we also have a scatterplot of the data. This problem can be overcome by
taking the initial steps in the modelling process (plotting the data, choosing trans-
formations etc.) but this is rarely done as the correlation coefficient is widely, and
incorrectly, viewed as a statistic which alleviates the need to examine the data. In any
case, once we have begun the modelling process, we may as well complete it properly.
11. A full linear regression model is far more useful in that it
• provides a complete description,

• can be used to draw conclusions about how the response changes (or can be
expected to change) as the explanatory variable changes,
• can be used to predict new values and,
• can be generalised in a variety of useful ways.
12. Calculation of a correlation coefficient complements this process, by summarising one

particular aspect of goodness-of-fit, but the diagnostic plots provide a far more com-
plete summary of the quality of the fit than ⇢ˆ or Adj R2 ever can.
2.8 Quantifying uncertainty

1. Consider the least squares estimator ˆ1 of the slope parameter 1 . (This is usually of
greater interest than the intercept, but the following concepts apply to any estimator
of any model parameter).
2. An estimate ˆ1 of 1 is never exactly equal to 1 because di↵erent samples give dif-

ferent estimates. There is therefore uncertainty associated with an estimate. We can
describe this uncertainty by the sampling distribution of the estimator which gives
the distribution of the estimator across all possible samples of the same size from the
population. The sampling distribution is the distribution of the estimator under the
assumed model.
34
ind P
3. Since Yi |Xi ⇠ N ( 0 + 1 Xi ,
2
) and ˆ1 = ni=1 w1i Yi , we have :
Shape ˆ1 is normally distributed, as ˆ1 is a linear combination of normally distributed
random variables Y1 . . . , Yn .
Mean
n
X
E( ˆ1 ) = w1i E(Yi )
i=1
Xn
= w1i ( 0 + 1 xi ) = 1
i=1
n
X n
X
as w1i = 0, w1i xi = 1.
i=1 i=1
so ˆ1 is an unbiased estimator of 1.
Variance
n
X
V ar( ˆ1 ) = 2
w1i V ar(Yi )
i=1
n
X
2 2
= w1i
i=1
2
=
(n 1)s2X
as
n
X
2 1
w1i = .
i=1
(n 1)s2X
4. Putting everything together,

✓ 2
◆
ˆ1 ⇠ N 1, .
(n 1)s2X
5. The variance of the slope estimator is made as small as possible for a fixed n by making
s2X as large as possible. This can be achieved by putting the Xs at either end of their
feasible range. Note however that this optimal design leaves no possibility of checking
the linearity of the model. The e↵ect of discrepant explanatory variables is to increase
s2X and hence decrease V ar( ˆ1 ). This will result in a misleadingly small variance if the
corresponding Y value is incorrectly measured.
6. From the variance, above, we obtain

s
2
s.d.( ˆ1 ) =
(n 1)s2X
35
as a measure of precision of the estimator ˆ1 . As this standard deviation depends on
the unknown parameter , we use the standard error
s
s2
s.e.( ˆ1 ) =
(n 1)s2X
where has been replaced by the estimate s (see Section 2.4), to summarise the
precision of ˆ1 .
7. Return to the numerical example introduced before. Show that s.e.( ˆ0 ) = 1.214
and s.e.( ˆ1 ) = 0.00964.
8. The treatment of ˆ0 is entirely analogous. We have

n
X
E( ˆ0 ) = w0i E(Yi )
i=1
Xn
= w0i ( 0 + 1 xi ) = 0
i=1
n
X n
X
as w0i = 1, w0i xi = x̄ x̄ = 0
i=1 i=1
and
n
X
V ar( ˆ0 ) = 2
w0i V ar(Yi )
i=1
n
X
2 2
= w0i
✓
i=1 ◆
2 1 x̄2
= +
n (n 1)s2X
as
n
X Xn ✓ ◆
2 1 1 x̄(xi x̄) x̄2 (xi x̄)2
w0i = 2 +
i=1 i=1
n2 n (n 1)s2X (n 1)2 s4X
1 x̄2
= + .
n (n 1)s2X
9. It follows that ✓  ◆
ˆ0 ⇠ N 2 1 x̄2
0, +
n (n 1)s2X
and ✓ ◆1/2
1 x̄2
s.e.( ˆ0 ) = s + .
n (n 1)s2X
36
10. Finally,
n
X n ✓
X ◆
1 x̄(xi x̄) xi x̄ x̄
w0i w1i = =
i=1 i=1
n (n 1)s2X (n 1)s2X (n 1)s2X
n n
!
X X
Cov( ˆ0 , ˆ1 ) = Cov w0i Yi , w1i Yi
i=1 i=1
n
X
= w0i w1i V ar(Yi )
i=1
n
X
2
= w0i w1i
i=1
2
x̄
= .
(n 1)s2X
2.9 Confidence Intervals

1. Recall from Section 1.3.1 that a confidence interval summarises the information jointly
conveyed by a parameter estimate and its standard error, by providing an interval of
values of the parameter which are plausible given the observed data.
2. In Section 1.3.1, we derived confidence intervals for a parameter of a simple normal

model by constructing a function involving the response variables Y1 , . . . , Yn and the
parameter of interest (but no other parameter) whose distribution was known.
3. For regression models, the sampling distribution of S 2 depends only on 2 , so we can

construct a confidence interval for 2 in a way which exactly parallels the construction
of the confidence interval for the variance 2 in Section 1.3.1.
4. However, it is usually of greater interest to evaluate a confidence interval for one of the
regression parameters 0 or (particularly) 1 .
5. The sampling distributions of ˆ0 and ˆ1 derived in Section 2.8, cannot be used directly
in confidence interval construction, as they both depend on the unknown variance 2 ,
as well as the parameter of interest. However, because we have the result
(n 2)S 2 2
2
⇠ n 2
independent of ˆ0 and ˆ1 (see §1.6), then it follows that

ˆ1 1
t( ˆ1 ) = ⇠ tn 2
s.e.( ˆ1 )
and
ˆ0 0
t( ˆ0 ) = ⇠ tn 2 .
s.e.( ˆ0 )
37
6. As, for any given n, the tn 2 distribution is known, we can find c such that
P ( c  t( ˆ1 )  c) = 1 ↵
for any given confidence level 1 ↵.
Figure 2.12: t-density function
7. Hence, as
c  t( ˆ1 )  c ) ˆ1 c s.e.( ˆ1 )  1  ˆ1 c s.e.( ˆ1 )
we have constructed a confidence interval for 1. Exactly the same argument can be
applied for 0 . This is of the general form:
Estimate ± Critical Value ⇥ Standard Error.
8. Interpretation: The random interval will contain the true value of 1 with probability
1 ↵.
9. Any particular interval does or does not contain 1; the probability is over the set of
possible intervals.
10. Practical consequence: 100(1 ↵)% of confidence intervals contain 1 and 100↵%
do not.
11. Return to the numerical example introduced before. Show that the 95% con-
fidence interval for 1 is (0.193, 0.237).
12. Find the 95% confidence interval for 0.
2.10 Estimating the conditional mean

1. We sometimes need to estimate the conditional mean 0 + 1 x0 at a value x0 of X
[which may, or may not, be one of the values of X in our observed data]. A natural
estimator of the conditional mean is Ŷ0 ⌘ ˆ0 + ˆ1 x0 .
2. It follows from the sampling distributions of ˆ0 and ˆ1 that
E(Ŷ0 ) = 0 + 1 x0
38
so this estimator is unbiased, and
V ar(Ŷ0 ) = V ar( ˆ 2 ˆ ˆ ˆ
✓ 0 ) + x0 V2ar( 1 ) + 2x0 2Cov( 0 , 1 ) ◆
2 1 x̄ x0 2x0 x̄
= + 2
+
✓ n (n 1)s X ◆ (n 1)s2X (n 1)s2X
1 (x0 x̄)2
= 2 +
n (n 1)s2X
2
= h00 ,
1 (x0 x̄)2
with h00 = n
+ (n 1)s2X
, so that
ˆ0 + ˆ1 x0 ⇠ N ( 0 + 1 x0 ,
2
h00 )
and
ˆ0 + ˆ1 x0 ( 0 + 1 x0 )
1/2
⇠ tn 2 .
sh00
3. Hence, we can obtain 100(1 ↵)% confidence intervals of the form
[ ˆ0 + ˆ1 x0 c s h00 , ˆ0 + ˆ1 x0 + c s h00 ]
1/2 1/2
where P (tn 2 < c) = ↵/2. This is of the general form:
4. Notice that the width of these confidence intervals depends on x0 and increases as x0
gets further from the sample mean x̄. That is, the best estimates are made in the
centre of the observed explanatory variables.
5. The fitted values are estimates of the conditional mean function at the observed values
of covariates so
2
E(Ŷi ) = 0 + 1 xi , V ar(Ŷi ) = hii
and
Cov(Ŷi , Ŷj ) = Cov( ˆ0 + ˆ1 Xi , ˆ0 + ˆ1 Xj )

= V ar( ˆ ˆ ˆ ˆ
✓ 0 ) + (xj 2+ xi )Cov( 0 , 1 ) + xi xj V ar( 1 )◆
1 x̄ (xj + xi )x̄ xi xj
= 2 + 2 2
+
✓ n (n 1)sX (n ◆ 1)sX (n 1)s2X
1 (xi x̄)(xj x̄)
= 2 +
n (n 1)s2X
2
= hij .
39
2.11 Predicting observations
1. Suppose that we want to predict the observation Y0 = 0 + 1 x0 + ✏0 at a value x0
of X, where ✏0 is independent of the errors in the sample data observations. This is
di↵erent from the inference problems we have considered so far because in those we
estimated a fixed but unknown constant parameter whereas here we are ‘estimating’ a
random variable. Nevertheless, the basic approach only needs minor modification.
2. Notice that Ŷ0 = ˆ0 + ˆ1 x0 satisfies
E(Ŷ0 Y0 ) = 0
so Ŷ0 is an unbiased predictor of Y0 . Making use of the fact that Y0 , and hence ✏0 , is
independent of Y1 , . . . , Yn , the prediction variance is
E(Ŷ0 Y0 )2 = E( ˆ0 + ˆ1 x0 0 1 x0 ✏0 ) 2
= E( ˆ0 + ˆ1 x0 0
2
1 x0 ) + E(✏0 )
2
= 2 (1 + h00 )
3. Using a similar approach to that in Section 2.10, we can obtain 100(1 ↵)% prediction
intervals of the form
[ ˆ0 + ˆ1 x0 cs(1 + h00 )1/2 , ˆ0 + ˆ1 x0 + cs(1 + h00 )1/2 ]
where P (tn 2 < c) = ↵/2.
4. This is of the general form:

5. The di↵erence between estimating the conditional mean and prediction is important.
If we are interested in predicting daughter’s height for a person with mother’s height
x0 = 60, we are trying to predict an individual Y0 . If on the other hand, we are
going to examine a number of daughter’s with mother’s height x0 = 60, we may be
interested in their average daughter’s height or conditional mean. Since it is easier
to predict average properties than it is to predict individual properties, the intervals
for Y0 (a point on the scatterplot) are wider than those for 0 + 1 x0 (the position of
the regression line). Specifically, this is due to the fact that the prediction variance
incorporates additional uncertainty in ✏0 . In practice, we are most often interested in
the more difficult problem of making predictions for individuals.
6. Prediction can be thought of either as interpolation (prediction within the range of
observed explanatory variables) or extrapolation (prediction beyond the range of ob-
served explanatory variables). The above equations apply equally to both cases but
it is important to note that extrapolation involves the implicit additional assumption
that the model holds in a global sense beyond the range of the observed explanatory
variable. This is a strong assumption which cannot be tested empirically and is often
known to be untrue. It is therefore best to view the model as a local approximation
and to avoid extrapolation unless there are special reasons to believe that the fitted
model holds ’globally’.
40
2.12 Residuals
1. We can also write down the properties of the residuals
R i = Yi Ŷi = Yi ˆ0 ˆ1 xi .
2. This looks like but is not the same as the prediction problem because the observation
Yi is in the sample and hence cannot be independent of ( ˆ0 , ˆ1 ) which depend explicitly
on Y1 , . . . , Yn . The mean and variance are
E(Ri ) = 0
and
V ar(Ri ) = V ar(Yi ) + V ar( ˆ0 + ˆ1 xi ) 2Cov(Yi , ˆ0 + ˆ1 xi ).
3. Now
n
!
X
Cov(Yi , ˆ0 + ˆ1 xj ) = Cov Yi , (w0k + w1k xj )Yk
k=1
= (w0i + w1i xj )V ar(Yi )
✓ ◆
2 1 x̄(xi x̄) xj (xi x̄)
= +
n (n 1)s2X (n 1)s2X
✓ ◆
2 1 (xi x̄)(xj x̄)
= +
n (n 1)s2X
= 2 hij
so
2 2
V ar(Ri ) = (1 + hii 2hii ) = (1 hii )
and, for i 6= j,
Cov(Ri , Rj ) = Cov(Yi ˆ0 ˆ1 xi , Yj ˆ0 ˆ1 xj )
= Cov(Yi , ˆ0 + ˆ1 xj ) Cov(Yj , ˆ0 + ˆ1 xi ) + Cov( ˆ0 + ˆ1 xi , ˆ0 + ˆ1 xj )
2
= (hij + hij hij )
2
= hij .
4. Note that, in general hij 6= 0 which implies that possibly there is correlation between
Ri and Rj , even though we assumed that ✏i and ✏j are independent. This is because
Ri and Rj are both dependent on ˆ0 and ˆ1 . The correlation between Ri and Rj is
defined by:
Cov(Ri ,Rj )
Corr(Ri , Rj ) = p
V ar(Ri ) V ar(Rj )
2h
= p 2 (1
ij
2 (1
hii ) hjj )
hij
= p .
(1 hii ) (1 hjj )
5. See the general definition of correlation in Section 2.7.
41
6. The residuals are useful for examining patterns and have nice simple distributional
properties. It is sometimes useful to scale them so to the standardised residuals
Ri /s.
We expect 5% of |Ri /s| > 2 and very few of |Ri /s| > 3 in normally distributed data.
The fact the variance of the residuals depends on i does not usually matter (and
this dependence diminishes as n increases). When this is a concern, we may use the
studentised residuals
Ri /[s(1 hii )1/2 ]
instead.
2.13 Hypothesis tests

1. Suppose that we conjecture that one of the parameters 0 or 1 takes a particular
specified value. We can formally examine whether our conjectured value is consistent
with the observed data by performing a hypothesis test. The null hypothesis (denoted
by H0 ) corresponds to the conjecture of interest which here, without loss of generality,
we shall assume involves 1 . Hence, we write
⇤
H0 : 1 = 1.
⇤
where 1 is the conjectured value.
2. The basic idea of testing is to take our estimator of 1 and see how close it is to 1⇤ . If
it is close to 1⇤ , then we conclude that the data support H0 ; if it is far from 1⇤ , then
we conclude there is evidence against H0 . As described in Section 1.3.2, we summarise
the evidence against H0 , by computing a p-value which is the probability under (the
model and) H0 of observing a value of ˆ1 as far from 1⇤ or further than that actually
observed.
3. In order to calculate a p-value, we require a test statistic involving ˆ1 , the distribution

of which is known under the null hypothesis. From Section 2.9, we know that
ˆ1 1
t( ˆ1 ) = ⇠ tn 2
s.e.( ˆ1 )
4. Hence, the test statistic

ˆ1 ⇤
t⇤ ( ˆ1 ) = 1
s.e.( ˆ1 )
has a tn 2 distribution if 1 = 1⇤ , that is, when H0 is true. Note that t⇤ ( ˆ1 ) is a

statistic, as it can be calculated using data values alone. No knowledge of any unknown
parameter is required.
42
5. We compute the p-value for the test by examining how probable it is that a tn 2
distribution would generate a value of t⇤ ( ˆ1 ) as, or more, extreme than the observed
data value. We can write the calculation succinctly as
p = P (|T | |t⇤ ( ˆ1 )|) = 2P (T  |t⇤ ( ˆ1 )|).
where T denotes a random variable with a tn 2 distribution, and t( ˆ1 ) is the observed

value of the test statistic.
Figure 2.13: The p-value calculation.
6. This a two-sided test (we are checking for departures from H0 in either direction).
Directional tests are possible but relatively uncommon in practice, and we will not
develop them here.
7. The p-value is a way of trying to present “evidence” regarding H0 . We sometimes

need to make a decision to accept or reject H. This is more complicated theoretically
(requiring specification of costs which change for each problem) but in practice can be
done by setting agreed cut o↵ levels for p. Conventionally, in scientific problems:
0.1 < p no evidence against H0 ,
0.05 < p  0.1 weak evidence against H0 ,
0.01 < p  0.05 moderate evidence against H0 (reject at the 5% level),
p  0.01 strong evidence against H0 (reject at the 1% level).
8. Note that we reject H0 when the evidence against is strong, in the sense that the
observed data are unlikely to have been generated by a model consistent with H0 .
When H0 has not been rejected, that is exactly what has happened – it has not been
rejected (and neither has it been ‘accepted’).
9. Return to the numerical example introduced before. We showed that ˆ1 =

0.215, s.e.( ˆ0 ) = 1.214 and s.e.( ˆ1 ) = 0.00964. Suppose we want to test H0 : 1 = 0.
Then we have
ˆ1 ⇤
0.215 0
t⇤ ( ˆ1 ) = 1
= = 22.30..
s.e.( ˆ1 ) 0.00964
Here the p-va;ue will be 2P (T  |22.30|) where T follows the standard t distribution
with n 2 = 8 degrees of freedom. Using the R command, 2 * pt(-22.30, df=8) we
calculate the p-value to be 1.73 ⇥ 10 8 which shows that there is significant evidence
against H0 .
10. Find the p-value for testing H0 : 0 = 0.
43
2.14 Model comparison
1. The most common test of interest for a simple linear regression model is the test of the
hypothesis that 1 = 0. The reason that this test is so important is that it provides a
formal way of comparing the linear regression model
ind 2
Yi ⇠ N ( 0 + 1 xi , ), i = 1, . . . , n
where the distribution of the response variable Y explicitly depends on the explanatory
variable X through a linear relationship in its conditional mean, with the simpler (more
parsimonious) model
ind
Yi ⇠ N ( 0 , 2 ) i = 1, . . . , n
where there is no relationship between the variables Y and X.
2. These models can be compared by testing the hypothesis H0 : 1 = 0 within the linear
regression model. If this hypothesis is not rejected, then we can draw the conclusion
that the simpler model is adequate for the data, and there is no reason to entertain
the more complex model – there is no apparent relationship between y and X (or at
least not one which can be explained by a simple linear regression).
3. Here, and more generally when using a hypothesis test to compare two models, the
null hypothesis corresponds to the simpler model. Furthermore, the null hypothesis has
special status in a hypothesis test – it is not rejected unless the data provide significant
evidence against it. This is consistent with a desire for parsimony in modelling. We
do not reject the simpler model in favour of the more complex model unless the data
provide significant evidence that we should (the simper model a poor fit to the data
in comparison with the more complex model).
4. The two models above can be compared from a di↵erent perspective which will be
useful later (where we want to perform more general model comparisons).
ind
• When we fit the model Yi ⇠ N ( 0 + 1 Xi , 2 ), i = 1, . . . , n, the
P quality of the fit is
reflected by the residual sum of squares (RSS) (n 2)s2 = ni=1 ri2 .
ind
• Under the simpler model (which we still denote H0 ) Yi ⇠ N ( 0 , 2 ), i = 1, . . . , n,
thePleast squares estimator of 0 is ˆ0 = Ȳ so the residual sum of squares (RSS)
is ni=1 (yi ȳ)2 = (n 1)s2Y .
– Although we use the same symbol 0 in both models, we have 0 = E(Yi |Xi =
0) or 0 is the intercept in the simple regression model and 0 = E(Yi ) or 0
is the mean in the reduced model. This raises a point which is very important
in general: the interpretation of a parameter depends on the context, and this
is not reflected in the notation.
• We can compare the two models by examining
RSS under H0 RSS under regression model
= (n 1)s2Y (n 2)s2
Xn X n
= (yi ȳ)2 ri2 .
i=1 i=1
44
Now we have
n
X n
X
(yi ȳ)2 = (yi ŷi + ŷi ˆ0 ˆ1 x̄)2
i=1 i=1
n
X
= (ri + ˆ1 (xi x̄))2
i=1
n
X n
X n
X
= ri2 + ˆ12 (xi x̄) + 2 ˆ1
2
ri (xi x̄)
i=1 i=1 i=1
n
X
= ˆ12 (n 1)s2X + ri2
i=1
as the normal equations imply that ȳ = ˆ0 + ˆ1 x̄ and

n
X
ri (xi x̄) = 0.
i=1
Hence,
(n 1)s2Y = ˆ12 (n 1)s2X + (n 2)s2
which is sometimes referred to as the analysis of variance decomposition.
5. For the linear regression model, we call the first term the total sum of squares, the
second the regression sum of squares and the third the residual sum of squares. Hence,
we have
Total Sum of Squares = Regression Sum of Squares

+ Residual Sum of Squares.
6. Now, as suggested above, we can reformulate the comparison of the two models, by
considering a test statistic based on the improvement in fit provided by the regression
model, over the simpler model
n
X n
X
(yi ȳ) 2
ri2 = ˆ12 (n 1)s2X = Regression SS.
i=1 i=1
If the regression model fits the data significantly better than the simple (null) model
H0 then the regression sum of squares will be large. But how do we determine if the
regression sum of squares is so large that we should reject H0 .
7. It is clear that, under H0 : 1 = 0, the regression sum of squares ˆ12 (n 1)s2X , divided
by 2 has a 21 distribution. Furthermore, as we shall prove later, it is independent of
S 2 . Hence, we consider the test statistic
Regression Mean Square
F =
Residual Mean Square
ˆ2 (n 1)s2 ˆ2 (n 1)s2 / 2
= 1 2
X
= 1
2 2
X
s s/
45
X12 /1
=
Xn2 2 /(n 2)
where X12 has a 21 distribution and Xn2 2 has a 2n 2 distribution. We call the distribu-
tion of such a ratio, an F distribution. The F distribution depends on two parameters
corresponding to the degrees of freedom in the numerator and denominator. Hence,
we denote the F distribution above by F1,n 2 . It is a ‘known’ distribution, properties
of which can be obtained from statistical software, or statistical tables.
8. The comparison of the null model and regression models based on the F-ratio is iden-
tical to a test of the same hypothesis based on the t-statistic because
ˆ2 (n 1)s2
⇣ ˆ1
⌘2 ⇣ ˆ ⌘2
• F = 1 s2 X = s/[(n 1)s 2 ]1/2 = s.e.(1ˆ )
X 1
2 /1 2
⇣ ⌘2
• F1,k ⇠ 21 /k ⇠ N (0,1)
2 /k ⇠ [ N2 /k]
(0,1)
1/2 ⇠ t2k
k k k
9. The analysis of variance decomposition and the associated F -test are often presented
in an ANOVA table. The ANOVA table for comparing the simple linear regression
model (Yi = 0 + 1 xi + ✏i ) with the null model (Yi = 0 + ✏i ) is given by:
Source Df Sum of Squares Mean SS F Value P Value
ˆ2 (n 1)s2
Regression 1 ˆ (n 1)s ˆ (n 1)s2
2 2 2 1 x
P r(F > F Value)
1 x 1 x s2
2 2
Residuals n 2 (n 2)s s
Pn
Total n 1 i=1 (yi ȳ 2 )
10. Sometimes, the ANOVA table is equivalently written as:

Source Df Sum of Squares Mean Squares F Value P Value
Regression SS Mean Regression SS
Regression 1 Regression SS P r(F > Fobs )
1 Mean Residual SS
Residuals n 2 Residual SS Residual SS
n 2
Total n 1 Total SS
where n
X
Regression SS = ˆ12 (xi x̄)2 ,
i=1
n
X
Residual SS = (yi ˆ0 ˆ1 xi )2
i=1
and Total SS is simply the sum
P of the Regression SS and Residual SS and can otherwise
be obtained by (n 1)s2y = ni=1 (yi ȳ)2 .
11. Satisfy yourself that the ANOVA table for the numerical example is given by:
Df Sum Sq Mean Sq F value P-value

x 1 1315.24 1315.24 497.28 0.0000
Residuals 8 21.16 2.64
==============================================
Total 9 1336.40
46
13.
12. The P-value, observed to be 0.0000, says that the linear regression model is significant.
2.15 The paradigm

The basis for regression modelling is a strategy based on a clear sequence of steps. While
these may be obvious from the example presented so far, it is helpful to make the strategy
explicit. The steps are as follows:
1. Propose a plausible model
(a) Look at the data

(b) Use a scatterplot to try to choose scales (transformations) so a simple linear
regression model is appropriate.
2. Ensure that the model is adequate
(a) Fit the model (estimate the parameters)

(b) Assess the quality of fit using the two basic diagnostic plots
(c) If necessary in the light of 3, modify the model and repeat (a-b) until an appro-
priate model is obtained.
3. Present, interpret and use the model to make inferences and predictions, remembering
always to appropriately summarise uncertainty.
2.16 Bodyfat Example

1. Knowledge of the fat content of the human body is physiologically and medically
important. The fat content may influence susceptibility to disease, the outcome of
disease, the e↵ectiveness of drugs (especially anaesthetics) and the ability to withstand
adverse conditions including exposure to cold and starvation. In practice, fat content
is difficult to measure directly – one way is by measuring body density which requires
subjects to be weighed underwater! For this reason, it is useful to try to relate simpler
measures such as skinfold thicknesses (which are readily measured using calipers) to
body fat content and then use these to estimate the body fat content.
2. Dr R. Telford collected skinfold (the sum of four skinfold measurements) and percent
body fat measurements on 102 elite male athletes training at the Australian Institute
of Sport.
3. The scatterplot of the data shows that the percent body fat increases linearly with the
skinfold measurement and that the variation in percent body fat also increases with
the skinfold measurement. There do not seem to be any unusual data points and there
are no other obvious patterns to note.
47
4. It is often the case that we can simplify the relationships exhibited by data by trans-
forming one or more of the variables. In the example, we want to preserve the simple
linear relationship for the conditional mean but stabilise the variability (i.e. make it
constant) so that it is easier to describe. We therefore try transforming both vari-
ables. There is no theoretical argument to suggest a transformation here but empirical
experience suggests that we try the log transformation.
5. The scatterplot on the log scale shows that the log transformation of both variables
preserves the linear conditional location relationship and stabilises the conditional vari-
ability.
6. Let Yi denote the log percent body fat measurement and Xi the log skinfold measure-
ment on athlete i, where i = 1, . . . , 102. Then we have
2
E(Yi |xi ) = 0 + 1 xi and V ar(Yi |xi ) =
7. After examining the diagnostics, we conclude that the linear regression model provides
a good approximation to the underlying population. Specifically,
Ŷ = 1.25 + 0.88x
(0.097)(0.025)
with a variance estimate of S 2 = 0.0067 on 100 degrees of freedom.
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept –1.250 0.097 –12.940 0.000
log skinfold 0.882 0.024 35.590 0.000
Analysis of Variance
Source DF SS MS F Prob>F
Model 1 8.5240 8.5240 1266.3 0.0000
Error 100 0.6731 0.0067
C Total 101 9.1971
8. A 95% confidence interval for the population slope parameter 1 is (0.83, 0.93).
9. A common test of interest is the test of the hypothesis that 1 = 0. i.e. that skinfold is not
useful for predicting bodyfat. The 95% confidence interval for 1 does not contain zero so
we can conclude that there is evidence against the hypothesis and that skinfold is a useful
variable in the model for predicting body fat. A formal hypothesis test can be carried out by
computing the t-ratio 35.59 which has p-value zero (to 4 decimal places). We conclude that
1 is significantly di↵erent from zero.
10. A test of the hypothesis 1 = 1 (bodyfat is proportional to skinfold) is also of interest.

The test statistic for this hypothesis (0.88225 1)/0.02479 = 4.75 so we also reject the
proportionality hypothesis.
48
11. A 95% prediction interval for the log body fat percentage of a single male athlete with a
skinfold of 70 is (2.334, 2.663).
12. Although in our body fat example, we have worked on the simpler log scale, it may be useful
to make predictions on the raw scale. Note that we can transform the fitted model to show
that on the raw scale, the fitted model is
Fitted % Body fat = exp( 1.25)(Skinfold)0.88 ,
a multiplicative model which is often biologically interpretable. Since a 95% prediction

interval for the log body fat percentage of a single male athlete with a skinfold of 70 is
(2.334, 2.663), we can convert this interval to the raw scale by exponentiating the endpoints.
That is, an approximate 95% prediction interval for the body fat percentage of a single male
athlete with a skinfold of 70 is (exp(2.334), exp(2.663)) = (10.32, 14.34). In general, if we
want to set prediction intervals on the raw scale, we can transform the endpoints of the
prediction intervals calculated on the log scale to obtain approximate prediction intervals on
the raw scale.
13. The validity of the inferences depends on the representativeness of the sample and the validity
of the model. It is vital when collecting data to ensure that it is representative of the
population of interest and before making inferences to ensure that the model is appropriate.
Strictly speaking, model assessment is a form of informal inference and has an impact on
other inferences but it is not sensible to simply hope that a model is valid when empirical
verification is available.
2.17 Code for analysing the bodyfat example

bodyfat <- read.table("R:/math2010/Data/bodyfatm.txt", head=T)
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")

plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
par(mfrow=c(2,2)) # draws four plots in a graph
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")

plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
plot(1:5, 1:5, axes=F, xlab="", ylab="", type="n")

text(2, 2, "Log both X and Y")
text(2, 1, "To have the best plot")
#Keep the transformed variables in the data set
bodyfat$logskin <- log(bodyfat$Skinfold)

bodyfat$logbfat <- log(bodyfat$Bodyfat)
49
bfat.lm <- lm(logbfat ~ logskin, data=bodyfat)
# Check the diagnostics
plot(bfat.lm$fit, bfat.lm$res, xlab="Fitted values", ylab = "Residuals")

abline(h=0)
# Should be a random scatter
qqnorm(bfat.lm$res)
qqline(bfat.lm$res)
# All Points should be on the straight line

summary(bfat.lm)
anova(bfat.lm)
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")

abline(bfat.lm, col=7)
title("Scatter plot with the fitted Linear Regression line")
# 95% CI for beta(1)

0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479
round(0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479, 2)
# To test H0: beta1 = 1.

tstat <- (0.88225 -1)/0.02479
pval <- 2 * (1- pt(abs(tstat), df=100))
par(mfrow=c(1,1))
x <- seq(from=-5, to=5, length=500)
y <- dt(x, df=100)
plot(x, y, xlab="", ylab="", type="l")
title("T-density with df=100")
abline(v=abs(tstat))
abline(h=0)
x1 <- seq(from=abs(tstat), to=10, length=100)

y1 <- rep(0, length=100)
x2 <- x1
y2 <- dt(x1, df=100)
segments(x1, y1, x2, y2)
abline(h=0)
# Predict at a new value of Skinfold=70
50
# Create a new data set called new
new <- data.frame(logskin=log(70))
a <- predict(bfat.lm, newdata=new, se.fit=T)
# Confidence interval for the mean of log bodyfat at skinfold=70

a <- predict(bfat.lm, newdata=new, interval="confidence")
# a
# fit lwr upr
# [1,] 2.498339 2.474198 2.52248
# Prediction interval for a future log bodyfat at skinfold=70

a <- predict(bfat.lm, newdata=new, interval="prediction")
# a
# fit lwr upr
# [1,] 2.498339 2.333783 2.662895
#prediction intervals for the mean

pred.bfat.clim <- predict(bfat.lm, data=bodyfat, interval="confidence")
#prediction intervals for future observation

pred.bfat.plim <- predict(bfat.lm, data=bodyfat, interval="prediction")
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")

abline(bfat.lm, col=5)
lines(log(bodyfat$Skinfold), pred.bfat.clim[,2], lty=2, col=2)

lines(log(bodyfat$Skinfold), pred.bfat.clim[,3], lty=2, col=2)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,2], lty=4, col=3)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,3], lty=4, col=3)
title("Scatter plot with the fitted line and prediction intervals")
symb <- c("Fitted line", "95% CI for mean", "95% CI for observation")
cat("Please click on an empty space on the graphics window\n")
legend(locator(1), legend = symb, lty = c(1, 2, 4), col = c(5, 2, 3))
# Shows where we predicted earlier

abline(v=log(70))
# R output
> summary(bfat.lm)
Call:
51
lm(formula = logbfat ~ logskin, data = bodyfat)
Residuals:
Min 1Q Median 3Q Max
-2.234e-01 -5.566e-02 5.226e-05 5.639e-02 1.530e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.24988 0.09661 -12.94 <2e-16 ***
logskin 0.88225 0.02479 35.59 <2e-16 ***
---
Residual standard error: 0.08205 on 100 degrees of freedom

Multiple R-squared: 0.9268, Adjusted R-squared: 0.9261
F-statistic: 1266 on 1 and 100 DF, p-value: < 2.2e-16
> anova(bfat.lm)
Analysis of Variance Table
Response: logbfat
Df Sum Sq Mean Sq F value Pr(>F)
logskin 1 8.5240 8.5240 1266.3 < 2.2e-16 ***
Residuals 100 0.6731 0.0067
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 0.1 1
52
2.18 Chapter Recap
1. The simple linear regression model is given by:
ind 2
Yi = 0 + 1 xi + ✏i , ✏i ⇠ N (0, ), i = 1, . . . , n.
2. The least squares estimates are given by: ˆ0 = ȳ ˆ1 x̄ and

Pn Pn Pn
ˆ1 = i=1 (xi x̄)(yi ȳ) (xi x̄)yi xi yi nx̄ȳ
Pn 2
= Pi=1n 2
= Pi=1
n 2
.
i=1 (xi x̄) i=1 (xi x̄) i=1 xi nx̄2
All these formulae are equivalent for ˆ1 . We use the form which is mostP
convenient for
the purpose, e.g. for proving theorems we use the form in the middle ni=1 w1i yi and
for numerical calculation we use the last form.
3. Fitted values and residuals are defined as: ŷi = ˆ0 + ˆ1 xi , and ri = yi ŷi .
4. We use residual plots to check model assumptions.
(a) We plot the residuals against the fitted values to check for homoscedasticity. Plot
should look like a random scatter if this assumption holds.
(b) We order and plot the residuals against the normal order statistics to check for
normality. This plot should be a straight line if the normality assumption holds.
5. Define notations:
n
X n
X
1 1
s2x = (xi 2
x̄) , s2y = (yi ȳ)2 ,
n 1 i=1
n 1 i=1
n
X n
X n
X
sxy = (xi x̄)(yi ȳ), sxx = (xi x̄)2 , syy = (yi ȳ)2
i=1 i=1 i=1
sxy
6. Define the sample correlation coefficient, r = (n 1)sx sy
. Now,
Pn
(x x̄)(yi ȳ) sxy sxy sy sy
ˆ1 = Pn i
i=1
2
= 2 = ⌘r .
i=1 (xi x̄) sx sx sy sx sx
⇣ ⌘
7. Have: Var( ˆ1 ) = , Var( ˆ0 ) = , Cov( ˆ0 , ˆ1 ) =
2 1 2 2
Pn 2
+ Pn x̄ Pn x̄
.
i=1 (xi x̄)2 n i=1 (xi x̄)
2
i=1 (xi x̄)2
8. It can be shown that:

n
X n
X n
X
Residual SS = ri2 = (yi ȳ) 2 ˆ2 (xi x̄)2 .
1
i=1 i=1 i=1
9. Residual df = number of observations number of parameters estimated = n 2.

nP Pn o
10. Estimate 2 2
by s = Residual SS 1
=n 2 n 2 ˆ2 2
Residual df i=1 (yi ȳ) 1 i=1 (xi x̄) .
53
11. The residual standard error is ˆ which is given by s.
12. Estimate
⇣ the conditional
⌘ mean of Y0 for a new x0 by ˆ0 + ˆ1 x0 . Its variance is
(x x̄) 2
2 1
n
+ Pn 0(xi x̄)2 = 2 h00 .
i=1
13. Predict a future observation Y0 for a new x0 by ˆ0 + ˆ1 x0 . Its variance is = 2

(1 + h00 ).
14. The standard error of an estimate of a parameter is the estimated value of the square
root of the variance of that parameter.
15. To construct a confidence interval, use the general formula: Estimate ± Critical
Value ⇥ Standard Error. Use this to find intervals for any 0 and 1 , the conditional
mean and the future observation Y0 for a new x0 .
16. The critical value is an appropriate quantile of the t-distribution with n 2 df. For
95% intervals, the R command is qt(0.975, df=n-2).
17. The critical value can be found from the table in the Formula Sheet. Choose row
using the residual degrees of freedom and the column using the confidence level. For
example, when n = 12 for 95% confidence interval use the value, 2.23 which is at the
row for k = 10 and at the column for ↵ = 0.025.
18. To test H0 : 1 = 0, against H1 : 1 6= 0,

ˆ
(a) obtain tobs = 1
.
Standard Error( ˆ1 )
(b) Calculate P-value = 2P (tn 2 > |tobs |). P-values are reported by R as the last
column of the summary table for a fitted linear model.
(c) Reject H0 if P-value is less than alpha, where alpha is the level of significance,
usually 0.05.
19. The ANOVA table for comparing the simple linear regression model with the null
model (Yi = 0 + ✏i ) is given by:
Source Df Sum of Squares Mean SS F Value P Value

ˆ2 (n 1)s2
Regression 1 ˆ2 (n 1)s2 ˆ2 (n 1)s2 1 x
P r(F > F Value)
1 x 1 x s2
Residuals n 2 (n 2)s2 s2
Pn
Total n 1 i=1 (yi ȳ 2 )
20. We often report the R2 :
(n 2)s2 Residual SS Model SS ˆ2 (n 1)s2 ˆ2 s2

R2 = 1 =1 = = 1 x
= 1 x
.
(n 1)s2y Total SS Total SS (n 1)s2y s2y
21. In this case of linear regression:

ˆ2 s2 ✓ ◆2 2
2 1 x sy sx
R = 2 = r = r2 = (Sample Correlation Coefficient)2 .
sy sx s2y
54
22. Adjusted R2 is the proportionate reduction in error variance achieved by the model.
s2
adj R2 = 1 .
s2y
55

Chapter 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

Chapter 2

Simple Linear Regression

2.1 Inheritance of height

6. Write some important characteristics of the scatter plot.

Figure 2.1: The heights data set

11. Interpreting a boxplot.

(a) The boxplots show the median as the middle line.

(55.4,57.3] (59.2,61.2] (63.1,65] (65,67] (67,68.9]

Figure 2.2: Boxplots for the heights data.

2.2 The model

2. Conditional Mean Parameters

3. Conditional Variance 2 , sometimes called a nuisance parameter, as its value does

The structure of the underlying population can then be viewed graphically.

Figure 2.3: The structure of the underlying population

6. A useful alternative way to write the model is as

7. Here and throughout our treatment of regression models, we proceed conditionally

2.3 Model Fitting

To minimise S with respect to 0 and 1 we set

5. Practical method to calculate ˆ0 and ˆ1 .

• Note that (or Satisfy yourself that):

• Hence, ˆ1 can be calculated by the formula.

• Once we have ˆ1 , ˆ0 can simply be obtained from:

6. For the following data show that ˆ0 = 0.916, ˆ1 = 0.215.

2.4 Estimating the variance

independent of ˆ0 and ˆ1 . Hence, S 2 is an unbiased estimator of 2

2. The sampling distribution of S 2 can be used to construct confidence intervals for 2

4. Show that s2 = 2.644 for the data set given above.

2.5 Fitted Values and the Residuals

Residual = Observed – Fitted

2.6.1 The (Anscombe) residual plot

Figure 2.6: Heteroscedsaticity in a residual plot

2.6.2 Normal Probability Plots

Shorter Tails Longer Tails Outlier

Asymmetry Rounding Clustering

Figure 2.9: Normal probability plots

Normal Q−Q Plot

2.6.3 Dependence Plots

2.6.4 Discrepant Explanatory Variables

2.7 Correlation and Goodness-of-fit

is the sample covariance. The statistic ⇢ˆ is called the Pearson or product-moment

4. Regression and correlation operate in di↵erent contexts.

• The regression model represents the error distribution as a normal distribution.

5. Nevertheless, there is a simple algebraic relationship between the estimate ˆ1 of the

the proportionate reduction in error variance achieved by including the explanatory

(n 1)s2Y = ˆ12 (n 1)s2X + (n 2)s2

and the relationship between ˆ1 and ⇢ˆ, we can write

defines a form of squared correlation coefficient which generalises to more complex

Figure 2.11: Two examples of correlation ⇡ 0

11. A full linear regression model is far more useful in that it

• provides a complete description,

12. Calculation of a correlation coefficient complements this process, by summarising one

2.8 Quantifying uncertainty

2. An estimate ˆ1 of 1 is never exactly equal to 1 because di↵erent samples give dif-

4. Putting everything together,

6. From the variance, above, we obtain

8. The treatment of ˆ0 is entirely analogous. We have

2.9 Confidence Intervals

2. In Section 1.3.1, we derived confidence intervals for a parameter of a simple normal

3. For regression models, the sampling distribution of S 2 depends only on 2 , so we can

independent of ˆ0 and ˆ1 (see §1.6), then it follows that

for any given confidence level 1 ↵.

Figure 2.12: t-density function

Estimate ± Critical Value ⇥ Standard Error.