Generalised Linear Models - Sheffield (2017-2018)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 119

MAS367/MAS467/MAS6003 Linear and

Generalised Linear Models


Module Information

1. This module is taught by Dr Kostas Triantafyllopoulos. The module has a MOLE page, and
all course materials will be available there. There is a more detailed information about the
structure of the module in MOLE. Below is a brief summary.

2. You can use the discussion boards on MOLE for asking questions, and you can post questions
anonymously.

3. There are three sets of non-assessed exercises, which will be available in due course on MOLE.
Deadlines for these will be given on the discussion board. You can ask for help with these
exercises at any time.

4. There is a project, worth 30% of the assessment for MAS467. More details and deadlines will
be posted on MOLE.

5. There are exercises (“Tasks ”) distributed throughout the notes. You should attempt these as
we reach them during the semester. Solutions are not given (typically the tasks involve working
through examples in the notes), but you may ask for help with the tasks at any time.

6. You can contact me and arrange to see me if you want to discuss anything about the module.
Contents

1 The Linear Model, Estimation and Hypothesis Testing 6


1.1 Model definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Matrix formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 The linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 An example of the design matrix X - polynomial regression . . . . . . . . . . . . . 8
1.2 Estimating model parameters using the least squares method . . . . . . . . . . . . . . . . 8
1.2.1 Fitted values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 LS estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Fitting linear models in R: the tractor data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Properties of the LS estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Gauss-Markov conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 LS estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Covariance of parameter estimates - the variance-covariance matrix . . . . . . . . . 12
1.4.4 Estimating the error variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.5 Further application to the tractor data . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Distributions of estimators and residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Maximum likelihood estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.2 MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Model fit: coefficient of determination R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Constructing intervals when predicting the response at a new explanatory value . . . . . . 16
1.9 Confidence intervals for components of β . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.10 Confidence intervals for model parameters in the tractor data . . . . . . . . . . . . . . . . 18
1.11 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.11.1 The linear hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.11.2 Testing the effect of a subset of regressor variables . . . . . . . . . . . . . . . . . . 21
1.11.3 Testing nested models using F tests . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.11.4 An example of the anova command in R . . . . . . . . . . . . . . . . . . . . . . . . 24

1
1.11.5 Limitations of using sequential sums of squares as a variable selection tool . . . . . 25
1.11.6 Using anova tables to perform tests on subsets of variables . . . . . . . . . . . . . 26
1.12 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.12.1 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.12.2 The multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 Residuals, Transformations, Factors and Variable Selection 31


2.1 Checking model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.1 Residuals for model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.2 Standardized residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.4 Checking normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.5 Checking independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.6 Checking homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Formal tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 Correcting for multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Standardized residuals for the tractor data . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Transforming to restore homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Estimating a transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Simulated data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Factors and interaction plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Introduction to factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Design matrices for factor variables - the corrosion data . . . . . . . . . . . . . . . 41
2.4.3 Calculation by R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.4 Factors coded as quantitative variables . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.5 Multicollinearity - when X is not full rank . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.6 Resolving over-parametrization using constraints . . . . . . . . . . . . . . . . . . . 45
2.4.7 Factors in the tractor data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.8 More R manipulation of the tractor data . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.9 Two factor variables - the crop data . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.10 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.1 Variable selection in the corrosion data . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.2 Penalized likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.3 Akaike’s information criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.4 The Schwartz criterion/Bayesian information criterion . . . . . . . . . . . . . . . . 58
2.5.5 What pairs of models can we compare by testing? . . . . . . . . . . . . . . . . . . 58
2.5.6 Non-nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2
2.5.7 Stepwise automated selection in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3 Generalized Linear Models (GLMs): Basic Facts 64


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1.1 Standard linear model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1.2 Why we need to extend linear models theory . . . . . . . . . . . . . . . . . . . . . 65
3.1.3 Reminder - inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.4 Generalized linear model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.5 Example: toxicity trial in beetles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Fitting a GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Calculating fitted values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Distributional properties in GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 Allowed distributions in GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.2 Background on the Score Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Mean and variance for GLM distributions . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Common GLM distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.5 The Canonical Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.6 Why is the canonical link useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Estimation of parameters - general theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 Covariance matrix for the mles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 Estimation of parameters - Iteratively reweighted least squares . . . . . . . . . . . . . . . 76
3.6 Scaled deviance and (residual) deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.6.2 Testing model fit using deviance (approximately) . . . . . . . . . . . . . . . . . . . 80
3.6.3 Testing model fit using pseudo R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Deviances for common distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Comparing nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7.1 Applying GLRT to GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7.2 Model building - Analysis of Deviance . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7.3 Analysis of deviance in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7.4 hypothesis tests on a single parameter . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7.5 Remember that p-values are only approximate . . . . . . . . . . . . . . . . . . . . 83
3.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.1 Distribution of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.2 Pearson residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.3 Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.9 Estimating the scale parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.10 Quasi-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Binary Response Data 86

3
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Binomial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Links for binomial reponses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Choosing a link function for the beetle data of chapter 3 . . . . . . . . . . . . . . . 87
4.4 Model building, assessing model fit and residuals . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.1 Model building for the beetle data . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.2 Analysis of deviance for the beetle data . . . . . . . . . . . . . . . . . . . . . . . . 91
2
4.4.3 Calculating pseudo R for the beetle quadratic model . . . . . . . . . . . . . . . . 91
4.4.4 Calculating residuals for the beetle quadratic model . . . . . . . . . . . . . . . . . 92
4.4.5 Example 2 of model building: plant anthers . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Over-dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 An example of the use of over-dispersion: grain beetle deaths . . . . . . . . . . . . 94
4.6 Odds and odds ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.1 Xi is a factor with 2 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.2 Xi is a factor with 3 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6.3 Xi is continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6.4 Multiple explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.6.5 Confidence intervals for odds ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Poisson Regression and Models for Non-Negative Data 99


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Modelling counts using the Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Example: AIDS deaths over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Adjusting for exposure: offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Example: Smoking and heart disease . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Non-negative data with variance ∝ mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Two-way Contingency Tables 104


6.1 Types of 2-way tables - response/controlled variables . . . . . . . . . . . . . . . . . . . . . 104
6.1.1 Case(a): Skin cancer (melanoma) data - 2 response variables . . . . . . . . . . . . 104
6.1.2 Case(b): Flu vaccine data - 1 response and 1 controlled variable . . . . . . . . . . 104
6.2 Notation for two-way tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.1 Association, Independence and Homogeneity . . . . . . . . . . . . . . . . . . . . . 105
6.3 Distributions for two-way tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.1 Case (a): two response variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.2 Case (b): one response variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.3 Case (c): independent Poissons (no fixed margins). . . . . . . . . . . . . . . . . . . 106
6.3.4 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 GLMs and two-way contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1 Natural hypotheses are log-linear models . . . . . . . . . . . . . . . . . . . . . . . . 107

4
6.4.2 Poisson log-linear modelling for two-way tables . . . . . . . . . . . . . . . . . . . . 107
6.4.3 Maximum likelihood estimation for πij in case(a) . . . . . . . . . . . . . . . . . . . 108
6.5 Interaction plots and examination of residuals . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.6 Analysis of the skin cancer data (case(a)) using log-linear models . . . . . . . . . . . . . . 109
6.6.1 Fitted values for the skin cancer data (case(a)) . . . . . . . . . . . . . . . . . . . . 109
6.6.2 Log-linear modelling in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6.3 Skin cancer data (case(a)) revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7 Flu vaccine data (case(b)) revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.7.1 Fitted values for the A + B model for the flu data (case(b)) . . . . . . . . . . . . . 111

7 Three-way Tables 113


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Types of three-way table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 One response variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4.1 Complete homogeneity of the response (A+B*C) . . . . . . . . . . . . . . . . . . . 114
7.4.2 Homogeneity of the response over one other factor (A*B+B*C) . . . . . . . . . . . 114
7.4.3 Example: ulcers and aspirin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.4 Estimates of E(Yijk ) for different log-linear models . . . . . . . . . . . . . . . . . . 115
7.4.5 Fitted values using the parameter estimates and the link function . . . . . . . . . . 116
7.4.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 One factor a binary response and the others explanatory . . . . . . . . . . . . . . . . . . . 117

5
Chapter 1

The Linear Model, Estimation and


Hypothesis Testing

This chapter gives a rapid review of the basic theory for linear models. Technical details are
provided for reference, but do not worry if you do not understand all the theory. The important
points you should know are

• how to represent a linear model in matrix notation;

• how to fit a linear model in R with the lm command and how to interpret the output;

• how to use a F-test for nested linear models, using the anova command.

Some technical details are provided, but the focus is on the broad understanding of linear models
and its implementation in R. In the end of the chapter you should have some basic understanding
of linear models, their structure in matrix form and be able to fit and interpret them in R.

1.1 Model definitions

1.1.1 Motivation

Suppose we want to see how y relates to several explanatory variables x1 , . . . , xp . We have data

(xi1 , . . . , xip , yi ), i = 1, . . . , n

Linear models have two parts: a linear predictor and a random error. The linear predictor
formalizes the idea that the response is a linear combination of terms involving the explana-
tory variables. The error term allows for variation in the response for identical values of the
explanatory variables. With the variables above we could propose the statistical model

yi = β0 + β1 xi1 + · · · + βp xip + i (1.1)

to explain how the response depends on the explanatory variables (analogous to the simple
regression model). This is an example of multiple regression. β0 + β1 xi1 + · · · + βp xip is the
linear predictor, i is the error term. We later impose some conditions on i . For the model
yi = β0 + β1 xi1 + β2 xi2 + i we are fitting a plane to the data. For this multiple regression model,
as the number of parameters increases we fit increasingly higher-dimensional hyper-planes.
Alternatively, reverting to a single explanatory variable, we might wish to express the relation-
ship between y and x in a quadratic way:

yi = β0 + β1 xi + β2 x2i + i . (1.2)

6
In this model yi = β0 + β1 xi + β2 x2i + i . is the linear predictor. Although the relationship is
quadratic, we would still call this a linear model because it is linear in the parameters β0 ,
β1 and β2 . It is a common misconception that linear models require the relationship between
the response and explanatory variables to be linear. This is not the case, we just need to be
able to form a linear predictor that is linear in the parameters. So a linear model is linear in
the parameters, not the explanatory variables

1.1.2 Matrix formulation

In this module vectors are written in bold and scalars in normal text so y is a vector and y
is a single value. It is convenient to express such models using vectors and matrices. We can
always collect the observed values of the variable y into a vector y = (y1 , . . . , yn )T and we can
similarly define x = (x1 , . . . , xn )T and  = (1 , . . . , n )T . If we also define 1n = (1, , . . . , 1)T to
be a vector of n ones, we can write the simple regression model as

y = β0 1n + β1 x + . (1.3)

Even more neatly, if we define the parameter vector β = (β0 , β1 )T and the n × 2 matrix X is
defined to have 1n as its first column and x as its second, then

y = Xβ + . (1.4)

For a suitable definition of the parameter vector β and the matrix X (also called design matrix)
we can express both equation (1.1) and (1.2) in this form.

1.1.3 The linear model

The linear model is expressed by (1.4). In its general form,

• y is a n × 1 vector of observed random variables.


• X is a n × p matrix of known coefficients (perhaps observed or controlled values of other
explanatory variables, but also perhaps functions of such explanatory variables, or just
constants).
• β is a p × 1 vector of unknown parameters,  is an n × 1 vector of unobserved random
variables, assumed to have zero mean, to be uncorrelated and to have common variance
σ2.
• σ 2 is another unknown parameter.
• The distribution of  is multivariate normal.

The number of components of β is allowed to be whatever we wish. In the multiple regression


model (1.1) it would be p + 1, while in the quadratic regression model (1.2) it would be 3. The
model is completed by specifying the distribution of the i ’s, the elements of the vector .

1.1.4 Terminology

We will refer to the y variable as the dependent variable or the response variable. The vector y
consists of n observations of this dependent or response variable.
Similarly, we can think of each column of the X matrix as comprising n values of a regressor
variable, also called an independent variable. So we have in general one response variable and p
regressor variables.

7
Sometimes this terminology may seem strange, because in the simple linear regression model
the X matrix has two columns, the first of which is a column of ones. We still would say that
this column defines a regressor variable. This “variable” is a constant.
Regressor variables are also sometimes called explanatory variables, but you will have noticed
that in this course I give this term a slightly different meaning. Consider the quadratic regression
(1.2). Here there are three regressor variables (the constant, the xi column and the x2i column),
but there is really only one x variable. The three regressor variables are all functions of the x
variable. In this course we will refer to the x variable as the single explanatory variable in the
quadratic regression model. In general, the regressor variables will always be functions of the
explanatory variables.

1.1.5 An example of the design matrix X - polynomial regression

The general polynomial regression with one explanatory variable is

yi = β0 + β1 xi + · · · + βr xri + i

which is turned into the general linear model form by

xr1
       
y1 1 x1 · · · β0 1
 y2   1 x2 · · · xr2   β1   2 
y =  . , X =  . . . ..  , β= , =
       
.. ..
 ..   .. .. .. 
.   .   . 
yn 1 xn · · · xrn βr n

and p = r + 1.

1.2 Estimating model parameters using the least squares method

1.2.1 Fitted values

Suppose we are performing simple linear regression so that we are estimating an intercept β0
and gradient term β1 . Suppose we have estimates of these two parameters βˆ0 and βˆ1 . Then
the fitted value for observation i is βˆ0 + βˆ1 xi . Suppose we have an estimate β̂ of the parameter
vector β. We define the fitted values to be X β̂ (this gives the vector of fitted values). See
section 1.3 for an example using the tractor data. Note the difference between β and β̂: β is
the vector of true parameter values, β̂ is the estimate of the vector of true parameter values.

1.2.2 Residuals

Residuals play a key role in linear models. In later chapters we will see how they can be used to
assess whether the statistical model we use to describe the relationship between our variables
fits the assumptions underpinning linear model theory. We have seen that the linear model has
several parameters (the βi contained in the vector of parameters β). We need to use our observed
data to estimate these parameters. The parameters that we choose are those that minimize the
sums of squares of the residuals. This is called the method of least squares.

The residual for observation i (usually denoted ei ) is defined to be ei = yi − (βˆ0 + βˆ1 xi ) (i.e.
observed value - fitted value). The vector of residuals will be
 
e1
 .. 
 .  = e = y − X β̂
en

8
and the sum of squares of these residuals is nicely expressed as
n
X
Sr = S(β̂) = e2i = eT e = (y − X β̂)T (y − X β̂). (1.5)
i=1

This sum is known as residual sum of squares and plays an important role in the the analysis
of linear models.
The least squares estimator is that β̂ which minimizes S(β̂). Intuitively, a value of β̂ that makes
the residuals small “fits” the data well. So the least squares estimator is generally described as
giving the best fit.

1.2.3 LS estimators

Estimator for β

Theorem 1 Assume that X has rank p. Then the least squares estimator of β is

β̂ = (X T X)−1 X T y. (1.6)

1.3 Fitting linear models in R: the tractor data

The tractor data in Table 1.1 show the ages in years of 17 tractors and their maintenance costs
per 6 months. Is a straight-line relationship between age and cost reasonable? This is an example
of simple linear regression, i.e. when we have only one regressor variable (p = 2 as there are two
parameters to estimate, gradient and intercept).

• Scatter plot. The scatter plot in Figure 1.1 shows that as age increases so does main-
tenance cost and it looks like a linear relationship describes the relationship reasonably
well. Remember that we could include quadratic and higher order terms in maintenance
cost and still have a linear model. What we need to check is that residuals are roughly
normally distributed.

• Fitting the regression. With a small dataset like this it is practical to do the computations
by hand, but even for such a small example it is quicker to use a computer package like R
(even in this case when there are only 17 observations this leads to a 17 by 2 matrix for
X). To fit a linear model we use the command lm, i.e.

> tractor.linear.lm <- lm(Maint~Age, data=tractor.data)

This tells R to fit regression model yi = β0 + β1 xi + i for data set tractor.data (you will
have to name the data set tractor.data using tractor.data<-read.table(...)).

• Fitted values. If we use the estimates of β̂0 and β̂1 from fitting the model we can cal-
culate the fitted values. These are the values of the response that we would obtain
based on our estimated model parameters and observed explanatory variable (xi ). The
fitted values are β̂0 + β̂1 xi for each value xi . Obviously these lie in a straight line (be-
cause the linear predictor is linear in xi ). We can add the fitted line to the plot using
abline(tractor.linear.lm); this line is shown in figure 1.1 and the actual fitted values
are shown as crosses. For each xi , the residuals are the differences between the observed
response and the fitted response. Some residuals will be negative and some will be posi-
tive. It can be shown that they add up to zero. To apply the linear model theory, these
residuals should be normally distributed.

9
Table 1.1: Tractor data

Age 0.5 0.5 1 1 1 4 4 4 4.5 4.5


Maint 182 163 978 466 549 495 723 681 619 1049
Age 4.5 5 5 5 5.5 6 6
Maint 1033 890 1522 1194 987 764 1373

Tractor data
800 1000 1200 1400
Maint

600
400
200

1 2 3 4 5 6

Age

Figure 1.1: Fitted values for the tractor data: circles are observed values; crosses are fitted
values

10
To see a first output we can type tractor.linear.lm in the command line, i.e.

> tractor.linear.lm
Call: lm(formula = Maint ~ Age, data = tractor.data)
Coefficients:
(Intercept) Age
323.6 131.7

This gives the estimates of β̂ parameters only, i.e.

β̂0 = 323.6, β̂1 = 131.7,

The β̂1 parameter (the gradient) tells us that our ‘best’ estimate is that as age goes up by one
year, the expected maintenance cost increases by 131.72. Note that this is only the expected
value - because of the error term in our model any actual cost will vary around this expected
value. A more detailed summary is available with the command summary(tractor.linear.lm),
i.e.

> summary(tractor.linear.lm)
Call:
lm(formula = Maint ~ Age, data = tractor.data)
Residuals:
Min 1Q Median 3Q Max
-355.49 -207.48 -61.06 132.65 539.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 323.62 146.94 2.202 0.04369 *
Age 131.72 35.61 3.699 0.00214 **

Residual standard error: 283.5 on 15 degrees of freedom


Multiple R-squared: 0.4771, Adjusted R-squared: 0.4422
F-statistic: 13.68 on 1 and 15 DF, p-value: 0.002143

Descriptive statistics of the residuals are given. Typically for a good fit we expect the median
to be close to 0, the min and max to have approximately the same absolute value and also the
lower and upper quartile to have approximately the same absolute value. Deviation from these
can indicate that the residuals are large or that they are positively or negatively skewed though
it is difficult to tell with small data sets. The quartiles of the residuals can easily be calculated.

1.4 Properties of the LS estimators

1.4.1 Gauss-Markov conditions

So far we have not imposed any statistical assumptions on model (1.4). Although the LS
estimator β̂ is still valid with no further assumptions, for the full development of the linear
model we need to impose distributional assumptions on  = y − Xβ. These assumptions, known
as Gauss-Markov conditions, can be expressed as

E() = 0 and V ar() = σ 2 In . (1.7)

This effectively says that the mean of i is zero, the variance of i is σ 2 and that i is uncorrelated
with j for i 6= j, with  = (1 , . . . , n )T . Later we will discuss also the distribution of .

11
1.4.2 LS estimators

From the above conditions, it follows that

E(y) = Xβ and V ar(y) = σ 2 In .

The expected value of β is

E(β̂) = E((X T X)−1 X T y) = (X T X)−1 X T Xβ = β

and so β̂ is an unbiased estimator for β.

1.4.3 Covariance of parameter estimates - the variance-covariance matrix

The variance properties of β̂ are contained in its covariance matrix. Using result 2 of the section
labeled ‘vector covariance’ in the Chapter 1 Appendix ,

V ar(β̂) = V ar((X T X)−1 X T y)


= (X T X)−1 X T V ar(y)((X T X)−1 X T )T
= (X T X)−1 X T σ 2 In X(X T X)−1
= σ 2 (X T X)−1

See Section 1.4.5 for how to obtain the variance-covariance matrix in R and how the values in it
relate to the standard errors of the parameter estimates.

1.4.4 Estimating the error variance

We have already defined the residuals as

e = y − X β̂ = y − X(X T X)−1 X T y = M y,

where M = In − X(X T X)−1 X T . The matrix M plays an important role, as we will see later.
The residuals have mean and variance given by

E(e) = M E(y) = M Xβ = 0,

since M X = 0 (you should check this). We can show that M satisfies M 2 = M , i.e. M is an
idempotent matrix. Then

V ar(e) = M V ar(y)M T = M σ 2 In M = σ 2 M 2 = σ 2 M. (1.8)

Note:
It is important to understand the difference between the error (i ) and the residual (ei ) for
observation i. The error is part of the statistical model that we fit to our data. The residual
is the observed value of the difference between the response and the fitted value. The observed
residuals, the ei , allow us to estimate the variability of the error terms, the i . Our estimate of
the standard deviation of the error term is called the residual standard error (i.e we estimate
σ by σ̂ and call σ̂ the residual standard error).
Thus M is also related to the variance-covariance matrix of the residuals. The variance of an
individual residual is σ 2 times the corresponding diagonal element of M .
Every idempotent matrix except the identity is singular (non-invertible). Its rank is equal to its
trace, and we have (using the result that tr(AB) = tr(BA) from the chapter 1 appendix, point
11)

tr(M ) = tr(In − X(X T X)−1 X T ) = n − tr((X T X)−1 X T X) = n − tr(Ip ) = n − p.

12
Since E(ei ) = 0 (i = 1, . . . , n), we can observe that the diagonal elements of the covariance matrix
V ar(e) are E(e21 ), . . . , E(e2n ) and from (1.8) we have that E(e2i ) = σ 2 mii , where m11 , . . . , mnn
are diagonal elements of M , writing M = (mij ). Then we can see
n n n
!
X X X
2 2 2
E ei = E(ei ) = σ mii = σ 2 tr(M ) = σ 2 (n − p)
i=1 i=1 i=1

This means that if we propose


n
2 1 X 2 1
σ̂ = ei = Sr
n−p n−p
i=1

as an estimator for σ 2 , then σ̂ 2 will be unbiased estimator for σ 2 , i.e. E(σ̂ 2 ) = σ 2 .


As we have already seen, the quantity Sr defined by
n
X
Sr = e2i = eTi e = (y − X β̂)T (y − X β̂),
i=1

is referred to as residual sum of squares.

1.4.5 Further application to the tractor data

Referring back to the tractor.linear.lm analysis in section 1.3, we had the following partial
output:

> summary(tractor.linear.lm)
Call:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 323.62 146.94 2.202 0.04369 *
Age 131.72 35.61 3.699 0.00214 **

Residual standard error: 283.5 on 15 degrees of freedom


Multiple R-squared: 0.4771, Adjusted R-squared: 0.4422
F-statistic: 13.68 on 1 and 15 DF, p-value: 0.002143

The ‘residual standard error’ is given by


v
u n
u 1 X 2
σ̂ = t ei
n−p
i=1

which in the output is shown to be 283.5; it is calculated in R by

>sqrt(sum(tractor.linear.lm$resid^2)/15
[1] 283.4792

Note that the degress of freedom is n − p = 15. In R, the variance-covariance matrix can
be obtained using vcov(). For the model yi = β0 + β1 xi + i fitted to the tractor data
vcov(tractor.linear.lm) produces the following output:

(Intercept) Age
(Intercept) 21591.05 -4623.990
Age -4623.99 1267.868

13
so that the variance of our
√estimate of the intercept (β̂0 ) is 21591.05. It follows that the standard
error of the intercept is 21591.05 = 146.94 as given√in the summary command output above.
Similarly the standard error of the gradient (β̂1 ) is 1267.868 = 35.61. The estimate of the
covariance between β̂0 and β̂1 is -4263.990; this means that the correlation between the parameter
estimates, which is given by

cov(β̂0 , β̂1 )
corr(β̂0 , β̂1 ) =
s.e.(β̂0 ) × s.e.(β̂1 )
is -4623.99/(146.94 × 35.61) = −0.8837

1.5 Distributions of estimators and residuals

In the linear model it is assumed that the elements of  are independent with zero means and
variances σ 2 . It is further assumed that they are normally distributed. We therefore have

 ∼ Nn (0, σ 2 In )

where Nn represents an n-dimensional multivariate normal distribution (see Appendix 1.12.2 for
further details).
Since y = Xβ + , it follows (from the linear transformation property in Appendix 1.12.2) that

y ∼ Nn (Xβ, σ 2 In )

Now both β̂ and e are linear functions of y and therefore they also have multivariate normal
distributions. From section 1.4.2 we have

β̂ ∼ Np (β, σ 2 (X T X)−1 ) (1.9)


2
e ∼ Nn (0, σ M )

Furthermore, β̂ and e are independent.


To prove this we use the linear transformation property with

(X T X)−1 X T
 
A=
M

and find that the covariance between β̂ and e is zero because M X = 0. Therefore they are also
independent.

1.6 Maximum likelihood estimators

1.6.1 The likelihood function

The full version of our model, in which we include the assumption of normality, simply says
that y ∼ Nn (Xβ, σ 2 In ). The likelihood therefore comes directly from the definition of the
multivariate normal density.

L(β, σ 2 ; y) = f (y|β, σ 2 )
 
1 1 T
= exp − 2 (y − Xβ) (y − Xβ)
(2πσ 2 )n/2 2σ
 
−n 1 T
∝ σ exp − 2 (y − Xβ) (y − Xβ)

In this derivation we have used the fact that |σ 2 In | = (σ 2 )n .

14
1.6.2 MLEs

To maximize this likelihood with respect to β, we obviously must minimize (y − Xβ)T (y − Xβ).
But we already know the result of this. It yields the least squares estimates, so the MLE of β
is the same β̂ that we have already obtained using both least squares and the Gauss-Markov
theorem.
After we have done this first stage maximization of the likelihood, we have
 
2 −n 1 T
L(β̂, σ ; y) ∝ σ exp − 2 (y − X β̂) (y − X β̂)

and we now need to maximize this with respect to σ 2 . This is easily done (by differentiating
the log-likelihood with respect to σ 2 and setting the result equal to zero - see review exercises
2) and produces the MLE, n−1 (y − X β̂)T (y − X β̂) = Sr /n, where Sr was introduced as the
residual sum of squares in Chapter 1.
However, this is easily seen to be a biased estimator. We have just shown that Sr ∼ σ 2 χ2n−p ,
and hence E(Sr ) = σ 2 (n − p). In fact, we obtain an unbiased estimator by dividing Sr by n − p
instead of n. This is the standard estimator of σ 2 :
Sr
σ̂ 2 = .
n−p

1.7 Model fit: coefficient of determination R2

When we are fitting a linear model, the estimate of β is obtained by minimizing the sum of
squared residuals. The estimate β̂ minimizes the sum of squares and thereby provides the best
possible fit to the data using the model (specified by the design matrix), and the residual sum
of squares Sr (section 1.2) is a measure of fit.
A model that achieves a lower residual sum of squares could be considered as giving a better
fit to the data. So we could consider using Sr directly as a measure of model fit. There are
two drawbacks to doing this. The first is that Sr depends on the scale of the observations: a
model fitted to data in which the response variable is measured in hundreds of some unit could
have a larger Sr than a model fitted to data in which the response variable is measured in single
units, even though it has a much better fit. The second is that we might prefer a measure that
increases as fit improves (whereas Sr decreases). Based on the first drawback, it would make
sense to relate Sr to the total variation in the response so that the scale is taken into account.
Taking these two factors into consideration, a better measure of model fit is
SST otal − Sr
R2 =
SST otal

where SST otal is the total sum of squares (this will generally be (y − ȳ)T (y − ȳ) = y T y − nȳ 2 ,
but becomes just y T y if there is no constant term required). Remember that we use bold when
a symbol represents a column vector so that ȳ is a vector in which each element is identical and
equal to the overall mean of the y values; whereas ȳ is a single value representing the overall
mean of the y values. R2 is sometimes called the coefficient of determination. R2 can be
thought of the proportion of the total sum of squares that the regression model explains. The
residual sum of squares Sr is the unexplained part of the total sum of squares of y.
Notice that in the case of the simple linear regression model, R2 equals r2 , the squared sample
correlation coefficient between y and x. So we can see R2 as the generalization of r2 to the
more general linear model. R2 is also called the squared multiple correlation coefficient and it
always lies between 0 and 1. In the lm output in R we can see the R2 as multiple R-squared.
Referring back to the summary(tractor.linear.lm) output in section 1.4.5 we can verify the
Multiple R-squared value in R using:

15
> y=tractor.data$Maint
> total.ss=sum(y^2)-17*(mean(y)^2)
> res.ss=sum(tractor.linear.lm$resid^2)
> (total.ss-res.ss)/total.ss
[1] 0.4770564

We see in the output that there is also an ‘Adjusted R-squared’ value. This value takes into
account the number of explanatory variables used in the model. It is defined as

Sr /(n − p)
R2 (adj) = 1 −
SStotal /(n − 1)

We can check it in R using

> 1-(res.ss/15)/(total.ss/16)
[1] 0.4421935

We will talk a lot more about model selection in the next chapter but it does not make sense
to pick the model with the largest R2 as this necessarily increases as the number of explanatory
variables increases. However, the adjusted R-squared value does not necessarily increase as
the number of explanatory variables increases, so using the adjusted R-squared value in model
selection is more sensible.

1.8 Constructing intervals when predicting the response at a


new explanatory value

In this section we imagine that we have used the techniques in this course to determine a
statistical model and we now want to make predictions based on this chosen model. As well as
point estimates we also want to construct intervals that we think are likely to contain the true
population value (i.e. confidence intervals). If xi is a column vector containing the values of the
explanatory/regressor variables for a new observation i (so xi T is a row vector similar to rows
of X but with the new values of the explanatory variables), there are two intervals we may be
interested in estimating: an interval for the expected value (mean value) and an interval for a
future observation itself. For estimation of the mean we use the fact that

β̂ ∼ Np (β, σ 2 (X T X)−1 )

to yield
xTi β̂ ∼ N (xTi β, σ 2 xTi (X T X)−1 xi )
so that
xT β̂ − xTi β
q i ∼ N (0, 1).
σ 2 xTi (X T X)−1 xi

We cannot calculate the confidence interval from this relationship (and base our CI on the
Normal distribution) because we do not know the value of σ 2 , the true variance of the error
term. Instead we have to replace it with its estimate σ̂ 2 and use a bit more distributional
theory. We use the fact that

√ 2xiTβ̂ −Txi β−1


T T

xTi β̂ − xTi β σ xi (X X) xi
T =q = q
σ̂ 2
σ̂ 2 xTi (X T X)−1 xi σ2

16
1500
Predicted maintencance cost

1000
500
0

1 2 3 4 5 6

Age

Figure 1.2: Prediction intervals for the tractor data.

where, for the right hand side, the numerator has a N (0, 1) distribution and the denominator
σ̂ 2 /σ 2 ∼ χ2n−p /(n − p). Standard distributional theory tells us that if X ∼ N (0, 1), Y ∼ χ2ν
and X and Y are independent then √X ∼ tν . Therefore T ∼ tn−p so that a 100(1 − α)%
Y /ν
confidence interval for the mean of the predicted value is given by
q
xTi β̂ ± tn−p,1−α/2 σ̂ 2 xTi (X T X)−1 xi .

A similar argument leads to a 100(1 − α)% prediction interval (note the distinction between
prediction and confidence interval) for the future observation y. It is given by
q
xT β̂ ± tn−p,1−α/2 σ̂ 2 (1 + xT (X T X)−1 x).

The prediction interval for an observation is wider than the confidence interval for the expected
value of the observations because of the extra uncertainty coming from the error term (σ̂ 2 ).

Prediction and confidence intervals are easily calculated in R using the predict.lm command.
If you want to fit a confidence interval use interval="confidence" as an argument. If you want
to fit a prediction interval then use interval="prediction" as an argument. See ?predict.lm
for help.
Figure 1.2 shows the confidence and prediction intervals for a simple regression model fitted to
the tractor data (with age as the sole explanatory variable). The central solid line represents
the predicted values (which all lie on the least squares line). The long dashed line shows the
confidence intervals for the expected value of the response and the short dashed line shows the
prediction interval for a new single observation. Note two things: the first is that as a result
of the extra σ̂ 2 term in the prediction interval, the prediction interval is always wider than the
confidence interval; the second is that the intervals get wider towards the ends of the age range
and this is particularly so for the confidence interval for the mean value.

17
1.9 Confidence intervals for components of β

We have shown that β̂ ∼ Np (β, σ 2 (X T X)−1 ). From the marginal distributions property of the
multivariate normal distribution, if we let β̂i be the i-th element of β̂, then

β̂i ∼ N (βi , σ 2 gii )

where βi is the i-th element of β and gii is the i-th diagonal element of G = (X T X)−1 .
Also, β̂ is independent of Sr = (n − p)σ̂ 2 ∼ σ 2 χ2n−p (the independence was shown to be true in
section 1.5). It follows that if we use σ̂ instead of σ to standardise β̂i we get

β̂i − βi
√ ∼ tn−p .
σ̂ gii

From this we immediately derive a 100(1 − α)% confidence interval. Let tm,α denote the upper
100α% point of the tm distribution. Then
!
β̂i − βi
α = P −tn−p,1−α/2 ≤ √ ≤ tn−p,1−α/2
σ̂ gii
√ √
= P (−tn−p,1−α/2 σ̂ gii ≤ β̂i − βi ≤ tn−p,1−α/2 σ̂ gii )
√ √
= P (β̂i − tn−p,1−α/2 σ̂ gii ≤ βi ≤ β̂i + tn−p,1−α/2 σ̂ gii )

Therefore we have the interval



β̂i ± tn−p,1−α/2 σ̂ gii
for βi . This is the two-sided interval. One-sided intervals are also easily constructed. We could
also similarly devise a confidence interval for any linear function of the βi ’s.
It should be noted that these intervals are for individual parameters. The simultaneous confi-
dence region for two parameters βj and βk is not a rectangle formed from individual confidence
intervals: it is an ellipse where the orientation of the axes is related to the correlation of the
the estimates of βj and βk . Hence, although the corresponding estimates bj0 may not look im-
probable for βj and bk0 not improbable for βk , the point (bj0 , bk0 ) may be quite improbable for
(βj , βk ).

1.10 Confidence intervals for model parameters in the tractor


data

Recall the tractors dataset considered in Section 1.3. There we fitted a simple linear regression
of maintenance cost on age. Now consider the possibility that the relationship may not be linear
in age. We suppose a quadratic regression model

yi = β0 + β1 xi + β2 x2i + i .

In R, if the data set is saved with the name tractor.data, we get the following output

> tractor.quadratic <- lm(Maint ~ Age + I(Age^2), data = tractor.data)


> summary(tractor.quadratic)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 348.088 214.816 1.620 0.127
Age 103.003 181.969 0.566 0.580

18
I(Age^2) 4.713 29.248 0.161 0.874

Residual standard error: 293.2 on 14 degrees of freedom


Multiple R-squared: 0.478, Adjusted R-squared: 0.4035
F-statistic: 6.411 on 2 and 14 DF, p-value: 0.01056

Notice first that it is necessary to use I(Age^2) instead of Age^2 as this allows the ^ symbol to
be interpreted as an arithmetic operator (exponent) rather than as a formula operator. See ?I
for more help in R.

1.11 Hypothesis testing

1.11.1 The linear hypothesis

A general testing problem

The most general linear null hypothesis takes the form


H0 : Cβ = c
where C is a q × p matrix and c is a q × 1 vector of known constants. This hypothesis simul-
taneously asserts specific values for q linear functions of β. Without loss of generality, we can
assume that C has rank q.

As an example consider the quadratic model fitted to the tractor data. The model is
yi = β0 + β1 xi + β2 x2i + i .

How can we test H0 : β1 = 1, β2 = 2? In this case we have


   
0 1 0 1
C= ,c =
0 0 1 2

In this general framework it is rarely of interest to pose a one-sided alternative, and indeed there
is no unique way to define such a thing.
The usual alternative is simply
H1 : Cβ 6= c
i.e. that at least one of the q linear functions does not take its hypothesized value.

Constructing a test

An obvious way to build a test is to consider the estimator C β̂. If the null hypothesis is true,
we have that E(C β̂) = Cβ = c, V ar(C β̂) = CV ar(β̂)C T = σ 2 C(X T X)−1 C T and so
C β̂ ∼ Nq (c, σ 2 C(X T X)−1 C T )
Therefore, again if H0 is true, we find that
−1
(C β̂ − c)T C(X T X)−1 C T (C β̂ − c)
∼ Fq,n−p (1.10)
qσ̂ 2
−1
Otherwise, we would expect (C β̂ − c)T C(X T X)−1 C T (C β̂ − c) to be larger.
So for a 100(1 − α)% significance test we will reject H0 if the test statistic exceeds Fq,n−p,α , the
upper 100(1 − α)% point of the Fq,n−p distribution. This is a one-sided test, even though the
original alternative was two-sided. It is one-sided because we are using sums of squares, so that
any failure of the null hypothesis leads to a larger value of the test statistic.

19
Special cases

Testing the i-th element βi of β is just a special case of this general hypothesis, where q = 1
(only 1 hypothesis) and C is a row vector of zeroes except for a single 1 in the i-th position.
Thus, we have H0 : βi = ci . Using (1.10) it is straightforward to show that

(β̂i − ci )2
∼ F1,n−p
σ̂ 2 gii
We can now formulate the test as a t-test, since if H0 is true

β̂i − ci
√ ∼ tn−p
σ̂ gii

using the standard result that X ∼ t(ν) ⇔ X 2 ∼ F (1, ν) (remember that gii is the i-th diagonal
element of G = (X T X)−1 - see section 1.9). We can clearly use it in this form to test one-sided
alternatives if desired.
In the quadratic linear model of Section 1.10, it is noticeable that the part of the output giving
the individual parameter estimates shows that they all have p-values greater than 0.05 so they
would not lead to rejecting the null hypothesis that the parameter value is zero (see the last
column of the R output in Section 1.10,). These p-values are obtained by tests which use the
statistic
β̂ − β0
se(β̂)

where β0 is the value of β under the null hypothesis. Here we have ci = β0 and se(β̂) = σ̂ g11 ,
where g11 is the top left element of the matrix G = (X T X)−1 , or
 
g11 g12 g13
(X T X)−1 =  g12 g22 g23  .
g13 g23 g33

The case c = 0 is the most common, and when C = I this accounts to testing H0 : β = 0. The
null hypothesis then says that the coefficients of all the regressor variables are zero, and hence
that none of the so-called explanatory variables has any explanatory power. The expectation
of y is zero, no matter what values the explanatory variables might take. usually the minimal
model we would consider is the one containing just the constant term α. So we would not
normally test whether β0 = 0.
As an example let us consider again the quadratic model fitted to the tractor data. The model
is
yi = β0 + β1 xi + β2 x2i + i .
and part of the R output is

> tractor.quadratic <- lm(Maint ~ Age + I(Age^2), data = tractor.data)


> summary(tractor.quadratic)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 348.088 214.816 1.620 0.127
Age 103.003 181.969 0.566 0.580
I(Age^2) 4.713 29.248 0.161 0.874

Residual standard error: 293.2 on 14 degrees of freedom


Multiple R-squared: 0.478, Adjusted R-squared: 0.4035
F-statistic: 6.411 on 2 and 14 DF, p-value: 0.01056

20
How can we test H0 : β1 = β2 = 0 (i.e. just allowing the constant term β0 ) against the alternative
hypothesis that at least one of them is non-zero? In this case we have
   
0 1 0 0
C= ,c =
0 0 1 0
T
β̂ = 348.088 103.003 4.713 , X is as given in section 1.10, q = 2 and σ 2 = 293.22 . We
can use equation (1.10) to calculate the F-statistic.

F = 1/(2 × 239.22 )×
  T
  348.09
 0 1 0  103.00  ×
0 0 1
4.71
T −1
   
  0.540 −0.376 0.0522 
 0 1 0  −0.376 0.389 −0.0613 
0 1 0  ×
0 0 1 0 0 1
0.0522 −0.0613 0.0101
 
  348.09
0 1 0 
103.00  = 6.41
0 0 1
4.71

Notice that this is the test that R performs in the last line of summary(tractor.quadratic)

F-statistic: 6.411 on 2 and 14 DF, p-value: 0.01056

So the final line is testing whether all the coefficients (except the intercept) are zero.

1.11.2 Testing the effect of a subset of regressor variables

We now concentrate on the case


C = (0 Iq ), c=0
where the 0 in C is a q × (p − q) matrix of zeroes, and the other 0 is a q × 1 vector of zeroes.
Then if we define  
β1
β=
β2
such that β 1 is (p − q) × 1 and β 2 is q × 1, then Cβ = β 2 and H0 just specifies that β 2 = 0.
The null hypothesis therefore asserts that the last q regressor variables all have no effect on the
dependent variable.
Notice that the order of elements in β is arbitrary. We can always shuffle them if we simultane-
ously shuffle the same way the columns of X. So there is no restriction in assuming that the q
variables whose effect we wish to test are the last q.
We can express these sums of squares, and corresponding tests, in an analysis of variance table.

Source of variation SS df MS MSR


Due to X1 if β 2 = 0 S1 p−q S1 /(p − q) F1
Due to X2 S2 q S2 /q F2
Residual Sr n−p σ̂ 2
Total yT y n

In the table:

• SS refers to sums of squares

21
• df refers to degrees of freedom

• MS refers to the mean square

• MSR refers to the mean square ratio

The two mean square ratios are F2 = S2 /(qσ̂ 2 ) and F1 = S1 /((p − q)σ̂ 2 ). Notice that the mean
square ratio (MSR) always has a denominator corresponding to the mean square (MS) of the
residuals.
We have two tests that can be performed here. First, F2 is just the test criterion on the left hand
side of (1.10), and so we reject the null hypothesis that β 2 = 0, at the α% level of significance,
if F2 > Fq,n−p,α .
The second test uses F1 and tests whether β 1 = 0 where we also assume that β 2 = 0 (i.e. we
assume β = 0 under the null hypothesis). We reject the null hypothesis that β 1 = 0, at the at
the α% level of significance, if F1 > Fp−q,n−p,α .
It is usual to perform these tests in order. First, use F2 to test whether β 2 = 0. Then if this test
does not reject this null hypothesis we use F1 to test whether we also have β 1 = 0 (assuming
that β 2 = 0).
Notice that this is not a completely correct procedure, because the fact that we fail to reject the
hypothesis that β 2 = 0 does not mean that this hypothesis is true. It is nevertheless accepted
practice.

Adjusting for the constant term - the minimal model

We remarked before that the case c = 0, C = I, amounts to testing the hypothesis that none of
the so-called explanatory variables has any explanatory power.
Nevertheless, this is rarely what we want. When we say that the explanatory variables have
no relationship with the dependent variable (i.e. they explain nothing), the implication would
usually be that the expectation of y is constant, no matter what the values of the explanatory
variables. We would not usually think that the constant value would have to be zero. Thus,
linear models nearly always include a constant term, corresponding to a column of ones in X.
This model is often called the minimal model (or null model).
The null hypothesis that the explanatory variables have no effect on y is usually that all the
other elements of β are zero. We can formulate this as in the analysis of variance above. β1 = β0
is just the first element of β (and the first column of X is X1 = 1n ), while β 2 is the remaining
p − 1 elements of β, corresponding to the p − 1 regressor variables in X.
The model with just β0 in it is what we have called the simplest of all linear models, the
case of a single normal sample with mean β0 and variance σ 2 . In the present notation, S1 =
T
y T X1 (X1T X1 )−1 X1T y = ( y)2 /n = nȳ 2 , Sr = y T y − β̂ X T X β̂ and this means that S2 =
P
T
y T y − S1 − Sr = β̂ X T X β̂ − nȳ 2 .
After testing whether all but β0 are zero we usually have no interest in testing whether β0 = 0,
so it is usual to omit the S1 row of the analysis of variance table. So to test whether none of
the explanatory variables has any effect on y (but allowing a constant term), we would use the
same analysis of variance table as before with q = p − 1 and with the S1 row omitted

Source of variation SS df MS MSR


T
Due to regressors S2 = β̂ X T X β̂ − nȳ 2 p−1 S2 /(p − 1) F
T
Residual Sr = y T y − β̂ X T X β̂ n−p σ̂ 2
Total (adjusted) Syy n−1

22
Notice that we now regard the total sum of squares as what would have been the residual
sum of squares after fitting the model with just µ, which is by definition Syy . Now Syy =
(y − ȳ)T (y − ȳ) = y T y − nȳ 2 and so Syy is indeed the sum of the SS column. The df column
obviously sums correctly too. The MSR is F = S2 /((p − 1)σ̂ 2 ), which we refer to tables of
significance points of the Fp−1,n−p distribution.
This approach extends to when we want to test for the effect of a subset of the regressor variables,
when we also have a constant term. We would automatically adjust sums of squares to allow
for the constant term. The resulting ANOVA table is

Source of variation SS df MS MSR


Due to X1 if β 2 = 0 S1 p−q−1 S1 /(p − q − 1) F1
Due to X2 S2 q S2 /q F2
Residual Sr n−p σ̂ 2
Total y T y − nȳ 2 n−1

The details of the construction of this table are as follows.

• We assume that the number of regressor variables in β 2 is still q. That means that β 1
contains the constant term β0 plus the remaining p − q − 1 regressor variables.

• The residual SS and df are unchanged, because they relate to the residuals after fitting
the constant plus all the other p − 1 regressor variables.

• The SS due to regressor variables X1 is given by S1 = y T X1 (X1T X1 )−1 X1T y − nȳ 2 =


T
(β̂ 1 X1T X1 β̂ 1 − nȳ 2 ).
T
• The SS due to the regressor variables X2 is still S2 = y T M2 y = β̂ 2 G−1
qq β̂ 2 but the
T
alternative expression for it becomes S2 = (β̂ X T X β̂−nȳ 2 )−S1 , where in both expressions
we correct for the constant term.

1.11.3 Testing nested models using F tests

Equation (1.10) allows us to perform an important test in linear models. This is where one of
the models is nested within the other and we want to know whether adding the extra parameters
increases the regression sums of squares enough to warrant the inclusion of the extra terms in
the model. For example we might have two models given by:

yi = β0 + β1 xi

and
yi = β0 + β1 xi + β2 x2i
and we want to know whether β2 = 0 so that the quadratic term does not increase the regression
sums of squares enough to warrant inclusion in the model. These two models are nested because
one model is the same as the other if we impose some constraints on the parameters (β2 = 0
in the above example). These tests are very important in model building as they allow us
to determine which explanatory variables we might need in our statistical model - they are a
variable selection tool. These tests of nested models can be carried out using euation (1.10) and
this is equivalent to performing the following test (which is easier to implement).
Define the following:

• n is the number of observations

• pf is the number of parameters in the full model

23
• pr is the number of parameters in the reduced model

• RSSf is the residual sums of squares in the full model

• RSSr is the residual sums of squares in the reduced model

Then under the null hypothesis (that the coefficients of the extra parameters are all zero)

(RSSr − RSSf )/(pf − pr )


∼ Fpf −pr ,n−pf .
(RSSf )/(n − pf )

These tests are performed in R using the anova command.

1.11.4 An example of the anova command in R

We will apply this theory again to the tractors dataset. Remember the quadratic model is

yi = β0 + β1 xi + β2 x2i + i .

We can obtain the relevant sums of squares using the anova command.

> anova(tractor.linear.lm,tractor.quad.lm)
Analysis of Variance Table

Model 1: Maint ~ Age


Model 2: Maint ~ Age + I(Age^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15 1205407
2 14 1203176 1 2231 0.026 0.8743

Using the anova command with two nested linear models performs a test in which the null
hypothesis is that the coefficients of the ‘extra’ least squares parameters in the larger model are
all zero (in this case H0 : β2 = 0 against H1 : β2 6= 0). In the R output above we can see that
there is not significant evidence to reject the null hypothesis H0 : β2 = 0 (p=0.8743).

We can use the F-test for nested models directly giving

(RSSr − RSSf )/(pf − pr ) 2231/(3 − 2)


= = 0.0260
(RSSf )/(n − pf ) 1203176/(17 − 3)

as given in the R output.


This type of test fits into the framework of equation (1.10) so we can use it to do the same test.
Working through the details (see review exercises 2), we find that equation (1.10) reduces to

β̂32
F =
g33 σ̂ 2

which using the values in section 1.10 gives a value of 4.712 /(0.010054×293.22 ) = 0.026 matching
the value above and the value calculate by R. Note that we use i = 3 here because we are
interested in β2 , the third parameter in the vector of parameters. To get the output in the form
of the tables we looked at in the last section we use the anova command with the full model as
argument. For example, consider the following command:

24
> anova(tractor.quadratic)
Analysis of Variance Table
Response: Maint
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 1099635 1099635 12.795 0.003034 **
I(Age^2) 1 2231 2231 0.026 0.874295
Residuals 14 1203176 85941

Rewriting it produces the easier to read ANOVA table below.

Source of variation SS df MS MSR


(S1 ) Linear term if γ = 0 1099635 1 1099635 12.8 0.003
(S2 ) Quadratic term 2231 1 2231 0.02 0.874
(Sr ) Residual 1203176 14 85941
Total 2305042 16

It is clear from the R output that as p = 0.87, there is not significant evidence to reject H0 : β2 = 0
(the quadratic term does not significantly improve the fit), agreeing with the previous R output.
The other test in this table is whether H0 : β1 = 0 given that β2 = 0. Assuming that β2 = 0,
there is significant evidence (p = 0.003) to reject H0 : β1 = 0 (the linear term does significantly
improve the fit).
We can clarify what these terms mean using R. The following output provides the sums of square
values in the ANOVA table above.

> attach(tractor.data)
> tractor.lin.lm<-lm(Maint~Age)
> tractor.quad.lm<-lm(Maint~Age+I(Age^2))
> y<-Maint
> sum((tractor.lin.lm$fitted)^2)-17*(mean(y))^2
[1] 1099635
> sum((tractor.quad.lm$fitted)^2)-17*(mean(y))^2-1099635
[1] 2231
> sum((y-tractor.quad.lm$fitted)^2)
[1] 1203176
> sum(Maint^2)-17*mean(Maint)^2
[1] 2305042
T
We can also clarify that S2 = β̂ 2 G−1
qq β̂ 2 using R with the following code:

> X=cbind(rep(1,17),Age,Age^2)
> G=solve(t(X)%*%X)
> G22=t(c(0,0,1))%*%G%*%c(0,0,1)
> solve(G22)*sum(tractor.quad.lm$coefficients[3]^2)
[,1]
[1,] 2231

as required. Remember that solve is the R command that returns the inverse of a non-singular
matrix.

1.11.5 Limitations of using sequential sums of squares as a variable selection


tool

So we have seen that the anova table can be used as a way of comparing nested models by
performing a series of hypothesis tests. So is this the end of the story as far as variable selection

25
goes? Remember what we want is to be able to build a statistical model. To do this we have
to select from a set of explanatory variables those that we want in our statistical model because
they help to explain or account for the observed variation in our data. Can we always use anova
to do this? The answer to this is that it depends. One of the problems with the sequential sums
of squares method is that it is a sequential procedure. The sums of squares in subsequent rows of
the anova table represent extra sums of squares assuming the previous explanatory variables are
in the model. So the order of the explanatory variables in the table might affect our conclusions.
The only time when the order of terms in the anova table is not important is when the ex-
planatory variables are uncorrelated. Generally this only happens in designed experiments
where observations are made at carefully chosen levels of factor variables. In studies without
this specific design the order of terms in the table is important.
Consider the linear model (yi = β0 + β1 xi + β2 x2i + i ) fitted to the tractor data. The order that
they appear in the anova table in R is determined the order they enter the linear model formula
in R. Entering age followed by age2 produces the following anova table.

> anova(lm(Maint~Age+I(Age^2),data=tractor.data))
Analysis of Variance Table

Response: Maint
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 1099635 1099635 12.795 0.003034 **
I(Age^2) 1 2231 2231 0.026 0.874295
Residuals 14 1203176 85941

Entering age2 followed by age produces the following anova table

> anova(lm(Maint~I(Age^2)+Age,data=tractor.data))
Analysis of Variance Table

Response: Maint
Df Sum Sq Mean Sq F value Pr(>F)
I(Age^2) 1 1074330 1074330 12.5008 0.003294 **
Age 1 27536 27536 0.3204 0.580323
Residuals 14 1203176 85941

We see that the sums of squares for Age in the two tables are different. In the second case the
sums of squares for Age is the extra sums of squares after Age2 has been fitted whereas the
sums of squares in the first case is the sums of squares with just Age in the linear predictor. If
Age and Age2 were uncorrelated then the sums of squares for Age would be the same in both
anova tables and the conclusions of the hypothesis tests would be the same; but clearly Age
and Age2 are correlated. Based on the first anova table we would include Age in the model
but not Age2 . In the second anova table we would include Age2 but not Age. With two highly
correlated variables, the second variable would nearly always be rejected using an anova table
based on sequential sums of squares.

1.11.6 Using anova tables to perform tests on subsets of variables

Suppose we want to perform a hypothesis test on a subset of variables. We have seen that one
way to do this is to define the C matrix and c vector and to use equation (1.10). This is rather
tedious and actually we can use the sequential sums of squares anova table to do it, but as we
shall see we have to be very careful of the order that terms enter the table. This is because

26
the sums of squares for a variable in the table are in fact the excess sums of squares a variable
contributes beyond that of the earlier variables in the table. Consider fitting the following model
to the tractor data:
yi = β0 + β1 xi + β2 x2i + β3 x3i + β4 x4i + i
where as usual xi is the age of the i-th tractor and yi the maintenance cost. The anova table is
obtained as follows:

> trac.quart.lm=lm(Maint~Age+I(Age^2)+I(Age^3)+I(Age^4))
> anova(trac.quart.lm)
Analysis of Variance Table

Response: Maint
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 1099635 1099635 17.2249 0.001346 **
I(Age^2) 1 2231 2231 0.0350 0.854823
I(Age^3) 1 31195 31195 0.4886 0.497860
I(Age^4) 1 405902 405902 6.3581 0.026833 *
Residuals 12 766079 63840

Suppose we want to test the null hypothesis that H0 : β3 = β4 = 0. We have seen that we could
do this using equation (1.10) with
   
0 0 0 1 0 0
C= ,c =
0 0 0 0 1 0

and by specifying the design matrix appropriately. But how can we use the anova table to test
H0 : β3 = β4 = 0? Because of the order of the terms in the anova table above, the sums of
squares for Age3 and Age4 are the excess sums of squares ‘explained’ beyond that ‘explained’
by Age and Age2 . Remember that for nested F-tests we are looking at whether the excess
sums of squares (the difference in the sums of squares for the larger and smaller models) for the
parameters in the null hypothesis is large enough for us to reject the null. So the sums in the
table are exactly the sums of squares that we want to do our F-test.

Our F test would take the form F = (31195+405902)/2


766079/12 = 3.423, p = 0.067
(using 1-pf(3.423,2,12)) so not significant evidence to reject H0 : β3 = β4 = 0.

So could we use the same anova table to test the null hypothesis that H0 : β1 = β3 = 0? What
we need is the excess sums of squares that Age and Age3 explain beyond that explained by Age2
and Age4 . Clearly the table doesn’t provide this information (the sums of squares for Age is the
sums of squares explained without any other terms in the model). So can we do the test using
an anova table? The key is to remember that you can determine the order of the terms in the
anova table by changing the order they enter R. We want Age2 and Age4 to come before Age
and Age3 , this can be achieved using

> quart=lm(Maint~I(Age^2)+I(Age^4)+Age+I(Age^3),data=tractor.data)
> anova(quart)
Analysis of Variance Table

Response: Maint
Df Sum Sq Mean Sq F value Pr(>F)
I(Age^2) 1 1074330 1074330 16.8285 0.001466 **
I(Age^4) 1 18524 18524 0.2902 0.599968
Age 1 16373 16373 0.2565 0.621727

27
I(Age^3) 1 429736 429736 6.7315 0.023462 *
Residuals 12 766079 63840

Our F test would take the form F = (16373+429736)/2


766079/12 = 3.494 p = 0.064
(since 1-pf(3.494,2,12)=0.064) so not significant evidence to reject H0 : β1 = β3 = 0.
So the key to using sequential sums of squares to test these kind of hypotheses is to ensure that
the terms you are testing come after all the other terms in the anova table.

1.12 Appendix

1.12.1 Random vectors

Vector mean

If we have a collection of random variables x1 , . . . , xp , we can form them into a vector x =


(x1 , . . . , xp )T . The joint distribution of the collection of random variables (x1 , . . . , xp ) defines
the distribution of the random vector x.
We define the mean of x to be the vector of means of its components:
 
E(x1 )
E(x) = 
 .. 
. 
E(xp )

Covariance matrix

The obvious way to define the variance of a vector random variable would be to make it the
vector of variances of its components. But it is useful also to recognize the covariances between
the various components of x, too. Accordingly, we define it to be the p × p matrix
 
V ar(x1 ) Cov(x1 , x2 ) · · · Cov(x1 , xp )
 Cov(x2 , x1 ) V ar(x2 ) · · · Cov(x2 , xp ) 
V ar(x) = 
 
.. .. .. .. 
 . . . . 
Cov(xp , x1 ) Cov(xp , x2 ) · · · V ar(xp )

The above is usually referred to as the variance-covariance matrix of x, or just as the covariance
matrix of x.
Notice that since covariance is a symmetric relation, Cov(xi , xj ) = Cov(xj , xi ), this is a sym-
metric matrix. Thus we can write V ar(x)T = V ar(x). Also we note that Cov(xi , xi ) = V ar(xi ).

Vector covariance

More generally, if x = (x1 , . . . , xp )T is a p × 1 random vector and y = (y1 , . . . , yq )T is a q × 1


random vector, we can define
 
Cov(x1 , y1 ) Cov(x1 , y2 ) · · · Cov(x1 , yq )
 Cov(x2 , y1 ) Cov(x2 , y2 ) · · · Cov(x2 , yq ) 
Cov(x, y) = 
 
.. .. . . .. 
 . . . . 
Cov(xp , y1 ) Cov(xp , y2 ) · · · Cov(xp , yq )

28
This is not a symmetric matrix. In fact, since it is p × q, it is not even square unless p = q.
Notice, however, that Cov(y, x) = Cov(x, y)T . Also we have Cov(x, x) = V ar(x).
Sometimes Cov(x, y) is referred to as the covariance of x and y.

Some useful results

We know many useful results about expectations, such as E(aX + b) = aE(X) + b, when a and
b are constants. Here are some vector generalizations.

1. E(Ax + b) = AE(x) + b, when A is a q × p matrix of constants and b is a q × 1 vector


of constants. Notice that this expresses the mean of the q × 1 random vector in terms of
that of the p × 1 random vector x.

2. V ar(Ax + b) = AV ar(x)AT .

3. Cov(Ax + b, Cy + d) = ACov(x, y)C T , where the dimensions of the matrices A and C


and the vectors b and d are as required to allow the matrix multiplication.

1.12.2 The multivariate normal distribution

Definition

The normal distribution is the most important distribution for a single random variable, and its
extension to a vector random variable is equally important in Statistics.
Let us denote the p-dimensional random vector

xT = (x1 , . . . , xp )

where x1 , . . . , xp are univariate random variables.


A p-dimensional random vector x is said to have a multivariate normal distribution if and only
if every component of x has a univariate normal distribution.
In the univariate case the probability density function (p.d.f.)

1 (x−µ)2
f (x) = √ e− 2σ2 ,
2πσ
i.e. depends on two parameters: µ and σ. Note that this formula can also be written as
1 1 T −1
f (x) = √ e 2 (x−µ) Σ (x−µ) (1.11)
2πσ

where Σ = σ 2 . When x is a p-dimensional random vector it can be shown that the joint p.d.f.
is
1
e− 2 (x−µ) Σ (x−µ)
1 T −1
f (x) = p/2 1/2
(2π) |Σ|
where Σ is the p × p covariance matrix of the random variables and |Σ| is its determinant.
This equation reduces to the previous one when p = 1. The p.d.f. of the MVN also has two
parameters: the vector µ and the covariance matrix Σ. In statistical notation, we would write
x ∼ Np (µ, Σ)

29
Linear transformation

The key results about multivariate normal distributions are these.

• Linear transformation. If A and b are constants, of dimensions q × p and q × 1 respectively,


then
Ax + b ∼ Nq (Aµ + b, AΣAT )

• Standardization. If x is non-degenerate, we can derive a standardizing transformation that


produces a vector with zero mean and identity variance. Since Σ is a variance covariance
matrix it is positive definite. One of the properties of positive definite matrices is that we
can find a p × p matrix C such that Σ = CC T . Then it follows immediately that

C −1 (x − µ) ∼ Np (0p , Ip )

where 0p is the p × 1 vector of zeroes and Ip is the p × p identity matrix. C is not unique.
There are many possible choices for C, which can be described as a square root of Σ.
So we see that standardization produces a vector of random variables that are independent
and identically distributed as N (0, 1). So standardization produces independent standard
normal random variables.

• If Σ is diagonal, i.e.  
σ11 0 · · · 0
 0 σ22 · · · 0 
Σ=
 
.. .. .. .. 
 . . . . 
0 0 ··· σpp
then the random variables x1 , . . . , xp are independent and of course they are uncorrelated
too.

30
Chapter 2

Residuals, Transformations, Factors


and Variable Selection

2.1 Checking model assumptions

2.1.1 Residuals for model checking

An important feature of any statistical analysis should be to check, as far as it is possible to


do so, that the assumptions made in the model are valid. In the case of linear models, our
assumptions concern the error terms . For observation i, the model error is i = yi − xTi β.
We assume the i ’s to:

• have zero mean (which is equivalent to assuming that we have got the systematic part of
the model represented by Xβ right);
• be independent;
• have common variance (homoscedasticity);
• be normally distributed.

In principle, if we knew the actual values of the i ’s then we could check these assumptions.
We could look to see if 1 , 2 , . . . , n looked in all relevant respects like a sample from a normal
distribution with zero mean.
We do not know the i ’s and cannot deduce them from the observed yi ’s unless we know β (the
true parameter vector rather than β̂, the estimate of it). In practice, we only have estimates of
the i ’s, which are the residuals ei ,
ei = yi − xTi β̂
Therefore, the main strategy for checking the assumptions of linear models is to perform various
checks using residuals.

2.1.2 Standardized residuals

Technically, we should recognize that the residuals have different properties from the errors. If
the model is correct, they are normally distributed and have zero mean. However, they are not
independent and do not have a common variance.
We can correct for the unequal variances by defining the standardized residuals, which are
given by
ei
si = p .
V ar(ei )

31
Plot 1: Index plot Plot 2: QQ − plot

Sample Quantiles
residuals

200

200
−200

−200
5 10 15 −2 −1 0 1 2

index Theoretical Quantiles

Plot 4: Residuals versus


Plot 3: Histogram
fitted values
0 1 2 3 4 5
Frequency

Residuals

200
−200
−400 0 200 400 600 400 600 800 1000

Residuals Fitted values

Figure 2.1: Residual plots for the Tractor data.

However, there seems little extra to gain from this. The standardized residuals no longer have
normal distributions. Their distribution will be approximately tn−p .

2.1.3 Residual plots

The most common diagnostic use for residuals is to plot them in various ways. In R we use
several plotting commands: plot, qqnorm, hist and par. For the Tractor data, the commands
are:

> a<-lm(Maint ~ Age,data=tractor.data)


> par(mfrow=c(2,2))
> plot(a$resid,xlab="index",ylab="residuals",main="Index plot")
> qqnorm(a$resid,main="QQ - plot")
> hist(a$resid,xlab="Residuals",main="Histogram")
> plot(a$fit,a$resid,xlab="Fitted values",ylab="Residuals",
> + main="Residuals versus fitted values")

which result in the plots in Figure 2.1. The par command changes the arrangements of plots on
the screen. In this example we arrange them in a 2 by 2 grid.

2.1.4 Checking normality

Plot 2 in Figure 2.1 shows a Q-Q plot. This plot is useful as it allows a visual inspection of the
normality of the residuals. It plots the quantiles of the observed data against quantiles from a
normal distribution with identical mean and variance. If a sample is genuinely from a normal
distribution with constant variance, then the residuals will lie close to a straight line. (For
this purpose it would be better to use the scaled or standardized residuals than the ordinary
residuals, if the sample size is relatively small.)
If the plot is clearly bowed, it suggests a skew distribution (whereas the normal distribution
is symmetric). If the plotted points curve down at the left and up at the right, it suggests

32
a distribution with heavier tails than the normal (and therefore one that is prone to produce
outliers). In our case, the graph is as nearly linear as we could expect with such a small sample.
The histogram (plot 3 in Figure 2.1) is another way of looking at the underlying assumption of
normality. In a large sample, it can be a useful addition to the normal plot, but in this case
there are not enough residuals to allow meaningful conclusions.

2.1.5 Checking independence

Plot 1 in Figure 2.1 is an index plot or I-chart. It plots residual against observation number, and
indicates whether the observations might be correlated. Specifically, it helps to assess whether
adjacent observations are correlated. Such a correlation would show up by observations that are
adjacent in the sequence having very similar residuals. The plot would then tend to move slowly
up and down. Like most diagnostics, it is not very sensitive for a sample as small as the tractor
data, and in this case there is no reason to suspect this kind of correlation. However, sometimes
we might suspect that observations made adjacently in time or space would show correlation,
and the index plot will show this if we order the observations appropriately.

2.1.6 Checking homoscedasticity

Plot 4 in Figure 2.1 is of residuals against fitted values, and its primary function is to look for
heteroscedasticity (i.e. non-homoscedasticity). The most common form of heteroscedasticity
arises when larger observations are also more variable. This often happens when the response
variable is necessarily positive. In this case, if the response is just above 0 its variance is likely
to be less than if the response was much larger since the response is bounded below by zero.
We would observe this kind of heteroscedasticity in the plot by seeing the residuals appearing
to fan out as we move from left to right. Other patterns might indicate other ways in which the
response variance its related to its mean.

2.2 Formal tests

2.2.1 Correcting for multiple testing

The above residual-based plots offer an informal way of checking the fit of the linear model.
We may construct more formal tests (note the difference between check and test) based on
confidence intervals or hypothesis tests. Here, we just provide the most simple test, which is
based on the assumption that if the model fit is good, then approximately the standardised
residuals si follow a tn−p distribution. We can then propose a test that would compare si
with the quantile qt−p,1−α/2 , for some value of the significance level α as discussed below. If
|si | > qt−p,1−α/2 , we would consider si to be too large for a tn−p distribution and so we would
consider the observation yi to be an outlier.
Note that if we do 20 tests at the 5% level, then on average one of them will lead to rejection of
the null when the null is true. There have been many suggestions of methods to overcome the
issue of multiple testing. A simple correction is the S̆idák correction. This type of correction
controls the probability of making at least one false positive claim in a family of tests, it is
easy to implement and works as follows: suppose we are performing n tests and in each test we
specify the probability of making a type I error to be β. Then, if the tests are independent, the
probability of at least one false positive claim in the n tests is given by 1 − (1 − β)n . If we want
to control this overall rate and fix it at α then
1
1 − (1 − β)n = α ⇔ β = 1 − (1 − α) n

33
1
So instead of using a type I error of α in each test, we use a type I error of 1 − (1 − α) n in
each test. The overall family-wise error rate is then α. This type of correction is similar to the
familiar Bonferroni correction but has a stronger bound and so has greater statistical power.

2.2.2 Standardized residuals for the tractor data

After fitting a simple linear regression model to the tractor data, the residuals can be obtained
by the commands

> a <- lm(Maint~Age,data=tractor.data)


> a$resid
1 2 3 ... 12 13 14 15 16 17
-297.35 132.65 116.65 ... 259.08 522.66 10.66 93.66 211.80 -226.48

After uploading the MASS library (library(MASS)), we can obtain the standardized residuals
with the command

> stdres(a)
1 2 3 ... 12 13 14 15 16 17
-1.09 0.49 0.43 ... 0.99 2.02 0.04 0.36 0.78 -0.90

We can see that observation 13 has a large standardized residual of 2.02.


Without applying the S̆idák correction we would compare |si | with t1−0.05/2,17−2 = 2.13; although
s13 = 2.02 is large it does not exceed the threshold.
If we apply the S̆idák correction we would have to use a significance level

1 − (1 − 0.05)1/17 = 0.003

which gives a new quantile t1−0.003/2,15 = 3.535. Again s13 is not considered too high to be an
outlier.

2.3 Transformations

2.3.1 Transforming to restore homoscedasticity

Theory

We usually transform y when we suspect, either by doing diagnostic checks on residuals or for
some other reason, that the model assumption of homoscedasticity do not hold if we model y
as the dependent variable. Suppose that y is a random variable whose variance depends on its
mean (true of some Generalized Linear Models). So if E(y) = µ then V ar(y) = g(µ), and the
function g(.) is known. We seek a transformation from y to z = f (y) such that the variance of
z is (approximately) constant. This is known as a variance-stabilising transformation.
Consider expanding f in a Taylor series about its mean. Keeping only the first order term, we
have
z = f (y) ≈ f (µ) + (y − µ)f 0 (µ).
Now taking expectations,

E(z) ≈ E(f (µ) + (y − µ)f 0 (µ))


= f (µ) + (E(y) − µ)f 0 (µ)
= f (µ)

34
and since f 0 (µ) is a scalar

V ar(z) ≈ V ar(yf 0 (µ))


= V ar(y)(f 0 (µ))2
= g(µ)(f 0 (µ))2 .

We want this to be a constant, say k 2 so that k 2 = g(µ)f 0 (µ)2 . Therefore we seek a function f
such that
k
f 0 (µ) = p .
g(µ)
Therefore the solution is Z
f (µ) = k (g(µ))−1/2 dµ.

Example 1. Suppose that the variance is proportional to the mean, so that g(µ) = aµ. Then


Z
−1/2
f (µ) = ka µ−1/2 dµ = 2ka−1/2 µ.


Since k and a are arbitrary constants, the variable z = y should have approximately constant
variance. Whether it does depends on how accurate our estimation of the function g is (i.e. our
assessment of how the variance of the response depends on the mean). If we are assessing this
relationship by eye it may not be very accurate.
Example 2. Suppose the standard deviation of y is proportional to µ, so that g(µ) = aµ2 , then
Z
f (µ) = ka−1/2 µ−1 dµ = ka−1/2 lnµ.

The transformed variable z = lny should have approximately constant variance.

Determining how the variance depends on the mean

It is primarily through plotting ei against ŷi that we will see indications of non-homoscedasticity.
If we see a funnel shape with the spread of the residuals seeming to increase linearly with ŷi ,
then this suggests that the standard deviation of yi is proportional to its mean. Then, example
2 above suggests that we should use a log transformation.
If we see signs of increasing standard deviation, but the growth flattens out as yi increases,
we may think that a standard deviation proportional to the square root of the mean is more
appropriate. Then example 1 above suggests using a square root transformation of the dependent
variable.
Other patterns might suggest other kinds of g(µ), although in practice it is hard to distinguish
anything more subtle unless we have a great deal of data.
Remember that in terms of the transformed variable, what may have been a linear relationship
between y and the explanatory variables becomes a nonlinear relationship between the new z
and the same explanatory variables. This can mean that although a linear model for z better
satisfies the assumption of homoscedasticity we cannot represent z so well through linear com-
binations of explanatory variables. So removing a problem of heteroscedasticity can introduce
other problems!
Fortunately in practice, the reverse is often true. We frequently find that after a suitable
variance-stabilizing transformation the relationship between the dependent variable and explana-
tory variables actually simplifies. Indeed, the transformation often also improves normality.

35
2.3.2 Estimating a transformation

Families of transformations

Suppose we cannot easily determine how the variance of the response depends on the mean by
eye; a formal statistical approach to transformations is to try to estimate what transformation
to use.
Suppose that we have a family of transformations fλ indexed by a parameter λ. So for any given
λ we could define the new dependent variable zλ = fλ (y). Suppose that we believe that there is
some value of λ for which zλ follows a linear model in which all the assumptions are true. Then
we have a model
fλ (yi ) = xTi β + i
where the i ’s are independent N (0, σ 2 ). This is a model with parameters β, σ 2 and λ.

Maximum likelihood

We can estimate λ by maximum likelihood. To calculate the likelihood we have to remember


that we do not observe fλ (yi ), which depends on the unknown parameter λ. We observe the
yi ’s, and the likelihood should be the joint probability of the yi ’s given the parameters, and this
entails bringing in the Jacobean of the transformation from fλ (yi ) to yi . The likelihood is
n
( ) n
−n 1 X 2 Y
2
L(β λ , σ , λ; y) ∝ σ exp − 2 T
fλ (yi ) − xi β fλ0 (yi ).

i=1 i=1

We already know how to maximize the likelihood with respect to and β and σ 2 , for fixed λ.
We first choose β̂ to minimize the sum of squares in the exponent. The minimal value is the
residual sum of squares, which we now denote by Sλ to show that it depends on λ. Then the
maximization with respect to σ 2 yields σ̂λ2 = Sλ /n.
Remember that the MLE has the divisor n, which we modify to n − p when we want an unbiased
estimator.
We now have
n
−n/2
Y
L(β̂ λ , σ̂λ2 , λ; y) ∝ Sλ fλ0 (yi ).
i=1

The MLE of λ is now obtained by maximising L(β̂ λ , σ̂λ2 , λ; y). In practice it is usually easier to
maximise its logarithm
n
n X
`(β̂ λ , σ̂λ2 , λ; y) = c − log Sλ + log fλ0 (yi ),
2
i=1

where c is a constant (not depending on λ) and it is usually ignored, i.e. c = 0.

The Box-Cox family

Whilst the above theory works for any family of transformations, by far the most popular family
is the family of power transformations, also known as the Box-Cox family. Box and Cox (1964)
proposed a family of transformations on the response variable to correct non-normality and/or
non-constant variance. If the response is positive, the transformation is given by
 λ
(y − 1)/λ, λ 6= 0
fλ (y) =
log y, λ=0

36
It can be shown that fλ (y) is a continuous function of y. We assume that the transformed
values of y are normally distributed about a linear combination of the covariates with constant
variance, that is fλ (y) ∼ N (xTi β, σ 2 ) for some unknown value of λ. Hence, the problem is to
estimate λ as well as the parameter vector β.

2.3.3 Simulated data set

Here is an example, comprising data that were simulated in order to demonstrate some of the
features we have been looking at in this chapter. There are two explanatory variables, X and
Z on 100 observations.
Here is the R output from fitting a regression with a constant term and the x and z variables
included.

> a <- lm(Y~X+Z, data=sim2.data)


> summary(a)
lm(formula = Y ~ X + Z, data = sim2.data)
Residuals:
Min 1Q Median 3Q Max
-36.3876 -5.8166 -0.2866 4.9869 51.3028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.0994 6.9300 -4.488 1.98e-05 ***
X 0.6631 0.4393 1.509 0.134
Z 18.1289 1.5849 11.438 < 2e-16 ***
---
Residual standard error: 12.59 on 97 degrees of freedom
Multiple R-squared: 0.5744, Adjusted R-squared: 0.5656
F-statistic: 65.46 on 2 and 97 DF, p-value: < 2.2e-16

The individual p-value of X seems to be insignificant and the R2 and adjusted R2 are low. The
null hypothesis H0 of both coefficients of X and Z being equal to 0 is rejected. However, there
are other problems with the fit. The 4-panel residual plot is shown in Figure 2.2.

Variance stabilizing

The plot of ei against ŷi suggested a standard deviation proportional to the mean, so our first
transformation idea is to take logarithms. Here is some R output obtained using the same model
as before but with log(y) defined as the dependent variable. The X variable is not important
and the analysis suggests it is not needed in the final statistical model.

> a <- lm(log(Y)~X+Z, data=sim2.data)


> summary(a)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.29441 0.32007 4.044 0.000105 ***
X 0.02918 0.02029 1.438 0.153561
Z 0.54117 0.07320 7.393 5.11e-11 ***
---
Residual standard error: 0.5816082 on 97 degrees of freedom

37
Index plot QQ − plot

−20 0 20 40

−20 0 20 40
Sample Quantiles
residuals
0 20 40 60 80 100 −2 −1 0 1 2

index Theoretical Quantiles

Histogram Residuals versus fitted values

−20 0 20 40
30
Frequency

Residuals
20
10
0

−40 0 20 40 60 0 20 40 60

Residuals Fitted values

Figure 2.2: Plot of residuals for simulated data set.

The number of observations with high standardized residuals (|si | > 3.6; this is obtained by using
the S̆idák correction, see Section 2.2.1) is reduced from 3 to 1. Numbers 23, 35 and 74 (which
were previously in the tails of the t97 distribution) are no longer flagged as outliers (based on a
threshold of ±3.6). However observation 43 that had a moderately high standardized residual of
-2.91 in the untransformed analysis has a standardized residual of -7.97 in the log-transformed
analysis. So some observations become outliers that were not previously. The 4-panel residual
plot shows considerable improvement. Figure 2.3 shows the 4 dignostic plots for the simulated
data of the log-response.
The normal plot and histogram support the normality assumption except for one or two ap-
parent outliers. The plot of residuals versus fitted values suggests that homoscedasticity is now
reasonable (observation 43 is clear on this plot with si = −8). Figure 2.4 shows the standardized
residuals versus fitted value plot with the two lowest residuals (observations 43 and 61) removed
and Figure 2.5 shows the corresponding histogram with these observation removed. We are not
suggesting removing these observations from the analysis but plots without them help to assess
heteroscedasticity and non-normality more effectively because we can zoom in on the area of
interest.

Obtaining a Box-Cox transformation

Note that each value of λ corresponds to a different transformation. In this section we are trying
to estimate the value of λ that best ‘fits’ our data and we use a maximum likelihood method to
do so. We apply the methods to the simulated data set.
Calculations by hand
We have fλ (y) = (y λ − 1)/λ so that fλ0 (y) = y λ−1 and the log likelihood takes the form
n
n X n
`(β̂ λ , σ̂λ2 , λ; y) = c − log Sλ + log fλ0 (yi ) = c − log Sλ + (λ − 1)nyL ,
2 2
i=1

where nyL is the sum of the logarithms of the observations, (nyL = ni=1 log yi ). For simulated
P
data set 2, nyL = 323.94. Note that nyL depends only on the values of the response.

38
I Chart QQ plot

Sample Quantiles
0

0
St. residuals

−4

−4
−8

−8
0 20 40 60 80 100 −2 −1 0 1 2

observation number Theoretical Quantiles

Histrogam of St. residuals St. residuals vs fitted values


50

0
St. residuals
Frequency

30

−4
0 10

−8
−8 −6 −4 −2 0 2 2.5 3.0 3.5 4.0 4.5

St. residuals Fitted values

Figure 2.3: Various residual plots for log-transformed simulated data set

In R, the value of Sλ for a particular transformation (i.e a particular value of λ) can be obtained
using the anova command: Sλ is the residual sums of squares. Using the output in Section 2.3.3,
we see that S0 = 32.812 (λ = 0 corresponds to a log transformation). λ = 1 corresponds to
subtracting 1 from each response value and can be thought of as the ‘no change’ transformation.
We can use an anova command on the linear model based on the untransformed data to show
that S1 = 15381.66 which is left as an exercise (note that subtracting 1 from every yi changes
the estimated intercept by 1 but does not alter the residual sum of squares). We therefore get
log likelihoods (ignoring c) of
n
c− log Sλ + (λ − 1)nyL = −50 log 15382 = −482.0466 (λ = 1)
2
and
n
c− log Sλ + (λ − 1)nyL = −50 log 32.81153 − 323.94 = −498.479 (λ = 0).
2
Comparing the log-likelihoods shows that doing no transformation provides a higher log-likelihood
than a log transformation. This may seem somewhat surprising given the violation of the model
assumptions in the untransformed data.

If we consider instead a square root transformation f0.5 (y) = 2( y − 1), the residual sum of

squares from fitting with dependent variable y is 111.111, and doubling the dependent variable
would multiply this by 4 to give S0.5 = 444.444. Or you could fit the model directly using

lm(2*sqrt(Y)~X+Z, data=sim2.data)

The log likelihood is −50 log 444.444 − 161.97 = −466.81. Therefore, this method suggests
that the square root transformation is better than either doing nothing or taking logs of the
response. But what we really want to do is to find the maximum likelihood estimate of λ rather
than compare a few different values of λ. Once we have a maximum likelihood estimate of λ
(with a 95% confidence interval), we can see which transformations are likely to be the most
sensible according to the available data.

39
St. residuals vs fitted values

2
1
St. residuals

0
−1
−2

2.5 3.0 3.5 4.0 4.5

Fitted values

Figure 2.4: Plot of standardized residuals for log-transformed simulated data set

Calculations by R

To obtain the Box-Cox transformations in R we first load the MASS library (library(MASS))
we then use the command boxcox. If a is an lm object then we could plot the log likelihood
profile by:

> a <- lm(Y~X+Y, data=sim2.data)


> boxcox(a,lambda=seq(-2,2,1/20),plotit=TRUE,
xlab=expression(lambda),ylab="Log-likelihood")

This produces a plot of the Log-likelihood function vs λ, given in Figure 2.6. We can zoom in
a little using a narrower set of λ values.

boxcox(a,lambda=seq(0.35,0.7,1/1000),plotit=TRUE,
xlab=expression(lambda),ylab="Log-likelihood")

This plot is provided in Fig 2.7. The plots of Figures 2.6 and 2.7 confirm the calculations done
by hand, that is λ = 1 (the original data) produces a larger log-likelihood compared with that
of λ = 0 (log transformation). Further it shows that the log-likelihood is maximized at -466.8
and this is achieved at λ = 0.517. Thus the transformation z = (y 0.517 − 1)/0.517 is the optimal
variance-stabilizing transform based on our data. However based on the 95% confidence interval
for λ (0.375,0.675), our square-root transformation (λ = 0.5) is more easily interpreted so we
might suggest this as the transformation to use.

2.4 Factors and interaction plots

2.4.1 Introduction to factors

In this section we discuss factor variables, their parametrization and how to work with them in R.
So far we have been dealing with continuous variables, such as height or weight. Multiple linear

40
Histrogam of St. residuals

50
40
Frequency

30
20
10
0

−2 −1 0 1 2

St. residuals

Figure 2.5: Histogram of standardized residuals for log-transformed simulated data set

regression is sufficiently flexible that qualitative (or categorical) variables, such as smoking status
(Smoker/Non-smoker) or blood group (A/B/AB/O), can also be included in the regression as
explanatory variables for the response y. Qualitative variables in linear models are usually called
factor variables. Inclusion of factor variables in linear models is achieved through the use of
‘dummy’ variables. R does all the coding internally so all you have to do is to declare a variable to
be a factor variable and the ‘dummy’ variables are created automatically within R. But we have
to understand what models R is actually fitting to be able to interpret the output. Each level of
a factor variable will have a dummy variable associated with it. This indicator variable can take
one of two values: 1 if the observation is associated with this level of the categorical variable, 0
if it is not. For instance if the categorical variable is blood group (which has categories A,B,AB
and O) and a subject has blood group AB, then the value of the factor variables IA , IB , IAB ,
IO , for the subject are 0, 0, 1, and 0, respectively.

2.4.2 Design matrices for factor variables - the corrosion data

We look at a simple example to illustrate what design matrices might look like for linear models
with factor variables. It is proposed to use a particular metal alloy for components that have
to survive high temperatures and aggressive gases. There is a rare corrosive reaction that can
take place in this alloy, and an experiment is performed to see how rapidly the reaction would
corrode the metal in the particular temperature and atmospheric conditions. Two pieces of the
alloy are found that are already affected by this corrosion. Each piece is cut into eight equal-
sized segments and the sixteen segments are put in an autoclave (a kind of ‘pressure cooker’ to
simulate the required conditions). One segment from each of the two original pieces is taken
out every twelve hours and weighed to determine the amount of corrosion, the piece is not used
again. Note that corrosion should add weight.
The following model is proposed for yij , which is the measurement on segment j from piece i:

yij = αi + βxij + ij , i = 1, 2, j = 1, 2, . . . , 16.

Here xij is the time when the measurement is made. Notice that it is assumed that corrosion
takes place at the same constant rate in both pieces, but the model allows different αi values to

41
95%

−900 −800 −700 −600 −500


log−Likelihood

−1100

−2 −1 0 1 2

Figure 2.6: Plot of the log-likelihood function of the Box-Cox family for simulated data.

account for the different sizes and corrosion levels in the two original pieces of alloy.
In this chapter we sometimes have to use subscripts on our parameters to indicate which group
they correspond to. Previously we have been using β0 , β1 , β2 , ... to represent the parameters in
our linear model. To avoid having to write double subscripts (i.e. β0i ) we sometimes vary the
convention and use µ, α, β etc to represent the parameters in the linear models. The meaning is
obviously the same.
We have a linear model with the following response vector and design matrix:
   
13.74 1 0 96
 12.97   1 0 84 
   
 10.78   1 0 72 
   
 10.53   1 0 60 
   
 10.16   1 0 48 
   
 9.38   1 0 36 
   
 9.23   1 0 24 
   
 9.20   1 0 12 
   
y= , X = 
 17.24   0 1 96 

 18.00   0 1 96 
   

 15.12 
 
 0 1 72 

   
 16.08   0 1 60 
   
 13.71   0 1 48 
   
 13.81   0 1 36 
   
 12.61   0 1 24 
14.03 0 1 12

and β = (α1 , α2 , β)T . Note that the operator forgot to remove the second segment of piece 2 at
84 hours, so it was left to the end of the experiment and weighed at 96 hours. Note that in this
design matrix there is no intercept term (no column of ones).

42
−469.5 −469.0 −468.5 −468.0 −467.5 −467.0
Log−likelihood

95%

0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70

Figure 2.7: zoomed plot of the log-likelihood function of the Box-Cox family for simulated data.

2.4.3 Calculation by R

We tell R that a variable is a factor by use of the factor command as follows:

> attach(corrosion.data)
> a<-lm(Weight~-1+factor(Piece)+Time)
> summary(a)

Call:
lm(formula = Weight ~ -1 + factor(Piece) + Time)

Residuals:
Min 1Q Median 3Q Max
-0.9595 -0.5978 -0.1177 0.7209 1.3069

Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor(Piece)0 12.074346 0.470012 25.689 1.58e-12 ***
factor(Piece)1 7.829195 0.461723 16.956 3.02e-10 ***
Time 0.054066 0.006857 7.884 2.62e-06 ***
---
Residual standard error: 0.7801 on 13 degrees of freedom
Multiple R-squared: 0.9972, Adjusted R-squared: 0.9965
F-statistic: 1523 on 3 and 13 DF, p-value: < 2.2e-16

Weights are in milligrams, so we estimate that the increased weight due to corrosion per hour
is 0.0541 milligrams.
Note that we had to specify that R should not fit a constant term in this model, and that is
shown by the ‘-1’ argument in the lm function. If there had been a constant term as well then
we would have had a design matrix that was over-parameterized (not of full rank). Different
parameterizations are discussed further in section 2.4.6.

43
2.4.4 Factors coded as quantitative variables

We have seen that factors are usually qualitative variables acting at 2 or more levels. Sometimes
there is an implicit order among the levels, for example grade of cancer, and sometimes not, for
example gender or eye colour.
As discussed in section 2.4.1, the explanatory variable (that we will label U ) associated with
factors is usually a set of dummy variables. Each dummy variable represents a level of the factor
and it can take one of two values: 1 if the observation is associated with this level of the factor,
0 if it is not. Consider the following data set containing information on the heights of children
in 8 school classes:

Y U
height(cm) class
119 1
105 1
.. ..
. .
122 2
147 2
.. ..
. .
147 8
152 8

In the one way analysis of variance model, in which yij is the height of child j in class i we want
to fit a model of the form:
yij = αi + ij (2.1)
This is a statistical model that allows a different mean for each class and within each class,
observations are normally distributed with a common mean and variance (and the variance
is identical in all classes). We don’t want to impose any numerical relationships between the
groups, i.e. we don’t want to force them to be in ascending order or to force the difference in
means between groups 7 and 6 to be the same as that between groups 3 and 2. If we choose
to use U directly as an explanatory variable in our linear model (lm(Y~U)) then both of these
conditions are imposed on the model. The linear model will be:

yij = µ + βui + ij

The X matrix would now have the form


 
1n1 1n1
 1n2 21n2 
X=
 
.. .. 
 . . 
1nk k1nk

where 1ni is a vector of 1s of length ni . Such a model is much too restrictive. It makes the
unreasonable assumption that the mean for group i is µ + iβ, in other words that the mean is
a linear function of group number. There is generally no reason to suppose such a model. The
group numbers are arbitrary and have no numerical significance in their own right. Any model
we propose must not assume even that the order in which the groups are numbered is relevant.
This is why we propose the model yij = αi + ij . We have k group means, and since we cannot
assume any structure for them we need k parameters to describe them. When undertaking
statistical analyses you have to decide whether there is any order among the levels of the cat-
egorical variable. If there isn’t or you don’t want to force there to be, then you can use the
factor command within the lm command.

44
So is it just a case of allowing a separate dummy variable for each level of the categorical
variable? This is certainly technically allowed but in most uses of linear models, we want to
have a constant term that corresponds to a column of ones in the X matrix. We generally want
to have a constant term for two reasons:

• First, it is intuitive to think of there being a common baseline mean µ, from which the
group means deviate according to the level of the parameters.

• Second, the hypothesis that the factor has no effect on the dependent variable should mean
that the group means then take a common value µ, and so this hypothesis should be that
all the level parameters defining deviations from this common mean are zero.

In the corrosion data we had two different constants for the two different pieces of original
alloy, so the X matrix does not require a column of ones but if we want an intercept term that
represents the overall mean of the y values, can we just include it in the design matrix?

2.4.5 Multicollinearity - when X is not full rank

Suppose we have the general situation of multiple regression and that one or more regressor
variables are linear combinations of some other regressor variables. For example consider the
model
yi = β0 + β1 xi1 + β2 xi2 + i , i = 1, . . . , n
where now xi2 is a linear combination of xi1 , or

xi2 = λxi1

for some known λ.


Now what does this mean for the design matrix X? We have
   
1 x11 x12 1 x11 λx11
 1 x21 x22   1 x12 λx12 
X= . ..  =  ..
   
.. .. ..
 ..

. .   . . . 
1 xn1 xn2 1 xn1 λxn1

and so the rank of X can not be 3, because there are only 2 linearly independent columns
(1, . . . , 1)T and (x11 , . . . , xn1 )T . In fact the rank of X is at most 2 and so the matrix X T X is
singular.
The above phenomenon is called multicollinearity indicating that the regressor variable x2 is
linearly correlated with x1 . We have regressor variables that contain identical information.
X T X is singular and therefore we can not invert it to compute β̂.
A popular approach to overcome multicolimearity is to apply enough constraints on the param-
eters β themselves to obtain a unique solution.

2.4.6 Resolving over-parametrization using constraints

For the corrosion data, if we specified the model yij = αi + ij , the X matrix would be
 T
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
X= (2.2)
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

45
This is technically correct but doesn’t include a constant term. If we also included a constant
term in the design matrix so that the model is yij = µ + αi + ij , the design matrix becomes
 T
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X= 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0  (2.3)
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Now X matrix is no longer of full rank (the columns are linear combinations of each other);
it is over-parameterized. This means that (X T X) is not invertible meaning the least squares
estimators β̂ = (X T X)−1 X T y do not exist. We cannot have a parameter for each factor level
and a constant term. This kind of over parametrization is common with factors.
Therefore we need a constraint to remove the over-parameterization and make the parameters
of the yij = µ + αi + ij model identifiable. There are several parameterizations available but
we focus on two common choices of constraint:

1. Level one zero constraints otherwise known as corner point constraints. The first and
simplest to implement (and hence found often in software packages) is to say that α1 = 0.
This immediately removes one parameter, and also removes the corresponding column of
X. The interpretation of µ in the one factor model now is that it is the mean of group 1.
The interpretation of αi , for i = 2, . . . , k, is that this is the deviation of the mean of group
i from that of group 1.
In a more general linear model containing a factor together with other regressor variables,
the level one zero constraint means that the parameter associated with factor level i, again
for i = 2, . . . , k, is the increment (or decrement if negative) in the mean y when we move
from level 1 of this factor to level i. To perform an analysis in R using the level one zero
constraints use the command

options(contrasts=c(factor="contr.treatment",ordered="contr.treatment"))

contr.treatment is what R calls the level one zero constraints. This will set the intercept
to be the mean of the first level of the factors. This command only needs entering in R
once. It will remain in force until overwritten with a new contrast statement.
Unless otherwise specified we will use the corner point constraints.
2. Sum to zero constraints. Another possible constraint is to insist that the αi ’s sum to
zero. The interpretation then of µ in the one factor model is that it is the average mean,
averaged over the k groups. The αi ’s are then deviations of individual group means from
the average. This is attractive in that it does not entail an arbitrary choice of one αi to
eliminate.
On the other hand, it may be equally arbitrary in the sense that it depends on the actual
levels of the factor that are present in the data (because it is usually the case that other
levels of the factor can also exist).
In a more general linear model containing a factor, the sum to zero constraint gives the
level parameters the interpretations of deviations from the average, in the same way.
Technically, the sum to zero constraint is implemented by removing one level parameter,
usually the last, and setting it equal to minus the sum of all the others.
So the constraint says that ki=1 αi = 0, and this implies that αk = − k−1
P P
Pk−1 i=1 αi . So wher-
ever αk appears in the model it stands for − i=1 αi , instead of representing a separate
unknown parameter. In this case the model becomes

µ + αi + ij , i = 1, 2, . . . , k − 1
yij =
µ − k−1
P
i=1 αi , i = k

This removes the column of X corresponding to αk , but also modifies the columns corre-
sponding to all the other αi ’s in rows where i = k.

46
Consider the one-factor model. We want to fit a model of the form yij = αi + ij but also include
a constant so that we have yij = µ + αi + ij . For this model, the level one zero constraint results
in    
1n1 0n1 · · · 0n1 µ
 1n 1n · · · 0n   α2 
2 2 2 
X= . , β =
  
.. .. ..   .. 
 .. . . .   . 
1nk 0nk ··· 1nk αk
while the sum to zero constraint produces
 
1n1 1n1 0n1 ··· 0n1  
µ
 1n
2 0 n2 1n2 ··· 0n2 

. . .. .. ..
  α1 
X=
 .. .. . . . ,

β=

..


   . 
 1nk−1 0nk−1 0nk−1 ··· 1nk−1 
αk−1
1nk −1nk −1nk ··· −1nk

In both cases the X matrix has full rank, k.

2.4.7 Factors in the tractor data

Let us look in detail at the tractor data again to illustrate the difference between the commands

> tractor.linear.lm<-lm(Maint ~ Age, data = tractor.data)

and

> tractor.linear.factor.lm<-lm(Maint ~ factor(Age), data = tractor.data)

Figure 2.8 shows the tractor data with a line representing the overall mean. The first command
(non-factor version) forces the age of the tractor to be a covariate where the value of age is
meaningful (not just a label). In this command, X takes the form
 T
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X=
0.5 0.5 1 1 1 4 4 4 4.5 4.5 4.5 5 5 5 6 6

yielding β̂ = (X T X)−1 X T y = (323.6, 131.7)T . The parameter estimates are confirmed by:

> tractor.linear.lm
Coefficients:
(Intercept) Age
323.6 131.7

So the outputs says that the maintenance cost goes up by 131.70 for every extra year of age.
The second command forces Age to be a factor variable. R then uses the constraint to estimate

47
800 1000 1200 1400
Maint

600
400
200

1 2 3 4 5 6

Age

Figure 2.8: A plot of the tractor data, with the overall mean indicated by the horizontal line

the coefficients of the dummy variables. Using the corner point constraint, the X matrix is:
 

 1 0 0 0 0 0 0 


 1 0 0 0 0 0 0 


 1 1 0 0 0 0 0 


 1 1 0 0 0 0 0 


 1 1 0 0 0 0 0 


 1 0 1 0 0 0 0 


 1 0 1 0 0 0 0 

1 0 1 0 0 0 0
 
X=
 
1 0 0 1 0 0 0

 
1 0 0 1 0 0 0
 
 

1 0 0 1 0 0 0

 
 
 1 0 0 0 1 0 0 
 

 1 0 0 0 1 0 0 


 1 0 0 0 1 0 0 


 1 0 0 0 0 1 0 

 1 0 0 0 0 0 1 
1 0 0 0 0 0 1

yielding β̂ = (X T X)−1 X T y = (172.5, 491.8, 460.5, 727.8, 1029.5, 814.5, 896.0)T . The R output is

> tractor.linear.factor.lm<-lm(Maint ~ factor(Age))


> summary(a.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 172.5 185.0 0.932 0.37316

48
factor(Age)1 491.8 238.9 2.059 0.06651 .
factor(Age)4 460.5 238.9 1.928 0.08274 .
factor(Age)4.5 727.8 238.9 3.047 0.01232 *
factor(Age)5 1029.5 238.9 4.310 0.00154 **
factor(Age)5.5 814.5 320.5 2.541 0.02929 *
factor(Age)6 896.0 261.7 3.424 0.00650 **

confirming the direct matrix calculations. The output means for example that the mean of the
observations with an Age of 4 is 172.5+460.5=633. This can be verified in R using:

> mean(tractor.data$Maint[tractor.data$Age==4])
[1] 633

So the parameter estimate for class i is the deviation of that class mean from the mean of class
1, where class 1 represents an age of 0.5 years.
This is obtained directly using the design matrix via the R commands:

> x=cbind(rep(1,17),
c(0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0),c(0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0),
c(0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0),c(0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0),
c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0),c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1))

> y=tractor.data$Maint
> solve(t(x)%*%x)%*%t(x)%*%y
[1,] 172.5000
[2,] 491.8333
[3,] 460.5000
[4,] 727.8333
[5,] 1029.5000
[6,] 814.5000
[7,] 896.0000

It is also possible to exclude certain levels from the analysis. For example if we wanted to exclude
Ages 5.5 and 6, we could do this using the exclude command within the factor option.

> a.lm<-lm(Maint~factor(Age,exclude=c(5.5,6)),contrast="contr.treatment",
+ data=tractor.data)
> summary(a.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 172.5 166.6 1.036 0.32736
factor(Age, exclude = c(5.5, 6))1 491.8 215.0 2.287 0.04798 *
factor(Age, exclude = c(5.5, 6))4 460.5 215.0 2.142 0.06085 .
factor(Age, exclude = c(5.5, 6))4.5 727.8 215.0 3.385 0.00806 **
factor(Age, exclude = c(5.5, 6))5 1029.5 215.0 4.788 0.00099 ***

2.4.8 More R manipulation of the tractor data

We now return to the analysis treating age as a factor which is reproduced below. The R output
is

> tractor.linear.factor.lm<-lm(Maint ~ factor(Age))


> summary(a.lm)

49
Estimate Std. Error t value Pr(>|t|)
(Intercept) 172.5 185.0 0.932 0.37316
factor(Age)1 491.8 238.9 2.059 0.06651 .
factor(Age)4 460.5 238.9 1.928 0.08274 .
factor(Age)4.5 727.8 238.9 3.047 0.01232 *
factor(Age)5 1029.5 238.9 4.310 0.00154 **
factor(Age)5.5 814.5 320.5 2.541 0.02929 *
factor(Age)6 896.0 261.7 3.424 0.00650 **

Remembering that, using corner point constraints, the intercept is the mean of observations
with an age of 0.5, and the other parameter estimates are deviations of the other group means
from this group mean. Ages 1 and 4 have a similar deviation and ages 4.5,5,5.5 and 6 also have
similar deviations. It looks like there are three distinct groups: age 0.5; ages 1 and 4; ages
4.5,5,5.5,and 6. So it would seem sensible to merge those ages with similar means. One way to
do this is to create a new data frame where the ages are recoded. Since the group labels are
arbitrary we can label them as we like but we just need to make sure the within group labels
are the same. We can re-order the tractor data with the following commands

> tractor.data.merged=tractor.data
> tractor.data.merged[9:17,1]=0
> tractor.data.merged[3:8,1]=1

The existing and modified data frames are:

> tractor.data > tractor.data.merged


Age Maint Age Maint
10 0.5 182 10 0.5 182
17 0.5 163 17 0.5 163
13 1.0 978 13 1.0 978
14 1.0 466 14 1.0 466
15 1.0 549 15 1.0 549
4 4.0 495 4 1.0 495
5 4.0 723 5 1.0 723
6 4.0 681 6 1.0 681
1 4.5 619 1 0.0 619
2 4.5 1049 2 0.0 1049
3 4.5 1033 3 0.0 1033
7 5.0 890 7 0.0 890
8 5.0 1522 8 0.0 1522
16 5.0 1194 16 0.0 1194
9 5.5 987 9 0.0 987
11 6.0 764 11 0.0 764
12 6.0 1373 12 0.0 1373

> a.lm<-lm(Maint~factor(Age),contrast="contr.treatment",data=tractor.data.merged)
> summary(a.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1047.89 81.03 12.932 3.56e-09 ***
factor(Age)0.5 -875.39 190.04 -4.606 0.000407 ***
factor(Age)1 -399.22 128.12 -3.116 0.007589 **

So it looks like if we treat Age as a factor, then we see that there are three distinct groups:
ages 0.5, ages 1 and 4 and the others taken together. The advantage of merging levels is that it

50
increases the group size and can reduce the standard error of the parameter estimates (which is
a measure of our uncertainty in the parameter estimates).

Whether to the treat explanatory variables as factors or covariates depends on whether it makes
sense in the context of the problem posed. If you treat the variable as a covariate, you are im-
posing a strict numerical relationship between the values that the variable takes. If this doesn’t
necessarily make sense, then make the variable into a factor. Obviously in the tractor data,
it makes sense to treat age as a covariate (since age can be considered a continuous variable)
rather than a factor and we forced age to be a factor just to illustrate how factors work in R.
The use of factors requires a little thought about the nature of the problem and the variables
that you are working with.

2.4.9 Two factor variables - the crop data

Consider the observations yij (i = 1, 2, . . . , m, and j = 1, 2, . . . , n) arranged in a two-way table


in which the rows and columns represent two different classifications, blocks (with subscript i)
and treatments (with subscript j). Suppose that the observations have true variance σ 2 .
We assume the effects of using a particular treatment in a particular block are to contribute
fixed amounts to an observation, i.e. the following additive model

yij = µ + αi + βj + ij (2.4)

holds for an observation, where µ is the overall mean, αi is due to the ith block, βj is due to
the jth treatment and ij is the random error. The assumptions about the errors are the same
as in the one-way case.

A grower was interested in the effect of three organic fertilizers (A,B and C) on the yield of
blackcurrants. The soil was quite different in different parts of the growing space; indeed there
appeared to be 2 distinct soil types (clay and loam). It was quite probable that the two different
soil types had different levels of natural fertility, though the grower wasn’t sure which soil type
was the most fertile. In order to take such differences into account, the grower decided to divide
the area into two parts based on the soil type and to randomly allocate each fertilizer to 3 bushes
within each part of the growing space. The resulting yields (in lbs.) from the plots are given
below:

Fertilizer
Soil A B C
clay 7.9, 7.6, 6.3 16.1, 17.7, 19.2 16.6, 15.7, 15.2
loam 13.1, 13.3, 15.1 21.1, 24.3, 22.9 26.3, 28.5, 26.9

How do fertilizer and soil type affect yield?

2.4.10 Interactions

In addressing the above question we may be interested in interactions between soil type and fer-
tilizer. By this we mean do yields obtained by the three fertilizers in clay differ (non-additively)
to those obtained in loam soil? R has a built in command interaction.plot that provides plots
that help us to address this question. We can plot the yield against soil type for each different
fertilizer using

> interaction.plot(soil,fertiliser,yield)

51
fertiliser

20
C
B
A
mean of yield

15
10

clay loam

soil

Figure 2.9: Interaction plot of crop data, by fertilizer

which produces Figure 2.9. The plot shows that the difference in yield between clay and loam
soil for fertilizer C looks much bigger than the difference in yield for fertilizers A and B. So there
appears to be (visual) evidence of an interaction. We can change the order of the variables in
the interaction.plot command to look at how yield varies against fertilizer type by soil type
using

> interaction.plot(fertiliser,soil,yield)

which produces Figure 2.10. Again we see an additive relationship between yield and soil type
for fertilizers A and B, but not for fertilizer C. These visual plots help assess interactions but
we can use formal statistical tests (F-tests) on nested models (that we encountered in Chapter
2) to assess the statistical evidence for interactions. Let’s start fitting some linear models to
the data. A first model would be the minimal model in which there is only an intercept. The
statistical model is yij = µ + ij where i represents Soil type and j represents fertilizer type.

> model0.lm<-lm(yield~1,data=twofactor.data)
> summary(model0.lm)
Residuals:
Min 1Q Median 3Q Max
-11.30 -3.85 -1.25 4.85 11.70

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.600 1.626 10.82 4.81e-09 ***
---
Residual standard error: 6.9 on 17 degrees of freedom

The next step would be to assess the change in yield by fertilizer and soil separately. We start
with fertilizer, the statistical model is yij = µ + βj + ij

> model1.lm<-lm(yield~factor(fertiliser),data=twofactor.data)

52
soil

20
loam
clay

mean of yield

15
10

A B C

fertiliser

Figure 2.10: Interaction plot of crop data, by soil type

> summary(model1.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.550 1.978 5.333 8.36e-05 ***
factor(fertiliser)B 9.667 2.797 3.456 0.003532 **
factor(fertiliser)C 11.483 2.797 4.105 0.000937 ***

Residual standard error: 4.845 on 15 degrees of freedom


Multiple R-squared: 0.5649, Adjusted R-squared: 0.5069
F-statistic: 9.738 on 2 and 15 DF, p-value: 0.001947

> anova(model0.lm,model1.lm)
Analysis of Variance Table

Model 1: yield ~ 1
Model 2: yield ~ factor(fertiliser)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17 809.38
2 15 352.16 2 457.22 9.7376 0.001947 **

So the model uses the corner point constraint in which the intercept is the mean yield of fertilizer
A, the coefficient for fertilizer B is the difference in mean yield of fertilizer B and fertilizer A
and the coefficient for fertilizer C is the difference in mean yield of fertilizer C and fertilizer A .
The F-test here has null hypothesis β2 = β3 = 0 (since β1 = 0 by the corner point constraint).
The p-value of 0.002 shows that there is clear evidence to reject to null and so yield appears to
vary by type of fertilizer. So fertilizer is needed, what about soil type? The statistical model is
yij = µ + γi + ij

> model2.lm<-lm(yield~factor(soil),data=twofactor.data)
> anova(model0.lm,model2.lm)
Analysis of Variance Table

53
Model 1: yield ~ 1
Model 2: yield ~ factor(soil)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17 809.38
2 16 519.78 1 289.6 8.9146 0.008735 **

We observe clear evidence that yield varies by soil type: the null hypothesis that γ2 = 0 is clearly
rejected (γ1 = 0 by the corner point constraint). If we want to fit both soil and fertilizer type
we can use the update command.

> model3.lm<-update(model2.lm, .~.+factor(fertiliser))

This is the same as

> model3.lm<-lm(yield~factor(soil)+factor(fertilizer),data=twofactor.data)

Model 3 is yij = µ + γi + βj + ij . To see whether model3 is an improvement over model 1, we


use

> anova(model1.lm,model3.lm)
Analysis of Variance Table

Model 1: yield ~ factor(fertiliser)


Model 2: yield ~ factor(soil) + factor(fertiliser)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15 352.16
2 14 62.55 1 289.6 64.814 1.272e-06 ***
---

Model 1 is yij = µ + βj + ij .


Model 3 is yij = µ + γi + βj + ij .

The null that γ2 = 0 is convincingly rejected. To see whether model 3 is an improvement over
model 2, we use

> anova(model2.lm,model3.lm)
Analysis of Variance Table

Model 1: yield ~ factor(soil)


Model 2: yield ~ factor(soil) + factor(fertiliser)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 519.78
2 14 62.55 2 457.22 51.164 3.657e-07 ***
---

So model 3 is an improvement over model 1 and model 2: both fertilizer and soil are needed in
the linear model. We can now add interaction terms

> model4.lm<-lm(yield~factor(soil)*factor(fertilizer),data=twofactor.data)

The * between two factor variables fits the explanatory variables and interaction terms between
all the levels of the factor variables. So the same model could be fitted using

54
> model4.lm<-lm(yield~factor(soil)+factor(fertilizer)+factor(soil):factor(fertilizer),
data=twofactor.data)

The nested F-test yields

> anova(model3.lm,model4.lm)
Analysis of Variance Table

Model 1: yield ~ factor(soil) + factor(fertiliser)


Model 2: yield ~ factor(soil) + factor(fertiliser) + factor(soil):factor(fertiliser)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 14 62.554
2 12 17.820 2 44.734 15.062 0.0005344 ***

Note that if * is used between two covariates it simply acts as the multiplication symbol (the
product of the two variables is the usual form of interaction considered for covariates). So there
is strong evidence to reject the null hypothesis that the coefficients of the interaction terms are
zero.

> summary(model4.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2667 0.7036 10.328 2.52e-07 ***
factor(soil)loam 6.5667 0.9950 6.600 2.54e-05 ***
factor(fertiliser)B 10.4000 0.9950 10.452 2.22e-07 ***
factor(fertiliser)C 8.5667 0.9950 8.610 1.76e-06 ***
factor(soil)loam:factor(fertiliser)B -1.4667 1.4071 -1.042 0.31781
factor(soil)loam:factor(fertiliser)C 5.8333 1.4071 4.146 0.00136 **

So the final model is


yieldi = 7.3 + 6.6I(L)i + 10.4I(B)i + 8.6I(C)i − 1.5I(L, B)i + 5.8I(L, C)i + i
where I(B)i is an indicator variable taking the value 1 if the fertilizer given to bush i was B and is
zero otherwise. The other indicator variables with a single variable have similar meanings. The
indicator variable I(L : C)i takes the value 1 if bush i was on loam soil and was given fertilizer
C and is zero otherwise (both conditions have to be true). We can check these parameter values
by considering means of the appropriate subsets of yield. For example the mean yield of the
bushes growing in loam soil and given fertilizer B can be calculated from the data using

> mean(twofactor.data$yield[13:15])
[1] 22.76667

and can be verified using the output from summary(model4.lm) as 7.267+6.567+10.400-1.467=22.767,


in agreement with the value calculate directly from the data. Of course we would need to check
the plots of residuals before we decided that the fit of the model is good. Although the R2 of 97.8
suggests a very good fit, we should still check for outliers, influential observations, departures
from normality, heteroscedasticity etc.

2.5 Variable selection

2.5.1 Variable selection in the corrosion data

Consider the corrosion data (see Section 2.4.2) and suppose that the most complex model we
want to consider is one in which the relationship between weight and time is quadratic and is a

55
different quadratic for each of the two original pieces.
What we want to know is whether we can remove some of the regressor variables. The more
complex model is
wi = α0 + α1 ti + α2 t2i + (β0 + β1 ti + β2 t2i )pi + i
where wi is the weight of the i-th segment (and now we are using just a single index, so that i
ranges from 1 to 16), pi = 1 if the i-th segment is from piece 2, otherwise pi = 0, and ti is the
time when the i-th segment was weighed.
Notice that in this model we have included a constant term. We are now treating the piece
number as a factor, including a constant term and removing the resulting over-parametrization
by imposing a level one zero constraint. When we fit this model with R, the following is a
selection from the output.

> a <- lm(Weight~factor(Piece)*Time+ I(Time^2)*factor(Piece),data=corrosion.data)

> summary(a)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4826786 0.9097389 10.424 1.09e-06 ***
factor(Piece)1 4.0419669 1.2767269 3.166 0.0101 *
Time -0.0295387 0.0386519 -0.764 0.4624
I(Time^2) 0.0007792 0.0003494 2.230 0.0498 *
factor(Piece)1:Time 0.0144004 0.0534689 0.269 0.7932
factor(Piece)1:I(Time^2) -0.0001728 0.0004728 -0.365 0.7224

Residual standard error: 0.6521 on 10 degrees of freedom


Multiple R-squared: 0.9647, Adjusted R-squared: 0.9471
F-statistic: 54.73 on 5 and 10 DF, p-value: 6.109e-07

> anova(a)
Analysis of Variance Table

Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
factor(Piece) 1 74.866 74.866 176.0711 1.129e-07 ***
Time 1 37.831 37.831 88.9718 2.708e-06 ***
I(Time^2) 1 3.533 3.533 8.3089 0.01631 *
factor(Piece):Time 1 0.070 0.070 0.1637 0.69433
factor(Piece):I(Time^2) 1 0.057 0.057 0.1335 0.72243
Residuals 10 4.252 0.425

The sequential sums of squares are the excess sums of squares of the current variable, given
the sums of squares of any previously entered variables. The anova command output shows us
that neither of the 2 interactions terms are needed (their excess sum of squares are both very
small). The p-value of 0.01631 for Time 2 tells us that we may need this term in our model (its
excess sum of squares is relatively high compared to the residual sum of squares). We cannot
properly test whether we need Time in the model because this test assumes that the parameter
for Time 2 is zero which is questionable. What is clear is that these tests depend on the order the
variables enter the ANOVA table (unless the study design takes a particular orthogonal form as
discussed in Chapter 2). If we knew how to order the terms in the model, so that those that are
not needed appear at the end of the table, a single analysis would show us what the final model
should be. However, this is usually not possible. ANOVA tables are thus of limited use.
We could use the t-statistics from the summary command as a model selection tool. The t-
statistics test the null hypothesis that each coefficient is zero, given that all other variables are

56
present in the model. Based on the output from the summary command, we could remove the
regressor variable with the lowest t-value. In the corrosion data model above, we would remove
the T ime : P iece interaction, since its t-value is the smallest (in absolute value) and not below
the threshold of 0.05 commonly used. However, the removal of a term from a model will usually
change the contribution of the remaining terms. Therefore we should not automatically look
for the term with second smallest p-value and, if below the threshold of 0.05, exclude it from
the model. A better procedure is to refit the model after we remove any parameter because
the remaining parameter estimates will change when the smaller model is fitted to the data.
These considerations have led to the development of a class of numerical procedures for model
selection. We illustrate the use of these methods on the corrosion data in section 2.5.7.

2.5.2 Penalized likelihood

We have already identified the residual sum of squares Sr as a measure of fit, in the sense that
smaller Sr implies a better fitting model.
This is closely related to the idea of maximizing the likelihood. We saw when considering
Box-Cox transformations that the maximized likelihood is

L(β̂, σˆ2 , y) ∝ (Sr /n)−n/2

where n is the number of observations in the study. A model which achieves a smaller Sr
therefore achieves a higher likelihood, which is another reason for thinking of Sr as a tool for
model choice.
However Sr will always decrease when we add more regressor variables, so the best fitting model
is always the full model which contains all the possible regressor variables - so Sr is not a good
model selection tool. For the same reason, the coefficient of determination R2 is not a useful
measure in model selection. R2 will not decrease as the number of parameters increases (i.e. it is
a non-decreasing function of the number of parameters in the model). We need some method of
model selection that takes account of the number of parameters used in the model. The adjusted
R2 is more useful but there are others based on the likelihood that have other advantages.
Several methods have been proposed for penalizing models for having more regressor variables.
The simplest way to compare these methods is first to work with minus twice the log likelihood,
which is n log Sr . On this scale, a smaller value denotes a better-fitting model. Then we add a
penalty, which is a function of p (or of p − 1, since we generally do not count the constant term
if it is in the model), so that a more complex model is not necessarily better. Therefore, for
some function z(·), the general criterion is
 
ˆ2
Sr
C = −2 log L(β̂, σ , y) + z(p) = n log + z(p) (2.5)
n

after omitting a constant. Then we declare that the optimal model is that which minimizes C.
We can think of z as an ad hoc adjustment that tries to give simpler models credit for having
fewer regressor variables.

2.5.3 Akaike’s information criterion

The most famous of these penalized likelihood methods is Akaike’s Information Criterion, AIC.
It uses the penalty function z(p) = 2p. AIC was derived using an argument based on information
theory that need not concern us in this course. It is advocated and used in all problems of model
choice in statistics, i.e. not just for variable selection in linear models.

57
2.5.4 The Schwartz criterion/Bayesian information criterion

An alternative z(p) = p log n is based on interpreting the penalty as deriving from prior informa-
tion. It is called the Schwartz criterion or the Bayesian Information Criterion, BIC. Like AIC,
the BIC can be used for all problems of model choice.
The two methods will clearly produce quite different results, if n is large. Then the BIC applies
a much larger penalty for complex models, and hence will lead to simpler models than the AIC.

2.5.5 What pairs of models can we compare by testing?

In any practical problem, we have a (typically large) set of possible regressor variables that we
might consider including in the model. For instance, if we have measured 4 different continuous
explanatory variables x1 , x2 , x3 , x4 , we might consider polynomial terms in each of these, together
with interactions (i.e. products). If we allow terms like up to the third order x1 , x1 x3 , x21 x3 ,
there are already 35 possible terms in the model. Practical applications of linear models (in
human genetics for example) often have hundreds or thousands of possible regressor variables
(columns in the X matrix).
The key question is: which subset of these regressor variables should we include in the model?
This is usually known as the problem of variable selection.
We have so far been considering the approach of testing hypotheses. That is, we propose a
model with some or all of the regressor variables in it, and then test whether we can exclude
one or more of those variables. This is effectively a comparison between two subsets of regressor
variables, and the hypothesis test is a formal way of deciding which model is better, in some
sense. This approach does not retain regressor variables unless the data suggest that they are
needed.
It is important to recognize that hypothesis testing (based on F-tests) cannot compare any
arbitrary pair of models. It is only possible to compare nested models, that is models in which
the regressor variables in one model are a subset of those in the other. As a reminder, the models
M0 through M3 below are examples of nested models. This point is important because we often
wish to compare non-nested models.

M0 : E(y) = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4
M1 : E(y) = β0 + β2 x2 + β3 x3
M2 : E(y) = β0 + β2 x2
M3 : E(y) = β0

2.5.6 Non-nested models

The advantage of using AIC or BIC is that they can be used to compare non-nested models.
If there are two competing models then the one with the lowest AIC/BIC is selected. It is
important to note that AIC/BIC cannot be used to compare models applied to different data
sets. This is because it is based on the likelihood, which depends on the values of the response
variable. Different data sets will have different values of the response variable so the likelihood
will be different in both data sets. So if we use AIC/BIC on different data sets we will have
different likelihoods to penalize. We won’t be able to tell whether the lower AIC/BIC is due to
the model really being better or whether the likelihood is just a lot lower in one of the data sets
because of the particular sample of response values.

58
2.5.7 Stepwise automated selection in R

Automated variable selection methods attempt to choose the “best” model by iteratively adding
in and removing regressor variables based on some measure of fit. If they are used without care
they can lead to poor model choice. Once a final model has been reached, the usual checking
of model assumptions should be performed and if these assumptions are not met, you should
investigate why and take appropriate action (consider transformations for example). In this
section we only discuss a popular automative method known as Stepwise selection, which is fully
implemented in R.
There are three approaches to stepwise automated selection: forward selection, backwards elim-
ination and full stepwise selection. Forward selection starts with the smallest model and adds
variables in. Backwards elimination starts with the largest model and removes variables. Full
stepwise selection is a hybrid of these two methods, both adding and removing variables to see
which models provide the best fit.

In R all three stepwise selection methods can be done by the function step (it requires the MASS
library to be loaded first using the command library(MASS)). The command step performs
variable selection based on the AIC. Which method is used is controlled by the direction argu-
ment within the step command. It is also possible to tell R what the smallest model and largest
models considered should be. This could be useful if, for example, there is a particular variable
that you must include in the model (common when adjusting for age or sex in epidemiological
studies) or if you do not want to consider particular interactions or combinations of variables.
The smallest and largest models considered are controlled by the scope argument. Finally the
trace argument controls the amount of output that is printed to the screen. The larger the
number, the more output is printed. You’ll need to experiment with this to see how high to go.
This is particularly pertinent when there are a large number of variables as the output will soon
grow into an unmanageable length. In the following three sections we illustrate the use of the
three stepwise steps or procedures by applying them to the corrosion data of section 2.4.2.

Forward selection

Forward selection uses the following algorithm:

1. Start with the minimal model, this is usually the model just containing the constant term
(if required), otherwise nothing. This is the baseline model.

2. Consider adding a single variable to the current baseline model and identify the model
with the lowest AIC. If the AIC for a new model is smaller than the baseline model then
choose the new model (we have added a variable to the baseline model). This new model
becomes the baseline model. Go to step 4

3. If a new variable isn’t added in step 2 stop the search and exit the algorithm.

4. Go back to step 2, and continue applying step 2 until no more terms can be added or you
have reached a model that can’t be improved on.

We now apply it the the corrosion data. The minimal model here contains just a constant term
and R adds in the 3 explanatory variables separately.

> a <- lm(Weight~1,data=corrosion.data)


> step(a,scope=list(upper=Weight~factor(Piece)*Time + I(Time^2)*factor(Piece)),
+ direction="forward")
Start: AIC=34.32

59
Weight ~ 1
Df Sum of Sq RSS AIC
+ factor(Piece) 1 74.866 45.742 20.807
+ I(Time^2) 1 45.796 74.812 28.678
+ Time 1 40.662 79.946 29.740
<none> 120.608 34.319

The AIC for the minimal model is 34.32, adding Piece reduces the AIC to 20.807 which is an
improvement in fit over the intercept-only model. So Piece is added to the linear predictor. R
then tries adding in the other 2 explanatory variables.

Step: AIC=20.81
Weight ~ factor(Piece)
Df Sum of Sq RSS AIC
+ I(Time^2) 1 41.106 4.637 -13.818
+ Time 1 37.831 7.911 -5.269
<none> 45.742 20.807

Adding Time2 gives the lowest AIC (-13.818) and is lower than the Weight~factor(Piece)
model (20.81) so Time2 gets added to the linear predictor.

Step: AIC=-13.82
Weight ~ factor(Piece) + I(Time^2)
Df Sum of Sq RSS AIC
<none> 4.6365 -13.8180
+ Time 1 0.2581 4.3784 -12.7346
+ factor(Piece):I(Time^2) 1 0.0648 4.5717 -12.0431

R tries to add Time. The AIC is not lower than the baseline model. Now that there are two
explanatory variables in the linear predictor, R tries adding interaction terms (only interaction
terms between terms included in the linear predictor are allowed). Adding the Piece:Time2
interaction doesn’t reduce the AIC. So the model selected by forward selection is:

Call:
lm(formula = Weight ~ factor(Piece) + I(Time^2))
Coefficients:
(Intercept) factor(Piece)1 I(Time^2)
8.9256397 4.1921978 0.0004965

Backward selection

The procedure for backwards elimination follows the same process of trying to minimize the
AIC but starts from the biggest model (which is user-specified) and removes variables. The
algorithm is:

1. Fit the full model (which is user-specified). This is the curent baseline model.

2. Remove one variable at a time from the baseline model. If the AIC for any new model is
smaller than the baseline this becomes the new baseline model. We have removed a term
from the previous model and so go to step 4.

3. If no variables were removed in step 2 then stop and exit the algorithm.

4. Go back to step 2, and continue applying step 2 until no more terms can be deleted or you
have reached a model that can’t be improved on.

60
Using backward selection from the full model (which includes interactions):

> a <- lm(Weight~factor(Piece)*Time + I(Time^2)*factor(Piece))


> step(a,scope=list(lower=Weight~1),direction="backward")
Start: AIC=-9.2
Weight ~ factor(Piece) * Time + I(Time^2) * factor(Piece)

Df Sum of Sq RSS AIC


- factor(Piece):Time 1 0.0308 4.2829 -11.0875
- factor(Piece):I(Time^2) 1 0.0568 4.3088 -10.9909
<none> 4.2520 -9.2031

The starting model has an AIC of -9.2 and we try removing the 2 interaction terms - removing
the Piece:Time interaction results in the lowest AIC of -11.0875 so this term gets removed. Note
R doesn’t try to remove the Time, Piece or Time2 variables because this could lead to interaction
terms being present without the variables themselves being in the model.

Step: AIC=-11.09
Weight ~ factor(Piece) + Time + I(Time^2) + factor(Piece):I(Time^2)
Df Sum of Sq RSS AIC
- factor(Piece):I(Time^2) 1 0.0955 4.3784 -12.7346
- Time 1 0.2889 4.5717 -12.0431
<none> 4.2829 -11.0875

The baseline model has an AIC of -11.09 and removing the Piece:Time2 interaction results in
the lowest AIC of -12.7346 so this interaction is removed. Why does R not try to remove Piece
at this stage?

Step: AIC=-12.73
Weight ~ factor(Piece) + Time + I(Time^2)
Df Sum of Sq RSS AIC
- Time 1 0.258 4.637 -13.818
<none> 4.378 -12.735
- I(Time^2) 1 3.533 7.911 -5.269
- factor(Piece) 1 69.296 73.674 30.433

Time is then removed as its removal further reduces the AIC (to -13.818).

Step: AIC=-13.82
Weight ~ factor(Piece) + I(Time^2)
Df Sum of Sq RSS AIC
<none> 4.637 -13.818
- I(Time^2) 1 41.106 45.742 20.807
- factor(Piece) 1 70.175 74.812 28.678

Call:
lm(formula = Weight ~ factor(Piece) + I(Time^2), data = corrosion.data)

Coefficients:
(Intercept) factor(Piece)1 I(Time^2)
8.9256397 4.1921978 0.0004965

No more terms are removed as they don’t reduce the AIC. An important point is that there is
no good reason to expect backward elimination and forward selection to lead to the same model.
This is particularly to be expected when there are very many possible explanatory variables and
interactions to be considered.

61
Full stepwise selection

The natural way to elaborate these two approaches is to mix them together. A full stepwise
approach will typically alternate the two methods. The algorithm is:

1. Start with a current baseline model

2. Consider adding a single variable to the current baseline model. Also consider removing
all allowed terms from the current baseline model (taking into account the need for the
main variables to be included if the interaction is). Identify the model with the lowest
AIC. If the AIC for the new models is smaller than the current baseline model then choose
the new model (we have added or removed a variable/term). This new model becomes the
baseline model. Go to step 4

3. If a new variable isn’t added or a variable removed in step 2 stop the search and exit the
algorithm.

4. Go back to step 2, and continue applying step 2 until no more terms can be added or
removed or you have reached a model that can’t be improved on.

This more elaborate method could produce yet another answer, different from the backwards
and forwards methods. Here you could start from the minimal model or the full model. We
illustrate the process of full stepwise selection starting from the minimal model.

> a <- lm(Weight~1,data=corrosion.data)


> step(a,scope=list(upper=Weight~factor(Piece)*Time + I(Time^2)*factor(Piece)),
+ direction="both")
Start: AIC=34.32
Weight ~ 1
Df Sum of Sq RSS AIC
+ factor(Piece) 1 74.866 45.742 20.807
+ I(Time^2) 1 45.796 74.812 28.678
+ Time 1 40.662 79.946 29.740
<none> 120.608 34.319

Try adding in the 3 explanatory variables: Piece is added as it reduces the AIC from 34.32 to
20.807

Step: AIC=20.81
Weight ~ factor(Piece)
Df Sum of Sq RSS AIC
+ I(Time^2) 1 41.106 4.637 -13.818
+ Time 1 37.831 7.911 -5.269
<none> 45.742 20.807
- factor(Piece) 1 74.866 120.608 34.319

Try adding the other two explanatory variables and removing Piece from the model. The lowest
AIC is with Time2 added.

Step: AIC=-13.82
Weight ~ factor(Piece) + I(Time^2)
Df Sum of Sq RSS AIC
<none> 4.637 -13.818
+ Time 1 0.258 4.378 -12.735

62
+ factor(Piece):I(Time^2) 1 0.065 4.572 -12.043
- I(Time^2) 1 41.106 45.742 20.807
- factor(Piece) 1 70.175 74.812 28.678

Try adding in all available terms (including allowed interactions) and removing terms in the
linear predictor: none of these reduces the AIC so the model doesn’t change at this stage and
the algorithm terminates.

Call:
lm(formula = Weight ~ factor(Piece) + I(Time^2),data=corrosion.data)

Coefficients:
(Intercept) factor(Piece)1 I(Time^2)
8.9256397 4.1921978 0.0004965

So all three methods agree on the final model. This is probably because of the small number of
explanatory variables; it’s less likely with data sets that have many more explanatory variables.

Calculating the AIC for a linear model

Consider the final model selected by full stepwise selection: W eight ∼ f actor(P iece)+I(T ime2 ).
The AIC for this model is -13.82, how is this AIC calculated? The AIC used is
 
Sr
AIC = 2p + nlog
n

where p is the number of parameters (including the constant), n is the number of observations and
Sr is the residual sums of squares. So for this model AIC = 2(3) + 16 × log(4.637/16) = −13.82

63
Chapter 3

Generalized Linear Models (GLMs):


Basic Facts

3.1 Introduction

This chapter and those following are concerned with the other extension of the basic linear model
considered in this course. We return to considering models only with fixed effects, and now relax
the normality assumption of the basic linear model and the assumption that the expectation of
the response variable is linear in the parameters. The resulting models are called Generalized
Linear Models (GLMs). (They are to be distinguished from general linear models, meaning
standard normal-theory linear models.)
Generalized linear models (GLMs) provide a widely useful, extension of the usual Normal linear
model. The extensions allow, for example, data about proportions or data about counts to be
modelled directly as Binomial or Poisson, respectively – there is no need for transformations of
the data or for assuming approximate Normality. In addition, the use of a link function between
the linear predictor and the mean (so the model is non-linear) can ensure that estimated means
are always within bounds (proportions in [0, 1], counts non-negative). A simple device means
that Multinomial data can be regarded as Poisson data, and hence that many models for the
analysis of contingency tables also fall within the GLM framework.
Model testing is carried out in a similar way to that already seen for the standard linear model.
One difference is that parameter estimates usually have to be found by numerical maximization.
Also, relevant distributions of estimators and test statistics are all approximate and many tests
use χ2 distributions.
In this chapter, the general theory of the models and tests is given. Subsequent chapters show
how the models can be used in particular areas – for proportions in Chapter 4, non-negative
data in Chapter 5, and contingency tables in Chapters 6 and 7. It will probably be best to read
this chapter in conjunction with Chapters 4 and 5 rather than try to understand everything here
on a first pass. At various points in this chapter you are referred to related examples in later
chapters. You are strongly advised to look carefully at these examples.

3.1.1 Standard linear model assumptions

The standard Normal-theory linear model is usually expressed as Y ∼ N (Xβ, Iσ 2 ). However,


for developments here, we separate the modelling of the mean from assumptions about the
distribution of Y . To this end, let E(Yi ) = µi and let xi be the vector of explanatory variables
for observation i, i.e. xTi is the ith row of the design matrix X. Then the model can be written
as:

64
(i-G) E(Yi ) = µi = xTi β = j βj xij = β0 + β1 xi1 + . . . , for the ith observation
P
(thus the mean is linear in β);
(ii-G) var(Yi ) = σ 2 (the variance is constant);
(iii-G) the Yi are independent and Normal.

3.1.2 Why we need to extend linear models theory

1.0
0.8
proportion dead

0.6
0.4
0.2

1.70 1.75 1.80 1.85

log concentration

Figure 3.1: Proportion of beetle deaths by log concentration

If we have independent observations, as in the linear models case, in which situations do we


need to extend the linear model framework?
The following data, from Dobson (1990, Example 8.1, pp. 108–111; 2002, §7.3.1, pp. 119–121),
give the number of beetles exposed (ni ) and killed (di ) after 5 hours exposure to gaseous carbon
disulphide (CS2 ) at various concentrations (xi in log10 CS2 mg/l).

xi 1.6907 1.7242 1.7552 1.7842 1.8113 1.8369 1.8610 1.8839


ni 59 60 62 56 63 59 62 60
di 6 13 18 28 52 53 61 60

The data is in the ‘beetle’ dataframe on the .RData workspace on MOLE. Figure 3.1 shows
a plot of the proportion of beetles dead by log concentration of CS2 . The number of beetles
used in the experiment at each log dose is fixed and it is reasonable to assume that, at a given
dose, each beetle dies with a certain probability, independently of the other beetles.If we want
to predict the proportion of deaths at a log concentration of 1.77 we need to construct a model
that relates the proportion of deaths to the log concentration. One way to do this would be to
try to fit a linear model. Looking at the plot there is clear curvature and it is not of a simple
form. We may be able to find regressor variables (functions of the log concentration) that enable
us to fit a linear model reasonably well but this seems a rather ad hoc way of doing it. A more
important issue is that the response varies only between 0 and 1, using a linear model we could

65
not easily ensure this is the case in the model. For example, using a linear model for the beetles
data, the proportion of deaths at a log concentration of 1.1 would almost certainly be negative.
Generalized linear modeling (GLM) offers a flexible way of fitting such models.

We saw in linear models how, in some situations, heteroscedasticity can be dealt with by find-
ing a transformation of the response that approximately stabilizes the variance (by a Box-Cox
method for example). However in some situations generalized linear models are a more natural
choice. Importantly they can have error structures that are not normally distributed (unlike
linear models) and so do not require finding transformations of the response.

Another important example is the case of a binary response variable. In linear models we must
have a response variable that is normally distributed. If we have a binary response then clearly
we can’t use linear model theory. GLM theory copes perfectly well with a binary response and
indeed this is probably the most common application of it (certainly in the medical literature
where it is used extensively for risk modelling in case control studies).

3.1.3 Reminder - inverse functions

Suppose a function f is defined such that f : x 7→ f (x) (ie the function maps x to f (x)). Then the
inverse of function f , denoted f −1 , is a function defined such that f −1 (f (x)) = f (f −1 (x)) = x
(i.e. it is the function that maps f (x) back to x).

Example: Suppose you have a function f (x) = ex−1 + 7 an you want to find the inverse func-
tion. The method is to rearrange the equation y = ex−1 +7 to make x the subject of the equation.

y = ex−1 + 7
y − 7 = ex−1
ln(y − 7) = x − 1
1 + ln(y − 7) = x

So the inverse function is f −1 (x) = 1 + ln(x − 7) and you can check that f −1 (f (x)) = 1 +
ln(ex−1 + 7 − 7) = x

3.1.4 Generalized linear model assumptions

For the generalized linear model (GLM) the assumptions above are generalized as follows.

(i-G) An assumption about the mean.


Let ηi = xTi β for some (vector) parameter β; η is called the linear predictor.
Let g be a known monotonic function and let h be its inverse, neither depending on i, and
suppose that the mean µi = E(Yi ) is represented as

µi = h(ηi ) = h xTi β , and so g(µi ) = ηi = xTi β.




Here g is called the link (function), and h the inverse link. Evidently g links the mean
of the response to the linear predictor and hence the explanatory variables.
(This provides a particular class of means that are non-linear in the parameters – except,
of course, when g is the identity)

66
(ii-G) An assumption about the variance:
φ
var(Yi ) = V (µi ),
wi
where wi are known weights, the (scalar) parameter φ is called the dispersion param-
eter or scale parameter and the function V is called the variance function. φ will be
different depending on the distribution used for the response Yi . It is known for the Bino-
mial and Poisson cases but has to estimated for the Normal and Gamma cases. However,
because it directly affects the variance of the response Yi , it can, even for the Binomial
and Poisson distributions be allowed to be a free parameter that has to be estimated. This
allows for what is called over-dispersion and is discussed further in section 3.10. Note that
V is just a function of the mean that occurs in the variance; it is not itself the variance of
anything. Different forms of function V are used for different distributions.
Key point: The response is now allowed to have a non-constant variance. The form of
this non-constant variance depends on the distribution assumed for the response. Note
that the variance does not depend on the choice of link function.

(iii-G) Assumptions about the distribution.


The Yi are independent.
The pdf (probability ‘density’ function) for Yi has a special form (explored and explained
in §3.3). The density (with variable y and parameters θi and φ) can be written as:
 
yθi − b(θi )
fi (y : θi , φ) = exp wi + c(y, φ) . (3.1)
φ

Here ‘density’ is in quotes because some examples are discrete and then ‘density’ should be interpreted
as ‘probability function’.

3.1.5 Example: toxicity trial in beetles

We revisit the beetle data described on page 65. With the notation defined earlier the number
of deaths (Di ) at log dose xi can be therefore be modeled as Di ∼ Bi(ni , µi ). It follows that
E(Di ) = ni µi and var(Di ) = ni µi (1 − µi ). Here ni is the number of beetles exposed at a certain
dose and µi is the probability of death. How does the probability of death µi relate to the log
dose xi ? In a GLM we allow a general relationship of the form µi = h(β0 + β1 xi ) where we can
choose the inverse link function (h) that best fits. The choice of h is one of the decisions to be
made when fitting a GLM.
Instead of considering the number of beetles dying we can deal with the proportion of beetles
dying: Yi = Di /ni . Then E(Yi ) = E(Di )/ni = µi = h(β0 + β1 xi ), var(Yi ) = 1/n2i var(Di ) =
µi (1 − µi )/ni . If we do this the we have a GLM with:

i) inverse link h

ii) E(Yi ) = µi = h(β0 + β1 xi )

iii) var(Yi ) = µi (1 − µi )/ni

Notice that var(Yi ) is the same as that given in point (ii-G) of section 3.1.4 with φ = 1 and
weights wi = ni .

Key point The variance of Yi is allowed to depend on the mean value µi (which in turn depends
on the linear predictor). This is not allowed in linear models but because we have widened the
choice of probability distributions for Yi (beyond the Normal distribution), they can now have

67
variances that depend on their expected values in some well-defined way.

Link function
In this example, the mean of the response µi is a probability. To be sure that µi = h(β0 + β1 x)
gives a probability whatever value x takes, the function h must take values in [0, 1] and, to
have an inverse, it must be monotone, so that (assuming, without any loss of generality, it is
increasing) mathematically h is in fact a distribution function. If h(u) = Φ(u), where Φ is the
distribution of the standard Normal, this gives rise to the probit link (and Φ−1 (u) is called the
probit of u); if h(u) = eu /(1 + eu ), then g(u) = log(u/(1 − u)), which is often called logit(u),
leading to the logit link. Note that 0 < h(u) < 1 ∀u ∈ R as required.

3.2 Fitting a GLM

The R function glm fits generalized linear models. Basic usage is

glm(response variable ~ explanatory variables, family = family.name,


data=...)

where family.name is the name of the distribution for the response and data is the data frame
being used. The response data must be appropriate for the distribution being fitted. For a
Gamma model, for example, observations must be non-negative. For a binomial model there
are several ways to fit the model (using proportions with weights or numbers of successes and
failures - see chapter 4 for more details). The formula

response variable ~ explanatory variables

is a model formula of the same type as used by other linear modelling functions such as lm and
lme. The usual formula syntax applies.
The default link function used by glm is the canonical link function for the specified distribution
(see section 3.3.5). A different link would be specified by an argument within the family object:
for example

glm(response ~ explanatory variables, family = family.name(link = link.name),


data=...)

Here link.name is the name of the link being used. The glm function produces an object of class
glm (as lm and lme produce objects of class lm and lme respectively) from which information
about the fit may be extracted as usual.
We illustrate these commands by fitting a GLM to the beetle data. The beetles data is a data
frame containing the following:

> beetle
conc number dead propn.dead
1 1.6907 59 6 0.1016949
2 1.7242 60 13 0.2166667
3 1.7552 62 18 0.2903226
4 1.7842 56 28 0.5000000
5 1.8113 63 52 0.8253968
6 1.8369 59 53 0.8983051
7 1.8610 62 61 0.9838710
8 1.8839 60 60 1.0000000

68
For the beetles data we have a binomial distribution for the response. If we choose to use the
probit link function then we would use the command

glm(propn.dead ~ conc, family = binomial(link=probit), weights=number,


data=beetle)

This yields the output

> first.beetle.glm=glm(propn.dead~conc,family=binomial(link=probit),
+ weights=number,data=beetle)
> first.beetle.glm

Call:glm(formula=propn.dead~conc,family= binomial(link=probit),
+ data=beetle,weights=number)

Coefficients:
(Intercept) conc
-34.94 19.73

Degrees of Freedom: 7 Total (i.e. Null); 6 Residual


Null Deviance: 284.2
Residual Deviance: 10.12 AIC: 40.32

Parameter estimates
As for linear models we obtain estimates of the parameters in the linear predictor; β0 + β1 x =
−34.94 + 19.73x. We will talk much more about deviance later but it is used in model building
(deciding which terms go in the linear predictor), a bit like using the anova command for nested
linear models. As for linear models, the summary command gives us a little more information:

> summary(first.beetle.glm)

Call:
glm(formula = propn.dead ~ conc, family = binomial(link = probit),
data = beetle, weights = number)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5714 -0.4703 0.7501 1.0632 1.3449

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -34.935 2.648 -13.19 <2e-16 ***
conc 19.728 1.487 13.27 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 284.202 on 7 degrees of freedom


Residual deviance: 10.120 on 6 degrees of freedom
AIC: 40.318

Number of Fisher Scoring iterations: 4

69
Fitted values
For now we shall just look at the fitted values.

> round(first.beetle.glm$fitted,digits=3)
1 2 3 4 5 6 7 8
0.057 0.179 0.379 0.604 0.788 0.904 0.962 0.987

1.0
0.8
proportion dead

0.6
0.4
0.2

1.70 1.75 1.80 1.85

log concentration

Figure 3.2: Observed (circles) and fitted (crosses) values for the beetle data

Figure 3.2 shows the fitted values along with the observed values. Circles are observed, crosses
are fitted. The fit looks to be reasonably good.

3.2.1 Calculating fitted values

To further clarify what is being fitted, consider the 4th observation. The observed proportion
of dead beetles is 0.5 (look at the data frame shown above). The fitted value is 0.604 to
3 d.p. How is this value obtained? To answer this we need the relationship stated above
E(Yi ) = µi = h(β0 + β1 xi ). Remember that our response is Yi = Di /ni where Di ∼ Bin(ni , µi )
and E(Yi ) = µi . µ̂i (the fitted value) is h(βb0 + βb1 xi ). Remember that h is the inverse link
function (in this case we used the probit inverse link, h = Φ) so

µ̂i = h(βb0 + βb1 xi ) = Φ(βb0 + βb1 xi ) = Φ(−34.935 + 19.728xi ).

So
µ̂4 = Φ(34.935 + 19.728x4 ) = Φ(−34.935 + 19.728 × 1.7842) = Φ(0.2637) = 0.604
confirming the fitted value that R produces. Note that the Φ function is pnorm in R:

> pnorm(-34.935 + 19.728*1.7842)


[1] 0.6039935

70
Key Points:

• We use R to calculate the parameter estimates.

• We use the parameter estimates to calculate the value of the linear predictor for an obser-
vation.

• In linear models this is the expected (fitted) value.

• In a GLM, we apply the inverse link function to the linear predictor to get the fitted value.
So we have an extra stage (applying the inverse link function) in the calculation for the
fitted values in GLMs compared with linear models.

Binomial modeling like this is discussed further in Chapter 4.


Exercise:

• For the beetle mortality data, fit a binomial GLM with the logit link function;

• Using R, find the fitted values;

• Using the logit link, verify the result for the 6th fitted value.

• The complementary log-log link is a link function for a response that is a proportion. The
link function g is g(µi ) = log(−log(1 − µi )) = ηi

• Derive h, the inverse link function

• This link is fitted to the beetle data with the following output

> cloglog<-glm(propn.dead~conc,binomial(cloglog),
+ weights=number,data=beetle)
> cloglog
Coefficients:
(Intercept) conc
-39.57 22.04

• Show that the fitted value for a log concentration of 1.7242 is 0.188

3.3 Distributional properties in GLMs

Throughout this section (i.e. §3.3) Y is a scalar, not a vector.

3.3.1 Allowed distributions in GLMs

For GLMs, the pdf (in scalar y) of a single observation has the form
 
yθ − b(θ)
f (y : θ, φ) = exp + c(y, φ) , (3.2)
φ/w
where both θ and φ are parameters. This general form involves parameters θ and φ, which are
related to the mean and variance of Y . We see how next, in §3.3.2 and §3.3.3; the result is that
φ 00
E(Y ) = b0 (θ) and var(Y ) = b (θ).
w
Various common and important distributions are included in the family (3.2): Normal, Poisson,
Binomial and Gamma, with the appropriate correspondences recorded in §3.3.4.

71
In the usual formulation where there are several independent observations Yi , i = 1, 2, . . ., the
values of θ and w are allowed to depend on i, giving θi and wi , while the value of φ and the
functions b, c are assumed not to depend on i. Thus Yi is taken from the density
 
yθi − b(θi )
exp wi + c(y, φ) , (3.3)
φ

as given in equation (3.1).


When φ is known, so that there is only one parameter, θ, (3.2) has the form of equation (3.5) below with
q = 1 and so is from the exponential family, but it may not be from the exponential family when φ is unknown.
Unfortunately, there is no standard notation for the density in equation (3.2).

Background on exponential families

The exponential family is a class of distributions useful for modelling many kinds of data and whose members
enjoy particularly tractable properties. The family includes the Normal, Binomial, Poisson, Gamma and
Multinomial distributions.
A distribution with a one-dimensional parameter θ belongs to an exponential family if its pdf is of the form

f (y : θ) = exp(a(y)b(θ) + c(y) + d(θ)) (3.4)

for suitably smooth functions a − d.


More generally for a vector parameter θ a distribution belongs to a q-parameter exponential family if its pdf
is of the form
 
Xq 
f (y : θ) = exp aj (y)bj (θ) + c(y) + d(θ) (3.5)
 
j=1

for q ≥ 1 and suitable functions aj , bj , c and d.

3.3.2 Background on the Score Statistic

Let l(θ, ψ, y) (= log f (y : θ, ψ)) be the log-likelihood corresponding to a pdf f (not necessarily
of the form (3.2); the results below are more general). The score statistic is defined as:

∂l(y, θ, ψ)
. (3.6)
∂θ

Result. For any pdf (provided it is regular in θ in a sense indicated in the proof) the score
statistic, (3.6), has mean 0 and variance
 2 
∂ l
−E .
∂θ2

when θ, ψ are the parameter values in the distribution from which Y was generated.
Proof. If f = f (y : θ, ψ) is any pdf (which depends on parameters θ, ψ), then
Z
f (y : θ, ψ)dy = 1,

so that, differentiating both sides, and assuming that everything is sufficiently well-behaved that
the differentiation and integration can be interchanged (which is the regularity needed)
Z Z Z
∂1 ∂ ∂f ∂f (y, θ, ψ)
0= = f dy = dy = dy.
∂θ ∂θ ∂θ ∂θ

72
This can be rewritten as
Z Z Z
1 ∂f ∂ log f ∂l
0= f dy = f dy = f dy,
f ∂θ ∂θ ∂θ
which is exactly  
∂l
0=E ,
∂θ
(i.e. the expected value of the score statistic is 0).
Differentiating again, assuming that differentiation and integration can be interchanged, and
using ∂f /∂θ = f ∂l/∂θ,
Z  Z   Z  2 
∂ ∂l ∂ ∂l ∂ l ∂l ∂f
0= f dy = f dy = f+ dy
∂θ ∂θ ∂θ ∂θ ∂θ2 ∂θ ∂θ
Z  2 
∂ l ∂l ∂l
= f+ f dy
∂θ2 ∂θ ∂θ
 2   2 !
∂ l ∂l
= E 2
+E .
∂θ ∂θ

The first part of the result established that ∂l/∂θ has mean zero, hence the second term here
must therefore be its variance and so, because the two terms sum to zero, the variance is also
given by minus the first term, which is the required result. (Also, the second term, and hence
minus the first term, is the expected information on θ.
The same proof works for multivariate Y and vector θ.

3.3.3 Mean and variance for GLM distributions

Now we apply the results on the score statistic to a single observation from the density given by
equation (3.2), for which
yθ − b(θ)
l(θ, φ, y) = + c(y, φ). (3.7)
φ/w
Differentiating l(θ, φ, y),
∂l y − b0 (θ) ∂2l −b00 (θ)
= and = .
∂θ φ/w ∂θ2 φ/w
The first of these is the score statistic, which has mean zero, and so
y − b0 (θ)
   
∂l
0=E =E ,
∂θ φ/w
which implies
µ = E(Y ) = b0 (θ). (3.8)
This is a simple formula for the mean, µ, making µ and θ functions of each other. However, the
form of the density in (3.7) would lose much of its simplicity if written in terms of µ instead of
θ.
Note that ∂ 2 l/∂θ2 does not depend on Y and so is unchanged by taking its expectation (over
Y ). Now the result on the variance of score statistic gives, after a little calculation,
φ 00
var(Y ) = b (θ). (3.9)
w
This shows that var(Y ) is a product of a term depending on θ and another depending on φ. In
the language of (ii-G) in §3.1.4, b00 (θ), when written in terms of µ, gives the variance function,
V (µ).

Task 1 Verify formula (3.9) for var(Y ).

73
3.3.4 Common GLM distributions

The table below summarizes characteristics of the most important GLM distributions.

φ w b(θ) c(y, φ) µ = b0 (θ) b00 (θ)


Normal
Y ∼ N (θ, φ) φ 1 θ2 /2 −(y 2 /φ + log(2πφ))/2 θ 1
Poisson
Y ∼ P o(eθ ) 1 1 eθ − log(y!) eθ µ
Binomial:
n

nY ∼ Bi(n, eθ /(1 + eθ )) 1 n log(1 + eθ ) log ny eθ /(1 + eθ ) µ(1 − µ)
Gamma ν log ν + (ν − 1) log y
Y ∼ Ga(ν, λ) † φ 1 − log(−θ) − log Γ(ν) −1/θ µ2
† ν ν−1 −λy
pdf f (y) = λ y e /Γ(ν) where θ = −λ/ν and φ = 1/ν.

As illustrated in Example 3.1.5, for the Binomial, Yi = Si /ni , the observed proportion of suc-
cesses, is taken as the response variable.

Task 2 Use equations (3.8) and (3.9) to verify the entries in the table above.

3.3.5 The Canonical Link

In a GLM, by (iii-G) in §3.1.4, Yi has the density function


 
yθi − b(θi )
f (y : θi , φ) = exp + c(y, φ) ,
φ/wi
where E(Yi ) = µi = h(xTi β). We have also shown in equation (3.8) that E(Yi ) = b0 (θi ), and so,
equating the two expressions for E(Yi ), b0 (θi ) = h(xTi β), i.e.

xTi β = h−1 b0 (θi ) = gb0 (θi ) (3.10)

The link function g for which θi = xTi β (that is, for which g = (b0 )−1 ) is called the canonical
link. In the Binomial case, the canonical inverse link is h = b0 (θi ) = eθi /(1 + eθi ). Rearranging
yields the canonical link function (g) as
g(µi ) = log(µi /(1 − µi ))
commonly called the logit link. For the Normal, Poisson and Gamma, the canonical links are
the identity, log and reciprocal, respectively.

Task 3 Verify the canonical links for the Normal, Binomial, Poisson and Gamma.

3.3.6 Why is the canonical link useful?


P
The key point with the canonical link is that θi = j xij βj ; for other links θ will not be a linear
combination ofPthe parameters β1 , ..., βp . We have seen that for a GLM, the log-likelihood can
be written l = i {wi (yi θi − b(θi ))/φ + c(yi , φ)}. If φ is known then the likelihood only depends
on the data through the term wi yi θi . For the canonical link

XX
wi y i θ i = wi yi xij βj
i j
XX
= βj xTji wi yi
i j
T T
= β X WY

74
where W = diag(w1 , w2P , ...). So if φ is known, the log-likelihood depends on the data only
through the p statistics i xTji wi yi .

The canonical link is usually the default link in packages but this should not influence your
choice of link function. The choice should be made on the grounds of model fit.

3.4 Estimation of parameters - general theory

In a linear model, parameter estimation is easy. If X is the design matrix, y the vector of
response values, and σ 2 the variance of the error term, then the least squares estimator of β is

β̂ = (X T X)−1 X T y

and the covariance matrix is


σ(X T X)−1 .
These are easily calculated and only require specification of X and y. Things get more compli-
cated in a generalized linear model. In a GLM the log-likelihood l = l(θ, φ, y) is given by
X  yi θi − b(θi ) 
l= + c(yi , φ) , (3.11)
φ/wi
i

where the θi are functions of the unknown β, see equation (3.10), and the other unknown
parameter is φ. Hence, for maximum likelihood (ML) estimation, one equation is ∂l/∂β = 0,
and the other is ∂l/∂φ = 0. Because neither φ nor c depends on θ or, as a consequence of this,
on β,
∂l ∂ X
= 0 =⇒ {wi [yi θi − b(θi )]} = 0.
∂β ∂β
i
These equations involve β only, not φ. They are solved to provide an estimate of β, which will
be independent of φ or the estimate of φ when it is unknown. When φ is unknown it is usually
estimated by a moment method rather than by maximum likelihood — see §3.9.
In virtually all cases except the Normal linear model the equation for estimating β is non-
linear, and has to be solved numerically, by iteration; the most popular method is a version of
Iteratively Reweighted Least Squares (weighted least squares in which the weights depend
on β, and hence, along with β, are iteratively estimated). The non-linear part in ∂l/∂β is b0 (θi )
(see section 3.3.4) which is usually non-linear in θi .

3.4.1 Covariance matrix for the mles

When φ is known, the asymptotic variance-covariance matrix for var(β) b can be obtained as
the inverse of the information matrix on β; note that the information matrix is also called the
Hessian matrix. Here the observed information matrix has (j, k)th entry
! !
∂2l ∂2 X  yi θi − b(θi )  1 ∂2 X
− =− =− {wi [yi θi − b(θi )]} .
∂βj ∂βk ∂βj ∂βk φ/wi φ ∂βj ∂βk
i i

The same formulae are usually used when φ is unknown, but now with φ replaced by its estimator.
Estimation of φ is discussed briefly in §3.9.

Task 4 For the canonical link (see, in particular, equation (3.10)) show that

∂2l 1 X
−xij wi h0 (xTi β)xik .

=
∂βj ∂βk φ
i

75
Standardizing (subtract the hypothesized mean and divide by the e.s.e.) gives the Wald test
statistic, which allows simple tests on a parameter or on a linear combination of parameters
cT β (using var(cT β)
b = cT [var(β)]c).
b The usual Wald test that βj = 0 uses t = βbj /e.s.e.(βbj ). The
asymptotic theory leads to a N (0, 1) distribution, but using a tn−p distribution may be better
(the residual degrees of freedom are n − p if the regression matrix X, with ith row xi , has n rows
and is of rank p). For an illustration see Example 4.4.1(10).

3.5 Estimation of parameters - Iteratively reweighted least squares

The objective is to obtain estimates of the parameters in the GLM along with the covariance
matrix. The process, known as iteratively reweighted least squares proceeds as follows:

1 Specify an initial vector of parameters b(1) = (β0 , β1 , ..., βp )T ;

2 Specify a weight matrix W that depends on the current parameter estimates;

3 Specify a vector z that depends on the current parameter estimates and response values;

4 Calculate recursively a new vector of parameter estimates b(m) using the relationship b(m) =
(X T W X)−1 X T W z where W and z are calculated using b(m−1) , the previous iteration
values of (β0 , β1 , ..., βp )T ;

5 Continue until the parameters value converge within a given tolerance.

The matrix W is a diagonal matrix with diagonal elements wii given by


 2
1 ∂µi
wii =
var(Yi ) ∂ηi

and the vector z has elements given by


 
∂ηi
zi = ηi + (yi − µi )
∂µi

Remember that ηi is the linear predictor. We illustrate this by applying it to the beetle data
with a binomial distribution for the response and a logit link function. We first do the analysis
in R to see what we should be getting.

second.beetle.glm=glm(propn.dead~conc,family=binomial(link=logit),
weights=number,data=beetle)

> summary(second.beetle.glm)

Call:
glm(formula = propn.dead ~ conc, family = binomial(link = logit),
data = beetle, weights = number)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5941 -0.3944 0.8329 1.2592 1.5940

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -60.717 5.181 -11.72 <2e-16 ***
conc 34.270 2.912 11.77 <2e-16 ***

76
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 284.202 on 7 degrees of freedom


Residual deviance: 11.232 on 6 degrees of freedom
AIC: 41.43

Number of Fisher Scoring iterations: 4

The covariance matrix is

> vcov(second.beetle.glm)
(Intercept) conc
(Intercept) 26.83966 -15.082090
conc -15.08209 8.480525

So
p for the linear predictor β0 + β1 xi , the estimated value of β0 is -60.72 with
p standard error
(26.84) = 5.18 and the estimated value of β1 is 34.27 with standard error (8.481) = 2.91;
the covariance between the estimates is -15.082 to 3d.p. The last line of the summary command
gives the number of iterations used before the estimates converged between iterations (4 in
this example). We will show how to verify the parameter estimates and standard errors using
iteratively reweighted least squares. For the logit link, we know that the relationship between
eηi
E(Yi ) = µi and the linear predictor ηi is µi = 1+e ηi so

(1 + eηi )(eηi ) − (eηi )2 eηi


 
∂µi
= =
∂ηi (1 + eηi )2 (1 + eηi )2
µi (1−µi ) eηi
by the quotient rule. We know that var(Yi ) = ni = ni (1+eηi )2
so that applying
 2
1 ∂µi
wii =
var(Yi ) ∂ηi

yields
ni eηi
wii =
(1 + eηi )2
These are
 the diagonal
 elements in the weight matrix (where ηi = xTi β). For the zi we have that
µi
ηi = log 1−µi so that
   

∂ηi 1 − µi
1 − µi + µi 1
= =
∂µi (1 − µi )2
µi µi (1 − µi )
 
∂ηi
by the chain rule. So applying zi = ηi + (yi − µi ) ∂µ i
yields

(yi − µi )
zi = ηi +
µi (1 − µi )
e i η
where µi = 1+e ηi . Now that we have calculated the diagonal elements of the matrix W and the

vector z we can use the iterative procedure to estimate the iterative reweighted least squares
estimates of the parameters.

77
Taking initial values of β0 and β1 as -60 and 35, we obtain
59 × e(−60+35×1.6907)
w11 = 2 = 12.497
1 + e(−60+35×1.6907)
If we repeat this for the other seven weights we get the weight matrix
 
12.497 0 0 0 0 0 0 0

 0 14.557 0 0 0 0 0 0 


 0 0 9.648 0 0 0 0 0 

 0 0 0 4.106 0 0 0 0 
W = 

 0 0 0 0 1.977 0 0 0 


 0 0 0 0 0 0.786 0 0 

 0 0 0 0 0 0 0.361 0 
0 0 0 0 0 0 0 0.158
We next calculate z. This is easier if we calculate the expected values (µ values) first.

eη1 e(−60+35×1.6907)
µ1 = =  = 0.3045974
1 + eη1 1 + e(−60+35×1.6907)

Then
(y1 − µ1 ) (0.1017 − 0.3046)
z1 = η1 + = −60 + 35 × 1.6907 + = −1.7834
µ1 (1 − µ1 ) 0.3046 × (1 − 0.3046)
It we repeat for the other z values we get
−1.783
 

 −1.175 


 −1.889 

 −3.287 
z= 

 −1.134 


 −2.331 

 3.369 
6.939
With the design matrix X given by
 
1 1.6907

 1 1.7242 


 1 1.7552 

 1 1.7842 
X= 

 1 1.8113 


 1 1.8369 

 1 1.8610 
1 1.8839
we get that

 
T 44.08992 76.4819
(X W X) =
76.48190 132.7406
and  
T −72.87857
X Wz =
−126.34615
so that the next estimates of the parameters are
 −1    
T −1 T 44.08992 76.4819 −72.87857 −3.535201
(X W X) X W z = =
76.48190 132.7406 −126.34615 1.085069

78
If we let b(j) be the estimated parameter vector at iteration j, then our initial values give
b(1) = (−60, 35)T and the next iteration provides b(2) = (−3.535, 1.085)T . If we continue doing
this we find that
b(1) = (−60, 35)T
b(2) = (−3.535, 1.085)T
b(3) = (−63.17875, 36.00085)T
b(4) = (−55.9871, 31.56474)T
.. ..
. .
b(7) = (−60.71745, 34.27033)T
b(8) = (−60.71745, 34.27033)T

so that we get convergence after 7 iterations. R finds a smarter way to do it in just 4 iterations.
Our parameter estimates obtained by iteratively reweighted least squares are β̂0 = −60.71 and
β̂1 = 34.27 in agreement with the R output above.

Using the converged estimates of the parameters, the covariance matrix is given by
 
T −1 26.840 −15.082
(X W X) =
−15.082 8.481
again in agreement with the vcov output in R. Note that these estimates are maximum likelihood
estimates. Standard theory states that the sampling distribution of the maximum likelihodd
estimates is asymptotically normal so we can find confidence intervals for the parameters in
the usual way. For example,
√ for the beetle data using the logit link, a 95% confidence interval
for β1 is 34.27 ± 1.96 8.48 = (28.6, 40.0). Note that because this is a numerical maximization
technique, care must be taken to avoid obtaining a local rather than a global mle.

3.6 Scaled deviance and (residual) deviance

3.6.1 Definitions

The model contains parameters θi , or alternatively µi , for each i, where µi = b0 (θi ); but the
important point is that the n values {µi } are expressed as functions of the parameter β with,
say, p components, via the inverse link function — µi = h(ηi ) = h(xTi β). The likelihood of the
sample is a function of the parameters β and φ, but it may be thought of instead as a function
of µ and φ, where µ is the vector of length n with components µi . The likelihood will be denoted
by L(µ, φ, y) and the log-likelihood by l(µ, φ, y) — but the true parameter remains β, and the
components of the vector µ remain constrained by the relationships µi = h(ηi ) = h(xTi β).

The maximum value of L (given φ) is then L(b b = h(xTi β)


µ, φ, y), where µ b and derives from the
ML estimate β discussed in §3.4; the corresponding θi make up the vector θ,
b b b resulting from
equation (3.8).
The model that allows the mean of each observation to be a separate parameter is called the
full or saturated or maximal model. Denote the means for this model by µ , with θ the
corresponding θ. Estimation is easy. The maximum possible value of L as the components of µ
vary without any restrictions (or, essentially equivalently, when there is an element of β for each
observation, so that µi = h(ηi ) = h(βi )) occurs when µi = yi , hence µ
c = yi for all i. Thus, for
i
the saturated model the estimate of the means is µ c = y, which is independent of the particular
link being used. Denote the corresponding estimate of θ by θb , given by µ c = b0 (θb ).

Task 5 Show that, for the maximal model, yi is the maximum likelihood estimate of µi . (Hint:
the likelihood equation for θi is yi = b0 (θi ), and we already know that b0 (θi ) = µi .)

79
Suppose a proposed model (µi = h(xTi β)) has corresponding estimated means µ b and associated
θ,
b linked via equation (3.8). The scaled deviance S(y, µ b) for that model is defined as twice
the difference in the log-likelihoods of the maximal model and that model:
n o
b) = −2 l(b
S(y, µ c , φ, y)
µ, φ, y) − l(µ
X yi (θb − θbi ) − b(θb ) + b(θbi )
i i
= 2
φ/wi
i
2 X h b b i
= wi yi (θi − θi ) − b(θbi ) + b(θbi ) .
φ
i

Hence, φS(y, µb) does not depend on φ and so is a function only of the data (including the wi ).
Then the deviance (also called the residual deviance) for the model with estimated means
µ
b is defined by
X h i
D(y, µ
b) = φS(y, µ
b) = 2 wi yi (θbi − θbi ) − b(θbi ) + b(θbi ) . (3.12)
i

Both S and D are non-negative, and the maximal model has S = D = 0.


In R output, D is called the residual deviance and deviance is used for the difference between
the residual deviances of two models where one is nested in the other; see §3.7.2.
N.B. Some authors use G or G2 for D. Unfortunately, Dobson uses the terms deviance and scaled deviance
the opposite way round to that above. The definitions above are a bit different from those actually used in R.
Whenever you see these terms used you will have to check out the author’s preferred definition. Probably, in
most other contexts, deviance is used for twice the difference in log-likelihood, which here is called the scaled
deviance.

3.6.2 Testing model fit using deviance (approximately)

b), has, asymptotically, a χ2n−p


Under rather strong assumptions (and if the model is correct) S(y, µ
distribution, for reasons indicated in §3.7.1 – but this may not be a very good approximation.
When φ is unknown – in particular for Normal and Gamma models – an estimate of φ is needed
(see §3.9). See section 4.4.1 for an example.

3.6.3 Testing model fit using pseudo R2

For linear models we can use R2 as a measure of fit (with the caveat that it cannot decrease
as more parameters are added). Remember that for a linear model R2 is the proportion of the
variation of the response that is explained by the model (the variation here being measured by
the sums of squares of the response). In GLMs there is a version of this called pseudo R2 . The
pseudo R2 is given by
l(µ̃, φ; y) − l(µ̂, φ; y)
l(µ̃, φ; y)
where l(µ̃, φ; y) is the log-likelihood of the minimal model (in which the linear predictor contains
just a constant term) and l(µ̂, φ; y) is the log-likelihood of the model under consideration. So
the pseudo R2 represents the proportional improvement in the log-likelihood due to the model
under consideration compared to the minimal model.
R produces an AIC value for each model and we can use this to get the likelihood. Remember
that
AIC = −2l + 2p
where l is the log-likelihood and p is the number of parameters. We can rearrange this to get
l = p − 1/2AIC. We can therefore use the AIC output from R for the minimal model and any
given model to estimate the pseudo R2 value. See 4.4.3 for an example.

80
3.6.4 Deviances for common distributions

For the common four distributions we mainly use, the (residual) deviances, D(y, µ b), are as
follows.
X
Normal (yi − µbi )2
X  
yi

Poisson 2 yi log − (yi − µbi )
µbi
X   
µbi

1 − µbi

Binomial −2 ni yi log + (1 − yi ) log
yi 1 − yi
i
    
X yi yi − µbi
Gamma −2 log −
µbi µbi
P
For Poisson and Gamma, (yi − µbi )=0 under a power or log link when the model has a constant
term, and so the deviance usually simplifies accordingly.

Task 6 Confirm these using the table in §3.3.4 and equation (3.12).

In all four cases the form of D(y, µ


b) supports its use as a measure of fit since it can be interpreted
as summarising how near the fitted values µ b are to the observed values y.

3.7 Comparing nested models

The Pearson X 2 statistic is often used as a measure of fit. It could therefore be used to
compare the fit of two models but D is usually preferred when looking at changes between
models as it is exactly additive when a hypothesis is partitioned into orthogonal components.

3.7.1 Applying GLRT to GLM

Suppose two models for the mean (i.e. linear predictors η ∗ and η ◦ ) are to be compared, and that
in Model 1 η ∗ = Xβ, and in Model 2 η ◦ = Xβ + Zγ — so that the first is a special case of
the second (i.e is nested within the second). Then, when φ is known, η ∗ can be tested against
η ◦ (i.e. testing the null hypothesis γ = 0) by using the Generalized Likelihood Ratio Test
(GLRT), with the test statistic
 
−2 log(L∗ /L◦ ) = −2(l∗ − l◦ ) = −2 l(µ c∗ , φ, y) − l(µ
c◦ , φ, y)
c∗ ) − S(y, µ
= S(y, µ c◦ )
D(y, µc∗ ) − D(y, µ
c◦ )
= .
φ
Note that, if φ is unknown, the true Generalised Likelihood Ratio Test would use maximum likelihood estimates
of φ under both Model 1 (in l∗ ) and Model 2 (in l◦ ). Instead, in this context, it is common to compute φb
from Model 2 and then treat it as known.

Under the null hypothesis that γ = 0 (i.e. that the model η ∗ is appropriate) and under suitable
assumptions of regularity, the distribution of the test statistic
c∗ ) − D(y, µ
D(y, µ c◦ )
φ
(which is just the difference of scaled deviances) is asymptotically χ2r when γ has r linearly
independent components (i.e. rank(Z) = r). If we are just testing whether adding an extra
explanatory variable improves the fit then r = 1; this is the test we usually use.

81
The claim that S(y, µb), has, asymptotically, a χ2n−p distribution, made in section 3.6.2, is just a
particular case of this result, with Model 1 giving µ
b, based on p parameters and Model 2 being
the saturated model, with n parameters.
Of course all of this theory presumes that Model 2 and the associated distributional assumptions
are suitable. Hence, before doing any testing based on the theory above one should first check
that Model 2 is adequate with residual plots (§3.8) and any other devices that are appropriate.

3.7.2 Model building - Analysis of Deviance

On the basis of the theory just discussed it is clear that approximate tests of nested models can
be made by using the deviance. Comparing nested GLMs using analysis of deviance is a very
similar process to using F-tests for nested linear models. A convenient way of presenting these is
in an Analysis of Deviance table, which is similar to an Analysis of Variance table, and reduces
to it in the Normal linear case. The general structure splits the variation for the simpler model
(model 1), measured by its deviance, into two components, one representing the improvement
of the more complicated model (model 2) over the simpler one and the other representing the
variation left after fitting the more complicated model.

Source Deviance d.f.


Model 2 after fitting Model 1 c∗ ) − D(y, µ
D(y, µ c◦ ) r = p◦ − p∗
Model 2 D(y, µc◦ ) n − p◦
Model 1 D(y, µc∗ ) n − p∗

where rank([X, Z]) = p◦ , rank(X) = p∗ , and so rank(Z) = p◦ − p∗ = r.


Typically an Analysis of Deviance involves a series of nested models, not just two. In the
table this amounts to decomposing the Model 2 term further into ‘Model 3’ and ‘Model 3 after
fitting Model 2’, in just the same way, and so on. Also, when extended tables of this form are
constructed, the simplest model considered, and used as ‘Model 1’, is usually the one where all
the means are assumed the same; this is called the null or minimal model and its deviance is
null deviance. In the null model all the entries in η ∗ are the same; ηi∗P
called the P = β0 for all
i. If W = wi , the estimate of µ in the minimal model is easily seen to be W −1
i wi yi , and
does not depend on the link used.
For an example of using analysis of deviance in model building, see section 4.4.1.

Task 7 Use the results in §3.6.4 to obtain simple expressions for the null deviances for Poisson
(weights must be one) and Gamma (assume weights are one).

You can see examples of these extended Analysis of Deviance Tables, in various formats, produced by packages.
c◦ ) is called a residual deviance and differences between these, D(y, µ
In R, D(y, µ c∗ ) − D(y, µ
c◦ ), are called
deviances.

3.7.3 Analysis of deviance in R

To compare two nested models in R use the anova command with the a test="Chi" option.
So anova(model1,model2,test="Chi") where model1 is nested in model2, will produce an
analysis of deviance table like that in section 3.7.2. See section 4.4.1 for an example.

3.7.4 hypothesis tests on a single parameter

The Wald t−test can sometimes be badly misleading. Hence, although comparing estimates
with their estimated standard errors can be a useful guide in model simplification, the change

82
in deviance test is preferred when actually performing a test on a single parameter.

3.7.5 Remember that p-values are only approximate

Note that, apart from the Normal linear model, even if all the assumptions hold precisely, the
deviance tests and the Wald t−tests for r = 1 in §3.4 are based on asymptotic theory, and hence
are approximate in practice. How close the null distribution is to the assumed one depends on
many things, not just the size of n. Thus all the tests should be viewed as guidelines only. It is
not realistic, for example with a 1 df χ2 -test (χ21,0.95 = 3.84), to say that 3.85 shows a significant
effect (p < 0.05), and 3.83 does not show one (p > 0.05); in both cases all you know is that
p ≈ 0.05.

3.8 Residuals

There are a number of different types of residuals which can be defined for generalized linear
models; in the Normal linear model they all reduce to variants of the basic (raw) least squares
residual ei = yi − µbi . Here two are defined. It is not clear which type of residual is more useful in
different situations and for different purposes, and the default may differ in different packages.

3.8.1 Distribution of residuals

Clearly when the response variable is not normally distributed, we wouldn’t expect the residuals
to be normally distributed - but in some cases they might be, although often only very approx-
imately. Consider the case of a response that has a binomial distribution. We would expect the
errors to be binomially distributed - but we know that provided n is large and p not too small,
that a bin(n, p) distribution is not too different from a N (np, np(1 − p)) distribution. So for
GLMs, it might at first seem reasonable to assume that, providing certain conditions are met,
we might expect the residuals to be roughly normally distributed. But there is also the added
problem that we have to estimate the expectation. Bearing all this in mind we shouldn’t put too
much emphasis on a QQ plot or histogram. These are often omitted from the list of diagnostic
checks. As we did for linear models, we could perhaps check for autocorrelation (correlation be-
tween successive observations), outliers, non-linearity and homoscedasticity (constant variance).
It may at first seem counter-intuitive to expect homoscedasticity for Pearson residuals in a GLM
as by definition the variance of the response is allowed to depend on the expected value. How-
ever the Pearson residuals are scaled by the variance so the variance should be approximately
constant for all values (the variance of Yi is dictated by the assumed distribution of Yi ) . Also,
be aware that for a Poisson distribution when counts are small, there may well be patterns in
the residuals resulting from this aspect of the data, which should not be construed as indicating
model inadequacy.

3.8.2 Pearson residuals


√ yi − µbi
eP,i = wi p
V (µbi )
Note that yi − µbi has been divided by the estimate of var(Yi )/φ, not by its estimated scaled variance.

The Pearson X 2 statistic (discussed as a possible measure of fit was briefly discussed in section
3.6.2). It is related to the Pearson residuals by the relationship
X
X 2 = X 2 (y, µ
b) = e2P,i ,

83
Note that Pearson X 2 statistic is asymptotically equivalent to the deviance for the model
(D).

Task 8 For the Poisson distribution, verify that D and X 2 are asymptotically equivalent. (Hint:
use that for small |x − y|, x log(x/y) ≈ (x − y) + (x − y)2 /(2y).)

Note that N (µi , V (µi )/wi ) is the usual large n approximate distribution for Yi under the Bi-
nomial and Poisson (where φ = 1). Estimating µi and V (µi ), and standardizing, gives eP,i for
these two cases. Even with this large n approximation we see that the Pearson residuals will at
best have only an approximate N (0, 1) distribution (if n is large) since we have to estimate µi
and V (µi ).

Task 9 Obtain the form of the Pearson residual for the Binomial, Poisson and Gamma. (These
are given in Chapters 4 and 5)

3.8.3 Deviance residuals

The deviance arises from the difference in log-likelihoods. In a log-likelihood for independent
observations each observation
P causes the addition of a non-negative term. Hence the deviance
has the form D = di , where di arises from the ith observation. In section 3.6.4 the residual
deviances for some common distributions are defined. The di defined in this section is the
contribution of the ith observation to this residual deviance. So referring to section 3.6.4, if Yi
has a binomial distribution then

    
µbi 1 − µbi
di = −2 × ni yi log + (1 − yi ) log
yi 1 − yi

the above form for di is only true for the binomial distribution. For other distributions you need
to consult section 3.6.4 for the specific form.
The ith deviance residual is defined in terms of di as
p
eD,i = sgn(yi − µbi ) × di ,
P 2
where sgn(x) = 1 if x > 0, −1 if x < 0. Note that eD,i = D.

Note: If φ is not 1, the Pearson or deviance residuals may be scaled by φ or its estimate to give scaled
Pearson residuals or scaled deviance residuals.

See section 4.4.4 for an example of how to calculate Pearson and deviance residuals for a binomial
response.

3.9 Estimating the scale parameter

A common (‘moment’) estimate of the scale parameter φ is:

X 2 (y, µ
b) D(y, µ)
b
or
n−p n−p
where n is the number of observations and p is the number of parameters (including the intercept)
in the model. The estimator of φ used in R is the X 2 one. Note that in the binomial case, n
is the number of proportions observed, not the total number of observations these proportions
are based on; see 4.5.1 for an example. This is not a maximum likelihood estimate, but then
neither is the usual estimate of the error variance in Normal linear model theory.

84
3.10 Quasi-likelihood

If a GLM has φ fixed, as for the Binomial and Poisson (which have φ = 1), a simple extension,
which may be appropriate for some data sets, is to assume the variance has the form for that
distribution but scaled by φ with φ unknown. The model can be fitted in the usual way for the
originating distribution, and then φ can be estimated (see §3.9).
For the Binomial and Poisson, this allows a general model for over-dispersion (or conceiv-
ably under-dispersion) to be fitted. These models can also be used for data restricted to [0, 1]
(Binomial) or [0, ∞) (Poisson) even if the underlying variable is not integer valued.
Since the only assumptions are about the mean E(Y ) and the variance var(Y ), the estimation
cannot be a maximum likelihood procedure, and the generalized likelihood ratio test theory does
not apply. However, an extension, called quasi-likelihood theory, shows that asymptotically,
the generalized likelihood ratio test described in §3.7.1 applies.
An illustration is given in Example 4.5.1.

85
Chapter 4

Binary Response Data

4.1 Introduction

Binary data occur when there are just two possible responses, such as dead or alive. The usual
generic terms are ‘failure’ and ‘success’. These can be coded so that the variables Yi take only
the values 0 or 1, respectively. The probabilities are given by:

P (Yi = 1) = µi and P (Yi = 0) = 1 − µi .

Then E(Yi ) = µi , which in a GLM, given explanatory variables xi , is assumed to equal h(xTi β).
Frequently, as in a toxicity trial (see Example 3.1.5), there are ni replicates for the same xi giving
rise
P to ni binary observations Yij , say. Then summing the observations with the same xi , Si =
j Yij gives the number of successes, which, providing that the assumptions of independence
and constant probability are satisfied, contain all information from the experiment about µ.
(Technically, S is a sufficient statistic for µ.) Hence the variables Si , whose distributions are
Bi(ni , µi ), are usually analysed instead of the original YPij . In the context of generalized linear
modelling it is more natural to deal with Yi = Si /ni = j Yij /ni , (the observed proportion of
successes). This has E(Yi ) = µi ∈ [0, 1].
Then the outcome is a GLM, see §3.1.5 and then §3.3.1, and all the results in Chapter 3 apply.
Since φ = 1 is known, the (residual) deviance is, for a correct distribution and model, also
asymptotically χ2n−p , although in practice, as noted at the end of §3.6.1, the χ2n−p distribution
may not be a good approximation (particularly if the Yi are binary). It should certainly be good
if the Normal approximation to Bi(ni , µi ) is good for each i.

4.2 Binomial likelihood

The maximum likelihood approach discussed in Chapter 3 requires the likelihood to be written
as a function of the parameters in the linear predictor. For the binomial case and the logit link
function (e.g. the beetle data) in which yi ∼ Bin(ni , µi ) and with m independent observations,
the likelihood is
m  
ni
µyi i (1 − µi )ni −yi
Y
L(µi , yi ; xi ) =
yi
i=1
m  yi  ni −yi
exp(xTi β)

Y ni 1
L(β, yi ; xi ) =
i=1
yi 1 + exp(xTi β) 1 + exp(xTi β)
m  
Y ni
= (exp(xTi β))yi (1 + exp(xTi β))−ni
yi
i=1

86
and the log-likelihood as
m
X  T
yi xi β − ni log(1 + exp(xTi β)) + constant

l(β, yi ; xi ) =
i=1

4.3 Links for binomial reponses

There are three commonly used link functions (applied to µi , the expected proportion of suc-
cesses) – the first two were noted in §3.1.5.
Logit logit(µi ) = log{µi /(1 − µi )} = xTi β
Probit probit(µi ) = Φ−1 (µi ) = xTi β
Complementary log-log cloglog(µi ) = log(− log(1 − µi )) = xTi β

Task 10 How does a GLM with identity link relate to a linear model? Under which situations
does it make sense to fit a linear model instead of a GLM with identity link?

For a single explanatory variable x, a rough check on the link, g, and linearity of the predictor
in x can be obtained by plotting g(yi∗ ) against xi , where, to avoid problems if yi = 0 or yi = 1,
yi∗ = (ni yi + 1/2)/(ni + 1) = (si + 1/2)/(ni + 1). Using a less appropriate link can lead to needing
extra polynomial terms in x to account for the extra curvilinearity – see §4.4.1.

4.3.1 Choosing a link function for the beetle data of chapter 3

The beetle data used extensively in chapter 3 (repeated in the table below) reports the number
of beetles exposed (ni ) and killed (si ) after 5 hours exposure to gaseous carbon disulphide (CS2 )
at various concentrations (xi in log10 CS2 mg/l).
Note the shape of the top left plot in Figure 4.1. It has the shape of a distribution function:
tending to zero at −∞ and tending to 1 at +∞ which is a common shape for data that naturally
give rise to proportions. How can we determine which link function to use to model the data?
We can check that plotting g(yi ) against xi gives a straight line. Clearly, the form that g(yi )
takes depends on the link function being considered. For the logit link we plot xi against
     
yi si /ni si
log = log = log
1 − yi 1 − si /ni ni − si
(or use yi∗ = (n
i yi +1/2)/(ni + 1) = (si + 1/2)/(ni + 1) if yi = 0 or yi = 1)
yi
Note that log 1−y i
is termed the empirical logit. For the probit and cloglog links we plot xi
against φ−1 (yi∗ ) and log(−log(1 − yi∗ )) respectively (see Figure 4.1).All three plots look fairly
linear though there is possibly some curvature in the logit plot.
More formally we could assess the fit using the scaled deviance (or residual deviance since
φ = 1 for the binomial). The scaled deviance for the beetle data (obtained from the summary
command in R - see section 3.2 for the probit output), fitting an intercept and linear term, for
the probit, logit and cloglog links are:

• probit link, S = 10.12 on 6 df (n = 8, p = 2) (1-pchisq(10.12,6)=0.12)


• logit link, S = 11.23 on 6 df (1-pchisq(11.23,6)=0.08)
• cloglog link S = 3.45 on 6 df (1-pchisq(3.45,6)=0.751)

All links provide reasonable fits (at the 5% level) where the null is that the model is ‘adequate’.
The cloglog links has the lowest scaled deviance of the three but inspection of residuals should
also be done before making any definitive conclusions.

87
xi 1.6907 1.7242 1.7552 1.7842 1.8113 1.8369 1.8610 1.8839
ni 59 60 62 56 63 59 62 60
si 6 13 18 28 52 53 61 60

plot of proprtions dead testing logit link


1.0

4
proportion dead

empirical logit
0.6

2
0
0.2

−2
1.70 1.75 1.80 1.85 1.70 1.75 1.80 1.85

log concentration log concentration

testing probit link testing cloglog link


2

empirical cloglog

1
empirical probit

0
−1
0
−1

−2

1.70 1.75 1.80 1.85 1.70 1.75 1.80 1.85

log concentration log concentration

Figure 4.1: Proportion of beetle deaths by log concentration

4.4 Model building, assessing model fit and residuals

The deviance is given by (§3.6.4)


    
X µbi 1 − µbi
D = −2 ni yi log + (1 − yi ) log .
yi 1 − yi
i

The null deviance (constant mean) does not depend on the link.
The Pearson residual (see §3.8.2) is
yi − µbi
eP,i = p .
µbi (1 − µbi ) /ni
Some calculation shows that, with si = ni yi ,
X X  [si − ni µbi ]2 [(ni − si ) − ni (1 − µbi )]2 
2 2
X = eP,i = + .
ni µbi ni (1 − µbi )
Thus, given the ‘expected values’, E (i.e. the estimated means), X 2 is the usual sum of (O −
E)2 /E, with ‘observed value’, O, over the 2 × n table with one row for ‘success’ and one for

88
‘failure’ — in the ith row, the observed number of successes in si and the observed number of
failures is ni − si .
Similarly, D is the sum over the 2 × n table of 2 × O × log(O/E).
As mentioned in §3.7, marginal Wald t-ratios on coefficients can be seriously misleading and this
can happen for Binomial data. Thus the test based on the difference in (residual) deviance is
preferred.
Note that to fit data as Binomial, it is necessary to specify both the response yi and the weights
ni (or some other function of these).

Task 11 Confirm that


X X  [si − ni µbi ]2 [(ni − si ) − ni (1 − µbi )]2

e2P,i = + .
ni µbi ni (1 − µbi )

1 1 1
Hint: use = + .
p(1 − p) p 1−p

4.4.1 Model building for the beetle data

Consider again the beetle mortality data of Example 4.3.1. This data set will be used to
demonstrate some of the model-fitting and testing theory of Chapter 3. In this example we use
the logit link function. Using the other link functions appears as an exercise in Exercise 2.

1. Fitting the minimal model

null.glm<-glm(propn.dead~1,binomial(logit),weights=number)

The null deviance (η = β0 ) is 284.2 (on 7 df). Looking at the left hand plot in Figure 4.2,
the proportion of deaths clearly depends on log concentration.

minimal model linear model quadratic model


1.0

1.0

1.0
0.8

0.8

0.8
proportion dead

proportion dead

proportion dead
0.6

0.6

0.6
0.4

0.4

0.4
0.2

0.2

0.2

1.70 1.75 1.80 1.85 1.70 1.75 1.80 1.85 1.70 1.75 1.80 1.85
log concentration log concentration log concentration

Figure 4.2: Fitted and observed values for various models applied to the beetle data: cir-
cles=observed values, crosses=fitted values

2. Adding a log concentration term to the model

89
linear.glm<-glm(propn.dead~1+conc,binomial(logit),weights=number)

η = β0 + β1 x gives a (residual) deviance of 11.232 on 6 df. The Pearson X 2 value (see


§3.8.2) is 10.026 (roughly comparable to the (residual) deviance). This is obtained from
my by using the command

sum(resid(linear.glm,"pearson")^2)

3. The change in deviance from the null model to the linear model is 284.2 − 11.232 = 272.97
on 1 df. χ21,0.95 = 3.84 so overwhelming evidence to reject the null hypothesis H0 : β1 = 0.
This is analysis of deviance for nested models described in section 3.7.2.
4. Since χ26,0.95 = 12.59, the fit appears acceptable (11.232 is less than 12.59). This is using
the (residual) deviance to assess model fit – see the end of §3.6.1.
Looking at the middle plot in Figure 4.2, we can see considerable improvement in the fit
compared to the minimal model.
5. The left hand figure in Figure 4.3 shows the Pearson residuals against fitted value for the
model with a linear term for log concentration. It suggests considerable curvilinearity and
it is sensible to consider adding a quadratic term into the linear predictor.
6. Adding a quadratic log concentration term to the model

quad.glm<-glm(propn.dead~1+conc+I(conc^2),binomial(logit),weights=number)

Fitting a quadratic in x (η = β0 + β1 x + β2 x2 ) gives a (residual) deviance of 3.195 on 5 df


.
7. This again is a significant improvement since the change in deviance is 8.037 (χ21,0.95 = 3.84,
χ21,0.99 = 6.64) so there is strong evidence to reject the null hypothesis H0 : β2 = 0. This is
analysis of deviance for nested models described in section 3.7.2. Also since χ25,0.95 = 11.07,
the (residual) deviance is well below this so the fit is good. This is a further use of (residual)
deviance to assess model fit - see the end of §3.6.1.

linear.glm quad.glm
0.5 1.0 1.5

0.5
Pearson residuals

Pearson residuals
0.0
−0.5

−0.5
−1.5

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
fitted values fitted values

Figure 4.3: Pearson residuals versus fitted values for linear.glm and quad.glm

The right hand plot in Figure 4.2 shows the fitted values for the quadratic model - showing
a considerable improvement in the fitted values compared to the model with just a linear
term. The right hand figure in Figure 4.3 shows the Pearson residuals versus fitted values
for the quadratic model. The curvature in the residuals has gone.
8. Finally, the (residual) deviance is so low that there is now essentially no scope for further
improvement by adding another parameter (the cubic term, for example): the (residual)
deviance from the quadratic model is less than χ21,0.95 = 3.84 so no change in (residual)
deviance can be less than 3.84.

90
9. Hence, using the logit link we get a good fit - residual deviance is low and residual plots
are satisfactory.
10. As an example of the Wald t-test (§3.4 and §3.7.4), consider the quadratic fit (using a logit
link). Obtaining the estimates for the parameters in the linear predictor using the summary
command we have that βb2 = 156.41, with e.s.e.(βb2 ) = 57.853, so the t-value for testing
β2 = 0 is 2.703. This gives a p-value of 0.00687. This comes from 2×(1−Φ(2.703)) = 0.007
using Neave’s tables; or 2*pnorm(-2.703) in R.

4.4.2 Analysis of deviance for the beetle data

As noted in section 3.7.2, we can perform an analysis of deviance in R using the anova command.
Using anova on a single GLM gives the residual deviances for all the univariate models nested
within the GLM under consideration. So for the quadratic model we would obtain the deviances
for the null, linear and quadratic models:

> anova(quad.glm)
Analysis of Deviance Table
Model: binomial, link: logit
Response: propn.dead
Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev


NULL 7 284.202
conc 1 272.970 6 11.232
I(conc^2) 1 8.037 5 3.195

We can also use the anova command to compare the residual deviance of two GLMs.

> anova(linear.glm,quad.glm,test="Chi")
Analysis of Deviance Table

Model 1: propn.dead ~ 1 + conc


Model 2: propn.dead ~ 1 + conc + I(conc^2)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 6 11.2322
2 5 3.1949 1 8.0373 0.004582 **

reproducing the results in point 10 above.

4.4.3 Calculating pseudo R2 for the beetle quadratic model

We can get the AIC values using the ‘$aic’ attribute as follows:

> quad.glm$aic
[1] 35.39294

There are 3 parameters for the quadratic model. This gives a log-likelihood of 3 − 1/2 × 35.39 =
−14.695. So the pseudo R2 for the quadratic model is (−155.2 − (−14.70))/(−155.2) = 0.91
which means that there is a 91% increase in the log-likelihood for the quadratic model compared
to the minimal model (where the linear predictor contains just an intercept term).

91
4.4.4 Calculating residuals for the beetle quadratic model

In section 4.4 it was stated that the value of the Pearson residual with a binomial response is
yi − µbi
eP,i = p
µbi (1 − µbi ) /ni

where µi is the ith fitted value. We can get these from R using

> round(resid(quad.glm,"pearson"),digits=3)
1 2 3 4 5 6 7 8
-0.412 0.843 -0.275 -0.524 0.852 -0.872 0.165 0.511

Other residuals are available. We can check these manually using the fitted values from R.

> y=beetle$propn.dead
> fitted=quad.glm$fitted
> n=beetle$number
> round((y-fitted)/sqrt(fitted*(1-fitted)/n),digits=3)
1 2 3 4 5 6 7 8
-0.412 0.843 -0.275 -0.524 0.852 -0.872 0.165 0.511

In section 3.6.4 the contribution of the ith observation to the residual deviance for the binomial
distribution is

    
X µbi 1 − µbi
−2 ni yi log + (1 − yi ) log
yi 1 − yi
i

So the di are obtained by

> round(sqrt(-2*n*(y*log(fitted/y)+(1-y)*log((1-fitted)/(1-y+0.000001)))),
digits=3)
1 2 3 4 5 6 7 8
0.422 0.819 0.277 0.523 0.875 0.825 0.170 0.722

We then makes those di negative for which yi < µi . These can be checked in R using

> round(resid(quad.glm,"deviance"),digits=3)
1 2 3 4 5 6 7 8
-0.422 0.819 -0.277 -0.523 0.875 -0.825 0.170 0.722

Task 12 Using the logit fitted line in x in Example 4.4.1, ηb = −60.72 + 34.27x, show by direct
c1 ≈ 0.059, eP,1 ≈ 1.41, eD,1 ≈ 1.28.
substitution that µ

4.4.5 Example 2 of model building: plant anthers

The data below, from Dobson (1990, pp. 113–115, Example 8.2; 2002, §7.4.1, pp. 123–125), are
the number of embryogenic anthers of a plant species prepared (nij ) and obtained (sij ) under 2
different storage conditions and 3 values of centrifuging force (xj ). Interest is in comparing the
effects of the storage conditions, one of which is a control, and the other is a new treatment (3◦ C
for 48 hours). The table entries are (nij , sij ), i = 1, 2, j = 1, 2, 3. The data is in the ‘anthers’
dataframe on the .RData workspace on MOLE.

92
Centrifuging force (g)
40 150 350
Control (102, 55) (99, 52) (108, 57)
Treatment (76, 55) (81, 50) (90, 50)

Let M represent an indicator variable which is zero for the control group and 1 for the treatment
group.

1. The plot of yi = si /ni against zi = log(xi ) (log force) suggests differences between the
treatments, with no strong dependence on z (or x), but possibly an interaction between
treatment and z (different slopes). The plot against x is very similar.
2. Plotting g(yi∗ ) against z or x shows no real difference between the various links. Consider
the logit link.
3. The minimal deviance (Model 1: η = α) is 10.452 (on 5 df), which is not large (χ25,0.95 =
11.07).
4. Fitting two groups with no dependence on z (Model 2: ηij = α + α1 M ) gives a deviance
of 5.173 (on 4 df), showing that there is strong evidence to reject H0 : αi = 0 (deviance
change from Model 1 to Model 2 is 5.3 on 1 df).
5. Fitting two parallel lines (Model 3: ηij = α + α1 M + βzj , gives a deviance of 2.619 (on 3
df), showing that there is not significant evidence to reject H0 : β = 0 (deviance change
from Model 2 to Model 3 is 2.55 on 1 df).
6. Fitting two non-parallel lines (Model 4: ηij = α + α1 M + βzj + β1 M zj ) gives a deviance
of 0.028 (on 2 df) showing there is not significant evidence to reject the null hypothesis
H0 : β1 = 0 (deviance change from Model 2 to Model 4 is 5.145 on 2 df - where χ22,0.95 =
5.99). Note that in R this is obtained using qchisq(0.95,2).
7. Using other links gives similar results which is not surprising as all sample proportions are
between 0.52 and 0.73 and the different links are all similar for this range of y values.
8. Thus, the fits provide evidence of differences between the treatments, but not much evi-
dence for a change with force.

Note that all the model comparisons using the change in (residual) deviance only compare nested
models. Comparing non-nested models using this method is not valid.

Task 13 How do the conclusions in example 4.4.5 change if the analysis is performed using xi
instead of zi = logxi as an explanatory variable?

4.5 Over-dispersion

If the trials aggregated to form yi are, in fact, not homogeneous – i.e. if the success probabilities
differ even though the xi are the same – the distribution of Si is no longer Binomial, and instead
of
µi (1 − µi )
var(Yi ) =
ni
a possible assumption which can be fitted using quasi-likelihood (see §3.10) is

µi (1 − µi )
var(Yi ) = φ ,
ni
with, usually, φ > 1, which is over-dispersion. A similar situation may arise if the observations
are not independent. The extra parameter, φ is estimated in the manner suggested in §3.9.

93
Also, several models for over-dispersion, e.g. Beta-Binomial, can be fitted directly. Note that an apparent
need for assuming φ > 1 can be due to the omission of necessary covariates. It has been suggested that
over-dispersion should not be assumed unless φ̂ > 2.

4.5.1 An example of the use of over-dispersion: grain beetle deaths

D.J. Finney (1971, Probit Analysis, Example 8, pp. 72–76) considers the following data from an
experiment on the toxicity of ethylene oxide (in mg/100 ml: x is log10 of this) to grain beetles,
the number killed (si ) out of ni was found 1 hour after exposure. The data is in the ‘grain’
dataframe on the .RData workspace on MOLE.

xi .394 .391 .362 .322 .314 .260 .225 .199 .167 .033
ni 30 30 31 30 26 27 31 30 31 24
si 23 30 29 22 23 7 12 17 10 0
1.0

2
0.8

1
proportion dead

empirical probit
0.6

0
0.4

−1
0.2

−2
0.0

0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4


log ethylene oxide log ethylene oxide

Figure 4.4: Grain beetle data plots. Left plot: fitted values (crosses) and observed values (circles)
for the model with a linear term. Right plot: empirical probit plot for model with a linear term.

The circles in the left plot of Figure 4.4 are the observed proportion of grain beetle deaths by
the log of ethylene oxide concentration. The plot looks reasonably linear but we should not fit a
linear model to the data because a log concentration of 0.5 looks like it would yield an estimated
probability exceeding 1. The link functions for the binomial response ensures that the response
remains in the [0, 1] interval.
In parts 1-4 below we fit a standard GLM with probit link function.

1. As usual, consider the proportion of deaths; yi = si /ni ; yi∗ = (si + 0.5)/(ni + 1).

2. The null deviance (η = β0 ) is 123.4 (on 9 df).

3. The right hand plot in Figure 4.4 shows a a probit plot: g(yi∗ ) = Φ−1 (yi∗ ) against xi . It
suggests a line predictor (η = β0 + β1 x).

4. The linear predictor η = β0 + β1 x (using a probit link) has a deviance (D) of 29.44 and
X 2 = 28.95 on 8 df. The change in deviance (94 on 1 df) is clearly very high showing an
improvement of the model with a linear term over the null model (overwhelming evidence
to reject H0 : β1 = 0). But the actual deviance for the model is high (29.44) providing

94
evidence against this Binomial model (χ28,0.9995 = 27.87). Other links give very similar
results. There is no evidence of systematic curvilinearity, and no other possible explanatory
variables were available. Note that the X 2 value is obtained as the sum of the squares of
the Pearson residuals; see section 3.8.2. The crosses in the left hand plot of Figure 4.4
represent the fitted values for this model.

5. This lack of fit with no apparent departure from linearity and very high Pearson residu-
b)/(n − p) =
als suggests fitting an over-dispersed Binomial model estimating φ as D(y, µ
28.95/8 = 3.62 (this estimate of φ can be obtained directly using the quasinomial argu-
ment).
q √
6. Standard errors for parameter estimates are then multiplied by φb = 3.62 ≈ 1.90.

7. We can do the usual model fit comparing η = β0 to η = β0 + β1 x taking into account


this estimate of φ. b The change in scaled deviance is 94/3.62 = 26 on 1 df which is
still very high showing an improvement of the model with a linear term over the null
model. The scaled deviance for this model is now 29.44/3.62=8.13, a much better fit
than before (χ27,0.95 = 14.1). Using the summary command we see that βb1 = 8.64, with
e.s.e.(βb1 ) = 0.980 × 1.90 = 1.86, so the Wald test statistic is t = βb1 /e.s.e.(βb1 ) = 4.7. The
slope is highly significant.

Note that the fitted values and hence residuals don’t change in the over-dispersed model but our
model fits the data much better after scaling by the scale parameter φ. Care must be exercised
in doing this as it possible to over fit to the data. In this case the value of φ is convincingly large
and there is good evidence for a over-dispersed model. If you fit an over-dispersed model and φ is
less than 2 and your data set is small then you should be wary of fitting an over-dispersed model.
Of course the degrees of freedom is reduced to account for estimation of this extra parameter
and this penalizes the inclusion of the extra parameter. Note that the command to perform the
over-dispersed binomial is

quasi.grain.glm<-glm(adj.prop.dead~1+x,family=quasibinomial,weights=n)

Task 14 Reproduce this analysis, in particular obtain the estimate of φ directly from R using
the quasinomial argument.

4.6 Odds and odds ratios

If ‘success’ has probability µ, then the odds of (or in favour of, or on) ‘success’ are

µ λ
λ= , 0 ≤ λ ≤ ∞, and so µ = .
1−µ 1+λ
The log-odds are then

log λ = log µ − log(1 − µ) = logit(µ), − ∞ ≤ log λ ≤ ∞.

Interest if often in the risk of some event in individuals exposed to a risk factor compared to
the risk in those not exposed to the risk factor, otherwise known as the odds ratios. In medical
studies the event of interest is often death or having some disease.
If Yi is binary{0,1} with µi = p(Yi = 1), then the odds of Yi = 1 are given by

p(Yi = 1) µi
=
1 − p(Yi = 1) 1 − µi

95
But the logit link gives  
µi
log = α + βxi .
1 − µi
So it follows that
p(Yi = 1)
= eα+βxi
1 − p(Yi = 1)

So the odds of Yi = 1 are given by eα+βxi This relationship allows us to calculate the quantity
of interest, the odds ratio, from parameter estimates from logistic regression. As the name
suggests, the odds ratio is just the ratio of two odds at different values of xi . We now consider
several different cases according to the form that Xi takes, factor or continuous variable. We
illustrate the different cases with some data relating saturated fat intake, gender and age to
coronary heart disease (CHD). The data set consists of 120 individuals collected as part of a
case-control study. The data is in the ‘CHD’ dataframe on the .RData workspace on MOLE.
The variables available are

• Status - whether the individual has CHD or not;

• Sat.fat - whether the individual has a high fat diet (coded as 2), a medium fat diet (coded
as 1) or a low fat diet (coded as 0);

• Gender - female (coded as 0) or male (coded as 1);

• Age - the age of the individual in years.

The response variable (Status) is binary (affected by CHD or not).

4.6.1 Xi is a factor with 2 levels

Assume initially that Xi has only two levels {0,1}.


p(Yi =1)
Then when Xi = 1 the odds of Yi = 1 are 1−p(Yi =1) = eα+β and when Xi = 0 the odds of
p(Yi =1)
Yi = 1 are 1−p(Yi =1) = eα . So the ratio of the odds of Yi = 1 when Xi = 1 compared to the odds
α+β
of Yi = 1 when Xi = 0 is e eα = eβ .
This is the odds ratio of Yi = 1 when Xi = 1 compared to when Xi = 0. So, if Xi is a factor with
2 levels the odds ratio of Yi = 1 is eβ where β is the coefficient of Xi in the logistic regression
analysis.
Fitting a logistic model (using the logit link) we get:

> gender.glm<-glm(status~gender,family=binomial,data=CHD.data)
> summary(gender.glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.03509 0.26495 -0.132 0.8946
gender 0.80056 0.37875 2.114 0.0345 *

The linear predictor is β0 + β1 xi where xi is 1 for males and 0 for females. The odds of
CHD (Yi = 1) for females (xi = 0) is eβ0 . The odds of CHD (Yi = 1) for males (xi = 1) is
eβ0 +β1 . So the estimated odds ratio for being affected by CHD in males compared to females is
eβ1 = e0.801 = 2.23 to 2d.p. This is generally interpreted as meaning that, based on this study,
c

the risk of CHD for males is 2.2 higher than the risk of CHD for females.

96
4.6.2 Xi is a factor with 3 levels

If we code the three factors as (0,1,2) then when Xi = 2

p(Yi = 1)
= eα+2β
1 − p(Yi = 1)

and when Xi = 1
p(Yi = 1)
= eα+β
1 − p(Yi = 1)
and when Xi = 0
p(Yi = 1)
= eα
1 − p(Yi = 1)
So the odds ratio of Yi = 1 comparing Xi = 1 to Xi = 0 is eβ and the odds ratio of Yi = 1
2
comparing X1 = 2 to Xi = 0 is e2β = eβ . So with Xi coded in this way, the odds ratios
behave in a multiplicative manner.

> satfat.glm<-glm(status~sat.fat,family=binomial,data=CHD.data)
> summary(satfat.glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3623 0.2970 -1.220 0.22251
sat.fat 0.8181 0.2644 3.095 0.00197 **

The linear predictor is β0 + β1 xi where Xi is 2 for a high fat diet, 1 for a medium fat diet and
0 for a low fat diet. The estimated odds ratio of Yi = 1 comparing

• Xi = 1 to Xi = 0 is eβ2 = e0.818 = 2.7;


c

• Xi = 2 to Xi = 1 is eβ2 = e0.818 = 2.7;


c

2
• Xi = 2 to Xi = 0 is e2β2 = e0.818 = 5.13 to 2d.p.
c

This coding of the variable assumes that the risk of CHD is multiplicative across group levels.
With this kind of variable coding, the lowest level (0 by convention) is taken to be the baseline
level and odds ratios are usually calculated comparing the odds in other levels to the baseline. It
is possible to allow a non-multiplicative model of disease risk. This requires setting up dummy
variables so that each level of the factor has a different parameter.

4.6.3 Xi is continuous

When Xi = t + 1
p(Yi = 1)
= eα+β(t+1)
1 − p(Yi = 1)
and when Xi = t
p(Yi = 1)
= eα+βt
1 − p(Yi = 1)
So eβ is the odds ratio of Yi = 1 comparing Xi = t + 1 to Xi = t So eβ is the odds ratio of Yi = 1
for a unit increase in Xi

97
> age.glm<-glm(status~age,family=binomial,data=CHD.data)
> summary(age.glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.20096 1.11289 -1.978 0.0480 *
age 0.05459 0.02340 2.333 0.0196 *

So the estimated odds ratio of Yi = 1 comparing Xi = t + 1 to Xi = t is e0.055 = 1.06. For every


year older the risk of CHD increases
10 by 6%. For a 10 year difference in age, the estimated odds
ratio is e (10×0.055) = e0.055 = (1.06)10 = 1.80; an 80% increase in CHD risk for a 10 year
difference in age.

4.6.4 Multiple explanatory variables

Suppose we have k variables X1 , X2 , ..., Xk . How do we interpret the parameters of the logistic
regression model? Suppose that X1 = x1 + 1, X2 = x2 , ..., Xk = xk Then the odds are

p(Yi = 1)
= eα+β1 (x1 +1)+β2 x2 +...+βk xk
1 − p(Yi = 1)

Suppose that X1 = x1 , X2 = x2 , ..., Xk = xk Then the odds are

p(Yi = 1)
= eα+β1 x1 +β2 x2 +...+βk xk
1 − p(Yi = 1)

Then eβ1 is the odds ratio of Yi = 1 for a unit increase in X1 holding all other values in the model
fixed. This is often called ‘adjusting’ for another variable and is common in epidemiological
studies. The variable you wish to adjust for is simply included as an explanatory variable in the
linear predictor.
Suppose we want to calculate the odds ratio of Yi = 1 comparing males to females but adjusting
for age. We would use the estimates from:

> both.glm<-glm(status~gender+age,family=binomial,data=CHD.data)
> summary(both.glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.59535 1.13704 -2.283 0.0225 *
gender 0.80587 0.38835 2.075 0.0380 *
age 0.05436 0.02347 2.317 0.0205 *

So the estimated odds ratio of CHD comparing males to females but adjusting for age is e0.806 =
2.24; hardly any change from the unadjusted odds ratio of 2.23 in the previous example but in
some cases adjusting can lead to very different estimates.

4.6.5 Confidence intervals for odds ratios

Using the output above we can calculate a 95% confidence interval for the log odds ratio of
CHD comparing males to females but adjusting for age as 0.806 ± 1.96 × 0.388 (estimate ± 1.96
standard errors) which gives (0.05,1.57). This gives a 95% CI for the odds ratio as (1.05,4.79).

98
Chapter 5

Poisson Regression and Models for


Non-Negative Data

5.1 Introduction

The Poisson distribution is often appropriate for count data when a Binomial is not (because
there is no upper bound, Bernoulli trials are inappropriate, etc). Its relationship to the Poisson
process also makes it a natural first choice when modelling counts of events, including rare events.
It has a further indirect use in connection with contingency tables, to be explored in Chapters 6
and 7. Here we review direct Poisson regression and similar regression models for non-negative
data, not necessarily discrete, in which the variance depends on the mean. Non-negative and
positively skew data, for which these models could be suitable, arise in many applications, such
as lifetimes, survival times, recovery times, counts and lengths. Some other models for such data
are briefly described.

5.2 Modelling counts using the Poisson distribution

The Poisson GL model for non-negative integer data is

Yi ∼ P o(µi ) with µi = h(xTi β).

The canonical link function is the log link (which ensures µi > 0)
Y
log µi = xTi β i.e. µi = exp(xTi β) = exp(xij βj )
j

giving a multiplicative structure for µi (for example, a term for age times one for height).
For a single explanatory variable x, a rough check on this link and linearity can be obtained by
plotting log(yi + 1/2) (the 1/2 is to avoid problems if yi = 0) against xi . Other links can be
used as long as non-negativity of the means is ensured.
The Deviance is, as given in §3.6.4,
X 
yi
 
D=2 yi log − (yi − µ
bi )
µ
bi

with the second sum usually 0, and the Pearson residual is

yi − µ
bi
eP,i = p
µbi ,

99
40

3
log(deaths+0.5)
30

2
deaths
20

1
10

0
0

2 4 6 8 10 14 2 4 6 8 10 14
quarter quarter

Figure 5.1: AIDS data plots- left: aids deaths per quarter; right: log(aids deaths) per quarter

with
bi )2
X (yi − µ
X2 =
µ
bi
Thus D and X 2 are, as in §4.4, the sums of 2 × O × log(O/E) and (O − E)2 /E, respectively.
This is a GLM with V (µ) = µ, and all the results in Chapter 3 apply. Since φ = 1 is known, the
(residual) deviance is, for a correct distribution and model, also asymptotically χ2n−p , although
in practice the χ2n−p distribution may not be a good approximation. (It should be good when
all the µi are fairly large.)

5.2.1 Example: AIDS deaths over time

Dobson (1990, Example 3.3 and Exercise 4.1) gives the data below for the numbers of deaths
from AIDS in Australia per quarter from the first quarter of 1983 to the second quarter of 1986
(numbered 1 to 14). The data is in the ‘AIDS’ dataframe on the .RData workspace on MOLE.

yi 0 1 2 3 1 4 9 18 23 31 20 25 37 45
xi 1 2 3 4 5 6 7 8 9 10 11 12 13 14

The data are shown in the left plot of Figure 5.1.

1. The plot of log(yi + 1/2) against xi shown in the right of Figure 5.1 suggests a straight
line (with a hint of downwards curvature).
2. Fitting a Poisson with log link and a line predictor on x produces a fit which is a bit poor
(D = 29.65 on 12 df, χ212,0.95 = 21.03).
3. Adding a quadratic term (x2 ) is an improvement (difference 13.01 on 1 df), and also
adequate (D = 16.37 on 11 df).
4. A line predictor including a log(x) term has D = 17.09 on 12 df, which appears adequate.
Adding a quadratic term (in log x) does not improve the fit (D = 16.64 on 11 df).
5. Thus possible simple models are a line in log x or a quadratic in x, but there are reservations
about both.

Task 15 Verify the analysis in Example 5.2.1.

100
5.3 Adjusting for exposure: offsets

Suppose yi is the number of events in time interval ti . Then, modelling events by Poisson
process, with possibly different rates in different intervals ti , it is natural to suppose that yi is
the observed value of a random variable Yi for which

Yi ∼ P o(µi ) = P o(λi ti ),

where λi the rate per unit time in the ti interval. Suppose we are interested in how the rates λi
depend on covariates x;

log µi = log(λi ti ) = log λi + log ti = xTi β + log ti .

Here log ti looks like a covariate, but with a known coefficient (equal to 1). Note that β models
how rates vary, which is usually what really matters, rather than how means vary.
Of course one could include log ti as a regular explanatory variable and see whether its regression coefficient is
near 1 before using it as an offset. This is a check on the model, not a way of indicating that an offset should
be used; the desire to use a variable as an offset arises from the modelling, not from looking at the data.

In general when data on counts arise from different degrees of exposure it is natural, as above,
to want to model and to draw conclusions about rates per unit exposure. For example, suppose
the data are on death counts taken over different large populations, perhaps different health
authorities, and some other explanatory variables. Then, the logarithm of the population size
would be used as an offset, since it is death rates per person (or, more usually, per 1000 people)
that are of interest.
An offset for a Poisson regression model is specified in R by means of the offset function. The
following example illustrates.

5.3.1 Example: Smoking and heart disease

Dobson (2002, §9.2.1, pp. 154–7) gives the following (historically important) data on deaths from
coronary heart disease after 10 years among British male doctors (from Doll et al.). Age groups
are coded as 1 for ages 35–44; 2 for 45–54; 3 for 55–64; 4 for 65–74; and 5 for 75–84. Exposure is
measure in person-years; a person who lived throughout the period would contribute 10 person-
years, but someone who died, either of heart disease or of something else, would contribute
fewer. The data is in the ‘smoking’ dataframe on the .RData workspace on MOLE.

Age Smokers Non-smokers


Group Deaths Person-years Deaths Person-years
1 32 52407 2 18790
2 104 43248 12 10673
3 206 28612 28 5710
4 186 12663 28 2585
5 102 5317 31 1462

1. The age group codes (1, 2, . . . ) are easily translated to a numerical variable ‘mid-point of
age group’ giving 40, 50, 60, 70 and 80 years.
2. Over the age groups, the rounded death rates (deaths per 100,000 person years) are 61,
240, 720, 1469 and 1918 for smokers, and 11, 112, 490, 1083 and 2120 for non-smokers.
Death rates for smokers are higher except for age group 5. Their ratios are 5.5, 2.1, 1.5,
1.4 and 0.90. The death rates for the two groups are shown in the top left plot of Figure
5.2.

101
2 3 4 5 6 7 8
Log death rate
1500
Death rate

0 500

40 50 60 70 80 40 50 60 70 80

Age Age
250

0.0
residuals
150
Deaths

−0.2
0 50

−0.4

40 50 60 70 80 2 4 6 8 10

Age Index

Figure 5.2: Plots for the smoking data - in the top-row plots, squares are smokers, triangles are
non-smokers - in the bottom left plot, triangles are observed values, dots are fitted values

3. The plot of log death rate against age suggests that there are differences between the two
groups, and that, with a log link, straight lines will not be adequate. The curves appear
to be approaching a maximum or asymptote as age increases suggesting quadratic terms
may be needed. The log death rates for the two groups are shown in the top right plot of
Figure 5.2.
4. The numbers of deaths are counts, so it is natural to think of a Poisson model. However,
the number of deaths will be influenced by the number of people at risk. In other words,
person-years must be taken into account; it is really death rates per person-year of exposure
that we wish to model. Accordingly a Poisson model with log link and linear predictor

η = log(person-years) + β0 + β1 x1 + β2 x2 + β3 x22 + β4 x1 x2

was fitted, where x1 is 1 for smokers, 0 for non-smokers; x2 is the age category (as a
covariate). This fits separate quadratics in age for the two groups, but with the same term
in x22 .
This model has fixed coefficient 1 for log(person-years), and so an offset is needed. See
below for the command to include it.
5. The fit is extremely good, with D = 1.635 on 5 df, and with all terms appearing necessary.
The fitted values for this model are shown in the bottom left plot of Figure 5.2. The fitted
coefficients are −19.70, 2.36, 0.36, −0.020, −0.031, respectively. Thus for non-smokers, ηb is

102
−19.7 + 0.36x2 − 0.02x22 , and for smokers it is −17.34 + 0.33x2 − 0.02x22 . The residuals for
this model are shown in the bottom right right plot of Figure 5.2.

Task 16 Verify the analysis in Example 5.3.1, using the coomand

glm(deaths~offset(log(person.years))+smoke*age+I(age^2),poisson,data=smoking)

5.4 Non-negative data with variance ∝ mean

Using quasi-likelihood, the model in which, for Yi > 0, var(Yi ) = φ × µi , can be fitted for any
non-negative data by fitting the Poisson distribution and then (if unknown) estimating φ. The
Y variable does not need to be a count for this to be acceptable.

Task 17 Compare the output from fitting a Poisson with log link and a line predictor on x to the
data in Example 5.2.1 with that obtained using the the log link and assuming that the variance
is proportional to the mean.

103
Chapter 6

Two-way Contingency Tables

6.1 Types of 2-way tables - response/controlled variables

Data often arise in the form of counts, cross-classified by various factors (often referred to as
categorical variables). The table of cross-classified counts is called a contingency table. An
important distinction in the analysis of contingency table data is between response variables
and controlled variables. This distinction depends on what totals are known in advance of
collecting any data, i.e. the method of sampling as illustrated by the following 2 examples,
referred to as Case(a) and Case(b) for the rest of this chapter.

6.1.1 Case(a): Skin cancer (melanoma) data - 2 response variables

Dobson (1990, Example 9.1; 2002, Example 9.3.1) gives the following table from a cross-sectional
study of malignant melanoma. The sample size was 400, and site and type (labelled T1 to T4
here, T1 : Hutchinson’s melanotic freckle; T2 : Superficial spreading melanoma; T3 : Nodular; T4 :
Indeterminate) were recorded. Both tumour type and site are response variables because none
of the row or column totals were fixed in advance of the data collection. The data is in the
‘Mela’ dataframe on the .RData workspace on MOLE.

Site
Tumour Type Head and Neck Trunk Extremities Total
T1 22 2 10 34
T2 16 54 115 185
T3 19 33 73 125
T4 11 17 28 56
Total: 68 106 226 400

6.1.2 Case(b): Flu vaccine data - 1 response and 1 controlled variable

Dobson (1990, Example 9.2; 2002, Example 9.3.2) gives the following table from a prospective
controlled trial on a new influenza vaccine. Patients were randomly assigned to the two groups
(Placebo, Vaccine), and the response (levels of an antibody found in the blood six weeks after
vaccination) was determined. Antibody level is the response and vaccine group is a controlled
variable (with totals fixed by experimental design). Note that a large response is good. The
data is in the ‘vaccine’ dataframe on the .RData workspace on MOLE.

104
Response
Small Moderate Large Total
Placebo 25 8 5 38
Vaccine 6 18 11 35
Total: 31 26 16 73

6.2 Notation for two-way tables

Denote the row factor by A, with levels i, where 1 ≤ i ≤ I, and the column factor by B, with
levels j, where 1 ≤ j ≤ J; let the observations P
be yij and P
suppose
P they are the observed values
of random variables Yij . We use the notation ij yij for i j yij and a . in the subscript of
a variable means sum over that subscript so that
P
• yi. means j yij
P P
• y.. means i j yij

Note that where it might cause confusion with this notation, I have omitted full stops at the
ends of sentences. Conventions for the observed values and probabilities are given in the tables
below. In case(b) we always assume that the rows represent the levels of the controlled variable.

col 1 col 2 ... col J Total col 1 col 2 ... col J Total
row 1 y11 y12 ... y1J y1. row 1 π11 π12 ... π1J π1.
row 2 y21 y22 ... y2J y2. row 2 π21 π22 ... π2J π2.
. . . . . . . .
. . . . . . . .
. . . . . . . .
row I yI1 yI2 ... yIJ yI . row I πI1 πI2 ... πIJ πI .
Total y.1 y.2 ... y.J y.. π..

6.2.1 Association, Independence and Homogeneity

Usually the interest is in whether the response variables are associated (in which case the re-
sponse variables are treated symmetrically, as in correlation).
In case(a), the skin cancer data with 2 response variables, the probabilities of interest,πij , are
the joint probabilities πij = P (A = i, B = j) where π.. = 1. Only the total sample size n = y..
is fixed. Independence implies P (A = i, B = j) = P (A = i) × P (B = j) for all i and j, where
πi = P (A = i) and πj = P (B = j) are the marginal probabilities of row i and column j.
In case (b), the flu vaccine data with 1 response and 1 controlled variable, the probabilities of
interest,πij , are conditional probabilities: πij = P (B = j|A = i). It follows that πi. = 1 for
1 ≤ i ≤ I. The row totals yi. are fixed, not random. The interest is in whether the probability
distribution of the response (antibody level) is the same in each level of the controlled variable
(drug group). If it doesn’t depend on i then we can write (πij = πj ). This is known as
homogeneity.
The analysis here is similar to that which you may have encountered previously in testing for
independence or homogeneity, except that the deviance is used here rather than Pearson’s X 2 ,
and the present approach allows a richer set of models.

105
6.3 Distributions for two-way tables

Natural models for the Yij in cases (a) and (b) are described below. A third model, labelled (c),
which at first sight appears unnatural since the total sample size is random, is also given
P for use
in §6.4 and later. We next need the Multinomial distribution: M n(n, (πi )) where i πi = 1.
The probability function of the Multinomial distribution is
 n! Q yi P
y !y !...y ! i πi if i yi = n
P (y1 , y2 , . . . , yr ) = 1 2 r
0 otherwise.

6.3.1 Case (a): two response variables.

The only fixed quantity is n = y.. , and the IJ observations have a multinomial distribution

{Yij } ∼ M(n, {πij }).

The likelihood is yij


Y πij
n!
yij !
i,j

6.3.2 Case (b): one response variable.

Now the distribution of {Yij } is the product multinomial: a product of multinomials, one for
each row,
{Yij } ∼ M(ni , {πij }), independent, for i = 1, . . . , I
so that the likelihood is  
yij
Y Y πij
ni ! 
yij !
i j

where ni = yi. , the row totals, are fixed.

6.3.3 Case (c): independent Poissons (no fixed margins).

If the Yij are independent Poisson variables, with mean µij then the likelihood is
Y yij 
−µij (µij )
e .
yij !
i,j

This seems unnatural as the Yi are not independent (since the total number of observations is
always fixed). However this case is needed since it forms the basis of log-linear modelling.

6.3.4 Expected values

It is more convenient in modelling to work with expected values (of the Yij ) rather than prob-
abilities: E(Yij ) = µij . So inPcase (a) µij = nπij , and in case (b) µij = ni πij where for (a)
P
ij πij = 1, and for case (b) j πij = 1

106
6.4 GLMs and two-way contingency tables

As discussed in section 6.3, the distributions that arise in connection with contingency tables
are multinomial or product-multinomial, which cannot be expressed directly as a GLM (because
the observations are dependent). However, it is possible to analyse such data by using GLMs,
by using a Poisson distribution and a log link and including all the controlled variables and their
interactions in the linear predictor. The justification for this, which applies to higher-way tables
too, is developed in the document ‘Treat Product-Multinomial as Poisson’ on MOLE. A Poisson
model of this kind with a linear predictor and a log link is often called a log-linear model.

6.4.1 Natural hypotheses are log-linear models

For case (a), the hypothesis of independence is that the joint probability can be written as
πij = πi πj , so that, in terms of expected values, µij = nπi πj
Taking logs gives
ηij = log µij = µ + αi + βj (i.e. A + B)
where µ = log n, αi = log πi , βj = log πj
Similarly for case (b): the hypothesis of homogeneity is that the conditional probability πij =
P (B = j|A = i) can be written πij = πj (P (B = j|A = i) doesn’t depend on the row i, so
µij = ni πj , and again
ηij = log µij = µ + αi + βj (i.e. A + B)
where now µ + αi = log ni and βj = log πj
In both cases, the hypothesis of interest implies additivity (A + B) for the log mean. Moreover,
in both cases the saturated model (unrestricted πij ) can be written

log µij = µ + αi + βj + (αβ)ij (i.e. A ∗ B) .

In this case the (αβ)ij terms are referred to as interaction terms, and are zero under the hy-
potheses of independence or homogeneity.
So in both cases we wish to see whether the model A + B is adequate. If it is, then the data
will be deemed consistent with independence in case (a) or homogeneity in case (b).
Key point reiterated: Multinomial and product-multinomial data may be analysed as though
they were independent Poisson data, with a log-link, provided terms corresponding to the fixed
margins (including all interactions) are included in the model. Both the estimators and the
deviance will be correct.

6.4.2 Poisson log-linear modelling for two-way tables

In all cases the Yij are modelled as independent Poisson with the log link.
Case(a)
Here there are no controlled variables. The linear predictor including just the intercept is
the minimal model. The log-linear models that are allowed, along with the corresponding
estimates of π̂ij (derived in the next section) are:

• A ∗ B : the non-independence saturated model, π̂ij = yij /n (D = X 2 = 0, df = 0)

• A + B : independence (probabilities the same in each row, π̂ij = yi. y.j /(n2 )

• A : independence and equal probabilities for each column π̂ij = yi. /(nJ)

• B : independence and equal probabilities for each row π̂ij = y.j /(nI)

107
• 1 : independence and equal probabilities for all rows and columns π̂ij = 1/(IJ)

Case(b)
Here row (labelled as factor A and indexed by i) is the controlled variable. So the linear predictor
including just A (and the intercept) is the minimal model. Now the only (log-linear) models
that are allowed are:

• A ∗ B : the non-homogeneity, saturated model, π̂ij = yij /ni (D = X 2 = 0, df = 0)

• A + B : homogeneity (probabilities the same in each row, π̂ij = y.j /n)

• A : homogeneity and equal probabilities for each column (π̂ij = 1/J)

6.4.3 Maximum likelihood estimation for πij in case(a)

The mles for πij in section 6.4.2 were not justified. Here we derive one of them. Consider the
linear predictor containing just A (and the intercept) for case(a). Since µij = nπij we have that
log(µij ) = log(n) + log(πij ). If A (indexed by i) is the only term in the linear predictor then πij
can only depend on i, i.e. πij = πi

Since we model Y11 , ..., YIJ as independent Poisson random variables with mean µij if follows
that Yij ∼ P o(µij ) ∼ P o(nπij ). The likelihood (L) is therefore
I Y
J 
exp(−nπi )[nπi ]yij
Y 
L =
yij !
i=1 j=1
I X
X J
l = log(L) = {−nπi + yij log(nπi ) − log(yij !)}
i=1 j=1
J  
∂l X yij
= −n +
dπi πi
j=1
∂l yi.
= 0 ⇒ π̂i =
dπi nJ

Task 18 Verify the maximum likelihood estimate for πij for the A + B model for case(a).

Task 19 Verify the maximum likelihood estimate for πij for the A + B model for case(b).

6.5 Interaction plots and examination of residuals

Just as with ordinary linear models, it is possible to investigate interactions by means of suitable
plots. If there is no interaction, the independence and homogeneity models have log(µij ) =
µ + αi + βj , so that if log µij is plotted against j for different i, the graphs will be parallel;
exactly the same will be true for plots of log µij against i for different j. If there is interaction,
they will not be parallel. However, if I > 2 or J > 2, additivity may hold over some of the rows
and/or columns, even if not over all.
The best estimate of log µij (under the saturated model) is log(yij ), so that it is graphs of
log(yij ), or log(yij + 3/8), that are used, and these may show the cause of the interaction — for
example that one level of A behaves quite differently from the others. In R, interaction.plot
produces these graphs.

108
Another approach is through the examination of residuals. Large (absolute) values of the de-
viance or Pearson residuals eD,ij or eP,ij can show why a hypothesis is not acceptable, since their
square makes a large contribution to D or X 2 , and perhaps suggest sub-tables over which the
hypothesis may hold. Note that the deviance is exactly additive when a hypothesis is partitioned
into orthogonal components (mentioned in §3.8.2); this is illustrated in Example 6.6.3.

6.6 Analysis of the skin cancer data (case(a)) using log-linear


models

From now on we will refer to modelling multinomial or product-multinomial responses using a


poisson response with a log link as log-linear modelling (making sure that all controlled variables
are included in the linear predictor). In this section we look at fitted values for various log-linear
models applied to the skin cancer data. We show how to enter the data into R and how to use
the methods we have developed in earlier chapters to choose a suitable model linking the counts
to the potential explanantory variables.

6.6.1 Fitted values for the skin cancer data (case(a))

With two response variables, the general form of µij is µij = nπij
According to which terms are entered into the linear predictor, πij changes accordingly.

Key Point
Essentially when a variable is included in the linear predictor, it fixes the marginal totals of
that variable. So in the skin cancer data, if tumour type (row variable) is included in the linear
predictor then the marginal (row) totals of the fitted values must be the same as the marginal
(row) totals of the observed values.
The fitted values for this two response case are given below using the results of section 6.4.2.

• A ∗ B: no restrictions on πij ; µ̂ij = yij

Site
Tumour Type Head and Neck Trunk Extremities Total
T1 22 2 10 34
T2 16 54 115 185
T3 19 33 73 125
T4 11 17 28 56
Total: 68 106 226 400

• A + B: independence; µ̂ij = nπ̂ij = yi. y.j /n

Site
Tumour Type Head and Neck Trunk Extremities Total
T1 5.8 9.0 19.2 34
T2 31.5 49.0 104.5 185
T3 21.3 33.1 70.6 125
T4 9.5 14.8 31.7 56
Total: 68 106 226 400

• A : independence and the same probability for each column category: µ̂ij = nπ̂ij = yi. /J

Site

109
Tumour Type Head and Neck Trunk Extremities Total
T1 11.33 11.33 11.33 34
T2 61.67 61.67 61.67 185
T3 41.67 41.67 41.67 125
T4 18.67 18.67 18.67 56
Total: 133.33 133.33 133.33 400

Task 20 Calculate the table of fitted values for the linear predictor containing B for case(a)

6.6.2 Log-linear modelling in R

To read the data into R, the skin cancer data should be presented in the following way

count type site


22 1 1
2 1 2
. . .
. . .
28 4 3

Remembering that the explanatory variables are factors we can perform the analysis in R us-
ing commands of the form glm(count~factor(type)*factor(site),poisson(log)). This
command fits the saturated/maiximal model.

6.6.3 Skin cancer data (case(a)) revisited

Consider the skin cancer data.

1. The test of independence based on the log-linear model A + B has D = 51.795 on 6 df.
2. The usual Pearson-χ2 test for independence gives X 2 = 65.813 on 6 df. (Use the chisq.test
function in R applied to the table of frequencies
matrix(Mela$number,nrow=4).
3. Each test provides overwhelming evidence of dependence. There are now several ways to
examine the data to see whether the dependence can be described fairly simply.
4. Examine the row or column proportions. Note the different pattern for T1 and for ‘Head
and Neck’.
5. Examine the residuals from the independence model; plotting them against one factor and
labelling by the other again shows that T1 and ‘Head and Neck’ look unusual. It is seen
that the largest (in modulus) residual is eP,11 = 6.75 (Pearson) or eD,11 = 5.14 (deviance);
squaring these shows that this cell makes a very large contribution to X 2 and D.
6. The interaction plot of log(yij ) for tumour type and tumour site again shows that the
(1, 1) cell looks like the source of the dependence.
7. In summary so far, all of these show that the first row and the first column differ from
the others, and the (1, 1) cell has a much larger value than expected under independence
(y11 = 22, µ̂11 = 5.78). A few other residuals in the first row or first column are larger
than 2 in modulus.
8. Removing Type T1 , the test of independence on the remaining 3 × 3 table has D = 6.509
(X 2 = 6.562) on 4 df, showing that independence is acceptable. Similarly, removing Head
& Neck, the test of independence on the remaining 4 × 2 table has D = 2.165 (X 2 = 2.025)
on 3 df, showing that independence is acceptable.

110
9. The difference of the first row or first column from the rest can be confirmed by comparing
these with the pooled values from the rest. For rows, pooling gives the 2 × 3 table

H&N Tr Ex
T1 22 2 10
T2 ,T3 ,T4 46 104 216

The test of independence on this table has D = 45.286 (X 2 = 60.532) on 2 df, showing
strong evidence against independence. Note that this deviance (45.286) added to the
deviance from the test for independence on types T2 , T3 , T4 alone (6.509) exactly gives the
deviance for all 4 rows (51.795) (the corresponding contrasts are orthogonal).
10. Similarly, comparing column 1 with the pooled frequencies from columns 2 and 3 gives
a 4 × 2 table with D = 49.630 (X 2 = 64.548) on 3 df, showing strong evidence against
independence. Again, this D (49.630) added to the D from the test on columns 2 and 3
(2.165) exactly gives D for all 3 columns (51.795).
11. To see if it is just the (1, 1) cell which is causing the association, a log-linear model can be
fitted which is additive in the factors, but includes a term for the (1, 1) cell — an indicator
variable for that cell (that is, treats it as an outlier). This has D = 8.002 (X 2 = 7.882) on
5 df, showing that the evidence against independence is attributable to the high count in
the (1, 1) cell.
12. Overall, there is overwhelming evidence of association between tumour type and site; there
is no reason to doubt that types T2 , T3 and T4 have a similar distribution over the body, but
the distribution of T1 over the body is different, with a higher concentration on Head and
Neck. (Note that, even in sampling situation (a) conclusions may be easiest to understand
when phrased as statements about conditional distributions.)

Task 21 Verify the analyses in Example 6.6.3.

Task 22 For the 4 × 3 table in Example 6.1.1, and the independence model, show directly (with-
out fitting a log-linear model) that µ̂11 = 5.780, eP,11 = 6.747, and eD,11 = 5.135.

Task 23 Consider the 2 × 3 two-way table {yij , i = 1, 2, j = 1, 2, 3}, with factors A, B. Let the
deviance for the additive model A + B be D for the 2 × 3 table, D1 be for the 2 × 2 sub-table
{yij , i = 1, 2, j = 1, 2}, and D2 for the 2 × 2 table {zij , i = 1, 2, j = 1, 2}, where zi1 = yi1 + yi2 ,
zi2 = yi3 for i = 1, 2. Verify that D = D1 + D2 .

6.7 Flu vaccine data (case(b)) revisited

1. The minimal model A has D = 23.68 on 4 df. χ24,0.95 = 9.49 so not a good fit.
2. The homogeneity model (A+B) has D = 18.643 on 2 df.
3. ∆D = 5.04 on 2df so not much of an improvement (χ22,0.95 = 5.99)
4. This analysis strongly suggests that groups differ in their response. The observed row
proportions show that the vaccine has a much better (larger) response.

6.7.1 Fitted values for the A + B model for the flu data (case(b))

π̂11 = y.1 /n = 31/73 and µ̂11 = n1 π̂11 = 38 × 31/73 = 16.14


The table of estimated values for πij and µij is

Small Moderate Large Total

111
Placebo π̂11 = 31/73 π̂12 = 26/73 π̂13 = 16/73 1
Vaccine π̂21 = 31/73 π̂22 = 26/73 π̂23 = 16/73 1
Placebo µ̂11 = 16.14 µ̂12 = 13.53 µ̂13 = 8.33 38
Vaccine µ̂21 = 14.86 µ̂22 = 12.47 µ̂23 = 7.67 35
Total y.1 = 31 y.2 = 26 y.3 = 16

Task 24 Verify the analysis in Example 6.7.

Task 25 What is the largest Pearson residual for the A + B model?

112
Chapter 7

Three-way Tables

7.1 Introduction

For three-way tables there are three different types of sampling situations, described in §7.2,
and for each type there is more than one hypothesis of potential interest. The situation is thus
more complex than the two-way case. In the two-way case the main hypotheses are those of
independence (case (a)) or homogeneity (case (b)), but with three factors there are several more
possibilities. The main purpose of the chapter is to review these possibilities and their analysis
for just one case. As in the previous chapter, only models which can be fitted as Poisson log-
linear models are considered. The log-linear model must include all terms corresponding to any
fixed margins (i.e. controlled variables) and all interactions between them.

7.2 Types of three-way table

Suppose there are three factors A, B, C, with levels i, j, k, where 1 ≤ i ≤ I, 1 ≤ j ≤ J, and


1 ≤ k ≤ K, and data yijk . This time there are three cases.

(a) The data form a single sample of size n = y... , and all three factors are responses.
(b) One factor C, say, has fixed margins nk = y..k for k = 1, . . . , K. There are K samples of
sizes n1 , . . . , nK , and in each sample the responses are factors A and B.
(c) Two factors, B and C say, have fixed margins njk = y.jk for j = 1, . . . , J, k = 1, . . . , K.
There are JK samples of sizes n11 , . . . , nJK , and in each sample the response is factor A.
Notice that, to describe this situation accurately, it would be better to say that B ∗ C has
a fixed margin.

We only consider case (c).

7.3 General approach

As in §6.4.2, we consider log-linear models, i.e. treat the observations as Poisson and we model
ηijk = log µijk using linear models with main effects and interactions in the factors A, B, C. As
before, for a correct model, the (residual) deviance and X 2 have asymptotically a χ2 distribution
on the (residual) df, and for nested models the difference of deviances have asymptotically χ2
distributions on the difference in df.

113
7.4 One response variable

Suppose that the response variable is A. Every model must include B ∗ C (the minimal model).
In this case there are JK independent multinomial samples: for each j = 1, . . . , J, k = 1, . . . , K,
the (yijk : 1 ≤ i ≤ I) are (assumed) independent M n(njk , (πijk )), where π.jk = 1 for all j and
k. The (i, j, k) cell mean is µijk = njk πijk for all i, j, and k.
The maximal log-linear model is A ∗ B ∗ C with D = 0.

7.4.1 Complete homogeneity of the response (A+B*C)

The log-linear model corresponding to A + B ∗ C is:

ηijk = intercept + αi + βj + γk + (βγ)jk A+B∗C

From µijk = njk πijk it follows that log(µijk ) = log(njk ) + log(πijk ). The log(njk ) term corre-
sponds to the βj + γk + (βγ)jk so πijk only depends on i. Therefore the probability distribution
of A is the same over all the JK levels of (B, C):

πijk = πi for all i, j, k.

We next derive the mle for πijk for the A + B ∗ C model. The derivation of the mle is similar
to that in section 6.4.3.

Since we model y111 , ..., yIJK as independent Poisson random variables with mean µijk if follows
that yijk ∼ P o(njk πijk ) ∼ P o(njk πi ). The likelihood (L) is therefore

I Y
J Y
K 
exp(−njk πi )[njk πi ]yijk
Y 
L =
yijk !
i=1 j=1 k=1
I X
X J X
K
l = log(L) = {−njk πi + yijk log(njk πi ) − log(yijk !)}
i=1 j=1 k=1
J X
K  
∂l X yijk
= −njk +
∂πi πi
j=1 k=1
∂l yi..
= 0 ⇒ π̂i =
∂πi n

The only easily interpretable log-linear models intermediate between this one and the saturated
one are A ∗ B + B ∗ C, A ∗ C + B ∗ C that we now consider.

7.4.2 Homogeneity of the response over one other factor (A*B+B*C)

The distribution of A varies as the level of B changes but not as the level of C changes (with B
at a fixed level). In terms of probabilities:

πijk = πij for all i, j, k.

The mean is µijk = njk πij , and the corresponding log-linear model is:

ηijk = µ + αi + βj + γk + (αβ)ij + (βγ)jk A∗B+B∗C

The other possibility (A ∗ C + B ∗ C) is homogeneity with respect to B.

114
7.4.3 Example: ulcers and aspirin

Dobson (1990, Example 9.3; 2002, Example 9.3.3) gives the following data for a case-control
study on patients with peptic ulcers (gastric or duodenal) and controls (matched for known
confounding factors), where aspirin use was noted. The data is in the ‘aspirin’ dataframe on the
.RData workspace on MOLE.

Aspirin use
non-user user Total
Gastric ulcer Cases 39 25 64
Gastric ulcer Controls 62 6 68
Duodenal ulcer Cases 49 8 57
Duodenal ulcer Controls 53 8 61
203 47 250

7.4.4 Estimates of E(Yijk ) for different log-linear models

To better understand what the different models represent, we consider their fitted values. We
consider the models:

• B*C+A

• B*C+A*B

• B*C+A*C

We label Aspirin use as i(A), type of ulcer as j(B) and case/control status as k(C). The cells
corresponding to πijk are shown in the table below.

Aspirin use(i)
non-user(i = 1) user(i = 2) Total
Gastric ulcer(j = 1) Case(k = 1) π111 π211 1
Gastric ulcer(j = 1) Control(k = 2) π112 π212 1
Duodenal ulcer(j = 2) Case(k = 1) π121 π221 1
Duodenal ulcer(j = 2) Control(k = 2) π122 π222 1

We show how to calculate the estimate of the expected value of the response for a gastric ulcer
control aspirin user, (µ̂212 ), for the different models.
Model B*C+A: π̂ijk = yi.. /n
The table of fitted values is

Aspirin use
non-user user Total
Gastric ulcer Case 51.97 12.03 64
Gastric ulcer Control 55.22 12.78 68
Duodenal ulcer Case 46.28 10.72 57
Duodenal ulcer Control 49.53 11.47 61
203 47

µ̂212 = n12 × π̂212 = n12 × y2.. /n = 68 × 47/250 = 12.78


Note that row totals and column totals are preserved.
Model B*C+A*B: π̂ijk = yij . /nj .
The table of fitted values is

115
Aspirin use
non-user user Total
Gastric ulcer Case 48.97 15.03 64
Gastric ulcer Control 52.03 15.97 68
Duodenal ulcer Case 49.27 7.73 57
Duodenal ulcer Control 52.73 8.27 61

(25+6)
µ̂212 = n12 × π̂212 = n12 × y21. /n1. = 68 × (25+6)+(39+62) = 15.97
Note that row totals are preserved as are marginal totals of aspirin use and ulcer type (i.e.
48.97+52.03=39+62).
Meaning - The distribution of aspirin use (A) varies with the level of tumour type (B) but not
with disease status(C).
Model B*C+A*C
This is left as an exercise that will be covered in lectures. Fill in table of fitted values (µ̂212 has
been done for you). If you are unsure you can get them from R (using the fitted attribute)
and then try to work out how they are obtained.

Aspirin use
non-user user Total
Gastric ulcer Case 64
Gastric ulcer Control 7.38 68
Duodenal ulcer Case 57
Duodenal ulcer Control 61

Show how to calculate µ̂212 and state which marginal totals are fixed in this model.

7.4.5 Fitted values using the parameter estimates and the link function

Of course fitted values can be obtained as before using the parameter estimates and knowledge
of the link function. We show how to obtain the fitted value of 7.38 for the exercise above. The
data are read in as follows:

count aspirin type status


39 1 1 1
25 2 1 1
. . . .
. . . .
8 2 2 2

The B*C+A*C model is implemented using

glm(count~factor(type)*factor(status)+factor(aspirin)*factor(status),poisson(log))

The linear predictor fitted (ηijk ) is

β0 + β1 I(type) + β2 I(status) + β3 I(aspirin) + β4 I(type : status) + β5 I(status : aspirin)

The I is an indicator variable which for the single variables is zero if the level is 1, and is 1 if
the level is 2. The indicator variable for the interaction terms is 1 if both variables are at level
2 and is zero otherwise. A more conventional representation is

µ + αi + βj + γk + (αβ)jk + (αγ)ik

The parameter estimates are

116
Coefficients:
(Intercept) 3.840429
factor(type)2 -0.115832
factor(status)2 0.264198
factor(aspirin)2 -0.980829
factor(type)2:factor(status)2 0.007198
factor(status)2:factor(aspirin)2 -1.125046

For fitted value µ̂212 we have that i(aspirin) = k(status) = 2 and j(type) = 1 so
η̂212 = 3.84 + 0.264 − 0.980 − 1.125 = 1.999
µ̂212 = eη̂212 = e1.999 = 7.38

7.4.6 Model selection

1. Ulcer site (B) and case/control (C) have fixed margins. There were 2 × 2 samples, and one
response: aspirin use (A). So all models must include B ∗C, and interest is in homogeneity
of aspirin use.
2. The row (aspirin user) proportions of 0.39 (B1 C1 ), 0.09 (B1 C2 ), 0.14 (B2 C1 ), 0.13 (B2 C2 )
suggest that aspirin usage is different for the gastric ulcer cases (B1 C1 ) from the other
three categories.
3. The (residual) deviances for the models B ∗ C + A ∗ B, B ∗ C + A ∗ C, B ∗ C + A are,
respectively, 17.697 (on 2 df), 10.538 (on 2 df), 21.789 (on 3 df).
4. B ∗ C + A is nested within both B ∗ C + A ∗ B and B ∗ C + A ∗ C. The changes in scaled
deviance are 4.098 and 11.26 respectively on 1 df. Comparing this to χ21,0.95 = 3.84 both
models are an improvement. However neither of them is are satisfactory as their scaled
deviances are too high.
5. Consider now the 4 × 2 two-way table formed by treating B and C together as one 4-level
factor. The two-way test for homogeneity is the same as the B ∗ C + A test above. The
Pearson residuals (to 1 d.p) are, by row, −1.8, 3.7; 0.9, −1.9; 0.4, −0.8; 0.5, −1.0; they
show the poor fit for A2 B1 C1 .
6. When the gastric ulcer cases are removed, the test for the 3 × 2 table has deviance 0.985
(on 2 df). The test of gastric ulcer cases against the rest (pooled) has deviance 20.804
= 21.789 − 0.985 (on 1 df). This confirms that gastric ulcer cases have a higher use of
aspirin, and that there’s no evidence that usages in the other three groups are different).

Task 26 Verify the analysis in Example 7.4.3.

7.5 One factor a binary response and the others explanatory

B 1 C1 B1 C2 ··· Bj Ck ···
A1 y111 y112 ··· y1jk ···
A2 y211 y212 ··· y2jk ···
Total y.11 y.12 ··· y.jk ···

If A is a response factor, and B, C, are explanatory factors, the distribution of (yijk : 1 ≤ i ≤ I)


is M n(y.jk , (πijk )). However, in the special case in which A has only two levels (as in the
ulcers and aspirin data), this is equivalent to saying that y1jk is Binomial Bi(y.jk , π1jk ), and
since nothing is then lost by considering y1jk alone the data may also be analysed as though
they were Binomial. This would cause problems if the answers might be different, but in fact
a log-linear analysis of the original table and a Binomial logistic regression analysis are exactly
equivalent, provided the appropriate models are used, as follows.

117
logistic, model C ↔ log-linear, model A ∗ C + B ∗ C Note that B ∗ C
is always included
e.g: logistic B ↔ log-linear A ∗ B + B ∗ C,
e.g: logistic 1 (i.e. mean) ↔ log-linear A + B ∗ C

An advantage of using the Binomial is that other links can be considered.

Task 27 By fitting the models in R, show that the binomial (η ∼ 1) analysis gives the same
residual deviance (and hence model fit) as the log-linear (η ∼ A + B ∗ C) analysis for the
ulcer/aspirin data.

118

You might also like