Download as pdf or txt
Download as pdf or txt
You are on page 1of 391

Introduction to Statistical

Modelling and Inference


The complexity of large-scale data sets (“Big Data”) has stimulated the development of advanced
computational methods for analysing them. There are two different kinds of methods to aid this.
The model-based method uses probability models and likelihood and Bayesian theory, while the
model-free method does not require a probability model, likelihood or Bayesian theory. These two
approaches are based on different philosophical principles of probability theory, espoused by the
famous statisticians Ronald Fisher and Jerzy Neyman.
Introduction to Statistical Modelling and Inference covers simple experimental and survey designs,
and probability models up to and including generalised linear (regression) models and some
extensions of these, including finite mixtures. A wide range of examples from different application
fields are also discussed and analysed. No special software is used, beyond that needed for
maximum likelihood analysis of generalised linear models. Students are expected to have a basic
mathematical background in algebra, coordinate geometry and calculus.

Features

• Probability models are developed from the shape of the sample empirical cumulative
distribution function (cdf) or a transformation of it.
• Bounds for the value of the population cumulative distribution function are obtained from
the Beta distribution at each point of the empirical cdf.
• Bayes’s theorem is developed from the properties of the screening test for a rare condition.
• The multinomial distribution provides an always-true model for any randomly sampled
data.
• The model-free bootstrap method for finding the precision of a sample estimate has a
model-based parallel – the Bayesian bootstrap – based on the always-true multinomial
distribution.
• The Bayesian posterior distributions of model parameters can be obtained from the
maximum likelihood analysis of the model.

This book is aimed at students in a wide range of disciplines including Data Science. The book
is based on the model-based theory, used widely by scientists in many fields, and compares it, in
less detail, with the model-free theory, popular in computer science, machine learning and official
survey analysis. The development of the model-based theory is accelerated by recent developments
in Bayesian analysis.
Murray Aitkin earned his BSc, PhD and DSc from Sydney University, Australia, in Mathematical
Statistics. Dr Aitkin completed his post-doctoral work at the Psychometric Laboratory, University
of North Carolina, Chapel Hill. He has held teaching/lecturing positions at Virginia Polytechnic
Institute, the University of New South Wales and Macquarie University along with research
professor positions at Lancaster University (three years, UK Social Science Research Council)
and the University of Western Australia (five years, Australian Research Council). He has been a
Professor of Statistics at Lancaster University, Tel Aviv University and the University of Newcastle.
He has been a visiting researcher and also held consulting positions at the Educational Testing
Service (Fulbright Senior Fellow 1971–1972 and Senior Statistician 1988–1989). He was the Chief
Statistician from 2000 to 2002 at the Education Statistics Services Institute, American Institutes
for Research, Washington DC, and advisor to the National Center for Education Statistics, US
Department of Education.
He is a Fellow of the American Statistical Association, an Elected Member of the International
Statistical Institute, and an Honorary Member of the Statistical Modelling Society.
He is an Honorary Professorial Associate at the University of Melbourne (Department of
Psychology 2004–2008, Department [now School] of Mathematics and Statistics 2008–present).
Introduction to Statistical
Modelling and Inference

Murray Aitkin
University of Melbourne, Australia
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2023 Murray Aitkin

CRC Press is an imprint of Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume respon-
sibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify
in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on
CCC please contact mpkbookspermissions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and
explanation without intent to infringe.

ISBN: 9781032105710 (hbk)


ISBN: 9781032105734 (pbk)
ISBN: 9781003216025 (ebk)

DOI: 10.1201/9781003216025

Typeset in CMR
by Deanta Global Publishing Services, Chennai, India
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Statistical Modelling? . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is Statistical Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 What is Statistical Inference? . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 What is (or are) Big Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Data and research studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


3.1 Lifetimes of radio transceivers . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Clustering of V1 missile hits in South London . . . . . . . . . . . . . . . . 5
3.3 Court case on vaccination risk . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Clinical trial of Depepsen for the treatment of duodenal ulcers . . . . . . . 6
3.5 Effectiveness of treatments for respiratory distress in newborn babies . . . 7
3.6 Vitamin K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.7 Species counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.8 Toxicology in small animal experiments . . . . . . . . . . . . . . . . . . . . 9
3.9 Incidence of Down’s syndrome in four regions . . . . . . . . . . . . . . . . 9
3.10 Fish species in lakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.11 Absence from school . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.12 Hostility in husbands of suicide attempters . . . . . . . . . . . . . . . . . . 11
3.13 Tolerance of racial intermarriage . . . . . . . . . . . . . . . . . . . . . . . . 12
3.14 Hospital bed use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.15 Dugong growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.16 Simulated motorcycle collision . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.17 Global warming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.18 Social group membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 The StatLab database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.1 Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 StatLab population questions . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Sample surveys – should we believe what we read? . . . . . . . . . . . . . . . . . . . . . 23


5.1 Women and Love . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Would you have children? . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Representative sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Bias in the Newsday sample . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Bias in the Women and Love sample . . . . . . . . . . . . . . . . . . . . . 25

v
vi Contents

6 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Degree of belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 StatLab dice sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 Computer sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4.1 Natural random processes . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Probability for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5.1 Extrasensory perception . . . . . . . . . . . . . . . . . . . . . . . 31
6.5.2 Representative sampling . . . . . . . . . . . . . . . . . . . . . . . 33
6.6 Probability axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.1 Dice example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.2 Coin tossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.7 Screening tests and Bayes’s theorem . . . . . . . . . . . . . . . . . . . . . 36
6.8 The misuse of probability in the Sally Clark case . . . . . . . . . . . . . . 39
6.9 Random variables and their probability distributions . . . . . . . . . . . . 42
6.9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.10 Sums of independent random variables . . . . . . . . . . . . . . . . . . . . 45

7 Statistical inference I – discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 47


7.1 Evidence-based policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 The basis of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 The survey sampling approach . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Model-based inference theories . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.6 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.6.1 The binomial likelihood function . . . . . . . . . . . . . . . . . . . 53
7.6.1.1 Sufficient and ancillary statistics . . . . . . . . . . . . . 54
7.6.1.2 The maximum likelihood estimate (MLE) . . . . . . . . 57
7.7 Frequentist theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.7.1 Parameter transformations . . . . . . . . . . . . . . . . . . . . . . 60
7.7.2 Ambiguity of notation . . . . . . . . . . . . . . . . . . . . . . . . 63
7.8 Bayesian theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.8.1 Bayes’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.8.2 Summaries of the posterior distribution . . . . . . . . . . . . . . . 65
7.8.3 Conjugate prior distributions . . . . . . . . . . . . . . . . . . . . . 67
7.8.4 Improving frequentist interval coverage . . . . . . . . . . . . . . . 67
7.8.5 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.8.6 Non-informative prior rules . . . . . . . . . . . . . . . . . . . . . . 68
7.8.7 Frequentist objections to flat priors . . . . . . . . . . . . . . . . . 69
7.8.8 General prior specifications . . . . . . . . . . . . . . . . . . . . . . 69
7.8.9 Are parameters really just random variables? . . . . . . . . . . . 70
7.9 Inferences from posterior sampling . . . . . . . . . . . . . . . . . . . . . . 70
7.9.1 The precision of posterior draws . . . . . . . . . . . . . . . . . . . 71
7.10 Sample design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.11 Parameter transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.12 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.12.1 Poisson likelihood and ML . . . . . . . . . . . . . . . . . . . . . . 77
7.12.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.12.3 Prediction of a new Poisson value . . . . . . . . . . . . . . . . . . 79
7.12.4 Side effect risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Contents vii

7.12.4.1 Frequentist analysis . . . . . . . . . . . . . . . . . . . . 80


7.12.4.2 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . 81
7.12.5 A two-parameter binomial distribution . . . . . . . . . . . . . . . 81
7.12.5.1 Frequentist analysis . . . . . . . . . . . . . . . . . . . . 82
7.12.5.2 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . 84
7.13 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.13.1 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . 85
7.14 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.15 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.15.1 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.15.2 Sampling without replacement . . . . . . . . . . . . . . . . . . . . 89

8 Comparison of binomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Example – RCT of Depepsen for the treatment of duodenal ulcers . . . . . 92
8.2.1 Frequentist analysis: confidence interval . . . . . . . . . . . . . . . 93
8.2.2 Bayesian analysis: credible interval . . . . . . . . . . . . . . . . . 94
8.3 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 RCT continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.5 Bayesian hypothesis testing/model comparison . . . . . . . . . . . . . . . . 97
8.5.1 The null and alternative hypotheses, and the two models . . . . . 97
8.6 Other measures of treatment difference . . . . . . . . . . . . . . . . . . . . 100
8.6.1 Frequentist analysis: hypothesis testing . . . . . . . . . . . . . . . 102
8.6.2 How are the hypothetical samples to be drawn? . . . . . . . . . . 103
8.6.3 Conditional testing . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.7 The ECMO trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.7.1 The first trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.7.2 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.7.3 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.7.3.1 Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . 108
8.7.4 The second ECMO study . . . . . . . . . . . . . . . . . . . . . . . 109

9 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


9.1 The histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.2 The empirical mass and cumulative distribution functions . . . . . . . . . 113
9.3 Probability models for continuous variables . . . . . . . . . . . . . . . . . . 113

10 Statistical inference II – the continuous exponential, Gaussian and


uniform distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.1 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.2 The exponential likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.3 Frequentist theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.3.1 Parameter transformations . . . . . . . . . . . . . . . . . . . . . . 119
10.3.2 Frequentist asymptotics . . . . . . . . . . . . . . . . . . . . . . . 121
10.4 Bayesian theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.4.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.5 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.6 The Gaussian likelihood function . . . . . . . . . . . . . . . . . . . . . . . 125
10.7 Frequentist inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.8 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.8.1 Prior arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
viii Contents

10.9 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128


10.10 Frequentist hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.10.1 µ1 vs µ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.10.2 µ0 vs µ ̸= µ0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.11 Bayesian hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.11.1 µ1 vs µ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.11.2 µ0 vs µ ̸= µ0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.11.2.1 Use the credible interval . . . . . . . . . . . . . . . . . 132
10.11.2.2 Use the likelihood ratio . . . . . . . . . . . . . . . . . . 132
10.11.2.3 The integrated likelihood . . . . . . . . . . . . . . . . . 133
10.12 Pivotal functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.13 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.14 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.14.1 The location-shifted uniform distribution . . . . . . . . . . . . . . 136

11 Statistical Inference III – two-parameter continuous distributions . . . . . . 137


11.1 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.2 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.3 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.3.1 Inference for σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.3.2 Inference for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.3.2.1 Simulation marginalisation . . . . . . . . . . . . . . . . 140
11.3.3 Parametric functions . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.3.4 Prediction of a new observation . . . . . . . . . . . . . . . . . . . 142
11.4 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4.1 The lognormal density . . . . . . . . . . . . . . . . . . . . . . . . 143
11.5 The Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.5.1 The Weibull likelihood . . . . . . . . . . . . . . . . . . . . . . . . 145
11.5.2 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.5.3 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.5.4 The extreme value distribution . . . . . . . . . . . . . . . . . . . 148
11.5.5 Median Rank Regression (MRR) . . . . . . . . . . . . . . . . . . 148
11.5.6 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.6 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.7 The gamma likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.7.1 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.7.2 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

12 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157


12.1 Gaussian model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12.2 Lognormal model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.3 Exponential model assessment . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.4 Weibull model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.5 Gamma model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

13 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167


13.1 The multinomial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.2 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.3 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.4 Criticisms of the Haldane prior . . . . . . . . . . . . . . . . . . . . . . . . 171
Contents ix

13.4.1 The Dirichlet process prior . . . . . . . . . . . . . . . . . . . . . . 173


13.4.2 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.5 Inference for multinomial quantiles . . . . . . . . . . . . . . . . . . . . . . 175
13.6 Dirichlet posterior weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13.7 The frequentist bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.7.1 Two-category sample . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.8 Stratified sampling and weighting . . . . . . . . . . . . . . . . . . . . . . . 180

14 Model comparison and model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


14.1 Comparison of two fully specified models . . . . . . . . . . . . . . . . . . . 183
14.2 General model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.2.1 Known parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.2.2 Unknown parameters . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.3 Posterior distribution of the likelihood . . . . . . . . . . . . . . . . . . . . 186
14.4 The deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
14.5 Asymptotic distribution of the deviance . . . . . . . . . . . . . . . . . . . 190
14.6 Nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.7 Model choice and model averaging . . . . . . . . . . . . . . . . . . . . . . . 192

15 Gaussian linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195


15.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
15.1.1 Vitamin K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
15.2 Model assessment through residual examination . . . . . . . . . . . . . . . 197
15.3 Likelihood for the simple linear regression model . . . . . . . . . . . . . . 199
15.4 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.4.1 Vitamin K example . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.5 Bayesian and frequentist inferences . . . . . . . . . . . . . . . . . . . . . . 203
15.6 Model-robust analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.6.1 The robust variance estimate . . . . . . . . . . . . . . . . . . . . . 206
15.7 Correlation and prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
15.7.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
15.7.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
15.7.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
15.7.4 Prediction as a model assessment tool . . . . . . . . . . . . . . . . 209
15.8 Probability model assessment . . . . . . . . . . . . . . . . . . . . . . . . . 209
15.9 “Dummy variable” regression . . . . . . . . . . . . . . . . . . . . . . . . . 210
15.10 Two-variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
15.11 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
15.12 The p-variable linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 215
15.13 The Gaussian multiple regression likelihood . . . . . . . . . . . . . . . . . 215
15.13.1 Absence from school . . . . . . . . . . . . . . . . . . . . . . . . . 216
15.14 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
15.14.1 ANOVA, ANCOVA and MR . . . . . . . . . . . . . . . . . . . . . 217
15.14.1.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 218
15.14.1.2 Backward elimination . . . . . . . . . . . . . . . . . . . 218
15.14.1.3 ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . 219
15.15 Ridge regression, the Lasso and the “elastic net” . . . . . . . . . . . . . . 219
15.16 Modelling boy birthweights . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
15.17 Modelling girl intelligence at age ten and family income . . . . . . . . . . . 222
15.18 Modelling of the hostility data . . . . . . . . . . . . . . . . . . . . . . . . . 226
x Contents

15.18.1 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227


15.18.1.1 Replication and variance heterogeneity . . . . . . . . . 232
15.19 Principal component regression . . . . . . . . . . . . . . . . . . . . . . . . 232

16 Incomplete data and their analysis with the EM and DA algorithms. . . . 235
16.1 The general incomplete data model . . . . . . . . . . . . . . . . . . . . . . 235
16.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
16.3 Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
16.4 Lost data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
16.5 Censoring in the exponential distribution . . . . . . . . . . . . . . . . . . . 239
16.6 Randomly missing Gaussian observations . . . . . . . . . . . . . . . . . . . 240
16.7 Missing responses and/or covariates in simple and multiple regression . . . 242
16.7.1 Missing values in the single covariate in simple linear regression . . . 242
16.7.2 Modelling the covariate distribution – Gaussian . . . . . . . . . . 242
16.7.3 Modelling the covariate distribution – multinomial . . . . . . . . 244
16.7.4 Multiple covariates missing . . . . . . . . . . . . . . . . . . . . . . 246
16.8 Mixture distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
16.8.1 The two-component Gaussian mixture model . . . . . . . . . . . . 247
16.9 Bayesian analysis and the Data Augmentation algorithm . . . . . . . . . . 251
16.9.1 The galaxy recession velocity study . . . . . . . . . . . . . . . . . 251
16.9.2 The Dirichlet process prior . . . . . . . . . . . . . . . . . . . . . . 263

17 Generalised linear models (GLMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265


17.1 The exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.3 The GLM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
17.4 Bayesian package development . . . . . . . . . . . . . . . . . . . . . . . . . 267
17.5 Bayesian analysis from ML . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
17.6 Binary response models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
17.6.1 Probit or logit analysis? . . . . . . . . . . . . . . . . . . . . . . . 268
17.6.2 Other binomial link functions and their origins . . . . . . . . . . . 268
17.6.3 The Racine data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
17.6.4 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 271
17.6.5 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
17.6.6 The beetle data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
17.7 The menarche data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
17.7.1 Down’s syndrome analysis . . . . . . . . . . . . . . . . . . . . . . . 283
17.7.1.1 BC analysis . . . . . . . . . . . . . . . . . . . . . . . . 283
17.7.1.2 Four regions analysis . . . . . . . . . . . . . . . . . . . 283
17.7.2 The Finney vasoconstriction data . . . . . . . . . . . . . . . . . . 287
17.7.3 Cross-classifications with binary data . . . . . . . . . . . . . . . . 292
17.7.3.1 Region 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 295
17.7.3.2 Region 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 295
17.7.3.3 Region 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 295
17.7.3.4 Region 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 295
17.7.3.5 Observed and (fitted) proportions, all regions . . . . . 296
17.8 Poisson regression – fish species frequency . . . . . . . . . . . . . . . . . . 296
17.8.1 Gaussian approximation . . . . . . . . . . . . . . . . . . . . . . . 299
17.8.2 The Bayesian bootstrap and posterior weighting . . . . . . . . . . 299
17.8.3 Omitted variables, overdispersion and the negative binomial
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Contents xi

17.8.4 Conjugate W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305


17.9 Gamma regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

18 Extensions of GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307


18.1 Double GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
18.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
18.3 Bayesian analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
18.3.1 Hospital beds and patients . . . . . . . . . . . . . . . . . . . . . . 309
18.3.2 The absence from school data . . . . . . . . . . . . . . . . . . . . 312
18.3.3 The fish species data . . . . . . . . . . . . . . . . . . . . . . . . . 313
18.3.4 Sea temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
18.3.5 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
18.4 Segmented or broken-stick regressions . . . . . . . . . . . . . . . . . . . . . 320
18.4.1 Nile flood volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
18.4.2 Modelling the break . . . . . . . . . . . . . . . . . . . . . . . . . . 323
18.4.3 Down’s syndrome . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
18.5 Heterogeneous regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
18.6 Highly non-linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
18.7 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
18.8 Social networks and social group membership . . . . . . . . . . . . . . . . 338
18.8.1 History of network structures . . . . . . . . . . . . . . . . . . . . 338
18.8.2 The Natchez women network . . . . . . . . . . . . . . . . . . . . . 339
18.8.3 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
18.8.3.1 The “null” random graph model . . . . . . . . . . . . . 340
18.8.3.2 The “saturated” model . . . . . . . . . . . . . . . . . . 340
18.8.3.3 The Rasch model . . . . . . . . . . . . . . . . . . . . . 340
18.8.4 The Exponential Random Graph Model (ERGM) . . . . . . . . . 341
18.8.4.1 The latent class or mixed Rasch model . . . . . . . . . 341
18.9 The motorcycle data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

19 Appendix 1 – length-biased sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

20 Appendix 2 – two-component Gaussian mixture . . . . . . . . . . . . . . . . . . . . . . . . 351

21 Appendix 3 – StatLab variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353


21.1 Child variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
21.2 Family variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
21.3 Mother variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
21.4 Father variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

22 Appendix 4 – a short history of statistics from 1890 . . . . . . . . . . . . . . . . . . . . 355


22.1 Karl Pearson (1857–1936) . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
22.2 Ronald Fisher (1890–1962) . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
22.3 Jerzy Neyman (1894–1981) . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
22.4 Harold Jeffreys (1891–1989) . . . . . . . . . . . . . . . . . . . . . . . . . . 357

23 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Preface

This book develops a new introductory course in statistics, aimed at students in a wide
range of programmes including Data Science. It is suitable for students in a general Statistics
programme, for whom Data Science can be one of many application areas.
The book has two aims: to clarify for students the confusion between the several infer-
ential frameworks used in different fields of application of statistics, and their philosophical
bases, and to serve as a textbook for students new to statistics and statistical modelling. Stu-
dents are expected to have a basic mathematical background of algebra, coordinate geometry
and calculus.

0.1 Why this book?


The book presents both the Bayesian and the frequentist approaches to statistical analysis.
This may sound confusing, but these approaches, while apparently different, have very close
connections and in many cases use the same data quantities to draw conclusions which are
very similar. The frequentist approach has two conflicting possible interpretations, which we
discuss at length. One is based on the idea of hypothetical replications of the sample data,
and the relation between the observed data and the hypothetical replications. The other does
not depend on hypothetical replications, but is generally based on a quadratic assumption
about the fundamental evidence function of statistical theory.
Bayesian theory does not rely on replications or the quadratic assumption, and is
more generally relevant to the analysis of the complex data structures which are in-
creasingly common, especially in Data Science. The frequentist analysis is still relevant,
however, for one major class of models, those for Gaussian (“normal”)-based regression
and ANOVA (analysis of variance), important applications in many fields. The frequen-
tist analysis is also the basis, in this book, for the Bayesian analysis of generalised lin-
ear models, the extension of regression models to response distributions in the exponential
family.

0.2 Why the focus on the Bayesian approach?


By focussing on the Bayesian approach we are able to travel much faster through the neces-
sary introduction to statistical inference. We give several relevant quotes:
The ability to simplify means to eliminate the unnecessary so that the necessary may
speak.
(Hans Hoffman, quoted in Efron and Tibshirani 1993, p. xiv.)

xiii
xiv Preface

Efron and Tibshirani were not referring to Bayesian theory here, but to nonparametric
bootstrapping, without an “unnecessary” probability model.
We call this streamlining of statistical methodology minimalist statistics. As the
field of Statistics finds itself increasingly intertwined with other disciplines, and as
required models become more complicated, we believe that such minimalism is of
critical importance. There is only so much time available to educate interdisciplinary
researchers and practitioners in statistical theory and methodology.
(Ruppert, Wand and Carroll 2003, pp. 320–321, emphasis in original.)
Ruppert et al were not referring to Bayesian theory here, but to a high-level simplification of
statistical modelling in two-level models, which is beyond the scope of this book. Nevertheless
their comment refers equally well to the simplification of the Bayesian theory relative to the
frequentist theory.
The Bayesian approach has developed very rapidly in the last 20 years, greatly assisted
by the dramatic increases in speed and memory of personal computers. It is now practical for
very complex modelling problems, for which the frequentist approach struggles to provide
an analysis.

0.3 Recent changes in technology


Four recent major changes in the statistics profession and its technology have emphasised the
need for a change in the current paradigm at the undergraduate level, and have stimulated us
to develop an undergraduate programme following the Bayesian approach, but also providing
a discussion of the frequentist approach.
These changes are:
• the very rapid developments since 1996 in fast Markov chain Monte Carlo Bayesian
methods for all kinds of incomplete data structures;
• the increasing availability of multiple-processor computers allowing parallel programming
for large-scale complex analysis;
• the enormous increase in data – the “Big Data” revolution – following the great increase
in sensed data of all kinds;
• the recognition by the professional statistical societies in the US and Australia, and
social science research funding agencies in the UK, that the current form of university
statistics courses was seriously outdated in many universities and needed modernisation
and transformation.
These changes have led to the new and continuing development of graduate programmes
in Data Science, Data Analytics, or just Analytics. Many of these have not been developed by
statistics departments, and these programmes have had to spend precious time introducing
the Bayesian approach to data analysis (used almost universally in graduate Data Science
programmes) for students trained in the frequentist paradigm, at the expense of the time
for development of data handling and data management tools.
We are coming from the preparatory end of such programmes, to provide in this book a
full Bayesian and frequentist introductory statistics course, on which can be built very ad-
vanced Bayesian methods in the Data Science programme, or other undergraduate statistics
Preface xv

programmes. The course is equally suitable for statistics students not intending to study
Data Science, as it gives a broad coverage of applications.
An important philosophical position we endorse is the probability model and likelihood
basis for statistical analysis and inference. This basis underlies both the (model-based) fre-
quentist and the Bayesian analyses of sample data. An important group of statisticians (more
common in survey fields), mathematicians and computer scientists regard this approach as
limiting and unnecessary, and argue that the choice of estimators for a population quantity
of interest need not, and should not, be restricted to those based on a formal probability
model.
The classical theory of survey sampling is based on this argument, as we discuss in the
book. Some well-known and highly regarded methods in machine learning – bootstrapping is
an example – are not based on probability models. A common view outside statisticians is
that statistical analysis is a form of optimisation – maximisation or minimisation – and that
the important question is the choice of the criterion or objective function to be optimised.
The sum of squares or weighted sum of squares of residuals from the fitted function is often
used as the criterion.
The difficulty of general optimisation methods is that without a model basis, we cannot
evaluate their quality for statistical inference without very detailed performance simulation
studies. Such studies tend to rely on asymptotic – large-sample – behaviour, supplemented
by some finite-sample behaviour. If these methods are based on a probability model, then
we can assess their performance relative to the likelihood-based analysis. Fisher pointed this
out in his revolutionary proposal of the likelihood.
In the development of this book we have taken advantage of recent developments in
Bayesian theory, which have important applications in survey sampling and frequentist max-
imum likelihood theory, and this allows us to give a unified presentation of these areas even
in the first course.

0.4 Acknowledgements
This programme has developed from courses based on this approach given or prepared for
presentation at a number of universities, including the universities of Lancaster and New-
castle UK, and Melbourne Australia. We are grateful to Brian Francis, Göran Kauermann
and Adriano de Carli for invitations to give courses: their invitations have spurred the de-
velopment of this programme.
The present form of the book owes much to discussions with Sunil Rao and Steve Fien-
berg, and we have benefitted from extensive discussions with the staff at the Department of
Statistics, University of Colorado at Fort Collins, the Department of Social Statistics, School
of Social Sciences at the University of Manchester, and the Griffith Social and Behavioural
Research College, Griffith University Queensland.
I am grateful for the support over many years from the UK Social Science Research
Council and Economic and Social Research Council, the Australian Research Council, the
US National Center for Education Statistics and the US Institute of Education Sciences, for
research into the development of new model-based and Bayesian data analysis methods.
I am particularly grateful to the many historical and current statistical scientists who have
influenced my philosophical and computational views, including George Barnard, Al Beaton,
Jim Berger, David Cox, Noel Cressie, Art Dempster, Anthony Edwards, William Ericson,
Steve Fienberg, Ronald Fisher, Harold Jeffreys, Jim Kalbfleisch, Nan Laird, Jim Lindsey,
xvi Preface

Rod McDonald, Jon Rao, Richard Royall, Don Rubin, David Sprott, Martin Tanner and
Matt Wand. In the finalisation of the book text, we have had much help from Rob Calver
and Lillian Woodall at Taylor and Francis, and Andrew Robinson at CEBRA, University of
Melbourne.

0.5 Grammatical note


I use the plural subject “we” rather than “I” when writing as the author, except when the
reference is direct. Throughout this book we use the spelling “generalised” in preference to
“generalized”. Both spellings are widely used; since other forms of the word, like “generali-
sation” use s, not z, for consistency we use s in all forms. We use the word data as plural:
its singular is datum, which is rarely used or needed.
The expressions some Bayesians or many Bayesians are not quantified, nor are those
referred to identified. Bayesians are a diverse group.
1
Introduction

We first define some common terms.

1.1 What is Statistical Modelling?


As Wikipedia describes it:
A statistical model is a mathematical model that embodies a set of statistical assump-
tions concerning the generation of some sample data and similar data from a larger
population. A statistical model represents, often in considerably idealized form, the
data-generating process.
(Emphasis in original)
A statistical model, in the sense we use it in this book, has some but not all of the charac-
teristic features of a model, described by Wikipedia as:
a three-dimensional representation of a person or thing or of a proposed structure,
typically on a smaller scale than the original.
Statistical models can be of much higher dimension than three, which makes them hard to
visualise. Their essential property is that they represent some, but not all, of the features
of the larger population. These features are represented by mathematical structures in the
models, which are the population quantities of interest to the data analyst or research in-
vestigator. The remaining variability, sometimes called noise, or random variation, which is
not represented by an explicit structure, is modelled by a probability distribution.

1.2 What is Statistical Analysis?


Statistical analysis, sometimes called Data Analysis or just Analytics, is the analysis of
sample data from a research or other investigation to address the question which led to
the investigation. Analysis is generally, though not always, based on an assumed statistical
model, even if the analyst does not realise that there is a model underlying the analysis.

1.3 What is Statistical Inference?


Statistical inference is the process of drawing conclusions about important aspects of the
population from aspects of the observed sample data. The inference is generally based on

DOI: 10.1201/9781003216025-1 1
2 Introduction to Statistical Modelling and Inference

the statistical model, but it can be based on conceptual or actual resampling from the
population without a model.
We develop analysis and inference progressively. We begin in Chapter 2 with Big Data, a
term which has become popular though undefined. Chapter 3 gives a variety of “small data”
sets from a wide range of scientific and social studies, with the research questions which have
led to these studies. These data sets are analysed progressively through the book. We also
make use in Chapter 4 of the StaLab database of a large public health study to investigate
important social questions at the time of the siudy. Before the technical discussions of models
and modelling, in Chapter 5 we discuss the problems of interpretation in sample surveys,
illustrated by a small survey and two larger surveys. The design of these studies is critical
to their interpretation and value.
Chapter 6 gives a discussion of the two aspects of probability, as the relative frequency
of some event in repeated “trials”, and the degree of belief in the outcome of some event.
These two aspects of probability are the foundations of the two approaches to statistical
inference, which are developed in the subsequent chapters. Chapter 7 develops the principles
of inference for the two aspects of probability in two simple discrete probability models: the
binomial and the Poisson. The fundamental evidence function of the model-based school is
introduced: the likelihood function, with its different usage by Bayesians and non-Bayesians.
Chapter 8 extends the discussion of the binomial model to the very important randomised
clinical trial.
Chapter 9 is a short discussion of visualisation of data and statistical analyses.
Chapter 10 extends the discussion of Chapter 7 to one-parameter continuous distribu-
tions: the exponential, Gaussian and uniform. Chapter 11 extends this discussion to two-
parameter distributions: the Gaussian, lognormal, Weibull and gamma.
The possibility of having several possible distributions for sample data requires ways of
assessing the models we are considering for suitability for the data. Chapter 12 on model
assessment shows how to do this with a data set of operating lifetimes of mobile phones.
Chapter 13 extends the binomial distribution to the multinomial, with more than two dis-
crete outcome categories. Chapter 14 continues assessing and comparing competing models
for the same data, extending Chapter 12 to the possibility of model averaging several models
to produce a composite analysis. Chapter 15 introduces regression models with a Gaussian
response variable, and their maximum likelihood and Bayesian analysis. Common machine
learning extensions to ridge regression, the lasso and principal component regression (PCA)
are discussed in the framework of statistical modelling.
Chapter 16 develops the analysis of incomplete data of many kinds, through the maximum
likelihood EM algorithm and the Bayesian Data Augmentation algorithm. Chapters 17 and
18 extend the Gaussian multiple regression framework to generalised linear models (GLMs),
and further extensions of GLMs.
Four appendices discuss length-biased sampling, two-component Gaussian mixtures, the
StatLab variables, and a short history of the major figures in the development of statistical
modelling and analysis since 1890.
2
What is (or are) Big Data?

Apart from Big Data meaning a lot of data, the use of this term implies a need for analysis
methods for large-scale data, for which the traditional methods for the statistical analysis of
“small” samples presumably fail. The problems with the analysis of large-scale data are not
new: they have been with us since surveys and experiments began to generate large samples
with complex data structures.
A simple way of quantifying the problem is through the size of the data matrix. In tradi-
tional social surveys, the data matrix X was an array of dimension n × p, of measurements
or recorded values of p “variables” on n “cases” or individuals. Simple tabulation procedures
in early computers for variable means and variances and their intercorrelations could handle
large numbers of individuals, but for any kind of regression modelling the covariance matrix
of the variables, of dimension p × p, had to be inverted.
When I began (1959) a summer research assistantship with the Chief Accountant’s Office
of the Commonwealth Bank in Sydney, the bank wanted to relate the total bank branch
salaries and number of staff at their 650 branches to 35 “activity” variables, the routine
activities of a branch staff member. The aim of this analysis was to establish which branches
were under- or overstaffed for the level of business, so that staff could be moved between
branches appropriately. To determine the relation between salaries or staff numbers and the
35 activity items through a regression model would require the inversion of a 36 × 36 matrix,
far beyond the capacity of an electric calculator, or of computers of that time. By the early
1960s, regression programs were available on an IBM mainframe which could do this. By
the early 1970s, regression programs could handle 250 variables, with very large numbers of
cases.
The major expansion of Big Data sizes came from several application areas: the recording
of detailed supermarket transactions on identifiable individuals, through their use of super-
market loyalty cards; the very fast and heavy stock-market transactions; and the genotyping
of individuals and its relation to disease occurrence, especially cancers. An early example
of the latter was gene expression data for p = 6,033 genes on 52 prostate cancer patients –
“cases” – and 50 normal subjects – “controls”. The research question was how to identify the
genes which were important for differentiating cases from controls in their expressions: these
might be related to genetic differences in the two populations. Here p >> n (102), so the
covariance and correlation matrices of the gene expressions, while they could be computed,
could not be inverted as they were of rank 100. Traditional statistical methods could not be
used without drastic changes.
With further increases in the size of genotyped data sets, and especially detailed astro-
nomical surveys with very high resolution telescopes, even holding the data set in computer
memory was impossible for even the largest computers. Statistical analysis would have to be
adapted to be usable.
In courses above the introductory level, the student will meet a variety of procedures
developed for handling very large data sets. One obvious one is to split the full data set into
multiple subsets, analyse each subset separately and then combine the analyses appropriately.
We do not pursue how to do this or other procedures further in this book: they depend on

DOI: 10.1201/9781003216025-2 3
4 Introduction to Statistical Modelling and Inference

more complex modelling. In this book, we establish the principles for statistical analysis for
the traditional (and important) small and medium-sized data sets, covering a wide range of
applications.
An important aspect of preliminary analysis is data visualisation. In the next chapter we
give data tables for a number of different research studies, and graphs for a few. Data tables
are not helpful in understanding the message of the data, and in later chapters we discuss
at some length historical and more informative modern ways of visualising data and fitted
models.
3
Data and research studies

We analyse and discuss a number of studies in this book. Research studies and collections
of administrative data have data of different types. Several small examples from research
studies follow: we will analyse them in later chapters.

3.1 Lifetimes of radio transceivers


The data come from a study of the operating lifetimes of field telephones (radio transceivers)
operating in dusty conditions, which examined the lifetimes of a sample of 88 telephones
(subsequently called “phones”), all of the same make and type.1 The purpose of the study
was to establish a cleaning and repair schedule for the phones, to reduce the chance of their
failing in operation.
The phones were used by workers on eight-hour shifts. The number of shifts during which
the telephones were operating correctly was recorded for each of the phones in the study.
The telephones were switched off at the end of a shift, placed in the battery charger and
switched on again at the beginning of the next shift. Some telephones failed to switch on
correctly at the beginning of a shift; others failed during a shift. The lifetimes of the phones
were recorded as a fraction of the shift time; they are converted to operating lifetime hours
in Table 3.1. The lifetimes are grouped into 56 distinct lifetimes in hours ti , i = 1, . . . , 56.
The number of phones at each lifetime value ti is ni .
It was decided that phones should be recalled for maintenance when 80% of their ex-
pected lifetimes had elapsed. When should this be done? We discuss this study in §8.2 and
Chapter 9.

3.2 Clustering of V1 missile hits in South London


During the Second World War in 1944, V1 missiles (pilotless aircraft powered by a pulse-jet
engine) were launched from Germany at London. Table 3.2 gives the number of squares of
size 0.25 km2 (0.5×0.5 km) which were hit by the missiles, in an area of South London (Shaw
and Shaw 2019). The question of interest was whether these numbers reflected intentional
targeting of the missiles at certain areas, or whether they fell randomly. This was important
for understanding the guidance system of the missiles.
We discuss this study in §6.10.

1 The origin of this data set has been lost.

DOI: 10.1201/9781003216025-3 5
6 Introduction to Statistical Modelling and Inference
TABLE 3.1
Lifetimes and numbers of radio transceivers
i t n i t n i t n i t n i t n i t n i t n

1 8 1 2 16 4 3 32 2 4 40 4 5 56 3 6 60 1 7 64 1
8 72 5 9 80 4 10 96 2 11 104 1 12 108 1 13 112 2 14 114 1
15 120 1 16 128 1 17 136 1 18 152 3 19 156 1 20 160 1 21 168 5
22 176 1 23 184 3 24 194 1 25 208 2 26 216 1 27 224 4 28 232 1
29 240 1 30 246 1 31 256 1 32 264 2 33 272 1 34 280 1 35 288 1
36 304 1 37 308 1 38 328 2 39 340 1 40 352 1 41 358 1 42 360 1
43 384 1 44 392 1 45 400 1 46 424 1 47 438 1 48 448 1 49 464 1
50 480 1 51 536 1 52 552 1 53 576 1 54 608 1 55 656 1 56 716 1

TABLE 3.2
V1 hits
Number of V1 hits Number of squares

0 237
1 189
2 115
3 28
4 6
5 1

3.3 Court case on vaccination risk


In a court case assessing a side effect risk after vaccination (Aitkin 1992), the question of
concern was:
• given the occurrence of the side effect in the current year of four cases in 300,533 vacci-
nations;
• and given that the international standard rate of the side effect is one in 310,000 cases;
• has the rate of occurrence increased in this year?
How do we answer this question? See §6.10.3 for details.

3.4 Clinical trial of Depepsen for the treatment of duodenal ulcers


A study was carried out at the Royal North Shore hospital in Sydney by Professor D.W. Piper
and his co-workers. The drug Depepsen (a trade name for sodium amylosulphate) had been
found effective in the treatment of gastric (stomach) ulcers, and it was believed that because
of its known physiological action in the treatment of this condition, and the similarity of the
two conditions, it should also be effective for duodenal ulcers (see Figure 2.1). The criterion
for “success” of the treatment was taken as the complete healing of the ulcer within a period
of eight weeks after the beginning of treatment. The existence of an ulcer, and its healing,
were positively identified by fibre-optic duodenoscopy, in which a flexible tube is swallowed
by the patient, and the lining of the duodenum examined visually through the optical tube.
Data and research studies 7

FIGURE 3.1
Peptic ulcers, stomach and duodenal

TABLE 3.3
Clinical trial of Depepsen
Depepsen Placebo Total

Healed 13 10 23
Not healed 5 7 12
Total 18 17 35

To assess the value of Depepsen, a randomised clinical trial was carried out with 35
patients, in which 13 of the 18 patients receiving Depepsen healed, while ten of the 17
receiving an inert placebo healed. Does this indicate a real superiority in healing of Depepsen
over placebo? Classifying the patients by treatment and recovery, we have Table 3.3.
We discuss this study at length in Chapter 7.

3.5 Effectiveness of treatments for respiratory distress in newborn


babies
This study compared a new treatment, called ECMO (Extra-Corporeal Membrane Oxy-
genation) for life-threatening breathing difficulties in newborn babies. This was compared
in a randomised trial with the current medical treatment, called CMT, which was supplying
oxygen under pressure to the baby’s lungs.
8 Introduction to Statistical Modelling and Inference
TABLE 3.4
Babies surviving or dying under
CMT and ECMO
Response
Treat Survived Died Total

CMT 0 1 1
ECMO 11 0 11
Total 11 1 12

TABLE 3.5
Concentration y and dose x
x 1.6 2.2 2.8 3.0 3.7 4.4 4.8
y 500 162 178 136 78 47 62
x 6.1 6.8 7.2 8.2 9.9 10.2 11.3 14.8
y 39 40 21 19 12 10 9 8

TABLE 3.6
Number of species observed
by five independent observers

16 18 22 25 27

The results of the first trial are shown in Table 3.4.


Was ECMO a more effective treatment than CMT? The curious features of this trial,
and of a second trial, are discussed in §7.7.

3.6 Vitamin K
In a study of the role of Vitamin K in blood clotting, 15 chickens were deprived of Vitamin
K and then fed dried liver (a source of Vitamin K) for three days at a (varying) dose of x
mg per gram weight of chick per day. At the end of this period, the response of each chicken
was measured as the concentration y of a clotting agent needed to clot samples of its blood
in three minutes. The data from the 15 chickens are given in Table 3.5.
What can we say about any relationship between the dose of Vitamin K and the concen-
tration of the clotting agent? We discuss this data set in §14.1.1.

3.7 Species counts


In a study of species abundance in a sampled area, the following counts of the number of
different species in the area were obtained by five observers operating independently. How
many species were there in the sampled area?
Assuming that all the observers were reliable, it is clear that there were at least 27, but
what more can be said? See §6.10.4.
Data and research studies 9
TABLE 3.7
Bioassay data from Racine et al
Dose x i Number of Number of
(log gm/ml) animals, n i deaths, y i

−0.86 5 0
−0.30 5 1
−0.05 5 3
0.73 5 5

TABLE 3.8
Counts of Down’s babies and baby populations in four
regions
Region BC Mass NY Sweden
age r n r n r n r n

15.562 0 0 1 1,364 1 5,142 0 383


16.543 0 0 2 3,959 4 12,524 1 1,979
17.527 16 13,555 10 9,848 7 27,701 3 5,265
...
46.425 0 0 9 258 10 514 7 82
47.419 7 249 7 103 3 183 1 35
48.411 0 0 2 41 1 65 2 19
49.410 0 0 0 13 1 22 0 7

3.8 Toxicology in small animal experiments


In the assessment of acute toxicity of a compound to a strain of laboratory mice, five mice
were assigned randomly to each of four dose levels of the compound. The binary response
was death of the mouse, and the dose levels (on the natural log scale) and the number
of deaths at each dose level are given in Table 3.7, adapted from Gelman, Carlin, Stern,
Dunson, Vehtari and Rubin (2014, §3.7). The original description of the study and analysis
was in Racine, Grieve, Flühler and Smith (1986).
What can be said about the dose level which would give 90% mortality with this strain
of mice? See §16.5.

3.9 Incidence of Down’s syndrome in four regions


Down syndrome or Down’s syndrome, also known as trisomy 21, is a genetic disorder caused
by the presence of all or part of a third copy of chromosome 21. It causes a distinct facial
appearance, intellectual disability and developmental delays. A large cross-national study
of the syndrome was carried out in four major regions: the Canadian province of British
Columbia, the US states of Massachusetts and New York, and the country of Sweden. The
counts r of babies born with Down’s syndrome out of the n born were tabulated by the
average age (within one-year intervals) of the mother at pregnancy (Geyer 1991). A small
10 Introduction to Statistical Modelling and Inference

0.035

0.030

0.025

Downs rate p
0.020

0.015

0.010

0.005

20 25 30 35 40 45
age

FIGURE 3.2
Incidence of Down’s syndrome in British Columbia, 1991 report

subset of the data is shown here. It is well known that the mother’s age is important, with
the observed incidence of Down’s syndrome (r/n) increasing for mothers over 30. Figure 2.2
shows the relation for British Columbia.
There are several research questions:
• Is the incidence of the syndrome low and constant up to some mother age?
• How is the increase best presented?
• Is the increase pattern consistent over the four regions? If not, how do they differ?
We address these questions in §16.6.

3.10 Fish species in lakes


A study of the relation between lake size (surface area) and number of different fish species
living in the lake was carried out by Barbour and Brown (1974) in 70 lakes. The data on
lake area and number of species identified are given in Table 3.9. The question of interest
was how to relate the number of species cohabiting in the lake to the area of the lake. We
discuss this in §16.9.

3.11 Absence from school


A study of absence from school and its relation to school and home variables was carried out
by Sydney University sociologist Dr Sue Quine in a NSW school. Table 3.10 gives the number
Data and research studies 11
TABLE 3.9
Number Y of fish species in lakes of area X
Y 10 37 60 113 99 13 30 114 112 17
X 5 41 171 25,719 59,596 1 44 58,016 19,477 10
Y 10 14 39 14 14 67 36 30 19 46
X 85 1 174 3 548 2,414 36 1 5 5,346
Y 68 93 13 53 17 245 88 24 37 22
X 2,072 17,500 673 2,150 2,370 2 8,490 4 413 29 9,065 3,302
Y 18 214 177 17 50 5 22 156 74 13
X 3,626 32,893 69,484 64,500 31,500 1 85,00 1 125 423,488 436,000 165
Y 11 48 14 28 17 17 21 13 14 21
X 6,206 18,400 24 10,340 23 8,000 221 4,650 231 7,154
Y 24 12 26 13 19 19 22 15 9 23
X 616 31,153 27,195 406 399 1,425 60 71 15 98
Y 48 21 46 14 7 5 40 18 20 17
X 684 212 676 1,080 111 8 8,264 9,065 357 347

TABLE 3.10
Absence, IQ and number of family dependents, Aboriginal girls
days 14 11 2 5 5 35 22 20 13
IQ 60 60 70 86 86 81 86 93 93
deps 11 11 5 9 10 9 7 4 12
days 7 14 27 6 20 4 15 13 6 6
IQ 96 90 82 79 65 64 66 76 73 74
deps 6 5 11 6 10 7 8 3 11 9
days 5 16 17 46 43 40 16 14 32
IQ 70 98 84 100 84 91 105 104 76
deps 7 14 3 10 7 7 4 11 11
days 57 6 53 23 8 34 36 38 23 28
IQ 83 73 92 93 99 95 84 106 89 103
deps 10 9 9 13 7 9 10 11 9 9

of days absent from school, the IQ (intelligence quotient) and the number of dependents in
the family for 38 Aboriginal girl children. What can be said about the relation between
absence and the other variables? See §14.10.

3.12 Hostility in husbands of suicide attempters


Sydney psychiatrist Dr Melvin Bennett carried out a study in married couples of the effect
of an unsuccessful suicide attempt by the wife on the emotional state of the husband. The
basis for this study was that the psychiatric literature on suicide gave conflicting opinions
on this question. One was that the husbands were distressed, caring and supportive of the
12 Introduction to Statistical Modelling and Inference

wife. The other was that the husbands were angry, critical and unsupportive. These reports
were based on small numbers of cases of individual psychiatrists.
Dr Bennett, as part of his MD doctoral thesis, was given access to married women patients
admitted to hospital after a suicide attempt and to married women patients admitted to
hospital with critical organic (non-psychological) abdominal conditions. When the recovery
of the wives was established, Bennett interviewed the husbands and recorded their responses
to three questions about the state of the marriage. The responses were analysed for affection
and hostility content using the Gottschalk-Gleser scales (Gottschalk and Gleser 1969, from
now abbreviated to GG). The analysis scored the husbands’ responses for affect (emotional
response) on several forms of negative affect (anger, guilt, . . .), and one of positive affect –
affection. The analysis provides scale scores which can be taken as approximately Gaussian
(the scales use a transformation of the count of affective words). The psychiatric question
was: how are the levels of affect, on the several GG scales, related to the nature of the
event (suicide attempt or organic abdominal condition), and do personal factors influence
the affect level?
In §14.17 we discuss how the data (not given here) were analysed and the conclusion
about affection.

3.13 Tolerance of racial intermarriage


Table 3.11 comes from a UK three-year large-scale national survey of attitudes of 16-year-
olds to racial intermarriage. The study results are published as a cross-classification of the
sample proportion p tolerant of racial intermarriage, out of the n (in parentheses) 16-year-
olds interviewed. The proportion is cross-classified by
• geographical region: 1 – South, 2 – Central, 3 – North East, 4 – West.
• level of education: 3 – more than high school, 2 – completed high school, 1 – did not
complete high school.
• year of survey: (19)72, 73, 74.

TABLE 3.11
Proportion tolerant of racial intermarriage
Region Ed 72 73 74

South 3 0.704 (27) 0.729 (48) 0.860 (57)


2 0.438 (137) 0.584 (178) 0.568 (176)
1 0.258 (155) 0.222 (162) 0.274 (146)
Central 3 0.783 (60) 0.873 (55) 0.862 (65)
2 0.699 (219) 0.701 (211) 0.732 (231)
1 0.374 (155) 0.389 (131) 0.422 (135)
North East 3 0.893 (56) 0.949 (59) 0.929 (70)
2 0.740 (196) 0.729 (193) 0.764 (240)
1 0.504 (403) 0.488 (84) 0.443 (79)
West 3 0.966 (29) 0.903 (31) 1.0 (27)
2 0.714 (91) 0.839 (87) 0.827 (81)
1 0.578 (45) 0.556 (45) 0.620 (50)
Data and research studies 13

It is clear that educational level has large differences in proportions, and region has smaller
differences; it is unclear what time changes have occurred. How do we summarise the varia-
tions in proportions with the three cross-classifying factors? See §16.8.

3.14 Hospital bed use


Royall and Cumberland (1981) discussed a population of short-stay hospitals for which data
were available for each hospital on the number of patients Y discharged in one year and the
number of hospital beds X in that year. The data in Figure 3.3 came from the hospital data
frame in Valliant, Dorfman and Royall (2000, Appendix B2). This frame is based on the
NCHS Hospital Discharge Survey, a national sample of short-stay hospitals with fewer than
1,000 beds (Herson 1976).
How does Figure 3.4 summarise the relation between patient numbers and hospital bed
numbers? Note that the vertical scales are different in the two figures. See §18.1.1.

3.15 Dugong growth


The dugong is a medium-sized marine mammal, one of four living species of the order
Sirenia, which includes three species of manatees. A study of the development of the dugong
recorded ages and lengths of a sample of 27 captured and released dugongs (data source
dugongs.data.R). What can be said about the length of fully grown dugongs? The data are
graphed in Figure 3.5.

2750

2500

2250

2000

1750
patients

1500

1250

1000

750

500

250

0
0 200 400 600 800
beds

FIGURE 3.3
Numbers of patients treated and hospital beds
14 Introduction to Statistical Modelling and Inference

6000

5000

4000
patients

3000

2000

1000

0 200 400 600 800


beds

FIGURE 3.4
Joint model ML fit with 95% variability bounds

2.7

2.6

2.5

2.4
length (metres)

2.3

2.2

2.1

2.0

1.9

1.8

5 10 15 20 25 30
age (years)

FIGURE 3.5
Ages and lengths of dugongs
Data and research studies 15

3.16 Simulated motorcycle collision


In a simulated motorcycle collision designed to test the safety of helmets, readings were taken
of the acceleration of the rider’s helmet at very short time intervals in milliseconds, before
and after the simulated collision. Figure 3.6 plots acceleration in scaled g units against time
in milliseconds.
How can this relation be modelled? Where does Figure 3.7 come from?

3.17 Global warming


Figure 3.8 shows the annual global sea temperature anomalies over the period 1880–2015
(from the US EPA site). The word “anomaly” is used in the sense of a deviation of the
annual temperature from the average temperature over the period. The anomaly is a simple
location change in the temperature variable.
There is a clear decline over the period 1880–1910, and a steady increase over 1910–2015,
apart from a very sudden increase and then decline in the period 1940–1945, and a much
smaller drop and increase in the period 1908–1911, together with a great deal of variation,
which appears to be decreasing over time. How do we analyse such data?

60

40

20

0
acceleration

-20

-40

-60

-80

-100

-120

10 20 30 40 50
time

FIGURE 3.6
Acceleration of motorcycle helmet
16 Introduction to Statistical Modelling and Inference

100

75

50

25

0
acceleration
-25

-50

-75

-100

-125

-150

-175
10 20 30 40 50
time in ms

FIGURE 3.7
Acceleration of motorcycle helmet and ML fitted polynomial model

0.8

0.6

0.4

0.2

-0.0
anomaly

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2
1880 1900 1920 1940 1960 1980 2000
year

FIGURE 3.8
Sea temperature anomaly by year
Data and research studies 17
TABLE 3.12
Event attendance
W \E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 T

1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 8
2 1 1 1 0 1 1 1 1 0 0 0 0 0 0 7
3 0 1 1 1 1 1 1 1 1 0 0 0 0 0 8
4 1 0 1 1 1 1 1 1 0 0 0 0 0 0 7
5 0 0 1 1 1 0 1 0 0 0 0 0 0 0 4
6 0 0 1 0 1 1 0 1 0 0 0 0 0 0 4
7 0 0 0 0 1 1 1 1 0 0 0 0 0 0 4
8 0 0 0 0 0 1 0 1 1 0 0 0 0 0 3
9 0 0 0 0 1 0 1 1 1 0 0 0 0 0 4
10 0 0 0 0 0 0 1 1 1 0 0 1 0 0 4
11 0 0 0 0 0 0 0 1 1 1 0 1 0 0 4
12 0 0 0 0 0 0 0 1 1 1 0 1 1 1 6
13 0 0 0 0 0 0 1 1 1 1 0 1 1 1 7
14 0 0 0 0 0 1 1 0 1 1 1 1 1 1 8
15 0 0 0 0 0 0 1 1 0 1 1 1 0 0 5
16 0 0 0 0 0 0 0 1 1 0 0 0 0 0 2
17 0 0 0 0 0 0 0 0 1 0 1 0 0 0 2
18 0 0 0 0 0 0 0 0 1 0 1 0 0 0 2
T 3 3 6 4 8 8 10 14 12 5 4 6 3 3 89

3.18 Social group membership


A study of social group membership was carried out in the 1930s in Natchez Mississippi,
reported in Davis, Gardner and Gardner (1941). One aspect of the study was to assess the
formation or existence of “cliques”, defined by the joint participation of groups of women in
attending common events.
Table 3.12 shows the attendance (1) or non-attendance (0) of each woman at each event.
Marginal totals (T) have been added to the table, giving the total number of events attended
by each woman (W), and the total number of women attending each event (E).
Can we identify well-connected groups of women in this small data set? See §17.9.
4
The StatLab database

The StatLab database is a “small” population of 1,296 families. The database was freely
available on the Web for some years. It is described in pp. 318–319 of the StatLab book
(Hodges, Krech and Crutchfield 1975):

The StatLab population [called Census in the book] covers 1296 member families of
the Kaiser Foundation Health Plan (a prepaid medical care programme) living in
the San Francisco Bay area during the years 1961–72. These families were partici-
pating members of the Child Health and Development Study conceived and directed
by Professor Jacob Yerushalmy, in the School of Public Health at the University of
California, Berkeley.
On her first visit to the Oakland hospital of the Health Plan after pregnancy
was diagnosed, each woman was interviewed intensively on a wide range of medical
and socioeconomic matters relating both to herself and to her husband. In addi-
tion, various physical and physiological measures were made. When her child was
born, further data about her and her newborn baby were recorded. Approximately
10 years later the child and mother were called in for follow-up testing, interview-
ing and measurement. In some instances, the husband was also interviewed and
measured.
The 1296 families of the population are divided into two equal subpopulations:
648 families consisting of a mother, father and female child; and 648 families of a
mother, father and male child. The children were all born in the Kaiser Foundation
Hospital, Oakland California, between 1 April 1961 and 15 April 1963. The popu-
lation does not include any other children who may have existed in these families.

More than 10,000 families took part in the Child Health and Development Study. To make
the data more widely available and suitable for student training in statistics, the study
authors prepared a subsample of 1,296 families from the full study as a statistical population
– the StatLab population – to provide for dice sampling, discussed in the following.
From the available data recorded in the Child Health and Development Study, 32 variables
were selected for the StatLab book. The 36 pages of the population listed each of these 32
variables for each of the 1,296 families. The first 18 pages covered the families with girls;
the second 18 pages covered the families with boys. Within each of these two sets of pages
the families were listed in order of mother’s age, with the youngest mothers first and the
oldest last.
The population list consisted of printouts numbered in consecutive “dice numbers” (i.e.
the population pages were numbered 11, 12, 13, 14, 15, 16, 21, 22, 23, . . ., 65, 66). Similarly,
the 36 families on each page were designated in consecutive dice numbers from 11 to 66. The
identification number (ID no.) for any given family consisted of two pairs of dice numbers, the
first pair indicating the page and the second pair indicating the family on the page. To select a
family purely at random from the population of 1,296, it was necessary to throw a pair of dice
twice. (The StatLab book was sold with a pair of dice, one red, one green.) If, for example,

DOI: 10.1201/9781003216025-4 19
20 Introduction to Statistical Modelling and Inference

the first throw gave a red 2 and a green 6, this selected page 26. If the second throw gave a
red 5 and a green 4, this selected family 54 on that page. Thus the ID number for this family
was 26-54.
The 32 variables for each family were grouped by child, mother, father and family.
Part of the data were collected at the time of birth (1961–1963) and the other data at
the time of test (1971–1972). A full description of variables collected in the survey is given in
Appendix 1.

4.1 Types of variables


In social survey work, variables are often described as binary, categorical, count and con-
tinuous. The StatLab variables can be described by slightly more detailed types, set out in
Table 4.1. They are assumed for the present to be measured without error and recorded
without missing values. In fact missing values do occur in some of the variables, as they do
in almost all databases, and we will discuss this at a later point in the book.
Quotation marks are used around “continuous” above because all such variables are in
practice measured to a finite precision, so they are actually discrete variables with a large
number of numerical values. For example, height may be measured to the nearest half-inch
or centimetre, survival time to the nearest day, week or month, and Stanford–Binet IQ may
be given as an integer, though it is defined as the ratio of mental age to chronological age
multiplied by 100 and rounded to an integer. Models for “continuous” data usually ignore
this measurement precision, but this leads to unnecessary difficulties with statistical theory.
Our formulation of models explicitly recognises this aspect of “continuous” data.

TABLE 4.1
Types of variables
Variable type Examples

Categorical, two Male/female, alive/dead,


categories (binary) presence/absence of a disease,
inoculated/not inoculated

Categorical, more Blood group, cause of death, type


than two unordered of cancer, political party vote,
categories religious affiliation

Categorical with Severity of symptoms or illness,


ordered categories strength of agreement or
disagreement, class of university
degree

Discrete count Number of children in a family,


number of accidents at an
intersection, number of ship
collisions in a year

“Continuous” Height, weight, response time,


survival time, Stanford–Binet IQ
The StatLab database 21

4.2 StatLab population questions


Students will be using the database to examine a number of questions about the study
population from which it is drawn. At this point we will restrict consideration to three
questions:

• Do mothers who were smoking at the diagnosis of pregnancy have babies with lower
birthweight than mothers who were not smoking at diagnosis? (Low birthweight increases
risk for babies.)
• More generally, how is the baby’s birthweight related to mother’s weight, mother’s age,
mother’s blood group, mother’s smoking, father’s smoking and family income?

• How does the child’s intelligence at age ten, as assessed by the Peabody and Raven
tests, relate to birthweight, family income and mother’s and father’s age, education and
occupation?
There are many other social/economic/public health questions which could be addressed
using this database – that is one of the values of such databases. We do not have the full
study population data to check the inferences, but that is the reality of all research studies.
We are assuming, in using the database to answer these questions, that the database is
a random sample from the full study population. But why is such an assumption necessary?
Would it matter if it was a systematic sample of some kind? In the worst case, the analysis
might be relevant only to the database itself: it might not generalise to any larger population.
This becomes clear when we consider some notorious studies, in the next chapter.
5
Sample surveys – should we believe what we read?

5.1 Women and Love


A major survey of sexual relationships of 4,500 US women was reported in the book Women
and Love – A Cultural Revolution in Progress, by Shere Hite (1987). Hite was a distinguished
and controversial researcher trained in social history who became a major investigator in the
sexual behaviour of women.
Some of the statistics reported in her book were staggering. For example:

• 95% of women reported emotional or psychological harassment from their husband or


lover;
• 84% of women were not emotionally satisfied by their relationships;
• 70% of women married five years or more were having sex outside the marriage;

• 39% of women married 25 years or more had been struck or beaten by a husband or
lover;
• 39% of women never married had been struck or beaten by a husband or lover.
Alarming as these “statistics” were, there seemed some strange inconsistencies. For example,

• 95% of women reported emotional or psychological harassment from their husband or


lover, but only 84% were not emotionally satisfied by their relationships.
• Does this mean that the other 11% who were emotionally or psychologically harassed by
their husband or lover were nevertheless emotionally satisfied by their relationships?

• If it does, what does this say about “emotional or psychological harassment”?


• If it does not, what does it mean?
• What percentage of women married less than 25 years had been struck or beaten by a
husband or lover?

What are we to make of these “statistics”? That depends on how they were obtained. An
important issue is the sampling method used in this study, and in other studies. The sampling
method has a critical effect on what we think about the study, and illustrates the fundamental
concept of “randomness” and its use to achieve a representative sample. This requires an
understanding of probability concepts.

DOI: 10.1201/9781003216025-5 23
24 Introduction to Statistical Modelling and Inference

5.2 Would you have children?


The book Statistics: Concepts and Controversies, by David S. Moore, gave a simpler survey
example. Newsday, a Long Island (NY) newspaper, ran until 2002 an advice column by Ann
Landers. In one of her columns, and in response to a letter to Landers, readers were asked
to answer this question:
• If you had your life to live over again, would you have children?
Readers were asked to send Newsday a postcard, with a Yes or No answer. Newsday
received nearly 10,000 responses, of which almost 70% said No. What should we conclude?
Newsday decided to commission a professional nationwide random sample of parents to ask
the same question. The sample polled 1,373 parents, and found that 91% said Yes. What
should we conclude?
How can we be sure that a sample is “representative” of the population we
aim to investigate?

5.3 Representative sampling


“Representative” sampling, as this term is used in the statistics profession, requires that
• every member of the population has a known chance of being included in the sample, and
• the chance of inclusion is not related to the response being measured.
The chance is frequently the same, but this is not essential. A sampling method that satisfies
these requirements is called unbiased; one that does not is called biased.
How is an unbiased sampling method achieved? In the most formal way, by constructing
a list of all the members of the population, and then drawing the required sample size using
a random mechanism to guarantee the known chance of inclusion.
What do we conclude about the two sampling methods used to answer the Newsday
question? The second nationwide survey closely approximates this requirement. Survey or-
ganisations maintain population lists of families and residences based on the Census or on
voter registration, and have standard random mechanisms for selecting the sample from the
list.
The Newsday poll fails this requirement because:
• We do not know that people submitting the postcards are actually members of the target
population – parents and readers of Newsday.
• We do not know the chance of inclusion of each population member in the voluntary
sample.
• We do not know what population has been sampled – is it readers of Newsday, or Long
Island residents, or New York City residents?
This is a characteristic feature of voluntary response in general. On emotional issues, it is
common to find that the chance of responding, and hence being included in a voluntary
survey, depends on the strength of feeling on the issue being surveyed, and may be different
for those with different views on the issue.
Sample surveys – should we believe what we read? 25

5.4 Bias in the Newsday sample


We can investigate the question of bias in the Newsday sample because we have a national
random sample of parents answering the same question. This does not give a definitive
answer, because the Long Island Newsday reader (sub)population may be quite different
from the national population of parents. Nevertheless we can ask “what if . . .” questions.
Suppose that the population about which we wish to make statements consists of the
parent readers of Newsday (how would non-readers get to hear about the question?). Let us
suppose its size is 2,000,000. This assumption is not necessary – we would get the same result
with any other population size assumption – but it makes the calculation simple. Suppose,
following the survey organisation sample result, that there are really 10% No parents and
90% Yes parents in the Newsday readers population. Then this population has 200,000 No
and 1.8 million Yes parents.
Suppose the voluntary postcard sample of 10,000 parents contained exactly 7,000 No
and 3,000 Yes respondents. Then the chance of being included in the sample for the No
parents was 7,000/200,000 = 1/30 approximately, but for the Yes parents it was 3,000/1.8
million = 1/600.
So a No parent had 20 times the chance of a Yes parent of being included in the sample.
So the voluntary response “postal survey” has oversampled the No responders relative to
the Yes responders by a factor of 20. The sample is grossly biased towards No responders.
As we noted, there is another possible explanation: that the postal survey was in fact a
representative sample of Newsday reader parents, who are vastly different from the national
population. The sample is however biased, since we cannot give the probability of inclusion
in the sample for each member of the population.
It is worth repeating that voluntary response samples are always biased, in the technical
sense. The belief that they are representative requires very strong external evidence, which
is rarely available. A large sample is not a guarantee of representativeness.

5.5 Bias in the Women and Love sample


What sampling method was used for Women and Love? Hite sent out questionnaires to (we
quote from her p. 777):

• church groups in 34 states,

• women’s voting and political groups in nine states,


• women’s rights organisations in 39 states,
• professional women’s groups in 22 states,
• counselling and walk-in centres for women or families in 43 states,

and a wide range of other organisations, such as senior citizens’ homes and disabled
people’s organisations, in various states.
In addition, individual women wrote for copies of the questionnaire . . .
All in all, 100,000 questionnaires were distributed, and 4,500 returned . . .
26 Introduction to Statistical Modelling and Inference

Those receiving the questionnaire were asked to pass it along to relevant people they
knew, if they were not themselves interested in responding. If they were interested in re-
sponding, they were permitted to omit any questions which were not relevant to them. The
response rate in this survey was 4.5%.
What should we conclude? What population is being sampled? The implication of the
book is that the target – the intended – population is the US adult female population. But
the sampling method is very likely to oversample groups with higher proportions of women
with relational difficulties. Since the questionnaires were sent to groups, we have no idea
who actually answered them (they were anonymous). The sample is clearly biased because
we cannot give the chance of any woman in the US population being included in the sample.
Hite’s questionnaire introduced a further difficulty. It directed respondents:

It is not necessary to answer every question! There are seven headings; feel free to
skip around and answer only those sections or questions you choose ...
(p. 787, emphasis in the original)

So the sample size may be different for different questions (these sample sizes were not given
in the book) – even the percentage responding to different questions may not be comparable
within the study, as we mentioned earlier:

• 95% reported emotional or psychological harassment from their husband or lover; but
• 84% of women were not emotionally satisfied by their relationships.

These numbers are surely a consequence of non-response to one question, or both. We gain
a misleading impression even within the survey by comparing percentages based on different
subsets of respondents.
If the sample is biased, what conclusions can we draw? Since the sampled population is
undefined, we can only regard the sample as the population. Hite had information from
4,500 women, and the percentages reported are the percentages in her sample, which is her
population.
These results have no knowable connection with percentages of the US female population,
and cannot be used to refer to this population. To make such a connection, it is not sufficient
to have a voluntary sample, no matter how large – we need a probability sample, in which
every woman in the population would be included with a known probability.
A little more can be said about the US female population. For the year 1986, the US
female population between the ages of 25 and 69 was 66,538,000 (rounded to the nearest
1,000). We take this as the target population for Hite’s survey. (It makes little difference if
we extend the age range.)
We know that, of the 4,500 women in the Hite sample, 95% – 4,325 – reported emotional
or psychological harassment by a husband or lover. So in the US population of women,
the proportion of those harassed is at least 4,325/66,538,000 = 0.000065, or 0.0065% of the
population. On the other hand, the proportion of women not harassed in the population is
at least 225/66,538,000 = 0.000003382, or 0.0003382%. So the most that we can say with
certainty about the population proportion of women who suffered emotional or psychological
harassment by a husband or lover is that it must be between 0.0065% and 99.9996618%. This
is so close to 0% and 100% that we have learnt almost nothing about the population. That
is the consequence of the survey design.
Hite responded vigorously to criticisms of the study. Some of her responses were emo-
tional: “the stories told by these women are heartbreaking” as indeed they must have been.
Sample surveys – should we believe what we read? 27

On the study design she persisted in the claim that 4,500 was a large sample and allowed gen-
eralisation to the population. Further comment on the sampling method and Hite’s responses
can be found at http://davidstreitfeld.com/archive/controversies/hite01.html.
An important issue is that in observational studies we may not have measured or recorded
some important variables. This issue is described pungently in Wainer (2016):

Controlled experimental studies are typically regarded as the gold standard for
which all investigators should strive, and observational studies as their polar oppo-
site, pejoratively described as “some data we found lying on the street”.
(p. 29)
6
Probability

The word “probability” is used in everyday language to represent uncertainty, and is often
used interchangeably with the word “likelihood”. However these words have quite different
and specific meanings in statistical inference, which is the basis of drawing conclusions from
data. Probability is used in two different ways:
• as a measure of relative frequency of occurrence of some event;
• as a measure of degree of belief in the occurrence of some event.
The “event” referred to is in the future – the probability statement is a predictive statement
about the not-yet-observed event.

6.1 Relative frequency


A simple example of the first use is the face showing on the next throw of a regular (cubic)
die. Most people would regard the faces as “equally probable” and so the probability of
each of the faces 1, 2, . . . , 6 would be given as 1/6 (summing to 1 over the faces). The usual
justification given for this assignment is that much experience has shown that when well-
made “fair” dice are thrown repeatedly, each face appears approximately equally often.
Of course in practice the dice in use are not going to be thrown so often that the relative
frequencies of each face can be established. So the assignment of equal probability to each
face is an assumption – a model – for the properties and behaviour of the dice. Large-scale
throws of dice have been performed, however. In 1894, Weldon rolled a set of 12 dice 26,306
times, to determine whether small variations from an expected “law” could be identified.
Karl Pearson (1900) analysed Weldon’s data and found that the frequencies of the numbers
1–6 departed from the expected 1/6 more than was possible by chance. This was a very early
example of a test of significance, invented by Pearson.
The difference from the expected 1/6 was because the manufacture of the dice involved
removing small amounts of material for the number spots from all the faces. The spot holes
were then painted. The different numbers of spots for each face moved the mass centre of
the die slightly away from its physical centre, giving slightly different frequencies of showing
each face. However, in normal dice games this discrepancy could not be observed.
Much later, the physicist and probabilist Edwin Jaynes developed a theory for this de-
parture based on the machine production of the dice. The plastic dice material was manu-
factured in a long strip, with a very accurate square cross-section, and then cut into cubes.
The length of the cut piece could be slightly different from the cross-section dimension, re-
sulting in larger or smaller length in one dimension than in the other two. He showed that
the departures from equal frequency in Weldon’s results were consistent with this model for
the construction of the dice. Interest in Weldon’s data continued, with a re-examination of
models by Kemp and Kemp (1991).

DOI: 10.1201/9781003216025-6 29
30 Introduction to Statistical Modelling and Inference

6.2 Degree of belief


An example of degree of belief is the use of probability statements in weather forecasting.
“Tomorrow there will be a 70% chance of rain in the metropolitan area.”
This might appear to be a frequency-based use of probability: for the large number
of past days with weather like that of today, and assuming the same development of the
weather system as on those days, 70% of the metropolitan area had rain on the following
day. Past data are of course a guide to future data, if the conditions are similar. In fact all
such predictive statements are based on a set of conditions, stated or unstated, which are
necessary for the prediction to be appropriate. In weather forecasting such conditions are
generally not made clear, and even if they are, the forecaster is using judgement – expressing a
high degree of personal belief in the necessary conditions holding – in making the prediction.
So at least some personal degree of belief, as well as past data, plays a role in the forecast.
In forecasting the track of catastrophic events like hurricanes or major bushfires, much
more caution is used, since tracking of the hurricane or bushfire is strongly dependent on
weather conditions which may change suddenly and drastically or may not be measurable
accurately, and the cost of a mis-prediction of the track may be disastrous.
In this book we will need only very simple forms of degree of belief probability, for prior
distributions of probability model parameters. We postpone further discussion of degree of
belief aspects of probability until then.

6.3 StatLab dice sampling


In 1975, computer sampling was in its infancy, and the students then threw the red and the
green dice twice to draw a single family from the population. (This was always regarded
as the most amusing part of the course.) They then recorded the variables to be discussed
in the course, and repeated the sampling process 39 times, to obtain a sample of size 40.
This sample was used in subsequent analyses, sometimes as it was, sometimes split into two
samples of size 20, and sometimes split further into four samples of size ten. This allowed
for an assessment of decreasing variability with increasing sample size.
A question you might ask is: what happens if we draw the same family twice? If the
sample size compared to the population size – the sampling fraction – is “small”, this is very
unlikely to happen. If it does happen, we discard the repeated family and draw again. If the
sampling fraction is large, then we draw the sample without replacement of the drawn family
back into the population, so that it cannot be drawn again.

6.4 Computer sampling


We can use the computer for sampling families as well as for the analysis of the data from
the sampled families: we will frequently be using the StatLab database as a population from
which we draw random samples. To draw a random family member, we need four indepen-
dent random integers between 1 and 6. Computer routines are well established for drawing
uniform random numbers between 1 and any given integer N , using modular arithmetic: we
Probability 31

draw integers “modulo N ”. Depending on the implementation of the routine, these may be
numbers from 1 to N or from 0 to N − 1. If the latter, we add 1 to each random number. (We
could avoid this altogether by renumbering the families from 1 to 1,296, and drawing ran-
dom integers from this range. For consistency with the StatLab book, we retain the original
database numbering.)

6.4.1 Natural random processes


Some students (and others) do not trust a computer black box to provide “truly random”
values. For many such people dice throwing is immediately convincing, though they might
want to examine the dice carefully before accepting the throws as random. There are sev-
eral natural random mechanisms which can be used (after some post-processing) to provide
random numbers.
The best known is the Geiger counter or Geiger-Müller counter. This is a device to test
for the presence of radiation (alpha or beta particles and gamma or X-rays) from a possible
radioactive source. If we place a radioactive source near the counter, the counter beeps when
it senses a particle emitted from the breakdown of an atom of the radioactive material. If
multiple atoms break down in quick succession – a burst of radiation – the counter emits a
string of beeps in quick succession. If the radioactive material is a “pure” specimen, with
atoms of the same kind, these have a characteristic breakdown rate, determined by the half-
life of the material – the time required, probabilistically, for half of the unstable radioactive
atoms in the material to undergo radioactive decay. The times between successive beeps are
independent random values from a theoretical probability distribution – the exponential. The
breakdown process is called exponential decay.

6.5 Probability for sampling


6.5.1 Extrasensory perception

Extrasensory perception (ESP) involves reception of information not gained


through the recognized physical senses but sensed with the mind. The term was
adopted by Duke University botanist and parapsychologist J.B. Rhine to denote
psychic abilities such as telepathy, clairaudience, and clairvoyance, and their trans-
temporal operation as precognition or retrocognition. ESP is also sometimes re-
ferred to as a sixth sense. The term implies acquisition of information by means
external to the basic limiting assumptions of science, such as that organisms can
only receive information from the past to the present. . . .
The scientific community rejects ESP due to the absence of an evidence base, the
lack of a theory which would explain ESP, and the lack of experimental techniques
which can provide reliably positive results, and considers ESP a pseudoscience. . . .
Rhine worked largely in the laboratory, carefully defining terms such as ESP
. . . and designing experiments to test them. A simple set of cards was developed,
originally called Zener cards – now called ESP cards. They bear the symbols circle,
cross, wavy lines, square and star; there are five cards of each in a pack of 25. . . . In
a telepathy experiment, the “sender” looks at a series of cards while the “receiver”
guesses the symbols.
(Wikipedia)
32 Introduction to Statistical Modelling and Inference

A class experiment with an ESP card deck is useful for understanding the basic ideas
of probability. The “sender” stands in the classroom behind an impermeable screen and
reads through the shuffled deck, one card at a time. At each card the sender pauses, con-
centrates on the symbol, and “projects” the card symbol to the student audience, which
cannot see him or her, only hear the “Next card” announcements. The students write down
the card symbols they “receive”. At the end of the 25 cards the sender asks the students
to switch their records of the card sequence with the students next to them, to prevent
“cheating”. The sender then reads through the sequence of cards, and the student markers
tick off the correct answers, sum the total number of correct identifications, and return the
records.
The sender then asks for a show of hands at each number correct, starting from 25
and working backwards. This requires a group of hand-counters to assist the sender.1 Af-
ter describing the experiment, but before it begins, the sender’s hand-counters count the
number of students present, and the sender puts up on the board the numbers of stu-
dents who are predicted to identify each possible number of cards correctly. The sender
emphasises that this is not a prediction for individual students, but for the class as a group.
In a class of 200 students, we would expect to see (rounded to integers) the numbers2 in
Table 6.1.
Only one student in a class of 200 would be expected to score zero, or more than ten
correct.3
Students are always puzzled by the closeness of the results to those predicted. “How does
he/she know that? How can he/she say that?”
The prediction is based on a simple statistical model. The analyst assumes (or believes)
that no student can learn anything about the cards being projected, so the student answers
are random guesses. The simplest model for this is that if the five card symbols are regarded
as equally likely under guessing, then the probability of a correct guess with any card is 1/5.
If the guesses are made independently and with the same probability, then the number r
correctly “identified” (guessed) has the binomial distribution b(r | 25, 0.2), shown in Table 6.2
to four decimal places.4
We develop this distribution in the framework of StatLab sampling.

TABLE 6.1
Expected number in 200 for ESP card guessing
r 0 1 2 3 4 5
Expected # 1 5 14 27 37 39
r 6 7 8 9 10 11
Expected # 33 22 12 6 2 1

1 Anyone getting zero correct always shows embarrassment.


2 The numbers add to 199 because of the rounding.
3 The rare high scorers are intriguing!
4 This has mean 5 and standard deviation 2, but the distribution is very skewed.
Probability 33
TABLE 6.2
Binomial distribution for ESP card guessing
r 0 1 2 3 4 5
b(r | 25, 0.2) 0.0038 0.0236 0.0708 0.1358 0.1867 0.1960
r 6 7 8 9 10 11
b(r | 25, 0.2) 0.1633 0.1108 0.0623 0.0294 0.0118 0.0040
r 12 13 14 15
b(r | 25, 0.2) 0.0012 0.0003 0.0001 0

6.5.2 Representative sampling


In the dice-sampling days, many students were surprised to find that in their sample of 40
families they did not get 20 boys and 20 girls. Some thought they had made a mistake in the
dice throwing. Since we know that the population contains equal numbers of boy and girl
families, it seemed reasonable, or perhaps “obvious”, that a “representative” sample should
also have equal numbers of boys and girls. (The assumption that small samples should have
the same properties as the population is sometimes called the “law of small numbers”. There
is no such law!)
Our “representative” guarantee of “equal chance” for each family to be included in the
sample does not guarantee an exact match in proportions of boys and girls between the
sample and the population. An exact match requires a stratified sample, in which we draw
a random sample of 20 from each of the boy and girl strata – sub-populations.
In our own four samples of 10 families, we found 4, 3, 5 and 6 boys. In the pooled sam-
ples of 20 families, there were 7 and 11 boys, with 18 boys in the complete sample of
40 families. This kind of random variation is constantly encountered in dealing with ran-
dom samples from populations. To describe this variation, we need to develop probability
models.
There are many formal ways to introduce probability. We follow an old and simple5 one,
mentioned earlier, based on the idea of equally likely cases. To develop a probability model,
we consider the sampling process step by step. We need some notation. We denote by B the
event of drawing a boy family in the throw of the two dice, and by G the event of drawing a
girl family in the throw of the two dice. (It is actually only the first die that matters, since
the girl families are 11–36 and the boys 41–66.) We take it as given6 that, when we throw
one die, the probabilities that it shows the faces 1–6 are equal – the faces are equally likely
to appear at any throw.
Since all pages in the database are equally likely to be selected, the probability p that
a boy family is selected is the number of boy-family pages divided by the total number of
pages: 18/36 = 1/2. Formally, if there are N equally likely possible outcomes, and R of these
correspond to the event B of interest, then the probability p of the event B, written Pr[B]
is p = R/N . So we model the probability of the event B1 , that we draw a boy family at the
first throw of the dice, by
p = Pr[B1 ] = 1/2,
and correspondingly
Pr[G1 ] = 1 − p = 1/2.

5 (and much criticised)


6 That is, this is a model assumption.
34 Introduction to Statistical Modelling and Inference

Now we throw the dice again to draw a second family. The family we chose at the first draw
remains in the population and could be drawn again, though that would be very unlikely –
its probability, by the same argument, would be 1/1,296. Sampling the population in this
way is called sampling with replacement. What would happen if we did draw the same family
again? We would set it aside and draw another one – repeating the same family does not
give any more information. Practical surveys are always drawn without replacement, but
the two methods have very similar properties if the population is large compared to the
sample.
Since the population has not changed, and the design of the throws makes the outcome
of the second throw independent of that at the first throw, the probability of a boy family
at the second throw is again p = 1/2 = Pr[B2 ], and Pr[G2 ] = 1 − p = 1/2. Clearly this will
be true for all the successive throws. This is an example of an axiom of probability theory:
that if the events B1 and B2 are independent, then the probability that both occur is the
product of their separate probabilities:

Pr[B1 and B2 ] = Pr[B1 ∩ B2 ] = Pr[B1 ] · Pr[B2 ] = (1/2)2 = 1/4.

Here the notation ∩ – commonly called “cap” – stands for intersection – the joint event.
The event of drawing a boy family and a girl family in the first two draws can occur in
two different ways: [B1 ∩ G2 ] and [G1 ∩ B2 ]. The probability of each of these is p(1 − p),
so the probability that a sample of two families contains one boy and one girl family is
2p(1 − p) = 2 · (1/2)2 = 1/2.
We can extend this argument indefinitely. In general, the probability that we obtain r
boy families in n throws of the two dice is given by the binomial (“two names”) distribution:
 
n
Pr[r boys | n, p = 1/2] = (1/2)n ,
r

where nr = r!(n−r)!
n!

is the binomial coefficient representing the number of arrangements of
the r boy families and n − r girl families – the number of distinct orderings of the r B and
n − r G symbols.
For the sample sizes n = 10, 20 and 40, and p = 1/2, the binomial distributions are shown
in Tables 6.3, 6.4 and 6.5 (probabilities less than 0.001 are omitted).
We now state the probability axioms more formally. The next section can be omitted or
postponed without loss.

TABLE 6.3
binomial distribution, n = 10, p = 1/2
r 0 1 2 3 4 5 6 7 8 9 10
Pr[r] .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

TABLE 6.4
binomial distribution, n = 20, p = 1/2
r 3 4 5 6 7 8 9 10
Pr[r] .001 .005 .015 .037 .074 .120 .160 .176
r 11 12 13 14 15 16 17
Pr[r] .160 .120 .074 .037 .015 .005 .001
Probability 35
TABLE 6.5
binomial distribution, n = 40, p = 1/2
r 12 13 14 15 16 17 18 19 20
Pr[r] .005 .011 .021 .037 .057 .081 .103 .119 .125
r 21 22 23 24 25 26 27 28
Pr[r] .119 .103 .081 .057 .037 .021 .011 .005

6.6 Probability axioms


We specialise the general axioms of Kolmogorov (1933) to a setting which is sufficiently
general for our purposes. We write Ω (Omega) for a finite set of events A1 , . . . , AN, which
are mutually exclusive and exhaustive, so that no two or more events can occur together,
but one event must occur. In set theory notation, with ∅ denoting the empty set with no
members, and ∀ meaning “for all”,
• Ai ∩ Aj = ∅ ∀ i ̸= j (any two events have no common members);
• A1 ∪ A2 ∪ · · · ∪ AN = Ω (the union of all the events is the entire set).
Each event Ai has a probability pi which may be zero, assigned by a model, so 0 ≤ pi ≤ 1.
We write Pr for probabilities of arbitrary events. As the events Ai are mutually exclusive
and exhaustive,
• Pr[Ai ∩ Aj ] = 0;
PN
• Pr[A1 ∪ A2 ∪ · · · ∪ AN ] = i=1 pi = 1.
Other events may involve the union of several events Ai . If event B = A1 ∪ A2 then

Pr[B] = Pr[A1 ] + Pr[A2 ] = p1 + p2 .

Elementary probability examples generally involve dice throwing and coin tossing.

6.6.1 Dice example


Define Ai to be the event that a “fair” die shows the number i when thrown.7 Let event
B be the throwing of an even number, and the event C be the throwing of a number less
than 3. Then B = [A2 ∪ A4 ∪ A6 ] and C = [A1 ∪ A2 ], and Pr[B] = 3/6, Pr[C] = 2/6, so
[B ∩ C] = A2 , Pr[B ∩ C] = 1/6. Of course we could obtain the same answer without the
formal probability notation.

6.6.2 Coin tossing


The probability that a coin when tossed shows Head is p, and shows Tail is q = 1 − p.
These two events are the only possibilities – we exclude the coin landing on an edge and not
toppling. The coin is tossed three times. What is the probability that two Heads result (and
one Tail)?
7 The term “fair” means we assume that all faces are equally probable. This is a model assumption: the

probability axioms do not depend in any way on how the model probabilities are assigned, or what they are.
36 Introduction to Statistical Modelling and Inference

We have to assume that the throws are independent – that the face shown on one throw
does not affect the probability of the same face on the next throw. (In gambling dice games
the dice are thrown together to ensure this.)
Then two heads has probability p2 and one tail q, but the sequences of H and T are
arbitrary: HHT, HTH, THH. So the probability of two heads and one tail in any order is
3p2 q. We use these kinds of results repeatedly in statistical inference.

6.7 Screening tests and Bayes’s theorem


We discuss a very common misuse, or misunderstanding, of probability. A common public
health procedure is to screen apparently healthy people for the possible presence of a serious
medical condition. The screening test is simple and easily administered, while the full medical
diagnosis of the condition is complicated and expensive, and stressful. The screening test is
not completely reliable: it may fail to show the condition when it is actually present – a
false negative result – or it may show that the condition is present when it is actually absent
– a false positive result. The first “error” may be very serious, as it falsely reassures the
person that he or she is free of the condition. The second error is also serious, as it arouses
unnecessary anxiety in the person, and also requires further detailed medical assessment
which may be inconvenient and distressing.
For these reasons, screening tests aim to have very high “true positive” and “true neg-
ative” rates. The true positive rate is generally called the sensitivity, and the true negative
rate is called the specificity of the test. We now consider the properties of such tests.

RT-PCR tests to detect severe acute respiratory syndrome coronavirus 2 (SARS-


CoV-2) RNA are the operational gold standard for detecting COVID-19 disease in
clinical practice. RT-PCR assays in the UK have analytical sensitivity and speci-
ficity of greater than 95%, but no single gold standard assay exists.
(Surkova, Nikolayevskyy and Drobniewski 2020)

We take as an example a condition C present in 2% of the population. For those people


who have the condition, the screening test gives a true positive result – it shows correctly
that they have the condition – 95% of the time. For those people who do not have the
condition, the screening test gives a false positive result – it shows incorrectly that they have
the condition – 5% of the time. We test a population of 1,000 people in a small town.8 The
true positive and false positive rates apply to this town population. What do we conclude
from the results of the screening test?
We can construct a simple table (Table 6.6) which shows the results of the test. First we
need notation for the test result. We write P if the test result is positive, and N if the test
result is negative. We write C+ if the person tested has the condition, and C− if the person
tested does not have the condition.
How do we arrive at this table? First, along the bottom margin, we have 20 people in the
population (2%) with the condition, C+, and 980 without the condition, C−. In the first
column of the 20 people with the condition, 19 (95% of the 20) are correctly identified as
having the condition (true positives) while 1 is a false negative – he or she has the condition
but the test is negative. In the second column of the 980 people without the condition,
49 (5% of the 980) are incorrectly identified as having the condition (false positives). The
8 Any other population size will give the same conclusion, but this number simplifies the calculation. The

table – a contingency table – is a form of Venn diagram.


Probability 37
TABLE 6.6
True and false positives and negatives,
screening test, 2% incidence
C+ C− Total

P 19 49 68
N 1 931 932
Total 20 980 1,000

remaining 931 are true negatives. Adding across the rows, we have 68 people testing positive
and 932 testing negative.
What does it mean if you test positive? What is the probability that you have the
condition? Of the 68 people who tested positive, only 19 actually had the condition. The rest
were false positives. So given that you had a positive test, your probability of having the
condition is only 19/68 = 0.279, about 28%.
What does it mean if you test negative? What is the probability that you have the
condition? Of the 932 people who tested negative, only one actually had the condition, a
false negative. The rest were true negatives. So given that you had a negative test, your
probability of having the condition is only 1/932 = 0.0011: the probability that you do not
have the condition is 0.9989.
So the test is very accurate at identifying people who do not have the condition. It is
much less accurate at identifying people who do have the condition. Of those testing positive,
only 28% actually have the condition.
This answer shocks many students – how is it possible that a test with such a high rate
of detection of people with the condition gives so little confidence in their identification
from a positive result? The answer is closely connected to the prevalence – frequency – of
the condition in the population; the condition is quite rare (only 2% have it), and so most
positive tests will be false positives.
To represent this result formally, we need the probability calculus. We have Pr[C−] =
1 − Pr[C+]. We usually have a good idea of the probability Pr[C+], that is, the proportion
of people in the population being tested who have the condition. We use the value stated
earlier of 0.02; 2% of people being tested would be expected to have the condition.
We suppose as before that the screening test correctly identifies 95% of people who have
the condition. We express this as a conditional probability: given that the person has condition
C+, the probability of a true positive response P is 0.95. We express this by the notation
Pr[P | C+] = 0.95. We suppose also that the test incorrectly identifies (by a positive test
result) only 5% of people who do not have the condition: the probability of a false positive
response is Pr[P | C−] = 0.05.
The first person tested, a man, gives a positive test result. What advice do we give him?
He could be truly positive, or truly negative.
We know the probability of a positive result from people with the condition is 0.95, and
is 0.05 for people without the condition. But these numbers are known prior to the test; we
want to update – revise – them given the positive result of the test. For this we need the
conditional probabilities given the data – the probabilities of C+ and C− given P.
Formally, we express the conclusions using a standard probability result. Given two events
A and B, the probability that they both occur, Pr[A and B], written formally as Pr[A ∩ B]
(“A cap B”) can be expressed in two different ways, as the product of the probability of one
event and the conditional probability of the other event given the first:

Pr[A ∩ B] = Pr[A|B] Pr[B] = Pr[B|A] Pr[A].


38 Introduction to Statistical Modelling and Inference

So
Pr[A|B] = Pr[B|A] Pr[A]/ Pr[B].
This result was derived by Thomas Bayes in 1763 and is known as Bayes’s theorem. It shows
us how to update the probability of an event A when we are given new information that the
event B has occurred.
We apply this to our events C and P : we are given

Pr[C+] = 0.02, Pr[P |C+] = 0.95, Pr[P |C−] = 0.05,

so
Pr[C+ | P ] = Pr[P |C+] Pr[C+]/ Pr[P ].
Now
Pr[P ] = Pr[P ∩ C+] + Pr[P ∩ C−]
since one of C and C− must be true, so

Pr[P ] = Pr[P |C+] Pr[C+] + Pr[P |C−] Pr[C−]


= 0.95 × 0.02 + 0.05 × 0.98
= 0.019 + 0.049 = 0.068,

and hence
Pr[C+ | P ] = Pr[P |C+] Pr[C+]/ Pr[P ] = 0.019/0.068 = 0.279.
Now we see that the high true positive rate does not guarantee a high probability of the
condition given a positive test: we have to consider as well the chance of a positive test when
the person is free of the condition, and combine these probabilities with the prevalence of
the condition.
The misinterpretation of the high true positive rate as the probability of the condition
given the positive test used to be endemic in law court discussions of probability, and is
known as the prosecutor’s fallacy. We will give a court case example later.
We use Bayes’s theorem (generally but incorrectly abbreviated to Bayes’ or Bayes theo-
rem) – the inversion of the sequence of events A and B – repeatedly in this book. It is the
foundation of statistical inference in the Bayesian paradigm, and all the methods developed
later in the book are extensions of this result.
The results of the screening test are potentially seriously misleading to those screened.
How can this be improved? There are two possible approaches (for a fixed prevalence rate
of the condition):
(1) Increase the true positive rate;
(2) Increase the true negative rate.
Small improvements in these already high rates will not change the conclusions much. But
now suppose the prevalence of the condition in the population is higher – 20% instead of
2% and the condition is common. With the previous rates of 0.95 and 0.05 for the true and
false positive rates of the test, we have

Pr[P ] = Pr[P |C+] Pr[C+] + Pr[P |C−] Pr[C−]


= 0.95 × 0.2 + 0.05 × 0.8
= 0.190 + 0.04 = 0.230,
Probability 39

and hence
Pr[C|P ] = Pr[P |C+] Pr[C+]/ Pr[P ] = 0.190/0.230 = 0.826,
and more than 80% of those testing positive do have the condition. But those testing negative
are less certain of their negative state:

Pr[N ] = Pr[N |C+] Pr[C+] + Pr[N |C−] Pr[C−]


= 0.05 × 0.2 + 0.95 × 0.8
= 0.01 + 0.76 = 0.77,

and hence
Pr[C + |N ] = Pr[N |C+] Pr[C+]/ Pr[N ] = 0.01/0.77 = 0.013
This is a factor of 10 larger than for 2% incidence, but it is still very low. A negative test is
still a very strong indication that the person does not have the condition.
The properties of the screening test depend strongly on the population incidence of the
condition – the prior probability before the test is done that a randomly selected person will
test positive. In terms of the previous table, we have

6.8 The misuse of probability in the Sally Clark case


We quote from Wikipedia:

Sally Clark (August 1964–15 March 2007) was a British solicitor who became the
victim of an infamous miscarriage of justice when she was wrongly convicted of the
murder of her two sons in 1999.
Clark’s first son died suddenly within a few weeks of his birth in 1996. After
her second son died in a similar manner, she was arrested in 1998 and tried for the
murder of both sons. Her prosecution was controversial due to statistical evidence
presented by pediatrician Professor Sir Roy Meadow, who testified that the chance
of two children from an affluent family suffering sudden infant death syndrome was
1 in 73 million, which was arrived at by squaring 1 in 8500 for the likelihood of
a cot death in similar circumstances. The Royal Statistical Society later issued a
public statement expressing its concern at the “misuse of statistics in the courts”
and arguing that there was “no statistical basis” for Meadow’s claim.
Clark was convicted in November 1999. The [two] convictions were upheld at
appeal in October 2000 but overturned in a second appeal in January 2003, after
it emerged that the prosecutor’s pathologist had failed to disclose microbiological

TABLE 6.7
True and false positives and
negatives, screening test, 20%
incidence
C+ C− Total

P 190 40 230
N 10 760 770
Total 200 800 1,000
40 Introduction to Statistical Modelling and Inference

reports that suggested one of her sons had died of natural causes. She was released
from prison having served more than three years of her sentence. The journalist
Geoffrey Wansell called Clark’s experience “one of the great miscarriages of justice
in modern British legal history”. As a result of her case, the Attorney-General
ordered a review of hundreds of other cases, and two other women convicted of
murdering their children had their convictions overturned.

What is the basis of the statistical evidence against Clark? We quote from the expert
witness testimony by Professor Phil Dawid:

The SUDI [Sudden Unexplained Death in Infancy] study was conducted between
February 1993 and March 1996 in a study area consisting of five regions of the
country [UK], having a total population of nearly 18 million. During the study
period there were around 470,000 live births in the study area. 456 of these babies
suffered sudden [unexplained] death in infancy, 363 of these deaths being classified
as cases of SIDS [Sudden Infant Death Syndrome]. Of these, 325 were subjected to
further analysis. For each of the 325 “index cases”, four “control” babies, born at
around the same time but not suffering SIDS, were identified by the health visitor.
For both index and control cases, a number of possibly relevant characteristics of
the family, baby, etc. were measured. Statistical analyses were conducted with the
aim of discovering differences between such characteristics, which might distinguish
the index (SIDS) babies from the control (non-SIDS) babies.
The figures presented in court were based on Table 3.58 of the SUDI report,
which purported to classify the risk of SIDS according to “the three prenatal factors
with the highest predictive value”:
• Anybody smokes in the household
• No waged income in household
• Mother less than 27 years and this child not her first.
In the case of Sally Clark, none of the above factors was present. For such a case,
the table gave a rate of 0.117 SIDS cases per 1000 live births, i.e. 1 in 8,543 live
births. The figure of 1 in 73 million mentioned . . . above was calculated by squaring
this (8,543 times 8,543 = 73 million, approximately).
(Dawid)

Meadow compared this probability to the chances of backing an 80–1 outsider in


the Grand National [a British horse race] four years running, and winning each
time. Clark was convicted by a 10–2 majority verdict on 9 November 1999, and
given the mandatory sentence of life imprisonment. She was widely reviled in the
press as the murderer of her children.
(Wikipedia)

What was wrong with Meadow’s statistical argument? There were two main issues:
• The deaths of the two children were treated as independent events, and their probabilities
multiplied.
If a genetic or environmental factor contributed to the deaths of both children, this would
mean that the occurrence of the death of the first child would increase the probability
of the death of the second child, if the genetic or environmental factor was unchanged.
Probability 41

The assumption of independence was not based on evidence and changed substantially
the stated probability of two deaths.
• The question for the jury was never expressed correctly, in terms of inferring the proba-
bility of a hypothesis from the probability of events under the hypotheses.
The inference drawn by Meadow was based on the argument that if a child death event
occurs which is extremely unlikely under normal family circumstances, then the circum-
stances must have been abnormal, which he took to be murder by a parent or parents
(Clark’s husband was initially charged with murder as well, but the charge was dropped).

So there were only two hypotheses: SIDS deaths by chance (event C) or deaths by murder (ac-
tually event C̄ – “not C”). The probability of C was, by Meadows’s calculation 1/73,000,000.
This was “too small” for C to be believable, and therefore C̄ must be true. The jury, and
many commentators, were left with the impression that the probability of 1/73,000,000 was
the probability that Clark was innocent. This misinterpretation is so common in law that it
has been named by statisticians “the prosecutor’s fallacy”.
It should be clear that something is missing here. We express the problem through Bayes’s
theorem. We assign prior probabilities to C and C̄, and calculate the likelihood – the prob-
ability of the two deaths – under C and C̄. We have the likelihood under C as Meadow
claimed, but what is the probability that a mother will murder two very young children?
The implication of Meadow’s argument is that this must be large, or at least much larger
than 1/73,000,000.
In his evidence, Professor Dawid searched the UK crime database for relevant information:

In 1996 there were 649,489 live births in England and Wales. Of these babies, 14
were later classified as having been murdered in the first year of life. If we were to
take the ratio 14/649,489 as our estimate of the probability that a single baby will
be murdered in the first year of life, and manipulate it in exactly the same way
as he did the SIDS rate, we would calculate that the probability of two babies in
one family both being murdered is [(4/649, 489)2 ], which gives 1 in 2,152,224,291.
On this basis, the “logic” [this tiny probability of C̄] would imply that we could
essentially exclude the possibility that Sally Clark’s two babies were murdered!

However, Dawid does not for a moment accept the Meadow argument, or his own. The
relevant calculation is of the likelihood ratio. If we assume the child murder rate to be
relevant to this case, as the SIDS rate was assumed to be relevant, the likelihood ratio for
C to C̄ would be
1/85432
= 192 = 361.
(4/649, 489)2
With equal prior probabilities on C and C̄, the posterior probability of C would be 0.9972,
or 361/362.
As Dawid concluded, we could essentially exclude the possibility that Sally Clark’s two
babies were murdered. However the statistical evidence was less compelling than the revela-
tion of the suppressed pathological evidence that one child had died of natural causes, and
it was the combination of these two independent sources of information that secured the
successful appeal.

Meadow was struck off the medical register by the General Medical Council in
2005 for serious professional misconduct, but he was reinstated in 2006 after he
appealed and the court ruled that his misconduct was not serious enough to warrant
42 Introduction to Statistical Modelling and Inference

him being struck off. In June 2005, Alan Williams, the pathologist who conducted
the postmortem examinations on both the Clark babies, was banned from Home
Office pathology work and coroners’ cases for three years after the General Medical
Council found him guilty of serious professional misconduct in the Clark case. This
decision was upheld by the High Court in November 2007.
Clark was permanently affected by the accusation, wrongful imprisonment, and
persecution by other prisoners. She never recovered from the experience, developed
a number of serious psychiatric problems including serious alcohol dependency, and
died of acute alcohol poisoning in her home in March 2007.
(Wikipedia)

Further details of the statistical argument can be found in Professor Phil Dawid’s expert
witness testimony, at www.statslab.cam.ac.uk/∼apd/SallyClark report.doc.
A simple version of this argument comes from early uses of the likelihood function,
in which the design was often omitted, or was implicit in the data to be analysed. The
importance of the design is made clear in an old example sometimes used to attempt to
discredit the use of the likelihood ratio:

A 52-card deck is shuffled, and a card drawn from it at random. It is the King of
Spades. The card dealer says:
“You are a likelihoodist – what is the probability of this deck being a deck
entirely made up of Kings of Spades (KOSs)?
If the deck is a regular deck, the probability of drawing the KOS is 1/52. If the
deck is entirely KOSs, the probability is 1. So the likelihood ratio of KOS to regular
deck is 52:1. You must be almost certain that this is the KOS deck. Of course this
is nonsense.”

What is missing here is the design of the study. Consider this design, of two card decks.
The first deck is made up of 52 KOSs. The second is a regular deck. One deck is chosen
at random, shuffled well and a card is drawn from it. The card is the KOS. What do we
conclude about the chosen deck?
The implicit assumption of the card dealer’s argument is that the prior probability of his
deck being regular is very high, like 1: decks of KOSs don’t occur in regular games. Wainer’s
quote applies here:

[S]ome data we found lying on the street.

The likelihoodist could resolve the matter very quickly by drawing a second card from the
deck. Of course the card dealer cannot allow this, a sure sign that this is a meaningless
example.

6.9 Random variables and their probability distributions


We now need to define formally random variables and their probability distributions.

6.9.1 Definitions
A random variable is a variable which is the the outcome of a random process of some kind
whose properties are not deterministic. The variable may be a count, a binary, a category
Probability 43

or a variable measured on a scale. The usual examples of coin tossing and dice throwing are
simple cases. The outcomes of the throw of a die or a coin are uncertain, and cannot be
predicted deterministically.
An argument is sometimes made that this uncertainty is simply a consequence of igno-
rance of the precise details of the construction of the die or coin, the throwing or tossing, the
surface on which the die or coin is landing and the air movement in the throwing or tossing
environment. If these were known then the physical equations of motion under gravity and
friction would determine the face showing.
Since the precision necessary to remove this uncertainty is unknown or unavailable, this
argument is hypothetical. As a consequence we can specify only the probability of each
outcome, from a suitable probability model.
However, this argument provides the basis for a general approach to statistical modelling.
The model has two components: a systematic part which can be specified from the design
of the study, and a random part which cannot be specified except through a probability
distribution.
In traditional statistics courses, a distinction is made between discrete random variables,
which can take only a finite set of possible values, like the die faces, and continuous random
variables, like the phone lifetimes, which can take any value in a continuous interval. This
distinction was denied by the distinguished Australian statistician, Edwin Pitman, in his
last book (Pitman 1979, p. 1):

All actual sample spaces are discrete, and all observable random variables have
discrete distributions. The continuous distribution is a mathematical construction,
suitable for mathematical treatment, but not practically observable.

The distinction is artificial, since every “continuous” variable is recorded with finite measure-
ment precision on a discrete scale, whether of days, shifts, hours, minutes or seconds, or km,
metres, cm, mm or µm (micrometres) etc, and so can take only a finite set of measurable
values, though the number of values may be very large. Removing this distinction allows us
to develop a unified treatment of discrete and continuous random variables.
We allow, in the definition of a random variable, for a countably infinite set of values of
the variable, that is, a set which can be put in 1:1 correspondence with the non-negative
integers. This allows for count random variables for which there may be no physical upper
limit, though the observed values are always finite.
So a random variable Y takes a finite or countably infinite set of ordered distinct values
Y1 < Y2 < · · · < YI < · · ·
in a discrete variable space Y, with a corresponding set of non-negative probabilities pI which
sum to 1 over I.9 The set of possible values YI and corresponding probabilities pI define the
probability distribution of the random variable Y . The probabilities pI as a function of I are
called the probabilityPmass function of Y , sometimes abbreviated to pmf, and the cumulative
I
probabilities qI = J=1 pJ are called the cumulative distribution function of Y , usually
abbreviated to cdf.
We are often interested in the mean and variance of random variables. The mean is
denoted by µ and the variance by σ 2 ; σ (positive) is called the standard deviation of the
random variable. They are defined by
X X √
µ= pI YI , σ 2 = pI (YI − µ)2 , σ = σ 2 .
I I

9 In other treatments of random variables, they are defined over a sample space, which is confusing, because

the definition does not involve sampling in any way.


44 Introduction to Statistical Modelling and Inference

The term expectation or expected value of a function g(y) of a random variable Y is used
more generally to describe the mean value of that function, and is denoted by E[g(y)].
An example of a random variable with a countably infinite set of possible values is the
Poisson random variable which takes the non-negative integer values YI = 0, 1, 2, . . . with
probabilities pI = e−λ λYI /YI !, where λ is the mean of the probability distribution:

X
µ= YI ∗ pI
I=0
X∞
= YI ∗ e−λ λYI /YI !
I=0

X
=λ∗ e−λ λYI −1 /(YI − 1)!
I=1
X∞
=λ∗ e−λ λYK /(YK )!
K=0
= λ,
where K = I − 1. An example of a random variable with a finite set of possible values is the
birthweights of the StatLab boy babies. The birthweight in pounds is recorded in the database
to one decimal place. A graph of the counts of the 648 boys at each distinct value of birth-
weight is given in Figure 6.1. The probability mass function is just a rescaling of the count
axis to give a total scaled count of 1.0. The cumulative probabilities are shown in Figure 6.2.
The sloping S-shape is characteristic of many variables with symmetric or near-symmetric
distributions. We discuss this much later with the Gaussian distribution in Chapter 9.

35

30

25
count

20

15

10

4 6 8 10 12 14
birthweight

FIGURE 6.1
Boy birthweight counts
Probability 45
1.0

0.9

0.8

0.7

0.6

cumulative 0.5

0.4

0.3

0.2

0.1

0.0
4 6 8 10 12 14
birthweight

FIGURE 6.2
Boy birthweight cumulative proportions

6.10 Sums of independent random variables


In many models we encounter sums of the observations in the analysis. It is useful to have
some general results on the properties of sums of independent random variables. The distri-
bution of a sum T = Y1 + · · · + Yn of independent random variables depends in general on
the forms of the distributions f (yi | θ), but the means and variances of the sum have simple
properties.
Whatever thePmeans µi of the individual Yi , the mean of the sum T is the sum of the
n
means: E[TP]n = 2 i=1 µi , and the variance of the sum T is the sum of the variances σi2 :
2 2 2
Var[T ] = i=1 σi . The variance of c Yi where c is a constant and Yi has variance σi is c σi .
Several stronger results occur in special cases:

• If the Yi are Gaussianly distributed, so is their sum T .


• For arbitrary distributions for the Yi , so long as they have finite variances, the distri-
bution of T approaches the Gaussian distribution as n increases; we say the asymptotic
distribution of T is Gaussian.

The second property is the famous Central Limit Theorem, commonly abbreviated to CLT.
We do not prove any of these properties: they are central to the frequentist theory but are
largely irrelevant to the Bayesian theory which we will develop.
7
Statistical inference I – discrete distributions

7.1 Evidence-based policy


It seems obvious that policy decisions in all fields should be based on the best evidence
available about the possible alternative policies. If not enough evidence is available, then
more should be sought through observational or experimental studies.
Before developing the statistical theory and data analyses used in this book, we need
to understand the background to evidence-based policy – the development of public policy
based on evidence obtained through carefully designed studies, including controlled clinical
trials, sample surveys and experimental designs and their careful statistical analyses.
In discussing population studies, samples or other data sets in this book, questions which
have to be addressed are:
• why do we have these data?

• to whom is the study important?


• how was the study designed?

7.2 The basis of statistical inference


The theory of probability deals with the properties of unobserved samples which can be
drawn from known populations. The theory of statistical inference deals with the properties
of unknown populations from observed samples which have been drawn from them.
This inversion of sample and population is critical to statistical inference, and is expressed
through Bayes’s theorem, which we discuss in §3.5. The theorem is based on the evidence
function – the likelihood function which conveys the evidence provided by the design, the
probability model and the sample data about the research question. In this chapter we
consider two special models – the binomial and Poisson distributions. In later chapters we
extend this analysis to other one-parameter models, and to two-parameter models.1
First we give a short discussion of a major division in statistical theory, and in the statis-
tics profession and its practice. This concerns the role of probability theory and probability
models in statistical inference. A detailed history of this development is given in Aitkin
(2010), and a short list of the main statisticians contributing to the development is given in
Appendix 2.

1 In some science and technology fields the term parameters is used for what we call variables.

DOI: 10.1201/9781003216025-7 47
48 Introduction to Statistical Modelling and Inference

7.3 The survey sampling approach


Opinions of this field have varied over time:

We are thus entering a rather narrow area of statistical theory, but it is an area
which has been intensively cultivated, and this on the grounds of its practical
importance rather than of its mathematical attractiveness.
(Kendall and Stuart 1966, p. 166)

Among the branches of statistics, survey sampling is notable for its public impor-
tance and its theoretical isolation. It should perhaps be an important component
of every statistician’s education, but by and large is neglected, and when not ne-
glected, found to be an alien subject having its own rules and orientation at odds
with standard methods of statistical inference. Students of statistics catch a glimpse,
shudder, and pass on.
(Valliant, Dorfman and Royall 2000, p. xv)

Survey sampling is a major field of application of statistics and it is one of the most
satisfying and useful fields of statistics where both the target of inference is solid
and observable and the range of models and associated methods used in modern
statistics can be applied.
(Chambers and Clark 2012, Preface summary)

The survey sampling approach is used almost universally by National Statistical Offices or
Central Bureaus of Statistics. Much of the routine work of these institutions (though less so
now than in the past) focuses on population and sub-population means, or totals, of important
variables. The simplest problem of inference is how to relate the sample mean ȳ of a variable
Y, from a sample of size n, to the population mean µ in the finite population of size N. It
makes the repeated sampling principle central to the analysis:

Statistical procedures should be evaluated on the basis of their behaviour in hypo-


thetical repetitions of the experiment that generated the original data.

Under this principle the variability in a parameter estimator, like the sample mean, is assessed
from its sampling distribution in conceptual – hypothetical – repeated samples of the same
size drawn from the same population. It does not require any population probability model
for the response variable Y, as such models make assumptions which cannot be verified
from the sample data, and which could lead to misleading conclusions if the probability
assumptions were incorrect. So the sampling process is not viewed as giving sample values of
a random variable Y : it gives sample values of a random variable U, the selection indicator.
The randomness in the sample values is a consequence of the random selection process for
the sampling of the finite population.
This selection process defines a set of N binary selection indicators u1 , . . . , uN, with
uI = 1 if population member I is selected into the sample, and uI = 0 otherwise. Then
the sample mean ȳ of the n sampled values y1 , . . . , yn can be expressed as a weighted linear
combination, with weights YI /n, of the randomly generated uI :
n
X N
X
ȳ = yi /n = uI YI /n.
i=1 I=1
Statistical inference I – discrete distributions 49

The observed sample is referred to a set of hypothetical other samples of the same size
which might have been drawn, but were not, by the selection process. In this hypothetical
repeated sampling, the uI and the resulting selected YI – the sample values yi – will vary.
The repeated sampling distribution of ȳ is that of a weighted linear function of (correlated)
binary Bernoulli random variables UI (correlated because their sum is the fixed sample size
n). The Central Limit Theorem can be used to obtain the asymptotic Gaussian distribution
of this linear function. Theories of inference which rely on the repeated sampling principle
for inference, like the survey sampling theory, are generally called frequentist.
For simple random sampling with Pr[UI = 1] = p = n/N for all I, the mean and variance
of the sampling distribution of ȳ can be easily found, in terms of the population mean µ and
variance σ 2 of the variable Y :
N
X
E[ȳ] = pYI /n
I=1
= N µ p/n = µ;
 
XN N X
X
Var[ȳ] =  YI2 Var[UI ] + YI YJ Cov[UI , UJ ] /n2 .
I=1 I=1 J̸=I

The sample mean is an unbiased estimator of the population mean. Here, the word unbiased
in an estimator means that the expected value of the estimator is equal to the parameter it
is meant to be estimating. If its expected value is not equal to the parameter, the estimator
is biased. For the variance of ȳ we need the joint distribution of pairs of the UI . These are
not independent: their joint inclusion probability is

Pr[UI = 1, UJ = 1] = Pr[UI = 1] Pr[UJ = 1 | UI = 1]


n n−1
= ·
N N −1
π − 1/N
=π· ,
1 − 1/N

where π = n/N is called the sample fraction – the proportion of the population which is
included in the sample. Then

n(n − 1) n
Cov[UI , UJ ] = − ( )2
N (N − 1) N
1 n n
=− (1 − )
N −1N N
= −π(1 − π)/(N − 1)
Corr[UI , UJ ] = −1/(N − 1).
X XX
Var[ȳ] = YI2 Var[UI ]/n2 + YI YJ Cov[UI , UJ ]/n2
I I ̸=J
 
1 − n/N  X XX
= (N − 1) YI2 − YI YJ 
nN (N − 1)
I I ̸=J
50 Introduction to Statistical Modelling and Inference
1 − n/N X
= (YI − µ)2
n(N − 1)
I

= (1 − n/N )σ 2 /n
= (1 − π)σ 2 /n

if the population variance σ 2 is defined as I (YI − µ)2 /(N − 1). The first term (1 − n/N )
P
is called the finite population correction and is written (1 − π). For a small sample fraction,
Var[ȳ] ≃ σ 2 /n, but as π → 1, n → N , and Var[ȳ] → 0, since the sample exhausts the
population. In regression models, the foundation of all complex analyses, the least squares
principle (the Gauss-Markov theorem) is used to estimate regression coefficients.
This theory has been extensively developed for very complicated survey designs. Without
a model for the YI , there are no optimal procedures (in the model-based sense of §7.4):
competing procedures have to be evaluated by their biases and variances. As Little (2004,
p. 547) described it sadly in his review paper, it is all a matter of judgement. We change his
notation I to our U:

For inference about a finite population quantity Q = Q(Y), the following steps are
involved:
1. Choosing an estimator qb = qb(Yinc , U), a function of the observed part Yinc of
Y, that is unbiased or approximately unbiased for Q with respect to the distribution
of U. . . .
2. Choosing a variance estimator vb = vb(bq (Yinc , U)) that is unbiased or approx-
imately unbiased for the variance of qb with respect to the distribution of U.

As Valliant, Dorfman and Royall (2000, p. 11) put it:

As sampling plans and estimators become more complicated, so do the technical


problems.
It becomes harder to calculate the repeated sampling bias or discover when a
particular approximation to the bias is useful. It becomes even harder to find and
evaluate estimates of the sampling standard error and to determine whether the
normal approximation to the sampling distribution is adequate.

So this inference procedure is not a science or a technology – art plays an important role. In
a much earlier review paper, Smith (1976) wrote in his conclusion, with the same frustration:

The basic question to ask is why should finite population inference be different from
inferences made in the rest of statistics? I have yet to find a satisfactory answer.
My view is that survey statisticians should accept their responsibility for providing
stochastic models for finite populations in the same way as statisticians in the
experimental sciences. These models can then be treated within the framework of
conventional theories of inference. The problems with the Neyman approach then
disappear to be replaced by disputes between frequentists, Bayesians, empirical
Bayesians, fiducialists and so on. But at least these disputes are common to all
branches of statistics and sample surveys are no longer seen as an outlier.

What is remarkable is that survey statisticians had already done this (Hartley and Rao 1968;
Ericson 1969), but their work was ignored for many years, and is still not taken seriously.
Further details are not given here; they can be found in Aitkin (2010, Chapter 4), and will be
Statistical inference I – discrete distributions 51

revisited in Chapter 12, where the multinomial distribution for the YI plays the role Smith
wished for: a stochastic always true model for any finite population.
Models are, however, used in survey sampling for two specific classes of problems: small-
area estimation (SAE) and incomplete or missing data.

• SAE: involves two-stage sampling, with small numbers of secondary sampling units.
These small samples give unreliable population estimates for their areas, and need to be
strengthened by borrowing strength from a distributional model for the area means. We
do not discuss two-stage sampling in this book.

• Incomplete data: incomplete records involving missing response variables or covariates


are common in surveys of all kinds. We discuss this in Chapter 15.

7.4 Model-based inference theories


The model-based theory of inference set out first is the frequentist likelihood theory which
was, for more than 60 years, the most useful one, but is now struggling to deal with in-
creasingly complex data and models. It was developed by Fisher (1912, 1922, 1925) and
extended by Neyman and Egon Pearson (1932, 1933). It is followed by the currently most
useful theory, the Bayesian theory, due to Bayes (1763), Laplace (19th century) and Jeffreys
(1961).
Both theories are based on
• a set of scientific (in the wide sense) research questions to be answered, or at least
investigated, by a research study;

• a design D for the study; and


• a probability model f (Y | θ) for the random variable Y , which is used to represent the
data y produced by the study.
The probability model depends in this chapter, and the next, on a single parameter – an
unknown constant – denoted here by θ whose value controls the location of the data of the
study. The parameter θ can take any (real) value in a parameter space Θ. The value of the
parameter is relevant to the scientific questions which led to the study. In later chapters the
model is extended to include covariates x, often called explanatory variables (we will not use
this term). We will write the general probability model as f (Y | x, θ), where the parameter
θ may have high dimension.
The aim of the statistical inference is to determine what can be learnt about the parame-
ter, the scientific questions, and the appropriateness of the probability model, from the data.
The information necessary for this learning is provided by the likelihood function.
In the Data Science vocabulary, the term Machine Learning is widely used. It refers to
the analysis of data of some kind by a computer program written to perform an analysis
of some kind. This may be fitting a model by maximum likelihood, or a Bayesian analysis,
or an analysis by a computational algorithm without a model or a likelihood. It may be
performing many analyses with different models or algorithms, and choosing the best, or
averaging them, or listing them with the preference order of the models. Machines do not
learn: they are programmed to carry out the designer’s or the programmer’s analysis, and we
learn what can be learnt from what is programmed.
52 Introduction to Statistical Modelling and Inference

7.5 The likelihood function


The likelihood function is the probability of the observed data, as a function of the design and
the model parameters θ. The design is the way in which the data were collected or generated,
through a random process of some kind.
Data collected by a non-random process cannot be analysed if we do not know the process
design which generated them.
As we discussed in Chapter 4, samples generated by voluntary response are always biased,
as are data found lying on the street.
In survey sampling a specific survey design is always used to guarantee a random sample
from the population of size N . We express the likelihood L(θ | D, y), given the design D
and the data y, through two random processes:
• first, drawing population member I with probability πI ;

• second, obtaining from drawn member I the response YI with probability pI .


As we described earlier, it is convenient to define an indicator variable UI = 1 if member I
is drawn into the sample, and UI = 0 if member I is not drawn. Then for the observed sample
QN
L(θ | D, y) = I=1 [Pr(UI ) Pr(yI | UI = 1)] .
We now assume that in the design the two processes are independent across the drawing
of the sample, that is, that the probability model for YI does not depend on I. The joint
processes are unrelated, and are identical for each population member drawn.2 The response
values vary by I according to the probability model f (y), but the form of f does not. We
say that the values yI drawn are independent and identically distributed. (In models with
covariates x, the values of the covariates and the response variable both vary with I. The
values drawn are then independent but non-identically distributed.)
Then under the assumption of independence,
N
Y
L(θ | D, y) = [Pr(UI ) Pr(yI | UI = 1)]
I=1

N
Y
= [Pr(UI ) Pr(yI )]
I=1

N
Y
= (πI pI )uI
I=1
" N
# " N
#
Y Y
= πIuI · f (YI | θ) uI
.
I=1 I=1

Under this assumption of independence between the sample selection process and the re-
sponse variable distribution (called a non-informative or ignorable sample design), the first
term in the likelihood is a constant, depending only on the design of the data collection.
(For example, a simple random sample of size n has selection probabilities πI = 1/N for all
population members.)
2 In the next section we give an example in which this independence does not hold.
Statistical inference I – discrete distributions 53

The second term depends on the probability model, and can be (and generally is) written
in terms of the observed sample values yi , i = 1, . . . , n rather than the partly observed
population values. So the likelihood is generally written
n
Y
L(θ | y) = c · f (yi | θ),
i=1

where c is a constant, not involving the model parameters θ. (Many treatments of likelihood
omit any constant, but we retain the constant, for reasons which will become clear later.)
The first research question to be investigated in this chapter is the proportion of StatLab
mothers who were smoking at the diagnosis of their pregnancy.

7.6 Binomial distribution


In this section we concentrate on very small samples, to illustrate the difficulties of frequentist
theory which have made this application often confusing for students. We draw four simple
random samples of ten, and combine the first two and the last two into two subsamples of
20, and then pool these to give a single sample of 40.
What proportion p of mothers were smoking at the diagnosis of pregnancy? The database
does not contain any indication of whether a smoking mother continued to smoke, or stopped
smoking, after the diagnosis, though it does contain information about the mothers’ smoking
at the ten-year follow-up.
To assess the proportion smoking at diagnosis we define, for each mother I in the popu-
lation, a binary indicator (random variable) YI = 1 if the mother was a smoker, and YI = 0
if the mother was a non-smoker, at the diagnosis of pregnancy. Then the proportion p of
PN
smoking mothers in the population is p = I=1 YI /N .
For a randomly drawn mother I, Pr[YI = 1] = p, Pr[YI = 0] = 1−p, and for each random
variable YI , we have E[YI ] = p, Var[YI ] = p(1 − p).

7.6.1 The binomial likelihood function


We develop this function for the mothers’ smoking variable. For the first sample we drew ten
times (with replacement)3 a random family from the population of 1,296, each time with the
same probability 1/1,296, and then observed the value of mother’s smoking (treated as either
currently smoking – a “failure” – or non-smoking or quit – a “success”). The probability of
a failure is the proportion of failures in the population, that is, the proportion p of mothers
smoking at the diagnosis of pregnancy. In the first sample of 10 we obtained 4 smoking
mothers, in the sequence FSSSFFSSSF, corresponding to the binary indicators 0111001110.
The likelihood function is the probability, for the ten families drawn, of the joint events
“family number I was drawn” and “the smoking status of the mother was YI ”, either 1 for
F or 0 for S.
Since the population was unchanged by the successive draws, the probability of the full
sample is the product across the sample members of the probabilities of being drawn, and
giving the smoking response. So, writing L(p) for the likelihood, we can express it as
Y
L(p) = [(1/1296) · pYI · (1 − p)1−YI ],
I∈s

3 Practical sampling is done without replacement, discussed in §6.13.


54 Introduction to Statistical Modelling and Inference

where the notation I ∈ s means that population member I was drawn in the sample. (This
is a standard notation in survey sampling.)
It is convenient to have a similar index notation for the sample members drawn, as well
as for the full population. We write the sample index as i, ranging from 1 to n, and the
smoking “indicator variable” (1 for smokers, 0 for non-smokers) for the i-th sample member
as yi . Then equivalently
n
Y
L(p) = [(1/1296) · pyi · (1 − p)1−yi ]
i=1
= c · pr (1 − p)n−r ,

− yi ) = n − r = 6, and c is the constant (1/1,296)10 .


P P
where r = i yi = 4, i (1

7.6.1.1 Sufficient and ancillary statistics


In general, the contribution to the likelihood of each observation has to be computed sep-
arately for each observation, but in many of the models we will use, like the binomial, the
likelihood function depends on the observation values only through one or more composites
of these values, and not through the values themselves. These composites are called sufficient
(for the likelihood) statistics, and are a great computational convenience. The binomial like-
lihood depends on one sufficient statistic r, in addition to the fixed and givensample size n,
and is a scale multiple of the binomial probability of r failures in n trials, nr pr (1 − p)n−r .
A graph of this function, at the grid spacing of 1/1,296, is shown in Figure 7.1.
At this very fine grid spacing we cannot distinguish the individual grid values: the function
when graphed looks like a continuous function of p, though it is discrete, and is graphed with
a point plotting symbol, not a line.

0.250

0.225

0.200

0.175

0.150
likelihood

0.125

0.100

0.075

0.050

0.025

0.000
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.1
Binomial likelihood, n = 10, r = 4
Statistical inference I – discrete distributions 55

0.250

0.225

0.200

0.175

likelihood
0.150

0.125

0.100

0.075

0.050

0.025

0.000
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.2
Binomial likelihood, n = 10, r = 3 (solid) and 4 (dotted)

The second sample gave three smoking mothers. The likelihood functions of p for n = 10
are shown in Figure 7.2 for r = 3 (solid curve) and 4 (dotted curve), in Figure 7.3 for r = 7
when n = 20, and in Figure 7.4 for r = 14 when n = 40. For the last two figures the observed
proportion of smoking mothers is the same (0.35) – but the likelihood is more concentrated
in the larger sample – we are more certain that p is in the range 0.2–0.6. These figures give
us a visual impression of the plausibility of other values of p. If the observed number of
successes is 0 or n, the likelihood has its maximum on the boundary of the parameter space,
at 0 or 1 (Figure 7.5).
Does the likelihood convey any other information, beyond the sufficient statistic? This
can be assessed from the conditional distribution of the data given the sufficient statistic:
" # " n #  
X Y
yi 1−yi n r
Pr {yi } | yi = r = (1/1296) · p (1 − p) / p (1 − p)n−r
i i=1
r
" n #  
Y n
= (1/1296) /
i=1
r

The result is just the probability of the observed sequence 0111001110, divided by the bino-
mial coefficient. Under the model, any other sequence would have the same probability, so
if the model is correct, the particular sequence we observed would be just as likely as any
other sequence – it tells us nothing about the probability of smoking. The sequence looks
like a random permutation of the 0s and 1s, as expected from the random sampling.
If the sequence had been 1111000000, this would look non-random, as though some event
changed the sample draws from non-smokers to smokers, or the random sampling design
had not been followed. The sequence of successes and failures is called an ancillary statistic
– a function of the data which is not informative about the model parameter p, but may
be informative about other aspects of the study design. There are several ways of assessing
56 Introduction to Statistical Modelling and Inference

0.18

0.16

0.14

0.12

likelihood
0.10

0.08

0.06

0.04

0.02

0.00
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.3
Binomial likelihood, n = 20, r = 7

0.12

0.10

0.08
likelihood

0.06

0.04

0.02

0.00
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.4
Binomial likelihood, n = 40, r = 14
Statistical inference I – discrete distributions 57
1.0

0.9

0.8

0.7

0.6

likelihood 0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.5
Binomial likelihood, n = 10, r = 0

departures from randomess, using properties of sequences – runs – of zeros and ones. We do
not discuss these further.
A potentially important point which we have not taken into account is that, once we
have the sample, we can see that there are other constraints on the possible values of p. We
have observed four smokers and six non-smokers in the sample, so in the population there
must be at least four smokers, and six non-smokers, so at most 1,290 smokers. So the possible
values of p must be in the smaller interval [4/1,296, 1,290/1,296], which is [0.0031, 0.9954].
The very small sample does not much restrict the possible values of p.

7.6.1.2 The maximum likelihood estimate (MLE)


The value of p at which the likelihood is maximised is called the maximum likelihood estimate,
abbreviated to MLE. A tabulation of the likelihood function (used to construct the graph
in Figure 7.1) gives the MLE as 389/1,296 = 0.39969, with the corresponding maximised
likelihood of 0.208227 to 6 dp.
The appearance of a continuous likelihood function suggests that the MLE could be
evaluated without computation, by differentiation of the log of the likelihood function:

ℓ(p) = log L(p)


= r log p + (n − r) log(1 − p)
ℓ′ (p) = r/p − (n − r)/(1 − p)
=0

when p = r/n. The “continuous” maximum probability of four smoking mothers occurs at
pb = 0.4, and is 0.208222 to 6 dp. The estimates agree to 3 dp and the likelihoods agree to 4
dp, which is sufficiently accurate for the continuous assumption to be useful. We now use a
58 Introduction to Statistical Modelling and Inference

continuous approximation to the discrete likelihood: as we saw from its graph it is impossible
to distinguish visually the discrete population values in the population of size 1,296.
Not surprisingly, what we observe in the sample has the highest probability when the
population proportion is the same as the sample proportion! However the MLE does not
summarise the information in the likelihood. Other values of p also give high probability –
near the maximum – to r = 4; these values have high likelihoods. At p = 0.35 or 0.45, the
probability of r = 4 is 0.238 (to 3dp), so these values of p are also very plausible.4
Further analysis requires a theory of statistical inference.

7.7 Frequentist theory


There is a long-running split in non-Bayesian model-based inference between the frequen-
tist school, represented by Neyman, and the likelihood school, represented (and invented)
by Fisher. The frequentist theory, like the survey sampling theory, relies on the repeated
sampling principle for the assessment of evidence, to generate a probability of some event
defined over the repeated samples. In this theory, the role of the likelihood function is simply
to provide estimators for the model parameters. Maximum likelihood is just one method for
generating estimators; there are others. The variability in a parameter estimate, like the
sample proportion, is assessed from its sampling distribution in (conceptual – hypothetical)
repeated samples of the same size drawn from the same population. However it is the values
of the random variable yi which are regarded as repeatedly sampled. The sample selection
indicators used in the survey sampling theory play no role in the analysis, if the design is
non-informative: they are simply an additional ancillary statistic.
PnHow do we assess the repeated sampling variability Pn of the estimator pb? Since this is
i=1 yi /n, the variance of p
b is immediate: Var[b p] = i=1 Var(yi )/n2 = p(1 − p)/n. This
does not help us to make direct probability statements about p through pb, because we do not
know the value of p to substitute in the variance. We know an upper bound for the variance,
since p(1 − p) ≤ 1/4, so Var[b p] ≤ 0.25/n.
Through the Central Limit Theorem, we know the asymptotic distribution of pb is Gaus-
sian; in the limit as n → ∞ the variance → 0 and pb → p. So an approximation to the
finite-sample distribution of pb is Gaussian with mean p and variance pb(1 − pb)/n.
p The stan-
dard error (SE) of pb is the square root of the approximate variance; SE(b p) = pb(1 − pb)/n.
We examine in the following the consequences of this for inference about p.
Fisher’s definition of the likelihood and its properties provided a criterion for the choice
of an estimator and a general way of obtaining the (exact or asymptotic) distribution of es-
timators of model parameters. Fisher was not a frequentist: he referred to the unique sample
providing the probability distribution of the estimator through algebraic manipulation of
the likelihood. He dismissed repeated sampling interpretations of the distribution of maxi-
mum likelihood estimators and functions derived from them, including confidence intervals,
introduced by Neyman.
We illustrate this with the binomial model:
ℓ(p) = log L(p) = r log(p) + (n − r) log(1 − p)
′ r n−r
ℓ (p) = −
p 1−p
′′ r n−r
ℓ (p) = − 2 − .
p (1 − p)2
4 It is easy but unsatisfactory to say that they are very likely, since this has not been given a qualifying

dimension. Plausibility is invoked instead as an understood concept.


Statistical inference I – discrete distributions 59

The maximum of ℓ(p) (and L(p)) occurs at pb = r/n, the MLE. Since r has a binomial
distribution with mean np and variance np(1 − p), pb = r/n has a scaled binomial distribution
with mean p and variance p(1 − p)/n. We have the same result as from the direct calculation
of the variance, and the same difficulty in using it.
Although the repeated sampling principle was invoked above for the frequentist asymp-
totic Gaussian distribution of pb, this distribution approximation does not depend at all
on a repeated-sampling interpretation; it follows directly from the binomial distribution
properties and the Central Limit Theorem. This is true for many other MLEs in simple
models.
The frequentist inference then consists of quoting the MLE pb of p as the measure of
location or centrality, the SE as the measure of variability and a confidence interval for p as
a summary of the extent of the plausible variation in the inference about p. The confidence
interval for p is the MLE ± λSE based on the asymptotic Gaussian distribution, where λ is
chosen to give a specified probability coverage in repeated sampling (frequently 95%, with
λ = 1.96).
This is an important limitation for a general theory: the theory depends on an asymptotic
assumption, which is adequate only under restrictive conditions. We discuss this further in
§9.5 on the Gaussian distribution.
For our samples of n = 10, 20 and 40 with r = 1, 3, 4, 7 and 14, the 95% confidence
intervals for p are shown in Table 7.1 to 3 dp, using pb ± 1.96 SE(b p). The intervals shorten
as the sample size increases, but even for n = 40 they are far from precise. The variance
approximation fails for r = 1, n = 10. It is not sufficient to truncate the interval below zero:
the zero value for p is impossible if r = 1. Truncating it just above zero requires a choice of
truncation point. The asymptotic theory does not handle this event.
A remarkable feature of the likelihood is that asymptotically (that is, in large samples),
when its maximum is internal to the parameter space – not on a boundary – the log-likelihood
approaches a quadratic in the model parameters. The linear and quadratic terms define the
MLE and its SE, and other terms tend to zero with increasing sample size.
We discuss this at length in §9.5, but here give the results for the binomial distribution,
assuming that r ̸= 0 or n (maximum not on the boundary). The justification of this comes
from the Taylor expansion of the log-likelihood (here for a single parameter p) about the
maximising value pb, assumed to be an internal point in the parameter space. We write
′ 1 ′′ 1 ′′′
p) + (p − pb)ℓ (b
ℓ(p) = log L(p) = ℓ(b p) + (p − pb)2 ℓ (bp) + (p − pb)3 ℓ (b p)
2! 3!
1
+ (p − pb)4 ℓiv (b
p) + . . .
4!
1 ′′ 1 ′′′ 1
p) + (p − pb)2 ℓ (b
= ℓ(b p) + (p − pb)3 ℓ (b p) + (p − pb)4 ℓiv (b
p) + . . .
2 6 24
c2 = −1/ℓ′′ (b
since the first derivative is zero at the MLE. Writing σ p) = pb(1 − pb)/n, we have
after some algebra

′′′ 2(1 − 2b p)
ℓ (b
p) = 4
,
nb
σ
6(1 − 3b p2 )
p + 3b
ℓiv (b
p) = 2 6
n σ b
1 c2 + 1 − 2b
p 1 − 3b p2
p + 3b
p) − (p − pb)2 /σ
ℓ(p) = ℓ(b (p − p) 3
+ (p − pb)4 . . .
σ4 4n2 σ
b6
b
2 3nb
60 Introduction to Statistical Modelling and Inference
TABLE 7.1
95% confidence intervals for p
r n conf

1 10 −0.086, 0.286
3 10 0.016, 0.584
4 10 0.096, 0.704
7 20 0.141, 0.559
14 40 0.202, 0.498

The cubic and quartic terms in the expansion are of smaller order in the sample size n than
the quadratic, so as n increases,
1
p) → − (p − pb)2 /σ
ℓ(p) − ℓ(b c2 ,
2
which is equivalent on exponentiation to
 
L(p) 1
σ2 .
→ exp − (p − pb)/b
L(b
p) 2

So under this quadratic assumption the likelihood approaches a multiple of a Gaussian den-
sity function of pb with mean p and variance σc2 . In this case we can express the inference
about p through the MLE pb as the best estimate of p, and its precision from the standard de-
viation σ
b. We can check the quadratic assumption by computing the quadratic log-likelihood
approximation:

. 1 n
ℓ(p) = ℓ(b p − p)2
p) + (b
2 pb(1 − pb)
p − p)2
(b
= r log(bp) + (n − r) log(1 − pb) −
2Var(bp)
2
 
. (b
p − p)
L(p) = pbr (1 − pb)n−r · exp − .
2Var(b p)

Figures 7.6, 7.7, 7.8 and 7.9 give the binomial (solid curve) and approximating Gaussian
(dotted curve) likelihoods for the cases r = 1 and 4 for n = 10, and r = 7, n = 20 and
r = 14, n = 40. They are identical at the MLE, and diverge away from it, decreasingly with
increasing n: the skew in the binomial likelihood reduces with increasing n for fixed pb.
At n = 10 and r = 1, 3 or 4, the Gaussian approximation gives positive likelihood to
negative values of p, which are impossible. They do not appear in the graphs which are
constrained to the possible values, but the Gaussian approximation does not descend to zero
at the left end.

7.7.1 Parameter transformations


One useful ad hoc approach to the difficulty of the finite range of the parameter p is to
transform the parameter to a scale which has an infinite range. A confidence interval for the
transformed parameter will then not exceed the range of the transformed parameter, and
the interval for the transformed parameter can be back-transformed to the p scale.
We illustrate with the logistic transformation, used in Chapter 16 for regression with bi-
nary response variables. We use it here for the single parameter p. The logistic transformation
Statistical inference I – discrete distributions 61

0.035

0.030

0.025

likelihood
0.020

0.015

0.010

0.005

0.000
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.6
Likelihood (solid) and Gaussian approximation (dotted), n = 10, r = 1

0.0012

0.0011

0.0010

0.0009

0.0008

0.0007
likelihood

0.0006

0.0005

0.0004

0.0003

0.0002

0.0001

0.0000
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.7
Likelihood (solid) and Gaussian approximation (dotted), n = 10, r = 4
62 Introduction to Statistical Modelling and Inference
2.e-06

2.e-06

2.e-06

2.e-06

2.e-06

1.e-06

likelihood 1.e-06

1.e-06

8.e-07

6.e-07

4.e-07

2.e-07

0.e+00
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.8
Likelihood (solid) and Gaussian approximation (dotted), n = 20, r = 7

6.e-12

5.e-12

5.e-12

4.e-12

3.e-12
likelihood

3.e-12

2.e-12

2.e-12

1.e-12

1.e-12

5.e-13

0.e+00
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 7.9
Likelihood (solid) and Gaussian approximation (dotted), n=40, r=14
Statistical inference I – discrete distributions 63

of p is to θ = log[p/(1 − p)]. The reverse transformation is p = exp(θ)/[1 + exp(θ)]. If we


make this substitution in the likelihood, we have
L(θ) = [exp θ/(1 + exp(θ)]r [1/(1 + exp(θ)]n−r
= exp(rθ)/[1 + exp(θ)]n
ℓ(θ) = rθ − n log[1 + exp(θ)]
ℓ′ (θ) = r − n exp(θ)/[1 + exp(θ)]
p/(1 − pb)]
θb = log[b
ℓ′′ (θ) = −n exp(θ)/[1 + exp(θ)]2
\ b 2 /[n exp(θ)]
Var[θ]
b = [1 + exp(θ)] b
p(1 − pb)].
= 1/[nb
p
So a 95% confidence interval for θ is θb ± 1.96 1/[nb p(1 − pb)]. For the example with one
smoker out of 10, we have θb = log(0.10) = −2.303, SE (θ) b = 1.054. The 95% confidence
interval is −2.303 ± 2.066 = [−4.369, −0.237]. Transforming back to the p scale gives the
interval [0.013, 0.441]. We seem to have solved the problem.
A non-obvious difficulty here is that the transformed interval will depend on the choice
of transformation. Any cumulative distribution function (cdf) can be inverted, analytically
or computationally, to give such a transformation. There is no principle in statistical theory
by which the choice of this cdf and its transformation can be specified. Another non-obvious
difficulty is that on the θ scale the likelihood (not shown) is skewed in the opposite direction,
so the quadratic assumption is not satisfactory on this scale either. Nor is the parameter
transformation the only way of obtaining a proper interval for p. We do not discuss the other
methods as they are unnecessary in the Bayesian inference framework. The point is that in
this binomial model the frequentist analysis is handicapped by the failure of the asymptotic
approximation in small samples or extreme outcomes.

7.7.2 Ambiguity of notation


There has developed an unfortunate conflation of frequentist and Bayesian terms for prob-
ability distributions. Frequentists have long used the term sampling distribution for the
distribution of a sample statistic like the sample mean in hypothetical repeated sampling,
over which the frequentist sampling distribution is defined.
Recently some Bayesians have begun to use the same term – sampling distribution – for
the probability model for the response variable, regarding it as the distribution of each sample
datum. This is very unfortunate, since it confuses students, and others, about the relation
between frequentist and Bayesian analyses. We restrict the term “sampling distribution” to
the frequentist use, and use the term “probability model” for both frequentist and Bayesian
probability representations – approximations – of the observed data.

7.8 Bayesian theory


We augment the likelihood with a prior (probability) distribution. This represents our infor-
mation about the model parameters before we observe the data, expressed as a probability
distribution. The prior distribution is combined with the likelihood to give the posterior
64 Introduction to Statistical Modelling and Inference

distribution, the information we have about the model parameters after seeing the data,
through Bayes’s theorem, which we express generally as a theorem in probability.

7.8.1 Bayes’s theorem


In §5.7 on the screening test example, we derived Bayes’s theorem to obtain the conditional
probability of a person tested having the serious condition, given a positive test, from the
reverse conditional probability of a positive test given the person had the condition, and the
prevalence of the condition. We summarise this result again.
For arbitrary events A and B, the probability of the joint event A ∩ B – that A and B
both occur – can be decomposed in two ways, into the marginal probability of one event,
and the conditional probability of the other event, given the first:
Pr[A ∩ B] = Pr[A] Pr[B | A] = Pr[B] Pr[A | B].
Bayes’s theorem follows immediately, that
Pr[B | A] = Pr[B] Pr[A | B]/ Pr[A].
This can be expressed as an updating of the probability of B as a consequence of observing
A: Pr[B] is scaled or weighted by the factor Pr[A | B]/ Pr[A], the ratio of the conditional
probability of A given B to the marginal or unconditional probability of A.
We apply this to the updating of our prior information about the parameter θ (B) from
the observation of the data y (A). We represent the prior information about θ through
the prior distribution π(θ). This is updated by the observation of the data to the posterior
distribution, written as π(θ | y). The probability of the data y under the model is the
likelihood L(θ), and Bayes’s theorem is
π(θ | y) = π(θ) · L(θ)/ Pr[y],
where X
Pr[y] = π(θ) · L(θ),
θ
and the “sum” is a finite summation if θ is discrete, or an integral if θ is continuous. The
sum is sometimes called the marginal density of the data or the marginal likelihood. Neither
of these terms is accurate, as we discuss in Chapter 13 on model comparisons. We will call
it the integrated likelihood. Bayes’s theorem is generally stated as
The posterior distribution is proportional to the product of the likelihood
and the prior.
The last term Pr[y] is a proportionality constant, depending on the data but not on the
value of θ. It is an integrating constant, which is necessary for the posterior to be a “proper”
probability distribution, that is, one which sums or integrates to 1 over the parameter space.
Its value is irrelevant to the inference about θ. This is easily understood by writing

π(θ | y) = c · π(θ) · L(θ),

where the constant c has to satisfy


Z Z
π(θ | y) dθ = c · π(θ) · L(θ) dθ = 1
Z
c = 1/ π(θ) · L(θ) dθ

if θ is continuous.
Statistical inference I – discrete distributions 65

In many studies there is no specific prior information about the parameters: the point of
the study is to obtain such information from the data. In other research areas there may be
some information from related or similar studies; the difficulty is in expressing this informa-
tion in a full probability distribution. We discuss in detail below ways in which informative
priors can be constructed. We use first the flat or uniform prior, leaving the likelihood un-
changed. This allows us to describe what the data say through the model likelihood, before
incorporating any informative prior.
We have a very large number – 1,297 – of possible values of p. As we noted from the graph
of L(p), we cannot distinguish the separate points in the graph at this level of resolution. In
such cases it is convenient to treat p as though it is continuous rather than discrete. That is,
we remove the restriction that p can take only the 1,297 values on the fine grid, and allow it
to take any value in [0,1]. This does not change the form of the likelihood, only its support
– the values on which it is defined.
We now define the uniform prior for p as π(p) = 1 for p ∈ [0,1]. To convert the likelihood
into
R 1 rthe posterior, we divide by the integral of the likelihood over its range. The integral
n−r
0
p (1 − p) dp is a Beta function:
Z 1
pr (1 − p)n−r dp = B(r + 1, n − r + 1)
0

Γ(r + 1)Γ(n − r + 1)
=
Γ(n + 2)
r!(n − r)!
= .
(n + 1)!

The posterior distribution of p is a Beta (r + 1, n − r + 1) distribution:

π(p | r, n) = pr (1 − p)n−r /B(r + 1, n − r + 1).

The cumulative distribution is not analytic (does not have a simple algebraic representation)
but is extensively tabulated and available as a system function in most statistical packages.

7.8.2 Summaries of the posterior distribution


The posterior distribution is a regular probability distribution, and so we can report its
properties in the usual way for any probability distribution. However an important feature
of posterior distributions of model parameters is that they are frequently not Gaussian and
may be skewed – asymmetric.
This possibility reduces the value of the mean and standard deviation or variance as
simple summaries of the location and scale of the distribution. Instead, it is common practice
in Bayesian analysis to report the median and several percentiles, like quartiles.5 In this book
we use the general term quantiles instead of percentiles. We report the median and the 2.5
and 97.5 quantiles of posterior distributions: the interval between the 97.5 and 2.5 quantiles
is called a 95% (central) credible interval for the parameter θ. It gives a commonly used
measure of precision of the posterior inference. This interval has the advantage that if we
are also interested in a monotone transformation g(θ) of the parameter θ, all quantiles of
the posterior for θ transform in the same way.6
For the examples with n = 10, 20 and 40, the posterior medians and 95% cen-
tral credible intervals for the proportion p of mothers smoking at diagnosis are given in
5 The interquartile range is sometimes used as a measure of variation; it is not used here.
6 The highest posterior density (HPD) region does not have this property and is not used here.
66 Introduction to Statistical Modelling and Inference
TABLE 7.2
Posterior medians and 95% credible interval quantiles
n r 50% 2.5% 97.5%

10 1 0.148 0.023 0.419


3 0.324 0.109 0.610
4 0.412 0.168 0.692
20 7 0.359 0.181 0.570
40 14 0.355 0.221 0.506
1,296 0.340

TABLE 7.3
95% credible and confidence intervals for p
r n cred conf

1 10 0.023, 0.419 −0.086, 0.286


3 10 0.109, 0.610 0.016, 0.584
4 10 0.168, 0.692 0.096, 0.704
7 20 0.181, 0.570 0.141, 0.559
14 40 0.221, 0.506 0.202, 0.498

Table 7.2. As n increases the posterior medians vary slightly around the observed propor-
tions, while the 95% credible intervals converge slowly towards the true population value,
which we know from the population listing is 0.340. However samples of this size give little
precision.
The 95% central credible intervals for p are shown in Table 7.3 to 3 dp, together with
the 95% confidence intervals. The variance approximation for the confidence interval fails
for r = 1, and is inaccurate (relative to the credible interval) for r = 3 and 4, n = 10, but
is increasingly accurate, at least at the 97.5 percentile, for the larger samples, where r is far
from the boundary. The logistic transformation 95% confidence interval for r = 1, n = 10 of
[0.013, 0.441] agrees fairly well with the credible interval.
If we can regard the StatLab population as a random sample of the full Child Health
and Development Study population, how precise is the StatLab value? Based on the sample
value of 0.340 in the StatLab sample of 1,296, the posterior median is 0.340, and the 95%
credible interval is [0.314, 0.366]. A sample of 1,296 gives quite precise information about
the population proportion.
An important, if uncommon, problem is what to give for a credible interval if r = 0 or
n. We discuss only this case of r = 0; symmetry applies to the other case. Figure 7.5 shows
the likelihood for r = 0, n = 10.
The central 95% credible interval of [0.002, 0.285] does not include 0, though this has
the highest likelihood and posterior density. Any interval we quote should not exclude zero,
unless there is strong prior evidence against it. The choice of a central (two-sided, equal-
tailed) interval is not a principle of Bayesian analysis, only a convention (there are others,
like the highest posterior density interval). For this extreme case it is sensible to use a
one-sided highest posterior density 95% credible interval, starting from zero: [0, 0.240].
Zero counts like this occur frequently in animal trapping of rare species. Ten traps are
set in an area but no animals are caught. Could there be no animals of this species in the
area?
Statistical inference I – discrete distributions 67

7.8.3 Conjugate prior distributions


The Beta posterior distribution arises from the binomial likelihood and the uniform prior
distribution for p. However, this posterior arises more generally, from the conjugate Beta
prior distribution. Conjugate in the general English sense means “joined together, especially
in pairs”. In statistics it refers to the identical distribution structure of the posterior and
the prior, which means the identical structure of the likelihood and the prior. Suppose we
can express our uncertainty about p not through the uniform distribution – every possible
value of p being equally probable a priori – but through an informative prior, which specifies
higher prior certainty that p is in some region of the interval near 0.5 rather than near 0
or 1. For example suppose the prior distribution is quadratic: π(p) = 6p(1 − p), which is
symmetric about 0.5, with mean and mode (maximum) 0.5.
The uniform and quadratic priors are special cases of the general Beta prior, which we
write
π(p | a, b) = pa−1 (1 − p)b−1 /B(a, b).
This prior is proper (integrates to 1) if a and b – the prior parameters – are both positive.
If not, the prior is improper. The Beta prior is of the same form as (is conjugate to) the
binomial likelihood, and so the posterior is again of the same form:

π(p | a, b) = pa−1 (1 − p)b−1 /B(a, b)


 
n r
L(p) = p (1 − p)n−r
r
π(p | a, b, r, n) = pr+a−1 (1 − p)n−r+b−1 /B(r + a, n − r + b).

This gives a useful interpretation of the prior parameters: a − 1 and b − 1 can be thought of
as the numbers of prior successes and prior failures from a previous study, or set of studies,
even if no such studies have occurred. It calibrates the information in the prior relative to
that in the likelihood. If no such studies have been carried out, then a = b = 1 and the prior
is uniform.

7.8.4 Improving frequentist interval coverage


An unusual use of a conjugate prior was proposed by Agresti and Caffo (2000) to deal
with the poor repeated-sampling coverage of the standard frequentist confidence intervals
when the sample size was small. They proposed adding two successes and two failures to the
observed sample, so the confidence interval would be based on a data set of r + 2 successes
and n − r + 2 failures. They pointed out that this was equivalent to a conjugate Beta(3, 3)
prior distribution, which gives more support to values of p near 0.5 than those near 0 or 1.
In our difficult example of one success and nine failures in ten trials, this would give three
successes and 11 failures
p in 14 trials, with a central 95% confidence interval (untransformed)
of 0.214 ± 1.96 ∗ 0.214 ∗ 0.786/14 = [−0.001, 0.429]. The proposed approach does not
solve the problem in this example: neither negative values nor zero can be in the confidence
interval. Perhaps we need three additional successes and failures for this example.
It is hard to see how this approach could be justified in a frequentist framework – how
do we know in advance that p will be near 0.5 rather than 0 or 1? It could be relevant in a
Bayesian framework if we did know this. But the Bayesian analysis does not depend on this
kind of informative prior: the uniform prior gives a full analysis. The proposal is a recognition
that the frequentist quadratic likelihood paradigm is not working for this example, which
means that it is not working generally.
68 Introduction to Statistical Modelling and Inference

7.8.5 The bootstrap


We will discuss the bootstrap in more detail in §12.5, but here consider its use in the difficult
case of inference for p with r = 1, n = 10. Our aim is to obtain a variability measure for the
obtained proportion of successes of pb = 0.1. The bootstrap does this by generating a large
number of bootstrap samples of size n, resampled with replacement from the data sample. In
this process, the original sample is treated as a population, from which the samples of size
ten are repeatedly sampled. The “population” has a “population proportion” of successes
of 0.1, so the numbers of successes we draw in each sample have the binomial distribution
b(10, 0.1). This defines the bootstrap distribution of the MLE pb. This distribution is given in
Table 7.4 to 3 dp.
There is clearly something wrong with this distribution. The true value of p cannot be
zero, since the original sample has one success. The bootstrap does not help here. This
example is discussed further in §12.5.

7.8.6 Non-informative prior rules


An important question is how to specify a non-informative Beta prior (non-informative rel-
ative to the likelihood). The conjugate prior interpretation suggests that the zero value for
a − 1 and b − 1, that is a = b = 1 (zero prior successes and prior failures) – which leaves the
likelihood counts unchanged – is appropriate, that is the uniform prior. This is not generally
agreed by Bayesians, however, with a strong preference by many Bayesians for the Jeffreys
prior with a = b = 1/2. A further possibility is the Haldane prior, with a = b = 0. These
two priors give an improper posterior if r is 0 or n, because they correspond to discounting
the observed successes and failures by 1/2 for the Jeffreys prior and by 1 for the Haldane
prior, giving impossible negative values for the indices for successes or failures in the Beta
posterior for r = 0 or n.
Jeffreys (1961) was only the first of many Bayesians to investigate rules for non-
informative priors for various probability distributions.

To take the prior probabilities different in the absence of observational reason for
doing so would be an expression of sheer prejudice. The rule that we should then
take them equal is not a statement of any belief about the actual composition of
the world, nor is it an inference from previous experience; it is merely the formal
way of expressing ignorance.
(pp. 33–34)
Two rules appear to cover the commonest cases. If the parameter may have any
value in a finite range, or from −∞ to +∞, its prior probability should be taken
as uniformly distributed. If it arises in such a way that it may have conceivably
any value from 0 to ∞, the prior probability of its logarithm should be taken as
uniformly distributed.
. . . It is now known that a rule with this property of invariance [under mono-
tone transformations] exists, and is capable of very wide, though not universal,
application.
(pp. 117–118)

TABLE 7.4
Bootstrap distribution of pb
pb 0 0.1 0.2 0.3 0.4 0.5
Pr .349 .387 .194 .057 .011 .002
Statistical inference I – discrete distributions 69

Jeffreys showed that his transformation invariance rule, now known as the Jeffreys prior
rule, when applied to the Bernoulli p was inconsistent with the “any value in a finite range”
prior rule, and he concluded that a single universal rule for non-informative prior assignment
would not be possible. This has not discouraged other searchers for a general rule: one of
the most recent is the reference priors of Berger, Bernardo and Sun (2009). Their reference
prior for p is the Jeffreys prior.
An early argument by Savage (1962) was called the principle of stable estimation, or
precise measurement. This specifies that when a likelihood function is sharply peaked in an
interval over which the prior density is relatively flat, the posterior density does not differ
much from the normed (scaled) likelihood function. The measurement itself is considered
to be precise, irrespective of the fine detail of the prior. This principle says essentially that
provided the information in the data is large relative to that in the prior, the precise form
of the prior is unimportant: the uniform prior would give essentially the same answer.
Jeffreys put the same point differently (1961, p. 122):

the mind retains great numbers of vague memories and inferences based on data
that have themselves been forgotten, and it is impossible to bring them into a formal
theory [formal prior distribution] because they are not sufficiently clearly stated. In
practice, if one of them leads to a suggestion of a problem worth investigating, all
that we can do is to treat the matter as if we were approaching it from ignorance –
the vague memory is not treated as providing any information at all.

7.8.7 Frequentist objections to flat priors


Fisher’s attack on Bayesian inference relied heavily on the claimed lack of transformation in-
variance of priors. This argument was contradicted by Geisser (1986), but the claim remains
current, as in Cox (2016, p. 73):

the flat prior is not invariant under reparametrization. Thus if θ is uniform eθ has
an improper exponential distribution.

This argument ignores the finite population, as Geisser pointed out. For the smoking mothers
question, the effect on the flat prior on p might be expressed through the logistic transforma-
tion θ = log[p/(1 − p)], p = eθ /(1 + eθ ). It is easily seen that the transformed support values
log[R/(N + 1 − R)] are unequally spaced, but they have the same equal prior probabilities
1/(N + 1). The jump to the continuous parameter space introduces the Jacobian of the
transformation dp θ θ 2
dθ = e /(1 + e ) , so the prior in θ has this form in the continuous space,
which has a mode at θ = 0. We appear to have higher prior belief in θ = 0 : p = 0.5 than in
large positive or negative θ : p near 0 or 1.
The Jacobian derivative simply defines the “packing density” of the support points in the
discrete θ space. Near θ = 0 the packing density is high, but for larger positive or negative
θ it is very low. So the Jacobian contribution of the transformation from the p scale to the
θ scale does not alter the uniformity of the prior, just the density of its support points.

7.8.8 General prior specifications


A major point of dispute among Bayesians is how priors should be specified and used in
general. One school of Bayesians feels that priors should reflect the personal information
of the data analyst, or the data user, about the model parameter(s): the function of the
analysis is to update his or her prior belief. It may be difficult for the user to elicit his or
70 Introduction to Statistical Modelling and Inference

her prior – to specify a full probability distribution for the parameter – but various ways
of assessing the prior location and variation may allow a full probability distribution to be
specified. This school dislikes the flat or non-informative prior specification, because it is
unreasonable to them to suppose that the user has no prior information whatever about
the parameter. A further powerful objection of this school is that the (probability) model
for the data is already a subjective choice of the statistician or user, though it may be
regarded as scientifically objective. We address this important objection in Chapter 12 on
model diagnostics.
Another school feels that there should always be a reference analysis with a non-
informative or minimally informative prior, to assess “what the data say” given the data
model; the user can then perform an analysis with an informative prior, and will then be
able to assess the information provided by the prior in the informative analysis. A particular
peril of the single personal prior analysis is that, if the data have little information about
the parameter, the posterior will be determined essentially by the prior, and the user may
not realise this.
In this book we follow the reference school, in using generally flat or non-informative
priors. We argue that the object of the research study is not to update the prior of the
analyst or researcher, but to provide a “neutral” analysis and interpretation of the data
which will inform researchers and scientists in general, whatever their personal priors. This
releases the student, statistician or user from the need to elicit his or her own prior, and
it allows the information content of the data to be assessed independently of the prior. If
the data (through the likelihood) are uninformative about the parameter, then the study
or experiment which led to this data set has not contributed to our understanding of the
question being investigated. This requires a more informative experiment and likelihood, not
the perilous use of an informative prior with an uninformative likelihood, which will simply
reproduce the informative prior as the posterior.
We give an example of this in §8.7, on the notorious ECMO trials.

7.8.9 Are parameters really just random variables?


An uncommon, but strongly held Bayesian view of parameters (e.g. by Seymour Geisser)7 is
that they are not really there: every “parameter” is a random variable, not a fixed quantity.
We do not follow this view: in our view all real populations are finite, and research ques-
tions about aspects of the populations commonly involve means and proportions (like the
proportion of smoking mothers), which are real properties of the population, not random
variables.
This uncommon view is partly a consequence of Bayes’s theorem: the inference about a
parameter like a population mean or proportion is expressed through a probability distribu-
tion, but this represents the strength of evidence we have from the likelihood and prior about
the real parameter: the mean or proportion itself is not varying; it is an unknown constant.

7.9 Inferences from posterior sampling


Bayesian inference made a major step forward in the late 1980s and 1990s with the de-
velopment of posterior sampling. In our discussion of the binomial model, the probability
7 In his book Predictive Inference: An Introduction, he held that conventional statistical inference about

unobservable population parameters amounts to inference about things that do not exist, following the work
of Bruno de Finetti (Wikipedia).
Statistical inference I – discrete distributions 71

distribution for the data and the posterior distribution of the parameter with the conjugate
prior could be developed analytically (algebraically). However, in many more complex mod-
els this is not possible, but it is possible to draw random samples from the posterior, and
use these to draw inferences about the model parameters. This will become clear in later
chapters, but we give here a simple example to show how useful this can be.

7.9.1 The precision of posterior draws


The accuracy we can obtain from posterior sampling can be illustrated with the binomial
p. For the case r = 3 and n = 10, with a flat prior the posterior for p is Beta(4, 8). We
show 10, 100, 1,000 and 10,000 random draws from this distribution, in Figures 7.10, 7.11,
7.12 and 7.13, cumulated and plotted (circles) with the true cdf from the Beta distribution
(solid curve) over the range of the simulated data. At each distinct data point in each figure
we can construct the 95% credible interval for the true cdf at that data value. The set
of these intervals defines the 95% credible region for the whole cdf. We join the endpoints
of the intervals together with red line segments to give a clearer picture of the credible
region. However, the internal interval points do not represent the credible intervals at these
points: there is no data information about them. The credible region allows us to assess the
probability model assumption in other analyses, by counting the number of observed data
values n for which the assumed probability model cdf falls outside the credible region. This
number would have a distribution which is the sum of correlated Bernoullis with common
probability 0.05. Its mean value is n × 0.05. We use this region representation consistently
over later chapters.
The true cdf is omitted in the 10,000 sample to prevent obscuring completely the values
of the simulated cdf, and the circle graph character for these values is replaced by the dot
character in both the 1,000 and 10,000 samples.

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0 0.2 0.4 0.6 0.8 1.0


beta

FIGURE 7.10
n = 10
72 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta

FIGURE 7.11
n=100

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta

FIGURE 7.12
n=1,000
Statistical inference I – discrete distributions 73
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta

FIGURE 7.13
n=10,000

No reliable information can be obtained from the sample of ten. The credible region for
the sample of 100 covers the true cdf at all values, but is too wide to give any precision. The
cdf from 1,000 draws is fairly smooth, and very close to the true cdf.
The cdf from 10,000 draws is very smooth and overlies the true cdf: the upper and lower
bounds are almost equal. It is clear that credible intervals from 1,000 draws will be fairly
accurate, and the cdf from 10,000 draws will be a very accurate approximation to the true
cdf. In Chapter 12 on model diagnostics we use this approach generally.
In some Bayesian packages it is standard practice to give a kernel density estimate –
a graph of the estimated posterior density of the parameter draws – for each parameter.
The kernel density is a form of mixture density, frequently a Gaussian mixture (discussed in
Chapter 15). In this book we do not use these density graphs, for three reasons:
• the densities are sometimes jagged, with small bumps or ripples in the density estimate
which have no simple interpretation. The jaggedness is an artifact of the choice of the
bandwidth – the standard deviation – which is set too small in such cases;
• the density estimate cannot be used for anything other than to give an indication of
centrality – of the location of the mode or modes of the density – and the extent of skew;
• quantiles of the posterior distribution cannot be obtained from these density graphs: for
these we need the cdf of the draws.

7.10 Sample design


We often want to be able to estimate a population proportion with a given degree of preci-
sion. For example, we may want a 95% credible interval for a proportion to be not more than
74 Introduction to Statistical Modelling and Inference

a specified length. This usually requires a large random sample, for which the simple ap-
proximation for the credible interval is accurate. Suppose we want the 95% credible interval
for p to be not more than 0.04 in length. The simple approximate interval is
r
pb(1 − pb)
pb ± 2 ,
n
and if this is to be of length not more than 0.04, we must have
r
pb(1 − pb)
< 0.01,
n
which means that n > 104 · pb(1 − pb). Since the maximum value of pb(1 − pb) is 1/4, the interval
length requirement will be satisfied, whatever the sample outcome, if n > 2,500. In general,
if the 95% credible interval for p is to be of length not more than δ, then this is guaranteed
if the sample size exceeds 4/δ 2 .
We thus have a solution to the problem of inference about a single population proportion,
including the sample size required for a specified precision.

7.11 Parameter transformations


As was the case with the frequentist inference about p, the Bayesian inference can also be
expressed through the logit transformation of p which has an unrestricted range. We discuss
this here because it proves very helpful in GLMs where the parameters have unrestricted
ranges.
The logistic transformation of p is to θ = log [p/(1 − p)], and the reverse transformation
is p = exp(θ)/[1 + exp(θ)]. If we make this substitution in the likelihood, we have
L(θ) = [exp θ/(1 + exp(θ)]r [1/(1 + exp(θ)]n−r
= exp(rθ)/[1 + exp(θ)]n
ℓ(θ) = rθ − n log[1 + exp(θ)]
ℓ′ (θ) = r − n exp(θ)/[1 + exp(θ)]
p/(1 − pb)]
θb = log[b
ℓ′′ (θ) = −n exp(θ)/[1 + exp(θ)]2
\ b 2 /[n exp(θ)]
Var[θ]
b = [1 + exp(θ)] b
p(1 − pb)].
= 1/[nb
If we did not know that this likelihood could be transformed back to the finite range of p,
we would have a problem with the grid on which to compute the likelihood: the range of θ is
±∞. However we have information from the MLE θb and its SE. We can use this information
to set up an equally spaced uniform grid on the range θb ± 5 SE.
Why 5 SE? If the likelihood is Gaussian, then about 95% of its content is in the interval
θb ± 2 SE. But if the likelihood is not Gaussian, and is skewed to the right or left, then the
grid interval will need to be longer on one side or the other. It is simplest to cover both.
If the likelihood is very skewed, this will be visible from the grid computation, with high
likelihoods at an extreme end of the grid. The grid can then be extended or shifted to provide
wider coverage.
Statistical inference I – discrete distributions 75

0.035

0.030

0.025

likelihood
0.020

0.015

0.010

0.005

0.000
-5 -4 -3 -2 -1 0
q

FIGURE 7.14
Logistic likelihood for θ ∈ [−5.4, 0.86]

For the example with r = 1, n = 10, we have θb = log(0.10) = −2.303, SE (θ) b = 1.054.
We begin with a 3 SE grid, giving a range of [−5.465, 0.859]. Figure 7.14 shows the logistic
likelihood calculated at each of the 100 equally spaced grid points in this interval.
The likelihood does not descend to zero at the left end. The range has to be extended
further on the left. Figure 7.15 shows the likelihood calculated at each of 1000 equally-
spaced grid points in the interval [−10,1]. The value −10 is 7.3 SEs away from the MLE.
The likelihood on the θ scale has extreme left skew, while that on the p scale has extreme
right skew (Figure 7.6). A consequence of the left skew in θ is that the 95% confidence
interval on the θ scale is also misleading: the left end of the confidence interval is much too
high, and so the left end of the confidence interval for p is much too high. Neither the p nor
the θ scale gives the confidence interval an accurate representation of the credible interval.
Is there a scale transformation which does give an accurate representation? In the ex-
ponential family this question was investigated by Anscombe (1964), in the framework of
removing skew, by the third derivative of the log-likelihood being zero at the MLE value. We
discuss this further in Chapter 9, with the two possible parametrisations of the exponential
distribution.
Now we give an application to a different model: an extended version of the binomial
model.

7.12 The Poisson distribution


Table 2.2 in §2.2 gave an example of 532 missile hits on 576 square areas in South London
in 1944, reproduced in Table 7.5. The data and analysis come from Shaw and Shaw (2019).
The Poisson distribution provides a model for rare events. It is an extreme limit of the
binomial distribution, when the number of trials n becomes very large, but the probability
76 Introduction to Statistical Modelling and Inference

0.035

0.030

0.025

likelihood
0.020

0.015

0.010

0.005

0.000
-10 -8 -6 -4 -2 0
q

FIGURE 7.15
Logistic likelihood for θ ∈ [−10, 1]

TABLE 7.5
V1 hits
Number of V1 hits Number of squares
0 237
1 189
2 115
3 28
4 6
5 1

of the event of interest (“success”) becomes very small, in such a way that the mean number
of successes µ = np remains finite and not very small, and the variance σ 2 = np(1 − p) → µ.
Accidents are a common application: the chance of a road accident is small, but the number
of road users is large, and the actual number of accidents is appreciable.
The limiting process is easy to show. In the binomial model, write p = µ/n; then
 
n
Pr[Y = r | µ] = (µ/n)r (1 − µ/n)n−r
r
  h
n! n(1−r/n)
i
= · (1 − µ/n) · µr /r!
nr (n − r)!
" r #
Y h i
= (1 − (i − 1)/n) · (1 − µ/n)n(1−r/n) · µr /r!
i=1

→ 1 · e−µ · µr /r!
Statistical inference I – discrete distributions 77

as n → ∞ with r, µ fixed.
The research question at the time was whether the squares with many hits were deliber-
ately targeted, or whether this distribution of hits was random. This was a critical issue for
the understanding of the guidance system of the missile. We examine whether the Poisson
distribution can represent the number of missile hits.

7.12.1 Poisson likelihood and ML


Pm Pm
We are given ni values of yi = i for i = 0, 1, . . . , m, with i=0 ni = n, i=0 ni yi = T . The
likelihood in µ, the log-likelihood and its derivatives are
m
Y
L(µ) = [exp(−µ)µyi /yi !]ni
i=0
m
Y
= exp(−nµ)µT / (yi !)ni
i=0
ℓ(µ) = −nµ + T log(µ) + c

ℓ (µ) = −n + T /µ
′′
ℓ (µ) = −T /µ2 .

The MLE is µ b = T /n = ȳ = 532/576


p = 0.924, and its asymptotic variance is Var(b
µ) = µ b/n,
with standard error SE(b µ) = ȳ/n = 0.040. An (asymptotic) 95% confidence interval for
µ is 0.924 ± 0.079 = [0.845, 1.003]. Do the Poisson probabilities with the ML estimate of µ
correspond well with the observed proportions? Table 7.6 shows the agreement.
The Poisson probabilities add to 0.999 because of rounding. The agreement appears close,
apart from the 1 and 2 hit categories. A graph (Figure 7.16) of the empirical cdf (circles)
with the 95% credible region (red segments) and fitted Poisson model (green) shows that
the Poisson is a good fit.
A traditional test of significance, which has been used since Karl Pearson invented it, is to
compare the observed and “expected” (ML) frequencies under the null hypothesis Poisson
model at each hit value using the Pearson χ2 or X2 test. We write Oi for the observed
frequency at the i-th value of hits, and Ei for the expected frequency under the Poisson
model: this is n × the ML estimated Poisson probability. We show these in Table 7.7; the
expected frequencies are given to one dp.
The expected frequencies add to 575.5 instead of 576 because of rounding and the one
dp accuracy. The Pearson test compares
5 
(Oi − Ei )2
X 
2
X = = 6.18
i=0
Ei

with the null hypothesis χ25 distribution of X 2 . The value of 6.18 is at the 29-th quantile of
χ25 . There is no strong evidence against the Poisson hypothesis. The Pearson X 2 test is an

TABLE 7.6
Observed and Poisson ML proportion of squares with each number of hits
Hits 0 1 2 3 4 5
Obs prop. 0.411 0.328 0.200 0.049 0.010 0.002
Pois prop. 0.397 0.367 0.169 0.052 0.012 0.002
78 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

cdf 0.7

0.6

0.5

0.4

0 1 2 3 4 5
Number of hits

FIGURE 7.16
Empirical cdf (circles), 95% bounds (red) and ML fitted Poisson model (green)

TABLE 7.7
Observed and Poisson expected frequencies of squares with each number of hits
Hits 0 1 2 3 4 5
Oi 237 189 115 28 6 1
Ei 228.7 211.4 97.3 30.0 6.9 1.2

approximation to the likelihood ratio test. In Chapter 16 on generalised linear models we


give more details of how to assess the goodness of fit of a Poisson regression model.
Here we can say, as did the actuary analysts in 1944, that the pattern of hits appears
quite consistent with the random Poisson distribution. Random variation in the aiming of
the missiles at launch would lead to a random pattern of strikes: the missiles were not guided
in flight.

7.12.2 Bayesian inference


We augment the likelihood with a prior π(µ). For a non-informative prior, there are two
simple alternative possibilities: a uniform prior for the population mean µ, and a prior c/µ
for a positive parameter (uniform for log µ). These are both improper. We give a more
general (improper) form of prior which includes both: a power prior π(µ) = cµs for some
constant s. The posterior distribution for µ is then

π(µ | y) = c · exp(−nµ) µT · µs
= c · exp(−nµ) µT +s
= exp(−nµ)(nµ)T +s /Γ(T + s + 1),
Statistical inference I – discrete distributions 79

a gamma density with parameters n and T + s. This density does not have an analytic cdf,
but the cdf is well tabulated and available as a library function in most statistical packages,
usually as the cdf of the standard gamma density with parameter k:
f (θ | k) = exp(−θ) θk /Γ(k + 1).
So θ = nµ with k = T + s has this density, and µ has the density of θ/n. For reasonably large
T , whether s is 0 or 1 or −1 makes little difference to the posterior quantiles. With very
small T it can have an appreciable effect. The posterior median and 95% central credible
intervals for µ are shown to 3 dp in Table 7.8 for s = −1, 0, 1. Here n = 576, T = 532.
The MLE of 0.924 agrees well with the median, and the asymptotic 95% confidence
interval of [0.849, 0.999] agrees well with the credible interval, though it is slightly shorter.
The effect of s is to provide an additional contribution to the observation total without
increasing the sample size. This has a negligible effect because of the large n.
The general conjugate gamma prior and corresponding posterior are of the form
π(µ | s, m) = exp(−mµ)(mµ)s /Γ(s + 1)
L(µ) = c · exp(−nµ)µT
π(µ | n, T, s, m) = exp[−(n + m)µ][(n + m)µ]T +s /Γ(T + s + 1).
The limiting case m → 0 recovers the family above.

7.12.3 Prediction of a new Poisson value


We will find later that the prediction of a new value is useful in model assessment. We
have observed n Poisson values with total T , and want to prcdict the value of a new as yet
unoserved value, through its posterior distribution.
The posterior predictive distribution of the new value y0 , with the prior s = 1 on µ, is
given by
Z
Pr[y0 | n, T ] = Pr[y0 | µ] · π[µ | n, T ] dµ
Z
Γ(T )y0 ! Pr[y0 | n, T ] = µy0 e−µ · exp(−nµ)(nµ)T −1 dµ
Z
= nT −1 µT +y0 −1 exp[−(n + 1)µ] dµ

nT −1 Γ(T + y0 )
=
(n + 1)T +y0 −1
Γ(T + y0 ) T −1
Pr[y0 | n, T ] = p (1 − p)y0 .
y0 !Γ(T )
where p = n/(n + 1), a negative binomial distribution.

TABLE 7.8
Median and 95% credible interval quantiles (s = −1, 0, 1) for
Poisson mean, missiles data
s 0.025 0.5 0.975
−1 0.845 0.921 1.002
0 0.847 0.923 1.004
1 0.848 0.925 1.006
80 Introduction to Statistical Modelling and Inference

7.12.4 Side effect risk


In Chapter 2, we gave the example of a court case involving rare events. In the court case
assessing a side effect risk after vaccination (Aitkin 1992), the question of concern was:
• Given the international standard rate for this side effect of one in 310,000 vaccinations,
• and given the occurrence of the side effect in the present year of four cases in 300,533
vaccinations,
• has the rate of occurrence of the side effect increased in the present year?
How do we answer this question? We may ask – how many cases would we expect in the
300,533 vaccinations? It is clear that this must be some number around one, but how do we
express it – and how is it evidence?
The chance p of a side effect is extremely small – 1/310,000 if the international standard
rate applies – and the number n of vaccinations is extremely large – 300,533 – with the
binomial mean µ = np = 0.9695. We can work with the Poisson distribution.

7.12.4.1 Frequentist analysis


If there had been no change in the rate in the current year, the probability of any given
number of side effect events would be given by the Poisson distribution with mean µ = 0.9695.
This distribution is given in Table 7.9. The probabilities add to 0.9999 because of rounding.
The probability of four events is 0.0140, very small.
If there had been a change, what was it? We do not have an alternative value of µ which
would give the probability of four events. The ML estimate of µ is ȳ = 4, implying a rate of
4/310,000. If this was the true rate in the current year, the probabilities of the numbers of
side effect events would be given by Table 7.10.
The probabilities add to 0.9998 because of rounding. The probability of four events is
0.1954. The ratio of the probabilities of the four events is 0.0140/0.1954 = 0.0716. This ratio
looks small, but we do not have a criterion for assessing it, and in any case we do not know
that the changed µ is 4 (if there is a change). In Chapter 13 we give the formal structure of
frequentist hypothesis testing or model comparison, and revisit this issue.
A common expression of frequentist evidence is the probability of the observed event,
or a more extreme one under the hypothesis of the standard rate. This is commonly called
the p-value. The probability of four or more cases with µ = 0.9695 is 0.0171. This appears
to be “fairly strong” evidence against the standard rate: the common calibration of the
p-value would use 0.05 as “suggestive” and 0.01 as “strong” evidence. It has the obvious
logical difficulty that we did not observe four or more events, we observed four. Surely our

TABLE 7.9
Poisson distribution with µ = 0.9695
Y 0 1 2 3 4 5 6
Pr[Y ] 0.3793 0.3677 0.1782 0.0576 0.0140 0.0027 0.0004

TABLE 7.10
Poisson distribution with µ = 4
Y 0 1 2 3 4 5 6
Pr[Y ] 0.0183 0.0733 0.1465 0.1954 0.1954 0.1563 0.1042
Y 7 8 9 10 11 12 13
Pr[Y ] 0.0595 0.0298 0.0132 0.0052 0.0019 0.0006 0.0002
Statistical inference I – discrete distributions 81

conclusions should be based on the number observed, not on numbers not observed. The
probability of four events is 0.0140, but this cannot be interpreted by itself as evidence in
the frequentist framework.

7.12.4.2 Bayesian analysis


A simple approach is to find the posterior distribution of µ, and see whether this appears
consistent with the international standard value 0.9695 for this number of vaccinations.
An important question is, immediately, what prior to use for µ. This is important because
we have only one observation (of four cases) – a sample size of 1 (not 300,533)! We illustrate
this importance with the µs power prior we gave above. We have n = 1, T = 4. The posterior
distribution of µ is the standard gamma with k = 4 + s (and n = 1). Figure 7.17 shows the
posterior density of the present year’s rate for the two values of s (solid curve s = 0, dotted
curve s = 1). The modes are at µ = 4 and 5, far above the international standard rate. For
s = −1 the mode is at 3 (not shown), still well above the international standard rate.
The posterior probability that µ > 0.9695 – an increase in risk – is given by the upper
tail of the gamma distribution: for s = −1, k = 3 it is 0.9252, and for s = 0, k = 4 it is
0.9828. The very small sample size means the prior parameter has a considerable effect on
the posterior probability. However, for either value of s the evidence is strongly against the
international standard value, and in favour of an increase in the rate of occurrence. This is
a clear benefit of the Bayesian analysis: we do not have to specify the alternative rate, only
its direction relative to the standard.

7.12.5 A two-parameter binomial distribution


In §2.7 we saw a sample of species counts from five observers, reproduced in Table 7.11.
It is clear that the observers had different success in observing species. Although a Pois-
son model would seem appropriate for the distribution of counts, this model would not

0.18

0.16

0.14

0.12
density

0.10

0.08

0.06

0.04

0.02

0.00
0 2 4 6 8 10 12 14
mean

FIGURE 7.17
Posterior densities of side effect rate – solid, s = 0, dotted, s = 1
82 Introduction to Statistical Modelling and Inference
TABLE 7.11
Observed numbers of species

16 18 22 25 27

account for the different observer successes, and does not have a parameter for the species
population size. Instead we use the binomial distribution, where the chance of any observer
identifying a species is p, and the actual number of species present in the sampled area is N .
It might appear that every observer i should have his or her own probability pi of success
in identifying a species, but these parameters would not be identifiable because we will have
more parameters than observations, as we have observed only the numbers of successes in
the N binomial trials, and we do not know the number of trials, which is the parameter of
interest.
Given the sample of counts y1 , . . . , yn , the binomial likelihood is
n  
Y N
L(N, p) = pyi (1 − p)N −yi
i=1
yi
" n  #
Y N
= pT (1 − p)N n−T
i=1
y i

Pn
where T = 1 yi .

7.12.5.1 Frequentist analysis


The MLE of p for given N is pb(N ) = T /N n = ȳ/N . This defines a hyperbola in the N, p
space, which is visible in the base of the graph in Figure 7.18. Since the MLE of the mean
N p of the binomial is ȳ, if N is large, p must be small, and vice-versa. The two parameters
are strongly dependent. The hyperbola in the surface is the consequence of this relation.
However we are not required to regard p as the nuisance parameter: the model can be
reparametrised in any convenient way to eliminate this curvature. If we define a new nuisance
parameter by ψ = N p, p = ψ/N , the likelihood in N and ψ is given here.
n  
Y N
L(N, ψ) = [ψ/N ]yi (1 − ψ/N )N −yi
i=1
yi
" n  #
Y N
= [ψ/N ]T (1 − ψ/N )N n−T
i=1
y i
" n  #
Y N
= [ψ/N ]T (1 − ψ/N )N (n−T /N )
i=1
y i
" n  #
Y N
→ [ψ/N ]T exp[−ψ(n − T /N )]
i=1
y i
" n  #
Y N
→ N −T · ψ T exp[−ψn],
i=1
y i

as N → ∞. The likelihood L(N, ψ) is shown in Figure 7.19. The MLE of ψ is ȳ for any value
of N , and is sharply defined. Not only is the curvature eliminated, but the two parameters
Statistical inference I – discrete distributions 83

FIGURE 7.18
Likelihood in N and p

FIGURE 7.19
Likelihood in N and ψ

are almost independent: the likelihood is almost separable as N → ∞. The “step” in the
graph is because ψ cannot be greater than N .
Figure 7.20 shows the likelihood in N at ψ.b It rises rapidly from N = 27 (the largest
observed count) to a poorly defined maximum around N = 100, but then decreases very
slowly to an asymptote at N = ∞ of 0.935 of its value at the maximum, where the binomial
reaches its Poisson limit.
84 Introduction to Statistical Modelling and Inference

0.0020

0.0018

0.0016

0.0014

0.0012

profile 0.0010

0.0008

0.0006

0.0004

0.0002

0.0000
100 200 300 400 500
N

FIGURE 7.20
Likelihood in N at ψb

We have learnt very little from the data about N – it has to be at least 27, but any value
between 50 and ∞ is plausible.

7.12.5.2 Bayesian analysis


We restructure the likelihood in N and p to clarify the posterior distributions.
n  
Y N yi
L(N, p) = p (1 − p)N −yi
i=1
y i
" n  #
Y N
= pT (1 − p)N n−T
i=1
y i
" n  #
Y N pT (1 − p)N n−T
= B(T + 1, N n − T + 1) · ,
i=1
yi B(T + 1, N n − T + 1)

What is the marginal posterior distribution of N ? For fixed N and a flat prior distribu-
tion, p has a conditional Beta(T+1, Nn-T+1) distribution. N marginally has no standard
distribution. However, this does not help: we need to eliminate p, not N . What prior should
be used to eliminate p? It might seem obvious that a Beta distribution would be appropriate –
we just have to specify the prior parameters.
Kahn (1987) considered the general conjugate Beta prior

pa−1 (1 − p)b−1
π(p) =
B(a, b)
R
and evaluated the integrated likelihood in N : L(N, p)π(p)dp as a function of a and b. He
found that this did not depend at all on the second Beta index b, but depended critically on
Statistical inference I – discrete distributions 85

the first index a, which controlled the location of the mode and the heaviness of the tail of
the integrated likelihood:
• For a = 0 this tail was flat, giving an essentially uninformative integrated likelihood for
N , like the tail of the likelihood in the frequentist analysis with (N, ψ).
b

• For a = 1/2 (as in the Jeffreys prior) and a = 1 (as in the the uniform prior) the
integrated likelihoods decreased with large N and had well-defined (different) interior
modes.
So the choice of the a parameter of the Beta prior for p had a critical effect on the posterior
in N . This is a consequence of the unusual shape of the two-parameter likelihood in N and
p (Figure 7.18).
For the uniform prior on p, the integrated likelihood could not be normalised (scaled
to integrate to 1) because of the non-zero tail at ∞. So no credible interval inference was
possible with a flat prior on N : any value from 50 to ∞ has approximately the same likelihood.
Many Bayesians would put a prior on N to eliminate the tail: an obvious choice would be
π(N ) = c/N (improper for any constant c). However this is a strongly informative prior,
which would need external scientific justification. The posterior in N would also be very
sensitive to changes in its informative prior: the conclusions about N are determined by the
priors for p and N .
These uncomfortable conclusions are not different for the frequentist and Bayesian anal-
yses: the problem is that we do not have enough data to identify N effectively. The model
asks for more than the data can deliver.
A more detailed analysis of this example is given in Aitkin and Stasinopoulos (1989),
partly reproduced in Aitkin (2010, pp. 24–31). The model is relevant in animal herd counting
by multiple aerial observers, and in air warfare by multiple aerial observers counting defensive
missile flashes in attacks on targets. How many herds were there? How many missiles were
fired?

7.13 Categorical variables


7.13.1 The multinomial distribution
The multinomial distribution is the extension of the binomial distribution to k > 2 cat-
egories. Its most well-known and historical applications are to categorical variables when
the categories have no numerical scale ordering. Many variables of interest with more than
two categories are of this form. Blood group and occupation are two such variables in the
StatLab database. The categories may be unordered, as with these two variables, or ordered
as with education and church attendance. Our analysis will handle both of these forms.
Other variables can be a mixture of categorical and “continuous”, like smoking, which has
categories of non-smoker, quit, and (for smokers) number smoked per day. We discuss this
case in the child birthweight analysis.
Our sample of 40 gave in Table 7.12 the breakdown across the father’s occupation cate-
gories, and the child’s blood group. Note that StatLab defined the occupational categories
from the index value 0, not 1.
What can we say about the population proportions in each category? A simple way would
be to take the categories one at a time, and construct credible intervals for the individual
proportions in each category, by pooling together the other categories. This ignores the fact
86 Introduction to Statistical Modelling and Inference
TABLE 7.12
Father’s occupational category and child’s blood group, n = 40
Occ. Category 0 1 2 3 4 5 6 7 8
Count 6 3 3 0 1 3 15 4 5
Blood group 1 2 3 4 5 6 7 8 9
Count 1 4 2 0 10 11 7 2 3

that the proportions must sum to 1.0, so that (for example) random draws of each proportion
will not add to 1 across the categories. We will want to use all the categories simultaneously
in many applications, so we develop a general approach, for any number of categories K.
We define the population proportion in the k-th of the K categories by pk , with the pk
PK
satisfying pk ≥ 0, k=1 pk = 1. We allow for the possibility that one or more pk could be
zero – these categories may be absent from the population, or from sub-populations. We
draw a random sample of size n with replacement from the population and obtain sample
counts n1 , . . . , nK in the K categories; some of the nk may be zero.
The probability of observing these sample counts has the multinomial distribution, writ-
ten M (n; p1 , . . . , pK ), in which
K
n! Y
M (n; p1 , . . . , pK ) = QK pnk k .
k=1 nk ! k=1

For the case K = 2 we have the binomial distribution:


 
n! n n1
M (n; p1 , p2 ) = pn1 1 pn2 2 = p (1 − p1 )n−n1 .
n1 !n2 ! n1 1

7.14 Maximum likelihood


ML estimation is very simple: the obvious extension of the binomial. Given the sample counts
{nk }, the likelihood is (omitting constants)
K
Y
L(p1 , . . . , pK | n1 , . . . , nK ) = pnk k .
k=1

To maximise the likelihood we add a constraint to the log-likelihood, that the pk must sum
to 1:
X
P = log L − λ( pk − 1)
k
X X
= nk log pk − λ( pk − 1)
k k
∂P
= nk /pk − λ = 0
∂pk
pk = nk /λ
∂P X
= pk − 1 = 0
λ
k
Statistical inference I – discrete distributions 87
X X
pk = nk /λ = 1
k k
pbk = nk /n,

unsurprisingly. The variance of the MLE pbk is pbk (1 − pbk )/n, and the covariance of pbk and pbℓ
is −b
pk pbℓ .

7.15 Bayesian analysis


We generalise the conjugate Beta prior in a similar way: we use the conjugate Dirichlet
(generalised Beta) distribution, written Dir(a; a1 , . . . , ak ), in which

K
Γ(a) Y
π(p1 , . . . , pK | a1 , . . . , aK ) = QK pakk −1 ,
k=1 Γ(a )
k k=1

PN
where the prior parameters ak ≥ 0, and k=1 ak = a. The posterior distribution of the pk
is again a Dirichlet distribution, with parameters nk + ak ; the prior weight ak is added to
the sample weight nk to give the posterior weight nk + ak at pk .
It might be expected that, as with the binomal/Beta case K = 2, the non-informative
prior would have ak = 1 for all k. However this means that the total prior weight would
be K, which may be quite large in many of the applications we want to consider, so it can
have a large effect on the posterior if the category sample sizes are small. We could use the
(improper) Haldane prior with ak = 0 for all k as the non-informative prior; this will be used
in §14.5 for “continuous” distributions.
A particular feature of this prior is that it gives zero posterior weight to categories k
for which there are no sample observations. That is, categories with zero sample counts are
treated as though they do not exist in the population – the sample zeros represent structural
(population) zeros, which is a strong statement of prior belief!
To avoid this, we need to assign positive values to the ak for which there are no sample
observations, which means in practice for all categories, since we do not know in advance
of the data which categories will have zero counts (and if the sample is small, many of the
counts may be zero). This treats the zero counts as sampling zeros, for which the population
counts are not forced to be zero.

7.15.1 Posterior sampling


Sampling from the Dirichlet posterior requires additional computation, because this distri-
bution is multivariate. It is included in several packages, but we give the approach available
from the gamma distribution, discussed again in detail in Chapter 14.
If Y1 , Y2 , . . . , YK are K independent variables with Gamma(nk , 1) distributions, and
K
X K
X
Y+ = Yk , Xk = Yk /Y+ , n+ = nk , pk = nk /n+ ,
k=1 k=1

then the distribution of X1 , . . . , XK is Dirichlet Dir(n+ ; p1 , . . . , pK ).


88 Introduction to Statistical Modelling and Inference

To generate a sample from the Dirichlet, we need to specify a prior. The uniform prior
here will give a total prior weight of 9, relative to a sample weight of 40. The Haldane prior
will exclude category 3 with the zero count. We compromise with a minimally informative
prior, giving equal prior weight 0.1 to all categories, and a total prior weight of 0.9.
Table 7.13 gives the median and 95% central credible intervals based on the 2.5% and
97.5% quantiles of the simulated cdfs from 5,000 random draws from the Dirichlet posteriors
of the category proportions for father’s occupation. Table 7.14 gives the same results for
child’s blood group. An additional line in each table gives the true StatLab population
proportions in these categories.
The population values are covered by the 95% credible intervals for all categories of both
variables, but the medians are often not close to the population values. With the small
sample of 40, and the very small counts in each category, high precision is not achieved. The
ninth blood group of “missing” is not a blood group category. (How to deal with missing
data is an important practical question. We discuss it at length in Chapter 15.)
A biological/genetic question of interest is whether blood group and Rh factor are asso-
ciated. We can assess this from the population structure, omitting the “missing” category.
For the StatLab sub-population which has complete data on blood group and Rh factor,
the breakdown over the eight categories is given in Table 7.15. We now reorder the table to
make it two-way: blood group by Rh factor, in Table 7.16. Marginal totals over each factor
are added to the table.
Blood groups O and A are much more common than B or AB. The Rh– group is uncom-
mon. Although it is not clear from the table, we can assess the question of independence of
blood group and Rh factor from the table data. Write Rh and BG for the two classifications.
If these are independent, then

Pr[Rh ∩ BG] = Pr[Rh] Pr[BG].

TABLE 7.13
Father’s occupational category, n = 40
Occ. Category 0 1 2 3 4 5 6 7 8
2.5% .054 .015 .016 .000 .027 .016 .215 .027 .041
median .133 .064 .064 .022 .088 .063 .341 .088 .113
97.5% .253 .163 .162 .027 .193 .160 .483 .193 .223
population .227 .069 .056 .023 .042 .103 .310 .060 .110

TABLE 7.14
Child’s blood group, n = 40
Blood group 1 2 3 4 5 6 7 8 9
2.5% .001 .029 .007 .000 .129 .152 .077 .007 .017
median .019 .094 .044 .000 .244 .266 .168 .044 .078
97.5% .092 .209 .138 .023 .385 .415 .299 .134 .171
population .052 .040 .012 .005 .343 .326 .110 .048 .065

TABLE 7.15
Child’s blood group, missing data excluded
Blood group 1 2 3 4 5 6 7 8
population .056 .043 .013 .005 .366 .348 .118 .051
Statistical inference I – discrete distributions 89
TABLE 7.16
Child’s blood group by Rh factor
O A B AB T
Rh– .056 .043 .013 .005 .117
Rh+ .366 .348 .118 .051 .883
T .422 .391 .131 .056 1.000

TABLE 7.17
Product table for child’s blood group
by Rh factor
O A B AB T
Rh– .049 .046 .015 .007 .117
Rh+ .373 .345 .116 .049 .883
T .422 .391 .131 .056 1.000

So if we multiply the marginal probabilities of each level of the two factors, we should see joint
probabilities very close to those in the population. This product table is given in Table 7.17.
Apart from blood group O, the product proportions differ from the population proportions
by at most 0.003. For group O, the difference is 0.007. The agreement is very close.

The Rh factor genetic information is also inherited from our parents, but it is
inherited independently of the ABO blood type alleles.
(The University of Arizona)

7.15.2 Sampling without replacement


In survey sampling of finite populations, sampling is almost always without replacement.
Clearly population members already sampled cannot contribute further information to the
analysis: we want new population members not previously sampled for full value from the
sample design. The probability of a simple random sample of size n from a finite population
of size N is easily developed. For the very simple case of drawing a sample of n = 3 for the
mothers’ smoking analysis, we write R for the number of smoking mothers and N − R for
the number of non-smoking mothers in the population.
The first mother drawn is a non-smoker. The probability of this is (N − R)/N. Now there
are N − R − 1 non-smokers left, and R smokers. The second mother drawn is a smoker; this
happens with probability R/(N − 1). The third mother is a non-smoker, with probability
(N − R − 1)/(N − 2). So the sample of three has two non-smokers and one smoker, and the
probability of this sequence is
N −R R N −R−1
· · .
N N −1 N −2
This probability does not depend on the actual sequence of the smoker and non-smoker
draws: if the first two mothers drawn were non-smokers and the third was a smoker, the
probability of this sequence is the same. So the probability of two non-smokers and one
smoker in the sample in any of the three possible orders is three times this value. The above
probability can be expressed as
90 Introduction to Statistical Modelling and Inference

    
R N −R N
/ .
1 2 3
In general, following the notation for sampling with replacement, when there are D
distinct values of YI in the population, the probability that the sample contains nI of the
NI values of YI in the population is the hypergeometric probability
" D  #  
Y NI N
Pr[{nI } | {NI }] = / .
nI n
I=1

Both Bayesian and frequentist analyses are more complex than for the previous case of
sampling with replacement. A full discussion can be found in Aitkin (2010) Chapter 4.
We note here that sampling with replacement accurately approximates sampling without
replacement if the sample fraction n/N is small.
8
Comparison of binomials
The Randomised Clinical Trial

8.1 Definition
An important application of statistical inference using the binomial distribution is to the
comparison of new medical or surgical treatments for disease or illness in a randomised
clinical trial (RCT). Such trials have certain characteristic features:
• A new treatment which has been found to be effective in small studies on selected patients
is to be evaluated in a large study, compared with the current best treatment.

• The medical profession must be in equipoise regarding the treatments: there must be no
issues of different side effects or other unwanted aspects of the treatments (Freedman
1987).
• Patients taking part in the study are assigned to receive either the new treatment or the
current best treatment. Assignment to one treatment or the other is by randomisation:
in the simplest form (not used in practice) by tossing a coin for each patient – heads
means the new treatment, tails means the current best treatment.
• Randomisation of patients requires their informed consent in advance: they must be told
what the treatments are, and that they will be randomised to one or the other treatment,
but they will not be told which one.

• The two treatments received by the patients must appear to them to be the same, so the
patients are not aware of which treatment they are receiving – the patients are blinded
to the treatment identification.
• Physicians must also be blinded (the trial is then double-blinded) – they must not know
which treatment a patient is receiving, so that this information cannot be accidentally
disclosed to the patient.
• With new drug treatments, the pills or capsules are made to look the same for each
treatment, though for the current best treatment group the pill may have no active
component – it may be an inert placebo.

DOI: 10.1201/9781003216025-8 91
92 Introduction to Statistical Modelling and Inference

8.2 Example – RCT of Depepsen for the treatment of duodenal


ulcers
This study was carried out in 1959 at the Royal North Shore hospital in Sydney by Professor
D.W. Piper and co-workers. This was one of a series of studies of gastric and duodenal ulcers;
see for example Piper et al (1981).
The drug Depepsen (a trade name for sodium amylosulphate) had been found effective
in the treatment of gastric (stomach) ulcers, and it was believed that because of its known
physiological action in the treatment of this condition, and the similarity of the two con-
ditions, it should also be effective for duodenal ulcers. The criterion for “success” of the
treatment was taken as the complete healing of the ulcer within a period of eight weeks
after the beginning of treatment. The existence of an ulcer, and its healing, were positively
identified by fibre-optic duodenoscopy, in which a flexible tube is swallowed by the patient,
and the lining of the duodenum examined visually through the optical tube.
The standard (current best) treatment for duodenal ulcers was then bed rest, sedation,
a bland diet, and antacid liquids to counteract stomach acidity. About 50% of patients
recovered, that is healed completely, in eight weeks with this treatment alone, without any
additional drug or other treatment. However this proportion varied across countries and
cultural groups. It could not be assumed to apply to the hospital subjects in the study – it
had to be assessed from patients randomly assigned to a control group receiving the current
best treatment.
The study encountered a difficulty – the drug Depepsen was very expensive to produce,
and only 20 doses were available for the trial, so only 20 patients could be treated with
Depepsen. The dose given to patients in the Depepsen treatment group was 5 ml containing
500 mg of Depepsen, to be taken six times daily for eight weeks, one hour and three hours

FIGURE 8.1
Peptic ulcers, stomach and duodenal
Comparison of binomials 93

after each main meal. Patients randomised to the placebo control group received a “dose”
of 5 ml of flavoured liquid at the same frequency. At the end of the eight-week period,
duodenoscopy was performed again to determine whether the ulcer had completely healed.
Twenty patients were randomly assigned to the treatment group and 18 to the control
group, but three patients had to be excluded from the study because they did not comply
with the protocol – the instructions for the treatment. All three took all their medication in
the first week. Two of these were in the treatment group and one in the control group. These
patients who did not follow the protocol were removed from the trial, and no follow-up after
the eight weeks was performed on them. They had no further role in the study.
Of the 35 remaining patients, 13 of the 18 receiving Depepsen healed, while ten of the
17 receiving placebo healed, in eight weeks. Does this indicate a real superiority in healing
of Depepsen over placebo? Classifying the patients by treatment and recovery, we have
Table 8.1.
We write pD for the probability of recovery with Depepsen, and p0 for the probability of
recovery with placebo. Then pbD = 13/18 = 0.722, and pb0 = 10/17 = 0.588. A higher sample
proportion of patients recover with Depepsen – the difference in proportions in favour of
Depepsen is 0.134 – but is this true in the population, or could it be that pD = p0 , or
pD < p0 ?

8.2.1 Frequentist analysis: confidence interval


For the comparison of the two recovery probabilities pD and p0 , the asymptotic frequen-
tist theory already used in the single binomial case gives a straightforward analysis. The
asymptotic distributions of pc
D and pb0 are

D ∼ N (pD , pD (1 − pD )/nD ),
pc
pb0 ∼ N (p0 , p0 (1 − p0 )/n0 ),

D − pb0 is
and so that of pc

D − pb0 ∼ N (pD − p0 , pD (1 − pD )/nD + p0 (1 − p0 )/n0 ).


pc

Again we need to approximate by substituting estimates in the variances:

D − pb0 ∼ N (pD − p0 , p
pc cD (1 − p
cD )/nD + pb0 (1 − pb0 )/n0 ).

For the sample values this gives the approximate 95% confidence interval for pD − p0 of
p
0.134 ± 1.96 ∗ 0.722 ∗ 0.278/18 + 0.588 ∗ 0.412/17,

which is 0.134 ± 0.312, or [−0.178, 0.446].


The “zero difference” value falls in the confidence interval, so we cannot conclude that
either treatment is clearly superior to the other. This interval does not refer to the coverage
in this sample, but to the overall coverage of such intervals in hypothetical repeated samples.

TABLE 8.1
Clinical trial of Depepsen
Depepsen Placebo Total
Healed 13 10 23
Not healed 5 7 12
Total 18 17 35
94 Introduction to Statistical Modelling and Inference

3.5

3.0

2.5

Probability density 2.0

1.5

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 8.2
Posterior densities placebo (dotted) and Depepsen (solid)

Also the actual coverage depends on whether the asymptotic Gaussian distribution of the
difference in proportions is appropriate in the small samples of 18 and 17, which cannot be
determined in real or hypothetical samples.
An alternative approach which leads to the same conclusion is through hypothesis testing,
discussed in §8.5.

8.2.2 Bayesian analysis: credible interval


To construct a credible interval to answer this question, we follow the approach for the pro-
portion of smoking mothers in the StatLab population. We assign independent uniform prior
distributions to the Depepsen and placebo recovery probabilities. The posterior distribution
of pD is then Beta(14, 6), and that of p0 is Beta(11, 8). Figure 8.2 shows the two posterior
densities (placebo dotted, Depepsen solid). The density of p0 falls to the left of that of pD ,
but the two overlap substantially; what can we say about the distribution of the difference
pD − p0 between the two? Does this have a high posterior probability of being positive? To
obtain an analytic (exact) answer to this question is very difficult, but we can simulate very
easily from the posterior distribution of pD − p0 .

8.3 Monte Carlo simulation


We want to find the probability distribution of some function Z = g(X, Y ) of independent
variables X and Y whose probability distributions f1 (x) and f2 (y) are known. The proba-
Comparison of binomials 95

bility distribution of Z cannot be evaluated easily in terms of those of X and Y . We use


repeatedly the following simple algorithm, already used in earlier chapters:

• We generate a large number M of random draws X [m] from the distribution f1 (x) of X,
and M independent random draws Y [m] from the distribution f2 (y) of Y .
• We form the M values Z [m] = g(X [m] , Y [m] ).

• Then the Z [m] are random draws from the probability distribution of Z.

By sorting the draws into increasing order, and assigning each one probability 1/M ,
we obtain a discrete approximation to the probability distribution of Z. If M is large, say
10,000, the cdf of Z is very smooth, and the quantiles of the distribution of Z can be closely
approximated by the sample quantiles of the M draws. For example, the lower and upper
2.5% points of the distribution of Z = X − Y can be approximated directly from the sample
cdf of the Z [m] = X [m] − Y [m] , by finding the 250th and 9,750th ordered values of the Z [m] .

8.4 RCT continued


We apply this approach to the clinical trial. Figure 8.3 shows the cdf of the 10,000 sample
[m] [m]
differences Z [m] = pD − p0 . Of the differences, nearly 80% (0.791) are positive, and
more than 20% (0.209) are negative – there is an appreciable posterior probability that the
Depepsen recovery probability is actually less than the placebo recovery probability.

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.4 -0.2 -0.0 0.2 0.4 0.6
Probability difference

FIGURE 8.3
cdfs of 10,000 differences pD − p0
96 Introduction to Statistical Modelling and Inference

The median difference is 0.124 (close to the difference 0.134 in sample proportions),
and the lower and upper 2.5% points of the Z [m] are −0.172 and 0.410, so the central 95%
credible interval for pD − p0 is [−0.172, 0.410] which includes zero, the “no difference” value.
So the difference in recovery proportions in the two populations could plausibly be as much
as 0.41 in favour of Depepsen, or as much as 0.17 in favour of placebo. The asymptotic 95%
confidence interval of [−0.178, 0.446] is similar. The trial is so small that the small difference
in sample proportions is a poor indicator of the difference in the population proportions,
which could be zero – a critical issue for recommending the Depepsen treatment.
So this trial was inconclusive, as are many small trials – the sample sizes are too small
to give any precision in the difference in response proportions. How large would the trial
need to be to find that this difference did indicate the superiority of Depepsen? The solution
of this problem is beyond the course level, but by more advanced methods we can show
that, if the same sample recovery proportions were to be attained in a trial with about 110
patients in each group, the observed difference of 0.722−0.588 would indicate the superiority
of Depepsen over placebo, because the 95% credible interval would not include zero. The
actual sample sizes are less than 1/5 of the required size – the trial was far too small to
establish a real difference.
For this reason such a clinical trial would now be regarded as unethical – patients were
being exposed to a trial of a new treatment which had little chance of being demonstrated
to be more effective than the existing best treatment, even if in fact it was more effective.
Soon after this trial, a different drug treatment for duodenal ulcers – cimetidine (trade name
Tagamet) – was found to be effective, and trials of Depepsen for the treatment of duodenal
ulcers were abandoned.
In the last ten years, these drug treatments, which were based on reducing acidity in the
stomach, have been replaced by an entirely different treatment with antibiotics. It was discov-
ered that most ulcers develop from a stomach infection by the Helicobacter pylori bacterium,
which responds rapidly to antibiotic drug treatment. See the Helicobacter Foundation site,
www.helico.com/h history.html, from which we quote:

Helicobacter pylori (H. pylori for short) was first discovered in the stomachs of
patients with gastritis and stomach ulcers nearly 25 years ago by Dr Barry J. Mar-
shall and Dr J. Robin Warren of Perth, Western Australia. At the time (1982/83)
the conventional thinking was that no bacterium can live in the human stomach
as the stomach produced extensive amounts of acid which was similar in strength
to the acid found in a car-battery. Marshall and Warren literally “re-wrote” the
text-books with reference to what causes gastritis and gastric ulcers. In recogni-
tion of their very important discovery, they were awarded the 2005 Nobel Prize for
Medicine and Physiology.
H. pylori is a corkscrew-shaped Gram-negative bacterium which is found to be
present in the stomach-lining of nearly 3 billion people around the world (i.e. half
the world’s population) and is the most common bacterial infection of man. Many of
those carrying the bacterium have little or no symptoms and are apparently well,
but all without exception have inflammation of the stomach lining, a condition
which is called “gastritis”. Gastritis is the underlying condition which eventually
causes ulcers and other digestive complaints. If a person has had an H. pylori
infection constantly for 20–30 years, it can lead to cancer of the stomach. This is
the reason that the World Health Organisation’s (WHO) International Agency for
Research into Cancer (IARC) has classified H. pylori as a “Class I Carcinogen” i.e.
in the same category as cigarette smoking is to cancer of the lung and respiratory
tract.
Comparison of binomials 97

8.5 Bayesian hypothesis testing/model comparison


We discuss first the Bayesian formulation, which is both simpler and more general. The
assessment of whether Depepsen performs better than Placebo can be expressed in another
way. As we noted in the discussion, the zero value of the difference in recovery probabilities
has an important role. If the evidence against a zero or negative value of the difference is not
strong, then we do not have compelling evidence for using Depepsen as a drug treatment
additional to the (then) best current treatment. When the treatment difference can be ex-
pressed through a single parametric function in this way, this analysis is sufficient to come
to a conclusion about the new treatment.
However, when there are more than two treatments involved, we do not have a simple
way of expressing the strength of evidence that the treatments are different. We need a way
of assessing this evidence, and this is expressed as a hypothesis testing or model comparison
problem, as opposed to inference about a single function of the model parameters.
Many Bayesians reject the notion of null hypothesis testing (we already know that any
point null hypothesis cannot be exactly true), while others accept the notion, but use an
integrated likelihood approach. We explain in detail the Bayesian approach we use through
the Depepsen example, and later extend it to more complex examples. We note first that
most Bayesians are more comfortable with the name model comparisons than with hypothesis
testing.

8.5.1 The null and alternative hypotheses, and the two models
We specify two models: the null (hypothesis) model and the alternative (hypothesis) model.
• The null model specifies that the recovery probabilities are the same in the Depepsen
and Placebo patient populations: pD = p0 = pc , unspecified.
• The alternative model specifies that the recovery probabilities are different in the De-
pepsen and Placebo patient populations: pD ̸= p0 , both unspecified.
If the three unspecified probabilities were all specified, we would have a simple comparison
of the models through Bayes’s theorem, by evaluating the likelihoods under each model. The
approach followed here, in which we find the posterior distributions of the deviances for each
model, is due originally to Dempster (1974, 1997) as extended by Aitkin (1997, 2010).

• Under the null model, the two treatment groups have the same recovery probabilities,
and the sample from both treatments gives us (from Table 8.1) 23 patients recovering out
of 35, with the common recovery probability pc . So the likelihood (omitting permutation
constants) is
L0 = p23 12
c (1 − pc ) .

• Under the alternative model, the likelihood (omitting permutation constants) is

L1 = p13 5 10 7
D (1 − pD ) · p0 (1 − p0 ) .

Given prior probabilities π0 for the null and π1 = 1 − π0 for the alternative, the ratio of
posterior probabilities is, from Bayes’s theorem,
π0|data π0 L0
= · .
π1|data π1 L1
98 Introduction to Statistical Modelling and Inference

If we take the prior probabilities equal, then the post-data probability of the null model
(hypothesis) is
L0
L0
π0|data = = L1L0 .
L0 + L1 1+ L 1

A problem in computing the likelihoods is numerical underflow: the values become ex-
tremely small in large samples and may be below the numerical accuracy of the statistical
package, or the computer. We avoid this problem by working on the log scale. Since we
need only the ratio of the likelihoods, we obtain this by computing the difference of the
log-likelihoods. For reasons of statistical theory which will become clear later, we modify
this slightly by computing the deviance, defined as −2 log L, and compute the difference of
the deviances under the two models, then exponentiate this back to get the likelihood ratio.
However, we are not given the recovery probabilities: we have only the sample information
[m]
about them, giving their posterior distributions. We can make M random draws pD and
[m]
p0 as before, and now make random draws of pc as well. So the posterior draws allow us
to make an inference about the probability of the null model:
[m] [m]
• We make M random draws pD and M independent random draws p0 , and substitute
them into the deviance

D1 = −2 log L1 = −2[13 log pD + 5 log(1 − pD ) + 10 log p0 + 7 log(1 − p0 )],


[m]
to give M draws D1 .
• We make M independent random draws from the Beta(24, 13) posterior distribution of
pc (again using a flat prior), and substitute them into the deviance

D0 = −2 log L0 = −2[23 log pc + 12 log(1 − pc )],


[m]
to give M draws D0 .
[m] [m] [m]
• We randomly pair the draws of D0 and D1 to give M draws of D01 = D0 − D1 .
• We convert these draws to draws of L0
L1 and π0|data :
 [m]  
L0 1 [m]
= exp − D01
L1 2
h i[m]
L0
[m] L1
π0|data = h i[m] .
L0
1+ L1

Figure 8.4 shows the posterior distribution of the deviances D0 (solid curve) and D1
(dotted curve). The deviance curves cross at about the 40% point of the cdfs. A smaller
deviance means a higher likelihood: the null model has higher likelihood in 60% of the
draws, the alternative model has higher likelihood in the other 40% of the draws. So the null
model is better supported than the alternative model, but not by much.
Figure 8.5 shows the posterior distribution of the deviance difference D0 −D1 . The median
is very close to zero (0.085) and the 95% central credible interval for the true deviance
difference is [−5.584, 4.13], which includes the zero point of no real difference. Figure 8.6
shows the posterior distribution of the null model probability. The median is 0.489 (close to
Comparison of binomials 99
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
45 50 55 60
deviance

FIGURE 8.4
Cdfs of 10,000 deviances D0 (solid) and D1 (dotted)

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-15 -10 -5 0 5 10 15
deviance difference

FIGURE 8.5
Cdf of 10,000 deviances D0 − D1
100 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
probability of H0

FIGURE 8.6
Cdf of 10,000 draws of π0|data

the prior “indifference” value of 0.5) and the central 95% credible interval is [0.112, 0.940].
This is wide, and does not point clearly to either model. We come to the same conclusion as
from the posterior distribution of pD − p0 : the patient samples are too small to give strong
evidence either way. It certainly could not be claimed that Depepsen had been shown to be
more effective than the then-current best treatment.

8.6 Other measures of treatment difference


We have taken for granted that the quantity of interest in the trial is the difference in recovery
probabilities. This is known in the medical statistics literature as the risk difference. However,
it is not the traditional measure of difference: this is the odds ratio θ, defined by

p0 /(1 − p0 )
θ= ,
pD /(1 − pD )

or its log, the log-odds ratio. The term “odds”, which comes from gambling, is used for
the ratio p/(1 − p), the “odds on recovery” under Placebo compared with the odds under
Depepsen. To complicate matters further, many medical studies were interested in the risk
ratio or relative risk, here pD /p0 , or its reciprocal.
A virtue of the Bayesian analysis is that all these different measures of effect can be
analysed in exactly the same way. We use the same draws of p0 and pD to substitute in the
appropriate function of the parameters.
For the odds ratio, shown in Figure 8.7, the median is 0.582 and the 95% central credible
interval is [0.144, 2.189]. This is very wide and includes 1, the “no difference” value. For the
risk ratio, shown in Figure 8.8, the median is 0.830 and the 95% central credible interval
Comparison of binomials 101
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
0 2 4 6 8 10
odds ratio

FIGURE 8.7
Cdf of 10,000 draws of the odds ratio

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.5 1.0 1.5 2.0 2.5
relative risk

FIGURE 8.8
Cdf of 10,000 draws of the risk ratio
102 Introduction to Statistical Modelling and Inference

is [0.402, 1.323]. The conclusions are the same, not surprisingly: that there is not enough
evidence to come to a clear conclusion about which treatment is better.

8.6.1 Frequentist analysis: hypothesis testing


The standard large-sample frequentist procedure for testing a null hypothesis H0 of zero (or
other specified) difference against an alternative hypothesis H1 of an unspecified difference,
is the likelihood ratio test (LRT). The LRTS – the LRT statistic – is defined by

LRTS = −2 log[L0max /L1max ],

a function of the ratio of maximised likelihoods under each hypothesis. The test statistic has
an asymptotic χ2ν distribution – a form of gamma distribution – under the null hypothesis,
with degrees of freedom equal to the difference ν in the number of parameters under the two
hypotheses. (The χ2 distribution is discussed in Chapter 15 on the Gaussian distribution.)
For computational reasons, we compute the frequentist deviances Dmin = −2 log Lmax to
avoid underflow. The LRTS is then D0min − D1min .
The formal testing decision process is to reject the null hypothesis in favour of the al-
ternative at test size α, if the LRTS value exceeds χ21−α,ν and to not reject it otherwise. In
the latter case the null hypothesis is not accepted, but maintained as tenable. A less formal
“descriptive” approach is to find the upper-tail probability of the LRTS and quote it as
the p-value or observed significance level of the test – the level at which the null hypothesis
would be just rejected. This allows the analyst to decide on the observed significance level –
the p-value – which would constitute compelling evidence. We describe this process for the
Depepsen trial.

• Under the null model, the two treatment groups have the same recovery probabilities,
and the sample from both treatments gives us (from Table 5.1) 23 patients recovering
out of 35, with the common recovery probability p̄. So the maximised likelihood is

L0max = (23/35)23 ∗ (12/35)12 ,

with the corresponding minimised deviance

D0min = −2 [23 log(23) + 12 log(12) − 35 log(35)] = 45.00.

• Under the alternative model, the maximised likelihood is

L1max = (13/18)13 ∗ (5/18)5 · (10/17)10 ∗ (7/17)7 ,

with the corresponding minimised deviance

D1min = −2 [13 log(13) + 5 log(5) − 18 log(18) + 10 log(10)


+ 7 log(7) − 17 log(17)] = 44.31.

The difference D0min − D1min = 0.69. This value is close to the 40th quantile of χ21 , no
evidence at all against the null hypothesis. The tail area probability from χ21 beyond the
observed value is the p-value, 0.594. It is often incorrectly treated as the probability of the
null hypothesis, as in the Sally Clark case. The frequentist theory does not have probabilities
of the hypotheses, only the probabilities of events under the two hypotheses.
Comparison of binomials 103

8.6.2 How are the hypothetical samples to be drawn?


A question which arises frequently in, and bedevils, the frequentist theory is whether, and
if yes how, accidents in the sampling process or the execution of the study should be repli-
cated in the hypothetical samples. In the Depepsen study three of the patients had to be
dropped from the analysis because of their failure to follow the protocol. However, they
could have been included in a second analysis – they did not change their treatment, only
accelerated it. In considering hypothetical replications of the study, should we assume no
such patient losses, giving replicate samples of 20 each, or should we hypothetically repli-
cate with samples of 18 and 17? The question may seem absurd: surely any hypothetical
replications should give the same treated sample sizes as those of the study? Since the repli-
cations are hypothetical, the question seems meaningless. How are they actually going to be
performed?
In modern clinical trials in which patients are informed about the treatments, it is some-
times an ethical requirement that patients can change from their randomly assigned treat-
ment to another treatment. What should be done with the resulting mixture of patients
responding to the treatments? A proposed recent principle of the frequentist analysis has
been that the responses of the changed patients must be assigned to the treatment to which
they were initially assigned, not to the treatment to which they have changed. The analysis
under this principle is called an intention to treat (ITT) analysis.
This has caused a great deal of argument. This comment from Wikipedia on intention to
treat analysis illustrates it:

ITT analysis requires participants to be included even if they did not fully adhere
to the protocol. Participants who strayed from the protocol (for instance, by not
adhering to the prescribed intervention, or by being withdrawn from active treat-
ment) should still be kept in the analysis. An extreme variation of this is that the
participants who receive the treatment from the group they were not allocated to,
should be kept in their original group for the analysis.
...
The rationale for this approach is that, in the first instance, we want to estimate
the effects of allocating an intervention in practice, not the effects in the subgroup
of the participants who adhere to it.
In comparison, in a per-protocol analysis, only patients who complete the entire
clinical trial according to the protocol are counted towards the final results.
(Emphasis added)

Our added emphasis draws attention to the aim of the study: for most studies it is to
assess the effects of the different treatments, not the effect of being included in a clinical
trial, whatever the treatment assignment. The Depepsen analysis described earlier is a per-
protocol analysis. An intention-to-treat analysis could not be done, as the patients who did
not follow the protocol were removed from the trial and no follow-up after the eight weeks
was performed on them.
The asymptotic confidence interval procedure described earlier for the treatment differ-
ence can also be used as a formal frequentist test: if the 95% confidence interval does not
contain the zero value of the difference in response probabilities, then we can reject the null
hypothesis of no difference in these probabilities. If the confidence interval does contain the
null value then we cannot reject the null hypothesis (at significance level 5%: 1-the confidence
coverage).
104 Introduction to Statistical Modelling and Inference

8.6.3 Conditional testing


The frequentist inference approach is complicated even further by a proposal by R.A. Fisher
of a different and controversial test which has become widely used despite its failings: the
“Fisher conditional test”, better known by Fisher’s label as the (Fisher) exact test.
To understand this test we need to see how the likelihood can be expressed. The algebra
is rather forbidding. First, the full binomial likelihood for Depepsen and Placebo can be
expressed as
   
nD r D n 0 r0
L(pD , p0 ) = pD (1 − pD )nD −rD · p (1 − p0 )n0 −r0
rD r0 0
    rD  r0
nD n0 pD p0
= (1 − pD )nD · (1 − p0 )n0
rD r0 1 − pD 1 − p0
  
nD n 0 rD
= θ (1 − pD )nD θ0r0 (1 − p0 )n0 ,
rD r0 D
pD p0
where θD = 1−p D
is the odds on recovery under Depepsen, and θ0 = 1−p 0
is the odds on
recovery under Placebo.
Then we define r = rD + r0 , and substitute r − rD for r0 :
  
nD n0
L(pD , p0 ) = θrD (1 − pD )nD θ0r−rD (1 − p0 )n0
rD r − rD D
    rD
nD n0 θD
= · (1 − pD )nD θ0r (1 − p0 )n0 .
rD r − rD θ0

Fisher saw that this likelihood could be factored into the product of a marginal and a
conditional likelihood, which could simplify the inference. We consider the distribution of
the random variable R = RD + R0 , the marginal total of recoveries under both treatments.
This is
u=u
X2
Pr[R = r] = Pr[R1 = u] Pr[R2 = r − u],
u=u1
u=u
X2     rD
nD n0 θD
= · (1 − pD )nD θ0r (1 − p0 )n0 ,
u=u1
u r−u θ0

where u1 = max(0, r − n0 ) and u2 = min(nD , r). The conditional distribution of RD given


the marginal total R is then

Pr[RD = rD | R = RD + R0 ]
    rD u=u X2 nD  n0   θD u
nD n0 θD
= / .
rD r0 θ0 u=u
u r−u θ0
1

The conditional distribution depends only on the odds ratio, the relative odds on recovery
under the two treatments.
Fisher argued that the marginal total of successes was an ancillary statistic – it gave
information about the variability of the odds ratio, but not about its location. He then argued
that the inference about the odds ratio should be based on this conditional distribution,
which meant, in terms of hypothetical repeated sampling, that this had to be restricted to
Comparison of binomials 105

hypothetical tables which had the same margins as the observed table. This gave a different
interpretation of the observed difference from the unconditional large-sample result. In the
following ECMO example we show how dramatic these differences can be in very small
samples.
Barnard (1945) gave a different test, also “exact”, in which only the designed numbers
of subjects in each treatment were fixed. This was computationally much more complex
then Fisher’s test. He showed in a small example that this test had greater power than
Fisher’s test. Fisher (1945) was furious and attacked Barnard, who retracted his claim in a
later paper, Barnard (1949). Crossing Fisher could be professionally damaging. The detail
of the different hypergeometric probabilities in the two approaches is given by Mehta and
Senchaudhuri (2003) in a Web paper.
Arguments among frequentists over Fisher’s proposal continue to this day. Plackett (1977)
showed that the marginal total was not ancillary except in the limit as the treatment sample
sizes tend to ∞. This was clear from the beginning, since the marginal distribution of the
total depended on the additional term (1 − pD )nD θ0r (1 − p0 )n0 , which was not a function of
the odds ratio, but was nevertheless informative about the two probabilities. This term was
eliminated from the conditional distribution. In large samples the conditional test gave the
same results as the unconditional test, but in the small-sample limit it lost information. But
this was just where Fisher said it should be used!
Fisher’s “exact” test has been shown in simulations to be less sensitive than the test using
unrestricted responses in the repeated sampling. Simulation studies are exact applications
of the repeated sampling principle: when repeated samples are actually drawn from a known
population, and are used to assess the coverage of confidence intervals, or the performance
of test procedures.
Fisher’s supporters claim that this evaluation misses exactly the point – that the sim-
ulation evaluations should have been with respect to the restricted samples with the same
marginal number recovering. These two approaches are incommensurate, being based on dif-
ferent repeated-sampling arguments. But neither gives an analysis relevant to the observed
sample – they refer to different ensembles of simulated tables.
What is most remarkable about the “exact” test is that it forced the analyst to frame
the scientific questions in terms of odds ratios, rather than other measures like risk ratios
or risk differences. This conditional analysis requirement was even implemented, for a long
period, in protocols for the analysis of medical research studies under some US government
grants.

8.7 The ECMO trials


We finally consider a very controversial clinical trial, or pair of trials.

8.7.1 The first trial


In Section 2.1.3, we gave the table reproduced in Table 8.2. The data come from a ran-
domised clinical trial of ECMO (Extra-Corporeal Membrane Oxygenation) for the treat-
ment of newborn babies with respiratory distress (inadequate lung function), compared to
the then-current best medical treatment (CMT). The CMT was administering oxygen un-
der high pressure into the baby’s lungs. In ECMO the baby’s blood was passed through
an external oxygenation machine and then returned to the baby’s body: this avoided the
106 Introduction to Statistical Modelling and Inference
TABLE 8.2
Babies surviving or dying under CMT and ECMO
Response
Treat Survived Died Total
CMT 0 1 1
ECMO 11 0 11
Total 11 1 12

possibility of lung damage from the high-pressure treatment. In the COVID-19 pandemic
ECMO was used for patients with severe breathing difficulties who did not improve with the
ventilator. See Lyons (2020), www.abc.net.au/news/health/2020-07-22/coronavirus-ecmo-
explainer/12472498. The first ECMO study was reported in Bartlett et al (1985), and statis-
tical and ethical issues in this trial, and in a subsequent trial, were discussed in Ware (1989)
and Begg (1990). The trial used adaptive randomisation of babies to the treatments, with a
success (survival) under a treatment increasing the probability of randomisation of the next
baby to that treatment. This was intended to minimise the number of babies randomised to
the less effective treatment.

8.7.2 Frequentist analysis


The frequentist analysis was complicated by the adaptive randomisation, and by the stopping
rule of the trial, which was to stop when the difference in the number of babies surviving
between the treatments was nine. As is clear, this rule was not followed.
At the specified stopping time, the study continued with two babies assigned non-
randomly (i.e., with probability 1) to the ECMO condition. Both survived. Many p-values
were given for this table (Ware 1989; Begg 1990) by different frequentist arguments, from
0.001 to 0.62. Cox (2016, p. 198) expressed the p-value problem well:

Difficulties with the [frequentist] approach are partly technical in evaluating p-


values in complicated systems and also lie in ensuring that the hypothetical long
run used in calibration is relevant to the specific data under analysis, often taking
due account of how the data were obtained. The choice of the appropriate set of
hypothetical repetitions is in principle fundamental, although in practice much less
often a focus of immediate concern.
(Emphasis added)

To understand the diversity of the conclusions, we need to understand the design of the study.
The then-current best treatment had a very high death rate, around 80%. Non-randomised
studies of ECMO had shown very low death rates. To compare the treatments required
ethical approval of a treatment assignment design which would minimise the number of
babies assigned to the less-successful treatment, whichever that was, and would terminate
the trial as soon as the evidence for the best treatment was strong.
The design adopted was a “play the winner” adaptive randomisation using an urn con-
taining initially M white balls (for CMT) and M black balls (for ECMO). The study used
M = 1.

• The urn was shaken well and a ball drawn from it. The ball was black, and the baby was
assigned to ECMO. Following treatment, this baby survived.
Comparison of binomials 107

• Before the next assignment, one black ball was added to the urn, which was shaken well
and a ball drawn from it. The ball was white, and the baby was assigned to CMT. This
baby died.

• Before the next assignment, another black ball was added to the urn, which was shaken
well and a ball drawn from it. The ball was black, and the baby was assigned to ECMO.
This baby survived.

This process and outcome continued, until nine babies had been randomly assigned (with
increasing probabilities) to ECMO, and all had survived.
At this point the stopping rule had been reached, which was that when the difference in
the number of babies recovering under the two treatments reached 9, the trial would stop.
However the trial did not stop, but the randomisation was stopped, and two more babies
were non-randomly (with probability 1) assigned to ECMO, and survived. (At this point, if
another randomisation had occurred, the probability that ECMO would have been chosen
was 11/12 = 0.917. This is very close to non-random assignment.)
Frequentist difficulties with the study are severe, because of the play-the-winner design
and the failure to follow the stopping rule. Following Cox’s discussion, what hypothetical set
of replications of the study should be used to assess a p-value? Should the replications follow
the stopping rule, or be allowed to add two non-randomised babies? Should the sequence of
outcomes of the urn draws be fixed (conditioned on the observed sequence), or should it be
random? Different choices for these and other possibilities led to the wide range of p-values,
from 0.001 to 0.62, proposed by different analysts. We do not discuss these; they are given
at great length in the two references.

8.7.3 The likelihood


To understand the arguments and the conflicting p-values, we need to construct the likeli-
hood, for which we need indicators for ball colour and treatment outcome. Write B for the
event of drawing a Black ball, and πBm for the probability of drawing a Black ball at draw
m. Correspondingly, write W for the event of drawing a White ball, and πW m for the prob-
ability of drawing a White ball at draw m. Write pE and pC for the survival probabilities
(assumed constant) under ECMO and CMT. Write S for Survival and D for Death.
The observed sequence of events is
B1.S.[+B].W2.D[+B].B3.S.[+B]B4.S.[+B]. . .B9.S.[B10.S.B11.S]
and the likelihood is

L(pE , pC ) = [0.5 · pE ] · [(1/3) · (1 − pC )] · [(2/4) · pE ] · [(3/5) · pE ] · [(4/6) · pE ]


· [(5/7) · pE ] · [(6/8) · pE ] · [(7/9) · pE ] · [(8/10) · pE ][1 · pE ] · [1 · pE ]
1 11
= p (1 − pC ).
9 · 10 E
So the design – the adaptive randomisation – though depending at each assignment on the
outcome of the previous assignment and treatment outcome, is non-informative, because the
assignment randomisation probabilities are known at each step. This is not a peculiarity of
starting with M = 1 Black and one White ball: it would remain true with any number M of
balls of each colour in the urn initially.
So for a Bayesian, conditioning on all the data, all that matters for analysis is that there
were 11 survivals out of 11 trials with ECMO, and one failure out of one trial with CMT.
108 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf
0.5

0.4

0.3

0.2

0.1

0.0
-0.2 0.0 0.2 0.4 0.6 0.8 1.0

FIGURE 8.9
Cdfs of 10,000 draws of the risk differences ECMO1-CMT, for 11 (solid curve) and nine
(dotted curve) ECMO survivals

(An argument arose about whether the last two babies should have been included in the
analysis, as they were non-randomly assigned. We give a second analysis next which excludes
them.)

8.7.3.1 Bayesian Analysis


Using uniform priors for pE and pC , their posterior distributions are for pE , Beta(12, 1) for
the 11 survivals or Beta(10, 1) for the nine survivals, and for pC , Beta(1, 2). Figure 8.9 shows
the posterior distributions of pE − pC from 10,000 draws for the 11 (solid curve) and for the
nine (dotted curve) ECMO survivals.
The median and 95% credible intervals for pE − pC are 0.622 and [0.065, 0.949] for
11 ECMO survivals and 0.608 and [0.044, 0.946] for nine ECMO survivals. Both credible
intervals exclude zero, so there is fairly strong evidence that ECMO is better, but the intervals
are exceedingly wide, and give no precision for the risk difference, or for any other summary.
The two additional survivals under ECMO make little difference: it is the uncertainty in the
CMT survival probability that limits the information in the study.
Not surprisingly, the trial was criticised for the “inadequate” CMT sample size. Critics
of the study had to grapple with both the ethical and the statistical difficulties of the
proposition that more babies were needed to have died under CMT. Many Bayesians would
argue that in such a study an informative prior for the CMT death rate, based on historical
data, was essential. The difficulty is that with only one observation from CMT, the prior
would dominate the likelihood and the conclusion would be effectively a comparison of the
ECMO treatment posterior with the CMT historical data prior.
A second study (by a different medical team) was carried out to establish the value of
ECMO on stronger evidence.
Comparison of binomials 109
TABLE 8.3
Babies surviving or dying under CMT and ECMO, randomised
Response
Treat Survived Died Total
CMT 6 4 10
ECMO 9 0 9
Total 15 4 19

8.7.4 The second ECMO study


The second study, described in Ware (1989), was carried out with blocks of four babies
assigned at random to ECMO or CMT. This was to prevent the trial stopping with a
small number of babies assigned to CMT. A different stopping rule was used, also based
on the number of deaths or survivals on the better treatment, and at the termination of the
randomisation part of the study, babies were assigned to the better treatment until another
stopping rule was met. We do not give more details of the study, which can be found in Ware
(1989). The study results are given in two ways, Table 8.3 for the randomised part and the
numbers for the non-randomised (to ECMO) part of the study.
Following the stopping rule for the randomised part of the trial, 20 babies were non-
randomly assigned to ECMO. Of these 20, 19 survived and one died. What do we conclude?
For the randomised part, following the analysis of the first trial with flat priors on the
two survival probabilities, the posterior distributions are Beta(7, 5) and Beta(10, 1). If we
combine the non-randomised babies with the randomised, the comparison is of Beta(7, 5)
with Beta(29, 2). In Figure 8.10 we show the posterior distributions of the risk difference
pE − pC for the first, second randomised, and second combined analyses. The larger sample

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0

FIGURE 8.10
Cdfs of 10,000 draws of the risk differences ECMO-CMT for ECMO1 (solid curve), ECMO2
rand (dotted), ECMO2 combined (dashed)
110 Introduction to Statistical Modelling and Inference
TABLE 8.4
ECMO trials comparisons
Trial median 95% credible interval
1 0.627 [0.056, 0.847]
2 rand 0.329 [0.005, 0.631]
2 comb 0.349 [0.090, 0.635]

of CMT babies has greatly increased the precision of the first analysis, though it has not
much changed the evidence against the “no difference” hypothesis. The medians and 95%
credible intervals for the difference pE − pC are given in Table 8.4.
Combining the non-randomised ECMO babies does not much change the interval, except
near zero. The survival rate under ECMO is still much better estimated than that under
CMT. We do not discuss these studies further.
9
Data visualisation

We divert from probability modelling for the moment to consider the general issue of how
to visualise the data structures and models we want to use. We illustrate the initial simple
uses of visualisation, and will return in later chapters to extend these ideas. We begin with
the single sample of field telephone lifetimes in hours, reproduced from §2.1.
Tables of data do not give much insight into structure or relationships. It is routine in
“descriptive” statistics (describing features of the sample data) to compute the mean and
variance of the sample data, as summaries of location and variation of the sample data.
These summaries do not lead to a probability model specification. Statisticians developed
visualisation tools in the late 19th century to assist the interpretation of their data.

9.1 The histogram


For moderate to large data sets, computation of the mean and variance was greatly simplified
by grouping the individual observations into class intervals (now commonly called bins) and
using the observation count in each interval or bin as a weight in the summations, with the
interval mid-point used as the value for all those in the interval.
A graph of the counts in each interval, drawn as a bar chart and called a histogram,
became a standard form of data presentation. The point of the histogram construction was
to give an impression of the shape of the underlying population represented by the sample.
For this purpose the presentation of the bin count as a set of area bars rather than a set of
numerical point counts assisted the interpretation, though it provided no more information
than the bin counts.
The histogram was only partly successful, as the shape made up by the bins depended
on the sample size and both the number and width of the bins. A large number of narrow
bins gave a very irregular pattern of small counts from which not much understanding of the
population could be gained, while a small number of wide bins gave only a crude picture.
Bins of varying width complicated further the visual impression of the histogram. Figure 9.1
is an extreme example of small bins: the graph is a maximum resolution histogram of values
of family income at birth for the StatLab family population. The bin width is the reporting
unit of income, one $.
A more informative example is the grouping of the radio telephone lifetimes by the
number of hours. There is no loss of information in this case, as the grouping is induced by
field conditions and is not an artificial addition. A point graph of the count against hours
(Figure 9.2) shows the lack of shape information: we see only that small lifetimes are more
common than medium to large values, and the tail of large values is very long.
Since the binning or grouping generally lost some information, corrections to the grouped
mean and variance were necessary; these were formalised by Sheppard (1897) and are known
as Sheppard’s corrections. They are rarely used today with the disappearance of hand cal-
culations of the mean and variance. Bin widths are generally taken as equal, but this has

DOI: 10.1201/9781003216025-9 111


112 Introduction to Statistical Modelling and Inference

35.00

28.00

Freq/1.0 units of INCOME 21.00

14.00

7.00

0.00
50 100 150 200 250
INCOME

FIGURE 9.1
Maximum resolution histogram of family income at birth, StatLab population

5.00

4.00
count

3.00

2.00

1.00
0 100 200 300 400 500 600 700
hours

FIGURE 9.2
Counts of phone lifetimes
Data visualisation 113
TABLE 9.1
Lifetimes and numbers of radio transceivers
t n t n t n t n t n t n t n
8 1 16 4 32 2 40 4 56 3 60 1 64 1
72 5 80 4 96 2 104 1 108 1 112 2 114 1
120 1 128 1 136 1 152 3 156 1 160 1 168 5
176 1 184 3 194 1 208 2 216 1 224 4 232 1
240 1 246 1 256 1 264 2 272 1 280 1 288 1
304 1 308 1 328 2 340 1 352 1 358 1 360 1
384 1 392 1 400 1 424 1 438 1 448 1 464 1
480 1 536 1 552 1 576 1 608 1 656 1 716 1

TABLE 9.2
Lifetimes t and cumulative numbers n of radio transceivers
t n t n t n t n t n t n t n
8 1 16 5 32 7 40 11 56 14 60 15 64 16
72 21 80 25 96 27 104 28 108 29 112 31 114 32
120 33 128 34 136 35 152 38 156 39 160 40 168 45
176 46 184 49 194 50 208 52 216 53 224 57 232 58
240 59 246 60 256 61 264 63 272 64 280 65 288 66
304 67 308 68 328 70 340 71 352 72 358 73 360 74
384 75 392 76 400 77 424 78 438 79 448 80 464 81
480 82 536 83 552 84 576 85 608 86 656 87 716 88

disadvantages at the extremes of the data range where bin counts are small. Variable width
bins are sometimes used.

9.2 The empirical mass and cumulative distribution functions


Modern statistical analysis has recognised the value of the empirical cumulative distribution
function (ecdf ). A version of the ecdf derived from the histogram has been known under
another name: the cumulative histogram. The cumulated version of Table 9.1 is shown in
Table 9.2. Like the previous lifetimes table, this is not very informative. However if we
transform the cumulative counts to proportions, by dividing by 89, and graph them against
lifetime hours, a quite different picture emerges (Figure 9.3).1
The graph of the ecdf is quite smooth,2 and appears to be representable by an exponential
function of some kind, with small variations about it.

9.3 Probability models for continuous variables


It has been accepted for many years in statistical data analysis that the data should point to-
wards the appropriate probability representation. Probability models provide approximations
1 Strictly speaking the ecdf should reach the value 1 at the largest sample value, but we scale it by 89
instead of 88 to avoid this value of 1.0, which leads to breakdown of probability scale transformations which
we use to develop the probability models.
2 Smoothness is a property of integration or cumulation; roughness is a property of differentiation or

differencing.
114 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

cumulative proportion
0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 100 200 300 400 500 600 700
hours

FIGURE 9.3
Phone lifetimes empirical cdf

1.0

0.9

0.8

0.7
survivor function

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 100 200 300 400 500 600 700
hours

FIGURE 9.4
Phone empirical survivor function
Data visualisation 115
1.0

0.9

0.8

0.7

survivor function
0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 100 200 300 400 500 600 700
hours

FIGURE 9.5
Phone empirical (circles) and exponential (curve) survivor functions

1.5

1.0

0.5

0.0

-0.5
log integrated hazard

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

-4.0

-4.5
3 4 5 6
log hours

FIGURE 9.6
Empirical (circles) and ML fitted (solid) log exponential integrated hazard
116 Introduction to Statistical Modelling and Inference

to sample and population proportions. This is usually done by assessing the fit of the as-
sumed model to the data, if necessary through residuals from a fitted model; we discuss this
in detail in later chapters.
We use the phone lifetimes to illustrate this. The empirical cumulative distribution func-
tion of the lifetimes has a very smooth appearance, with small variations. We want to ap-
proximate the ecdf by a smooth function – the cdf of a continuous random variable. How do
we identify the appropriate random variable?
We use the practical background and technological context to provide a transformation
of the ecdf to suggest the choice. In studies of lifetimes of electrical and mechanical devices,
a common function of interest is the survivor function.
The survivor function for a continuous variable Y is S(y) = 1−F (y), the proportion surviving
– still functioning – at time y. The empirical survivor function at yi is 1 − qi where qi is the
empirical cdf. Figure 9.4 shows the empirical survivor function for the phones.
The appearance of an exponential decline is striking, and suggests that a suitable model
for the survivor function could be S(y) = e−y/λ for some value λ. Figure 9.5 shows the
empirical survivor function, with superimposed the exponential survivor function with the
value of λ taken as 210.8, the mean survival time.
The exponential curve agrees fairly well with the observed data, though it appears to un-
dershoot and then overshoot. Could a different value of λ improve the agreement? How do
we decide on the value of λ? Figure 9.6 shows the same function on transformed scales.
We continue this example in the next chapter.
10
Statistical inference II – the continuous
exponential, Gaussian and uniform distributions

10.1 The exponential distribution


We were led to the exponential distribution as a model for the phone lifetimes by the “ex-
ponential decay” appearance of the survivor function. The exponential model approximates
the survivor function by the exponential function S(y | θ) = exp(−θy). This can also be
expressed as S(y | λ) = exp(−y/λ) where λ = 1/θ. We will frequently switch back and
forward between these parametrisations, which are both in common use. The parameter λ
is the mean lifetime (in hours here), and the parameter θ is the hazard – the rate of dying
(per hour here).
However the likelihood function is a product over observations of the probability mass
function, not the survivor function. Also, the observations are discrete, but the exponential
survivor function is continuous. How do we relate them?
All “continuous” variables are recorded with some limited measurement precision δ, on a
grid of spacing δ. Most of the recorded lifetimes are multiples of eight hours, though some are
multiples of four or two hours. The smallest lifetime is eight hours. What is the probability
of this time under the exponential model? Suppose that the recorded lifetime of eight hours
is measured to the nearest hour. Then the actual lifetime is a number between 7.5 and 8.5
hours. The probability of a lifetime in this interval is (using the λ parametrisation)

Pr[7.5 < Y < 8.5] = S(7.5 | λ) − S(8.5 | λ)


. ′
= −S (8 | λ) · 1.0
= f (8 | λ)
= exp(−8/λ)/λ

Here we have used an approximation based on the Mean Value Theorem; that
Z b
.
F (b) − F (a) = f (x)dx = (b − a)f ((1 − ϕ)a + ϕb)
a

for some ϕ ∈ (0, 1). Our approximation uses ϕ = 0.5.1


The derivative of the continuous distribution function F (y | λ) = 1−S(y | λ) is called the
probability density function f (y | λ), the continuous analogue of the probability mass function
pi . If time is measured to a finer or coarser scale than one hour, the same argument gives
the same approximation in terms of the density at eight hours, except that the 1.0 difference
between the upper and lower limits is replaced by another constant number. As will be
1 This corresponds to approximating the area under the curve f (x) between a and b by the area of the

rectangle: (b − a) · f ((a + b)/2).

DOI: 10.1201/9781003216025-10 117


118 Introduction to Statistical Modelling and Inference

seen, this makes no difference to our conclusions about the model parameter, though the
approximation may be poor if the time scale is very coarse. We are free then to approximate
the likelihood as the product of continuous density ordinates, omitting the constant arising
from the mean value approximation.
We note here the mean µ of the random variable defined in this way is
Z ∞
µ= yf (y | λ)dy
Z0 ∞
= y exp(−y/λ)dy/λ
0
Z ∞
=λ z exp(−z)dz
0
= Γ(2)λ = λ.
We will also find useful the hazard function h(y), which is the instantaneous probability of
failure at time y, given survival up to this time. This is given by
h(y) = f (y | λ)/S(y | λ) = 1/λ = θ,
a constant. We note for later use the integrated hazard function H(y), given by
Z y
H(y) = h(s)ds = y/λ = θy,
0

which means that S(y) = exp[−H(y)], and H(y) = − log S(y). The log integrated hazard
has a particulary simple form:
log H(y) = log θ + log(y).
This function is shown with the fitted exponential distribution in Figure 9.6.
In later chapters we will see the value of this representation for model assessment. The
exponential distribution has the characteristic property of constant hazard. For the phones,
this means that they do not age or wear out with use; a phone which has been in use for 100
hours is just as good (in its remaining lifetime distribution) as a new phone. This is unlikely
to be true in the dusty conditions of their use.

10.2 The exponential likelihood


Using the density representation, the likelihood is computed over a 1,000-point grid in λ
from 1 to 700. The count ni of phones at each lifetime acts as a weight in the computation.
The likelihood is
Ym
L(λ) = f (yi | λ)ni
i=1
m  ni
Y 1  y 
i
= exp −
i=1
λ λ
 
1 T
= n exp −
λ λ
1  nȳ 
= n exp − ,
λ λ
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 119
1.0

0.9

0.8

0.7

0.6

likelihood
0.5

0.4

0.3

0.2

0.1

0.0
150 200 250 300 350
l

FIGURE 10.1
Exponential relative likelihood for λ

Pm
where T = i=1 ni yi , ȳ = T /n. The total lifetime T of all the phones is a sufficient statistic
for the mean λ. Equivalently, the sample mean lifetime ȳ is a sufficient statistic. Because the
likelihood values are extremely small, in graphing or computing the likelihood it is convenient
to scale the likelihood by its maximum (achieved at λ b = ȳ):
!n " !#
λ
b λ
b
L(λ)/L(ȳ) = exp −n −1 .
λ λ

This scaled likelihood is called the relative likelihood, relative to the maximum. It has a
maximum value of 1.0. The relative likelihood is shown over the reduced range of appreciable
likelihood in Figure 10.1.
The likelihood is right-skewed, with a longer right-hand than left-hand tail.

10.3 Frequentist theory


A fundamental complication with inference in the frequentist theory appears in this model,
caused by the two possible parametrisations. While the ML estimates correspond, their
precisions do not.

10.3.1 Parameter transformations


The MLE of λ and its asymptotic variance are computed using the log-likelihood ℓ(λ):
nȳ
ℓ(λ) = log L(λ) = −n log λ −
λ
120 Introduction to Statistical Modelling and Inference

ℓ (λ) = −n/λ + nȳ/λ2
′′
ℓ (λ) = n/λ2 − 2nȳ/λ3
′′′
ℓ (λ) = −2n/λ3 + 6nȳ/λ4 .

\ b2 /n. The
The MLE of λ is λ b = ȳ = 210.8, and its estimated variance is Var[
√ √ λ]
b = λ
standard error of ȳ is ȳ/ n = 210.8/ 88 = 22.47. Then an (asymptotic) 95% confidence
interval for the true (population) value of λ is 210.8 ± 2 × 22.47 = [165.9, 255.7], centred at
the MLE.
We note for later discussion that the third derivative of the log-likelihood, at the MLE, is
4n/ȳ 3 . This must be positive, so the likelihood in λ is inherently right-skewed : the first two
derivatives do not provide complete information about the parameter. The scale-free measure
skewness is defined by third derivative/(second derivative)3/2 (at the MLE), which is
of √
4/ n.
However, if we work instead with the hazard parametrisation θ, we have

ℓ(θ) = log L(θ) = n log θ − nȳθ



ℓ (θ) = n/θ − nȳ
′′
ℓ (θ) = −n/θ2
′′′
ℓ (θ) = 2n/θ3 .

\ b = θb2 /n. The


The MLE of θ is θb = 1/ȳ = 0.00474 = 1/λ, b and its estimated variance is Var[θ]

standard error of θ is θ/ n = 0.000506. Then an (asymptotic) 95% confidence interval for
b b
the true (population) value of θ is 0.00474 ± 2 × 0.000506 = [0.00373, 0.00515]. If we invert
the 95% limits to those for λ, we have [173.9, 268.1]. These are shifted to the right by√12
relative to those from the λ parametrisation. Why? The skewness on the θ scale is 2/ n,
only half that on the λ scale. The likelihoods have different curvature and skew, and this
changes the interval endpoints.
An obvious question for a frequentist is: since we can transform from either scale to the
other, which scale should we use? Or should we use some other scale? This issue was investi-
gated by Anscombe (1964) for several single-parameter distributions, with the aim of defining
a parameter transformation which would normalise (or Gaussianise) the likelihood, in the
sense of making the third derivative zero at the maximum, so giving a more nearly symmetric
likelihood. This is quite simple, if we define θ = ϕr for some r to be determined. Then
r
L(ϕ) = ϕnr e−nȳϕ
ℓ(ϕ) = nr log ϕ − nȳϕr

ℓ (ϕ) = nr/ϕ − nrȳϕr−1
′′
ℓ (ϕ) = −nr/ϕ2 − nr(r − 1)ȳϕr−2
′′′
ℓ (ϕ) = 2nr/ϕ3 − nr(r − 1)(r − 2)ȳϕr−3 .
1/r ′′′
Here ϕb = [1/ȳ] , and at the MLE, ℓ (ϕ) b = nr[2 − (r − 1)(r − 2)]/ϕb3 , which is zero when
r = 0 or 3. Zero does not give a transformation, so the transformation θ = ϕ3 , or ϕ = θ1/3
iv
gives a zero third derivative at the maximum. However, the fourth derivative ℓ (ϕ) b at this
4
value of r is −18n/ϕ , which is negative, not zero, so this transformation does not induce
b
Gaussianity. Sprott (1980) gave a t-approximation for the transformation, correcting for the
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 121
0.0

-0.5

-1.0

-1.5

-2.0

log relative likelihood


-2.5

-3.0

-3.5

-4.0

-4.5

-5.0

-5.5

-6.0
160 180 200 220 240 260 280 300
l

FIGURE 10.2
Exponential (solid) and Gaussian (dotted) relative likelihoods for λ, log scale

fourth derivative. We do not discuss these transformations further, as they are not useful in
regression problems.

10.3.2 Frequentist asymptotics


The use of only the first two derivatives of the log-likelihood implies the assumption of
a quadratic form of the log-likelihood, or Gaussianity of the likelihood. To rely on this
approximation, we need to verify that the log-likelihood is quadratic, or equivalently, that the
likelihood is Gaussian. This is easily done: we simply compute the Gaussian approximation
from the sample mean and standard error, and plot the approximating Gaussian likelihood
with the actual likelihood. Figure 10.2 shows the exact (solid) and approximate(dotted) log
relative likelihoods in λ.
It is immediately clear that the Gaussian approximation is a poor one: the curves agree
only within a tiny distance from the maximum. Removing the cubic and quartic terms from
the log-likelihood removes the possibility of skewness (from the third derivative) and heavy
or light tails from the fourth derivative. The Gaussian likelihood approximation is wholly
within the exponential likelihood: it understates the variability of the inference about λ.
The “95%” confidence interval, based on ± 2 SEs from the MLE, is too short, especially on
the right.
The important message in inference for the exponential distribution, as with many others,
is that the choice of parametrisation affects the frequentist asymptotic inference, but as we
will see does not affect the Bayesian inference, which is always based on the likelihood, not
an asymptotic approximation to it.2
2 There are some probability models for which improved confidence intervals can be constructed by in-

verting tests (like the likelihood ratio test) rather than relying on the asymptotic approximation. We do not
give details for these, as the Bayesian analysis does not require them.
122 Introduction to Statistical Modelling and Inference

10.4 Bayesian theory


To proceed with a Bayesian analysis, we need a prior distribution for λ. It is an important
part of Bayesian theory that the prior distribution must not be dependent on the data. The
prior represents the information available about the parameter prior or external to the data
becoming available. We must not look at the data and then decide on the prior – this would
be “using the data twice”.
We can see clearly how the Bayesian inference will follow with a flat prior. The likelihood
function for the exponential distribution in the mean parametrisation is
1  nȳ 
L(λ) = n exp − .
λ λ
To convert it to a probability distribution, we scale it by a constant. The form of the posterior
is clearer if we transform the parameter from λ to the rate or hazard parameter θ = 1/λ. We
use a more general prior, of the power form π(θ) = θa , where a could be −1, 0 or 1. (These
values of a are reversed for the corresponding prior on λ. The value of −1 for a corresponds
to the Jeffreys uniform rule for the log of the parameter.) These priors are all improper (do
not integrate to 1), but the posteriors are all proper.
The likelihood is then
L(θ) = θn exp (−nȳθ) .
This has the form of a gamma distribution with parameters nȳ and n + 1, but it is missing
the integrating constant (nȳ)n+1 /Γ(n + 1). The full form of the posterior for θ, for the power
prior on θ, is
(nȳ)n+a+1 n+a
π(θ | y) = θ exp (−nȳθ) .
Γ(n + a + 1)
The maximum (mode) occurs at θ = (n + a)/(nȳ).
The gamma distribution cdf is available as a system function in most statistical packages,
usually for the standard form:
1
f (y | m) = y m exp(−y),
Γ(m + 1)
though it is sometimes with
1
f (y | m) = y m−1 exp(−y).
Γ(m)
Quantiles from this (θ) distribution can be transformed to those for the λ distribution by
dividing the nȳ value by the gamma quantile. For example for the phone data, the median
and 95% central credible interval for λ, with nȳ = 18, 550, n = 88, and the three values of a
above are given in Table 10.1.

TABLE 10.1
Median and 95% credible intervals
for λ with prior parameter a
a median 95% interval
−1 214.0 [174.7, 266.2]
0 211.6 [172.9, 262.8]
1 209.2 [171.1, 259.5]
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 123

As a increases the median decreases and the credible interval shrinks: the prior parameter
increases the effective sample size without affecting the mean. The same approach can be
used for quantiles of the distribution. The 80th quantile is of interest because we want to
know when this fraction of the lifetime has been reached in the field. The 80th quantile,
which we denote by y80 , satisfies the equation

S(y80 | λ) = 0.20
exp(−y80 /λ) = 0.20
−y80 /λ = log(0.2)
y80 = −λ log(0.2) = 1.61λ

So the median and 95% credible interval for the 80th quantile follow by simply scaling up
those for λ (with a = −1) by the multiplier 1.61, giving median 344.4, credible interval
[281.2, 428.4].

10.4.1 Conjugate priors


We may have information which provides a preference for some parameter values over others.
One way of representing such information is through a conjugate prior. The conjugate gamma
prior on the θ scale is of the form

π(θ | α, β) = c · (βθ)α exp(−βθ).

The prior adds α to the sample size and β to the sum T of the observations. No new principle
is involved, and we do not discuss these further.

10.5 The Gaussian distribution


We need a probability model to describe the variation in boy birthweights. As we discussed
briefly in the previous chapter, and will discuss at length in later chapters, we use the ecdf of
the sample birthweight values in §5.9 to guide the choice of models. We see in Figure 10.3 the
smooth shape of the ecdf. What probability distribution would have such a sloping S-shaped
cdf? The Gaussian distribution has such a cdf, and Figure 10.4 shows the “best” Gaussian
distribution cdf superimposed on the ecdf, with the 95% credible region shown in red for
the true cdf. The agreement looks good, but for birthweights between 8.2 and 9 pounds the
Gaussian cdf is outside the credible region. The largest birthweight is also notably far from
the second-largest. We investigate the agreement more carefully in Chapter 12 on diagnostics.
We assume for the present that the Gaussian distribution can adequately represent the
birthweights. Its density function is

(y − µ)2
 
1
f (y | µ) = √ exp − ,
2πσ 2σ 2

where µ is the mean of the distribution and σ is a scale parameter (assumed known in this
chapter), the standard deviation.
124 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cumulative 0.5

0.4

0.3

0.2

0.1

0.0
4 6 8 10 12 14
birthweight

FIGURE 10.3
Boy birthweight cumulative proportions
1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
4 6 8 10 12 14
birthweight

FIGURE 10.4
Boy birthweight cumulative proportions (circles), Gaussian model (solid) and 95% credible
region (red)
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 125

10.6 The Gaussian likelihood function


The likelihood for the sample y = (y1 , . . . , yn ) is
n
(yi − µ)2
 
Y 1
L(µ) = √ exp −
i=1
2πσ 2σ 2
n
(yi − ȳ + ȳ − µ)2
 
Y 1
= √ exp −
i=1
2πσ 2σ 2
 n  Pn 2 2

1 i=1 (yi − ȳ) + n(ȳ − µ)
= √ · exp −
2πσ 2σ 2
n(ȳ − µ)2 )
 
= c · exp − ,
2σ 2

where c is a known constant, involving the data and σ but not µ. The likelihood depends
on only one function of the data: the sample mean ȳ, which is a sufficient statistic for µ:
it is the only data function (apart from the known sample size n) we need to describe the
likelihood.
PnAnother function
2
of the data, the sample sum of squares (about the sample mean)
i=1 (yi − ȳ) also appears in the likelihood, but it is a known constant, as is σ, the scale
parameter. The sum of squares in this model is an ancillary statistic: it does not give infor-
mation about µ, but it may give information about the model, in this case the specified value
of σ. We will see in Chapter 11 on two-parameter models that if the Gaussian distribution
model is correct, then the sample sum of squares divided by σ 2 has a χ2 distribution with
(n − 1) degrees of freedom. This would allow us to check whether the specification of the
value of σ 2 is consistent with the sample data.
It is striking that the likelihood is identical (apart from ignorable constants) to that
from a single Gaussianly distributed “observation” ȳ with mean µ and variance σ 2 /n. This
illustrates a remarkable property of the likelihood, pointed out by Fisher, that the likelihood
defines the distribution of the sufficient statistics (if these exist), in this case the sample
mean ȳ, which is Gaussianly distributed N (µ, σ 2 /n). Although the distribution of the sample
mean is commonly described as its repeated-sampling distribution, there is no invocation of
the Central Limit Theorem or the repeated sampling principle needed to justify it: the
distribution follows directly from the likelihood.

10.7 Frequentist inference


The first attempt at a non-Bayesian analysis with the likelihood was by Fisher, who was
strongly anti-Bayesian. It was a very simple idea: the Gaussian likelihood defined the√distri-
bution of the sample mean ȳ ∼ N (µ, σ 2 /n). So with probability 95%, ȳ√∈ µ ± 1.96σ/ n. By
simple inversion of the interval, this could be written as µ ∈ ȳ ± 1.96σ/ n. Fisher called this
interval a 95% fiducial interval for µ. The unusual word fiducial was a legal term implying
“taken as a standard of reference”, or “founded on faith or trust”. Fisher did not explain
the name or how it should be understood.
126 Introduction to Statistical Modelling and Inference

He may have thought this obvious: he regarded the result as an indication that the
Bayesian analysis was unnecessary: he had the same result without the need for a prior
distribution, uniform or not. Fisher extended the idea to the two-parameter Gaussian dis-
tribution, but it could not be extended to discrete distributions, and the aforementioned
simple result depended on the existence of pivotal functions (discussed in §10.12).
Neyman and Pearson (1933) based their development of confidence intervals on the re-
peated sampling principle. The principle does not state how procedures should be evaluated,
nor which behaviours, but common practice led to the use of bias and sampling variance or
mean square error in simulation studies, as measures of performance of parameter estimators
– estimates computed from actual or hypothetical repeated samples. For the performance of
intervals, their length and coverage in repeated sampling were the general criteria.
Before the likelihood or computer experiments were available, it was impractical to assess
the precision of intervals, and statistical inference therefore relied on asymptotic theory. The
Central Limit Theorem (CLT) played an important role: it could be assumed that sums (or
means) of random variables would be Gaussianly distributed in sufficiently large samples,
and the CLT could be invoked to justify the asymptotic properties of the MLE and the
confidence interval. It was neither necessary nor possible to know whether the asymptotic
result applied in the observed sample: it was sufficient that it existed.
The likelihood expressed this more precisely and strongly. If the MLE was internal to
the parameter space (not on a boundary) then as the sample size → ∞, the likelihood would
approach the Gaussian form. This meant that the log-likelihood would be quadratic in the
parameter, and so the frequentist analysis could be based on the first two derivatives of the
log-likelihood function ℓ(µ):

ℓ(µ) = log L(µ)


n(ȳ − µ)2
= c ∗ −n log σ − n/2 log(2π) −
2σ 2
′ n(ȳ − µ)
ℓ (µ) =
σ2
′′ n
ℓ (µ) = − 2 .
σ

The maximum likelihood estimate (MLE) µ b of µ is the solution of ℓ (µ) = 0, which is ȳ,
and its variance σ 2 /n is the negative inverse of the second derivative. The higher derivatives of
the Gaussian log-likelihood in µ are all zero: the information about µ is contained completely
in the first two derivatives and the value at the maximum. The Gaussian distribution is
unique in having a quadratic log-likelihood.
This asymptotic result is a principle of much of frequentist analysis in general: we rely
on the first two derivatives of the log-likelihood to provide the MLE and its variance. We
demonstrate this process repeatedly in subsequent models. The post-Fisher development of
likelihood computation allowed a major advance: we could now see from its computation
(at least in single-parameter models) whether the observed sample data log-likelihood was
quadratic.
If it was, then the credible intervals (with a flat prior) in this sample would be the same
as the confidence intervals. However, for credible and confidence intervals to be the same
generally in the repeated-sampling theory of inference, the assumption of the log-quadratic
likelihood would have to apply in all the hypothetical samples from which the confidence
intervals would be computed. Of course we could not assess this since the samples are
hypothetical. On the other hand, we could conceptualise a sequence of hypothetical samples
all of which have the same log-quadratic property as the observed sample. Or, we could have
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 127

a much longer set of hypothetical random samples from which we filter out those which do
not have log-quadratic likelihoods. These constructions of hypothetical samples may seem
absurd, but follow exactly the argument used in the 2×2 contingency table in Chapter 7. As
Cox (2006, p. 198) put it,

The choice of the appropriate set of hypothetical repetitions is in principle funda-


mental, although in practice much less often a focus of immediate concern.
(Emphasis added)

One of the peculiarities of this procedure (apart from its hypotheticality) is that it gives no
information about whether the confidence interval covers the true value. It says only that
the probability of this coverage in the hypothetical samples is 95%. A long-run frequency
statement has to have a long run for its use, even if it is hypothetical. Students often confuse
this statement with the Bayesian posterior probability statement, which does refer to the
actual sample: it is the probability that the true value lies in the credible interval.
For the 648 boy babies, the sample mean birthweight is 7.65 pounds. With σ = 1.12
pounds, the 95% confidence
√ interval for the Child Development population is given by
7.65 ± 1.96 · 1.12/ 648 = 7.65 ± 0.086 = [7.56, 7.74] pounds.
It is rare in practice for us to know the standard deviation but not the mean. In general
both parameters are unknown. We deal with this case in Chapter 11.

10.8 Bayesian inference


We use the non-informative uniform prior for µ. This is improper – it does not integrate to a
finite quantity. However, provided the sample data do not give a flat likelihood, the posterior
distribution of µ is proper: it is exactly the Gaussian density:

n(ȳ − µ)2 )
 
n
f (µ | ȳ) = √ exp − .
σ 2π 2σ 2
2
So µ has the Gaussian posterior distribution √N (ȳ, σ /n). It follows immediately that a 95%
central credible interval for µ is ȳ ± 1.96σ/ n, where the value 1.96 is the 97.5 quantile of
the standard normal distribution. The 95% credible interval is exactly √
the same as the 95%
n(ȳ−µ)
confidence interval. This is a consequence of the pivotal function σ in the Gaussian
distribution, and the flat prior on µ.

10.8.1 Prior arguments


The prior represents the information available about the parameter prior to the data be-
coming available. This is a principal objection to the Bayesian analysis by some frequentists.
Since the prior isn’t part of the data, it must be based on some internal judgement of the
scientist, or analyst. Doesn’t that make it “subjective” – or personal to the person doing
the analysis? Many Bayesians would reply that the choice of the Gaussian distribution is
already a matter of judgement by the scientist, although it can be assessed and discarded or
altered if it is inappropriate (we discuss this further in Chapter 12 on diagnostics). Related
to this question is an argument about prior ignorance. If the analyst has no information at
all about the parameter, how can the analysis be done?
128 Introduction to Statistical Modelling and Inference

Bayes and Laplace used flat – constant – priors to represent ignorance, if there is no prior
evidence to support any possible parameter value over any other possible value. If the prior
is constant, then the posterior distribution is the likelihood function, scaled to integrate to 1.
An immediate problem (frequently presented as a dismissive objection) arises with the
Gaussian mean µ. Since this can conceptually take on any value in an infinite range, a
“proper” uniform distribution for µ (one which integrates to 1 over the range of µ) cannot
be defined over an infinite range, since the integral would be infinite as well. This criticism
can be answered in two ways:
• No matter how large µ may be, we can always construct a vastly long but finite interval
for it.
• We can define the prior over a finite interval and then let the interval endpoints tend to
±∞.
An additional argument for the uniform prior comes from finite population considerations
(and all real populations are finite). For the proportion p of zero values of a binary variable
in a population of size N , the possible values of the proportion are I/(N + 1), where I is
an integer in the range 0 to N . Without any prior information, all the values I/(N + 1) are
equally probable – so the non-informative prior is discrete uniform on these values. As N
increases, the uniform discrete prior approaches the uniform continuous prior on (0,1). This
argument extends directly to the mean of any variable y measured with finite measurement
precision δ. Its possible values are on a finite grid, and the possible values of the population
mean µ are also on a finite grid of spacing δ/N .
For the Gaussian mean, using the second option above, with a uniform prior for µ over
a large range [a, b], the posterior is the scaled likelihood over the range [a, b], which must
integrate to 1 over this range after scaling. The posterior of µ is
π(µ | ȳ) = c · f (µ | ȳ)

n(µ − ȳ)2 )
 
= c · exp − , for µ ∈ [a, b].
2σ 2
The constant c is determined by integration over [a, b]. We have
Z b
1= c · f (µ | ȳ)dµ
a
 √  √ 
n(b − ȳ) n(a − ȳ)
=c· Φ −Φ ,
σ σ
 √  √ 
n(b − ȳ) n(a − ȳ)
c = 1/ Φ −Φ ,
σ σ
where Φ is the standard Gaussian cdf. As b → ∞ and a → −∞, c → 1. So we obtain the
same result as using the improper prior π(µ) = 1 on µ ∈ (−∞, ∞).

10.9 Hypothesis testing


Hypothesis testing, or more generally model comparison requires us to compare the evidence
for two or more competing model explanations for the observed data.
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 129

This is different from making a decision about the “best” model and taking some con-
sequent action. Decisions and actions have to consider many factors beyond the empirical
data. In this book we do not consider these factors, which are part of Statistical Decision
Theory, in which decisions and actions have consequent losses, determined by the state of
nature, and the object is usually to minimise the (expected) loss.
Our aim is narrower: to state the relative evidence for the competing hypotheses or mod-
els, and where possible and appropriate to average over the models to provide a composite
conclusion. This will be developed in subsequent chapters.
In the one-parameter Gaussian model, there are two possible types of competing hy-
potheses:
• comparing one specified value µ1 with another specified value µ2 ;
• comparing a specified “null” value µ = µ0 with an unspecified “alternative” value of
µ ̸= µ0 .
We take as an example for the first case a sample of n = 25 from a Gaussian distribution
with variance σ 2 = 1 and sample mean ȳ = 0.4. Model 1 has µ1 = 0, model 2 has µ2 = 1.
For the second case the sample data are the same, and model 1 has µ0 = 0, but model 2 has
µ unspecified.

10.10 Frequentist hypothesis testing


10.10.1 µ1 vs µ2
For this case, both frequentist and Bayesian inference use the likelihood ratio L(µ1 )/L(µ2 ).
This is
h i
n(ȳ−µ1 )2 )
L(µ1 ) exp − 2σ 2
= h i
L(µ2 ) exp − n(ȳ−µ
2
2
2) )

h n i
= exp − (µ2 − µ1 )(ȳ − µ̄) ,
σ2
where µ̄ = (µ1 + µ2 )/2. Informally, a large likelihood ratio, much greater than 1, would
support M1 (model 1) over M2 (model 2), and a small value, much less than 1, would
support M2 over M1, while a likelihood ratio around 1 would not give a clear preference for
either model.
But frequentist hypothesis testing, as developed initially by Fisher and more formally by
Neyman and Pearson, was not formulated to deal with two “equal status” models. It was
designed to give preference to one – the current best theory – as the “null”, over the other –
an extension of the null – as the “alternative”, as in the second case mentioned. The current
best theory was to be retained unless the data gave convincing evidence against it.
The difficulty for the frequentist theory, with competing models of equal status, was in
defining the distribution of the likelihood ratio, for there were two different distributions
under the two models. The theory could work only if one of the models was specified as
the “null” and the other as the “alternative”. Reversing these would change the distribution
of the observed likelihood ratio, and not symmetrically. We do not consider this problem
further.
130 Introduction to Statistical Modelling and Inference

10.10.2 µ0 vs µ ̸= µ0
We restrict this discussion to the use of the p-value, widespread in all fields of application.
The argument with the credible interval has an exact parallel in the confidence interval. Does
the confidence interval √cover the null hypothesis value? The 95% central confidence interval
for µ is 0.4 ± 1.96 · 1/ 25 = [0.008, 0.792]. This just excludes 0: the zero value is on the
boundary of the 95.44% confidence interval, and the parameter region beyond this interval
has confidence 4.56%3
Hoever, in the frequentist p-value hypothesis testing framework, we express this dif-
ferently. Under
√ the null hypothesis, the probability of observing a sample mean of 0.4 (a
“Z”-value of 25(ȳ − µ0 )/1 = 2) or more is 0.0228 – 2.28%. This is the one-sided p-value
of the sample outcome. The two-sided p-value of 0.0456 corresponds to including in the
“more extreme values” region values of µ < −0.4. This seems unreasonable, but the whole
process is based on values “more extreme than that observed” – these can be defined in
different ways, giving different p-values. A common calibration of the p-value is that 0.05
is mild evidence, 0.01 is strong evidence and 0.001 is very strong evidence against the null
hypothesis.
The logic of this expression of evidence is puzzling, as we noted before. What we observed
is a sample mean of 0.4, but the evidence we quote is for a different event: that the sample
mean was greater than or equal to 0.4. The simple reason for this is that in the frequentist
framework we cannot give non-zero probability to single values of a continuous random
variable – only intervals of values can have non-zero probability. Students are sometimes
told, confusingly:

Well, we would certainly reject the hypothesis even more definitely if the sample
mean was greater than 0.4, so we include this in the interval probability.

This is a statement of procedure, not of logic.


We now give the Bayesian analysis for both cases.

10.11 Bayesian hypothesis testing


10.11.1 µ1 vs µ2
The likelihood ratio L(µ1 )/L(µ2 ) is
h i
n(ȳ−µ1 )2 )
L(µ1 ) exp − 2σ 2
= h i
L(µ2 ) n(ȳ−µ2 )2 )
exp − 2σ2
h n i
= exp − 2 (µ2 − µ1 )(ȳ − µ̄) ,
σ
where µ̄ = (µ1 + µ2 )/2. The Bayesian inference formalises the use of the likelihood ratio. We
need prior probabilities, π1 and π2 on the two models, with π2 = 1 − π1 . We update with
3 The 4.56% region is beyond both ends of the confidence interval, corresponding to an alternative on

either side of the null.


Statistical inference II – the continuous exponential, Gaussian and uniform distributions 131

the data y the prior probabilities to posterior probabilities through Bayes’s theorem:

Pr[M 1 | y] = Pr[y | M 1] Pr[M 1]/ Pr[y]


Pr[M 2 | y] = Pr[y | M 2] Pr[M 2]/ Pr[y]
Pr[M 1 | y] Pr[y | M 1] Pr[M 1]
=
Pr[M 2 | y] Pr[y | M 2] Pr[M 2]
L(µ1 ) Pr[M 1]
= · ,
L(µ2 ) Pr[M 2]

which can be expressed as


posterior odds = likelihood ratio · prior odds
Note that this ratio does not involve the scale constant Pr[y].
Write LR = L(µ 1)
L(µ2 ) , π1 = Pr[M 1], π1|y = Pr[M 1 | y] and similarly for M 2. Then

π1|y π1
= LR ·
π2|y π2

If we are initially indifferent between the two models, the posterior odds are equal to the
likelihood ratio, and π1|y = LR/(1 + LR), π2|y = 1/(1 + LR). In general
π1
LR · π2
π1|y = π1
1 + LR · π2
π1 LR
= .
π2 + π1 LR
An important question is how to calibrate the likelihood ratio and posterior probability.
There is no unique scale for the interpretation of the likelihood ratio, but values of 1, 3,
10, 30 and 100 are in common use for none, very weak, mild, strong and very strong data
evidence for the numerator model compared to the denominator model. If the models had
equal prior probabilities, the corresponding numerator posterior probabilities would be (to
3 dp) 0.5, 0.75, 0.909, 0.968, 0.990.
For the example, we have µ̄ = 0.5, and
 
L(µ1 ) 25
= exp − (1) · (−0.1)
L(µ2 ) 1
= exp(2.5) = 12.18.

By the calibration above, this would be mild sample evidence in favour of M1 , not surprising
since the sample mean is closer to 0 than to 1.

10.11.2 µ0 vs µ ̸= µ0
In the second case, the null model is based on the current best theory, which is to be retained
unless the data give convincing evidence against it. The example has µ1 = 0 and σ = 1, with
n = 25 and the sample mean ȳ = 0.4. Given the data y, which model is better supported?
Under the null hypothesis we have the likelihood L(µ0 ), but under the alternative µ is
unknown. However, we have its posterior distribution (with the flat prior on µ) given by
µ | ȳ ∼ N (ȳ, σ 2 /n). We can use this in two different but complementary ways.
132 Introduction to Statistical Modelling and Inference

10.11.2.1 Use the credible interval


Does the credible interval
√ contain the null hypothesis value? The 95% central credible interval
for µ is 0.4 ± 1.96 · 1/ 25 = [0.008, 0.792]. This just excludes 0: the zero value is at the
boundary of the 95.44% credible interval. This seems to be some evidence against the null
hypothesis, though it is not very strong.

10.11.2.2 Use the likelihood ratio


We first note that the best-supported value of µ from the data under the alternative is
µ
b = ȳ = 0.4. At this value the likelihood ratio is
 
L(µ1 ) 25
= exp − (1) · (ȳ − µ1 )2 )
L(bµ) 2
= exp(−2) = 0.1353.

The alternative hypothesis (that µ ̸= 0) is better-supported than the null at this value of µ,
but not very strongly: the inverse of the ratio is 1/0.1353 = 7.39. At other values of µ the
support for the alternative is even weaker, so the likelihood ratio is greater.
Dempster (1974, 1997) gave a straightforward approach to this calibration. We are unable
to give a single number likelihood ratio, but we can find its posterior distribution, from that
of µ. Although we can give in this simple case an analytic solution (as did Dempster), we
will instead use a simulation approach, because this is very general and simple.
We make a large number M of random draws µ[m] of µ from its N (ȳ, σ 2 /n) posterior
distribution, and substitute them into the denominator of the likelihood ratio, to give M
random draws LR[m] = L(µ0 )/L(µ[m] ) from the posterior distribution of the likelihood ratio.
The cdf of M = 10, 000 draws is shown in Figure 10.5.

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
likelihood ratio

FIGURE 10.5
Posterior distribution of the likelihood ratio
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 133

The distribution is extremely skewed. The median likelihood ratio is 0.1694, and the 95%
central credible interval is [0.1354, 1.666]. This does not exclude the value 1 of indifference
between the two hypotheses. The posterior probability that the likelihood ratio is greater
than 1 (null hypothesis better-supported than the alternative) is 0.0465. We can convert
these results to those for the posterior probability of the null hypothesis.
If we specify equal prior probabilities for the null and alternative hypotheses, then the
median posterior probability of the null hypothesis is 0.145, and the 95% credible interval is
[0.119, 0.625]. The evidence against the null hypothesis is quite weak.

10.11.2.3 The integrated likelihood


Many Bayesians have difficulty with the null hypothesis testing problem. Some dismiss it
completely as useless in the Bayesian paradigm: they dismiss the idea of “testing a null
hypothesis” as irrelevant. (We already know that any precise point hypothesis is false!) It
is, however, relevant in the frequentist paradigm, and is entrenched in many official and
semi-official procedures.
Other Bayesians struggle with a conventional approach. It is obvious that the maximised
likelihood, as a one-point summary, is an extreme overstatement of data information about
the likelihood, as would be the MLE or posterior mode as a one-point summary of the
data information about the parameter. The difficulty for these Bayesians arises from the
convention that instead of maximising the likelihood over the unknown parameter, this
parameter should be integrated out of – averaged over – the likelihood with respect to its
prior distribution.
The argument given for this convention comes from an analogy with bivariate dis-
tributions. If (Y, X) have a bivariate distribution f (y, x), this can be factored (as with
events) into the marginal distribution of X and the conditional distribution of Y given
R x) = f (y | x)f (x). Integrating
X: f (y, R out x on both sides gives the marginal distribution
of Y : f (y, x) dx = f (y) = f (y | x)f (x) dx. The analogy uses y as the random variable
Y with a conditional distribution given θ of f (y | θ), and uses X as the model parameter θ
with prior (“marginal”) distribution π(θ). Then the marginal distribution of y is
Z
f (y) = f (y | θ)π(θ) dθ
Z
= L(θ)π(θ) dθ.

The right-hand side is the scaling or normalising factor which scales the product of likelihood
and prior into the posterior. The integrated likelihood is a weighted average of the likelihood,
with prior weights given by the prior importance of the parameter values. In the convention,
the integrated likelihood is used as though it were the likelihood from a completely specified
model (as with the null model). So there is no uncertainty in the integrated likelihood: it
will be equal to a value of the likelihood L(µ∗ ) at some other value µ∗ of the parameter µ,
determined by the prior specification.
There is no principle in Bayesian analysis justifying this use of the integrated likelihood
as though it were the actual likelihood under the alternative model. It is not a consequence
of Bayes’s theorem. An alternative way of expressing the integrated likelihood is as the prior
mean of the likelihood: it is another one-point summary of the likelihood function, like the
maximised likelihood. This view emphasises the difficulty of the “marginal” argument. The
prior distribution represents our information about the model parameter before the data are
observed. It is not the distribution of an observable random variable on which the distribution
of the data has been conditioned. (An unusual view of some Bayesians is that Nature has
134 Introduction to Statistical Modelling and Inference

performed a random draw of θ from the prior, and the best we can do is to average over
Nature’s possible choices.)
Quite apart from this philosophical objection, there is an immediate difficulty with
the computation of the integrated likelihood if the prior is improper, for example flat on
(−∞, ∞). This prior does not integrate to 1, nor can it be scaled to do so by multiplying
by a constant. For a finite integrated likelihood, the prior must be proper, for example by
being defined on a finite interval, or by other parametrisations.
Suppose we define the prior for µ as flat on the finite interval [a, b] : π(µ) = 1/(b − a).
Then integrating the Gaussian likelihood over this interval gives
Z b  √  √ 
1 c n(b − ȳ) n(a − ȳ)
L(µ)dµ = · Φ −Φ ,
b−a a b−a σ σ

where c is a constant function of the data and σ. The integrated likelihood is an explicit
function of a and b. Varying these values will vary the integrated likelihood correspondingly.
We cannot let a → −∞ and b → ∞ as in the posterior computation, for then, though the
second term → 1, the first term → zero, so the integrated likelihood under the alternative
hypothesis tends to zero – the null hypothesis will be infinitely well-supported relative to
the alternative!
Many attempts have been made to avoid this difficulty, by using priors that are “not
too diffuse” relative to the likelihood (for example Kass and Raftery 1990). This requires an
inspection of the likelihood to decide how to assess diffuseness and specify a not-too-diffuse
prior. This is using the likelihood to determine the prior, violating the fundamental Bayesian
precept that the prior must be specified before the data are observed: we are using the data
twice! A detailed discussion of this problem can be found in Aitkin (2010, §2.8).

10.12 Pivotal functions


√ √
The data and parameter function n(µ− ȳ)/σ (or n(ȳ −µ)/σ) has the remarkable property
of having the same known N (0, 1) distribution, whether frequentist (with ȳ the random
variable and µ the fixed parameter) or Bayesian (with ȳ the fixed data function and µ the
random variable with a uniform prior distribution). A function of data and parameters whose
distribution does not depend on the parameters is called a pivot, or pivotal function.
Pivots are found in other distribution models, not just the Gaussian. As a consequence of
this identity, the Gaussian 95% (or any other %) confidence interval for µ also has a Bayesian
interpretation with a specific prior, and in this interpretation is called a 95% (or any other
%) credible interval. But the Bayesian interpretation is different. As we noted earlier, in the
frequentist theory we cannot say in general that

there is a 95% probability that this 95% confidence interval covers the true popu-
lation value.

That statement is effectively a Bayesian credible interval statement, which correctly re-
expressed is

there is a 95% probability that the true population value lies in this 95% credible
interval.
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 135

The first statement has to be re-expressed in terms of the coverage property of a hypothetical
ensemble of random intervals:

In hypothetical repetitions of the sampling process, 95% of the hypothetical 95%


confidence intervals will cover the true population value.

But in this special case of the Gaussian distribution with known variance (and in other cases
in which a pivot exists), the probability statement has both interpretations!

10.13 Conjugate priors


We do not need to use a flat prior. We may have information which provides a preference
for some parameter values over others. One way of representing such information is through
the conjugate prior. The conjugate prior for µ would be of the Gaussian form

m(z̄ − µ)2 )
 
π(µ | z̄, m) = c · exp − ,
2σ 2

where m and z̄ are called prior parameters or sometimes hyper-parameters. The prior can be
expressed as representing the information from a prior experiment or prior study of size m in
which the study mean was z̄. By appropriate choice of the prior parameter values, the prior
can approximate the information about µ that is actually available (provided it is symmetric
around the mean). As m → 0, the Gaussian prior approaches the uniform prior.
The posterior distribution of µ can be evaluated directly from the two densities:

f (µ, ȳ, z̄)


f (µ | ȳ, z̄) =
f (ȳ, z̄)
f (ȳ | µ) · f (µ | z̄)
=R
f (ȳ | µ) · f (µ | z̄) dµ
 
exp −n(ȳ − µ)2 + m(z̄ − µ)2 /2σ 2
=c· R .
exp {[−n(ȳ − µ)2 + m(z̄ − µ)2 ] /2σ 2 } dµ

The numerator term is


("  2 # )
∗ nȳ + mz̄ 2 2
c · exp −(n + m) µ − + nm(ȳ − z̄) /2σ ,
n+m

where the normalising denominator term does not involve µ. The posterior distribution of
µ is
σ2
 
nȳ + mz̄
µ | ȳ, z̄, n, m ∼ N , .
n+m n+m

The posterior mean is the weighted mean from the likelihood, weighted by the data sample
size, and the prior, weighted by the prior sample size. The posterior variance is reduced by
the additional prior sample size.
136 Introduction to Statistical Modelling and Inference

10.14 The uniform distribution


The uniform distribution appears constantly in Bayesian simulations, which start from ran-
dom draws from the standard uniform (0,1) distribution, which are then expressed as random
draws of the cdf of the distribution of interest, then transformed into random draws from
the density or mass function to generate the likelihood.
The standard uniform distribution itself is: f (y) = 1 for y ∈ (0, 1), f (y) = 0 elsewhere. It
has no parameters, other than the interval over which it is non-zero. The mean is 1/2 and
variance is 1/12. However, special cases of this distribution illustrate very important points
of statistical inference.

10.14.1 The location-shifted uniform distribution


We have a random sample (y1 , . . . , yn ) drawn from the location-shifted uniform distribution,
with a shift of θ and density f (y | θ) = 1 for y ∈ (θ, θ + 1). The likelihood is
L(θ) = 1 for yi ∈ (θ, θ + 1), ∀i
= 0 otherwise.
This is equivalent to
L(θ) = 1 for θ ∈ (y(n) − 1, y(1) )
= 0 otherwise.
where y(1) and y(n) are the smallest and largest order statistics of the sample when the
observations are sorted into increasing order.
The likelihood is zero outside the θ range [y(n) − 1, y(1) ], and is the constant 1 – flat –
elsewhere. There is no single sufficient statistic like the sample mean for θ. The order statistics
y(1) and y(n) are jointly sufficient for θ. Their role here is just to define the interval of values
of θ which have the same non-zero likelihood. There is no single maximum likelihood estimate:
all the values in the interval [y(n) − 1, y(1) )] have the same likelihood.
Confidence interval construction is peculiar as well. There is no way to distinguish one
value of θ over another in the permissible interval. On the other hand, there is no need to: the
permissible interval is a 100% confidence interval for θ – we are certain that θ must lie in this
interval ! Further, this interval is also a 100% credible interval with a flat prior, or with any
other prior, and the posterior is just a rescaled version of the prior over the permissible range.
To some model-free frequentists, this unusual property of the uniform likelihood suggests
that a model-free approach with the obvious sample mean estimator √ of θ would do better,
with the usual asymptotic 95% confidence interval of ȳ ± 1.96 s/ n. This should work ad-
equately; if not, the bootstrap standard error could be used instead, or the quantiles from
random draws of the bootstrap mean.
An extended discussion of this example and others is given in Aitkin (2018). There are
two difficulties with alternative approaches:
• The 95% confidence interval may exclude values of θ which are included in the likelihood-
based 100% confidence interval, and are certainly possible;
• The 95% confidence interval may include values of θ which are excluded from the
likelihood-based 100% confidence interval, and are certainly impossible.
The asymptotic coverage of the model-free intervals in repeated sampling is irrelevant to the
100% precision in the given sample, which the likelihood provides in this model.
11
Statistical Inference III – two-parameter continuous
distributions

We now extend the family of models to those with two parameters. This raises new issues
in the relation between the two parameters.

11.1 The Gaussian distribution


We have already seen this distribution with the scale parameter σ known; now it is unknown.
The likelihood for the sample y = (y1 , . . . , yn ) is
n
(yi − µ)2
 
Y 1
L(µ) = √ exp −
i=1
2πσ 2σ 2
n
(yi − ȳ + ȳ − µ)2
 
Y 1
= √ exp −
i=1
2πσ 2σ 2
 n  Pn 2 2

1 i=1 (yi − ȳ) + n(ȳ − µ)
= √ · exp −
2πσ 2σ 2
√ Pn 2
n(ȳ − µ)2
   
n 1 i=1 (yi − ȳ)
=c· exp − · exp −
σ 2σ 2 σ n−1 2σ 2

where c is a known constant, not involving σ or µ. The likelihood depends on only two
functions of thePdata: the sample mean ȳ and the “residual sum of squares” (about the
n
mean) RSS = i=1 (yi − ȳ)2 . These are the sufficient statistics for µ and σ: they are the
only data functions (apart from the known sample size n) we need to describe the likelihood,
which factors into two separate pieces. The mean µ appears only in the first term, but σ
appears in both terms.

11.2 Frequentist analysis


The MLEs of µ and σ 2 are µ c2 = RSS/n. The MLE of σ 2 is biased, in a distributional
b = ȳ, σ
sense: its expectation is not σ , but (n − 1)σ 2 /n. As n → ∞ the bias (σ 2 /n) → 0. This
2

property was often used as an argument against maximum likelihood as a general method
of finding an estimator of a parameter. The unbiased (quadratic in the data) estimator is
s2 = RSS/(n − 1). We discuss next the use of the restricted or marginal likelihood for the
justification of the unbiased estimator.

DOI: 10.1201/9781003216025-11 137


138 Introduction to Statistical Modelling and Inference

The frequentist inference about µ and σ is based on the sufficient statistics ȳ and RSS,
which have indepenent distributions:

• n(ȳ − µ)/σ ∼ N (0, 1);
• RSS/σ 2 ∼ χ2n−1 .
The two-parameter Gaussian distribution has two pivots! For inference about µ, the frequen-
tist approach uses the derived pivot

n(ȳ − µ) s
t= /
√ σ σ
n(ȳ − µ)
=
s
∼ tn−1 ,
which has a Student’s t-distribution with ν = n − 1 degrees of freedom, with density
−( ν+1
2 )
t2

Γ((ν + 1)/2)
√ 1+ .
νπ Γ(ν/2) ν
This is symmetric about the mean, but has longer “tails” than the Gaussian distribution,
reflecting the uncertainty in the variability. Large deviations from the mean have higher
probability than under the Gaussian.

11.3 Bayesian analysis


The form of the likelihood in the Gaussian model appears in other two-parameter (θ, ϕ)
distributions, in which the likelihood factors into a conditional likelihood of θ given ϕ and a
marginal likelihood of ϕ.
For the Gaussian, we need a joint prior distribution for µ and σ. Provided that the prior
distribution is conjugate – factors into a function of µ given σ and a function of σ alone –
the joint posterior distribution can be expressed as the product of the conditional posterior
distribution of µ given σ and the marginal posterior distribution of σ.
We see immediately that as in the case of σ known, the posterior distribution of µ is, with
a flat prior, N (ȳ, σ 2 /n). But since σ is unknown, this is now the conditional distribution given
σ. The posterior distribution of σ is less clear, but it can be made clearer by reparametrising
to the inverse variance – the precision parameter ϕ = 1/σ 2 . Then the likelihood term in ϕ is
ϕ(n−1)/2 exp(−RSS ϕ/2), a gamma distribution.
If the prior distribution of ϕ is the improper power family: π(ϕ) = c·ϕr , then the posterior
is the gamma distribution:
RSS [(n−1)/2+r+1]
π(ϕ | y) = ϕ[(n−1)/2+r] exp(−RSS ϕ/2).
2[(n−1)/2+r+1] Γ[(n
− 1)/2 + r + 1]
This can be represented in the χ2 form. In the general gamma form:
β α α−1 −βx
f (x) = x e ,
Γ(α)
the gamma (α, β) random variable with α = ν/2 and β = 1/2 is called a χ2 random variable
with ν degrees of freedom. So RSS ϕ = RSS/σ 2 ∼ χ2n+2r+1 . The value of r determines the
degrees of freedom ν (Table 10.1). The prior π(ϕ) = c·ϕ−1 is equivalent to π(σ) = c·σ −1 , for
which the posterior is RSS/σ 2 ∼ χ2n−1 . This is the prior suggested by Jeffreys for positive
Statistical Inference III – two-parameter continuous distributions 139
TABLE 11.1
χ2 degrees of freedom ν for prior
parameter r
r −2 −3/2 −1 −1/2 0
ν n−3 n−2 n−1 n n+1

parameters, and is the common standard for a non-informative prior for a scale parameter
in Bayesian analysis. This shows the property of the two pivots in the Gaussian distribution:
the inferential conclusions for µ and σ are the same, for the flat priors on µ and log σ.
From a frequentist viewpoint, the marginal posterior of ϕ is equivalent to the Marginal
or Restricted likelihood for σ 2 . The maximising value of the restricted likelihood is given by
σ̃ 2 = RSS/(n − 1), which is generally called the REML (Restricted Maximum Likelihood)
or MML (Marginal Maximum Likelihood) estimate.
The joint conjugate prior is of the same form as the likelihood, and can be expressed as
being based on an auxiliary experiment with sample size m, sample mean z̄ and residual
sum of squares ASS. The joint posterior then factors into the marginal posterior χ2n+m−1
distribution of (RSS + ASS)/σ 2 and the conditional Gaussian distribution of µ given σ,
N ([nȳ + mz̄]/[n + m], σ 2 /[n + m]).
It might appear that this result is not general, because the same σ has been assumed for
the prior as for the data model. However, if we take a different variance for the prior, say
ϕ2 , we can redefine the prior variance to be σ 2 and the prior sample size m∗ to be such that
ϕ2 /m = σ 2 /m∗ , that is, define m∗ = mσ 2 /ϕ2 . We do not give further details.

11.3.1 Inference for σ


The σ parameter is not generally of interest, but inference about it is straightforward.
Given the sum of squares RSS for a sample of size n from a Gaussian distribution, and
the Jeffreys
 prior (flat on log σ), the 95% central credible interval for σ 2 is given by
σ ∈ RSS/χ2n−1,0.975 , RSS/χ2n−1,0.025 .
2

For example, if RSS = 24 in a sample of n = 20 from a Gaussian distribution, the 95%


credible interval is σ 2 ∈ [24/32.85, 24/8.91] = [0.73, 2.69], so the 95% credible interval for σ
is [0.85, 1.64].

11.3.2 Inference for µ


The Gaussian conditional distribution for µ given σ can be marginalised by integrating over
the χ2 distribution of RSS θ = RSS/σ 2 :
Z
f (µ | y) = f (µ | y, σ 2 )f (σ 2 | y) dσ 2 .

The Gaussian distribution is one of the very few in which this integration can be done ana-
lytically. As for the frequentist case, the analytic
√ derivation gives a Student’s t-distribution
with degrees of freedom ν = n − 1, for t = n(µ − ȳ)/s where s2 = RSS/(n − 1), with
density
− ν+1
t2

Γ((ν + 1)/2) 2
√ 1+ .
νπ Γ(ν/2) ν
For most two-parameter distributions, this integration is not analytic, and can be carried
out by simulation. Instead of finding the probability density of µ by integration, we generate
140 Introduction to Statistical Modelling and Inference

a large random sample of M values of µ. If M is sufficiently large, like 10,000, this provides
an accurate approximation to the cdf, from which accurate quantiles can be obtained.

11.3.2.1 Simulation marginalisation


We give a simple example of the Gaussian simulation. We have RSS = 24 in a sample of
n = 5 from a Gaussian distribution, with sample mean ȳ = 10. We make M = 10, 000 draws
σ 2[m] from 24/χ24 , and for each m draw a random µ[m] from N (10, σ 2[m] /5). The empirical
cdf of the 10,000 draws is shown (dots) in √ Figure 11.1, together with the cdf of t4 (solid
curve), shifted by ȳ = 10 and scaled by s/ n. The agreement is very close.
The t-distribution arises in another way, from a Gaussian distribution where the variance
σ 2 is varying randomly over the data. This is a common model for outliers. We will see in
Chapter 12 that the boy with the largest birthweight does not belong to the distribution
for the other boys. The constant variance model represents well the other boy birthweights
but does not apply to this boy. In §15.8 we represent this by a finite mixture of Gaussian
distributions, where the components of the mixture have the same variances but different
means. But we can also represent it as a continuous mixture of Gaussians with the same
means but different variances.

11.3.3 Parametric functions


Several functions of µ and σ are of interest: the coefficient of variation σ/µ and its inverse
µ/σ, the number of SDs µ is away from 0.1 Inference about these follows directly from
the random draws of µ and σ. For the first, we compute the M values σ [m] /µ[m] (using
different random seeds for the two sets of draws). For the second, we reverse the terms:
µ[m] /σ [m] . Figures 11.2 and 11.3 show the cdfs of the 10,000 draws for the aforementioned

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0 5 10 15 20 25
y

FIGURE 11.1
Cdf of 10,000 draws of t4 (dots) and t4 cdf (solid curve)
1 In some disciplines this function is called the effect size.
Statistical Inference III – two-parameter continuous distributions 141

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
coefficient of variation

FIGURE 11.2
Cdf of 10,000 draws of the coefficient of variation σ/µ

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0 2 4 6 8 10
number of SDs

FIGURE 11.3
Cdf of 10,000 draws of µ/σ
142 Introduction to Statistical Modelling and Inference

simulation example. The posterior distribution of σ/µ is heavily skewed, with a very long
tail. Simulation values of µ near zero give very large values of the coefficient of variation.
The posterior of µ/σ is only slightly skewed, the small values of µ having little effect. So if
we are interested in the coefficient of variation, and µ may be near zero, it would be more
effective, and more accurate, to use the posterior distribution of µ/σ to provide quantiles,
then invert these to give those of the coefficient of variation.

11.3.4 Prediction of a new observation


Prediction of a new value of y is an unusual problem in this simple Gaussian model, but
when we consider regression models it becomes an important application, so we develop the
theory here. It is also important for model assessment, which we discuss at length in Chapter
12. We give first the Bayesian formulation.
Given a random sample y drawn from a population modelled by the Gaussian distribution
N (µ, σ 2 ), we want to predict a new value, denoted by y0 , to be drawn from this distribution.
We do this in a two-stage random drawing process, from the posterior distribution of (µ, σ 2 ).
The analytic distribution of y0 can be obtained by integration:
Z Z Z
f (y0 | y) = f (y0 | µ, σ 2 ) · f (µ, | y, σ 2 ) dµ dσ 2
Z Z
= f (σ | y)dσ · f (y0 | µ, σ 2 ) · f (µ | σ 2 , y)dµ.
2 2

Analytically, the marginal distribution of


r
n (y0 − ȳ)
t=
n+1 s
is tn−1 , where s2 = i (yi − ȳ)2 /(n − 1). However the sequence of conditional distributions
P
in the integral shows clearly how to simulate from the distribution of y0 , by reversing the
order of integration:
• From the marginal distribution of σ 2 , draw a large number M of random values σ 2[m] ;
• then for each m, draw a random value µ[m] from the conditional Gaussian distribution
of µ given σ 2[m] ;
[m]
• then for each m, draw a random value y0 from the conditional Gaussian distribution
of y given µ[m] and σ 2[m] .
[m]
The M values y0 are a random sample from the posterior predictive distribution of y. These
values are more variable than those for the mean µ from the first two draws in the drawing
sequence: they incorporate the inevitable uncertainty in drawing a random value from any
distribution.
The frequentist formulation is in three steps:
r  
n y0 − ȳ
∼ N (0, 1),
n+1 σ
s2 /σ 2 ∼ χ2n−1 independently,
r   r   r 2
n y0 − ȳ n y0 − ȳ s
= /
n+1 s n+1 σ σ2
∼ tn−1 .
Statistical Inference III – two-parameter continuous distributions 143

The posterior predictive distribution has an important role in assessing the Gaussian model
assumption. Instead of predicting new values of Y, we can use qthe sample data we have,
n
 yi −ȳ 
and compare the sample cdf of the “Studentised” variables n+1 s with that for
the tn−1 distribution. This might seem strange – shouldn’t we be comparing it with the
Gaussian distribution? We pay a price for not knowing µ and σ, which are estimated
from the sample with imprecision; that imprecision is reflected in the more diffuse t-
distribution with its longer tails than the Gaussian. As n increases this imprecision goes
to zero.
Two-parameter skewed distributions have wide application. We examine three of them.

11.4 The lognormal distribution


The lognormal distribution was widely used over many decades for the analysis of right-
skewed data. The log transformation of the response scale could greatly reduce and sometimes
eliminate the skew, allowing the Gaussian model to be assumed for the log transformed
responses. We illustrate this idea with the phone data.

11.4.1 The lognormal density


We begin with the reverse transformation. We have Z ∼ N (µ, σ 2 ), and transform the scale
to Y = exp(Z). The Gaussian and lognormal probability elements are

(z − µ)2
  
1
f (z)dz = √ exp − dz, −∞ < z < ∞
2πσ 2σ 2

(log y − µ)2
  
1
f (y)dy = √ exp − dy, 0 < y < ∞.
2πσy 2σ 2

The only difference between the densities is in the additional denominator term in y in the
lognormal. This does not affect the ML or the Bayesian posteriors, though it does affect the
value of the likelihood.
For the ML or Bayesian analysis, it is just a matter of replacing the usual terms in yi
by the corresponding terms in zi = log(yi ). So the MLE of µ from the random sample
Pn c2 = Pn (zi − µ
b = i=1 zi /n, and for σ 2 is σ
y = (y1 , . . . , yn ) is µ i=1 b)2 /n. The MLEs are
µ
b = 5.267, σ b = 0.842. Figure 11.4 shows the ML fitted cdf of the lognormal (solid curve)
with the ecdf (circles) and 95% credible region (red) for the true cdf on the scale of log hours.
The lognormal does not fit well: the curvature is wrong.
Inference about the 80th quantile follows from the 80th quantile of the standard Gaussian
distribution, which is 0.842. So the posterior distribution of the 80th quantile for the phone
data is obtained from the 10,000 draws of the parametric function µ[m] + 0.842 σ [m] . Figure
11.5 gives the posterior cdf. The median and 95% credible interval for the 80th quantile of
µ+0.842 σ are 6.02 and [3.66, 8.40]. Transforming exponentially to the original scale of y, the
median and 95% credible interval transform to 411.6 and [38.9, 4447]. These values are quite
different from those for the exponential. The apparently incorrect distribution specification
has thrown out the location and length of the credible interval. We discuss this further in
Chapter 12 on model assessment.
144 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
3 4 5 6
log hours

FIGURE 11.4
Empirical (circles) and ML fitted lognormal (solid) cdfs, and 95% credible region (red)

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
2 4 6 8 10

lognormal 80th percentile

FIGURE 11.5
Cdf of the lognormal 80th quantile
Statistical Inference III – two-parameter continuous distributions 145

11.5 The Weibull distribution


The Weibull distribution, named after Waloddi Weibull (1887–1979), a Swedish engineer,
scientist and mathematician, is an extension of the exponential – a power transformation of
it. The form of the density is (extending the exponential hazard form)

f (y | θ, α) = αθy α−1 exp(−θy α ),

where α is the shape parameter and θ the scale parameter.2 The survivor function is now
S(y) = exp(−θy α ), the hazard function is h(y) = αθy α−1 , and the integrated hazard is
H(y) = θy α . The hazard is monotone decreasing for α < 1, constant for α = 1 and monotone
increasing for α > 1. The Weibull distribution is an accelerated failure time model, in the
sense that we may think of an age variable a as defining a rescaling of clock time y, by
a = y α . On this age scale, the distribution of age lifetime is exponential. So if α > 1, age is
“accelerated” relative to clock time, while if α < 1, age is “braked” relative to clock time.
The Weibull mean and variance are complex functions of the model parameters: mean
µ = θ−1/α (1/α)Γ(1/α), variance = θ−2/α (2/α)Γ(2/α) − (1/α)2 Γ2 (1/α) . Practical interest


is focussed on the quantiles of the distribution, which have simpler functional form in the
parameters. The 100γ quantile of the distribution, yγ , is given by

S(yγ ) = exp(−θyγα ) = 1 − γ
H(yγ ) = θyγα = − log(1 − γ)
yγ = {− log(1 − γ)/θ}1/α
log(yγ ) = {log[− log(1 − γ)] − log(θ)}/α.

So the median is y0.5 = [0.693/θ]1/α .

11.5.1 The Weibull likelihood


For the random sample y = (y1 , . . . , yn ), the likelihood in θ and α is
n
Y
L(θ, α) = f (yi | θ, α)
i=1
n
Y
= αθyiα−1 exp(−θyiα )
i=1

= αn θn P α−1 exp(−θT [α])


= θn exp(−θT [α]) · αn P α−1
= c · θn exp(−θT [α]) · αn exp(αS)
θn T [α]n exp(−θT [α]) αn exp(αS)
=c· · ,
Γ(n + 1) T [α]n
Pn Qn
where T [α] = i=1 yiα , P = i=1 yi , S = log P, and c is a function of the data only.
2 The form of the density and the parameter symbols are not those standardised in the engineering

literature.
146 Introduction to Statistical Modelling and Inference

11.5.2 Frequentist analysis


There is no simple joint ML estimation for the parameters, though it is easily seen that,
given α, θc
α = n/T (α). ML for the Weibull is widely available in most statistical packages.
For this parametrisation we have α
b = 1.308, θb = 0.000821.

11.5.3 Bayesian analysis


As for the exponential distribution, the likelihood again factors, but into a less simple form,
as there are no simple sufficient statistics for θ and α because of the appearance of yiα in the
sum T [α]. With a prior 1/θ (uniform on log θ) and π(α) on α, the posterior distribution of
θ, conditional on α, is gamma with parameters n and T [α]. α has no analytic posterior in
general, whatever prior is used for it,3 though it can be computed numerically.
We compute the likelihood over a uniform double grid, 300 × 300, for α and log θ in the
region of appreciable likelihood. Marginalising gives the marginal posteriors for α and log θ.
Figure 11.6 shows the marginal posterior for α and Figure 11.7 for θ, exponentiated from
log θ.
The posterior for θ is heavily skewed (one reason for using the log transformation). The
median and 95% central credible intervals are, for α, 1.313, [1.108, 1.538], and for θ, 0.000775
and [0.0001693, 0.002588]. The MLE for α (1.308) is very close to the posterior median, but
both are far from the exponential value of 1. The MLE of θ is less close (0.000821) to the
posterior median, a consequence of the heavy skew in this parameter.

0.012

0.011

0.010

0.009

0.008
posterior density

0.007

0.006

0.005

0.004

0.003

0.002

0.001

0.000
0.8 1.0 1.2 1.4 1.6 1.8
a

FIGURE 11.6
Posterior density of α

3 It is not a gamma distribution unless S is negative.


Statistical Inference III – two-parameter continuous distributions 147
0.022

0.020

0.018

0.016

posterior density
0.014

0.012

0.010

0.008

0.006

0.004

0.002

0.000
0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007
q

FIGURE 11.7
Posterior density of θ

Our interest is in the 80th quantile of the distribution, y80 . This is an analytic function
of the parameters:

S(y80 | θ, α) = 0.20
α
exp(−θ y80 ) = 0.20
α
θ y80 = − log(0.2)
α
y80 = − log(0.2)/θ = 1.61/θ

y80 = (1.61/θ)1/α

log(y80 ) = [log(1.61) − log(θ)]/α

= [0.476 − log(θ)]/α.

We evaluate this function of the two parameters, sort the values, and construct the me-
dian and 95% credible interval for log(y80 ) in the usual way, then exponentiate these
values to give those for y80 . Figure 11.8 shows the posterior distribution for the 80th
quantile.4
The median and 95% credible interval on the log scale are 5.799 and [5.696, 5.968]; the
credible interval is slightly asymmetric about the median. Exponentiating gives median hours
330.0 and credible interval [283.2, 380.4]. The median is close to that for the exponential
distribution, but the credible interval is much shorter (48) at the upper end.
4 The small ripples near the median are a consequence of the grid spacing. With 106 points in the grid

instead of 90,000, the cdf is completely smooth.


148 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
5.5 5.6 5.7 5.8 5.9 6.0 6.1
Weibull 80th percentile

FIGURE 11.8
Cdf of the 80th quantile, log hours

11.5.4 The extreme value distribution


If the survival time Y has a Weibull distribution, the log survival time Z = log Y has an
extreme value distribution, with density and survivor functions

f (z) = αθ exp(αz) exp(−θeαz )


S(z) = exp(−θeαz ).

The log survivor function is a negative exponential function of z, rather than the negative
power function of y in the Weibull. We do not discuss its properties further: they follow from
those of the Weibull.

11.5.5 Median Rank Regression (MRR)


A widely used estimation method in the engineering literature (Abernethy et al 1983; Aber-
nethy 2010; Genschel and Meeker 2010) for the Weibull parameters is based on the shape
of the log integrated hazard function: log H(y) = log θ + α log y. So if we graph the sample
log integrated hazard against log survival time, we should see a linear trend, from which the
parameters log θ and α can be estimated.
Figure 9.6 showed the phones data on these scales, with the straight line corresponding
to the exponential distribution (α = 1). It was clear that the exponential distribution did
not fit well, but the graph suggested a straight line with slope greater than 1. Figure 11.9
shows the same graph but with the Weibull distribution fitted by maximum likelihood (with
α
b = 1.308). The envelope of red lines is a 95% credible region for the true cdf. The Weibull
distribution clearly fits well.
The approach to fitting the Weibull distribution in the MRR literature is quite different.
A straight line is fitted by least squares (LS) to the log hazard graph against log survival time.
Statistical Inference III – two-parameter continuous distributions 149
2

-1

log H(t)
-2

-3

-4

-5

3 4 5 6
loghours

FIGURE 11.9
Empirical (circles) and ML fitted Weibull (black line) log integrated hazard, with 95% cred-
ible region (red lines)

This is shown in Figure 11.10. The Least Squares estimates and (SEs) are α̃ = 1.231 (0.013),
log θ̃ = −6.680 (0.071), θ̃ = 0.00126. The estimates are fairly close to the MLEs:
α
b = 1.308, θb = 0.000821, but the LS standard errors are much smaller than the preci-
sions of the MLEs or the posteriors, as they do not reflect the information in the Weibull
likelihood – they are optimal for a Gaussian likelihood. The estimates could be used, how-
ever, as initial estimates in a Newton-Raphson or Fisher scoring algorithm for maximum
likelihood.
The MRR approach to analysis reflects a view, unfortunately common outside statisti-
cians, of statistical data analysis as a branch of optimisation. Given the problem, we need
a criterion or objective (goodness-of-fit) function which has to be optimised – maximised or
minimised. The sum of squared residuals from the fitted model is a popular choice. As we
have seen in other models, the sum of squares would be optimal in the sense of statistical
theory if the “error” distribution were Gaussian. Here it is not, so the LS estimates cannot be
optimal in the statistical sense – they are not efficient. More seriously, since their standard
errors are seriously underestimated, they will give over-precise confidence intervals for any
parameters or parametric functions, like quantiles or predicted values.
The unusual name of the method comes from the early analysis of heavily censored
experimental data. We do not give details, but discuss censoring in the next section.

11.5.6 Censoring
An important application of the Weibull (and other survival distributions) is the censoring of
observations. This term is used in statistics for the termination – cutting off – of observation
of the lifetime, before failure has occurred. Censored observations cannot “speak” – their
information is “censored”.
Censoring is common in both engineering and medical applications. In the first, it in-
volves life-testing of components by placing them under stress on a suitable machine and
150 Introduction to Statistical Modelling and Inference

log integrated hazard


-1

-2

-3

-4

-5

-6
3 4 5 6
log time

FIGURE 11.10
Least squares fit to the log integrated hazard

operating the machine until failure occurs, or until some pre-specified censoring time has
elapsed. In medical applications, especially of cancer treatment, the patient is observed after
treatment until death or for a fixed pre-specified follow-up time. The observed lifetimes of
the components, or patients, which have not been fully observed because of censoring are
not discarded: they provide information about the model parameters and must be included
in the data analysis.
This is done quite simply: a component or patient which has survived for a time T before
observation ends is known to have an unobserved failure time which is greater than the
termination time. So the contribution to the likelihood of a censored lifetime T is given by
the survivor function S(y) at time T .
We give a simple example of the exponential distribution. Suppose we have an additional
phone which was withdrawn from service after operating for a time T = 300 hours with-
out failing. How does this affect our inference about the mean lifetime λ? The likelihood,
including this observation, is now
"m #
Y
ni
L(λ) = f (yi | λ) · S(T | λ)
i=1
m   y ni  
Y 1 i T
= exp − · exp −
i=1
λ λ λ
 Pm 
1 n i yi + T
= n exp − i=1
λ λ
 
1 nȳ + T
= n exp − .
λ λ
Statistical Inference III – two-parameter continuous distributions 151

The effect of the censored observation is to increase the total survival time of the phones
by T without increasing their number, so the MLE of λ will increase, from ȳ = 210.8 to
ȳ + T /n = 210.8 + 3.41 = 214.21. The Bayesian analysis will be affected correspondingly.
We discuss the effect of censoring and its analysis further in §15.4.

11.6 The gamma distribution


The first generalisation of the exponential was the Weibull; the second is the gamma distri-
bution, which arises as the distribuion of the sum of a set of independent exponential random
variables. We have already seen this in the discussion of the exponential distribution, where
the posterior distribution for the hazard θ had exactly the form of a gamma density:
(nȳ)n+a+1 n+a
π(θ | y) = θ exp(−nθȳ).
Γ(n + a + 1)
The general form of the gamma density has two parameters, which we will take to be the
mean µ and the shape parameter r:
rr
f (y | µ, r) = · y r−1 exp(−ry/µ).
Γ(r)µr
This form is widely used for generalised linear models (GLMs), discussed in a later chapter.
The mean µ is a scale parameter, but not a location parameter. The cdf of the distribution
can be scaled to a mean parameter equal to 1, but the shape of the density and cdf still
depend on the shape parameter r.

11.7 The gamma likelihood


The likelihood in µ and r for the phones sample with counts ni at survival times yi , with
i = 1, . . . , m, is
m
Y
L(µ, r) = f (yi | µ, r)ni
i=1
m  ni
Y rr r−1
= · y exp(−ry i )/µ)
i=1
Γ(r)µr i
   nr r−1 
exp(−rT /µ) r P
= · .
µnr Γn (r)
Pm Pm Qm
where n = i=1 ni , T = i=1 ni yi and P = i=1 yini . The likelihood factors into a product
of a term in µ and r, and one in r only, as in the Gaussian distribution.

11.7.1 Frequentist analysis


Maximum likelihood for the exponential family, including the gamma distribution, has been
widely implemented since the 1980s. In the GLM formulation with parameters (µ, r), the
152 Introduction to Statistical Modelling and Inference

MLE of µ is ȳ, while rb is the solution of a non-linear equation:

n(1 + log r) + ψ(r) − n log ȳ + log P = 0,

where ψ is the digamma function – the derivative of the log gamma function. The MLEs
and (SE)s of µ and r are 210.8 (17.4) and 1.67 (0.89).
An attraction of this formulation is that the cross-derivative of the log-likelihood with
respect to µ and r is zero at the MLEs, as in the Gaussian log-likelihood with µ and σ. So the
MLEs µ b and rb are uncorrelated and asymptotically independent. However, as in the Gaussian
case, the likelihood in µ and r does not factor into independent terms: the parameters are
not independent in the posterior.

11.7.2 Bayesian analysis


The representation of the likelihood given earlier allows simple conditional/marginal sam-
pling for the posteriors, but the parametrisation can be simplified. We define θ = r/µ, and
reparametrise to θ and r. Then
 r−1 
P
L(θ, r) = [θnr exp(−θT )] · n .
Γ (r)

For independent priors for θ and r, with that for log θ uniform and π(r) not yet specified,
the posterior for θ and r can be expressed as
 nr   
T nr−1 Γ(nr) r−1
π(θ, r | y) = θ exp(−θT ) · n P π(r) .
Γ(nr) Γ (r)T nr
The joint posterior now factors into a term in r only, and a term in θ and r which is exactly
a gamma posterior. From this we can conclude that, conditional on r, θT has a gamma
posterior distribution with parameters nr and 1. The other term – the marginal distribution
of r – however, has no standard form, whatever prior distribution we might use for it.
We compute the likelihood component for r with a flat prior on a dense grid from 0.5
to 2.5, then scale the likelihood component by its sum. (Since r > 0, a flat prior on log r
would also be reasonable.) Figure 11.11 shows the posterior density of r, and Figure 11.12
the cdf. The median and central 95% credible interval for r are 1.54 and [1.16, 2.00]. The
distribution is slightly right-skewed.
Then we make 10,000 random draws r[m] from this posterior, and for each r[m] , make
one random draw θ[m] from the Gamma(nr[m] , 1)/T conditional posterior distribution.
Figure 11.13 gives the joint scatter of the 10,000 draws. The two parameters are strongly
associated, not surprisingly.
The posterior median for r is quite different from the MLE (1.80), which uses the addi-
tional information about r in the first component of the likelihood. This difference is similar
to that in the Gaussian distribution, where the MLE of σ 2 has n in the divisor of the RSS,
while the posterior distribution has its mode at n − 1.
Figure 11.14 shows the marginal posterior cdf of θ. The distribution of θ is also
slightly right-skewed. The median for θ is 0.00729 and the 95% central credible interval
is [0.00522, 0.00984]. The median differs somewhat from the MLE 0.00687, again a conse-
quence of the skew.
Our interest is in the 80th quantile of the distribution. This is not an analytic function
of the parameters, but we can use the inverse gamma cdf function (the “gamma deviate”
or “gamma quantile” function) to generate its posterior, by substituting the random draws
(r[m] , θ[m] ) into the gamma quantile function at the 80th quantile (Figure 11.15):
Statistical Inference III – two-parameter continuous distributions 153

0.006

0.005

0.004
posterior mass

0.003

0.002

0.001

0.000
0.5 1.0 1.5 2.0 2.5
r

FIGURE 11.11
Posterior density of r

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
1.0 1.5 2.0 2.5
r

FIGURE 11.12
Cdf of r
154 Introduction to Statistical Modelling and Inference

0.012

0.011

0.010

0.009

0.008
q

0.007

0.006

0.005

0.004

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4


r

FIGURE 11.13
Joint draws of r, θ

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.004 0.006 0.008 0.010 0.012
q

FIGURE 11.14
Cdf of θ
Statistical Inference III – two-parameter continuous distributions 155
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
250 300 350 400 450
80th percentile

FIGURE 11.15
Cdf of the lifetime 80th quantile

[m]
y80 = G−1 (0.8, r[m] )/θ[m] .

The median is 325.7 and the 95% credible interval is [276.9, 389.5]. These values are very
close to those for the Weibull: 330.0 and [283.2, 380.4]. The gamma median is 4.3 hours less
than the Weibull median, and the gamma credible interval is 15 hours longer. We assess the
fit of the gamma and Weibull distributions in Chapter 12 on model assessment.
12
Model assessment

We have so far assumed that the probability models specified in the analyses are appropriate.
We now examine ways in which this assumption can be investigated. The first approach, for
continuous distributions, is through the agreement between the empirical cdf (ecdf) and the
model cdf, described several times previously. For this we construct a credible region for
the true model cdf from the set of credible intervals from the ecdf at each distinct data
point, based on the binomial model and the Beta posterior distribution. We give the formal
construction:
• At each ordered distinct sample value yi of the variable Y , where the value of the ecdf
[m]
is ni /n, we make 10,000 draws Pi from the Beta posterior with the uniform prior,
Beta(ni + 1, n − ni + 1).
• We order the draws and extract the 2.5 and 97.5 quantiles of the ordered draws at yi .
• These quantiles define the central 95% credible interval for the true population propor-
tions Pi .
• We repeat the simulations independently at each distinct value of y.
• The region defined by the interior of the set of 95% credible intervals is called a 95%
(pointwise) credible region for the true cdf.
An important value of this approach is that the uncertainty in the true cdf is assessed from
the data, not from the assumed distribution. This allows multiple models to be assessed for fit
to the same data without separate calculations of variability for each model. A frequentist
confidence region for the true cdf can be constructed in a number of ways (Owen 1995),
including the frequentist analogue of this one. We do not give details.
We begin with the Gaussian model example.

12.1 Gaussian model assessment


Figure 12.1 shows the agreement between the cdfs of the Gaussian distribution at the ML
estimates and the boy birthweight data, together with the 95% credible region for the true
population cdf. Though there are 648 boys in the sample, there are only 61 distinct birth-
weights. The agreement looks good.
A common problem with visualisation is assessing departures of the ecdf from a curve;
it is much easier visually to detect departures from a straight line. This is easily achieved
in location/scale models (with a location and/or a scale parameter). We transform from
the cdf probability scale to the inverse cdf (quantile) scale: on this scale the cdf becomes
a straight line, with the intercept and slope determined by the location and scale parame-
ters. The inverse Gaussian cdf scale is commonly called the probit scale, for (the Gaussian)

DOI: 10.1201/9781003216025-12 157


158 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
4 6 8 10 12 14
birthweight

FIGURE 12.1
Birthweight cumulative proportions (circles), ML fitted Gaussian model (black curve) and
95% credible region (red curves)

PROBability InTegral transformation, and we will use this term and Gaussian quantile in-
terchangeably.
Figure 12.2 shows the boy birthweights on this scale. It now appears that at least the
heaviest boy departs markedly from the Gaussian distribution model. The lightest boys
do not. The heaviest boy is an outlier in the statistical sense: he does not belong to this
distribution. However the central region, around nine pounds, also departs slightly from the
Gaussian model. We discuss better-fitting models which allow for the outlier in §15.8.

12.2 Lognormal model assessment


The lognormal model needs only the cdf assessment tools from the Gaussian distribution.
We do not discuss posteriors for the model parameters. Figure 12.3 shows the 95% credible
region for the phones, true cdf on the cdf and log hours scales. The data curvature appears
wrong. Figure 12.4 shows the 95% credible region on the probit scale. The strong curvature
shows that the lognormal is a bad fit: no straight line can be fitted inside the envelope. This
also means that the posterior and credible interval for the 80th quantile from the lognormal
model are unreliable: the model is wrong.

12.3 Exponential model assessment


We saw in Figure 8.3 (reproduced in Figure 12.5 with the 95% credible region for the true
cdf added) that the exponential survivor function appeared to fit quite well the empirical
Model assessment 159

2
probit
1

-1

-2

-3

4 6 8 10 12 14
birthweight

FIGURE 12.2
Birthweight cumulative proportions (circles), ML fitted Gaussian model (line) and 95% cred-
ible region (red curves) on probit scale

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
3 4 5 6
log hours

FIGURE 12.3
Phone log lifetimes (circles), ML fitted lognormal model (curve) and 95% credible region
(red curves)
160 Introduction to Statistical Modelling and Inference

probit
0

-1

-2

3 4 5 6
log hours

FIGURE 12.4
Phone log lifetimes (circles), ML fitted lognormal model (line), and 95% credible region (red
curves) on probit scale

1.0

0.9

0.8

0.7
Survivor function

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 100 200 300 400 500 600 700
hours

FIGURE 12.5
Phone empirical (circles), ML exponential (curve) survivor functions and 95% credible region
(red curves)
Model assessment 161

Survival probit
0

-1

-2

-3

0 100 200 300 400 500 600 700


hours

FIGURE 12.6
Phone empirical (circles) and ML exponential (curve) survivor functions, and 95% credible
region (red curves), probit scale

survivor function for the phones. The exponential cdf falls completely inside the 95% credible
region. However, the cdf curve moves from one side of the credible region to the other,
suggesting a poor fit. The sample size is too small to give precision in the credible region,
but the curvature of the exponential seems to be wrongly specified. Figure 12.6 gives the
same picture on the probit scale with the same conclusions. The exponential distribution
cdf does not give a straight line on this scale. Of course it would not: the probit scale
assumes a Gaussian distribution. We need a different transformation of scale for the cdf, and
possibly time.
We have, for the exponential distribution cdf with mean λ,

S(y) = 1 − F (y) = e−y/λ


− log S(y) = y/λ = H(y),

the integrated hazard function. So if the exponential distribution is the correct model, the
graph of the integrated hazard H(y) should be close to a straight line through the origin,
with slope 1/λ. Figure 12.7 shows the empirical integrated hazard function (circles) with the
fitted exponential line and the 95% credible region. It is clear that the integrated hazard
function does not fit a straight line at all, so the log transformation of the survivor function
is not adequate. The remaining curvature suggests a log transformation of time as well. We
log transform both scales. Then if the exponential model is correct, we should have

log[H(y)] = log[− log S(y)] = log(y) − log λ.

So the log integrated hazard function should increase linearly with log(y), with intercept
− log(λ).
b Figure 12.8 shows the effect of this transformation. The graph is nearly linear, and
162 Introduction to Statistical Modelling and Inference

0
0 100 200 300 400 500 600 700
hours

FIGURE 12.7
Empirical (circles) and exponential (line) integrated hazard functions and 95% credible re-
gion (red curves)

0
log integrated hazard

-1

-2

-3

-4

-5

3 4 5 6
log hours

FIGURE 12.8
Empirical (circles) and exponential (line) log integrated hazard functions and 95% credible
region (red curves)
Model assessment 163

the line just fits inside the credible region, but it moves from one sie of the region to the
other: it is clear that the slope is incorrect. The ML fitted exponential log integrated hazard
is −5.35 + log(hours).
We extend this analysis in the next section.

12.4 Weibull model assessment


We proceed as for the exponential model. The Weibull cdf at the response value y is
F (y | θ, α) = 1 − exp(−θy α ). As with the exponential distribution, a (further) transforma-
tion of the cdf gives a clearer picture of departures from the model. We have

F (y | θ, α) = 1 − exp(−θy α )
S(y | θ, α) = exp(−θy α )
− log S(y | θ, α) = θy α
log[− log S(y | θ, α)] = log(θ) + α log(y).

So on the log integrated hazard scale for the survivor function and the log scale for the
response, we should see a straight line with intercept log(θ) and slope α under the Weibull
model. Figure 12.9 shows the (ML) fitted log integrated hazard function and the credible
region.

0
log Weibull integrated hazard

-1

-2

-3

-4

-5

3 4 5 6
log hours

FIGURE 12.9
Empirical (circles) and Weibull log integrated hazard functions (line) with 95% credible
region (red curves)
164 Introduction to Statistical Modelling and Inference

The ML fitted Weibull log integrated hazard is −7.105 + 1.308 log(hours). All the data
points are within the credible region boundaries. The Weibull fit is acceptable. However, this
does not mean that the Weibull distribution is correct ! It means only that the Weibull is an
adequate fit to the data. But other distributions may also fit adequately; we will examine the
gamma as well. If more than one distributional model fits adequately, what do we conclude?
We discuss this in Chapter 13.
The Weibull distribution is not a location/scale family member. However, the log of the
response variable with a Weibull distribution has an extreme value distribution, which is
a location/scale family, as can be seen from the form of the density for z = log y, and
transforming the parameters by σ = 1/α, ψ = − log θ/α :

f (y) dy = αθy α−1 exp(−θy α ) dy

f (z) dz = αθ exp(z)α−1 exp(−θ exp(z)α ) exp(z) dz


= αθ exp(αz) exp[−θ exp(αz)] dz
    
1 z−ψ z−ψ
f (z) = exp exp − exp .
σ σ σ

Here ψ is a location parameter and σ a scale parameter. The standard form of the extreme
value distribution, with ψ = 0 and σ = 1 is

f (z) = exp(z) exp[− exp(z)].

The model assessment in this form is the same as it is in the Weibull form as the distribution
still depends on two parameters. However the log integrated hazard function is linear in log
y in the Weibull, but linear in z in the extreme value. We do not consider the extreme value
form further.

12.5 Gamma model assessment


How do we know whether the gamma distribution fits the phone data better than the expo-
nential or Weibull? The gamma is not a location/scale model, so there is no simple inverse
cdf transformation which gives a straight line fit. We need to examine the direct fit of the cdf.
Figure 12.10 shows two fitted gamma cdfs, at the MLEs of µ and r (solid curve) and
at their posterior median values (dashed curve), with the empirical cdf (circles) and the
95% credible region (red curves). The two cdf estimates are barely distinguishable, but
their fit is poor, with the upper tail curving away from the credible region. Why is this? A
graph of the gamma and Weibull densities at their MLEs shows why (Figure 12.11). The
gamma model gives higher probability to values around the mode – lifetimes between 50
and 250 hours, while the Weibull gives higher probability to the very low and very high
lifetimes.
In Figure 12.10 the gamma cdf increases too slowly up to 50 hours, with the cdf on the
lower edge of the 95% critical region, and then increases too rapidly up to 250 hours, where
the cdf goes beyond the 95% critical region. In Chapter 14 we discuss model comparison of
these and other possible lifetime models through their deviance distributions.
Model assessment 165
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
0 100 200 300 400 500 600 700
hours

FIGURE 12.10
Fitted gamma cdfs (solid – ML, dashed – posterior, circles – empirical) and 95% credible
region (red curves)

0.0030

0.0025

0.0020
density

0.0015

0.0010

0.0005

0 100 200 300 400 500 600 700


hours

FIGURE 12.11
ML fitted gamma (red) and Weibull (black) densities
13
The multinomial distribution

In §6.9 we used the multinomial distribution in its conventional form to represent the prob-
ability structure of the population of a categorical variable, as an extension of the binomial
distribution for two categories to three or more. But surprisingly, this representation can
be used as well for “continuous” variables, which has remarkably valuable consequences for
statistical modelling and analysis. A detailed discussion of the history of this development
can be found in Aitkin (2010) Chapter 4. Here we give a brief summary. The fundamental
point, quoted previously, was expressed by Pitman (1979):

all observable random variables have discrete distributions

because any real random variable Y is recorded with finite measurement precision δ, and
so is recorded on a grid of reported values of spacing δ. Then the set of possible recorded
population values YI ∗ , I ∗ = 1, . . . , N can be tabulated by the D distinct ordered values
YI , I = 1, . . . , D into a set of D population
PD counts NI on these distinct population values
YI . The total population count is N = I=1 NI , and the population proportions PI = NI /N
define the population multinomial distribution of Y .
This approach is due to Hartley and Rao (1968) who called it the “scale-load” distribu-
tion, and to Ericson (1969) who gave the Bayesian version. We can think of this (YI , PI )
structure as a maximum resolution population histogram, where the bin “width” is the mea-
surement precision δ. Figure 13.1 shows the 648 family incomes in dollars of the StatLab
boy population, as a histogram of counts at the level of resolution of the incomes (δ = $1).
The histogram is far from smooth!
The multinomial idea seems bizarre to many model-based statisticians, or at least unrea-
sonably complicated. Much of the early teaching of introductory statistics was based on the
idea of a smooth population: if we could draw larger and larger samples from the population
we would see the population histogram become increasingly smooth (no such demonstrations
were commonly done). The question of interest then would be what form of smooth density
function could best represent it.
However, the income data show that this assumption can fail: the population histogram is
jagged, and while a conceptual larger population (like the combined boy and girl populations)
might have a smoother histogram, the visible preference for particular numbers (5s and
10s) in the population would remain. Model-based statisticians have for so long regarded a
probability model as a necessary smooth simplification of the unknown population structure,
through a small number of model parameters, that the idea of complicating the population
structure with a large number of parameters, one for every distinct observation, seems to be
going in the wrong direction. We seem to be overburdening ourselves with parameters which
are scarcely identifiable.
But these population proportion parameters are not the parameters of interest in most
applications. It is surprising to find that, on the contrary, population parameters of interest,
like the mean, median or other moments or quantiles, and regression coefficients, are all well-
identified without any simplifying model, and that statistical inference with the multinomial

DOI: 10.1201/9781003216025-13 167


168 Introduction to Statistical Modelling and Inference

35.00

28.00

Freq/1.0 units of INCOME


21.00

14.00

7.00

0.00
50 100 150 200 250
INCOME

FIGURE 13.1
StatLab family income histogram, boy population

population representation is competitive with methods based on standard model assumptions


when the simple model assumptions hold, and is superior when they do not hold. What is
more, the multinomial approach does not require any model assessment or validation: it is
valid by definition. The multinomial distribution is not a simplifying model, but an exact
representation of the population.
As with simple models, the multinomial population parameters are unknown constants,
about which we learn from the sample data. If we draw a simple random sample from the
population, we can tabulate the values of Y in the sample correspondingly, and the sample
can be expressed through the sample counts nI at YI (many of these may be zero). If the
sample size n is small compared to the population size N , so that sampling with replacement
accurately approximates sampling without replacement,1 the multinomial probability of the
sample counts nI is
D
n! Y
Pr(n1 , . . . , nD | P1 , . . . , PD ) = M (n; P1 , . . . , PD ) = QD PInI ,
I=1 nI ! I=1

where the factorial term gives the number of distinguishable arrangements of the sample
values. The multinomial is a multivariate version of the binomial, with the moments of the
nI given by

E[nI ] = N PI , Var[nI ] = N PI (1 − PI ), Cov[nI , nJ ] = −N PI PJ .


Some Bayesians worry about the “restrictive” covariance structure of the multinomial. It
has no separate parametric covariance structure, unlike the multivariate Gaussian. The co-
variance structure, like that for the binomial variance, follows from the exact representation
1 The alternative case leading to a hypergeometric likelihood is discussed in detail in Aitkin (2010),

Chapter 4.
The multinomial distribution 169

of the population. Unlike the applications of the multinomial to true category variables, we
are not interested in the details of the individual PI and YI , since each “category” is just a
distinct single value ofP
Y . As we will see, we are interested in weighted linear functions of the
D
PI , like the mean µ = I=1 PI YI . The covariance structure is inherent in this representation,
and in the corresponding sample structure.

13.1 The multinomial likelihood


Given the sample counts, the likelihood (omitting known constants) is
D
Y
L(P1 , . . . , PD ) = PI n I .
I=1

Formally, we need to know the number D of distinct values of Y in the population, and the
smallest and largest population values Y1 and YD , to be able to compute this likelihood.
But for any unobserved values of YI the corresponding nI is zero, so the likelihood can be
re-expressed in terms of the PI for only the observed distinct YI . The PI for the unobserved
YI do not contribute to the likelihood, so these YI do not need to be known unless the prior
gives them non-zero weight.
In the absence of informative prior information about the unobserved YI and their pop-
ulation proportions PI , we can rewrite the likelihood L(p1 , . . . , pd ), in terms of the sample
index i and the d ordered observed sample quantities yi and ni , i = 1, . . . , d:
d
Y
L(p1 , . . . , pd ) = pni i .
i=1

This quantity is frequently called the empirical likelihood.

13.2 Frequentist analysis


The maximum profile empirical likelihood analysis, due to Owen (1988, 2001), requires max-
imisation of the multinomial log-likelihood over the PI subject to constraints implied
P by the
definition of the population parameters of interest. For the Ppopulation mean µ P= I PI YI ,
we maximise log L(P1 , . . . , PD ) subject to the constraints I PI YI = µ and I PI = 1,
through the constrained log-likelihood G:
D
X D
X D
X
G({PI }) = nI log PI − nλ (PI YI − µ) − nϕ (PI − 1)
I=1 I=1 I=1
∂G
= nI /PI − nλYI − nϕ = 0.
∂PI
So
nI = nλPI YI + nϕPI
D
X D
X D
X
n= nI = nλ PI YI + nϕ PI
I=1 I=1 I=1
170 Introduction to Statistical Modelling and Inference

1 = λµ + ϕ
nI
PI =
nλYI + n(1 − λµ)
P˜I
= ,
1 + λ(YI − µ)

where P˜I = nI /n. The constrained profile MLE (MPLE) of PI given µ is

\ P˜I
P I (µ) = ,
d I − µ)
1 + λ(µ)(Y

where λ(µ)
d is the implicit solution of

D
X P˜I
= 1.
I=1 1 + λ(µ)(YI − µ)
d

The empirical profile log-likelihood in µ is then


X
EP ℓ(µ) = \
nI log P I (µ).
I

This can be computed over a grid of µ and maximised numerically to give the MPLE of
µ. The precision of the MPLE is assessed by treating the profile empirical likelihood as a
parametric likelihood and inverting the likelihood ratio test to construct a profile-likelihood-
based confidence interval for µ. The confidence coefficient is asymptotic, from χ2 . Owen gave
details.
A complicating issue for the MPLE analysis is that if the variance as well as the mean
is to be estimated, the penalised maximisation with the additional variance constraint will
give a different MPLE for the mean, and a more complex constrained maximisation.

13.3 Bayesian analysis


The Bayesian computations are much simpler. The conjugate prior for the multinomial is
the Dirichlet, the multi-category extension of the Beta distribution:
QD D
I=1 Γ(aI )
Y
π({PI } | {aI }) = PD · PIaI −1 ,
Γ( I=1 aI ) I=1

where aI is the prior weight on pI . The posterior distribution of the population proportions
with this prior is again Dirichlet:
QD D
I=1 Γ(nI + aI ) Y
π({PI } | {aI }, {nI }) = PD · PInI +aI −1 .
Γ( I=1 [nI + aI ]) I=1

Its use requires the specification of the prior parameters aI on all the values of Y , observed
or unobserved. An immediate possibility for a non-informative prior would appear to be
aI = 1 ∀ I, as in the binomial. However this prior has serious difficulties. In general we will
The multinomial distribution 171

not know the number or values


PD of the YI which have not been observed, and we do not know
the total prior weight a = I=1 aI either! Even if we use only the observed data range,2 the
total prior weight can be of substantial magnitude relative to the sample size, so may have
a substantial effect on the inference.
To avoid these problems, we use the reference improper Haldane prior with aI = 0 ∀ I,
giving zero prior and posterior weight to any proportions PI of values YI not observed in
the sample:
D
Y
π({PI }) = c · PI−1
I=1
D
Y
L({PI } | {nI }) = PInI
I=1
D
Γ(n) Y
π({PI } | {nI }) = QD PInI −1 .
I=1 Γ(n )
I I=1

The prior has an arbitrary constant c, showing that it is improper (does not integrate to 1).
With this prior, the posterior would be improper for any value PI which has sample size
nI zero, that is, is not observed in the data. The impropriety is that for such PI , the
corresponding posterior density 1/PI → ∞ as PI → 0, that is, has an infinite spike at zero:
we are infinitely certain that such PI are zero.
So in the posterior, we need distinguish only the d distinct ordered, observed sample
values, denoted by yi , with their sample counts ni and corresponding population proportions
pi , so the posterior can be expressed as
d
Γ(n) Y
π({pi } | {ni }) = Qd pni i −1
i=1 Γ(ni ) i=1

For values YI not observed in the sample, their posterior probabilities are zero. We can
express the posterior as the empirical likelihood with the improper Haldane prior on the pi
for the observed data values yi .

13.4 Criticisms of the Haldane prior


It may appear that we are effectively contracting the underlying population to values already
sampled. This may appear unreasonable – we seem to be saying that it is impossible for values
unobserved in the sample to be in the population: their population proportions are structural
zeros. Rubin (1981), who proposed this Bayesian bootstrap, regarded it as unreasonable, and
(possibly consequently) this approach was little used. As we described earlier, this situation
is a consequence of the multinomial model, not the prior. For unsampled values of YI we
simply have no sample evidence about them – there is nothing to say. The alternative is even
more unattractive: to specify a proper prior we have to make informative prior statements
about the unobserved part of the population, not just about the Dirichlet parameters for the
observed part.
2 This would mean fitting the prior to the data!
172 Introduction to Statistical Modelling and Inference

Banks (1988) took up these criticisms by developing a smoothing of the Dirichlet poste-
rior. Given the Haldane prior, he proposed generating a random value of pI for each observed
YI , and then spreading it uniformly over this YI and all unobserved values (which therefore
had to be known) to the left of this YI down to the next observed value.
This required prior assumptions about both the number of unobserved values and their
locations. In this way the posterior mass was spread over the extended sample range from
y1 to yn , though in an ad hoc way. Values of Y outside the sample range had zero posterior
probability, while those within the sample range had varying prior and posterior probabilities
determined by their prior-specified spacing.
Lazar (2003) examined proposals for other Bayesian analyses with the multinomial dis-
tribution, including a mixture of maximum likelihood and Bayesian analyses using the max-
imised (profile) likelihood as a parametric likelihood, with a conventional parametric prior.
This was discussed by Owen (2001, §§9.4, 9.5).
Lazar (2003, p. 320) dismissed the Bayesian bootstrap because of

the extreme sensitivity of the results to the model assumption, in particular the form
of prior, which is usually taken to be Dirichlet for reasons of conjugacy. Furthermore,
there is no intuitive way of setting the prior, since it is not likely that information
will be available a priori about the [PI ], making any type of subjective Bayesian
analysis impossible.
(Emphasis added)

This argument seems to suggest that the multinomial distribution is subject to extreme sen-
sitivity of results, though no evidence is given to support this claim. Sensitivity of Bayesian
posteriors in any model to variations in the prior are natural and expected.
Aitkin (2008) examined the behaviour of the Haldane prior: an apparently unreasonable
prior specification would be expected to perform poorly in simulations. He demonstrated
the contrary with a simulation study of several methods. Informative departures from the
Haldane prior performed poorly, while the multinomial analysis with the Haldane prior
performed as well as the correct parametric distribution as the sample size increased.
Some statisticians suggest the use of the “overdispersed” Dirichlet-multinomial com-
pound distribution (the generalisation of the beta-binomial to more than two categories) to
generalise the covariance structure. This only complicates matters further. Integrating out
the multinomial parameters PI gives a form of multivariate hypergeometric distribution:
P D
Γ(N + 1)Γ( I aI ) Y Γ(nI + aI )
Pr[{nI }] | {aI }] = P · .
Γ(N + I aI ) Γ(nI + 1)Γ(aI )
I=1

These probabilities now depend strongly on the prior parameters aI instead of the pI , for
which we have the same difficulties as before. Worse, the posterior inferences of interest to
us are about linear functions of the PI , which have been integrated out of the compound
distribution altogether. The multivariate hypergeometric distribution arises naturally when
the sample size is large relative to the population size, and we need the hypergeometric
likelihood. This is discussed at length, with its two-stage simulation extension, in Aitkin
(2008, 2010, Chapter 4).
An important issue in our use of the multinomial distribution as a data model is that we
never need to make explicit use of the covariance structure of the model, or make inferences
about individual parameters PI : we need only the simulated distributions of functions of
these parameters, linear or non-linear. These make implicit use of the covariance structure,
but the simulations reported in Aitkin (2008, 2010, Chapter 4) show that the simulation
inferences are competitive with model-based inferences as the sample size increases.
The multinomial distribution 173

So the “restrictive” covariance structure of the multinomial distribution is irrelevant to


our use of the multinomial. In this book we take the Haldane prior as the standard represen-
tation of the absence of prior information for the “continuous” version of the multinomial
distribution.

13.4.1 The Dirichlet process prior


The Dirichlet distribution is a special case of the Dirichlet process prior (DPP, Ferguson
1973). This has become popular in recent Bayesian computation. It is seen as a more flexible
alternative to the Haldane prior, replacing the Dirichlet weighting of the observations by the
weighting of distributions of observations in a finite mixture prior.
The object of the DPP analysis is to draw an inference about the population distribution,
rather than about the mean or other parameters. The Dirichlet process is an informative
prior, for which the user has to specify a “concentration” parameter α which controls the
complexity of the finite mixture posterior, and a “base” or kernel density distribution F0 .
The output of the DPP analysis is a set of random draws of finite mixtures of base densities,
with the number of mixture components and their proportions determined by the Dirichlet
draws.
If α is zero the posterior is just the diffuse Dirichlet multinomial, as in the Bayesian
bootstrap: the base density plays no role. As α increases, the posterior increases in mixture
complexity and closeness to the form of the base density. An example is discussed for the
galaxy recession velocities in Chapter 15.
We do not use the DPP. It represents an informative smoothing of the empirical distri-
bution towards the base density: the user must already have in mind a distribution family
which is “close to” the empirical distribution, which is difficult without seeing the data. Since
we are interested in the population mean and other parameters, not the distribution itself,
we have no need for the base density.
The term “Bayesian bootstrap” has become ambiguous, as it is used both by Rubin for
his proposal, and by other Bayesians for the Dirichlet process prior. In this book we reserve
the name for Rubin’s proposal.

13.4.2 Posterior sampling


The computation of the posterior distribution is restricted to the d distinct observed sample
values
Pyi . Simulation of the pi , and therefore of any marginal function of the pi like the mean
µ = i pi yi , from the Dirichlet posterior with the Haldane prior is particularly simple. There
are several equivalent approaches. We use the form of the Dirichlet as a transformation and
scaling of the sum of independent gamma variables.
We want to generate random draws of the population mean from a sample with d dis-
tinct values yi . For a single simulation we generate d independent uniform (0,1) values Ui ,
transform them to d independent gamma variables Gi with parameters P 1 and ni through
the gamma inverse cdf (quantile) function, and then define pi = Gi / j Gj . Repeating the
[m]
simulations M times gives M simulated values pi of the pi , and hence M simulated values
d
[m]
X
µ[m] = pi yi
i=1

from the marginal posterior distribution of µ.


Figure 13.2 gives the posterior cdf of the boy birthweight mean in the Study Population
from 10,000 draws from the Dirichlet posterior on the 61 distinct values in the StatLab
174 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
7.50 7.55 7.60 7.65 7.70 7.75 7.80
mean

FIGURE 13.2
Boy birthweight mean posterior, cdf scale

sample. The posterior median is 7.651 and the 95% central credible interval is [7.567, 7.739].
These are almost identical to the sample mean 7.652 and the 95% confidence (and credible)
interval [7.565, 7.739] from the Gaussian model. The single “outlier” boy with the unusual
weight does not affect the Gaussian model inference.
Transforming the cdf scale to the probit scale in Figure 13.3 shows that the draws of the
posterior mean are very close to the Gaussian distribution. This could be expected since the
draws are the weighted sum of 61 values of y.
The computational and conceptual saving achieved with the non-informative Dirichlet
prior is striking compared with the empirical profile likelihood, and with frequentist boot-
strapping. These procedures also depend only on the observed sample values and their sample
frequencies: they do not assume anything about unobserved values. The finite population
survey sampling paradigm, which does not use parametric probability models, also makes no
assumption about the population values not included in the sample. Necessary assumptions
refer to the design of the sampling, not the structure of the population.
The term “Bayesian bootstrap” comes from the analogy with the frequentist bootstrap,
which resamples from the observed sample. This is discussed at length in §13.7. The Bayesian
bootstrap also uses only the observed sample, but it resamples from the posterior distribution
of the probabilities attached to each observed value, rather than from the values themselves.
The Bayesian bootstrap gives a more “fine-grained” distribution of the mean.
There are striking differences between the Bayesian generation of the posteriors of the
parameters of interest and the frequentist generation of the empirical profile likelihood in
these parameters.

• The population parameter posteriors are explicit functions of the multinomial Dirichlet
draws; the multinomial MPLEs are implicit functions of the multinomial parameters PI .
The multinomial distribution 175

• The parameter posteriors are fully informative in location and precision; the precision of
the MPLEs for the population parameters has to be assessed by additional analysis, in
which it is difficult to assess or allow for skew.

The Bayesian analysis is very much simpler: the Dirichlet posterior draws are unaffected by
additional specifications of the population parameters. A further important point is that the
Bayesian bootstrap is easily extended to cover Gaussian-type regression models. We give
several examples in later chapters.

13.5 Inference for multinomial quantiles


Quantiles for parametric models are explicit functions of the parameters, so their posterior
distributions follow from those of the model parameters. The Bayesian bootstrap approach
to inference about functions of the multinomial probabilities is not limited to moments of
Y . The posterior distribution of any function of the probabilities PI can be simulated in the
same way. We illustrate with the median and the 75th percentile.
The population median Y0.5 is defined as the largest value of Y satisfying the inequality
Pr[Y ≤ Y0.5 ] ≤ 0.5. Roughly speaking the median is the value above which the cumulative
distribution function of Y changes from less than (or equal to) 0.5 to greater than 0.5. This is
[m]
easily simulated: from the mth draw of the d values pi we form the cumulative probabilities
[m] P i [m] [m]
ci = k=1 pk , and find the median value Y0.5 for this draw; this is a random draw from
the posterior distribution of Y0.5 . Since the sample values of Y are discrete, the posterior

1
probit

-1

-2

-3

7.50 7.55 7.60 7.65 7.70 7.75 7.80


mean

FIGURE 13.3
Boy birthweight mean posterior, probit scale
176 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
50 55 60 65 70 75
median

FIGURE 13.4
Cdf of posterior median, family income

distribution of the median is also discrete, on the same sample support. So the cdf will be
constant between jumps at the observed support points, and the exact percentiles of the
posterior distribution will be a discrete set.
Figure 13.4 shows the posterior cdf of the median from M = 10, 000 draws of family
income at birth, and Figure 13.5 shows the posterior cdf of the 75th percentile from the
same set of draws. The posterior median (for the population median) is 60, and for the
75th percentile is 75. The discrete posterior mass functions of these percentiles are shown in
Table 13.1, and the posterior cdfs are shown in Table 13.2.
We cannot set arbitrarily the credibility coefficients for credible intervals for percentiles
because of the discreteness of the posterior distributions. Approximate 96% credible intervals
are for the median [56, 70] and for the 75th percentile [67, 93]. The true values are 70
and 90.

13.6 Dirichlet posterior weighting


We first note a computational point in the use of the Dirichlet posterior weights. If the
number of distinct data points n is very large (as in a large sample with many continuous
covariates), many of the Dirichlet draws which sum to 1 can be very small. This may lead to
the effective omission of these data points, or the inaccurate computation of means or other
weighted functions. To prevent this, we adopt in these analyses the convention of rescaling
the Dirichlet draws to sum to the sample size, rather than to 1. This scaling of the draws gives
them an average of 1, rather than 1/n. The rescaling does not affect the parameter draws,
but it does affect explicit functions of the sample size, like deviances. These are incorrectly
The multinomial distribution 177

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
60 70 80 90 100
75th percentile

FIGURE 13.5
Cdf of posterior 75-th percentile, family income

TABLE 13.1
Posterior mass functions of median and 75th percentile
Y 46 47 52 53 55 56 58 60
median .0001 .0004 .0007 .0100 .0161 .0256 .4367 .2484
75th .0003 .0034
Y 65 67 69 70 71 72 75 77
median .1576 .0465 .0287 .0161 .0066 .0041 .0019 .0005
75th .0194 .0253 .0418 .0686 .0993 .1258 .1456 .1471
Y 80 81 85 93 96 104 107
75th .1296 .0846 .0601 .0328 .0186 .0012 .0002

TABLE 13.2
Posterior cdfs of median and 75th percentile
Y 46 47 52 53 55 56 58 60
median .0001 .0005 .0012 .0112 .0273 .0529 .4896 .7380
75th .0003 .0034
Y 65 67 69 70 71 72 75 77
median .8956 .9421 .9708 .9869 .9935 .9976 .9995 1.000
75th .0194 .0447 .0865 .1551 .2544 .3802 .5258 .6729
Y 80 81 85 93 96 104 107
75th .8025 .8871 .9472 .9800 .9986 .9998 1.000
178 Introduction to Statistical Modelling and Inference

scaled by the direct Dirichlets, since they appear to correspond to a sample of 1 instead of
n. The rescaling makes such weighted functions comparable with the unweighted functions.
We do not comment on this further.

13.7 The frequentist bootstrap


It may seem surprising that the frequentist bootstrap, which is in very wide use, was not
discussed before the Bayesian bootstrap, which developed from it, and is in little use. This
choice reflects the simplicity of the Bayesian analysis relative to the frequentist analysis.
Efron’s (1974) proposal of the bootstrap was a form of “distribution-free” or “model-
free” inference, initially about a population mean. Given a random sample y1 , . . . , yn from a
population, for which the sample mean ȳ was to estimate the population mean µ, he suggested
resampling the sample values with replacement to generate a large number of resampled
means ȳj , which could provide interval variability statements about the population mean
through their quantiles. This would generally be more appropriate than the usual Gaussian-
based standard error inference for the precision of the sample mean. A separate issue is
that the use of ȳ as the estimator of the population mean eliminates consideration of those
population distributions (like the lognormal and Weibull) for which the sample mean is
not the MLE of the population mean. Inference for the mean µ based on ȳ would then be
inefficient.
The idea was greatly extended, especially to difficult model-comparison problems like the
number of components in a finite mixture, where even asymptotic theory did not provide a
reliable test; resampling from the fitted model could provide an effective comparison.
The basis of the bootstrap can be understood in terms of that of the Bayesian bootstrap.
For the latter, the set of possible recorded population values YI ∗ , I ∗ = 1, . . . , N is tabulated
by the D distinct ordered values YI , I = 1, . . . , D, into a set of D population
PD counts NI on
these distinct population values YI . The total population count is N = I=1 NI , and the
population proportions PI = NI /N define the population multinomial distribution of Y . A
simple random sample of size n from the population can be tabulated correspondingly, and
can be expressed through the sample counts ni and proportions pi at the d distinct ordered
sample values yi .
The frequentist bootstrap iterates this sampling process, with the jth of r additional
bootstrap samples Pd of size n, drawn with replacement from the original sample, having sample
mean ȳj = i=1 nij yi /n, from nij values of yi . The set of r means {ȳj } is called the
bootstrap distribution of the sample mean. It acts as a surrogate for the unobservable repeated
sampling distribution. It is used to provide a confidence interval for the population mean,
through either the bootstrap standard error, relying on the large-sample Gaussianity of the
bootstrap sample means, or the quantiles of the bootstrap distribution. We generally restrict
our discussion to the use of the quantiles, to avoid dependence on the CLT.
We can express the bootstrap explicitly in terms of the multinomial to clarify its relation
to the Bayesian bootstrap. Since the sampling is with replacememt, the “population” which
is being bootstrap resampled is multinomial M (n, {pi }). The jth bootstrap sample mean
P [j] [j]
can be rewritten as ȳj = pi yi where pi = nij /n is a random draw from the multi-
nomial distribution M (n; {pi }). The jth Bayesian bootstrap sample mean can be written
P [j] [j]
as µ[j] = πi yi where πi is a random draw from the posterior Dirichlet distribution
Dir(n; π1 , . . . , πn ).
The multinomial distribution 179

The only difference between these draws is the use of the multinomial observed data
proportions pi in the bootstrap, and the Dirichlet posterior proportions πi in the Bayesian
bootstrap. The multinomial probabilities in the bootstrap are constant across draws, while
the Dirichlet probabilities are varying randomly across draws. The variability of the Bayesian
bootstrap mean draws will be greater than that of the bootstrap mean draws, correcting for
the assumption of known proportions pi in the bootstrap samples.

13.7.1 Two-category sample


We gave a very simple example in §6.9.5, of the bootstrap distribution of the sample pro-
portion of pb = 0.1 in a sample of ten. In this example we did not need to do the resampling:
the bootstrap distribution could be obtained analytically. The bootstrap distribution of pb
was the binomial distribution b(10, 0.1), while the posterior distribution of p with a flat prior
was Beta(2, 10). The two distributions are shown in Figure 13.6.
The bootstrap distribution of pb has very small and discrete support, while the posterior
distribution of p has continuous support on the full (0,1) interval. Both distributions give
highest probability to p = 0.1, and have similar right-tail behaviour. However the value p = 0,
impossible because we have observed one success, has zero probability in the posterior, but
probability 0.349 in the bootstrap distribution. It may seem absurd that we have a large
bootstrap probability of the zero value of p, when we know that it cannot be zero. This
happens because many of the bootstrap samples will have no successes – the one success in
the original sample is not drawn in these bootstrap samples. Over the full set of bootstrap
samples, nearly 35% of them have no success. This is an alarming, if unusual, aspect of the
bootstrap procedure.

0.35

0.30

0.25
probability

0.20

0.15

0.10

0.05

0.00
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 13.6
Bootstrap (circles) and posterior (solid) distributions of p from one success in ten trials
180 Introduction to Statistical Modelling and Inference

One might think that this is easily dealt with – we can simply truncate the bootstrap
distribution to the positive values in the support – we cut off the value 0. This has two
difficulties: there is now no indication that values between 0 and 0.1 are even possible, and
the remaining probabilities have to be rescaled to sum to 1, by multiplying by the ratio
1/0.349 = 2.865. The rescaled values do not correspond to the shape of the posterior dis-
tribution of p. A characteristic feature of the bootstrap distribution is its discreteness, even
when the parameter space is continuous.
We now consider the relevance of the bootstrap distribution to formal inference about the
population mean, when the variable Y has more than two values. Conditional on the initial
sample with counts ni at the distinct values {yi }, the n counts nij in bootstrap sample j
have a multinomial distribution M (n, {qi }), where qi = ni /n is the proportion of the initial
sample at yi . The joint distribution of the sample counts ni and the bootstrap sample counts
nij is given by the marginal/conditional product:
( d
)  r " d
#
n! Y Y n! Y n

Pr[{ni }, {nij } | {pi }] = Qd pni i · Qd qi ij
i=1ni ! i=1

j=1 nij !
i=1 i=1

Inspection of this joint distribution shows that the nij are ancillary for the pi and hence for µ:
their distribution M (n, qi ) is completely known (since the qi are known) and independent of
the pi , and hence of µ. So we cannot improve, in a model-based analysis, on the information
in the original sample. The bootstrap samples do not provide further information about µ:
the information they do provide is misleadingly precise.
We do not discuss the bootstrap further.

13.8 Stratified sampling and weighting


In practical survey design, populations are often stratified for sampling, and differentially
sampled in different strata. Small but important strata may be oversampled relative to larger
strata, to achieve adequate sample sizes in the small strata. If inference is required (as it
usually is) for the strata means and the population mean, the differential sampling rates
have to be accounted for in the population mean inference.
In the design-based approach, weighting of sample estimates by the inverse of the sample
fractions – the sample selection probabilities – is almost universally recommended in sam-
ple designs with unequal selection probabilities. This is generally called inverse probability
weighting, or IPW. It is used not only for means in simple stratified designs, but also for
more complex regression analyses, and multilevel regression analyses as well.
The argument for weighting, as it is applied to inference about the population mean,
is superficially appealing. We denote the S strata proportions in the population of size
N by π1 , . . . , πs , . . . , πS , with corresponding strata population numbers N1 , . . . , NS , where
πs = Ns /N . We denote the sampling rates in the strata sub-populations by p1 , . . . , pS , with
corresponding strata sample sizes n1 , . . . , nS , where ps = ns /Ns .
We write ysj for the jth observation on variable y in stratum s. For the sample fraction
ps (s = 1, . . . S) in stratum s, each of the ns sample members ysj from stratum s “represents”
ws = 1/ps = Ns /ns members of the population in stratum s, and so if we are computing the
estimate of the population mean from the pooled sample observations from all the strata,
it is clear that we should give different weights ws to the observations in each stratum s, to
correct for the differential sampling rates in the strata. To achieve this, the weighted sample
The multinomial distribution 181

mean, known as the Horvitz-Thompson (HT) estimator (Horvitz and Thompson 1952), is
defined as
X ns
S X ns
S X
X
µ
e= ws ysj / ws
s=1 j=1 s=1 j=1
S
X S
X
= ws ns ȳs / ws ns
s=1 s=1
S
X S
X
= Ns ȳs / Ns
s=1 s=1
S
X
= πs ȳs .
s=1

So the inverse probability weighting of the individual observations is exactly equivalent to


the direct weighting of the strata sample means by their proportions in the population: it
is not necessary to weight the individual observations. In regression analyses with stratified
samples, oversampling of some strata relative to others is easily dealt with by including
the strata as a factor in the regression model. Then if an aggregate estimate over the full
population is required, this is achieved by weighting the strata effects by the strata population
proportions. No actual weighting of the regression analysis is required.
14
Model comparison and model averaging

From the model assessment for the phones, we have an uncomfortable situation. Three
distributions appear to be inappropriate – the exponential, lognormal and gamma – while
two others – the Weibull and multinomial – appear to be nearly equally appropriate. The
multinomial distribution is always appropriate – it makes no assumptions. The credible
intervals for the lifetime 80th quantile vary considerably among the distributions we have
considered. How do we express the information we now have about the 80th quantile?
A common frequentist procedure is to choose the model with the highest maximised
likelihood and base conclusions on it, ignoring the others, as though they had not been
investigated. If the maximised likelihoods for two of the models are very close, this will
clearly be unsatisfactory, especially if the 80th quantile conclusions from the two models are
different.
We develop recent Bayesian approaches to this problem: detailed discussions were given
in Aitkin, Liu and Chadwick (2009) and Aitkin (2010). We build up the general procedure
from simpler models.

14.1 Comparison of two fully specified models


We are given the random sample of data in Table 14.1 (from Cox 1961), drawn from either
a Poisson distribution with mean 1, or a geometric distribution with mean 1. How do we
determine the strength of the data evidence for each model?
The probability distributions are, at data values yi , and given mean µ,
• Poisson: P [Y = yi | µ] = e−µ µyi /yi !
• geometric G[Y = yi | µ] = µyi /(µ + 1)yi +1 .
The corresponding likelihoods P (µ) and G(µ) are
Y  e−µ µyi ni
P (µ) =
i
yi !
e−nµ µT
=
P

TABLE 14.1
Counts from Cox (1961)
i 1 2 3 4 5
yi 0 1 2 3 >3
ni 12 11 6 1 0

DOI: 10.1201/9781003216025-14 183


184 Introduction to Statistical Modelling and Inference

3.e-16

3.e-16

2.e-16

likelihood

2.e-16

1.e-16

5.e-17

0.e+00
0.4 0.6 0.8 1.0 1.2 1.4 1.6

FIGURE 14.1
Poisson (solid) and geometric (dashed) likelihoods, Cox data

Y  µ yi 1 ni
G(µ) =
i
1+µ 1+µ
µT
=
(1 + µ)T +n

where n = i ni = 30, T = i ni yi = 26, P = i (yi !)ni = 384. The likelihoods (Poisson


P P Q
solid and geometric dashed) are shown on the same scale in Figure 14.1.
The likelihood ratio at µ = 1 is 17.56 in favour of the Poisson. If the prior probabilities
of the Poisson and geometric are equal, then the posterior probability of the Poisson is
17.56/18.56 = 0.946. That is quite strong evidence for the Poisson.
However, Cox’s problem was more difficult: µ is unknown. As mentioned earlier, a stan-
dard form of frequentist analysis would replace the unknown µ by its MLE under each model.
In fact µ
b = ȳ = T /n = 0.867 under both models, and the likelihood ratio at this value of µ is
the maximum possible: 20.13. If that were the specified value of µ, the posterior probability
of the Poisson would be even higher: 20.13/21.13 = 0.953.
But the true value is uncertain, and smaller values of the likelihood ratio, and the cor-
responding posterior Poisson probability, are consistent with the sample data. We need to
analyse this problem in a principled Bayesian way:

Any function of model parameters and observed data has a posterior distribution
which can be obtained from that of the model parameters, since the data values
after their observation are known numbers.

We have seen this with the credible region for the population cdf. We now discuss the general
problem.
Model comparison and model averaging 185

14.2 General model comparison


Bayesian and frequentist model comparisons are discussed at great length in Aitkin (2010,
Chapter 2). Here we give a concise treatment.

14.2.1 Known parameters


If we have two models f1 (y | θ1 ) and f2 (y | θ2 ) with fully known parameters θ1 and θ2 ,
Bayesians and frequentists agree that the appropriate measure of relative data evidence for
model 1 over model 2, based on random sample data y, is the likelihood ratio between the
models: L1 /L2 , where Lj = fj (y | θj ), j = 1, 2. A Bayesian analysis would complement this
ratio with the prior information, through the ratio of prior model probabilities π1 /π2 . If
there were no prior preference for one model over the other, the prior probabilities would be
equal, and the ratio of posterior model probabilities would be equal to the likelihood ratio.
If there are K > 2 models with known parameters, likelihoods Lk from data y and prior
probabilities πk , the posterior probability of model k is
K
X
πk|y = πk Lk / πℓ Lℓ .
ℓ=1

14.2.2 Unknown parameters


Bayesians and frequentists differ in the treatment of unknown parameters.
• Bayesians are themselves divided on whether such model comparisons make sense. If they
do, they generally use the Bayes factor, the ratio of integrated likelihoods, integrated with
respect to the priors on each parameter, which must be proper – they cannot be diffuse
and improper. The Bayes factor is treated as though it was a likelihood ratio between
models with known parameters:
Z Z
BFk,ℓ = Lk (θk )πk (θk )dθk / Lℓ (θℓ )πℓ (θℓ )dθℓ .

Combined with the ratio of prior model probabilities it gives an analogue of the ratio of
posterior model probabilities.
• Frequentists use the MLE of the parameters under each model, giving the maximised
likelihoods. These are used in different ways for nested and non-nested models:
– for nested models, in which model 1 is a special case of model 2 under a null
hypothesis, the maximised likelihoods are used in the likelihood ratio test, with
the test statistic −2 log[L1 (θb1 )/L2 (θb2 )], which has under the null hypothesis an
asymptotic χ2 distribution with degrees of freedom equal to the difference in the
number of parameters between the models.
– for non-nested models, the values of the frequentist deviance −2 log Lmax are pe-
nalised by a function of the number of model parameters to provide a decision
criterion: the model with the smaller value of the criterion is the “best”.
The frequentist likelihood ratio test approach is not able to deal with non-nested models,
like the comparison of Weibull, gamma and lognormal models for the phone data. The
186 Introduction to Statistical Modelling and Inference

penalised maximised likelihood approach does not give a measure of strength of evidence for
each model, but a decision criterion for the “best” model. It does not provide a procedure
for the joint use of multiple well-supported models. The Bayes factor integrated likelihood
approach reduces the likelihood function to a single number – a one-point average summary
of a parametric function: Z
Lk (θk )πk (θk )dθk = Lk (θ̃k )

for some θ̃k which depends on the prior πk (θk ). This is analogous to the frequentist use of
the MLEs θbj as the appropriate values of the θj , though θ̃k is implicit and is known only
after the likelihood has been integrated. A detailed discussion of other Bayesian approaches
to model comparison is given in Ando (2010).

14.3 Posterior distribution of the likelihood


The principled Bayesian analysis is through the posterior distribution of each model’s like-
lihood, illustrated here with the Poisson/geometric comparison. We generate the posterior
distribution of µ under each model (here using a flat prior for µ). The posterior distribution
of µ is Gamma(T + 1, n) for the Poisson model. The unfamiliar form of the geometric like-
lihood can be put in a familiar form by a parameter transformation. Write p = µ/(1 + µ);
then G(p) = pT (1 − p)n , a binomial likelihood, and the derivative of the transformation
is dµ
dp = (1 − p)
−2
, so p has an improper Beta(1, −1) prior distribution, and a proper
Beta(T + 1, n − 1) posterior distribution.
[m] [m]
We make M random draws µP and pG from each posterior using different random
[m]
seeds, and substitute these values into the likelihood for each model to give M draws P (µP )
[m]
and G(pG ). These draws define the posterior distribution of the likelihood under each model.
We generate the posterior distribution of the likelihood ratio for Poisson to geometric by
[m] [m]
forming the M paired simulated likelihood ratios LR[m] = P (µP )/G(pG ).
A general numerical problem with likelihood functions is their frequently very small
values, and the larger the data set, the smaller the likelihood. Likelihoods are frequently of
the order of 10−30 or smaller. Dividing one such number by another may cause numerical
underflow, fail or give zero. This is easily avoided by working with the log-likelihood.
Instead of substituting the random parameter draws into the likelihoods, we substitute
them into the log-likelihoods:

log P (µ) = −nµ + T log(µ) − log(P ),


log G(p) = T log(p) + n log(1 − p),
[m] [m]
to give M draws log P (µP ) and log G(pG ). We form the M paired simulated log-likelihood
[m] [m]
differences log LR[m] = log P (µP )−log G(pG ). Then exponentiating these differences gives
the posterior distribution of the likelihood ratio.
Figure 14.2 shows the 10,000 draws of both log-likelihoods, using dot-point characters.
The Poisson is the higher in the centre. Figure 14.3 shows the 10,000 draws of the geometric-
Poisson log-likelihood differences.
The median log-likelihood difference is 3.010, and the 95% central credible interval is
[0.723, 5.258]. Exponentiating and converting to the Poisson posterior probability scale, the
Model comparison and model averaging 187

-36

-37

-38

-39

-40
log-likelihood
-41

-42

-43

-44

-45

-46

0.5 1.0 1.5 2.0 2.5


mean

FIGURE 14.2
Poisson (upper) and geometric (lower) log-likelihoods

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-4 -2 0 2 4 6 8 10
log L difference

FIGURE 14.3
Geometric-Poisson log-likelihood difference
188 Introduction to Statistical Modelling and Inference

median posterior probability of the Poisson is 0.953 (the value at the MLE of µ), and the
95% credible interval is [0.673, 0.995]. As expected, the median is close to the MLE, but the
credible interval shows that much smaller values are quite plausible. Reporting only a point
estimate (as always) fails to account for the uncertainty in the small sample of 30.
The log-likelihood is widely used in applications in all fields. Recently the deviance has
become the common tool of inference, in both frequentist and Bayesian analyses.

14.4 The deviance


The term deviance was introduced by Nelder and Wedderburn (1972) in the formulation
and analysis of generalised linear models, discussed in Chapter 16. In this formulation there
were, confusingly, two definitions of the deviance of a model with likelihood L(θ). If the
model was Gaussian, the deviance was the residual sum of squares from the fitted model. If
it was not Gaussian, the deviance was defined to be −2 log L(θ) b + 2 log Lsat , where Lsat was
the likelihood for the saturated model, which had a different parameter for every observation.
The difference in definitions came from the wish to use the deviance as a goodness-of-fit test
of the fitted model, with an asymptotic χ2n−p “null” distribution if the model was correctly
specified, with n the sample size and p the number of model parameters. It was later realised
that this asymptotic distribution was not valid for non-Gaussian models, but it was valid
(asymptotically) for comparing nested models, with the null model being the smaller one.
Even more confusingly, Bayesian analysts later used the same term deviance, but this
referred to −2 log L(θ), a function of the parameter θ and the data. We will try to avoid
this confusion by calling −2 log L(θ)
b the frequentist deviance, and −2 log L(θ) the Bayesian
deviance. The term “deviance” without a qualifier will mean the Bayesian deviance. The
frequentist deviance is a number, a function of the MLE θ, b and the Bayesian deviance is a
function of the model parameter θ.
So for the Cox data the Poisson Bayesian deviance is
DP (µP ) = 2[nµP − T log µP + log P ] = 2[30µP − 26 log µP + 5.95],
and the frequentist deviance is
DP (b µP − T log µ
µP ) = 2[nb bP + log P ] = 2[26 − 26 ∗ (−0.143) + 5.95] = 2 ∗ 35.67 = 71.34.
The geometric Bayesian deviance is
DG (µG ) = 2[(T + n) log(1 + µG ) − T log µG ] = 2[56 log(1 + µG ) − 26 log µG ]
and the frequentist deviance is
DG (b bG ) − T log µ
µG ) = 2[(T + n) log(1 + µ bG ] = 2[56 ∗ 0.624 − 26 ∗ (−0.143] = 2 ∗ 38.67 = 77.34.
The difference in the frequentist deviances is 6.00, which is exactly the value of the
frequentist likelihood ratio test statistic: twice the difference in maximised log-likelihoods.
However in this case of non-nested models, there is no asymptotic sampling distribution for
this ratio.
The cdfs of M = 10, 000 draws from the posterior distributions of the Poisson (solid) and
geometric (dot-dashed) Bayesian deviances are shown in Figure 14.4. The slight blurriness
in the curves is explained in the following. The cdf of the Bayesian deviance differences
(geometric-Poisson) is shown in Figure 14.5. The difference graph is identical to that of the
log-likelihood except for the scale change (factor of 2) in the horizontal axis.
Model comparison and model averaging 189

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
75 80 85 90
deviance

FIGURE 14.4
Poisson (left, solid) and geometric (right, dashed) deviance distributions, Cox data

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-5 0 5 10 15 20
deviance difference

FIGURE 14.5
Deviance difference distribution, geometric minus Poisson, Cox data
190 Introduction to Statistical Modelling and Inference

14.5 Asymptotic distribution of the deviance


A very important result follows from the asymptotic pivotal property of the likelihood ratio.
For regular models f (y | θ), with maximum over θ internal to the parameter space and flat
priors, using the second-order Taylor expansion of the Bayesian deviance about the MLE θb
gives:
. b ′ ℓ′ (θ) b ′ ℓ′′ (θ)(θ
−2 log L(θ) = −2 log L(θ)b − 2(θ − θ) b − (θ − θ) b − θ)
b

= −2 log L(θ) b ′ I(θ)(θ


b + (θ − θ) b − θ) b
. b · exp[−(θ − θ)′
b − θ)/2]
L(θ) = L(θ) b I(θ)(θ b
. ′
π(θ | y) = c · exp[−(θ − θ) b − θ)/2].
b I(θ)(θ b

So asymptotically (and loosely in the first result)

b −1 ),
θ − θb | y ∼ N (0, I(θ)
b ′ I(θ)(θ
(θ − θ) b − θ) b | y ∼ χ2 ,
p
L(θ)
−2 log | y ∼ χ2p ,
L(θ)
b

where p is the dimension of θ.


What is remarkable about these results is that they are all pivotal functions. The rever-
sal of the frequentist and Bayesian distributions for θb and θ is paralleled in the quadratic
form and likelihood ratio results. So the Bayesian deviance −2 log L(θ) has an asymptotic
shifted χ2p distribution, shifted by the frequentist deviance −2 log L(θ).
b This is a simple and
remarkable result: for large samples from regular models with non-informative priors, the
posterior distribution of the likelihood, or of the deviance, depends on the data only through
the maximised likelihood and the number of model parameters. So in large samples, the com-
parison of non-nested regular models depends only on their maximised likelihood ratio, or
their frequentist deviance difference.
The calibration of the Poisson-geometric deviance difference depends on the distribution
of the difference between two independent χ2 variables, which has no simple exact form, but
is very easily simulated. Further, the agreement between the asymptotic and the empirical
distributions is very easily assessed, by simply plotting the empirical and asymptotic cdfs
in the same graph. The blurriness we noted in Figure 13.4 is caused by graphing together
the empirical and the asymptotic distributions for each model. The asymptotic distribution
is almost exact for the Poisson model, and is very close for the geometric. This contrasts
strongly with the asymptotic repeated-sampling χ2 distribution of the frequentist likelihood
ratio test statistic for nested models, for which there is no validation possible from the
observed data.
Our experience with heavily parametrised models (for example the Gaussian mixture
distributions in Chapter 15) shows that with increasing parameter dimension, the achieved
maximum of the likelihood in 10,000 draws from the posterior distribution of the likelihood is
increasingly far below the analytic maximum, because the effective sample size per parameter
decreases. However we do not need to rely on the asymptotic distribution, as the empirical
distribution provides all the necessary information, subject only to the sampling variation
in proportions computed from a simulation sample size of 10,000.
Model comparison and model averaging 191

14.6 Nested models


For nested models, the principled Bayesian analysis is simpler, since there is only one set of
model parameters. We illustrate with the Gaussian model yi ∼ N (µ, σ 2 ), i = 1, . . . , n. The
likelihood is
n
(yi − µ)2
 
Y 1
L(µ, σ) = √ exp −
i=1
2πσ 2σ 2
n
(yi − ȳ + ȳ − µ)2
 
Y 1
= √ exp −
i=1
2πσ 2σ 2
 n  Pn 2 2

1 i=1 (yi − ȳ) + n(ȳ − µ)
= √ · exp −
2πσ 2σ 2
√  2
  Pn 2

n n(ȳ − µ) 1 i=1 (yi − ȳ)
=c· exp − · n−1 exp −
σ 2σ 2 σ 2σ 2
where c is a known constant, involving the data but not σ or µ. The likelihood depends
on only Ptwo functions of the data: the sample mean ȳ and the residual sum of squares
n
RSS = i=1 (yi − ȳ)2 .
A null hypothesis H0 specifies µ = µ0 , the alternative hypothesis H1 does not specify µ.
The SD σ is unspecified under both hypotheses; we assume it is not affected by the different
hypotheses. The likelihood ratio, and Bayesian deviance difference, of the null model to the
alternative model, with the standard priors for µ given σ and σ, are
√ h i h Pn 2
i
n n(ȳ−µ0 )2 1 i=1 (yi −ȳ)
L(µ0 , σ) σ exp − 2σ 2 · σ n−1 exp − 2σ 2
= √ h i h Pn 2
i
L(µ, σ) n n(ȳ−µ)2 1 i=1 (yi −ȳ)
σ exp − 2σ 2 · σ n−1 exp − 2σ 2

2 2
 
−n(ȳ − µ0 ) + n(ȳ − µ)
= exp
2σ 2
L(µ0 , σ) n(ȳ − µ0 )2 − n(ȳ − µ)2
−2 log =
L(µ, σ) σ2
n(ȳ − µ0 )2 s2 n(ȳ − µ)2
= 2
· 2−
s σ σ2
2 2 2
∼ t · χn−1 /(n − 1) − χ1 ,
where t is the frequentist test statistic for the null hypothesis. The difference between the
two independent scaled χ2 terms has no exact distribution, but is easily simulated. The (pos-
terior) mean of the deviance difference is t2 −1, and the posterior variance is 2t4 /(n − 1) + 2.
Aitkin (2010, p. 71) gave an example: a sample of ten gives a sample mean of ȳ = 5 and
standard deviation s = 8.74. The null hypothesis of µ = µ0 = 0 gives a frequentist t-statistic
of 1.809, with a p-value of 0.104. The evidence against the null hypothesis is not convincing.
Figure 14.6 shows the posterior cdf of 10,000 draws of the deviance difference. The 95%
central credible interval for the true deviance difference is [−2.258, 6.266]: the difference is
poorly defined in this very small sample. The equivalent interval for the likelihood ratio of
null to alternative is [0.044, 3.09], and that for the posterior probability of the null hypothesis
is [0.042, 0.756]. The sample data are consistent with a wide interval of values for µ, and
hence for the likelihood ratio and posterior probability of the null hypothesis. The Bayesian
analysis is much more informative than the frequentist p-value.
192 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
-10 -5 -0 5 10
deviance difference

FIGURE 14.6
Deviance difference distribution, null minus alternative

14.7 Model choice and model averaging


The aim of the analysis of the phone data was to make a statement about the 80th quantile
of the lifetime distribution. The empirical 80th quantile of the observed lifetimes is between
340 and 352 hours. We have five models which give different posterior information about the
80th quantile: the medians and 95% credible intervals for it are
gamma: 325, [276, 390]
Weibull: 330, [283, 380]
multinomial: 340, [273, 414]
exponential: 344, [281, 421]
lognormal: 411, [294, 447]

How do we – or do we – choose among, or combine, these posteriors? The multinomial makes


no assumptions, and for the moment we set it aside. Our diagnostics for the parametric
models suggested that the exponential, lognormal and gamma are not suitable, but the
Weibull is suitable. The relative evidence for each model is based on its posterior distribution
of the model deviance, as described earlier for the Poisson-geometric comparison. Figure 14.7
shows these posterior distributions.
The different shape of the exponential is due to its single parameter, giving an (asymp-
totic) χ21 distribution. The other three deviance distributions are nearly parallel and like χ22
distributions, and are slightly bumpy because of the grid of 100 for each parameter, while
the exponential’s one parameter has a grid of 10,000.
It is clear that the Weibull and gamma are almost equally well supported, while the
exponential and lognormal are not supported. The median difference in deviance between
Model comparison and model averaging 193
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
1110 1115 1120 1125 1130 1135
Deviances

FIGURE 14.7
Exponential (red), Weibull (black), gamma (orange) and lognormal (green) deviance distri-
butions, phones data

the first two and the exponential is 18 deviance units. It is almost impossible for a random
draw of the exponential deviance to be smaller than a random draw of the Weibull or gamma.
The differences from the lognormal are even greater.
The Weibull and gamma are the only parametric models which need to be considered,
and their conclusions about the 80th quantile are very close, despite the slightly different
shapes of their density functions and corresponding cdfs. It is a matter for the researcher
whether the Weibull or the gamma is preferred. The analytic properties of the Weibull
quantiles (for five-year survivals) make it generally preferable in medical or life-teating
applications.
What about the multinomial? The median 80th quantile for the multinomial is close to
the medians for the Weibull and gamma, but the credible interval is more conservative –
longer tails in both directions. The increased precision of the parametric models comes at
the cost of the model assumptions.
A technical difficulty with the comparison of the multinomial deviance with that of any
continuous-parameter model is that the likelihoods are not comparable, because the support
for the multinomial is at a discrete set of probabilities, while the continuous distributions
have continuous support for their parameters. We do not discuss this further. The importance
of the multinomial is to compare its credible interval with the credible intervals for the
parametric models.
We do not need the formal model-averaging procedure here, but state it below. To con-
struct M draws of the averaged 80th quantile, we need to follow M times for each m this
sequence:
[m]
• draw at random a value of the Bayesian deviance Dj from each model j;
[m]
• convert the set of deviances to a set of model probabilities pj through Bayes’s theorem;
194 Introduction to Statistical Modelling and Inference
[m] [m]
• draw at random a model Mj using the model probabilities pj ;
[m] [m]
• For the selected model Mj , make one random draw yj,80 from the posterior distribution
of the 80th quantile of this given model.
[m]
The set of random draws yj,80 defines the posterior distribution of the averaged 80th quantile.
15
Gaussian linear regression models

15.1 Simple linear regression


The term “simple” refers to models with a single covariate. The term “linear” does not refer
to the degree of the relationship between a response variable y and a covariate x, but to
the linearity of the mean function in the model parameters. A Gaussian regression model
quadratic in the x variable is still linear in the regression parameters. The term “regression”
is somewhat misleading. It is used in the developmental sense, of “returning” or “becoming
smaller”. Its origin is in genetics:

The concept of regression comes from genetics and was popularized by Sir Fran-
cis Galton during the late 19th century with the publication of Regression towards
mediocrity in hereditary stature (Galton 1886). Galton observed that extreme char-
acteristics (e.g., height) in parents are not passed on completely to their offspring.
Rather, the characteristics in the offspring regress towards a mediocre point (a point
which has since been identified as the mean). By measuring the heights of hundreds
of people, he was able to quantify regression to the mean, and estimate the size of
the effect.
(Wikipedia)

We begin with a medical example.

15.1.1 Vitamin K
Vitamin K was discovered in the 1930s to have an effect on blood clotting: a lack of Vitamin
K in the diet could lead to very slow blood clotting of patients with disease or injury involving
haemorraging. The identification of Vitamin K was established by the Danish scientist Henrik
Dam, working with Schønheyder and other colleagues in Denmark, and E.A. Doisy and his
colleagues in the USA. It could be extracted from dried liver. Dam and Doisy shared the
Nobel Prize for Physiology or Medicine in 1943 for this discovery. The calibration of the effect
of Vitamin K on blood clotting was established in an experiment by Schønheyder (1936) in
a study of chickens.
Fifteen chickens were deprived of Vitamin K and then fed dried liver for three days at
a (varying) dose of x′ mg per gram weight of chick per day. At the end of this period, the
response of each chicken was measured as the concentration y ′ of a clotting agent needed
to clot samples of its blood in three minutes. The data from the 15 chickens are given in
Table 15.1, and are graphed in Figure 15.1.
There is a rapid drop of concentration with increasing dose, which then flattens out. This
decline appears more rapid than an exponential decay.
Modelling a sharp curve is not straightforward. A property of the data – the ra-
tio of largest to smallest on each variable – suggests an alternative. Ratios over 10 for

DOI: 10.1201/9781003216025-15 195


196 Introduction to Statistical Modelling and Inference
TABLE 15.1
Concentration y ′ and dose x′
x′ 1.6 2.2 2.8 3.0 3.7 4.4 4.8
y ′ 500 162 178 136 78 47 62
x′ 6.1 6.8 7.2 8.2 9.9 10.2 11.3 14.8
2 y′ 39 40 21 19 12 10 9 8

500

450

400

350

300
concentration

250

200

150

100

50

2 4 6 8 10 12 14
dose

FIGURE 15.1
Concentration vs dose, Vitamin K

positive variables suggest a scale transformation to a logarithmic scale. This ratio is 62.5
for concentration, and 9.25 for dose. Both variables are positive, and do not have zero
values. Figure 15.2 shows the data on (natural) log scales for both variables, defining
y = log(concentration), x = log(dose). The plot is now nearly linear: the curvature has
disappeared.
We model the relationship between the transformed variables by a simple linear regression
model. This can be expressed in several different ways. The most common is algebraic:

y = α + βx + ϵ.

This model structure can be interpreted as a fixed part – the α + βx – and a random part –
the ϵ. In the fixed part, α is the intercept parameter and β is the slope parameter or regression
coefficient. The ϵ is a random variability term for departures of the observations from the
straight line defined by the fixed part of the model. This is sometimes called an “error” term,
though there is no error in recording the data. The fixed and random parts of the model are
called in engineering and communications the “signal” and the “noise”.
How do we interpret the model? The covariate (sometimes called “explanatory vari-
able”) – liver dose – is determined by the experimenter, and at a given dose x, if there
were no “error”, the concentration response y would be the linear function α + βx of x.
Concentration would be determined or caused by dose. The ϵ term represents random “noise”
Gaussian linear regression models 197

6.0

5.5

5.0

log concentration
4.5

4.0

3.5

3.0

2.5

0.5 1.0 1.5 2.0 2.5


log dose

FIGURE 15.2
Log concentration vs log dose, Vitamin K

or variation due to differences in the individual chickens which had not been, or could not
be, controlled by the experimenter.
We need to estimate the parameters α and β. Without a probability distribution for ϵ,
we have no optimal principle for estimation. Many analysts use “ordinary least squares” –
OLS – a principle due to Gauss and Markov, which chooses the estimates α̃ and β̃ to
minimise the sum of the squared deviations of y from the “fitted” linear function α̃ + β̃x.
The same estimates are obtained as maximum likelihood estimates if we assume a Gaussian
distribution with zero mean and unknown variance σ 2 for the random terms ϵ, as we show in
the following. If this distribution is not Gaussian, then the best estimates will not be those
from OLS.
Figure 15.3 shows the fitted line with the transformed data, and Figure 15.4 shows the
fitted values from the linear model reverse-transformed back to the original scales. It might
appear that the fitted model on the original scale “misses” the first value by a large amount,
but the transformed scale shows that this difference is neither large nor unusual on that scale.
The slightly “jerky” appearance of the fitted model is a consequence of the graph procedure
which uses straight-line segments between adjacent fitted values. This can be smoothed by
computing fitted values explicitly over a fine x grid and graphing them againt the grid values,
giving a smooth curve.

15.2 Model assessment through residual examination


Assessing the model probability assumption is more difficult in regression models, because
it is not the observed response variable y which has the homogeneous Gaussian distribution
(its mean changes with x), but the unobserved random variation terms ϵ.
198 Introduction to Statistical Modelling and Inference

6.0

5.5

5.0

log concentration
4.5

4.0

3.5

3.0

2.5

2.0

0.5 1.0 1.5 2.0 2.5


log dose

FIGURE 15.3
Log concentration vs log dose and ML fitted linear model, Vitamin K

500

450

400

350

300
concentration

250

200

150

100

50

2 4 6 8 10 12 14
dose

FIGURE 15.4
Concentration vs dose and ML fitted model, Vitamin K
Gaussian linear regression models 199

0.3

0.2

0.1

residual
0.0

-0.1

-0.2

-0.3

0.5 1.0 1.5 2.0 2.5


log dose

FIGURE 15.5
Vitamin K residuals

The standard frequentist assessment of model departures from the data is based on the
residuals. We will refer to these traditional data features as the frequentist residuals: as we
will see, the Bayesian analysis uses a different definition, which we will call the Bayesian
residuals. The frequentist residual ei for the ith observation is defined by ei = yi − αb − βx
b i,
where αb and βb are the maximum likelihood estimates. It is easily shown that the frequentist
residuals are uncorrelated with the covariate x: the linear dependence has been accounted
for by the regression model fitting.
A traditional frequentist approach is to treat the frequentist residuals ei as though they
were the unobserved ϵi , to plot them against x to see if any structure remains in the residuals,
and to plot the Gaussian quantiles of the empirical cdf of the residuals against them, to assess
departures from the Gaussian distribution.
This is not quite correct, since the frequentist residuals are correlated (they sum to zero)
and have different variances as well. Figure 15.5 shows the frequentist residuals (from the
log-transformed model) plotted against x, and Figure 15.6 shows their quantile plot. The
residuals show a random pattern against log dose, expected since they are uncorrelated with
the covariate, and their sample quantiles are nearly linear, supporting a Gaussian distribution
for the ϵi . There is no evidence casting doubt on the linearity of the model or the Gaussian
distribution assumption for the ϵi , though in a sample of 15 even large departures of the
frequentist residuals from the Gaussian straight line might not be persuasive. We formalise
the Bayesian residuals in a later section.

15.3 Likelihood for the simple linear regression model


The likelihood function for the Gaussian linear regression model is very similar to that for
the single-sample Gaussian model, although the structure of the data is quite different. We
200 Introduction to Statistical Modelling and Inference

1.5

1.0

0.5

quantile 0.0

-0.5

-1.0

-1.5

-0.3 -0.2 -0.1 -0.0 0.1 0.2 0.3


residual

FIGURE 15.6
Vitamin K residual quantiles

have a sample of size n from an experiment in which each response yi has a corresponding
covariate xi related to yi through the linear regression model. The “error” terms ϵi are
random values from the Gaussian distribution N (0, σ 2 ), independent of the xi . So the model
can be written alternatively as

yi | xi ∼ N (µi , σ 2 ),
µi = α + βxi .

Here the symbol | is a conditioning sign, read as “given” – the response yi is conditioned on
xi explicitly through the model. This representation will be used repeatedly in the book. It
represents the separation of the “random” part of the model – the probability distribution –
from the “fixed” or “structural” part of the model: the relation between the probability
model parameters and the covariates.
The likelihood for the model parameters α, β and σ is

L(α, β, σ) = Pr[y1 , . . . , yn |x1 , . . . xn , α, β, σ]


n
Y
= f (yi |α, β, σ) · δ
i=1

n 
1 (yi − µi )2
  
Y 1
= √ exp − δ
i=1
2πσ 2 σ2
n Pn
1 i=1 (yi − µi )2
  
δ 1
= √ · exp − .
2π σn 2 σ2
Gaussian linear regression models 201

Here δ is the measurement precision with which y is measured. The likelihood is based
on measurement of y with high precision, so the discrete probability of each yi can be
approximated accurately by the Pn density ordinate rectangle with width δ.
The sum of squares term i=1 (yi − µi )2 in the likelihood appears in some form in every
Gaussian model, and has a standard re-expression. We introduce vector notation for the
parameters and the covariates. We write θ = (α, β)′ for the column vector of the parameters,
and xi = (1, xi )′ for the column vector of the covariates for the ith data value (treating 1 as
a constant covariate). Then the model can be written as

yi | xi ∼ N (µi , σ 2 ),
µi = θ ′ xi .
Pn
We write SS(θ) = i=1 (yi − θ ′ xi )2 for the sum of squares. Writing y = (y1 , . . . , yn )′ , a
column vector of length n, and X = (x1 , . . . xn )′ , an n × 2 matrix, this can be expressed as
n
X
SS(θ) = (yi − θ ′ xi )2
i=1
= (y − Xθ)′ (y − Xθ)
= y′ y − 2y′ Xθ + θ ′ X′ Xθ.

The p×p matrix X′ X is called the sum of squares and cross-products matrix of the covariates
with each other, and the p × 1 vector X′ y is called the sum of cross-products vector, of the
covariates with the response.

15.4 Maximum likelihood


The likelihood L(θ, σ), omitting known constants, and the log-likelihood ℓ(θ, σ) and its
derivatives are
n
1 (yi − θ ′ xi )2
 
Y 1
L(θ, σ) = exp −
i=1
σ 2 σ2
 
1 1 SS(θ)
= n exp −
σ 2 σ2
SS(θ)
ℓ(θ, σ) = −n log σ −
2σ 2
∂ℓ(θ, σ) 1 ∂SS(θ)
=− 2 ·
∂θ 2σ ∂θ
1
= 2 [X′ y − X′ Xθ]
σ
∂ℓ(θ, σ) n SS(θ)
=− +
∂σ σ σ3
2
∂ ℓ(θ, σ) 1
= − 2 X′ X
∂θ∂θ ′ σ
∂ 2 ℓ(θ, σ) 2
= − 3 [X′ y − X′ Xθ]
∂θ∂σ σ
∂ 2 ℓ(θ, σ) n SS(θ)
= 2 −3 .
∂σ 2 σ σ4
202 Introduction to Statistical Modelling and Inference

The MLEs of θ and σ 2 are θ b = [X′ X]−1 X′ y, σ


c2 = SS(θ)/n.
b The cross-derivative with
respect to both θ and σ is zero at the MLEs, so the parameter estimates θ b and σ b are
uncorrelated, and asymptotically independent. The sum of squares can be partitioned, as in
the Gaussian mean model, into two parts (the cross-product terms vanish):
SS(θ) = [y − Xθ
b + X(θ b − θ)]′ [y − Xθb + X(θ b − θ)]
b ′ (y − Xθ)
= (y − Xθ) b + (θ b − θ)′ X′ X(θ
b − θ).

The first term is the residual sum of squares RSS, the second is the regression sum of squares
RegSS. The likelihood can be decomposed correspondingly (see the following).
It is important to note (as in the simple Gaussian mean case) that the log-likelihood is
not quadratic in all the parameters, though SS(θ) is quadratic in θ. We can write these and
the ML estimates explicitly in the simple linear regression model:
 Pn   
X ′ X = Pn
n Pni=1 x2i = n 1 x̄
i=1 xi i=1 xi x̄ x¯2
x¯2 −x̄
   
′ −1 1 ′ ȳ
[X X] = Pn , Xy=n ,
i=1 (xi − x̄)
2 −x̄ 1 xy
¯
where x¯2 = i=1 x2i /n, xy
Pn Pn
¯ = i=1 xi yi /n. After some algebraic simplification,
Pn
(x − x̄)(yi − ȳ)
β = i=1
b Pn i 2
i=1 (xi − x̄)
¯ − x̄ȳ
xy
= ¯2 ,
x − x̄2
b = ȳ − βbx̄.
α

So the residual sum of squares can be expressed as


n
X
RSS = (yi − α b i )2
b − βx
i=1
n
X
= b i − x̄)]2 .
[(yi − ȳ) − β(x
i=1

The information matrix (evaluated at the MLEs) is block diagonal: the regression parameter
estimates are uncorrelated with the σ estimate. The inverse of the information block for the
c2 [X ′ X]−1 , so
regression parameters is σ
n
c2 x¯2 /
X
\
Var[α]
b =σ (xi − x̄)2 ,
i=1
n
X
\
Var[β] c2 /
b =σ (xi − x̄)2 .
i=1

However, the parameter estimates α b and βb are correlated, with correlation r = −x̄/ x¯2 . This
has the opposite sign to x̄. It depends on the location (sample mean) of the covariate x, but
not its scale (sample standard deviation). An important treatment of the data, by centring (or
centering) the covariate by subtracting the mean x̄ : x′i = xi − x̄, gives uncorrelated parameter
estimates, with the intercept estimate now being ȳ, and the slope estimate i (x′i yi′ )/ i (x′2
P P
i ).
Gaussian linear regression models 203

15.4.1 Vitamin K example


Without centering of the log dose, the MLEs (with SEs in parentheses) are intercept
b = 6.931 (0.166), slope βb = −1.892 (0.092), and the correlation between them is −0.937.
α
With centering, these are αbc = 3.742 (0.058), βbc = −1.892 (0.092), with correlation zero. The
residual sum of squares is 0.6540, giving the “unbiased” estimate σ˜2 = 0.0503 and σ̃ = 0.224.
The MLE and SE of β, and the RSS, are unaffected by the centering.

15.5 Bayesian and frequentist inferences


The likelihood partitions with SS(θ):
 
1 1 SS(θ)
L(θ, σ) = n exp −
σ 2 σ2
( )
1 1 RSS + (θb − θ)′ X′ X(θ
b − θ)
= n exp −
σ 2 σ2
( )
1

1 RSS

1 b − θ)′ X′ X(θ
1 (θ b − θ)
= n−2 exp − · 2 exp −
σ 2 σ2 σ 2 σ2

The Bayesian and frequentist interpretations of this partition are parallel to those in the
Gaussian mean model:
• (Bayesian) with a prior on σ of 1/σ 2 , and a flat prior on θ, the marginal posterior
distribution of RSS/σ 2 is χ2n−2 , and the conditional posterior distribution of θ given σ 2
b σ 2 [X′ X]−1 );
is the bivariate Gaussian: N2 (θ,
• (frequentist) the first term is the χ2n−2 distribution of RSS/σ 2 , and the second is the
b N2 (θ, σ 2 [X′ X]−1 ).
bivariate Gaussian distribution of θ:
So as in the Gaussian mean model, the two inferences about the regression parameters and
the variance are identical, subject to the prior specifications above. We need not distinguish
the two approaches in the Gaussian model. The advantage of the centering of x is that it
eliminates the covariance, so that statements about the regression coefficient and the inter-
cept are independent: the two parameters have independent posterior Gaussian distributions
given σ 2 .
Inference about the regression parameters is usually based on the marginal t-distributions
of the parameters or their ML estimates; the joint distribution gives credibility or confidence
ellipses in the joint parameter space. Functions of the model parameters are straightforward.
One of particular interest is the mean function µ(x) = α + βx. Figure 15.7 shows the median
(line) and 95% mean precision region (green curves) from 10,000 draws of this function. The
uncertainty in the mean function increases away from the mean x̄ in both directions. This
is characteristic of regression models in general.

15.6 Model-robust analysis


The preceding analysis is optimal in the statistical sense if the distribution of ϵ is Gaussian.
We have no evidence that it is not Gaussian, but even small departures from Gaussian
204 Introduction to Statistical Modelling and Inference

6.0

5.5

5.0

4.5

y 4.0

3.5

3.0

2.5

2.0

-1.0 -0.5 0.0 0.5 1.0


centred x

FIGURE 15.7
Vitamin K ML mean function (line) and 95% precision region bounds (green curves)

might affect the conclusions. We can guard against this possibility by using the Bayesian
bootstrap (BB) analysis from Chapter 12. The parametric functions of inference are the
population regression parameters – the population analogues of the sample estimates. These
are defined by
PN
I=1 (XI − X̄)(YI − Ȳ )
B= PN 2
I=1 (XI − X̄)
A = Ȳ − B X̄,

in the population of size N with population values (YI , XI ), I = 1, . . . , N. We follow exactly


the BB procedure, expressed as an iteratively weighted Gaussian ML analysis of the sample
data, with weights given by random Dirichlet draws of the multinomial probabilities pi on
the observed distinct data values (xi , yi ) with multiplicity ni . The random values A[m] and
B [m] are then random draws from the joint posterior distribution of A and B. Figures 15.8,
15.9 and 15.10 show the posterior distributions of α, β and σ from M = 10,000 draws. We
compare the centered BB analysis with the centered ML analysis.
The medians and 95% credible intervals are for α, 3.743 and [3.629, 3.858], for β, −1.891
and [−2.070, −1.713], and for σ, 0.230 and [0.164, 0.359]. Those for α and β correspond closely
to the MLEs and Gaussian 95% credible/confidence intervals: α bc = 3.742, [3.618, 3.866],
and βbc = −1.892, [−2.089, −1.695]. For σ, the Gaussian estimate is 0.224, and the 95%
confidence/credible interval is [0.163, 0.361], which also agree very closely with the posterior
values.
So the robust BB analysis validates the Gaussian analysis in this sample. However we
can invert the logic of this comparison. We needed to do the robust BB analysis because of
concern for the validity of the Gaussian model assumption. In fact, we did not need to do the
Gaussian linear regression models 205

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
3.6 3.7 3.8 3.9
alpha

FIGURE 15.8
Posterior of α

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6
beta

FIGURE 15.9
Posterior of β
206 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
0.2 0.3 0.4 0.5 0.6
sigma

FIGURE 15.10
Posterior of σ

Gaussian analysis! – only specify the Gaussian as a tentative model for the log response. In
fact we do not even need to have a tentative probability model – it is sufficient to take the
population definitions of the regression parameters A and B as the population parameters of
interest for analysis. The whole BB analysis can be carried out without any formal statement
about a restrictive probability model for the response. This parallels the survey sampling
approach to analysis, but bases it on the non-restrictive multinomial model. (We should add
here that there is a circularity in the argument: the definitions of the population parameters
of interest are themselves implied by the Gaussian model.)
We make use of this feature of the BB analysis in more complex models.

15.6.1 The robust variance estimate


In many application fields where the Gaussian assumption of constant error variances may be
questionable, it is popular to allow for “heteroscedasticity” – unequal variances – by using a
robust estimate of the precision of the Gaussian MLEs – robust standard errors (White 1980).
We want to retain the Gaussian-based MLEs, but abandon their Gaussian-based precisions.
There is a logical difficulty in this process. If the observation variances are unequal, then
the standard Gaussian model is incorrect, and the Gaussian MLEs are not optimal estimates.
They need to be replaced as well. One possible way of accounting for this is by a specific
structural model for the variances, discussed in Chapter 17. Without such a model, the
difficulty is resolved by the BB analysis.
We simply weight the standard Gaussian-based analysis by the random Dirichlet draws,
giving the robust full posterior distribution of the model parameters without the Gaussian
model assumptions, including constant variance. (However, the assumption of constant vari-
ance is built in implicitly through the definition of the population parameters of interest.)
Gaussian linear regression models 207

15.7 Correlation and prediction


15.7.1 Correlation
An important summary measure of the strength of association between y and x in the
simple linear regression model is the sample correlation coefficient, denoted by r. It is a
useful descriptive statistic regardless of the probability model for y.
The correlation coefficient calibrates the strength of association through the proportion of
the residual sum of squares in y “explained” (modelled) by the regression on x. This is easily
calculated: if no regression model is fitted (equivalent to β = 0), the Gaussian
Pn model becomes
the simple mean model, for which the residual sum of squares is RSS0 = i=1 (yi − ȳ)2 . The
squared correlation between y snd x is defined by

r2 = (RSS0 − RSS1 )/RSS0


= 1 − RSS1 /RSS0 ,

where RSS1 is the residual sum of squares from the regression model. Algebraically,
Pn b i )2
(y − α b − βx
2
r = 1 − i=1 Pn i 2
i=1 (yi − ȳ)
Pn b i − x̄)]2
[(yi − ȳ) − β(x
= 1 − i=1 Pn 2
i=1 (yi − ȳ)
Pn
[ (yi − ȳ)(xi − x̄)]2
= Pn i=1 2
Pn 2
.
i=1 (yi − ȳ) i=1 (xi − x̄)

So the correlation itself is


Pn
(yi − ȳ)(xi − x̄)
r = pPn i=1 Pn ,
2 2
i=1 i − ȳ)
(y i=1 (xi − x̄)

where the sign of r is the sign of the numerator sum of cross-products, or of the slope
coefficient β.
b
The squared correlation is restricted to the interval [0,1]; r itself is restricted to the
interval [−1,1]. For the Vitamin K sample, the correlation between the log-transformed dose
and response variables is −0.937; the squared correlation is 0.878. These values are extremely
high in magnitude: we will see in most surveys (as opposed to experiments) that correlations
are much lower. The correlation between the original variables before the transformation is
much smaller: on the original scale the relation between the variables is strongly non-linear.

15.7.2 Prediction
In designed studies like that for Vitamin K, the experimenter may want to use the fitted
regression model to make a prediction about the response value y0 for a new “predictor”
variable value x0 . We have already seen (in §10.3.1) the prediction of a new observation given
a sample of response values only. The Bayesian and frequentist analyses of the regression
extension are identical and very straightforward.
We observe y and x related through the regression model, and we have a new x0 and want
to make a predictive inference about the corresponding y0 , through its posterior predictive
208 Introduction to Statistical Modelling and Inference

distribution, derived from the posterior distribution of α + βx0 . The Bayesian predictive
inference is constructed from the conditional distribution of y0 given x0 , α, β, σ. We have

f (y0 , y, | x0 , x, α, β, σ) = f (y0 | x0 , α, β, σ) · f (y, | x, α, β, σ)


= c · f (y0 | x0 , α, β, σ) · f (α, β | σ, y, x) · f (σ | y, x).

The posterior predictive distribution can be obtained analytically by integrating out succes-
sively α, β and σ:
Z Z Z
f (y0 , | x0 , y, x) = c · f (σ | y, x) dσ · f (y0 | x0 , α, β, σ) · f (α, β | σ, y, x) dα dβ,

However, posterior simulation gives a very simple (and more generally useful) alternative:
• We make M random draws σ 2[m] from the marginal posterior distribution of σ 2 ;

• then for each draw m we make a single random draw (αm] , β [m] ) from the conditional
bivariate Gaussian distribution of (α, β) given σ [m] ;
[m]
• finally for each draw m we make a single random draw y0 from the conditional Gaussian
distribution of y0 given x0 , α[m] , β [m] , σ [m] : N (α[m] + β [m] x0 , σ 2[m] ).
[m]
The M values y0 are a random sample from the posterior predictive distribution.
The frequentist formulation is in three steps, as before:

b′ x ∼ N (θ ′ x, σ 2 h)
θ
b′ x ∼ N (0, σ 2 (1 + h))
y0 − θ
RSS1 /σ 2 ∼ χ2n−2 independently,

b′ x)/s 1 + h ∼ tn−2 ,
(y0 − θ

where h = x′ [X ′ X]−1 x, s2 = RSS1 /(n − 2).

15.7.3 Example

We want to predict the Vitamin K concentration for a new dose level of x0 = 5 mgm per
gm wt of chick per day. We transform to the log dose scale: x0 = log 5 = 1.61, then centre
this value by subtracting the original sample mean x̄: 1.686, so x0c = −0.076. The posterior
prediction median or mean of log (y) is 3.886, with 95% prediction interval [3.385, 4.387].
The corresponding prediction values for y are 48.7 and [29.5, 80.4].
There are several cautions about this procedure:
• No prediction should be given without its full posterior predictive distribution, or a
credible interval for it. A point prediction, like a point estimate, is useless.

• The precision of prediction from the posterior predictive depends on the validity of the
probability model for the response variables.
• Prediction from the model assumes that the model applies to the new value;
• in particular: the new x value lies within the range of the existing predictor variables.
Gaussian linear regression models 209

Prediction of a response variable from a covariate value outside the range of the observed
covariates, a common practice in extrapolating time series into the future, is hazardous,
because it assumes a “steady state”. We have no way of knowing if the “steady state” – the
regression – will continue unchanged beyond the observed data. For this reason predictions
without precisions are sometimes called “projections”, which are assumed not to require
measures of precision. They do require them!

15.7.4 Prediction as a model assessment tool


We can use the inferential structure of prediction of new observations to assess both the
precision of the fitted model, and the data variability to be expected from the model. This is
simply done by constructing the prediction intervals for new observations y at the observed
values x. In some statistical packages this is assisted by the provision of a function: the
variance of the linear predictor (of the mean function) at xi is:

ηi ] = σP2 i
Var[b

= Var[β
b xi ]
= x′i Cxi ,

where C is the covariance matrix of the ML estimates β. b


Then the variance of the mean prediction is given directly by this function, and the data
variability to be expected from the model at xi is this variance plus the variance of a single
observation: σV2 i = Var[b c2 . We adopt as a general notation the graphical representation
ηi ] + σ
of the 95% precision region µbi ± 2 ∗ σPi by green curves, and the 95% credible region (for
variability) µbi ± 2 ∗ σVi by red curves.

15.8 Probability model assessment


We saw in §15.2 that the usual investigation of the distribution of the frequentist model
residuals through their probit plot was theoretically unsatisfactory: these residuals have zero
mean, but different variances and are correlated. A formal Bayesian analysis, however, solves
this problem. We may write the model yi | xi ∼ N (α + βxi , σ 2 ) in the equivalent forms

yi − α − βxi ∼ N (0, σ 2 ),
(yi − α − βxi )/σ ∼ N (0, 1).

We denote the Bayesian (standardised ) residuals by ϵi = (yi − α − βxi )/σ, which have
independent standard Gaussian distributions. However, they are not observable. We generate
M independent draws of the parameters α, β, σ, and for each observation (yi , xi ) we construct
the standardised residuals ϵi at xi :
[m]
ϵi = (yi − α[m] − β [m] xi )/σ [m] .
[m]
At each xi we sort the M draws ϵi and find the 2.5% and 97.5% quantiles of these ordered
draws. The region within the upper and lower quantiles is a 95% credible region for the cdf
of these residuals. We show the quantiles joined by red straight line segments, together with
the median straight line, and the frequentist residuals as circles in Figure 15.11.
210 Introduction to Statistical Modelling and Inference

2.5

2.0

1.5

1.0

0.5

probit 0.0

-0.5

-1.0

-1.5

-2.0

-2.5

-0.3 -0.2 -0.1 -0.0 0.1 0.2 0.3


sorted residuals

FIGURE 15.11
Vitamin K residual probits

The credible region is wide, with the Gaussian model in the middle. There is no evidence
against the Gaussian residual distribution. Joining the quantile points by straight lines is
more visible and attractive, but may give a misleading impression of our (lack of) knowledge
of the cdf between the observations.

15.9 “Dummy variable” regression


A very important application of the regression model is to the two-sample design. We have
discussed the construction of the posterior distribution of the difference between two Gaus-
sian means, or two binomial proportions. However, the linear regression model can be used
to represent these structures, through the use of a “dummy” or “indicator” variable.
We define a covariate x to take only two possible values, here defined to be 0 or 1 (other
possibilities will be explained shortly). The regression model then becomes:

Y ∼ N (α, σ 2 ), x = 0
∼ N (α + β, σ 2 ), x = 1.

So α is the mean in the first group of observations with x = 0, and β is the difference between
the means in the second and the first groups.
If we define x to take the two values a and b instead of 0 and 1, then β = (µ2 −µ1 )/(b−a)
and α = bµ1 −aµ2 . In particular, if a = −1/2 and b = 1/2, then β = µ2 −µ1 , α = (µ1 +µ2 )/2.
The 0, 1 choice gives the simpler relationship, as we are not usually interested in the equally
weighted average mean. We will see how useful this is in the analysis of cross-classifications.
Gaussian linear regression models 211
TABLE 15.2
Absence, IQ and number of family dependents, Aboriginal girls
days 14 11 2 5 5 35 22 20 13
IQ 60 60 70 86 86 81 86 93 93
deps 11 11 5 9 10 9 7 4 12
days 7 14 27 6 20 4 15 13 6 6
IQ 96 90 82 79 65 64 66 76 73 74
deps 6 5 11 6 10 7 8 3 11 9
days 5 16 17 46 43 40 16 14 32
IQ 70 98 84 100 84 91 105 104 76
deps 7 14 3 10 7 7 4 11 11
days 57 6 53 23 8 34 36 38 23 28
IQ 83 73 92 93 99 95 84 106 89 103
deps 10 9 9 13 7 9 10 11 9 9

15.10 Two-variable models


The simple linear regression model can be extended directly to two covariates, with a small
change in notation; this extension leads directly to the general p-covariate model. We begin
with a simple example of absence from school, in Table 15.2, reproduced from Table 2.7.
How is the number of days absent from school related to IQ and number of family
dependents? It is always useful to graph data; graphs of each variable against the other two
follow, in Figures 15.12, 15.13 and 15.14. All the graphs show a great deal of variability,
without a clear strong relation. Writing 1 for IQ, 2 for number of family dependents and y
for absence, correlations among the variables are positive and small: r1y = 0.343, r2y = 0.208
and r12 = 0.025.
The two absence graphs show a vague pattern of increasing variability with increasing IQ
or dependents. The graph of IQ against dependents shows almost no relation: the variables
appear to be independent. We now fit single-variable regressions of days on IQ and days on
dependents. The fitted models are shown with the data in Figures 15.15 and 15.16.

For IQ, the fitted regression is b = −12.48 (0.39) + 0.391 (0.178) IQ.
µ

For dependents, the fitted regression is µ


b = 10.98 (7.93) + 1.129 (0.887) deps.

The 95% confidence/credible interval for the IQ slope excludes zero, while that for depen-
dents includes zero – dependents do not appear important for absence. However the increas-
ing variability is not accounted for in either model. We discuss how to deal with this in
Chapter 17.

15.11 Model assumptions


We examine the Gaussian model assumption through the frequentist residuals. Figure 15.17
shows the graph of the residual probits from the regression of absence on IQ against the
ordered residuals. The curvature shows that the distribution is right-skewed – the Gaussian
212 Introduction to Statistical Modelling and Inference

55

50

45

40

35
days absent
30

25

20

15

10

60 70 80 90 100
IQ

FIGURE 15.12
Absence vs IQ

55

50

45

40

35
days absent

30

25

20

15

10

4 6 8 10 12 14
dependents

FIGURE 15.13
Absence vs dependents
Gaussian linear regression models 213

105

100

95

90

85
IQ
80

75

70

65

60
4 6 8 10 12 14
dependents

FIGURE 15.14
IQ vs dependents

55

50

45

40

35
days absent

30

25

20

15

10

60 70 80 90 100
IQ

FIGURE 15.15
Absence vs IQ and fitted linear model
214 Introduction to Statistical Modelling and Inference

55

50

45

40

35
days absent
30

25

20

15

10

4 6 8 10 12 14
dependents

FIGURE 15.16
Absence vs dependents and fitted linear model

1.5

1.0

0.5
probit

0.0

-0.5

-1.0

-1.5

-10 -0 10 20 30
residual

FIGURE 15.17
Probit residuals
Gaussian linear regression models 215

model does not fit well. Before proceeding further with this example we give the theory for
the general p-variable case.

15.12 The p-variable linear model


The linearity of the p-variable multiple regression model is of two kinds:
• The relation of the p covariates to the mean of the response distribution is through a
linear combination of these variables x, commonly called a linear predictor, and denoted
by the Greek letter η (eta).

• The linear predictor acts as a single variable in the regression, as in the simple linear
regression model.
The model is sometimes called a single-index model, with index referring to the linear func-
tion.
We first extend the notation of both covariates and regression parameters. We define x0
to be the vector of 1s, and rename
Pp the intercept parameter α as β0 . We write the linear
predictor in vector form: η = j=0 βj xj = β ′ x. We do not need a regression coefficient on
η because it is already a function of the regression coefficients β. More informatively,

yi | xi ∼ N (µi , σ 2 ),
µi = β ′ xi .

The x variables themselves do not need to be linear; they can be powers or products of other
variables, as we will see. Geometrically, the population mean values are modelled as lying in
a hyper-plane – the generalisation of a (y, x) line in two dimensions and a (y, x1 , x2 ) plane
in three dimensions.

15.13 The Gaussian multiple regression likelihood


The likelihood has the same form as for the simple linear regression model:

L(β, σ) = Pr[y1 , . . . , yn |x1 , . . . xn , β, σ]


Yn
= f (yi |β, σ) · δ
i=1
n 
1 (yi − µi )2
  
Y 1
= √ exp − δ
i=1
2πσ 2 σ2
n Pn
1 i=1 (yi − µi )2
  
δ 1
= √ · exp −
2π σn 2 σ2

Again δ is the measurement precision


Pn with which y is measured.
The sum of squares term i=1 (yi − µi )2 in the likelihood appears in some
Pn form in every
′ 2
Gaussian model, and has a standard re-expression. We write SS(β) = i=1 (yi − β xi )
216 Introduction to Statistical Modelling and Inference

for the sum of squares. Writing y = (y1 , . . . , yn )′ , a column vector of length n, and the
n × (p + 1) matrix X = (x0 , x1 , . . . , xn )′ , this can be expressed as
n
X
SS(β) = (yi − β ′ xi )2
i=1
= (y − Xβ)′ (y − Xβ)
= y′ y − 2y′ Xβ + β ′ X′ Xβ.

The matrix X′ X is now the sum of squares matrix of the covariates, and the vector X′ y is
the sum of cross-products vector of the covariates with the response.
It is clear that the Bayesian and frequentist results will again be the same, with appro-
priate adjustment of the prior on σ 2 :
• (Bayesian) with a prior on σ of 1/σ p+1 , and a flat prior on β, the marginal posterior
distribution of RSS/σ 2 is χ2n−p−1 , and the conditional posterior distribution of β given
b σ 2 [X′ X]−1 );
σ 2 is the (p + 1)-variate Gaussian distribution: Np+1 (β,
• (frequentist) the first term is the χ2n−p−1 distribution of RSS/σ 2 , and the second is the
p + 1-variate Gaussian distribution of β:b Np+1 (β, σ 2 [X′ X]−1 ).

In the Gaussian analyses described in the following, we use the frequentist terms as these
are most common in the applications of Gaussian regression analysis. We repeat that the
Bayesian analysis gives the same results.
As in the single variable model, the intercept and regression model parameter ML es-
timates (or parameters for Bayesians) are uncorrelated with the variance parameter ML
estimate (or parameter), but the regression coefficient estimates (or parameters) are in gen-
eral correlated with each other and with the intercept. By centering each of the covariates,
we can achieve independence of the intercept from the regression coefficients, but not of the
regression coefficients from each other.

15.13.1 Absence from school


We return to the example. The two-variable IQ/dependents ML fitted model with standard
errors is
b = −21.24 (16.58) + 0.385 (0.177) IQ + 1.083 (0.845) deps.
µ
Both parameter ML estimates change little from the single-variable models, because of the
near-independence of IQ and dependents.
It now appears that the number of dependents does not contribute to absence beyond
the single variable IQ, and can be omitted. However we have not accounted for the large
variation about the regression. In interpreting the analysis it may appear odd that absence
increases with IQ. In observational studies like this one, there is no interpretation of a causal
relation between the variables: increasing IQ does not cause increased absence from school –
nor does increased absence from school cause increased IQ! Any such interpretation could
be drawn only from a randomised experiment, one which followed the children longitudinally
(in time) to observe changes over time in both absence and IQ, which would be difficult to
design and execute.
In the sample analysed, these two variables are very weakly related, and there are many
other observed or unobserved possible confounding variables which might be related to both,
which could reproduce the observed relation. We deal with this possibility systematically, in
Gaussian linear regression models 217

the framework of omitted variables, in Chapter 16. We also examine in Chapter 17 alterna-
tive models for absence, allowing for relations between the covariates and the variability of
absence.

15.14 Interactions
The two-variable model we have examined has a strong assumption built-in: that the two
variables do not interact with each other. The meaning of interaction can be understood if
we examine the effect of varying the number of dependents on the regression of absence on
IQ. Writing the structural part of the model as µabs = β0 + β1 IQ + β2 deps, this structure is
for deps = 1 and 2,

µabs = β0 + β1 IQ + β2 ,

µabs = β0 + β1 IQ + 2β2 .

The change in the number of dependents affects only the intercept of the regression, β0 + β2
or β0 + 2β2 , and not the slope of the regression on IQ. So the two-variable regression can
be thought of as a set of parallel single-variable IQ regressions, with the vertical spacing
between them set by the deps coefficient β2 . The same argument leads to the alternative
interpretation of a set of parallel single-variable deps regressions, with the vertical spacing
between them set by the IQ coefficient β1 .
This no interaction property can be changed by extending the model, to include the
interaction – the cross-product of the IQ and dependents variables:

µabs = β0 + β1 IQ + β2 deps + β3 IQ ∗ deps.

Now the change from deps 1 to deps 2 has an effect on the slope of the regression on IQ as
well as on the intercept:

µabs = β0 + β1 IQ + β2 + β3 IQ,

µabs = β0 + β1 IQ + 2β2 + 2β3 IQ.

It is important to note that, when the interaction IQ ∗ deps is modelled, the regression
coefficients β1 and β2 for the interacting variables IQ and deps no longer represent the slopes
of the relations between the response and these variables: they represent only the slopes of
the relations at the zero values of the other variable. This is very important in the analysis
of cross-classifications, considered in §15.18.
It is difficult to identify the possible need for an interaction term from the graphs of
absence against IQ and dependents. The interaction term has to be fitted and assessed
for necessity. In the ML fitted interaction model, the coefficient of the interaction term is
−0.0310 with SE 0.0718: the interaction is small and can be omitted. Interactions can be
extended to higher orders with three, four, . . . variables interacting. We discuss examples
below and in the GLM chapter.

15.14.1 ANOVA, ANCOVA and MR


Historically, applications of the multiple regression model have been given different names.
218 Introduction to Statistical Modelling and Inference

15.14.1.1 ANOVA
In the analysis of designed experiments in agriculture, engineering and many other fields,
factorial experiments were used in which different factors – treatments at different levels of
some variable – were analysed as cross-classifications with main effects and interactions of
the factors. This required the partitioning of the regression sum of squares into separate and
independent components for the main effects and their interactions. This partitioning process
was invented by R.A. Fisher, who called it the Analysis of Variance, generally abbreviated
to ANOVA.
A fundamentally important difference in the use of ANOVA occurs between designed
experiments and observational studies. Fisher invented both ANOVA and the concept of
designed experiments. He saw that if the factors being investigated in the experiment could
be designed to be orthogonal – uncorrelated – the ANOVA would have a unique partition
structure, and the effects of both main effects and interactions could be established in a
single analysis through the ANOVA table. His 1935 book The Design of Experiments became
famous, and changed the design and analysis of agricultural and other industrial experiments.
In observational studies, like the Bennett investigation of emotional response in hus-
bands to suicide attempts by their wives, the possibly important factors affecting emotional
response are almost never orthogonal except by accident. This means that the ANOVA table
is not unique – it depends on the sequence in which effects are included in, or removed from,
the model. With p covariates in the model, there are p! possible permutations of the order
of their entry into the model, and even more possible models: there are 2p possible models
with any number of covariates.
There have been several attempts to construct a single sum of squares table which cor-
rectly represents the importance of the covariates. In our view these are unsuccessful; a full
discussion of the attempts and their difficulties can be found in Aitkin (1978); it is unfor-
tunate that this discussion has been largely ignored. Fortunately, there is an alternative
approach.

15.14.1.2 Backward elimination


We follow here a standard frequentist analysis method developed in the early days of
computer-based regression, generally called backward elimination.1 This proceeds by first
fitting all covariates in the “full model”. Then the contribution of each covariate is assessed
by computing for each parameter the “t-value” – the ratio of the ML estimate to the standard
error for the parameter – and eliminating from the model the variable with the smallest t-
value in magnitude. We use an elimination criterion of |t| < 2, corresponding to a 95%
credible or confidence interval for the corresponding parameter including zero. In Gaussian
regression models these t-values have a Student t-distribution, but in other distributions this
is only an approximation: they have an asymptotic Gaussian distribution.
The model is refitted after each elimination, and the process repeated until no further
elimination is needed. The t-values change in distribution as the elimination procedure pro-
gresses, but we do not attempt to adjust the “elimination value” of t to reflect this. We
illustrate this procedure on the following examples.2 The probability model is then assessed
by the probit plot of the frequentist residuals.
A closely related Bayesian formulation is the spike and slab prior (Mitchell and
Beauchamp 1988). This is a mixture prior, the product of p independent two-component
Gaussian mixture priors, one component with mean 0 and a very small variance (the spike),
1 We do not discuss the forward selection and stepwise forward/backward methods.
2 Some Bayesians would prefer a model-averaged analysis over the well-supported models out of the 2p
possible models.
Gaussian linear regression models 219

the other component with mean 0 and a very large variance (the slab). These are mixed in
proportions pj and 1 − pj for covariate j, with the pj unknown with flat priors. The aim
of the mixture prior is to shrink the small MLEs towards zero, while leaving unaffected the
large MLEs; this is the same aim as backward elimination, though the shrinkage is generally
not to zero. We discuss this further in Chapter 16.

15.14.1.3 ANCOVA
In some experiments, one or more continuous covariates is part of the model, and the re-
gression of the response on these covariates could be interacted with the factorial effects.
The process of decomposition of the regression sum of squares in these models was called
the Analysis of Covariance, generally abbreviated to ANCOVA.
When all covariates are continuous, it is possible to interact covariates with each other
into a higher-dimensional surface model. In some engineering applications, the true response
is a complex multi-dimensional function, which is approximated by a response surface model
which includes powers and cross-products of the covariates.

15.15 Ridge regression, the Lasso and the “elastic net”


If the number of covariates is large, and some are highly intercorrelated, the variability (SEs
or posterior SDs) of the ML estimates or posterior means may “blow up” – the SEs may be
very large from the near-singularity of the SSP matrix.3 We saw a small version of this in
the linear and quadratic terms which were not centred. Centering eliminated this problem,
but it does not solve the problem if the covariates are continuous. An early treatment of
this problem was to add a penalty term to the log-likelihood which was a function of the
squared regression coefficients. The “objective function” of the parameters which was to be
maximised was this penalised log-likelihood:
p
X
H(θ, σ) = log L(θ, σ) − λ θj2
j=1

= −[(y − Xθ) (y − Xθ)]/2σ 2 − n log σ − λ θ ′ θ.


The object of using the squared regression coefficients is to shrink the MLEs towards zero.
The “ridge” – the main diagonal of the SSP matrix – is “loaded”, augmented by the positive
product 2λσ 2 , decreasing the high correlations in the augmented matrix. Here λ is the
penalty constant which determines the penalised MLE (PMLE) θ̃ of θ:
θ̃ = [X′ X + 2λσ 2 I]−1 X′ y
= [I + 2λσ 2 [X′ X]−1 ]−1 [X′ X]−1 X′ y
= [I + 2λσ 2 [X′ X]−1 ]−1 θ.
b

The loading determines the bias and variance of the PMLE:


E[θ̃] = [I + 2λσ 2 [X′ X]−1 ]−1 θ
Var[θ̃] = σ 2 [I + 2λσ 2 [X′ X]−1 ]−1 [X′ X]−1 [I + 2λσ 2 [X′ X]−1 ]−1 .
3 Some regression programs test for and/or report a singularity, or avoid it by omitting the covariate which

leads to the singularity.


220 Introduction to Statistical Modelling and Inference

This approach has several difficulties:


• the magnitude of the regression coefficients depends on the scale of their covariates;
if these are quite different there is no reason for the penalty to be the same for each
coefficient.
• the penalty constant cannot be estimated from the data: it has to be known.
The first difficulty can be addressed by standardising the covariates to have sample mean 0
and variance 1. The regression coefficients are then all on the same scale. The second difficulty
is insoluble: the choice of λ has to be decided by the experimenter or data analyst. For
frequentists, the issue is the “tradeoff” between the biases and the variances of the shrunken
regression coefficients. A large value of λ will shrink the estimates and their variances strongly
towards zero, but will lead to substantial biases in the estimates. A small value of λ will
lead to small biases and smaller reductions in variances. This may be assessed by “tuning”
or “eyeballing” λ: a “flexible” process like optimising over a grid, but the optimisation is to
the user’s eye, rather than of a numerical criterion.4 As with other “flexible” procedures, we
are left uncertain of the properties of the ridge estimates and their precisions.
These issues are less relevant for Bayesians, though prediction of new values is equally
affected for frequentists and Bayesians. A Bayesian interpretation makes clearer the meaning
of the penalty. It can be expressed in terms of the likelihood as a Gaussian prior distribution,
of independent parameters θj with mean 0 and variance 1/(2λ). As with any informative
prior, the prior parameter λ must be known, or else be given a non-informative hyper-prior.
A closely related alternative penalty approach is the Lasso (Tibshirani 1996), where
the penalty is the sum of the absolute values of the regression coefficients. The Bayesian
interpretation of this is that the prior distribution of the independent parameters θj is a
Laplace distribution, of the form

π(θj ) = λe−λ|θj | /2.

A yet further generalisation is the “elastic net”, which has both penalties. From a Bayesian
point of view, the parameter prior in this case is the normalised product of the Gaussian and
the Laplace.
We do not take the discussion of penalties further here, though the idea will reappear in
later models. Near-collinearity is easily handled by the omission of one or more of the highly
correlated covariates (as occurs in backward elimination), or by redefining them into a single
covariate. Actual collinearity is detected by most packages.

15.16 Modelling boy birthweights


We give as an example the modelling of birthweight for the 648 StatLab sample boys. The
research aim was to determine which (if any) personal factors in the child’s family might be
connected to the child’s birthweight. We examine six: parent covariates of age, smoking at
diagnosis, weight (recorded only for the mother) and family income. For previous smokers
who have quit, smoking is coded as −1. This is arbitrary, but has to be distinguishable from
0 or a positive number. In our analysis we will want to determine whether quit smokers need
to be distinguished from non-smokers in the relation between birthweight and smoking.
With a large number of covariates, graphing the response against all the covariates (or
all covariates against each other and the response) in the “trellis plot” becomes confusing,
4A possible criterion is the mean square error of θ̃ – the sum of squared bias and variance.
Gaussian linear regression models 221

and the graphs do not provide a reliable indication of the importance of the covariates in
the joint model.
We do not show the individual graphs for this analysis. We begin with a “full model” of
the six covariates mentioned, then eliminate the redundant covariates in order of smallest
t-value, while |t| < 2. This leaves two of the six covariates in the model: mothers’ weight
(mwt) and mothers’ smoking (msm).
The ML fitted values and (SEs) are

6.141 (0.251) − 0.0182 (0.0047) msm + 0.0123 (0.0019) mwt

The interpretation is straightforward. A ten-pound (4.53 kg) increase in mother’s weight is


associated with a 0.123 pound (56 gm) increase in mean boy’s birthweight. A ten cigarettes
per day increase in mother’s smoking is associated with a 0.182 pound (83 gm) decrease in
mean boy birthweight. This conclusion assumes that there are no interactions among the
covariates. We assess this by extending the model to include all two-covariate interactions,
and eliminate the redundant terms in the same way.
But first we consider the awakward smoking variable. The assessment of its effect is
complicated by its categorical and continuous structure. We resolve this by first examining
the variation of birthweight with smoking, for the smokers. We define a “smoker” indicator
which is 1 if the smoking variable is greater than 0, and is zero otherwise. Then we use
this indicator as a weight in the regression analysis with smoking and the other variables:
this omits from the analysis the non-smokers and quit smokers. We do this for both mother
smoking and father smoking. The variation in birthweight with cigarettes smoked per day
by both mother and father turns out to be unimportant (both are dropped with t values
around 1). So it is not the number of cigarettes smoked daily by either mother or father
which is important, but whether they smoke at all.
We can now reclassify smoking into three categories for both mother and father: smoker,
non-smoker and quit smoker, and assess the importance of this classification for birthweight
variation in the full data set, without the use of number of daily cigarettes smoked, accounting
for mother’s weight. We find an important difference in mean birthweight between smokers
and the other two categories. Fathers’ smoking was not important, and neither was the
difference between non-smoker and quit smokers. So we can finally categorise the smoking
variable as smoker or non-/quit smoker. We repeat the interaction model with this two-
category smoking covariate.
There are two possible strategies for variable elimination with interactions. One is to
regard all model terms as equally important in the elimination process, whether interaction
or main effect. This is a common approach with factorial designs, where the covariates are
all categorical factors and are independent. The other, which we will follow, is to retain all
main effects until all redundant interactions have been eliminated, and then eliminate any
redundant main effect covariates not involved in the retained interactions.
The final model is (in birthweight pounds weight)

b = 4.65 (0.61) + 0.0116 (0.0019) mwt − 0.360 (0.091) msm + 0.0565 (0.0199) mage
µ
+ 0.0186 (0.0074) inc − 0.000644 (0.000249) mage.inc

Smoking mothers have a mean boy birthweight of 0.36 pounds (164 grams) less than non-
smoking or quit mothers, with 95% confidence or credible interval [0.188,0.536] pounds,
or [85,243] grams. This is a birthweight reduction of about 5% relative to average boy
birthweight, and about 1/3 of the residual standard deviation from the model.
An increase of ten pounds in mother’s weight is associated with a mean increase of 0.116
pounds, or 53 grams in birthweight. Mothers’ age and family income interact on birthweight:
both are positively related to birthweight, but their effects are reduced at high values of
222 Introduction to Statistical Modelling and Inference

14

13

12

11

10

boy weight 9

100 125 150 175 200 225 250


mother weight

FIGURE 15.18
Boy birthweights against mother
weights

either. Figure 15.18 graphs boy birthweight against mother’s weight, and Figure 15.19 iden-
tifies mother’s smoking (red circles: smoker, green circles: non-smoker).
The fitted values from a model with just smoking amd log mother’s weight are shown
as two parallel red and green curves with a vertical spacing of 0.36 pounds. The additional
variation from the mother’s age and family income is not shown here: it remains in the
variation about the fitted terms. The log transformation of mother’s weight gives a slightly
better fit. Without the log transformation the fitted lines are very close to the curves.
It is very difficult to see, without the two curves, any difference between the red and
green circles: they appear to be randomly mixed. The difference in mean birthweight with
smoking is small, relative to the variability, but the sample size is so large that it is real. This
research study established that mother’s smoking affected negatively boy birthweights. It is
the mother’s smoking environment, not the number of cigarettes smoked, that is apparently
important for lower boy baby birthweights.
Could there be unobserved variables which explain some of the remaining variation? The
mixture model for overdispersion identifies only the largest boy birthweight as defining a
component; this has no effect on the regression model for the remaining 647 boys. The full
modelling of girl birthweights is left as an exercise for the student.

15.17 Modelling girl intelligence at age ten and family income


In §3.2 we gave several research questions about intelligence at age ten and family income.
Here we consider:
Gaussian linear regression models 223

14

13

12

11

10

birthweight
9

100 120 140 160 180 200 220


mother weight

FIGURE 15.19
Boy birthweights against mother weights and fitted model. Red smoker, green non-smoker
or quit

• How does the child’s intelligence, as measured by the Peabody and Raven tests, relate to
birthweight, family income and mother’s and father’s ages, occupation and education?
We examine this question for the StatLab girls.
The two intelligence tests from the 1960s assessed different aspects of intelligence. The
Peabody test (its full name was the Peabody Picture Vocabulary Test) consisted of 175
vocabulary items of generally increasing difficulty. The child listened to a word uttered by
the interviewer, and then selected one of four pictures that best described the word’s mean-
ing. All of the questions on the Raven test (its full name was the Raven Progressive Matrices
Test) consisted of visual geometric designs with a missing piece. The child was given six to
eight choices to pick from and fill in the missing piece. How the test results were coded into
a score is not given on the websites of these tests (it is given in the test manuals).
As these are tests of different aspects of the child’s intelligence, we would expect them to
be fairly highly correlated. Figure 15.20 graphs the Peabody score against the Raven score
for the girls. It is clear that the tests are fairly highly correlated. However, the variability of
the Peabody score is increasing with increasing Raven score – the data points “fan out” as
Raven score increases. If we were to regress Peabody score on Raven score we would need to
account for this pattern of variability. We may need this also for regressing Peabody score on
other possible covariates. In Chapter 17 we discuss how to use the double GLM to do this.
In this book we consider each intelligence measure separately. We first model the girls’
Raven scores on birthweight, family income at both birth and test, and mother and father
ages, occupation and education, the last two as categories. Many packages require factors
to have non-zero level codes, for which occupation and education need to be increased by 1.
We do not give complete details of the model reduction here.
Mother and father occupation, mother and father age and child birthweight were all irrel-
evant, as was family income at both birth and test. However, education level was important
224 Introduction to Statistical Modelling and Inference

120

110

100

Peabody score
90

80

70

60

10 20 30 40 50
Raven score

FIGURE 15.20
Girl Peabody score against Raven score

for both mother (MED) and father (FED), showing a steady increase in Peabody mean score
with increasing education level.
By defining a new education variable rather than a factor, and then its square, cube and
fourth power, we can model the education relation as a polynomial, of order necessary to fit
the Peabody values. A linear term is sufficient to fit the relation. The ML fitted model is

µ
\Rave = 13.63 (1.46) + 2.667 (0.438) MEDL + 2.196 (0.383) FEDL.

The mother and father linear education terms differ by one SE of either coefficient, so they
can be reduced to a common coefficient, by defining a new linear term MFEDL - “Total
MFL” - as the sum of the two. This gives a final model

µ
\Rave = 13.75 (1.44) + 2.411 (0.189) MFEDL.

Each step up the Total MFL scale corresponds to a 2.4 step increase in mean Raven score.
The fitted model is shown with the Total MFL values in Figure 15.21. Are the model resid-
uals reasonably near Gaussian? The frequentist residuals are shown on the probit scale in
Figure 15.22. The Gaussian model looks acceptable.
We repeat the modelling with the girls’ Peabody scores on birthweight, family income
at both birth and test and mother and father ages, occupation and education. As with the
Raven score, mother and father occupation, family income at both birth and test and mother
age and child birthweight were irrelevant. However father age was important, and as before
education level was important for both mother and father, showing the same form of steady
increase in mean score with increasing education level. The composite Total MFL score again
fitted linearly. The final model was

µP ea = 52.71 (2.28) + 2.401 (0.188) MFEDL + 0.215 (0.053) FAGE.


Gaussian linear regression models 225

50

45

40

35
Raven score

30

25

20

15

10

2 4 6 8 10
Total education level

FIGURE 15.21
Girl Raven score against Total MFL, with ML fitted model

3.0

2.5

2.0

1.5

1.0

0.5
probit

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

-3.0
-20 -10 0 10 20
Raven residuals

FIGURE 15.22
Girl Raven score residuals
226 Introduction to Statistical Modelling and Inference

The increase in mean Peabody score with an increase of 1 in Total MFL was very close to
that for the Raven score. Although the scales are different, as can be seen from Figure 15.20,
the ranges (maximum-minimum) are nearly the same. A probit plot of the residuals (not
shown) had some curvature. A log transformation of the Peabody score gave more nearly
Gaussian residuals, but did not change the conclusions.
So the level of both parents’ education has a clear relation to the aspects of the girls’
intelligence measured by both the Raven and Peabody tests. Father’s age is important for
the Peabody test.

15.18 Modelling of the hostility data


In §2.12 we described the study by Bennett of hostility and affection in the husbands of
wives who were recovering from a suicide attempt through overdoses of tranquilisers, and in
those recovering from an acute abdominal (non-psychological) illness. When the recovery of
the wife was established, Bennett interviewed the husbands, and recorded their responses to
three questions about the state of the marriage. The responses were analysed for affection
and hostility content using the Gottschalk-Gleser scales (Gottschalk and Gleser 1969, from
now abbreviated to GG). The analysis scored the husbands’ responses for affect (emotional
response); this provides a scale score which can be taken as approximately Gaussian (it uses
a square-root transformation of the count of affective words). The psychiatric question was:
how are the levels of affect, on the several GG scales, related to the nature of the event (sui-
cide attempt or organic abdominal condition), and do the other personal factors influence
the affect level?
The need for a “control group” for the suicide group was clear: any crisis requiring emer-
gency hospitalisation would arouse very strong emotional responses in husbands. However,
the group which had experienced the acute abdominal condition might have experienced a
previous episode of this condition, and so would have previously experienced the emergency
hospital stresses. This might affect the level of both affection and negative affect. This could
have applied as well to the suicide attempt group. So Bennett classified both groups by an
additional covariate, whether there had been a previous occurrence of the event.
In the “control” group, another possibility was the nature of the marital history. Marital
difficulties requiring psychological or marriage counselling might have produced emotional
effects in the husband similar to the occurrence of a previous event. So Bennett divided the
control group into two sub-groups, which he called “true controls” – those with no history
of marital difficulties – and “false controls” – those with a history of marital difficulties.
A further personal factor which might have been important was the nationality of the
husband. This was Australian-born or British-born: there were no other origin husbands in
either group. We report here the analysis of the affection score. Of the 67 husbands, there
were 25 in the “overdose” group (OD), 13 in the “false control” group (FC), and 29 in
the “true control” group (TC). 37 husbands had had no previous occurrence (NPO) of the
event, 30 had had a previous occurrence (PO). 48 husbands were Australian-born (A), 19
were British-born (B). The sample mean affection across all 67 husbands was 2.20, and the
standard deviation was 0.60.
A preliminary tabulation of the affection responses is given in the classification Table 15.3,
in which the numbers of husbands, and their sample affection means and standard deviations,
are given within each cell. Cell numbers vary widely, from 1 to 13. Sample means vary from
1.38 to 3.20, and SDs from 0.23 to 0.70. Three cells with only one observation have no SD.
It is difficult to draw conclusions from such tables, except that the within-cell variability
seems fairly consistent. To draw stronger conclusions, we need to model the mean variation.
Gaussian linear regression models 227
TABLE 15.3
Counts, means and SDs of affection
Nation A B
Occur NPO PO NPO PO
OD 12 8 4 1
mean 1.92 1.85 1.90 1.38
SD 0.52 0.46 0.45 –
FC 4 5 3 1
mean 2.46 1.91 2.84 2.37
SD 0.23 0.51 0.35 –
TC 13 6 1 9
mean 2.53 1.93 3.20 2.58
SD 0.52 0.70 – 0.63

15.18.1 Data structure


The “fixed” or “structural” model is a three-way interaction model with constant variance
and (tentatively) a Gaussian distribution for the within-cell variability. We begin by fitting
the full three-way interaction model. The output format depends on the (dummy variable)
parametrisation of the model adopted in the package. We give the parametrisation in which
the first level of factors is the reference level, and the dummy variables give the difference
of the higher level means from the first level. So the group has two dummies, called here
G2 and G3. The two-level factors N and PO have single dummies, called here N2 and PO.
Interactions are represented here by dot products of the dummy variables. The MLEs and
SEs with t values (omitted for the intercept) are shown in the output log, which also gives
the Gaussian frequentist deviance.

deviance = 15.316
estimate s.e. t parameter
1 1.999 0.3573 1
2 1.020 0.6796 1.50 G2
3 1.135 0.5096 2.23 G3
4 0.4263 0.8274 0.52 N2
5 -0.07625 0.2409 -0.32 PO
6 -0.1286 1.292 -0.10 G2.N2
7 0.3215 1.401 0.23 G3.N2
8 -0.4803 0.4282 -1.12 G2.PO
9 -0.5271 0.3548 -1.49 G3.PO
10 -0.4463 0.6373 -0.70 N2.PO
11 0.5294 0.9501 0.56 G2.N2.PO
12 0.3685 0.8851 0.42 G3.N2.PO
scale parameter 0.2785

The “scale parameter” is the “Restricted” ML (REML) estimate of σ 2 (the posterior mode
with the default prior). We proceed with backward elimination, but with a hierarchical order
of main effects and interactions, starting with the highest-order interactions. The t-values of
both the three-way interaction terms are small, so they may be omitted.
228 Introduction to Statistical Modelling and Inference

deviance = 15.411
estimate s.e. t parameter
1 2.056 0.3370 1
2 0.8676 0.6028 G2
3 1.058 0.4648 G3
4 0.07410 0.5245 N2
5 -0.1172 0.2260 PO
6 0.5458 0.4207 1.30 G2.N2
7 0.8206 0.4158 1.97 G3.N2
8 -0.3780 0.3753 -1.01 G2.PO
9 -0.4714 0.3201 -1.47 G3.PO
10 -0.1596 0.3691 -0.43 N2.PO
scale parameter 0.2704
We examine next only the two-way interactions. We omit N2.PO. To save text space we
show only the interaction next omitted:
deviance = 15.461
8 -0.3955 0.3705 -1.07 G2.PO
9 -0.4773 0.3176 -1.50 G3.PO
scale parameter 0.2666
The scale parameter decreases if the omitted variable has a t of less than 1, otherwise it
increases. After these two interactions are omitted, we see another possibility for simplifica-
tion:
deviance = 16.131
estimate s.e. t parameter
1 2.456 0.2286 1
2 0.3240 0.2093 1.55 G2
3 0.4136 0.1665 2.48 G3
4 -0.1745 0.2608 -0.67 N2
5 -0.4025 0.1407 -2.86 PO
6 0.6232 0.4056 1.54 G2.N2
7 0.6572 0.3470 1.89 G3.N2
scale parameter 0.2688
The G.N interaction terms are rather large. Rather than omitting them both, we first note
that they are very similar in value, as are the “main effects” G2 and G3. This suggests that
the true and false control group dichotomy may not be relevant in the interaction with N.
We examine this by defining a new dummy variable G23 which combines these categories,
corresponding to G2 or G3, that is, not G1. Replacing the two G.N interactions by a single
G23.N2 interaction gives:
deviance = 16.133
estimate s.e. t parameter
1 2.450 0.2180 1
2 0.9618 0.3015 3.19 G2
3 1.062 0.2812 3.78 G3
4 -0.1737 0.2585 -0.67 N2
5 -0.3986 0.1323 -3.01 PO
6 0.6451 0.3150 2.05 G23.N2
scale parameter 0.2645
Gaussian linear regression models 229

The reparametrisation of the interaction changes the G2 and G3 estimates by 0.63, the vari-
ation suppressed by the single G23.N2 interaction. We now encounter a common situation:
the N2 “main effect” is small, but it has a substantial interaction with G23. This means
that the N “effect” (which is not averaged across the husband groups but is only that in
overdoses) is small in the overdose group but larger in the true and false control groups. We
can then set it to zero in the overdose group by eliminating the N2 dummy:
deviance = 16.252
estimate s.e. t parameter
1 2.403 0.2055 1
2 0.8199 0.2143 3.99 G2
3 0.9198 0.1841 5.00 G3
4 -0.3895 0.1310 -2.97 PO
5 0.4684 0.1729 2.71 G23.N2
scale parameter 0.2621

We see again that the main effects for G2 and G3 are very similar, and we equate them in
the same way, by replacing them by G23:
deviance = 16.341
estimate s.e. t parameter
1 2.399 0.2043 1
2 -0.3870 0.1302 -2.97 PO
3 0.4192 0.1402 2.99 G23
4 0.4712 0.1719 2.74 G23.N2
scale parameter 0.2594
At this point no further variable elimination can occur. The coefficients for G2 and G3
change with the reparametrisation of the dummy variables for the category levels.
The interpretation is straightforward:
• Husbands with a previous occurrence are much less affectionate than those without a
prior occurrence, consistently over the three groups.
• The true and false control husbands are substantially more affectionate than the overdose
husbands.
• The British husbands are substantially more affectionate than the Australian husbands
in the true and false control groups.
We use the terms “much less” and “substantially more” here without any natural scale of
affection. We can construct a scale in terms of standard deviation units differences (called
effect sizes in some fields) among the classifications. We first smooth the coefficients, by
noting that they are quite close in magnitude (in terms of their variabilities), and could be
equated in a simpler model. We define a variable lin, by
lin = −PO + G23 + G23*N2, and fit this variable instead of its three components:

deviance = 16.399
estimate s.e. t parameter
1 2.035 0.0684 1
2 0.4266 0.0776 5.50 LIN
scale parameter 0.2523
230 Introduction to Statistical Modelling and Inference
TABLE 15.4
Observed and fitted mean values of affection
Nation A B
Occur NPO PO NPO PO
OD 12 8 4 1
mean 1.92 1.85 1.90 1.38
fitted 2.0 1.5 2.0 1.5
FC 4 5 3 1
mean 2.46 1.91 2.84 2.37
fitted 2.5 2.0 3.0 2.5
TC 13 6 1 9
mean 2.53 1.93 3.20 2.58
fitted 2.5 2.0 3.0 2.5

We have been able to represent the mean affection variation by a single scale or index,
which has a set of steps as the covariates change. To make the scale values simpler we do
some small rounding, of the estimates
√ within their variability, and the scale parameter. The
residual standard deviation is 0.2523 = 0.502, which we round to 0.5, and we round the
lin coefficient by one standard error to 0.5, and the intercept by half a standard error to
2.0. We fit this constrained model lin2 = 2.0 + 0.5 lin as an offset – a variable with a fixed
regression coefficient of 1 – and without an intercept, and tabulate the fitted values by the
cross-classifying variables:
deviance = 16.627
-- No parameters to display
scale parameter 0.2482

We can summarise this table by noting that the fitted values define a four-point scale of
mean affection, with a spacing of one standard deviation (0.5) between the scale points; the
number of fathers at each scale point is given in parentheses:
1.5 : overdoses with a previous occurrence (9)
2.0 : overdoses with no previous occurrence (16);
Australian controls with a previous occurrence (11)
2.5 : Australian controls with no previous occurrence (17);
British controls with a previous occurrence (10)
3.0 : British controls with no previous occurrence (4)
There is no distinction between true and false controls in affection. The 1.5 gap between the
most and least affectionate scale groups is three standard deviations on the affection scale –
a very wide range.
We conclude by examining the Gaussian model assumption through the residuals from
the final model. Figures 15.23 and 15.24 give the cdf and probit plots of these residuals. The
probit plot shows some curvature.
A final assessment can be made by posterior Dirichlet weighting of the three-variable final
model, for the posterior distribution of the model parameters. The medians and 95% credible
intervals for the parameters from 10,000 draws are given in the left panel of Table 15.5, and
the MLEs and 95% confidence interval endpoints in the right panel.
Gaussian linear regression models 231

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

-1.0 -0.5 0.0 0.5 1.0


residual

FIGURE 15.23
Affection residuals cdf

2.0

1.5

1.0

0.5
probit

0.0

-0.5

-1.0

-1.5

-2.0

-1.0 -0.5 0.0 0.5 1.0


residual

FIGURE 15.24
Affection residuals probit
232 Introduction to Statistical Modelling and Inference
TABLE 15.5
Posterior medians and 95% credible intervals (left), MLEs and 95% confidence intervals
(right)
quantile int PO2 G23 G23N2 MLE,ends int PO2 G23 G23N2
2.5 1.806 −0.626 0.165 0.126 2.5 1.990 −0.647 0.139 0.127
50.0 2.014 −0.387 0.422 0.474 MLE 2.399 −0.387 0.419 0.471
97.5 2.205 −0.145 0.673 0.780 97.5 2.808 −0.127 0.699 0.815

Apart from the intercept, the MLEs correspond very closely to the medians. The larger
intercept shift is induced by the smaller shifts in the draws of the other parameters. The 95%
credible intervals for the regression coefficients are 5–10% shorter than the 95% confidence
intervals. There is no serious departure from Gaussianity in the parameter posteriors (not
shown).

15.18.1.1 Replication and variance heterogeneity


A common feature of designed experiments is replication – the repetition of the same treat-
ment on more than one experimental unit. The purpose of this is to have an independent
assessment of the precision of the experimental treatments – the experimental error – through
the variance of the responses under the same treatment. This is more informative than the
usual variance estimate based on the residual sum of squares from the model regression. The
latter combines the lack of fit variability of the mean responses to the regression model –
the variability among treatments – with the variability within treatments.
In observational studies, it sometimes happens, especially with cross-classifications, that
accidental replication occurs, as for example in the suicide attempts study above. The clas-
sification by the three factors showed that, in the cells with more than one observation, the
within-cell variances were quite similar, as the analysis required.

15.19 Principal component regression


A popular approach to the problem of high dimensional covariates in regression is to use the
principal components of the covariates to reduce their dimensionality. Principal components
were introduced into statistics by Harold Hotelling (1933). They are linear transformations
u = Hx of the covariates x, chosen to orthogonalise the covariance matrix V of the covari-
ates, so that
Cov(u) = HVH′ = diag [e1 , e2 , . . . , ep ],
where the ej are the variances of the uncorrelated principal variables. In formal matrix terms,
the uj are the eigenvectors of V and the ej are the corresponding eigenvalues. Without any
loss of generality we may assume (by relabelling of the components) that the eigenvalues are
ordered from largest e1 to smallest ep .
Then we may try to reduce the number of necessary covariates by regressing the re-
sponse y on the principal components through a model y | u ∼ N (γ ′ u, ϕ2 ), commonly
called a PCA (Principal Component Analysis). An immediate computational and inferential
benefit is that, since the principal components are uncorrelated, there is no need for model
reduction by backward elimination or any other method: it is immediately visible from the
Gaussian linear regression models 233

regression coefficient estimates and their standard errors which components can be omitted:
their omission leaves the importance of the other principal variables unaffected.
It is tempting to assume that the importance of the components as predictors will cor-
respond to the size of their variances. There is no necessary connection between these two
properties; in particular it is not true that the first principal component is the best single
predictor of y. We know that the best single predictor of y is given by the MLE β b ′ x, not the
first principal variable u1 .
A particular difficulty with principal component regression is that the principal com-
ponent variables have no simple interpretation. In most research studies the covariates are
chosen for their relevance to variations in the response. So the weight (including zero) they
are given by the estimated regression coefficients has a direct interpretation in the research
process. Principal variables do not have such an interpretation, since they are already linear
functions of the covariates. An interpretation can be drawn by reverse-transforming the es-
timated regression function of principal variables back to a linear function of the covariates,
through γ b′u = γ
b ′ Hx. Whether this would be more precise than the backward elimination
process depends on the sample data. We do not pursue further in this book the use of
principal components in regression.
16
Incomplete data and their analysis with the EM
and DA algorithms

The generality of these “missing data” algorithms is remarkable: they have changed the face
of complex analysis since 1977. For historical reasons, we give a detailed exposition of the
EM algorithm with examples, before discussing the Bayesian extension through the Data
Augmentation algorithm and other MCMC procedures.
We first express the EM algorithm in its most general form, then give a number of
applications to problems we have already encountered in previous chapters. We do not give
a fully detailed theoretical treatment of EM: many such treatments have been published
since the fundamental paper of Dempster, Laird and Rubin (1977) and the books of Rubin
(1987) and Little and Rubin (1987). A comprehensive coverage of extensions of EM can be
found in McLachlan and Krishnan (1997).

16.1 The general incomplete data model


We model the variation in a response variable Y with observed data y through a distribution
g(y | θ) depending on a parameter θ. Through some circumstance (we will give many exam-
ples) the data that we observe are incomplete in some way. This incompleteness complicates
the frequentist and Bayesian analyses of the observed data y – the MLE θ b and its SE, and the
posterior distribution π(θ | y) of θ – through the observed data likelihood: L(θ) = g(y | θ).
If the data had been complete, these analyses would have been straightforward.
The EM algorithm relates the observed data analysis to the complete data analysis. For
this purpose we define the complete data y, Z as what would have been observed had the
incompleteness circumstance not occurred.1 We print Z in upper case, to indicate that it is
an unobserved random quantity. Of course this “definition” is an unverifiable assumption –
it is, in the language of Dempster et al, a counterfactual.
We write f (y, Z | θ) for the complete data density function. With this definition, we can
define the complete data likelihood as LZ (θ | y, Z) = f (y, Z | θ). We need to relate the
complete data to the observed data. The complete data are an expansion of the observed
data; the observed data are a contraction of the complete data. We need a general form for
this contraction. We write the relation between the distributions of Z and y, and therefore
the likelihoods, as Z
g(y | θ) = f (y, Z | θ) dZ,
X (y)

where g is the observed data likelihood given the observed data y, and f is the complete data
likelihood given the complete data y, Z. The integral transformation X (y) is quite general,
1 This definition is slightly different notationally from the standard definition, for greater simplicity.

DOI: 10.1201/9781003216025-16 235


236 Introduction to Statistical Modelling and Inference

representing any kind of constraint on the y, Z space which reduces the data information to
that in the observable data y. We assume this transformation does not depend on the model
parameters θ.
Then taking logs and differentiating the log-likelihood under the integral sign (X (y) does
not depend on θ), we have, for the observed data score vector (first derivative) sy (θ) and
Hessian matrix (second derivative) Hy (θ),

R
∂ log g f (Z | θ)dZ
∂θ X (y)
sy (θ) = = R
∂θ X (y)
f (Z | θ)dZ
R
s (θ)f (Z | θ)dZ
X (y) Z
= R
X (y)
f (Z | θ)dZ
Z
= sZ (θ)f (Z | θ, y)dZ
X (y)

= E[sZ (θ) | y]
2 Z
∂ log g ∂
Hy (θ) = = sZ (θ)f (Z | θ, y)dZ
∂θ∂θ ′ ∂θ ′ X (y)
 
∂f (Z | θ, y)
Z
= HZ (θ)f (Z | θ, y) + sZ (θ) dZ
X (y) ∂θ ′
Z
= [HZ (θ)f (Z | θ, y) + sZ (θ){sZ (θ)′ − E[sZ (θ)′ | y}f (Z | θ, y)] dZ
X (y)
Z
= [HZ (θ) + sZ (θ){sZ (θ)′ − E[sZ (θ)′ | y}f (Z | θ, y)] dZ
X (y)

= E[HZ (θ) | y] + C[sZ (θ) | y].

So
• the observed data score is equal to the conditional expectation of the complete data score,
and
• the observed data Hessian is equal to the conditional expectation of the complete data
Hessian plus the conditional covariance of the complete data score,
where the conditioning is with respect to the observed data. We can re-express the second
result in terms of the information matrix I = −H:
• the observed data information matrix Iy (θ) is equal to the conditional expectation of the
complete data information matrix IZ (θ) minus the conditional covariance of the complete
data score.
Remarkably, these results hold regardless of the form of incompleteness, so long as the afore-
mentioned conditions hold. So algorithms like Gauss-Newton can be implemented directly
from the complete data form of the analysis, provided the conditional distribution of the
complete data score and Hessian given the observed data can be evaluated. This does not,
however, guarantee convergence without step-length or other controls on these algorithms.
It is important that, for the precisions of the MLEs, it is insufficient to compute the
conditional expectation of the complete data information matrix: this overstates the precision
of the estimates. It is corrected by the subtraction of the conditional covariance of the
complete data score function. The latter is quite complex even in simple models, and we do
not give details in most of the applications.
Incomplete data and their analysis with the EM and DA algorithms 237

A further concern in the use of standard errors from the information matrix is that in
incomplete data models the observed data likelihood can be far from Gaussian, even when
the complete data likelihood is Gaussian. For reliable measures of precision, we need the
Bayesian extension, given in a later section.

16.2 The EM algorithm


The expression of the observed data likelihood in terms of the expected complete data
likelihood does not tell us how to achieve ML computationally. The basis of the EM algorithm
is the alternation of E – Expectation – and M – maximisation – steps. We begin with some
initial estimate θ [0] of the parameter. In the E-step we take the conditional expectation of
the complete data log-likelihood, evaluated at θ [0] . The expectation replaces any function of
the unobserved Z by its conditional mean given the data y and θ [0] . Then in the M-step we
maximise this expected complete data log-likelihood with respect to the parameter θ; the
maximising value gives us the next estimate θ [1] of θ.
We continue to alternate successive E and M steps until the parameter estimate sequence
converges, to at least a local maximum of the likelihood. Remarkably, this occurs regardless
of the starting value, and without any control needed on the “step size” of the parameter
updates. However, there may be multiple maxima of the likelihood, and different starting
values may lead to convergence to different local maxima.
We first need to understand the types of missingness.

16.3 Missingness
We have assumed throughout the previous chapters that the data are always recorded cor-
rectly and are without missing values. The latter are endemic in all kinds of studies. In this
section we describe the types of missingness and the consequent methods for dealing with
them. In order to deal with missingness, we need to define the probability structure which
leads to values being missing.

• Missing completely at random (MCAR). The values are missing through a random pro-
cess, unrelated to the variables or any model parameters. An extreme example would
be a laboratory fire which destroys some specimens and loses data on other damaged
specimens. The damage is unrelated to the data values.
• Missing at random (MAR). The values are missing through a random process which may
depend on completely observed variables, but not on the values of the variables which are
incomplete. An example would be a study of randomly selected school boys and girls in
which some members of each sex are missing some variables from one of the interviewing
and data recording sessions because of minor illness.
• Missing non-randomly (MNR). The values are missing through a process which depends
explicitly on the values which would have been observed. An example is the destruction
of lifetime records of phones with low lifetimes. Another is a study of reported incomes
in which some sample members with high incomes decline to report them.
238 Introduction to Statistical Modelling and Inference

16.4 Lost data


In the phone data example of the 88 phones, suppose that in fact 90 phones were on test,
but the lifetimes of two were lost. How should this affect the analysis? It might seem obvious
that, since the data were lost, we can analyse only the observed data – nothing else needs
to be done. This would be true if the missingness process was at random – unrelated to the
lifetimes which would have been observed. That might seem obvious too. But suppose that
in fact their lifetimes had been observed, but then were destroyed because they were very
low, suggesting poor quality. Then the observed data are a biased sample of lifetimes: the
complete data including the two very low lifetimes would have had a reduced mean lifetime.
So the obvious analysis of the observed data is valid only under the assumption of the
missing observations missing at random. This assumption can be formalised mathematically,
but we make the point here as it occurs repeatedly in incomplete data applications.
One frequently suggested possibility of dealing with missing data of this kind is to impute
the missing data – fill them in somehow. If the values are missing at random, the EM
algorithm provides the answer. Write Z1 , Z2 for the two missing phone lifetime values, which
are assumed to have the same exponential lifetime distribution as the observed lifetimes.
Then the complete data likelihood is
m
Y 2
Y
CL(λ) = f (yi | λ)ni · f (Zj | λ)
i=1 j=1
m   y ni Y 2   
Y 1 i 1 Zj
= exp − · exp −
i=1
λ λ j=1
λ λ
Pm P2 !
1 i=1 ni yi + j=1 Zj
= n+2 exp −
λ λ
 
1 nȳ + 2Z̄
= n+2 exp − ,
λ λ

where Z̄ = (Z1 + Z2 )/2. The complete data log-likelihood is

log CL(λ) = −(n + 2) log λ − (nȳ + 2Z̄)/λ

and its maximising value is


λ̃ = (nȳ + 2Z̄)/(n + 2).
In the E step we have to replace the term Z̄ by its conditional expectation given y and the
current value of λ. Since Z1 and Z2 are independent of y, and have the same exponential
distribution, we have immediately that

E[Zj | y, λ] = λ.

We need an initial value λ[0] of λ to assign this value to the Zj . Then the next value of λ is

λ[1] = (nȳ + 2λ[0] )/(n + 2).

An obvious choice for λ[0] is the MLE ȳ from the observed data – this should be close to the
MLE from the expected complete data. Then

λ[1] = (nȳ + 2ȳ)/(n + 2) = ȳ.


Incomplete data and their analysis with the EM and DA algorithms 239

So the λ estimate at the next step is again ȳ – the algorithm has converged immediately!
The MLE is unaffected by the imputation of the two missing values. We might as well have
ignored them.2
Note that this process is not the same as enlarging the observed data to n+2 by replacing
each unobserved Zj by the sample mean. That process is called mean imputation. It gives
the correct MLE, but its precision is wrong: its variance is λ2 /(n + 2) instead of λ2 /n.
This is a general feature of single imputation methods – replacing unobserved or incomplete
observations by the sample mean or some other estimate. In the EM algorithm the conditional
expectation step does not provide “plug-in” estimates of the incomplete observations: it is
a device which leads to ML estimation. Another way of looking at the EM result is to see
that the two missing observations cannot contribute to the likelihood. So if the missingness
is at random, we can forget them and analyse the observed data.

16.5 Censoring in the exponential distribution


In §10.5.5 we discussed the effect of censored observations on the likelihood and MLE in the
exponential distribution for the phone lifetimes. We now use the EM algorithm to find the
MLE of λ with one additional censored observation T = 300 hours.3
The observed data are the n observed failure times yi with frequencies ni , i = 1, . . . , m,
and the censored observation time T . The unobserved complete data are the yi and the
unobserved failure time Tf . The complete data likelihood is
m
Y
CL(λ) = f (yi | λ)ni · f (Tf | λ)
i=1
m   y ni 1  
Y1 i Tf
= exp − · exp −
i=1
λ λ λ λ
 Pm 
1 ni yi + Tf
= n+1 exp − i=1
λ λ
 
1 nȳ + Tf
= n+1 exp − ,
λ λ

We now need to take the conditional expectation of the complete data log-likelihood
log CL(λ) with respect to the unobserved Tf . This is

nȳ + Tf
log CL(λ) = −(n + 1) log λ − .
λ
In the E step of the algorithm we replace the term Tf by its conditional expectation. The
exponential distribution has constant hazard: phones do not age in service; used is as good
as new! Knowing that the phone has survived T hours makes no difference to the expectation
of its future life, which remains at λ. So E[y | y > T ] = T + λ.
We adopt a notation for the successive replacement values T [r] for Tf and the successive
ML estimates λ[r] for λ. We begin with λ[0] = ȳ = 210.8. We replace Tf by T + λ[0] in the
2 If some other initial value of λ is used, the algorithm converges after some iterations. We illustrate this

in examples.
3 The single observation is not a restriction: any number of censored observations can be analysed in the

same way, as in the previous section.


240 Introduction to Statistical Modelling and Inference

expected complete data log-likelihood, E[log CL], and maximise this over λ to get the next
MLE:

nȳ + T + λ[0]
log E[CL(λ)] = −(n + 1) log λ − .
λ
Maximising this gives

nȳ + T + λ[0] nȳ + T λ[0]


λ[1] = = + = 214.17,
n+1 n+1 n+1
and in general, recursively

nȳ + T λ[r]
λ[r+1] = +
n+1 n+1
[r]
λ
= 211.8 + .
89

So at convergence with r = ∞, λ[∞] = λ,


b we have

b = 211.8 + λ ,
b
λ
89
= 211.8/(1 − 1/89) = 214.2,

the same value as we found directly and analytically in §10.5.5.

16.6 Randomly missing Gaussian observations


We draw a sample x of size n from a population modelled by a Gaussian distribution
N (µ, σ 2 ). However, only the first m observations y are actually recorded, the last n − m
being lost by a random process.
Since the lost observations are missing at random, maximum likelihood can be achieved
directly from the observed random subsample of m observations, so
m
X m
X
µ̂ = ȳ = yi /m, σ̂ 2 = (yi − ȳ)2 /m.
i=1 i=1

Alternatively, we can use the EM algorithm.


The complete data log-likelihood (omitting constants) is
n
1 X
Cℓ(µ, σ) = −n log σ − (xi − µ)2 .
2σ 2 i=1

Given the current estimates µ[p] and σ [p] , the E-step replaces the log-likelihood for the
unobserved terms in x by its conditional expectation given the current parameter estimates.
So for the observed terms for i = 1, . . . , m, xi = yi , and for the unobserved missing terms for
i = m + 1, . . . , n, each (xi − µ)2 is replaced by

E[(x − µ)2 | y, µ[p] , σ [p] ] = σ [p]2 .


Incomplete data and their analysis with the EM and DA algorithms 241

So the conditional expected complete data log-likelihood is


"m n
#
1 X X
E[Cℓ(µ, σ) | y, µ[p] , σ [p] ] = −n log σ − 2 (yi − µ)2 + σ [p]2
2σ i=1 i=m+1
"m #
1 X 2 [p]2
= −n log σ − 2 (yi − µ) + (n − m)σ ,
2σ i=1

so the MLEs at the next iteration are


"
m
#
X
[p+1] [p+1]2 2 [p]2
µ = ȳ, σ = (yi − ȳ) + (n − m)σ /n.
i=1

At convergence, the second equation gives


"m #
X
c2 = 2 c2 /n
σ (yi − ȳ) + (n − m)σ
i=1
m
X
= (yi − ȳ)2 /m.
i=1

It is easily, if tediously, verified that the conditional expectation of the complete data score,
and the conditional expectation of the complete data Hessian plus the conditional covariance
of the complete data score are given by
m
1 X
E[sx (µ) | y] = 2 (yi − µ)
σ i=1
"m #
n 1 X 2 2
E[sx (σ) | y] = + 3 (yi − µ) + (n − m)σ
σ σ i=1
n
E[Hx (µ, µ) | y] = −
σ2
"m #
2 X
E[Hx (µ, σ) | y] = − 3 (yi − µ)
σ i=1
"m #
n 3 X
E[Hx (σ, σ) | y] = 2 − 4 (yi − µ)2 + (n − m)σ 2
σ σ i=1
n−m
C[sx (µ, µ) | y] =
σ2
C[sx (µ, σ) | y] = 0
2(n − m)
C[sx (σ, σ) | y] = ,
σ2
which give the observed data score and Hessian.
This application may seem trivial, but it is a widespread practice to use multiple impu-
tation for randomly missing response values in models of all kinds. This involves unnecessary
effort since the randomly missing values do not contribute to the observed data likelihood.
242 Introduction to Statistical Modelling and Inference

16.7 Missing responses and/or covariates in simple and multiple


regression
Large data sets generally have some proportion of missing values, in responses or covariates
or both. Missing responses are like lost data: if the missingness is random, we can ignore
these observations, which do not contribute to the likelihood. Missing covariates are much
more difficult to deal with. We begin with the simplest case.

16.7.1 Missing values in the single covariate in simple linear regression


We have n completely observed data (yi , xi ), i = 1, . . . , n and m additional observations
yn+1 , . . . , yn+m for which the covariate is missing at random. The data are modelled as
Y | x ∼ N (α + βx, σ 2 ). How can the incomplete data be included in the analysis?
Early approaches to missing data were generally of two simple kinds, the first based on
single imputation of each missing value, by some method of choosing a “reasonable” value.
The data set “completed by imputation” would then be analysed as though there had been
no missing data. This was always clearly unreasonable, since the precision of any statement
about model parameters must be overstated.
The second approach was to omit all the observations with incomplete data, and analyse
only the complete cases. Since the additional response observations yn+1 , . . . , yn+m have a
probability distribution, they contribute to the likelihood, and so omitting them loses their
contribution to inference about the parameters. What is their contribution? The data model
conditions on the covariate values xi : if these are not observed, how do we proceed?

16.7.2 Modelling the covariate distribution – Gaussian


A distribution for the covariate is not part of the model. We have to extend the model with
this distribution. The earliest approach to the Gaussian regression model was to assume
a Gaussian distribution N (µx , σx2 ) for X as well. The X model parameters ϕ = (µx , σx )
are nuisance parameters of no substantive interest, but they are needed to complete the
model specification, and we need to estimate them along with the parameters of interest
θ = (α, β, σ) of the Y | x distribution.
With this specification, the joint distribution of Y and X is bivariate Gaussian, with
variances σy|x = σ 2 , σxx = σx2 , and covariance σyx = βσx2 , and so Y has a marginal Gaussian
distribution:
Y ∼ N (α + βµx , σ 2 + β 2 σx2 ).
The full likelihood for the n complete and m incomplete observations is then
n   
Y 1 1 2
L(θ, ϕ) = √ exp − 2 (yi − α − βxi )
i=1
2πσ 2σ

n+m
"  #
Y 1 1 2
· √ p exp − (yi − α − βµx )
i=n+1
2π σ 2 + β 2 σx2 2(σ 2 + β 2 σx2 )
n+m
Y  1  
1 2
6pt] · √ exp − 2 (xi − µx ) .
i=1
2πσx 2σx
Incomplete data and their analysis with the EM and DA algorithms 243

The second term in the likelihood mixes up the parameters for the Y | x and the X dis-
tributions, and the direct ML estimation of all the parameters now has to follow a general
Newton-Raphson approach. The number of parameters has increased by two, and the infor-
mation matrix no longer separates into independent pieces for (α, β) and σ. So even with
this simple “conjugate” model for X, direct ML estimation for the full data set requires the
full NR approach.
The EM algorithm applies directly. The complete data likelihood can be written as
n   
Y 1 1
CL(θ, ϕ) = √ exp − 2 (yi − α − βxi )2
i=1
2πσ 2σ
n+m   
Y 1 1 ∗ 2
· √ exp − 2 (yi − α − βxi )
i=n+1
2πσ 2σ
n   
Y 1 1
· √ exp − 2 (xi − µx )2
i=1
2πσx 2σx
n+m   
Y 1 1 ∗ 2
· √ exp − 2 (xi − µx ) ,
i=n+1
2πσx 2σx

where x∗ denotes the missing values of x. The complete data log-likelihood is, ignoring known
constants,

log CL(θ, ϕ) = −(n + m)[log σ + log σx ]


" n n+m
#
1 X X
− 2 (yi − α − βxi )2 + (yi − α − βx∗i )2
2σ i=1 i=n+1
" n n+m
#
1 X
2
X
∗ 2
− 2 (xi − µx ) + (xi − µx ) .
2σx i=1 i=n+1

For the E step, we need the expectations of x∗i and x∗2 i given the observables yi and the
parameters θ, ϕ. Since (y, x) are bivariate Gaussian the conditional distribution of xi given
yi is Gaussian:
−1 −1
N (µx + σxy σyy (yi − α − βxi ), σxx − σxy σyy σyx ),
which we write as N (x̃i , Vi ). So x∗i is replaced by x̃i , and x∗2 2
i is replaced by Vi + (x̃i ) . A
considerable simplification of notation is possible by defining a new “x” variable, which we
will call w:
wi = xi for i ≤ n, wi = x̃i for i > n.
Initial estimates of the parameters can be those from the complete cases, which should be
near the final MLEs. Implementation of the algorithm can be accelerated by giving the ML
equations for the parameters with their Z functions replaced appropriately. So we replace
the complete data equations
" n n+m
#
∂ log CL 1 X X
= 2 (yi − α − βxi ) + (yi − α − βx∗i ) = 0
∂α σ i=1 i=n+1
" n n+m
#
∂ log CL 1 X X
∗ ∗
= 2 xi (yi − α − βxi ) + xi (yi − α − βxi ) = 0
∂β σ i=1 i=n+1
244 Introduction to Statistical Modelling and Inference
" n n+m
#
∂ log CL n+m 1 X 2
X
∗ 2
=− + 3 (yi − α − βxi ) + (yi − α − βxi ) = 0
∂σ σ σ i=1 i=n+1
" n n+m
#
∂ log CL 1 X X
= 2 (xi − µx ) + (x∗i − µx ) = 0
∂µx σx i=1 i=n+1
" n n+m
#
∂ log CL n+m 1 X X
=− (xi − µx )2 + (x∗i − µx )2 = 0
∂σx σx σx3 i=1 i=n+1

by the expected complete data equations with the new notation, and rearrange to define the
MLEs:

µ
bx = w̄;
"n+m n+m
#
X X
2 2
σ
bx = (wi − w̄) + Vi /(n + m)
i=1 i=n+1

= sww + V̄ ;
"n+m # "n+m n+m
#
X X X
βb = (wi − w̄)(yi − ȳ) / (wi − w̄)2 + Vi
i=1 i=1 i=n+1

= swy /[sww + V̄ ];
b = ȳ − w̄β;
α b
" n+m n+m
#
X X
2 2 2
σ
b = (yi − α
b − βwi ) /(n + m) + β
b b /(n + m)
i=n+1 i=n+1

= sy|w + βb2 V̄ ,

where
n+m
X n+m
X
w̄ = wi /(n + m); V̄ = Vi /(n + m);
i=1 i=n+1
n+m
X
sww = (wi − w̄)2 /(n + m);
i=1
n+m
X
swy = (wi − w̄)(yi − ȳ)/(n + m);
i=1
n+m
X
sy|w = (yi − α b i )2 /(n + m).
b − βw
i=1
Pn+m
The correction term i=n+1 Vi has the effect of a diagonal loading on the SSP matrix
for the two models, and can be programmed in this way. We do not give further details.

16.7.3 Modelling the covariate distribution – multinomial


The assumption of a Gaussian distribution for the covariate will often be unreasonable (for
example, if it is binary!). There are many possibilities for non-Gaussian distributions. We
consider only the simplest general model – the multinomial. With the same Gaussian response
Incomplete data and their analysis with the EM and DA algorithms 245

example in §16.7.2, we use the complete cases to provide an empirical distribution of the
missing covariate values. The d distinct observed values of the covariate in the complete
cases, which we denote by uj , have counts nj at the uj , with unknown covariate population
proportions pj . The EM algorithm applies as before, but the complete data likelihood is
different:
n   
Y 1 1 2
CL(θ, ϕ) = √ exp − 2 (yi − α − βxi )
i=1
2πσ 2σ
n+m
Y  1  
1
· √ exp − 2 (yi − α − βx∗i )2
i=n+1
2πσ 2σ
 
d d n+m
nj 
Y Y Y Z
· pj · pj ij 
j=1 j=1 i=n+1
n   
Y 1 1 2
= √ exp − 2 (yi − α − βxi )
i=1
2πσ 2σ
n+m
Y  1  
1
· √ exp − 2 (yi − α − βx∗i )2
i=n+1
2πσ 2σ
d
n +Z+j
Y
· pj j ,
j=1

where x∗ denotes a missing value of x, Zij = 1 if the missing x∗i is uj and is zero otherwise,
Pn+m
Z+j = i=n+1 Zij , and ϕ = (p1 , . . . , pd ). The missingness is now expressed in two ways: the
missed value x∗i and the indicator Zij which identifies which of the possible uj is the missed
value for x∗i .
We now need the log complete data likelihood, omitting known constants:
" n n+m
#
1 X X
log CL = − (n + m) log σ − 2 (yi − α − βxi )2 + (yi − α − βx∗i )2
2σ i=1 i=n+1
d
X
+ (nj + Z+j ) log pj .
j=1

The expected complete data log-likelihood requires the expectations of x∗i , x∗2
i and Z+j .
We cannot use the previous wi , as x does not have a Gaussian distribution. The conditional
distribution of x given y is given by

f (xi | yi ) = f (yi | xi )f (xi )/f (yi )


X
πj|yi = Pr[xi = uj |yi ] = pj f (yi | xj )/ pj f (yi | xj )
j
X X
E[xi |yi ] = uj pj f (yi | xj )/ pj f (yi | xj )
j j
X X
E[x2i |yi ] = u2j pj f (yi | xj )/ pj f (yi | xj ).
j j

The interpretation of Bayes’s theorem is that xi | yi has a posterior multinomial distribution


πj|yi on the support points uj , with the prior weights pj scaled by the Gaussian density of yi
246 Introduction to Statistical Modelling and Inference

at each support point, and then normalised to sum to 1. The conditional mean and second
moment of xi average the uj over the support points with the posterior weights.
We now define a new wi = xi if xi is observed, and wi = E[xi |yi ] if xi is missing,
and a new Vi = E[x2i |yi ] − wi2 . Then we can parallel the results for the regression model
parameters from the Gaussian x case. We still need the conditional distribution of the Z+j ,
the unobserved total number of incomplete x observations assigned to the support point uj .
The derivatives of the log-likelihood with respect to the θ parameters of the Gaussian
regression model are the same as those for the Gaussian x model. For the multinomial model
parameters, we need to impose the constraint that the pj sum to 1, so we need the derivative
of the log likelihood minus a Lagrange multiplier times the constraint:
P
∂[log(CL) − λ( j pj − 1)]
= (nj + Z+j )/pj − λ = 0
∂pj
pbj = (nj + Z+j )/λ
P
∂[log(CL) − λ( j pj − 1)]
X
= pj − 1 = 0
∂λ j
X
pbj = (nj + Z+j )/ (nj + Z+j )
j

= (nj + Z+j )/(n + m).

The unobserved total count on the uj is added to the observed count.


We need the conditional expectation of Zij | yi , θ, ϕ. As for x∗i , this is given by

Pr[Zij = 1] = pj
d
X
Pr[Zij = 1|yi ] = pj · f (yi | xi )/ pj · f (yi | xi )
j=1

E[Zij |yi ] = Pr[Zij = 1|yi ]


d
X
= pj · f (yi | xi )/ pj · f (yi | xi )
j=1

16.7.4 Multiple covariates missing


With multiple incomplete covariates the ML analysis is further complicated. We would need
the conditional distributions of many subsets of the incomplete covariates given the complete
covariates and the current parameter estimates. With just three incomplete covariates in
different missingness patterns, we would need seven different conditional distributions. These
would be poorly estimated because the number of complete cases would be reduced.
If all the covariates were independent of each other, the problem would be greatly sim-
plified because each incomplete covariate could be treated separately, as though it were the
only incomplete covariate, and the analysis of the single incomplete covariate case could be
extended to this case.
This idea was used by Vermunt, Van Ginkel, Van der Ark and Sijtsma (2008), and Si and
Reiter (2013), through the latent class model : the covariates were modelled as conditionally
independent, given the latent classes to which they belonged. Vermunt et al gave the ML
analysis and Si and Reiter gave a Bayesian analysis using the Dirichlet process prior. These
analyses are beyond the level of this book.
Incomplete data and their analysis with the EM and DA algorithms 247

16.8 Mixture distributions


Mixture distributions arise from probability models with a latent variable Z, and an observed
variable Y whose distribution depends on the latent variable. The latent variable may be
continuous or discrete; the resulting mixture distributions are continuous mixtures or finite
mixtures. The negative binomial distribution is a continuous mixture of a Poisson distribution
with a latent gamma distribution. Finite mixtures of Gaussians are so widely used that we
give a detailed discussion of them.The term mixture of experts model is sometimes used in
the machine learning community: each component of a mixture is thought of as an “expert”
in its field – the local data area in which it applies. A comprehensive general reference is
McLachlan and Peel (2000).
A Gaussian mixture of a Gaussian remains Gaussian. We cannot generalise a Gaussian
model with an omitted unobserved Gaussian variable: it only increases the variance of the
resulting Gaussian variable.

16.8.1 The two-component Gaussian mixture model


We apply the EM approach to the boy birthweight data. We noted in §11.1 that the heaviest
boy or boys seemed to belong to a weight distribution different from the others. We model the
full distribution as a two-component mixture of Gaussian distributions. The ML fitted model
for the single Gaussian is shown in Figure 16.1, together with the 95% credible region for the
true cdf (red curves), reproduced from Figure 11.2. Figure 16.2 shows the two-component
mixture with the credible region. The probability of the second component is 0.0016 – very

2
probit

-1

-2

-3

4 6 8 10 12 14
birthweight

FIGURE 16.1
Boy birthweights and Gaussian cdf (solid line), with 95% credible bounds for the true cdf
(red curves) on probit scale
248 Introduction to Statistical Modelling and Inference

probit
1

-1

-2

-3

4 6 8 10 12 14
birthweight

FIGURE 16.2
Boy birthweights and two-component Gaussian mixture cdf (solid curve), with 95% credible
bounds for the true cdf (red curves) on probit scale

close to 1/648. The heaviest boy defines the second component, with the next three heaviest
boys having decreasing probabilities of belonging to this component. The remaining boys
all belong to the first component. In both graphs the ML fitted curves fall just outside the
credible region at four and eight pounds weight.
The three-component mixture resolves the first component into two very close compo-
nents, with means 7.67 and 7.52, common standard deviation 1.095, and probabilities (to
2dp) 0.83 and 0.17. The mean separation of 0.15 is about 1/7th of a standard deviation,
barely detectable or observable.
We now give the details of the EM algorithm. The observed data y are a random sample
from the model
f (yi | θ) = pf1 (yi | µ1 , σ1 ) + (1 − p)f2 (yi | µ2 , σ2 ),

with " #
1 1 2
fj (yi | µj , σj ) = √ exp − 2 (yi − µj )
2πσj 2σj

and θ = (µ1 , µ2 , σ1 , σ2 , p). The likelihood is an awkward product across the observations of
sums across the two components.
For the EM algorithm, we introduce a latent component identifier Zi which, if observed,
would convert the model to a two-group Gaussian model with group-specific means and
variances. We define Zi = 1 if observation i is from component 1, and Zi = 0 if observation i
is from component 2, and give Zi the Bernoulli distribution with parameter p. The complete
data are then y and Z, and the complete data likelihood, log-likelihood, score and Hessian
Incomplete data and their analysis with the EM and DA algorithms 249

are (in an obvious shorthand)


n
Y Zi 1−Zi
LZ (θ) = [f1 (yi ; µ1 , σ1 ) p] [f2 (yi ; µ2 , σ2 ) (1 − p)]
i=1
n
X
ℓZ (θ) = [Zi log f1 (yi ; µ1 , σ1 ) + (1 − Zi ) log f2 (yi ; µ2 , σ2 ) + Zi log p + (1 − Zi ) log(1 − p)]
i=1
n  
X Zi − p
sZ (θ) = Zi sZi (µ1 , σ1 ) + (1 − Zi ) sZi (µ2 , σ2 ) +
i=1
p(1 − p)
n
X
Zi HZi (µ1 , σ1 ) + (1 − Zi ) HZi (µ2 , σ2 ) − Zi /p2 − (1 − Zi )/(1 − p)2 .
 
HZ (θ) =
i=1

The common appearance of Zi in these terms greatly simplifies the observed score and Hes-
sian computations, which require the conditional expectations of Zi given the observed data.
From Bayes’s theorem,

E[Zi | y, θ] = Pr[Zi = 1 | y, θ]
= f1 (y; θ | Zi = 1) Pr[Zi = 1]/f (y; θ)
f1 (y; θ | Zi = 1) Pr[Zi = 1]
=
f1 (y; θ | Zi = 1) Pr[Zi = 1] + f2 (y; θ | Zi = 0) Pr[Zi = 0]
= Zi∗
sy (θ) = E[sZ (θ) | y]
n
X
= [Zi∗ sZi (µ1 , σ1 ) + (1 − Zi∗ (sZi (µ2 , σ2 ) + Zi∗ /p − (1 − Zi∗ )/(1 − p)].
i=1

Formally, we write the observed data score as


n
1 X ∗
sy (µ1 ) = 2 Z (yi − µ1 )
σ1 i=1 i
Pn n
Z∗ 1 X ∗
sy (σ1 ) = − i=1 i + 3 Z (yi − µ1 )2
σ1 σ1 i=1 i
n
1 X
sy (µ2 ) = (1 − Zi∗ )(yi − µ2 )
σ12 i=1
Pn n
(1 − Zi∗ ) 1 X
sy (σ2 ) = − i=1 + 3 (1 − Zi∗ )(yi − µ2 )2
σ2 σ2 i=1
n
X n
X
sy (p) = Zi∗ /p − (1 − Zi∗ )/(1 − p).
i=1 i=1

The score equations are simple weighted versions of the complete data score equations, with
the component membership probabilities as weights. The EM algorithm alternates between
evaluating the parameter estimates given the weights, and evaluating the weights given the
parameter estimates. At convergence (asymptotic) variances of the estimates are given by
250 Introduction to Statistical Modelling and Inference

the observed data Hessian:

Hy (θ) = E[HZ (θ) | y] + C[sZ (θ) | y]


Xn
= [Zi∗ Hxi (µ1 , σ1 ) + (1 − Zi∗ )Hxi (µ2 , σ2 ) − Zi∗ /p2 − (1 − Zi∗ )/(1 − p)2 ]
i=1
 
Xn n
X
+ E  [Zi − Zi∗ ] · [Zj − Zj∗ ]
i=1 j=1
   
1 1
· sxi (µ1 , σ1 ) − sxi (µ2 , σ2 ) + · s′xj (µ1 , σ1 ) − s′xj (µ2 , σ2 ) +
p(1 − p) p(1 − p)
n
X
= [Zi∗ Hxi (µ1 , σ1 ) + (1 − Zi∗ )Hxi (µ2 , σ2 ) − Zi∗ /p2 − (1 − Zi∗ )/(1 − p)2 ]
i=1
n  
X 1
+ Zi∗ [1 − Zi∗ ] sxi (µ1 , σ1 ) − sxi (µ2 , σ2 ) +
i=1
p(1 − p)
 
1
· s′xi (µ1 , σ1 ) − s′xi (µ2 , σ2 ) + .
p(1 − p)

The formal version of the Hessian is formidable, and is not given here. The effort involved
in evaluating the covariance terms is considerable, even in this simple model, and increases
rapidly in more complex models. Several important features of the observed data Hessian
are visible:
• The off-diagonal expected Hessian terms in the means and standard deviations are zero
at the maximum likelihood estimates.

• The covariance terms are never zero, except by accident in the (µ1 , p) and (µ2 , p) terms.
• The covariance terms always have opposite signs to the expected Hessian terms: infor-
mation is reduced by the unobserved Zi , not surprisingly.
Consequently,

• the expected complete data Hessian, which is produced by many ML packages in the M
step, understates the standard errors of the ML estimates.
• All estimated parameters are positively correlated.
An important point which may not be visible is that the mixture likelihood can be far from
Gaussian in the model parameters. This explains the importance of EM in being able to reach
a maximum of the likelihood without step-length correction. However it also means that the
parameters generally have skewed, non-Gaussian distributions, and the standard errors from
the information matrix are unreliable as measures of precision. The Bayesian analysis is
essential for full information through the posterior distributions of the parameters.
A further important point is that mixture distributions, and other incomplete data mod-
els, may have multiple maxima in the likelihood, reachable from different initial estimates.
This means that an extensive search for local maxima with varying starting values has to
be a part of any mixture (or other latent variable) analysis. In mixture distributions, it is
common for a K-component model to have a local maximum at a K − 1 component model
local or global maximum.
Incomplete data and their analysis with the EM and DA algorithms 251

16.9 Bayesian analysis and the Data Augmentation algorithm


The EM algorithm was extended to a Bayesian version, initially by Tanner and Wong (1987),
which they called Data Augmentation. The same conditional distributions were involved, but
they were treated Bayesianly.
The E step, in which functions of the unobservable Z in the log-likelihood were replaced
by their conditional expectations, was replaced by an Augmentation step, in which M pos-
terior draws Z [m] of the unobserved Z were made from its conditional distribution given the
observed data and the current parameter draws θ [m] .
The M step of maximising the expected complete data log-likelihood was replaced by
posterior draws θ [m] from the conditional distribution of the parameters given the observed
data and the Z [m] draws.
The alternating steps of conditional posterior draws of parameters and unobservables
continue until convergence in distribution occurs of the two sets of draws to their full marginal
distributions. Unlike the EM algorithm, convergence cannot be monitored by the increasing
sequence of maximised log-likelihood values: it has to be assessed by the stability of the
posterior distributions. Many checks on convergence are now available in the implementations
of this algorithm and others, for example in BUGS.
We illustrate the mixture of Gaussians model with an example which has become an
argued-over test-bed for Bayesian model comparison procedures.

16.9.1 The galaxy recession velocity study


The data in Table 16.1 are the recession velocities, in km/sec divided by 1,000, of 82 galaxies
from the observer, in six well-separated sections of the Corona Borealis region (Postman,
Huchra and Geller 1986). They are shown as a sideways stem-and-leaf plot – another form
of histogram where the individual observations are identified. The first column is the stem,
the values of velocity before the decimal point, and the rows after the decimal point are the
leaves, growing out of the stem.

TABLE 16.1
Galaxy recession velocities (km/sec×10−3 )
9. 17 35 48 56 78
10. 23 41

16. 08 17

18. 42 55 60 93
19. 05 07 33 34 34 44 47 53 54 55 66 85 86 86 91 92 97 99
20. 17 18 18 20 22 22 42 63 80 82 85 88 99
21. 14 49 70 81 92 96
22. 19 21 24 25 31 37 50 75 75 89 91
23. 21 24 26 48 54 54 67 71 71
24. 13 29 29 37 72 99
25. 63
26. 96 99

32. 07 79

34. 28
252 Introduction to Statistical Modelling and Inference

The question of astronomical interest was whether these velocities were clumped into
groups or clusters, or instead the velocity density increased initially and then gradually
tailed off. This had implications for theories of evolution of the universe. If the velocities
were clumped, the velocity distribution should be multi-modal.
The question of clumping has been investigated repeatedly by fitting mixtures of Gaus-
sian distributions to the velocity data; the number of mixture components necessary to
represent the data – or the number of modes – is the parameter of particular interest. We
do not consider modes here: these raise additional complications, as two poorly separated
components may have one or two modes.
Mixtures of Gaussians were fitted by ML to the velocity data. Table 16.2 gives the MLEs
of the means, proportions and standard deviations with up to six components, together with
the frequentist deviances. We give both the equal variance and unequal variance cases; the
equal variance model is much inferior to the unequal variance model, as will be clear from
the table.
A first question is: could the data come from a single Gaussian distribution? Figure 16.3
shows the empirical cdf (circles) and the 95% credible region for the true cdf (inside the red
curves). Without the ML fitted Gaussian cdf the conclusion is unclear. Figure 16.4 shows
the cdf on the probit scale with the credible region and the ML fitted Gaussian cdf (solid
curve). The answer is clearly NO.
It is clear from Figure 16.4 that the observations in the central group follow a nearly
linear structure, while the upper and lower groups need structures of their own. We expect
to need two additional components at least, to represent the upper and lower groups of
observations. To proceed further, we need to use EM. We do not give details of the ML
mixture analyses.
The probit plots for ML fitted 2, 3, 4 and 6 component mixtures are shown in Figures 16.5,
16.6, 16.7 and 16.8. For clarity we do not give the credible region bounds: the models beyond
two components give a very close fit to the empirical cdf.
It is clear from Figures 16.7 and 16.8 that increasing K from 4 to 6 gives only an
interpolation of the data. We do not use K > 6 as this leads to degenerate components
in which a single observation is split off to form a component. A one-observation component
is meaningless, except as an outlier.
The three-component model fits well, apart from the region around velocity 20, where
the slope seems to be wrong. The four-component model provides a separate slope in this
region for a closer fit. We do not show the five-component model, which improves very little
over the four-component model. The six-component model has two jumps in the plot to
accommodate the two pairs of observations above and below the central group. Each pair
defines an additional component, but not fully: each new component has three parameters,
but only two observations.
The major difficulty with mixture analysis is the determination of the number of compo-
nents K needed to represent the data. In frequentist analysis this is usually done by sequen-
tially increasing K until the maximised log-likelihood no longer increases by a “substantial”
amount.
One frequentist way of assessing the number of components is the bootstrap likelihood ratio
test, in which the likelihood ratio test statistic, for the null hypothesis of the smaller number
against the alternative of the larger number, is computed for a large number of bootstrap
samples, generated by resampling from the fitted null hypothesis model, and computing the
bootstrap distribution of the likelihood ratio test statistic.
The more common frequentist way to determine what is substantial is by adding a penalty
to the frequentist deviance, based on the number of model parameters, and choosing the
model with the smallest penalised deviance.
Incomplete data and their analysis with the EM and DA algorithms 253
TABLE 16.2
Means, proportions and SDs, galaxy data

Equal variances Unequal variances


---------------------------------------------
K k mean prop sd mean prop sd
deviance deviance
--------------------------------------------
1 1 20.83 1 4.54 20.83 1 4.54
480.83 480.83

2 1 21.49 0.533 4.49 21.35 0.740 1.88


2 20.08 0.467 19.36 0.260 8.15
480.83 440.72

3 1 32.94 0.037 2.08 33.04 0.037 0.92


2 21.40 0.877 21.40 0.878 2.20
3 9.75 0.086 9.71 0.085 0.42
425.36 406.96

4 1 33.04 0.037 1.32 33.05 0.037 0.92


2 23.50 0.352 21.94 0.665 2.27
3 20.00 0.526 19.75 0.213 0.45
4 9.71 0.085 9.71 0.085 0.42
416.50 395.43

5 1 33.04 0.037 1.07 33.05 0.036 0.92


2 26.38 0.037 22.92 0.289 1.02
3 23.04 0.366 21.85 0.245 3.05
4 19.76 0.475 19.82 0.344 0.63
5 9.71 0.085 9.71 0.085 0.42
410.85 392.27

6 1 33.04 0.037 0.81 33.04 0.037 0.92


2 26.24 0.044 26.98 0.024 0.018
3 23.05 0.357 22.93 0.424 1.20
4 19.93 0.453 19.79 0.406 0.68
5 16.14 0.025 16.13 0.024 0.043
6 9.71 0.085 9.71 0.085 0.42
394.58 365.15

Several penalties, their number increasing over time, have been proposed and many of
these are in wide use, though there is no strong theoretical justification for their use, nor
agreement over which is most appropriate, or when. We illustrate this process with the two
oldest methods, the AIC and BIC. The AIC uses a penalty of 2p on the deviance while BIC
uses p log(n), where p = 3K − 1 is the number of model parameters, n is the sample size and
K is the number of components. The BIC is derived through an asymptotic approximation
to the integrated likelihood used in the Bayes factor. Many authors have noted that the BIC
penalty is greater than the AIC penalty, provided that log(n) > 2, or n > 7.4.
254 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
10 15 20 25 30
velocity

FIGURE 16.3
Empirical cdf of galaxy data with 95% credible region (red)

1
probit

-1

-2

10 15 20 25 30
velocity

FIGURE 16.4
Probit of galaxy data with ML fitted Gaussian (line) and 95% credible region (red)
Incomplete data and their analysis with the EM and DA algorithms 255

2.5

2.0

1.5

1.0

probit 0.5

0.0

-0.5

-1.0

-1.5

-2.0

10 15 20 25 30 35
velocity

FIGURE 16.5
Probit of galaxy data with ML fitted two-component Gaussian

3.0

2.5

2.0

1.5

1.0
probit

0.5

0.0

-0.5

-1.0

-1.5

-2.0

10 15 20 25 30 35
velocity

FIGURE 16.6
Probit of galaxy data with ML fitted three-component Gaussian
256 Introduction to Statistical Modelling and Inference

3.0

2.5

2.0

1.5

probit 1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

10 15 20 25 30 35
velocity

FIGURE 16.7
Probit of galaxy data with ML fitted four-component Gaussian

3.0

2.5

2.0

1.5

1.0
probit

0.5

0.0

-0.5

-1.0

-1.5

-2.0

10 15 20 25 30 35
velocity

FIGURE 16.8
Probit of galaxy data with ML fitted six-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 257
TABLE 16.3
Model deviances and penalised deviances
K p dev AIC BIC
1 2 480.83 484.83 489.64
2 5 440.72 450.72 462.75
3 8 406.96 422.96 442.21
4 11 395.43 417.43 443.90
5 14 392.27 420.27 453.96
6 17 365.15 399.15 440.06

Table 16.3 shows the frequentist deviance and the AIC and BIC for each of the unequal
variance models.
AIC prefers four components to three, BIC prefers three to four, but both prefer 6 overall.
The preference for six comes from the small drop in deviance from four to five components,
and the large drop from five to six, in which the two pairs of observations at the ends of the
central group are fitted by one component each.
In §13.4 we discussed the principled Bayesian approach to model comparison. This ap-
proach requires only the random posterior draws of the model parameters from the Bayesian
analysis, which are substituted into the deviance function to give random draws of the model
deviance, for each model. The deviance distributions are examined for stochastic ordering,
and can be used to give medians (or any other quantile) of the posterior model probabilities.
A recent addition to the penalised deviances is the DIC, the deviance information criterion
(Spiegelhalter, Best, Carlin and van der Linde 2002). This does not use the frequentist
deviance, but the random posterior draws of the Bayesian deviance. These are averaged,
to give the (simulated) mean deviance which is then penalised by an effective number of
parameters, which also has to be computed from the deviance draws.
For the DA analysis, we need priors for the component means, SDs and proportions.
Computationally these need to be proper: improper priors on (0, ∞) or (−∞, ∞) do not
provide useful random draws. Finite range flat priors for µ and log σ or σ, and the minimally
informative Dirichlet prior with indices 1 are commonly used. Details are not given here: see
Celeux, Forbes, Robert and Titterington (2006) for these specifications for the galaxy data.
The following figures, over several pages, show the cdfs, on different horizontal scales,
of the 10,000 deviance draws from each number of components K from 1 – 7 (solid curves)
together with the cdfs of the asymptotic shifted χ2 distributions (dashed curves). The fre-
quentist deviance for each K is the circle near the graph origin. An important point in these
graphs is that the left end of the posterior cdf of the deviance does not “reach” the frequen-
tist deviance beyond K = 2: with these data and models the frequentist deviance could not
be randomly drawn for mixtures with K > 2 even with a sample of 10,000. The frequentist
deviance is not a representative value of the data support for the model. The last figure shows
all the deviance draw distributions on the same scale.
The interpretation of these graphs is straightforward.
For K = 1 the asymptotic and the deviance draw distributions are indistinguishable: for
n = 82 the simulated Gaussian deviance cannot be differentiated from the shifted χ22 .
As K increases, even for K = 2, the deviance distributions depart increasingly from the
asymptotic distribution.
The deviance distribution for K = 1 is far to the right of the others – the one-component
mixture (single Gaussian distribution) is a very bad fit: its deviance distribution is the stochas-
tically largest.
258 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
482.5 485.0 487.5 490.0 492.5 495.0 497.5 500.0
deviance

FIGURE 16.9
Galaxy and asymptotic deviances, one-component Gaussian

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
450 460 470 480
deviance

FIGURE 16.10
Galaxy and asymptotic deviances, two-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 259

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
420 440 460 480 500
deviance

FIGURE 16.11
Galaxy and asymptotic deviances, three-component Gaussian

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
400 420 440 460 480 500 520
deviance

FIGURE 16.12
Galaxy and asymptotic deviances, four-component Gaussian
260 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf
0.5

0.4

0.3

0.2

0.1

0.0
400 420 440 460 480 500
deviance

FIGURE 16.13
Galaxy and asymptotic deviances, five-component Gaussian

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
380 400 420 440 460 480 500
deviance

FIGURE 16.14
Galaxy and asymptotic deviances, six-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 261

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
375 400 425 450 475 500 525
deviance

FIGURE 16.15
Galaxy and asymptotic deviances, seven-component Gaussian

1.0

0.9

0.8

0.7

0.6
cdf

0.5 3 456 7 2 1

0.4

0.3

0.2

0.1

0.0
420 440 460 480 500
deviance

FIGURE 16.16
Galaxy deviances, one- to seven-component Gaussians
262 Introduction to Statistical Modelling and Inference

The fit improves substantially from K = 1 to 2, and continues to improve from K = 2


to 3.
The distribution for K = 3 is the stochastically smallest.
As the number of components increases beyond three the deviance distributions move
steadily to the right, to larger deviance values (lower likelihoods).
They also become more diffuse, with decreasing slope, tipping over to the right.
With more parameters there is less information about each parameter, and therefore
about the likelihood and deviance, so their posterior distributions become more diffuse. The
improvement in fit can be assessed from the distributions of the deviance differences between
neighbouring numbers of components. Three components fit best, but the improvement over
four is not overwhelming. Figure 16.17 shows the cdf of 10,000 draws of the differences
between random draws of the unordered deviances from three and four components. About
64% of the deviance differences are negative (three components fits better) while the other
36% are positive (four components fits better). We cannot choose firmly between these
numbers of components, but we can rule out one or two, and more than four. More detail
on these comparisons can be found in Aitkin (2011).
An important point here for penalised maximised likelihood procedures like AIC and BIC
is that the frequentist deviance becomes decreasingly appropriate as a measure of goodness
of model fit as the models become increasingly complex. It is not just a matter of the shift in
the asymptotic distribution: the shape of the simulated distribution departs further from the
χ2 shape. This is equally true of the DIC, which loses completely the shape of the deviance
distribution by penalising its mean.
Aitkin, Vu and Francis (2015) described a simulation study with samples generated from
galaxy-based Gaussian mixture distributions with parameters given by the MLEs of the
galaxy data. The model comparison by posterior deviances was superior to that by the

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-60 -40 -20 0 20 40 60
dev3-dev4

FIGURE 16.17
Galaxy deviance differences, three to four components
Incomplete data and their analysis with the EM and DA algorithms 263
TABLE 16.4
Percentages of correct model identification using DIC and
posterior deviance
n 82 164 328 656
K DIC Dev DIC Dev DIC Dev DIC Dev
1 100 100 100 99 100 100 100 100
2 85 98 100 100 100 100 100 97
3 51 99 98 99 100 99 100 99
4 3 9 11 67 30 99 17 99
5 0 18 0 9 0 37 1 89
6 2 9 0 10 56 100 78 100
7 0 1 0 15 4 3 4 32

DIC, especially for a large number of poorly separated components (Table 16.4). With the
galaxy sample size of 82, both methods had difficulty identifying more than three compo-
nents. Doubling repeatedly the galaxy sample size steadily improved the posterior deviance
comparisons, but did not change much the DIC comparisons.

16.9.2 The Dirichlet process prior


In Chapter 13 we described the DPP as a representation of a population distribution. This
was used by Lunn, Jackson, Best, Thomas and Spiegelhalter (2013, §11.8.1, p. 296) for an
analysis of the galaxy recession velocity data. Their focus was on the posterior shape of
the velocity distribution; the number of Gaussian mixture components used in the random
draws was a minor part of the analysis. Two graphs (their Figure 11.11) of the posterior
distribution of the number of mixture components used in the analysis were given based on
three choices of α: a fixed value of α = 1, giving a median number of components of 7, with
95% credible interval [4,10], a Gamma(2, 4) prior giving median 6 and credible interval [3,11],
and a uniform prior on (0.3,10) giving median 13 and credible interval [6,18]. In both the
cases α = 1 and uniform, the posterior was monotone decreasing from K = 1. The authors
noted that

there is sensitivity to the prior choice . . . however the qualitative shape of the
density is unchanged.

The number of component densities used in the random draws is not the number of compo-
nents which can be identified in the data but the number of component densities generated by
the Dirichlet process, which is unrelated to the data structure. The DPP analysis is unrelated
to the question of clumping.
17
Generalised linear models (GLMs)

17.1 The exponential family


The key to the successful use of regression models in many distributions beyond the Gaussian
was their expression through the exponential family of distributions, so called because their
log densities or mass functions could be expressed in a general form:

log f (y | θ, ϕ) = [yθ − b(θ)]/ϕ + c(y, ϕ).


f (y | θ, ϕ) = exp{[yθ − b(θ)]/ϕ + c(y, ϕ)}
= d(y, ϕ) · exp{[yθ − b(θ)]/ϕ}.

Here θ is the natural or canonical parameter of the distribution (often the parameter of
interest, or a function of it) and ϕ is a scale parameter, usually a nuisance parameter. The
regression model was for θ or a monotone function of it. The scale parameter, if there was
one, represented the additional random variability not modelled by the regression. If there
wasn’t one, the probability model and the regression had to account for all the variability
in the data.

17.2 Maximum likelihood


Theoretical and applied statistics were both convulsed by the publication of the GLM paper
by Nelder and Wedderburn (1972). It showed that regression models in the one-parameter
exponential family distributions could be fitted by maximum likelihood using an iteratively
weighted least squares (IWLS or IRLS) algorithm. Special cases of this result were already
in use for the probit and logit regression models (Finney 1947 and Lord 1952 respectively).
The generality of the result meant that a concise algorithm could be constructed for all the
one-parameter exponential family distributions (binomial, Poisson, exponential), and could
be extended to the two-parameter distributions (normal and gamma, and later the inverse
Gaussian distribution) by a simple (non-maximum likelihood) estimation of the second scale
parameter.
Nelder and Wedderburn at Rothamsted began from early 1972 to develop code for the
algorithm; this work was extended by Baker and members of the GLIM Working Party of the
Royal Statistical Society, and implemented in the statistical package GLIM (Generalised Lin-
ear Interactive Modelling), first released in 1974. This was the first package implementation
of the IWLS algorithm, and it was designed from the beginning for interactive use, though
many UK university computing systems at this time did not support multiple interactive
users, who had to batch their jobs instead.
The user had to specify the error (probability) distribution, the link function relating the
mean of the distribution to the monotone function of it being modelled, and the regression

DOI: 10.1201/9781003216025-17 265


266 Introduction to Statistical Modelling and Inference

model itself. The program gave the ML estimates and standard errors based on the estimated
expected information (evaluated at the ML estimates), and a quantity called the deviance,
whose definition varied by the probability distribution. If the distribution was Gaussian, the
deviance was the residual sum of squares, while if it was not, the deviance was the likelihood
ratio test statistic for the fitted model compared to a “saturated” model with a parameter
MLE for each observation. This unusual combination was intended to give the deviance the
same interpretation, as a goodness of fit statistic for the model, whether Gaussian or not. A
detailed history of the development of GLIM was given in Aitkin (2018).
Books on GLMs, and the implementation of GLMs in other packages, developed slowly
at first. Nelder and McCullagh’s 1983 book became the standard theoretical text, though it
made little mention of GLIM or any other software implementation. The 1989 GLIM book
by Aitkin, Anderson, Francis and Hinde was widely adopted as a teaching text for the use
of GLIM; the text was ported to R in a 2009 edition (Aitkin, Francis, Hinde and Darnell).
The development of the object-oriented statistical computing system S, and its open source
version R, greatly extended the range of applied statistical modelling, using modern robust
and simulation-based methods. The series of books by Venables and Ripley (fourth edition
2002 for example) showed the importance of this development.
GLIM’s early success showed clearly the importance of GLMs as a major tool of both
theoretical and applied statistics. For applied statistics, it changed the focus of analysis from
statistical methods to statistical models and modelling. Maximum likelihood for GLMs is now
widely implemented and documented, and details of the model fitting are not given in the
examples here. We summarise the GLM algorithm.

17.3 The GLM algorithm


The GLM algorithm is a special case of the general Newton-Raphson algorithm for solving
a system of linear equations. We want to find the MLE of β with data (y, X) from a model
f (yi | xi , ηi ) with ηi = β ′ xi . The likelihood is L(β), with log-likelihood ℓ(β). We want to
solve
∂ℓ ∂ℓ
= D′ = 0,
∂β ∂η
where D is the matrix of derivatives with
∂ηi
Dij = .
∂βj

The Newton-Raphson algorithm for solving this system iteratively is


∂ℓ
β new = β + H−1 ,
∂β
where H is the matrix of negative second derivatives

∂2ℓ
 
H= −
∂βj ∂βk
∂2ℓ ∂2η
   
∂ℓ
= D′ − D+ − .
∂ηj ∂ηk ∂η ∂βj ∂βk
Generalised linear models (GLMs) 267

The right-hand side of the updating Newton-Raphson algorithm is evaluated at the current
parameter values. The Fisher scoring algorithm used in most GLM algorithms simplifies this
updating by replacing the second derivative matrix by its expected value (at the current
parameter values). This simplifies H to D’WD, where

∂2ℓ
 
Wjk = E − ,
∂ηj ∂ηk
so that
∂ℓ
β new = β + (D′ WD)−1 D′
∂η
 
′ −1 ′ −1 ∂ℓ
= (D WD) D W Dβ + W
∂η
= (D′ WD)−1 D′ Wz,

where
∂ℓ
z = Dβ + W−1 .
∂η
In applying this formulation to specific distributions in the exponential family, the GLM
packages hold a library of the necessary functions for the covered range of distributions. We
do not give details. The algorithm is in fact more general than the exponential family: all
that is required is that the distribution has a single index η = β ′ x.

17.4 Bayesian package development


Complex models do not generally have explicit posterior distributions for the model param-
eters. As we have seen already even with two-parameter distributions like the gamma, some
form of grid computation or posterior sampling is needed. This is a particular difficulty with
the GLM family outside the Gaussian. The BUGS package (Bayesian inference Using Gibbs
Sampling), Lunn, Jackson, Best, Thomas and Spiegelhalter (2013), was developed by a con-
sortium in the UK with extensive support from the UK Medical Research Council to provide
a very general Bayesian analysis. The aim of this system was to provide the full conditional
distributions for all unobserved quantities (parameters and latent variables) in a wide range
of models. We have seen in the finite mixture model that both the EM and DA algorithms
depend on the backward conditional distributions of the latent component membership vari-
ables Z, given the observed data and the parameters. These can be derived analytically in
mixtures of exponential family models, which is why this EM algorithm application advanced
so rapidly. In more complex discrete and continuous mixtures the conditional distributions
are more difficult, and are generally not analytic. BUGS provides these conditionals from its
library and gives the full posterior distributions of all unobservables, through their random
draws from the sequence of forward and backward conditionals. The user has to specify the
forward conditionals – the dependence of the observed data on the latent variables and the
distributions, including priors, of all unobservables. This provides also the parameter pos-
teriors for models without covariates or latent variables. Later Bayesian packages extended
the coverage in BUGS.
A feature of many Bayesian packages is the initial burn-in period needed to move from
the initial random draws from the prior distributions of the parameters to the final random
268 Introduction to Statistical Modelling and Inference

draws from the posteriors. This process can be accelerated by using information about the
likelihood from the ML analysis.

17.5 Bayesian analysis from ML


For non-Gaussian models, the frequentist reliance on MLEs and SEs depends on the validity
of the asymptotic Gaussian form for the likelihood holding in the data. We can investigate
this by using a simple approach, discussed in §6.12 in the case of the simple binomial model
on the logit scale, which can also provide the posterior distributions of the parameters without
the Gaussian assumption, or MCMC.
We use the individual MLEs and their SEs to set up a uniform grid in the region of the
MLEs. The extent of the grid needed depends on the extent of departures of the likelihood
from Gaussian, which we cannot evaluate without computing it at each grid-point of the
full parameter set. If we find that the likelihood is asymmetric about the MLEs, we have
evidence that the asymptotic Gaussian distribution is not holding, and the MLEs and SEs
may not give reliable inferences. For this an iterated set of grids may be required to establish
the limits of the region of appreciable likelihood – skewed or heavy-tailed parameters will
need a wider grid.
When we have established a uniform grid which covers the whole of the region of appre-
ciable likelihood, we sum these likelihoods and normalise by the sum to generate the joint
posterior distribution with the flat joint prior.

17.6 Binary response models


17.6.1 Probit or logit analysis?
The probit and logit transformations are very similar. The standard Gaussian distribution
2
has variance 1, but the standard logistic distribution
√ has variance π /3. If we rescale the
logit transformation by dividing the scale by π/ 3, the transformed cdfs are very close
(Figure 17.1). It is largely a matter of personal preference whether we use the logit or the
probit transformation; only the scales of the parameters are different.

17.6.2 Other binomial link functions and their origins


Most GLM packages include the complementary log-log link (CLL) function:

CLL(p) = log[− log(1 − p)],

which is asymmetric in p and (1 − p). All the binomial link functions can be derived from a
bifurcation of the cdf of a continuous distribution.
Suppose a continuous random variable Y has a density function f (y) and cdf F (y). At
a y-value y0 the distribution is cut into two pieces, with probability contents p = F (y0 ) and
1 − p = 1 − F (y0 ). The value of y is unobservable: we can observe only the binary variable Z,
that y > y0 (z = 1) or y ≤ (y0 ) (z = 0). We model (either of) these probabilities as functions
of the covariates x through the link function. The choice of the link function determines the
distribution of the underlying Y.
Generalised linear models (GLMs) 269

2.5

2.0

1.5

1.0

0.5

probit 0.0

-0.5

-1.0

-1.5

-2.0

-2.5
0.0 0.2 0.4 0.6 0.8 1.0
p

FIGURE 17.1
Probit (solid) and logit (dashed) transformations on the probit scale

There is no need to believe in the existence of an underlying continuous variable to make


use of the link function. Some binary variables are inherently so, and do not arise from a
dichotomised continuous scale. But this representation makes clear that a very wide range of
link functions for binary data could be developed from particular continuous distributions.
We simply choose a cdf F for which F (y) = p has a simple inverse function y = F −1 (p).
The logit link arises from the logistic distribution, for which the cdf is
F (y) = exp(y)/[1 + exp(y)] = p, with y = log[p/(1 − p)]. The probit link arises from the
Gaussian distribution: p = F (y) = Φ(y), y = Φ−1 (p). The CLL link comes from the extreme
value distribution: y = log[− log(1 − p)] , p = 1 − exp(− exp(y). A link not commonly used,
or even mentioned, is the log-log link (LL), with p = exp(− exp(y), y = log[− log p]. This
distribution is the reversed extreme value distribution, with opposite skew to the extreme
value distribution. We give an example of these in §17.8.

17.6.3 The Racine data


In §2.8, we gave an example of a binary response in a small-animal experiment, reproduced
in Table 17.1.
This unusually small study was carried out to establish whether worthwhile evidence
could be obtained from small animal study designs, allowing animal sacrifice to be reduced.
The social and scientific importance of this question led to a major collaborative study car-
ried out by CIBA-GEIGY. The original description of the study and analysis was reported at
length in Racine, Grieve, Flühler and Smith (1986), and it has been used as an introductory
example of Bayesian data analysis in the Bayesian Data Analysis book series (Gelman, Car-
lin, Stern, Dunson, Vehtari and Rubin 2014, §3.7). The aim of the analysis was to determine
the LD50 (50% lethal dose), the dose at which the probability of death is 0.5. We examine
also the LD90, the dose giving a 90% probability of death.
270 Introduction to Statistical Modelling and Inference
TABLE 17.1
Bioassay data from Racine et al
Dose x i Number of Number of
(log gm/ml) animals, n i deaths, y i
−0.86 5 0
−0.30 5 1
−0.05 5 3
0.73 5 5

The large uncertainty in the relation between dose and death probability is clear if we
construct the 95% credible region for the relation from the observed data. The 95% central
credible region for the death probabilities pi is given directly from the 95% credible intervals,
from the posterior Beta(yi + 1, ni − y1 + 1) distributions, with independent uniform priors
on each pi . The 95% credible interval bounds are shown as a red line-segment region with
the data (black circles) and the posterior median values of p (green circles) in Figure 17.2.
The line segments are in fact illusory as there are no data points between the design
points to provide credible intervals. However, it is clear that the true relation could be almost
anything – it need not even be monotone increasing. Any two-parameter model which goes
to 0 and 1 at the data margins will appear to be an adequate fit; in particular any link
function for the binomial will give an acceptable fit. (A cubic with any link will fit exactly.)
The logistic linear model is just one of these, but it is the most common. The probability
p(x) of death at dose level x (mgm per gram of bodyweight) is modelled by the logistic linear

1.0

0.9

0.8

0.7

0.6
proportion

0.5

0.4

0.3

0.2

0.1

0.0
-0.75 -0.50 -0.25 0.00 0.25 0.50 0.75
log dose

FIGURE 17.2
Racine 95% credible region (red segments) for the death probability function, with observed
proportions (black circles) and posterior medians (green circles)
Generalised linear models (GLMs) 271

regression model:

yi | xi , ni ∼ b(ni , pi )
 
pi
logit pi = log
1 − pi
= α + βxi .

We note that the LD50, written here as x50 , is given by

logit 0.5 = 0 = α + βx50


x50 = −α/β,

and similarly for the LD90,

logit 0.9 = 2.197 = α + βx90


x90 = (2.197 − α)/β.

17.6.4 Maximum likelihood


The data and two fitted models are shown in Figure 17.3. The solid curve uses the posterior
medians of the parameters, the dashed curve uses the ML estimates. The ML fit is extremely
close – almost exact – with a frequentist deviance of 0.055.
The null model of constant response proportion has frequentist deviance 15.79, compared
with 0.055 for the logistic linear model. The parameter MLEs and (SE)s without centering
are α
b = 0.85 (1.02) and βb = 7.75 (4.87), with correlation 0.71. As for the Gaussian linear

1.0

0.9

0.8

0.7

0.6
probability

0.5

0.4

0.3

0.2

0.1

0.0
-0.75 -0.50 -0.25 0.00 0.25 0.50 0.75
x

FIGURE 17.3
Racine data (circles) and fitted logistic models (solid curve posterior median, dashed curve
MLEs)
272 Introduction to Statistical Modelling and Inference

model, centering the covariate is helpful, reducing both the correlation of the MLEs and the
parameter grid necessary to cover the region of appreciable likelihood.
With centering to the mean covariate value −0.12, the parameter estimates become
bc = −0.08 (0.73) and βb = 7.75 (4.87), with correlation 0.20.1 The slope estimate and SE are
α
unaffected by the centering, but the precision of the intercept is increased, with a smaller SE.
Part of the difficulty of the analysis without centering is the elongation of the near-elliptical
contours of the likelihood (shown in Gelman et al’s Figure 3.3a), which is greatly reduced
by centering.

17.6.5 Bayesian analysis


A particular difficulty with this data set was discussed in Racine et al: the small sample
size gives a diffuse likelihood, for which considerable search over the parameter space was
needed to locate the region of high likelihood. This should be straightforward, as Gelman et
al described (p. 76):

We will compute the joint posterior distribution at a grid of points (α, β). It is a
good idea to get a rough estimate of (α, β) so we know where to look. To obtain
the rough estimate, we use existing software to perform a logistic regression . . .
The [ML] estimate is (b α, β)
b = (0.8, 7.7) with standard errors of 1.0 and 4.9 for
α and β respectively.
We are now ready to compute the posterior density at a grid of points (α, β).
After some experimentation, we use the range (α, β) ∈ [−5, 10] × [−10, 40] which
captures almost all of the mass of the posterior distribution. The resulting contour
plot appears in [their] Figure 3.3a.

Although it was not commented on, the uniform grid in both parameters is effectively
a discrete uniform prior for them. It might be thought that searching the parameter space
for the area of high likelihood would be “tuning” the prior to the data (the likelihood), but
this is not so. Conceptually, it is simply restricting the range of the parameters to the region
of high likelihood by having an initial near-infinite grid and then removing from this grid
regions of zero likelihood.
With such a small sample (and small numbers of animals at each dose level), we would
expect the posterior of β at least to be skewed. Neither Racine et al (1986) nor Gelman et
al (2014) showed the posterior distributions for α or β: they moved directly to the LD50.
Skew in the posterior for β is already clear, from the inconsistency between the two tests
for a zero value of β: the Wald test β/SE
b (1.59) and the likelihood ratio test (15.74), which
would be close to the squared Wald test value (2.53) in a Gaussian likelihood.
To assess this, we follow the Gelman et al analysis and generate 10,000 grid values of
α and β over the region [−5, 10] for α and [−10,40] for β. We substitute the grid values
into the binomial log-likelihood and exponentiate this. Direct exponentiation may fail for
very large negative values of the log-likelihood. To prevent this we carry out the exponen-
tiation in several steps. First we find the largest log-likelihood and subtract this from the
log-likelihood values. This guarantees negative values of the difference which exponentiate
without difficulty.
Figure 17.4 shows the likelihood classified by the regression coefficients for intercept and
slope, where the likelihood at each point is coded in two ways: on a colour scale and a size
scale. The largest circles, coloured green, have the highest likelihoods (at least 80% of the
1 The correlation is not zero, because the mean is unweighted, not weighted by the iterative weights used

in the ML estimation.
Generalised linear models (GLMs) 273
40

35

30

25

20

15
b

10

-5

-10
-4 -2 0 2 4 6 8 10
a

FIGURE 17.4
Racine data likelihood

maximum); the crimson have likelihoods 60–80% of the maximum, the blue 40–60%, the
yellow 20–40%, and the red less than 20% of the maximum. The size has a similar structure.
Blank areas have zero likelihoods to machine accuracy. The elliptic form of the likelihood is
asymmetric and very long-tailed in β. The marginal posterior densities and cdfs of α and β
are shown in Figures 17.5, 17.6, 17.7 and 17.8.
For the LD50 and LD90, we substitute the random parameter draws into the definitions
of these functions. Figure 17.9 gives the LD50 posterior cdf and Figure 17.10 gives the LD90
posterior cdf.
Despite the small samples at each design point, and the very small number of design
points, the study was able to draw fully informative (though not precise!) conclusions about
the parameters and the LD50 and LD90, without relying on asymptotic theory.

17.6.6 The beetle data


The effectiveness of carbon disulphide (CS2 ) as a poison for beetles was assessed in a study
at eight different levels of concentration (Bliss 1935). The data are available in R, and are
given in Table 17.2, which shows the number of beetles y dying out of n exposed at log dose
concentration d, which is in units of log10 CS2 mg per litre.
Figure 17.11 shows the proportions dying at each dose level (circles), together with the
95% credible region for the true proportions (red curves). The sample sizes are not large
enough to give precision at any dose level.
We now consider different links for the bimomial model. The log-log (LL) link is not
generally given in GLM packages, as it can be fitted using the CLL link by reversing the
proportion being modelled. Figure 17.12 shows the ML fitted regressions on dose from the
LL (red), logit (black) and CLL (green) links. The LL gives a poor fit, and the CLL is clearly
274 Introduction to Statistical Modelling and Inference

0.055

0.050

0.045

0.040

0.035
density
0.030

0.025

0.020

0.015

0.010

0.005

0.000
-2 0 2 4 6
a

FIGURE 17.5
α density

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-2 0 2 4 6
a

FIGURE 17.6
α cdf
Generalised linear models (GLMs) 275

0.035

0.030

0.025

density
0.020

0.015

0.010

0.005

0.000
0 10 20 30 40
b

FIGURE 17.7
β density

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0 10 20 30 40
b

FIGURE 17.8
β cdf
276 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.4 -0.2 0.0 0.2 0.4
LD50

FIGURE 17.9
LD50 cdf

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.2 -0.0 0.2 0.4 0.6 0.8
LD90

FIGURE 17.10
LD90 cdf
Generalised linear models (GLMs) 277
TABLE 17.2
Beetles dying y, exposed n, at dose d
y 6 13 18 28 52 53 61 60
n 59 60 62 56 63 59 62 60
d 1.6907 1.7242 1.7552 1.7842 1.8113 1.8369 1.8610 1.8839

1.0

0.9

0.8

0.7
death proportion

0.6

0.5

0.4

0.3

0.2

0.1

0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose

FIGURE 17.11
Proportion dying (circles) and 95% credible region (red curves)

better than the logit: the frequentist deviances from the three fitted models are 27.92 (LL),
11.23 (logit) and 3.45 (CLL), each with 6 df. Although these models are not nested, it is
clear that the CLL curvature is the best: it gives a close fit at every dose level. The fit of
the logit model can be improved, by increasing its curvature with an added quadratic term
in dose. The frequentist deviance of the quadratic logit model is 3.20. Figure 17.13 shows
the ML fitted logit (black), CLL (green) and quadratic logit (red) models. The quadratic
model is also a close fit. Is the difference in deviances meaningful, with the different degrees
of freedom?
We can assess this approximately (and sufficiently) by graphing the two deviance dis-
tributions, assuming the asymptotic χ2 forms. Figure 17.14 shows the CLL (green) and
quadratic logit (red) deviance cdfs from 10,000 random draws. The two cdfs cross near the
5% point of the cdf: neither is uniformly better than the other. In about 5% of the deviance
draws, the quadratic logit draw is smaller than the CLL draw; the remaining 95% reverse
this difference.
The quadratic logit model (red) has a more diffuse cdf from its 3 df and this contributes
to the large preference for the CLL over the quadratic logit model. However, this is only a
preference: the deviance difference between the two is quite small at all percentiles. Formally,
if we pair the random draws and subtract one from the other, the 95% credible interval for
the true deviance difference is [−5.37, 7.75], with a median of 0.60. We cannot choose firmly
between these models.
278 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

death proportion
0.6

0.5

0.4

0.3

0.2

0.1

0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose

FIGURE 17.12
LL (red), logit (black) and CLL (green) links

1.0

0.9

0.8

0.7
death proportion

0.6

0.5

0.4

0.3

0.2

0.1

0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose

FIGURE 17.13
Quadratic logit (red), logit (black) and CLL (green) links
Generalised linear models (GLMs) 279
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
5 10 15 20
deviance

FIGURE 17.14
Quadratic logit (red) and CLL (green) deviance cdfs

17.7 The menarche data


The data set in Table 17.3 from Milicer and Szczotka (1966) arose from a study of the
age development of the menarche (beginning of menstruation) in Polish girls. It gives the
mean age at menarche, in varying-length intervals, of 3,918 Polish girls in a 1965 study. The
analysis was later extended to covariates in Laska-Mierzejewska (1970).

TABLE 17.3
Number of girls assessed, and number reaching menarche (positive), by mean age
group 1 2 3 4 5 6 7 8 9 10
mean age 9.21 10.21 10.58 10.83 11.08 11.33 11.58 11.83 12.08 12.33
number 376 200 93 120 90 88 105 111 100 93
positive 0 0 0 2 2 5 10 17 16 29
group 11 12 13 14 15 16 17 18 19 20
mean age 12.58 12.83 13.08 13.33 13.58 13.83 14.08 14.33 14.58 14.83
number 100 108 99 106 105 117 98 97 120 102
positive 39 51 47 67 81 88 79 90 113 95
group 21 22 23 24 25
mean age 15.08 15.33 15.58 15.83 17.58
number 122 111 94 114 1,049
positive 117 107 92 112 1,049
280 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

proportion 0.5

0.4

0.3

0.2

0.1

0.0
10 12 14 16
age

FIGURE 17.15
Proportion reaching menarche by age group (circles) and ML fitted logistic model (curve),
with 95% credible region (red segments)

The sample sizes at each mean age are large, and the number of age group is also large:
the regression should be well-defined. Figure 17.15 shows the proportion data, the fitted
centered ML logistic regression, and the 95% credible region bounds (red curves).
The fitted regression falls entirely within the 95% credible region for the true function;
the bounds vary in width with the variation in sample sizes at each age. The fit appears
very good: the deviance is 26.70 with 23 df. Centering makes a dramatic difference to the
intercept and its correlation with the slope:
• Uncentred: −21.23 (0.770) + 1.632 (0.059) AGE, correlation −0.9966;

• Centred: 0.153 (0.063) + 1.632 (0.059) AGEC, correlation 0.0665.


The likelihood is shown in Figure 17.16, tabulated over the range of α (centred) and
β. The contours are almost circular, a consequence of the near-zero correlation and near-
equal SEs.
The coarse grid results from the restriction in the graph of the full grid to the area of
appreciable likelihood, magnifying the grid separation. Figures 17.17, 17.18 and 17.19 show
the cdfs of α, β and ζ = −α/β, the last of which, when added to the sample mean from
centering, gives the posterior of the median age at menarche.
Generalised linear models (GLMs) 281

1.85

1.80

1.75

1.70

1.65
b

1.60

1.55

1.50

1.45

1.40
-0.0 0.1 0.2 0.3 0.4
a

FIGURE 17.16
Likelihood in α and β

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.0 0.1 0.2 0.3 0.4
a

FIGURE 17.17
Posterior cdf of α
282 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
1.4 1.5 1.6 1.7 1.8
b

FIGURE 17.18
Posterior cdf of β

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.25 -0.20 -0.15 -0.10 -0.05 -0.00 0.05
zeta

FIGURE 17.19
Posterior cdf of ζ
Generalised linear models (GLMs) 283

17.7.1 Down’s syndrome analysis


17.7.1.1 BC analysis
The British Columbia (BC) birth population numbers are very large, and most of the Down’s
counts are also quite large. The ML analysis will be closely equivalent to the Bayesian
analysis. The graph of incidence p against age (Figure 17.20) shows a very rapid increase
with age over 35 – at age 45 the observed rate is 35 cases/1,000 – but there is very little
resolution of the low rates. Do the small numbers there have more variability, and perhaps
a constant rate? We transform to the logit scale (Figure 17.21).
On the logit scale curvature is immediately evident: a logistic linear model will be a bad
fit. We model it with the quadratic logit model: Figure 17.22 shows the ML fitted model.
The fit to the curvature appears good, but does it account for the variability in the data? In
Figure 17.23 we add the 95% red/blue credible region for the true proportions at each age,
on the probit scale, to compare with Figure 17.22. Only the vertical scales differ.
The small numbers of births at high ages have high variability. One red and one blue
bound point fall on the “wrong side” of the region, and three other blue bound points fall
right on the region edge. The quadratic model appears reasonable. Its frequentist deviance
is 48.9 with 27 df. Fitting the cubic term improves this only slightly, to 43.8, with two bound
points now on the wrong side.
However, inspection of the bound gaps suggests another possibility: of either a constant
or a decreasing linear relation from age 15 to around 30, followed by a linear or quadratic
increase from age 30. How do we set up such a segmented regression model? We return to
this question in Chapter 18.
17.7.1.2 Four regions analysis
We now extend the quadratic logistic model analysis to the other regions. The table of
counts and population sizes in the different regions has been set up by adding zero cases and

0.035

0.030

0.025
Downs rate p

0.020

0.015

0.010

0.005

20 25 30 35 40 45
age

FIGURE 17.20
Down’s incidence
284 Introduction to Statistical Modelling and Inference

-3.5

-4.0

-4.5

Downs rate logit -5.0

-5.5

-6.0

-6.5

-7.0

-7.5

20 25 30 35 40 45
age

FIGURE 17.21
Down’s incidence, logit scale

-2.5

-3.0

-3.5

-4.0

-4.5
Downs rate logit

-5.0

-5.5

-6.0

-6.5

-7.0

-7.5

20 25 30 35 40 45
age

FIGURE 17.22
Down’s incidence and quadratic logit model
Generalised linear models (GLMs) 285

-1.6

-1.8

-2.0

-2.2

probit
-2.4

-2.6

-2.8

-3.0

-3.2

-3.4

20 25 30 35 40 45
age

FIGURE 17.23
Down’s incidence, quadratic model and 95% credible region

population counts as necessary, so that all regions have data of the same length. This assists
comparative analyses between regions: we can use interactions of region with the BC model
to establish the nature of differences in their regressions.
Figure 17.24 shows the four region observed proportions with Down’s syndrome on the
logit scale. The colour code is BC red, Massachusetts orange, New York blue and Sweden
green. They all show similar curvature, in different degrees: the quadratic is needed in all
regions.
To examine the differences in the regressions we need to “block” the four data sets r, n, a
of length 35 into a single data set of length 140, by concatenating them. We define the new
“long” variables lr, ln, la, qa with qa = la2 , and a four-level factor of region, all of length
140. Then we fit the “full” model of region × (la + qa) which gives different quadratic models
in each region. The ML fitted full model is shown in Figure 17.25. The blue proportions –
for New York – are notably lower than the others, which are quite similar.
We examine the importance of the region × qa interactions, omitting in sequence those
with the smallest t value less than 2 in magnitude. For those omitted, we then examine
the region × la interactions in the same way. Finally we examine any region terms which
are not included in interactions, to see which can be omitted. The process terminates in a
five-parameter model, shown with MLEs and (SE)s:
−6.094 (0.029) − 0.190 (0.003) la + 0.00745 (0.00343) qa
−0.846 (0.058) REG 3 − 0.00119 (0.00055) REG 3 ∗ qa.
The only region term needed is for Region 3 – New York. This region has both a lower
initial level and a reduced curvature relative to the others, which all have the same model
(Figure 17.26), shown in green.
Why might New York be different from the others? Maternal age is often said to be
the only certain risk factor for Down’s syndrome. A search of reports on terminations of
286 Introduction to Statistical Modelling and Inference

-3

-4

-5
logit p

-6

-7

-8

20 25 30 35 40 45
age

FIGURE 17.24
Four regions Down’s incidence

-2

-3

-4

-5
logit p

-6

-7

-8

20 25 30 35 40 45
age

FIGURE 17.25
Down’s incidence and full quadratic logit model by region
Generalised linear models (GLMs) 287

-3

-4

-5
logit p

-6

-7

-8

20 25 30 35 40 45
age

FIGURE 17.26
Down’s incidence and final quadratic logit model by region

pregnancies establishes that there were variations in the incidence of termination of Down’s
syndrome foetuses by ethnic origin in the United States. Two examples are Bishop, Huether,
Torfs, Lorey and Deddens (1997) and Caruso, Westgate and Holmes (1998).
This may imply a selection bias in New York in the analysis of the observed proportions,
since these may have been reduced by terminations. The references estimate the incidence
that would have been observed in their populations if the terminations had not occurred.

17.7.2 The Finney vasoconstriction data


We use the “Bayes from ML” approach of §16.4 with a well-known data set (Finney 1947)
originally used to illustrate the probit regression model; it has been reanalysed many times
with the logistic regression model. See for example Aitkin et al (2009) for the logistic model
ML analysis in R, and Ibrahim and Laud (1991) for a Bayesian analysis. The data are given
in Table 17.4, and were obtained in a carefully controlled study of the effect of the rate
(rate) and volume (vol) of air inspired by human subjects on the occurrence (coded 1) or
non-occurrence (coded 0) of a transient vasoconstriction response r in the skin of the fingers.
Studies of this kind were discussed in Gilliatt (1948).
The question of interest is the relation between the vasoconstriction response and the rate
and volume variables. This relation is shown in Figure 17.27 where the occurrence response
is shown in blue and the non-occurrence response in red. It is clear that occurrence responses
(blue) are mainly in the upper right-hand corner, while non-responses are mainly in the lower
left-hand corner.
The relation is modelled by a logistic regression of response on log volume and log rate
(both after centering on the log scale):
θ = α + β log vol + γ log rate ,
288 Introduction to Statistical Modelling and Inference
TABLE 17.4
Vasoconstriction data
vol rate r vol rate r vol rate r vol rate r
3.7 0.825 1 0.9 0.45 0 0.85 1.415 1 1.9 0.95 1
3.5 1.09 1 0.8 0.57 0 1.7 1.06 0 1.6 0.4 0
1.25 2.5 1 0.55 2.75 0 1.8 1.8 1 2.7 0.75 1
0.75 1.5 1 0.6 3.0 0 0.4 2.0 0 2.35 0.3 0
0.8 3.2 1 1.4 2.33 1 0.95 1.36 0 1.1 1.83 0
0.7 3.5 1 0.75 3.75 1 1.35 1.35 0 1.1 2.2 1
0.6 0.75 0 2.3 1.64 1 1.5 1.36 0 1.2 2.0 1
1.1 1.7 0 3.2 1.6 1 1.6 1.78 1 0.8 3.33 1
0.9 0.75 0 0.6 1.5 0 0.95 1.9 0
1.8 1.5 1 0.75 1.9 0
0.95 1.9 0 1.3 1.625 1

1.25

1.00

0.75

0.50

0.25
log rate

0.00

-0.25

-0.50

-0.75

-1.00

-0.5 0.0 0.5 1.0


log volume

FIGURE 17.27
Vasoconstriction data

where θ = log [p/(1 − p)] is the logistic transformation. The ML fitted centered model with
(SE)s is
θb = −0.345 (0.540) + 5.220 (1.852) log vol + 4.631 (1.783) log rate.
The parameter estimates are all correlated, despite the centering: corr(b b = −0.522,
α, β)
corr(b b) = −0.373, corr(β, γ
α, γ b b) = 0.804. ML fitted values for the model are shown in
Figure 17.28 as “level curves” (actually parallel lines) for 10% (red), 50% (black) and 90%
(blue) probability of vasoconstriction, computed from:
• 10%: −0.345 + 5.220 log vol + 4.631 log rate = log(1/9) = −2.197;
• 50%: −0.345 + 5.220 log vol + 4.631 log rate = log(1) = 0;
• 90%: −0.345 + 5.220 log vol + 4.631 log rate = log(9) = 2.197.
Generalised linear models (GLMs) 289

1.25

1.00

0.75

0.50

0.25

log rate 0.00

-0.25

-0.50

-0.75

-1.00

-1.25
-0.5 0.0 0.5 1.0
log volume

FIGURE 17.28
Vasoconstriction data with level curves: 10% (red), 50% (black), 90% (blue)

The two covariates are almost equally important. The fitted surface within the data bound-
aries is essentially a triangular plain in the bottom left corner, and a triangular plateau in
the upper right corner, with a smooth logistic slope joining them.
Are the MLEs and SEs reliable? Are the posterior distributions of the parameters Gaus-
sian? For a grid in multiple dimensions, there is a conflict of fineness of the grid for indi-
vidual parameters and its size; this becomes severe in high-dimensional covariate models.
Most Bayesian analysts dismiss high-dimensional grids. However, we can avoid this problem
by having a fixed-dimension individual parameter precision of 10,000 points, but randomly
sampling uniformly the grid point locations, over a range determined by the parameter MLE
± 5 SEs from the asymptotic Gaussian distribution. We use ± 5 SEs to cover a major part
of the high-likelihood region, while recognising that an extension or relocation of the grid
may be necessary. (A wider range would be needed if the SEs are themselves unreliable, for
example if they are complete data standard errors from an incomplete data EM analysis.)
So we generate a 10,000-point 3D uniform random grid over the ten SE parameter inter-
vals: α ∈ −0.345 ± 5 ∗ 0.54; β ∈ 5.220 ± 5 ∗ 1.852; γ ∈ 4.631 ± 5 ∗ 1.783.
We evaluate the binomial log-likelihood at these points and exponentiate this, to give
the full joint posterior with this flat prior. Our particular interest is in the (marginal) joint
distribution of β and γ; the intercept α is of no contextual interest. Figure 17.29 shows
the likelihood classified by the regression coefficients for log volume and log rate, where
the likelihoods at all points are coded on two scales of likelihood value as in the previous
example:
• an interval scale of the size of the circle plotting character;
• a five-point colour scale from smallest red to largest green, the green corresponding to
likelihoods more than 80% of the maximum.
290 Introduction to Statistical Modelling and Inference
10

5
g

0
0 2 4 6 8 10
b

FIGURE 17.29
Vasoconstriction, log rate γ vs log volume β

There is a peculiar pattern in the figure, with four features:


• The high likelihood values are clustered around the centre of the graph, but not uniformly,
because of the random generation of the grid locations.
• We are seeing a 2D projection of the 3D likelihood function. The few small red circles in
the central area show low likelihood values, where the unseen intercept parameters are
on the fringe of the joint likelihood.
• The elliptical scatter pattern is symmetrical in the two axes: the covariates are exchange-
able in their contributions to the response.
• This exchangeability, and the very close MLEs for the two covariates, suggest that they
can be replaced by a single covariate, their sum: total = log volume + log rate.
Other analysts have come to the same conclusion. We investigate this by transforming the
two covariates into their sum – the total – and their difference: log vol – log rate.
The ML fitted centred model has a frequentist deviance of 29.264:
θb = −0.345 (0.540) + 4.926 (1.727) total + 0.2946 (0.570) difference.

The t statistic for the difference variable is 0.517: this variable can clearly be omitted. The
reduced centred ML model has deviance 29.535:
θb = −0.407 (0.524) + 4.93 (1.72) total.
The correlation between α b and βb is −0.491.
We repeat the Bayesian analysis through ML for the one-variable model. Figure 17.30
shows the likelihood (centred) in α and β based on 10,000 points. The negative correlation is
clear. Figures 17.31 and 17.32 show the marginal cdfs of the two parameters. The posterior
modes are very close to the MLEs; α is nearly symmetric and β is heavily skewed. We do
not take this analysis further.
Generalised linear models (GLMs) 291

12

10

6
b

-2

-3 -2 -1 0 1 2
a

FIGURE 17.30
Vasoconstriction, total

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-2 -1 0 1
a

FIGURE 17.31
Posterior cdf α
292 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
2 4 6 8 10 12
b

FIGURE 17.32
Posterior cdf β

TABLE 17.5
Proportion in (sample size) tolerant of racial intermarriage
Region Ed 72 73 74
South 3 0.704 (27) 0.729 (48) 0.860 (57)
2 0.438 (137) 0.584 (178) 0.568 (176)
1 0.258 (155) 0.222 (162) 0.274 (146)
Central 3 0.783 (60) 0.873 (55) 0.862 (65)
2 0.699 (219) 0.701 (211) 0.732 (231)
1 0.374 (155) 0.389 (131) 0.422 (135)
North East 3 0.893 (56) 0.949 (59) 0.929 (70)
2 0.740 (196) 0.729 (193) 0.764 (240)
1 0.504 (403) 0.488 (84) 0.443 (79)
West 3 0.966 (29) 0.903 (31) 1.0 (27)
2 0.714 (91) 0.839 (87) 0.827 (81)
1 0.578 (45) 0.556 (45) 0.620 (50)

17.7.3 Cross-classifications with binary data


In §2.12 we gave a cross-classification table (Table 17.5) from a UK survey of rates of tolerance
for racial intermarriage. The study results are published as the sample proportions p tolerant
of racial intermarriage, out of the n (in parentheses) 16-year-olds interviewed. The proportion
is cross-classified by
Generalised linear models (GLMs) 293

• geographical region: 1 – South, 2 – Central, 3 – North East, 4 – West.


• level of education: 1 – did not complete high school, 2 – completed high school, 3 – more
than high school.
• year of survey: (19)72, 73, 74.
How do we summarise the variations in proportions with the three cross-classifying fac-
tors? It is clear from even a cursory look that the proportions tolerant generally increase
with time and with education level, and also with region downwards, from South to West,
but with some fluctuations.
We assume that the cell counts come from independent binomial distributions: there
is no heterogeneity within cell. The model structure is a three-factor cross-classification,
with a four-level Region and two three-level Education and Year factors. We give first the
well-established frequentist ML GLM analysis.
The cell numbers are quite large; the smallest is 27, with an observed proportion of 0.704
in one cell and one of 1.0 in another. The boundary value of the cell proportion pb gives a
likelihood in the cell parameter p which cannot be Gaussian, no matter how large the sample
size. For the proportion of 0.704 the likelihood in p is mildly left-skewed, and for the larger
sample sizes the skew is very small (for pb = 0.5 it is zero). The likelihood for the full data
set will be close to Gaussian in the logit model parameters. We give the frequentist analysis.
We could fit the full three-factor interaction model, as with the suicide attempt data.
However the classifying factors are of different kinds: education level and year are on scales,
but region, like nationality in the suicide data, is not. There is no a priori meaning to a scale
of regions: region is a stratifying categorical factor, defining different strata – levels – of the
population, geographical, sociological or other.
A detailed analysis begins with separate analyses of the four regions, to assess the possibly
different trends in time and education level in each region. The three levels of the education
and year factors have two degrees of freedom – they each require two dummy variables to
specify different response proportions at each level. By using the levels 1, 2 and 3 for each
factor as variables, we can replace the two dummy variables by the linear and quadratic
terms in the factor value EL, EQ, YL, YQ.
These “trend” and “curvature” variables can then be multiplied to give interactions
between time and educational level in a “full” model in each region. For this it is not
necessary to construct separate data sets for each region: the analyses can be done with a
weighting variable which gives weight to only those cells in the region of interest.
In region 1, the fitted “full” model gives the output log:
scaled deviance = 0
residual df = 0 from 9 observations

estimate s.e. t parameter


1 4.777 3.621 1
2 -7.182 4.261 -1.69 EL
3 -8.380 4.041 -2.07 YL
4 2.024 1.156 1.75 EQ
5 2.048 0.994 2.06 YQ
6 9.958 4.704 2.12 EL.YL
7 -2.492 1.266 -1.97 YL.EQ
8 -2.432 1.153 -2.11 EL.YQ
9 0.622 0.310 2.01 EQ.YQ
scale parameter 1.000
294 Introduction to Statistical Modelling and Inference

It appears that nearly all the terms are necessary. This is a consequence of very highly cor-
related variables, which can be seen from the correlation matrix of the parameter estimates.
correlations between parameter estimates

1 1.0000
2 -0.9699 1.0000
3 -0.9723 0.9374 1.0000
4 0.9150 -0.9821 -0.8802 1.0000
5 0.9333 -0.8975 -0.9891 0.8411 1.0000
6 0.9476 -0.9717 -0.9694 0.9503 0.9570 1.0000
7 -0.8967 0.9577 0.9135 -0.9713 -0.9006 -0.9818 1.0000
8 -0.9102 0.9312 0.9600 -0.9092 -0.9692 -0.9887 0.9696 1.0000
9 0.8612 -0.9179 -0.9050 0.9295 0.9128 0.9713 -0.9882 -0.9817 1.0000
Many correlations are very close to ±1: the largest in amgnitude is −0.9891. This is be-
cause the linear and quadratic terms are very highly correlated. When their interactions are
included in the model they are even more highly correlated.
This can be avoided by expressing them as orthogonal polynomials: a simple example
would be to centre the scales: −1,0,1 instead of 1,2,3. We repeat this analysis with the
centred linear and quadratic scales. The full model and its estimated parameter correlation
matrix follow.

estimate s.e. t parameter


1 0.339 0.152 1
2 1.122 0.188 5.97 EL
3 0.262 0.115 2.28 YL
4 -0.472 0.243 -1.95 EQ
5 -0.327 0.191 -1.72 YQ
6 0.217 0.156 1.39 EL.YL
7 -0.00378 0.194 -0.02 YL.EQ
8 0.0563 0.245 0.23 EL.YQ
9 0.622 0.310 2.01 EQ.YQ
scale parameter 1.000

correlations between parameter estimates

1 1.0000
2 -0.0000 1.0000
3 -0.0000 -0.0000 1.0000
4 -0.6291 0.3838 0.0000 1.0000
5 -0.7979 0.0000 -0.0742 0.5020 1.0000
6 -0.0000 -0.0000 -0.0000 0.0000 0.0000 1.0000
7 0.0000 0.0000 -0.5920 -0.0000 0.0439 0.5254 1.0000
8 0.0000 -0.7685 0.0000 -0.2949 -0.0000 -0.0513 -0.0432 1.0000
9 0.4906 -0.2992 0.0456 -0.7797 -0.6149 -0.0423 -0.0596 0.4404 1.0000

There are now no correlations near ±1; the largest is −0.7979. The highest-order interac-
tion EQ.YQ and its SE are unchanged. The others are changed considerably. The quadratic
term is now 1,0,1, uncorrelated with the linear term, instead of 1,4,9, which is correlated
0.9897 with the linear term. (Note that the sequence 1,5,9 correlates 1.0 with 1,2,3.)
Successive elimination of terms with the smallest t < 2 gives a final model.
Generalised linear models (GLMs) 295

17.7.3.1 Region 1
scaled deviance = 7.67
residual df = 6 from 9 observations

estimate s.e. t parameter


1 0.111 0.073 1
2 1.191 0.105 11.31 EL
3 0.189 0.083 2.29 YL
scale parameter 1.000

Proportion logits increase linearly with both time and (much more strongly) educational
level. We repeat the elimination analysis in turn for regions 2, 3 and 4. The final models
are given in the following, followed by a table of observed and fitted probabilities for each
region.

17.7.3.2 Region 2
scaled deviance = 3.52
residual df = 6 from 9 observations

estimate s.e. t parameter


1 0.901 0.086 1
2 1.041 0.113 9.21 EL
3 -0.291 0.142 -2.05 EQ
scale parameter 1.000

There is no year effect but a strong linear education increase with negative curvature.

17.7.3.3 Region 3
scaled deviance = 3.97
residual df = 7 from 9 observations

estimate s.e. t parameter


1 1.125 0.076 1
2 1.174 0.101 11.65 EL
scale parameter 1.000

A linear logit increase with education level.

17.7.3.4 Region 4
scaled deviance = 10.90
residual df = 7 from 9 observations

estimate s.e. t parameter


1 1.435 0.129 1
2 1.149 0.183 6.27 EL
scale parameter 1.000

A linear logit increase with education level.


296 Introduction to Statistical Modelling and Inference
TABLE 17.6
Observed and (fitted) proportions tolerant of racial intermarriage
Region Ed 72 73 74
South 3 0.704 (0.753) 0.729 (0.786) 0.860 (0.816)
2 0.438 (0.480) 0.584 (0.528) 0.568 (0.574)
1 0.258 (0.219) 0.222 (0.253) 0.274 (0.291)
Central 3 0.783 (0.839) 0.873 (0.839) 0.862 (0.839)
2 0.699 (0.711) 0.701 (0.711) 0.732 (0.711)
1 0.374 (0.394) 0.389 (0.394) 0.422 (0.394)
North East 3 0.893 (0.909) 0.949 (0.909) 0.929 (0.909)
2 0.740 (0.755) 0.729 (0.755) 0.764 (0.755)
1 0.504 (0.488) 0.488 (0.488) 0.443 (0.488)
West 3 0.966 (0.930) 0.903 (0.930) 1.0 (0.930)
2 0.714 (0.808) 0.839 (0.808) 0.827 (0.808)
1 0.578 (0.571) 0.556 (0.571) 0.620 (0.571)

17.7.3.5 Observed and (fitted) proportions, all regions


It is striking that in all regions except the South there was no systematic change in propor-
tions across time: the large differences are across the education levels. In the South region
proportion logits increased linearly with time. Because the logit transformation is non-linear,
the changes in fitted proportions are not linear, though they are nearly linear. Proportions
increased steadily from South to West in all cells.

17.8 Poisson regression – fish species frequency


We consider a data set from Barbour and Brown (1974), given in Table 2.8. They investigated
the environmental variables which influenced the number of fish species occurring in lakes.
For a sample of 70 lakes and inland seas from throughout the world, they reported that
surface area and latitude accounted for about one-third of the variability in fish species
diversity. The data set gives the species count y and the area x of each of the 70 lakes, but
not its latitude.
We show the number of species graphed against area in Figure 17.33. It is difficult to see
how a model can encompass all of the data. Log scales make a big difference; see Figure 17.34.
There appears to be a general upward trend, with a possible upturn at the extreme left of
the graph, and a great deal of variability, possibly increasing with log area.
The nature of the integer counts suggests that the Poisson or another discrete distribution
might be appropriate. For the Poisson the natural link function is the log, to ensure non-
negative fitted values from any regression model.
We fit by ML a quadratic log-linear Poisson model with mean defined by the linear
predictor η = α + β log(x) + γ[log(x)]2 with µ = exp(η). The MLEs and (SE)s of the
quadratic model parameters are

b = 2.683 (0.092), βb = −0.0303 (0.0251), γ


α b = 0.0116 (0.0016).

The deviance (1,488 with 67 df) shows immediately that the Poisson model is mis-specified.
For a well-fitting Poisson model the deviance should be of the same order as the degrees of
Generalised linear models (GLMs) 297

240

220

200

180

160

number of species 140

120

100

80

60

40

20

0 100000 200000 300000 400000


area

FIGURE 17.33
Count of fish species with lake area

6.0

5.5

5.0

4.5

4.0
log count

3.5

3.0

2.5

2.0

1.5

1.0
0 2 4 6 8 10 12
log lake area

FIGURE 17.34
Count of fish species with lake area, both on log scales
298 Introduction to Statistical Modelling and Inference

freedom. The residual deviance for the simpler linear model is even larger (1,535.7). The t
statistic for the quadratic term is 7.25, though this does not reduce the deviance substantially:
the Poisson quadratic model is clearly incorrect.
We define the precision and variability bounds from the fitted Poisson model by analogy
with the Gaussian case, and the properties of the GLM algorithem, in which the “working
variate”
Z = log(µ) + (y − µ)/µ
is iteratively regressed on the covariates with a weight variable; the variance of the working
variate is 1/µ. Because of the discreteness of the Poisson and its varying mass function shape
with the mean, we do not have precise credible interval probabilities for these bounds. We use
instead a figure of ±4.5 SDs to define an approximate 95% credible interval, with the choice
of 4.5 SDs based on the Chebyshev inequality: that for any random variable, the probability
of the random variable exceeding k standard deviations from its mean is not more than 1/k 2 .
For k = 4.5 this probability is not more than 1/20.25 = 0.05.
So an approximate 95% credible (precision) region for the mean function at x is given by
p
µb(x) ± 4.5 Var[b µ(x)],

and the approximate 95% variability (prediction) region at x is given by


p
b(x) ± 4.5 1/b
µ µ(x).

We show in Figure 17.35 the number of species graphed against area, both on log scales,
together with the Poisson fitted quadratic model (solid curve), 4.5 SD bounds (green curves)

5.5

5.0

4.5

4.0
log count

3.5

3.0

2.5

2.0

1.5
0 2 4 6 8 10 12
log area

FIGURE 17.35
Count of fish species with lake area, log scales and Poisson fitted quadratic model (black
curve) with 4.5 SD precision bounds (green curves) and 4.5 SD prediction bounds (red
curves)
Generalised linear models (GLMs) 299

for the 95% precision region and 4.5 SD bounds (red curves) for the 95% prediction region.
Of the 70 points in the graph, 21 (30%) fall outside the 4.5 SD bounds of the prediction
region. The Poisson model does not provide the appropriate variability representation for the
data. For this reason the direct Bayesian analysis from the Poisson ML will not be effective.
Where to go from here?

17.8.1 Gaussian approximation


Before GLMs were formalised, a common analysis of discrete count data was to use a Gaus-
sian representation for the log transformed counts: a lognormal distribution for the counts.
This provides a separate variability parameter, untied from the mean. The log transforma-
tion of the data corresponds to the log link for the Poisson: we are modelling on the same
log scale. We evaluate this analysis.
We begin by fitting a Gaussian linear model of log count regressed on log area:

log(count) ∼ N (α + β log(area), σ 2 ).
The ML estimates and (SE)s are αb = 2.339 (0.199), βb = 0.1436 (0.0266). The residual sum of
squares (RSS) from the model is 36.925; the RSS from the null model with β = 0 is 52.713,
so R2 = 0.2995, R = 0.547: 30% of the variability (measured by residual sum of squares) in
species “diversity” (measured by log count) is “explained” by log area.
Does the Gaussian model identify a real upturn on the extreme left? We assess this by
adding the quadratic term in log area to the model. The ML estimates and (SE)s from the
quadratic model are:

b = 2.786 (0.288), βb = −0.0532 (0.0968), γ


α b = 0.0156 (0.0074),

where γ is the coefficient of the quadratic term. The t statistic for γ


b is 2.11, a vast difference
from the Poisson value of 7.25.
We show in Figure 17.36 the number of species graphed against area, both on log scales,
together with the fitted Gaussian linear and quadratic models. Does this model represent
well the variability? In Figure 17.38 we add the 95% precision (green) and variability (pre-
dictive) regions for the linear model, and in Figure 17.39 the regions for the quadratic
model.
Is the variability about the regression(s) consistent with the Gaussian distribution? If
the model is reasonably close to corrrect, we expect about 5% of the observations to fall
outside the red bounds. Three observations (4%) fall outside the linear model bounds, two
(3%) outside the quadratic model bounds. The Gaussian model appears to be a good fit, but
the need for, and size of, the quadratic term may be sensitive to the Gaussian assumption
for the log counts. We investigate this with a probit plot of the residuals from the linear
model (Figure 17.37). Curvature in the plot is marked: the Gaussian assumption is not
supported. This is not because of the omitted quadratic term: the plot of the residuals from
the quadratic model (not shown) has the same curvature.

17.8.2 The Bayesian bootstrap and posterior weighting


The consistency of the Poisson and Gaussian ML estimates, and the failure of the Poisson
assumption, lead us to consider a general model using the linear/quadratic mean struc-
ture. We achieve this with the no-assumption multinomial model for the counts. We accept
the Gaussian definition of the population model parameters but use the Dirichlet posterior
300 Introduction to Statistical Modelling and Inference

5.5

5.0

4.5

4.0
log count

3.5

3.0

2.5

2.0

0 2 4 6 8 10 12
log area

FIGURE 17.36
Count of fish species with lake area, with Gaussian linear (solid line) and quadratic (dashed
curve) models

-1.4

-1.6

-1.8

-2.0
probit

-2.2

-2.4

-2.6

-2.8

-3.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
linear residuals

FIGURE 17.37
Residuals from Gaussian linear model, probit scale
Generalised linear models (GLMs) 301

6.0

5.5

5.0

4.5

4.0

log count
3.5

3.0

2.5

2.0

1.5

1.0

0.5
0 2 4 6 8 10 12
log area

FIGURE 17.38
Count of fish species with lake area, with Gaussian linear model (black line), 95% precision
region (green curves) and 95% predictive region (red lines)

6.0

5.5

5.0

4.5
log count

4.0

3.5

3.0

2.5

2.0

1.5

0 2 4 6 8 10 12
log area

FIGURE 17.39
Count of fish species with lake area, with Gaussian quadratic model (black curve), 95%
precision region (green curves) and 95% predictive region (red curves)
302 Introduction to Statistical Modelling and Inference

weighting of the Gaussian analysis to obtain the posterior distributions of the quadratic
model parameters, robust to the failure of the Gaussian model assumption.
Figures 17.40, 17.41, 17.42 and 17.43 give the posterior cdfs of the parameters from 10,000
draws. The posterior medians and 95% central credible intervals are:
α: 2.782, [2.433, 3.115]; β: − 0.0534, [−0.1772, 0.0750]; γ: 0.0156, [0.0048, 0.0260];
σ 2 : 0.4873, [0.3502, 0.7014]; VR: 0.664, [0.484, 0.863].
The 95% credible interval for γ excludes zero, confirming the need for the quadratic term.
Here VR is the ratio of the quadratic model variance to the null model variance: it is the
proportion of “unexplained” variance by the quadratic model. The Gaussian frequentist
MLEs agree quite closely with the posterior medians, so the ML fitted Gaussian quadratic
is a fair picture of the relation, but the 95% Gaussian-based confidence intervals below are
much longer than the 95% credible intervals:
α: 2.786, [2.222, 3.350]; β: − 0.0531, [−0.2429, 0.1635]; γ: 0.0156, [0.0011, 0.0301];
σ 2 : 0.5168, [0.378, 0.749]; VR: 0.675.
The distribution of α is left-skewed, while the others are right-skewed, heavily for σ 2 .
The poor agreement of the residuals with the Gaussian model leads to the poor summaries
of the precisions by the SEs. This poor precision may extend to the mean function.
We can assess the precision region from the joint posterior distribution of all the param-
eters. For the precision, we need the posterior of the linear predictor at each data value.

We sort the M values of the linear predictor β [m] xi at each xi , and save the median, 2.5%
and 97.5% quantiles. Connecting these values across the data gives the multinomial-based
median and 95% precision bounds of the linear predictor. Figure 17.44 shows the 95% pre-
cision regions for the Gaussian (red) and Bayesian bootstrap (green) analyses. Precision is

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
2.00 2.25 2.50 2.75 3.00 3.25
alpha

FIGURE 17.40
Posterior cdf of α
Generalised linear models (GLMs) 303

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
-0.3 -0.2 -0.1 -0.0 0.1 0.2
beta

FIGURE 17.41
Posterior cdf of β

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.00 0.01 0.02 0.03 0.04
gamma

FIGURE 17.42
Posterior cdf of γ
304 Introduction to Statistical Modelling and Inference

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
sigma^2

FIGURE 17.43
Posterior cdf of σ 2

5.5

5.0

4.5

4.0
log count

3.5

3.0

2.5

2.0

0 2 4 6 8 10 12
log area

FIGURE 17.44
Gaussian (red) and Bayesian bootstrap (green) 95% precision regions
Generalised linear models (GLMs) 305

increased by the BB analysis, for at least the smaller and medium-sized lakes. The BB anal-
ysis gets the precisions right, without assumptions. We do not need to check the multinomial
variability representation: it is already accounted for in its definition.

17.8.3 Omitted variables, overdispersion and the negative binomial model


In the Barbour and Brown study, latitude was also important for fish species diversity.
This means that the simple log area model is already mis-specified. There may have been
other important variables also omitted; perhaps the quadratic term is a consequence of the
omission of some other important variable or variables. A more appropriate model for the
linear predictor might have been

ζ = α + β log(x) + γ ′ Z = η + γ ′ Z,

where Z is the full vector of omitted variables, with γ the corresponding vector of regression
coefficients. We are assuming that these additional variables are part of the linear predictor,
rather than acting non-linearly in some way. We don’t know how many variables have been
omitted, or what they are, apart from latitude which we don’t have. So Z is a completely
unknown quantity, varying over the data set – it is an unobserved or latent random variable.
In some fields this representation of omitted variables is called overdispersion or unobserved
heterogeneity. Since γ ′ Z is a scalar quantity, we might as well represent the more appropriate
model by
ζ = η + Z,
where now Z is an unobserved scalar variable with a completely unknown distribution. With-
out a probability model specification g(z) for Z, we do not have a probability distribution
and likelihood for Y . We first adopt the conjugate gamma distribution, leading to a negative
binomial marginal distribution for the response.

17.8.4 Conjugate W
Expressed in terms of the non-negative variable W = eZ , the Poisson conditional mean given
W is µW = eα+β log(x) · W = W eη . For the Gamma(r, θ) distribution of W with density
function
f (W | r, θ) = exp(−W θ)W (r−1) θr /Γ(r),
the mean is r/θ. The marginal distribution of Y is given by
Z
y! h(y) = e−µW µyW exp(−W θ)W (r−1) θr dW/Γ(r)
Z
η
= e−W e W y eηy exp(−W θ)W (r−1) θr dW/Γ(r)
Z
= eηy θr exp[−W (θ + eη )]W r−1+y dW/Γ(r)

= eηy θr [Γ(r + y)/(θ + eη )r+y ]/Γ(r)


Γ(r + y) eηy θr
h(y) = ·
Γ(r)y! (θ + eη )r+y
 
r+y−1 y
= p (1 − p)r ,
y

where p = eη /(θ + eη ).
306 Introduction to Statistical Modelling and Inference

This probability function is a negative binomial probability of y “successes” and r “fail-


ures” – it is the probability that y successes precede the rth failure in a sequenceof binomial
trials. Here r is not required to be an integer. The binomial coefficient is r+y−1y , not r+y
y
as in the regular binomial distribution, because the (r + y)-th trial must be a failure. The
logistic transformation of p is ϕ = log[p/(1 − p)] = η − log(θ). The logistic linear predictor
has an additional constant log θ subtracted from the Poisson linear predictor, but this is
completely confounded with the intercept α and is not identifiable. The regression coeffi-
cient of log(x) (and of any other model variable) is unaffected. The value of θ can be fixed
at any arbitrary positive value. A convenient choice is 1, so that the intercept is unaffected:
the mean and variance of the gamma distribution are then both r, and the unconditional
variance of Y is r eη (1 + eη ).
The more common choice in statistical packages is θ = r, which complicates the analysis
in one respect: log r then appears negatively in the intercept of the negative binomial linear
predictor, as well as r in the binomial coefficient. This has to be allowed for in the post-
processing. However, it has an advantage in another respect: the cross-derivative of the
log-likelihood is then zero at the MLEs, so the parameter estimates β b and rb can be regarded
as independent asymptotically. In either case we have an explicit probability distribution for
y; the likelihood is
n
Y Γ(r + yi ) r
L(η, r) = pi (1 − pi )yi .
i=1
Γ(r)yi !
The search for a better-fitting compound discrete distribution can be extended to use the
multinomial distribution for the unobserved Z. But since the multinomial can be used, not
for the compounding latent variable Z but for the observed variable Y , there seems little
point in an extensive search for other discrete distributions for Y . We do not take this search
further.
Since the issue of the data variability is important, a further possibility is to model the
variation explicitly through the double GLM. This is discussed in detail in §18.1, where we
return to this example.

17.9 Gamma regression


The gamma distribution provides an alternative model for right-skewed continuous response
variables without transformation, and is a member of the exponential family of distributions.
The gamma form in the convenient parametrisation of §10.6 is

log f (y | θ, r) = r log θ − log Γ(r) + (r − 1) log y − θy

Since θ must be positive, the natural scale (link) for θ is the log, on which the regression
model is fitted: log θ = β ′ x. It is also possible that r is varying over the data. Since it also
must be positive, the log scale is natural for it. We do not give further details of model
fitting; an application follows in the next chapter.
18
Extensions of GLMs

18.1 Double GLMs


The one-parameter exponential family distributions – binomial, Poisson, exponential – are
restricted in their representation of variability, which is determined by the mean. The two-
parameter Gaussian and gamma distributions are more flexible as the second variability
parameter is unrelated to the mean. We have assumed, throughout the discussion of these
models, that the second variability parameter is fixed – that it is not itself dependent on the
covariates.
In both of these models the second parameter can also be modelled as a regression on
covariates (Smyth 1986; Aitkin 1987; Smyth 1989).1 Since both σ in the Gaussian and r in
the gamma are positive parameters, the regression models fitted are on the log scale, through
a log link function. We illustrate with the Gaussian maximum likelihood analysis, which is
simpler. Care is needed with these models, since including the same covariates in both models
may lead to unidentifiability. The formality of the model structure, with covariates x and
z, is
yi | xi , zi , σi ∼ N (µi , σi2 )
µi = β ′ xi
log σi2 = λ′ zi ,
where z of dimension q may be a subvector of x of dimension p, or overlapping with or equal
to x, or disjoint from x.
The likelihood and log-likelihood (omitting constants) are
(yi − β ′ xi )2
Y 1  
L(β, λ) = √ exp
i
2πσi 2σi2
" #
1 X 2
X
2 2
ℓ(β, λ) = − log σi + ei /σi ,
2 i i

where ei = yi − β ′ xi .

18.2 Maximum likelihood


The first and second derivatives of the log-likelihood with respect to the parameters are
∂ℓ X
= xi ei /σi2
∂β
1 The attempt by Efron (1986) to define a second variability-parameter generalisation of the binomial

distribution had serious difficulties, described in Aitkin (1995).

DOI: 10.1201/9781003216025-18 307


308 Introduction to Statistical Modelling and Inference

∂2ℓ X
′ =− xi x′i /σi2
∂β∂β
∂ℓ 1h X X i
= − zi + e2i zi /σi2
∂λ 2
1X 2 2
= (ei /σi − 1)zi
2
∂2ℓ 1X 2 2
′ =− (ei /σi ) zi z′i
∂λ∂λ 2
∂2ℓ X
= − (ei /σi2 ) xi z′i .
∂β∂λ′
Taking expectations, the expected information matrix is block-diagonal, with blocks
1
Iβ = X ′ W11 X, Iλ = Z ′ Z,
2
where
X ′ = [x1 , . . . , xn ], Z ′ = [z1 , . . . , zn ], W11 = diag(1/σi2 ),
since E[Ei ] = 0 and E[Ei2 ] = σi2 . So a Fisher scoring algorithm for the simultaneous MLE
of β and λ reduces to two separate algorithms for β and λ. Since, however, W11 depends
on λ and ei depends on β, it is simplest to formulate the scoring algorithm as a successive
relaxation algorithm. For given σi2 , β
b is a weighted least squares estimate with weights 1/σ 2 ,
i
and for given β, λb is the MLE from a gamma model with scale parameter 2 (an exponential
distribution) and a response variable e2i .
The algorithm begins with an initial unweighted Gaussian regression of y on x, taking
σi2 ≡ σ 2 . The squared residuals from the least squares fit are defined as a new response
variable with a gamma distribution with scale parameter 2. The linear predictor λ′ x is then
fitted using a log link function, and the frequentist deviance calculated for the initial estimate
of (β, λ). A weighted Gaussian regression of y on x is now fitted, with scale parameter 1 and
weights given by the reciprocals of the fitted values from the gamma model. This alternating
process continues until the deviance converges. At this point the standard errors (based on
the expected information) from both models are correct.
However, the log-likelihood in the two sets of parameters may be very skewed, and so
the standard errors are not a reliable indicator of variable importance. In addition, the loss
of degrees of freedom in the variance model due to the estimation of the mean parameters
may be serious, requiring a marginal or restricted likelihood maximisation for the variance
model. Smyth and Verbyla (1999) gave a discussion. The preceding analysis assumes that
the parameters β and λ are functionally independent, which will usually be the case.
We give several examples of ML with the double Gaussian model.

18.3 Bayesian analysis


We express the log-likelihood explicitly in terms of the two model parameters:
" n n
#
1 X X
ℓ(β, λ) = − log σi2 + e2i /σi2
2 i=1 i=1
n
1 Xh ′ ′ i
=− λ zi + e−λ zi (yi − β ′ xi )2 .
2 i=1
Extensions of GLMs 309

We define two “weight functions”:

w1i = λ′ zi
w2i = (yi − β ′ xi )2 .

Then the log-likelihood can be expressed in two different ways:


" n n
#
1 X X

ℓ(β | λ) = − w1i + e−w1i (yi − β xi )2
2 i=1 i=1
n
1X
ℓ(λ | β) = − w2i λ′ zi .
2 i=1

We alternate between these forms through Gibbs sampling, analogous to ML successive re-
laxation. We begin with an initial unweighted Gaussian regression of y on x, taking σi2 ≡ σ 2 .
We make M random draws β [m] from the constant-variance Gaussian posterior distribution
N (β,
b σb2 [X ′ X)]−1 , and for each m form the initial weights
[m] ′
w2i = (yi − β [m] xi )2 .
Pn [m] 1 ′ [m]
We write u[m] = i=1 w2i zi . The conditional log-likelihood − 2 λ u for λ given β is
a product-exponential, and with flat independent priors on λj gives the initial conditional
product-exponential posterior with
[m]
uj
 
[m] 1 [m]
π(λj | β )= exp − λj uj ,
2 2
[m]
so that λj has an exponential distribution with mean 2/uj .
For each m we make one random draw λ[m] from this posterior, and evaluate the weights
[m] ′
w1i = λ[m] zi . For the next M draws of β, we sample from the Gaussian conditional
posterior of β given λ:
b [X ′ W X]−1 X ′ W y),
β | λ ∼ Np (β,
where W = diag[e−w1i ], the w1i are the log variances of the observations yi , and the e−w1i
Pn [m]
are the reciprocal variances. (The additional “constant” term i=1 w1i cancels in the pos-
terior.) This alternating process continues until the posterior distributions stabilise.

18.3.1 Hospital beds and patients


In §2.14 we showed the data on number of beds and number of patients discharged in a
survey of 393 short-stay hospitals (Herson 1976).
The trend is nearly linear, but the variability is increasing with number of beds
(Figure 18.1). A linear regression of number of patients on number of beds gives the ML
fitted model with (SE)s
µ
cp = 122.9 (20.1) + 2.518 (0.058) b,
and the squared correlation is 0.829. There is clearly something inappropriate here: the
intercept of 122 is the ML fitted mean number of patients for a “no-bed” hospital: the
intercept is well above the actual values in the dense cloud of small numbers of patients for
the small hospitals.
310 Introduction to Statistical Modelling and Inference

2750

2500

2250

2000

1750

patients
1500

1250

1000

750

500

250

0
0 200 400 600 800
beds

FIGURE 18.1
Numbers of patients treated and hospital beds

It is possible that log transformations on both scales – a lognormal distribution for pa-
tients – would linearise the relation and also stabilise the variance (Figure 18.2). The squared
correlation increases to 0.891: more of the variability is “explained” by the transformations.
However the variance heterogeneity, while reduced, is now greater at low values, and/or the
relation is curving downward at the lower end. The ML fitted model with (SE)s is

µ[
log p = 1.316 (0.090) + 0.959 (0.018) log b,

which back-transforms to

cp = e1.316 b 0.959
µ
= 3.73 b 0.959 .

The mean number of patients appears to be nearly proportional to the number of beds b,
but the value 0.959 is 2.28 SEs away from 1. The lognormal model is not well supported.
The alternative is to model the Gaussian variance directly, through a log-linear model:2

µ = β0 + β1 b
log σ 2 = λ0 + λ1 log b.

The log scale for the variance guarantees positive fitted variances, increasing or decreasing
log-linearly in log b – variances increasing or decreasing as a power function of b. The un-
transformed number of beds b could be used in the variance regression instead of log b; then
the variance will increase exponentially with b, faster than a power function. We illustrate
2 A “saturated” model specifying unrelated different variances σ 2 for each observation will not be identi-
i
fiable.
Extensions of GLMs 311
8.0

7.5

7.0

6.5

6.0

log patients
5.5

5.0

4.5

4.0

3.5

3.0

3 4 5 6
log beds

FIGURE 18.2
Numbers of patients treated and hospital beds, log scales

this possibility by fitting both variance models. The ML fitted mean and variance model
with (SE)s and log b are followed by those for b in the variance:

µ
b = 14.84 (4.98) + 2.989 (0.052) b
\
log σ 2 = 1.850 (0.388) + 1.592 (0.073) log b
µ
b = 26.24 (9.73) + 3.019 (0.060) b
\
log σ 2 = 8.585 (0.116) + 0.00626 (0.00034) b.

The mean functions and SEs are very close visually, with the slope within 1/3 of an SE of 3.
Both fitted mean models µ b (solid lines) together with the 95% variability bounds (dashed
curves) based on µ b ± 2b
σ are shown in colour in Figure 18.3. The log b variance model is
shown in red, the b variance model is in green. The red bounds exclude 20 points, 5% of
the sample. The exponential shape of the bounds for the b model is striking, but they lie
inside those for the log b model for b < 500. The result is that the green bounds exclude
many more data points than the red bounds. The frequentist deviances indicate clearly the
data preference for the models: 5,116.4 for the log b model and 5,164.8 for the b model. The
difference of 48.4 is very substantial. The fit of the log b model appears to be good.
A minor point is that the intercepts in the mean regression are not near zero. It might
seem that the regression should go through the origin, since at zero beds there must be zero
patients. However, there are no zero beds hospitals: this extrapolation is irrelevant to the
reality (the smallest hospital has ten beds). The mean regression coefficient is close to 3: the
mean number of patients is roughly three times the number of beds plus 15. The variance
increases with the number of beds at roughly the rate b1.6 .
312 Introduction to Statistical Modelling and Inference

6000

5000

4000

patients 3000

2000

1000

0 200 400 600 800


beds

FIGURE 18.3
Joint model ML fit with 95% variability bounds: red log b, green b

18.3.2 The absence from school data


In §14.13.1 we saw variability in the school absence data beyond that modelled by the mean
function in IQ and number of dependents.
We first need to recognise that a linear regression for the mean number of days absent may
give fitted values from the model which are negative at some extreme points. Since negative
values of absence are impossible, we need to make a transformation of either the link function
or the response variable to prevent it. The Poisson model for non-negative counts uses a log
link function to guarantee this property, as discussed in Chapter 16. However, if there is
extra variation in the data beyond the Poisson, we have to extend the model to involve a
latent variable to represent that extra variation.
A simpler alternative is to transform the response – use its log – which guarantees the
same positivity for fitted values of days absent. (If there are zero values of days absent, these
can be made positive by adding a small number, 0.5 or 1, to the zeros.) This also allows us
to model the variance through the double GLM, with a variance model as well as a mean
model. Of course we have only the observed covariates IQ and dependents to include in
these models, but we can extend the two-variable regression to higher order terms in both
covariates. We illustrate this with the absence data.
We extend both models to include quadratic as well as linear terms in both IQ and
dependents. We fit the double models by ML, and reduce first the variance model, and then
the mean model, by successively eliminating unnecessary terms.
The final models for log days absent are:
Mean: 1.384 + 0.0154 (0.0074) IQ;
Variance: −7.288 + 1.859 (0.753) DEPS − 0.117 (0.0430) DEPS2 .
Figure 18.4 shows on the log days scale, the observed data (circles), fitted mean (line) and
the 2 SD variability bounds (red and blue triangles) against IQ. The bounds contain all the
Extensions of GLMs 313
5.0

4.5

4.0

3.5

log days absent


3.0

2.5

2.0

1.5

1.0

0.5

60 70 80 90 100
IQ

FIGURE 18.4
Data (circles), ML fitted mean (line) and variability bounds (triangles), log days

observed values, though this is difficult to see because of the variations in variability from
dependents. Converting the parameter ML estimates on the log days scale to the original
days scale, we show the observed data (circles), the fitted mean (line) and the 2 SD variability
bounds (red and blue triangles) for IQ in Figure 18.5, and in Figure 18.6 the observed data
(circles) and the 2 SD variability bounds (red and blue triangles) for DEPS. No mean function
is shown in the second figure as the mean is constant over dependents, though the variance
is not: it changes remarkably (quadratically) with DEPS.
The large variability with number of dependents suggests that there may be other unmea-
sured important variables (for example family income) related to both absence and number
of dependents.

18.3.3 The fish species data


We fit to the log species count the Gaussian quadratic mean model and the log-linear variance
model in log area. The ML fitted models are
b = 2.767 (0.171) − 0.043 (0.072) log area + 0.0148 (0.0064)[log area]2
µ
\
log σ 2 = −1.936 (0.303) + 0.163 (0.051) log area.
The fitted mean model, 95% precision bounds (green curves) and 95% variability bounds
(red curves) are shown in Figure 18.7, for comparison with the Poisson quadratic model fit
in Figure 18.8, reproduced from Figure 16.42.
Two of the 70 observations (3%) fall outside the double Gaussian 95% variability bounds.
The fit looks good.
It is clear that the Poisson model has the variability around the wrong way. The difference
in shape of the bounds is caused by the variance of the Poisson being determined by the
314 Introduction to Statistical Modelling and Inference

140

120

100

days absent 80

60

40

20

0
60 70 80 90 100
IQ

FIGURE 18.5
Data (circles), ML fitted mean (curve) and variability bounds (triangles)

140

120

100
days absent

80

60

40

20

0
4 6 8 10 12 14
dependents

FIGURE 18.6
Data (circles) and variability bounds (triangles)
Extensions of GLMs 315
5.5

5.0

4.5

4.0

log count 3.5

3.0

2.5

2.0

1.5
0 2 4 6 8 10 12
log area

FIGURE 18.7
Count of fish species with lake area, log scales and Gaussian fitted quadratic mean and log-
linear variance model with 2SD bounds

5.5

5.0

4.5

4.0
log count

3.5

3.0

2.5

2.0

1.5
0 2 4 6 8 10 12
log area

FIGURE 18.8
Count of fish species with lake area, log scales and Poisson fitted quadratic model (black
curve) with 4.5 SD precision bounds (green curves) and 4.5 SD prediction bounds (red
curves)
316 Introduction to Statistical Modelling and Inference

mean, rather than freely modelled, and by the log scale of the response, which has decreasing
instead of increasing variability with increasing log area.

18.3.4 Sea temperatures


Figure 18.9 shows the average global sea surface temperature anomaly in degrees Centigrade
over the period 1880–2015, from

https://www.epa.gov/climate-indicators/climate-change-indicators-sea-surface-temperature.

The word “anomaly” refers to a deviation from the annual temperature by subtraction of
the average temperature over the period. This is a simple location change in the temperature
variable.
There is a clear decline over the period 1880–1910, and a steady increase over 1910–2015,
apart from a very sudden increase and then decline in the period 1940–1945, and a much
smaller drop and increase in the period 1908–1911, together with a great deal of variation,
which appears to be decreasing over time. It is unclear how to model the variation.
The EPA quotes an uncertainty in the reported anomalies in terms of a 95% confidence
interval, presumably based on averaging across measurement occasions within each year.
The uncertainty jumps in the period 1907–1911, and jumps substantially in 1939–1945. It
decreases rapidly from 1945 to 1980, then increases until 2015. Figure 18.10 shows this vari-
ability as the measurement standard deviation implied by the Gaussian confidence interval.
In any regression model we need to allow for the precision of measurements. Other factors
may also need to be considered: the sudden jump in the period 1940–1945 corresponds to
World War II and the considerable sinkings of ships in many oceans, reducing the reliability
of measurements in this period, and possibly losing measurements completely in war zones.
The period 1907–1911 does not correspond to any obvious external event.
We need to weight each reported anomaly inversely by its measurement variance (squared
SD). It is important to centre and scale the year to a time t to avoid the computation of high

0.8

0.6

0.4

0.2

-0.0
anomaly

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2
1880 1900 1920 1940 1960 1980 2000
year

FIGURE 18.9
Sea surface annual temperature anomaly 1880–2015 (◦ F)
Extensions of GLMs 317

0.24

0.22

0.20

measurement SD
0.18

0.16

0.14

0.12

0.10

1880 1900 1920 1940 1960 1980 2000


year

FIGURE 18.10
Sea surface annual temperature anomaly standard deviation

0.8

0.6

0.4

0.2

-0.0
anomaly

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2
1880 1900 1920 1940 1960 1980 2000
year

FIGURE 18.11
Sea annual temperature anomaly and fourth-degree polynomial regression
318 Introduction to Statistical Modelling and Inference

1.0

0.8

0.6

0.4

0.2

anomaly -0.0

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2
1880 1900 1920 1940 1960 1980 2000
year

FIGURE 18.12
Fourth-degree regression with 95% bounds (red segments) from the measurement SD

powers of large numbers. We begin with a standard Gaussian regression analysis with inverse
variance weighing. This requires a fourth-degree mean function, though the quadratic term
is not needed. Figure 18.11 shows the fitted model with the data. The effect of the quartic
term is that the slope of the regression increases rapidly with increasing year. Figure 18.12
adds the 95% variability bounds (red segments) given by the EPA measurement standard
deviation. The variability bounds exclude 20 observations, 15% of the data. It is immediately
clear that the 1940–1945 period is an anomaly in the technical sense – the sea temperatures
in this period do not belong to the remainder of the model structure, and the period 1907–
1911 is also anomalous. Changes in level have occurred in these periods which cannot be
represented by the polynomial model.
There is an additional difficulty, that the reported variability is inflated in the 1940–1945
period, and is much lower at the beginning and end of the full period analysed. This suggests
that an analysis which allows for smoothly varying variance is needed, through the double
GLM. We ignore the reported measurement variability, and fit a sixth-degree model in both
mean and log variance, eliminating unnecessary terms, and examine the effect of successively
removing the data for the periods 1940–1945 and 1907–1911 from the analysis.
The full 136 observations require a sixth-degree polynomial in the mean and a fifth
degree in the variance. The reason is clear from Figure 18.13. The two sets of anomalous
values induce bulges in the variance model from the high-degree terms. We now remove the
years 1940–1945 from the data and refit the models to the remaining 130 observations, in
Figure 18.14.
Both models now require fifth-degree polynomials, but the bulge in the variance model
remains. It now does not include the cluster of five observations at 1907–1911. We finally
exclude this cluster as well. The removal of these two clusters allows a constant variance model
to fit the remaining 125 observations with a fifth-degree mean regression, in Figure 18.15.
Five values fall outside the bounds, 4% of the restricted data. The two clusters excluded
explain the high-degree variance polynomial needed for the full data. The measurement
Extensions of GLMs 319

0.8

0.6

0.4

0.2

0.0
anomaly
-0.2

-0.4

-0.6

-0.8

-1.0

-1.2

1880 1900 1920 1940 1960 1980 2000


year

FIGURE 18.13
Sea annual temperature anomaly, DGLM polynomial regression with 95% bounds

0.8

0.6

0.4

0.2

0.0
anomaly

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2

1880 1900 1920 1940 1960 1980 2000


year

FIGURE 18.14
DGLM polynomial regression with 95% bounds, excluding 1940–1945
320 Introduction to Statistical Modelling and Inference
1.0

0.8

0.6

0.4

0.2

anomaly
-0.0

-0.2

-0.4

-0.6

-0.8

-1.0

1880 1900 1920 1940 1960 1980 2000


year

FIGURE 18.15
Fifth-degree polynomial regression with 95% bounds, excluding 1940–1945 and 1907–1911
standard deviation does not represent adequately the data variability: this can be represented
as constant variance apart from the two anomalous clusters. The fifth-degree mean function
ineases even more rapidly than the quartic with the EPA measurement model.

18.3.5 Model assessment


The model residuals ei = (yi − µi )/σi should “look like” a sample from a standard Gaussian
distribution, if the model specification is correct. Figure 18.16 shows the probit plot of the
ordered residuals from the final model excluding the two anomalous sets, together with the
straight line for the standard Gaussian distribution. The fit is extremely close.
The interpretation of the full data has to be fractional. For the period outside 1907–1911
and 1940–1945, the general picture is very clear, a sharp decrease from 1880 to 1900, then a
steadily increasing upward trend: from 2000 to 2015 the average sea temperature increased
by 0.15 degrees. The trend is accompanied by a constant variance. The departures from this
trend during the 1940–1945 and 1907–1911 periods are in opposite directions. The former
rapid increase appears to be associated with the war; the latter decrease in the 1907–1911
period has no simple interpretation, though the EPA has reported its jump in uncertainty.

https://1v1d1e1lmiki1lgcvx32p49h8fe-wpengine.netdna-ssl.com/wp-content
/uploads/2021/01/1609991325-SoTC2020 ag1 V52 900w 1.png (BOM Australia).

18.4 Segmented or broken-stick regressions


In §16.6 we considered a possible two-part regression for the logit of Down’s syndrome rate
against mother’s age. This model has several other different names – piecewise regression,
Extensions of GLMs 321

2.0

1.5

1.0

0.5

residual 0.0

-0.5

-1.0

-1.5

-2.0

-2.5
-2 -1 0 1 2
probit

FIGURE 18.16
Probit plot of residuals from the EPA125 final model, excluding 1940–1945 and 1907–1911

change-point and break-point regression are three of them. The idea is that there is a break,
or change, in the value of one or more model parameters at some point or points in the data
(there may be more than two sections). This is similar to a mixture, but different because
the break defines different models on each side of the break, with either continuity or a jump
in the response mean value given by the model at the break-point. We give two examples.

18.4.1 Nile flood volumes


A simple and well-known example is the variation in the annual flooding level of the Nile river
in Egypt. The data are the flood volumes of the river for the 100 years 1870–1969, shown in
Figure 18.17. An important question for Egyptian agriculture is whether the flood volume
has decreased to a lower constant level at some time point in the 100 years, or has decreased
systematically in some other way. The graph shows a general decline of some kind, but does
not give any clear indication of a break in the mean. The large overall variability appears to
be fairly constant. If there is no break, and flood volumes vary randomly by year according
to a single Gaussian distribution, the probit graph should be close to a straight line.
Figure 18.18 graphs the cdf of the 100 volumes on the probit scale, together with the
fitted single Gaussian and the 95% credible region for the true cdf. The empirical cdf shows
curvature, and the Gaussian straight line scrapes the region boundaries around volume 860.
The constant mean Gaussian does not give a close fit.
We first fit some standard Gaussian models. The constant, linear and quadratic mean
models have residual sums of squares 2,835,421, 2,227,545 and 1,910,836, with residual vari-
ances 28,644, 22,730 and 19,699 (SDs of 169, 151 and 140), and deviances 1,310, 1,287 and
1,273. The linear model (Figure 18.19) with 95% variability bounds shows a clear trend
and appears to represent adequately the data variability: three points (3%) are outside the
bounds.
322 Introduction to Statistical Modelling and Inference

1300

1200

1100

1000

volume
900

800

700

600

500

1880 1900 1920 1940 1960


year

FIGURE 18.17
Nile flood volumes 1870–1969

1
probit

-1

-2

600 800 1000 1200


volume

FIGURE 18.18
Nile flood volumes, probit scale, with single Gaussian (line) and 95% credible region (red
segments)
Extensions of GLMs 323

1300

1200

1100

1000

volume 900

800

700

600

500

1880 1900 1920 1940 1960


year

FIGURE 18.19
Nile flood volumes and ML linear model, 95% precision bounds (green) and prediction bounds
(red)

The quadratic model (Figure 18.20) with 95% variability bounds suggests an increase
in volume towards the end of the period. The t-statistic for the quadratic term is 4.01: the
linear model is not adequate.
The quadratic model also appears to represent adequately the data variability: four points
(4%) are outside the bounds.

18.4.2 Modelling the break


The break-point model specifies different sub-models up to and after the break. The sub-
models can be the same linear or other structure, or different structures, for example one
linear, one constant. The difficult aspect of the model is identifying the location of the
break-point. We have to know this before we can fit the break-point model. A common
approach is “eyeballing”: we inspect the data, try out a few possibilities, and find the best,
call it θ. Then given the choice, we can fit the composite model using a dummy variable
Z = 1 if year > θ, Z = 0 if year ≤ θ: the composite model mean is simply 1 + Z. If we want
the two segment means and their SEs, we simply define W = 1 − Z and fit the model with
W and Z but without an intercept. Figure 18.22 shows the two constant sub-models fitted
to the break-point θ = 1899, with the fitted model (black dots) with 95% precision (green
dots) and prediction (red dots) regions. The dot character is used to avoid the steep jump
in continuous graphed functions between the two segments.
The means and SEs in the two sub-models are 1,087 (24.4) and 851 (15.6), with residual
sum of squares 1,692,871, residual variance 17,274 (SD 130) and deviance 1,259. This model
appears to fit substantially better than the quadratic model. Extending this constant mean
sub-model to the linear sub-model in each part gives no improvement. Is the variability
324 Introduction to Statistical Modelling and Inference

1300

1200

1100

1000

volume 900

800

700

600

500

1880 1900 1920 1940 1960


year

FIGURE 18.20
Nile flood volumes and ML quadratic model, 95% precision bounds (green) and prediction
bounds (red)

5
t value

1880 1900 1920 1940 1960


year

FIGURE 18.21
Nile flood volumes break-point t values
Extensions of GLMs 325

1300

1200

1100

1000

volume 900

800

700

600

500

1880 1900 1920 1940 1960


year

FIGURE 18.22
Nile flood volumes, 1899 break

better modelled? Seven (7%) of the observations fall outside the 95% prediction region, a
worse fit than the linear or quadratic models.
From a Bayesian point of view this analysis is unsatisfactory. We should be regarding the
break-point θ as an additional parameter in the model. Eyeballing the data to search for the
best break-point means that we have no external information about θ. So it should be given
a flat prior, on all the observations from 1871 to 1968. The extreme endpoints are excluded,
as they give a single Gaussian instead of two.
The analysis is now much heavier. Assuming we are fitting constant levels in both sub-
models, we have to repeat the above analysis for each break-point, find the sample mean and
variance in each sub-model and combine these to give the residual variance and the deviance
from the composite model.
A helpful summary for each model is the t-statistic for the difference of the two sub-model
means (Figure 18.21). The largest t value is 8.71 at 1899, but there are five values near 8.
We cannot say unequivocally that the change is at 1899. The t-statistic can be computed in
a loop of 98 repetitions with a transfer of one observation from one sub-model sample to the
other. The deviance is a 1:1 function of the t-statistic.
It is instructive to see the full Bayesian model comparison approach in this example. Each
break-point defines a Gaussian model with two Gaussian segments and a cut-point defining
them, with a frequentist deviance from the three parameters for each model. For the Gaussian
model the posterior deviance distributions are exactly χ23 , shifted by the frequentist deviance.
Figure 18.23 shows the full deviance distribution cdfs for the 98 possible break-points.
The individual years are not identified, but the first seven on the left – with the smallest
deviance distributions – are exactly those with the largest t-statistics. The cdfs are ex-
actly parallel and so their differences can be summarised by the frequentist deviances. Their
posterior model probabilities can be obtained by direct transformation of the deviances to
maximised likelihoods, and are shown in Table 17.1 for probabilities larger than 0.001. All
326 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
1060 1080 1100 1120 1140
deviance

FIGURE 18.23
Nile break-point model deviances

TABLE 18.1
Nile flood break-point posterior probabilities
year volume post.prob.
1896 1,260 0.0019
1897 1,220 0.0536
1898 1,030 0.1162
1899 1,100 0.7749
1900 774 0.0428
1901 840 0.0077
1902 874 0.0025

the remaining possible break-points have a total posterior probability of 0.0004. The years
1899 and 1898 are the only ones with appreciable probabilities. In the further analyses below
we take the break-point to be at 1899. Model-averaging over the possible break-points would
change very little the results, as only 1898 has appreciable probability, which would differ
from 1899 by the location of only one observation.
Could there be more than one break-point? Since we have identified one at 1899, we
keep this one and consider the two periods before and after 1899 for any possible break-
points in these periods. “Eyeballing” of Figure 18.22 is unhelpful. The period 1870–1899
seems too short, and the longer period 1900–1969 shows no obvious level change, though
the variabilities above and below the mean appear unequal. This raises a separate modelling
question: are the variances in the two break-point regions necessarily equal? Modelling the
variance with the linear, quadratic or break-point models gives no better fit than the constant
variance models. We leave open the possibility of the Bayesian bootstrap for the possibility
Extensions of GLMs 327

of non-Gaussian variability, though there is no sign of it. The flood valume decreased during
the 100-year period, but whether this was linear, quadratic, or through a jump is unclear.

18.4.3 Down’s syndrome


We consider a linear decline model for ages 15 to θ, connecting continuously with a quadratic
model for ages greater than θ. This would allow for the first model to be constant, and the
second to be linear if appropriate. As before, we define a binary indicator variable Z by Z = 1
if age > θ, Z = 0 if year ≤ θ: the composite logistic model is α + β1 age + Z ∗ [β2 age + β3 age2 ].
However, this model excludes the possibility of different quadratic regressions before and
after the break. If we fit this model, we can assess the need for the quadratic or linear
terms in both models. We reformulate the two models in a slightly different way, by defining
completely separate quadratic regressions in each part of the break:
(1 − Z) ∗ [α0 + β1 age + β2 age2 )] + Z ∗ [α1 + γ1 age + γ2 age2 ].
Then each model can be reduced as appropriate.
With the two quadratics, the maximum likelihood/minimum deviance break-point loca-
tion is not around 25, but at 38!, and the break is in the quadratic term, which changes
sign there – is reversed – for the last eight observations. The minimum deviance at this
break-point is 30.3, with four extra parameters (including the break-point), compared with
48.9 for the single quadratic logit model. The fitted model is shown in Figure 18.25, together
with in Figure 18.24 the graph of the fitted quadratic logit model. The change is curvature
is discontinuous – a visible jump. This is a property of all piecewise functions.
The posterior probabilities, greater than 0.001 to 3 dp, of the location of a break with
the two quadratics are shown in Table 18.2. The location of the break-point is very unclear,
though the posterior probability is 0.963 that the break-point is greater than 34 (if there
is one).

-2.5

-3.0

-3.5

-4.0

-4.5
Downs rate logit

-5.0

-5.5

-6.0

-6.5

-7.0

-7.5

20 25 30 35 40 45
age

FIGURE 18.24
Down’s incidence, and quadratic logit model
328 Introduction to Statistical Modelling and Inference

-3.5

-4.0

-4.5

-5.0

logit
-5.5

-6.0

-6.5

-7.0

-7.5

20 25 30 35 40 45
age

FIGURE 18.25
Down’s incidence, and break-point quadratic model

TABLE 18.2
Posterior probability of a break-point at age
age 27 28 29 30 31 32 33 34 35
pp .009 .006 .001 .001 .001 .005 .008 .006 .183
age 36 37 38 39 40 41 42 43
pp .096 .107 .295 .039 .016 .120 .067 .040

At age 38 the posterior probability is less than 0.3, hardly a clear indication, despite the
substantial reduction in deviance which is also nearly achieved at other values of age. The
reflection of the quadratic trend in the last eight observations appears unreasonable.
The t value for the quadratic term past the break-point is −1.5: this term could be
eliminated from the model, leaving a linear trend from age 39 onwards, as is visible in the
figure. This may be less unreasonable, but we do not discuss it further. No strong conclusion
can be drawn about the value of the break-point model for these data.

18.5 Heterogeneous regressions


The Gaussian distribution has been used in a wide range of non-linear models, in which
some parameters, as well as some covariates, appear non-linearly in the regression. The
frequentist and Bayes analyses are both complicated by the non-linearity. We do not give a
general formulation of non-linear models, but give a number of examples of different kinds
of non-linearity.
Extensions of GLMs 329

5.0

4.5

4.0

duration (mins)
3.5

3.0

2.5

2.0

50 60 70 80 90
waiting time (mins)

FIGURE 18.26
Durations against waiting times, Old Faithful

The eruption times of Old Faithful, the famous Yellowstone National Park volcanic geyser,
have been recorded many times to investigate the regularity of its eruptions. The Wikipedia
site for Old Faithful gives a graph of one of the public data sets, which shows eruption
durations di against waiting times wi between eruptions. A graph of the same data is shown
in Figure 18.26.
The figure has a very clear message: short waiting times correspond to short durations,
and long waiting times to long durations. There are almost no intermediate waiting times,
or durations. A linear regression appears to be well-supported; see Figure 18.27. The near-
separation of the data into two point clouds raises possible difficulties with this simple
interpretation. It is possible that, within each cloud, the relation of duration to waiting time
is much weaker, if it is even the same in the two clouds. This kind of heterogeneity can be
investigated with mixture modelling as described in §15.5.
However, another difficulty arises from an alternative interpretation of the data. Azzalini
and Bowman (1990) described their 272 observations, tabulated as durations di followed by
waiting times wi , as

the time interval between the starts of successive eruptions wi , and the duration of
the subsequent eruption di .
(Emphasis added)

This implies that the first eruption duration d1 was followed by the first waiting time w1
before the second eruption duration d2 . So the first waiting time applies to the second eruption
time, not to the first. This would follow from an observation process which begins during a
(left-censored) waiting period, so that the first recorded observation would be of the length
of an eruption duration, followed by the next waiting period, and continuing with these
alternations.
330 Introduction to Statistical Modelling and Inference

5.0

4.5

4.0

duration (mins) 3.5

3.0

2.5

2.0

1.5

50 60 70 80 90
waiting time (mins)

FIGURE 18.27
Durations di against waiting times wi with ML regression, Old Faithful

If we graph the resulting 271 observations di+1 against wi , as did Azzalini and Bowman,
we see something quite different. Now there appear to be three or four regions or point clouds
and their interpretation is quite different (Figure 18.28):

• For waiting times less than 70 minutes, nearly all the durations are long;
• for waiting times longer than 70 minutes, about 40% of the durations are short, and 60%
are long, though shorter than the durations for the short waiting times;
• the evidence for regression of duration on waiting time is weak in all three major clouds;
• there is no simple model description of the structure: a three- or four-component mixture
of linear regressions might be necessary to accommodate the small cloud of six in the
left-hand bottom quadrant;
• within the components the regressions may be quite different.

Azzalini and Bowman gave a detailed discussion of the geological mechanism of the eruptions
and waiting periods which could produce this effect.
Figures 18.29, 18.30, 18.31 and 18.32 show the two-, three-, four- and five-component
mixture regressions. The observations are coloured by the line colour if they have (ML
estimated) posterior probability greater than 0.5 of being in that coloured component. Un-
coloured (white) observations have appreciable probabilities of being in more than one com-
ponent. Details of the model fits are given in Table 18.3.
The two-component model is very clear. The two upper point clouds are fitted as one
component in all models except the five-component. The blue component regression is almost
constant. As more components are added, these two components lose peripheral observations
to the added components, though their slopes remain much the same.
Extensions of GLMs 331

5.0

4.5

4.0

duration (mins) 3.5

3.0

2.5

2.0

50 60 70 80 90
waiting time (mins)

FIGURE 18.28
Durations di+1 against waiting times wi , Old Faithful

5.5

5.0

4.5

4.0

3.5
duration

3.0

2.5

2.0

1.5

1.0
40 50 60 70 80 90 100
waiting time

FIGURE 18.29
Durations against waiting times, two components
332 Introduction to Statistical Modelling and Inference

5.5

5.0

4.5

4.0

duration 3.5

3.0

2.5

2.0

1.5

1.0
40 50 60 70 80 90 100
waiting time

FIGURE 18.30
Durations against waiting times, three components

5.5

5.0

4.5

4.0

3.5
duration

3.0

2.5

2.0

1.5

1.0
40 50 60 70 80 90 100
waiting time

FIGURE 18.31
Durations against waiting times, four components
Extensions of GLMs 333
5.5

5.0

4.5

4.0

3.5
duration
3.0

2.5

2.0

1.5

1.0
40 50 60 70 80 90 100
waiting time

FIGURE 18.32
Durations against waiting times, five components

TABLE 18.3
K-component Gaussian regressions for the Old Faithful data
K p σ
b dev AIC BIC π1 π2 π3 π4 π5
1 3 0.950 741.1 747.1 757.9 1
2 5 0.334 524.4 534.4 552.4 0.641 0.359
3 8 0.305 502.3 518.3 547.1 0.594 0.153 0.253
4 11 0.256 475.0 497.0 536.6 0.536 0.080 0.151 0.233
5 14 0.192 454.9 482.9 533.3 0.261 0.295 0.076 0.129 0.239

The third green component appears artificial, with observations only at both ends. Unas-
signed observations, with no component membership probability greater than 0.5, occur only
in the three-component model, and there are only three of them.
The fourth orange component takes over some of the green and red components’ obser-
vations, plus the unassigned observations in the three-component model.
The five-component model splits the first component into two, shown as the red and
violet lines. The six-component model (not shown) has a deviance of 449.5 and the seven-
component model a deviance of 443.4: the small deviance changes for K 6 and 7 are less
than the penalties for these values, for both AIC and BIC.
The visual evidence for more than two components is not compelling. The common
frequentist model comparisons methods from Table 17.3 give K = 5 as the best model.
Further assessment requires the Bayesian model comparison, as in the galaxy recession
velocity data: we do not pursue that here.
334 Introduction to Statistical Modelling and Inference

18.6 Highly non-linear functions


In §2.15 we showed the development of 27 dugongs, with length related to age. At some
age the dugongs are fully grown. The data are graphed in Figure 18.33. The “inverted
exponential” shape suggests a maximum length somewhere near 2.7 metres, with evidently
quite large variation, since the oldest dugong is not the largest. Can we transform the scales,
or the data, to see how a linear model might be fitted?
Logging both scales looks promising (Figure 18.34), and a linear regression on the log
scales looks a good fit (Figure 18.35). However if we exponentiate the fit to the original
scale (Figure 18.36), the problem is clear: the length increases to ∞ with age – there is no
plateauing to an upper bound – an asymptote. Another problem appears on the original
scale: the variability is very low for small ages and then increases, until possibly decreasing
again after 15 years, though it is still large at age 30+.
A common approach to growth problems with an asymptote is through the von Berta-
lanffy growth function (von Bertalanffy 1969, p. 136), based on a differential equation, in his
notation:
dL
= k(L∞ − L)
da
L(a) = L∞ (1 − exp(−k(a − a0 ))),
where a is age, k is the growth coefficient, a0 is a value used to calculate size when age is zero
and L∞ is asymptotic size. The rate of growth decreases as length approaches its asymptotic
value, and is zero at the asymptote.
If we reparametrise this into a linear model form, writing L = y, a = x, L∞ = γ, k = −β
and defining α = log γ −βa0 , we can express the model on the log scale as log(γ −y) = α−βx.

2.7

2.6

2.5

2.4
length (metres)

2.3

2.2

2.1

2.0

1.9

1.8

5 10 15 20 25 30
age (years)

FIGURE 18.33
Ages and lengths of dugongs
Extensions of GLMs 335

1.00

0.95

0.90

0.85

log length
0.80

0.75

0.70

0.65

0.60

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5


log age

FIGURE 18.34
Ages and lengths of dugongs, log scales

1.00

0.95

0.90

0.85
log length

0.80

0.75

0.70

0.65

0.60

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5


log age

FIGURE 18.35
Ages and lengths of dugongs, with fitted log-log model
336 Introduction to Statistical Modelling and Inference

2.7

2.6

2.5

2.4

2.3
length

2.2

2.1

2.0

1.9

1.8

5 10 15 20 25 30
age

FIGURE 18.36
Ages and lengths of dugongs, with fitted exponentiated model

Here γ must be larger than the largest y value in the data. This model is not a member of
the linear model family: γ, the parameter of interest, appears non-linearly.
Maximum likelihood analysis can be carried out by profiling – hill-climbing: if we fix the
value of γ, α and β can be estimated by Gaussian ML, assuming the data model has added
Gaussian variability. By defining a grid of values of γ and maximising the likelihood over the
other parameters, we can find the profile MLE of γ which maximises this profile (maximised)
likelihood in γ.
For the dugong data, this approach is unsuccessful: the maximised likelihood increases
indefinitely with γ. The literature on this model mentions the difficulty of estimating this
parameter when there are no or few large animals – in effect, we have to extrapolate outside
the data range.
One possible way of having an asymptote is to use inverse polynomials (Nelder 1966;
McCullagh and Nelder 1983). With the model µ = α − β/x, the mean increases with x to
an asymptote of α. The same asymptote structure results with any positive power of x−1 ,
or multiple such powers. We illustrate with the inverse square root power x1 = x−1/2 and
its square x2 = x21 = x−1 . We also need to model the variability. After some searching
over combinations of x1 and x2 in the mean and variance models, we find the model with
covariates x1 and x2 in both mean and variance models gives the best fit.
The fitted mean model with SEs is
b = 3.118 (0.061) − 2.663 (0.236) x1 + 1.344 (0.182) x2 ,
µ
and the fitted variance model on the log scale (link), with SEs, is
\
log σ 2 = −7.256 (1.595) + 13.41 (7.09) x1 − 15.63 (6.29) x2 .
The fitted model, with (green) bounds for the 95% precision region and (red) bounds for the
95% variability region, and the sample data, are shown in Figure 18.37. All observations fall
inside the 95% variability region. The model appears to fit well.
Extensions of GLMs 337

2.8

2.7

2.6

2.5

2.4

length 2.3

2.2

2.1

2.0

1.9

1.8

5 10 15 20 25 30
age

FIGURE 18.37
Ages and lengths of dugongs, with fitted inverse polynomials and 95% bounds

The mean model intercept is the asymptote of the regression: 3.12 (0.06) metres, with
95% confidence interval [2.99, 3.24]. This interval is well above the heaviest sample dugong.
Without information about the age cycle of dugongs, we are unable to say whether this is a
reasonable range or not. Wikipedia gives dugongs a lifetime of 70 years; at this value of age
the model mean length is 2.82 metres: the sample we have is of young dugongs. The Great
Barrier Reef Marine Park Authority gives dugongs a mature length of up to three metres.
We leave further investigation to the student.

18.7 Neural networks


Neural networks were a minor specialisation in statistics until their rapid development in the
1990s related to models for brain connections, signal transmission and generalised regression.
Their current implementation in deep learning is now expanding rapidly. Here we give the
original version with a single “hidden layer” of latent variables. The structure is very similar
to a finite mixture of regressions. The relation between response variable Yi and p covariates
xi in a sample of size n is expressed through a finite mixture of a set of q binary latent
variables Zi = (Zi1 , . . . , Zij , . . . , Ziq ):
Yi | Zi , xi ∼ N (λ′ Zi , σ 2 ),
Zij | xi ∼ b(1, pij ),
logit pij = β ′j xi .
The binary latent variable Zij has a logistic regression on the covariates xi with regression
coefficient β j . So the relation between response and covariates is mediated by the latent
338 Introduction to Statistical Modelling and Inference

variables: in this model there is no direct connection between response and covariates, but
the model can be generalised to include this. Since the Zij are unobservable, we are unable to
determine which latent variables are “active” – have the value 1 – in the observed responses.
While the number p of covariates is known, the number q of latent variables is not known,
since they are not observable. This is a standard problem with finite mixtures.
The observed data Yi and xi , eliminating the unobserved Zij , have a complex mean
structure:
q
X
E[Yi | xi ] = λj E[Zij | xi ]
j=1
Xq
= λj exp(β ′j xi )/[1 + exp(β ′j xi )].
j=1

The early analyses of this complex mean structure used the residual sum of squares from
the fitted mean function as the objective function to be minimised. Details can be found in
the major books of Michie, Spiegelhalter and Taylor (1994), Ripley (1996) and Kay and Tit-
terington (1999). Considerable difficulties with the convergence of the optimising algorithm
with this model led to a loss of interest in neural networks by the statistical and computer
science communities. Aitkin and Foxall (2003) pointed out that the objective function being
minimised was not the negative log-likelihood for the complex model, as the unconditional
variance was not constant, but an even more complex function with two sets of regression
parameters:

Var[Yi | xi ] = Var[E(Yi | Zi )] + E[Var(Yi | Zi )]


q
X q
X
= λ2j exp(β ′j xi )/[1 + exp(β ′j xi )]2 + λj exp(β ′j xi )/[1 + exp(β ′j xi )] + σ 2
j=1 j=1

By taking advantage of the incomplete data form of the model, Aitkin and Foxall gave both
scoring and EM algorithms for ML estimation in the explicitly formulated latent variable
model. Figure 18.38, reproduced from their paper, shows the ML fit of a four-node model
to the motorcycle acceleration data, discussed at length in §18.9. It did not allow for the
increasing variance.
In later research papers, the EM algorithm was extended to the multilevel “deep learning”
model. We do not give details here.

18.8 Social networks and social group membership


This section is included as an example of recent research in the field of social networks, which
is expanding rapidly.

18.8.1 History of network structures


The origin of social networks came from representing the connections of members of a com-
munity – actors – to each other. Friendship networks were an important example. The
relation between members could be expressed mathematically through the value of a binary
(0,1) link variable between each pair of actors. Statistical modelling of the probability of a
Extensions of GLMs 339

FIGURE 18.38
Acceleration of motorcycle helmet, with four-node neural network fit

link was developed extensively by Holland and Linehart (1981); a broad coverage is given
by Lusher, Koskinen and Robins (2013).

18.8.2 The Natchez women network


In §2.14 we gave in Table 2.12 attendances at 14 social events by 18 women in Natchez,
Mississippi. The question of interest to the sociologists at the time was whether there were
distinct cliques – social groups of the women attending different events. Table 2.12 is difficult
to understand: the redundant zeros crowd it. If we remove the zeros the picture is clearer
(Table 18.4). There are large blank areas in the top right and bottom left corners.
Women 1–8 did not attend any of the events 10–14. Women 10–18 did not attend any of
the events 1–5. So there seems to be at least two distinct groups; the remaining events were
attended by some women in both these groups.
This little data set has been analysed at least 21 times from different computational
points of view (Freeman 2003). Only one of these 21 analyses used a statistical model,
the exponential random graph model (ERGM), though the model itself did not address
specifically the group structure of attendance. How do we express this statistically? What
is the model, or models? The following sections are condensed from the full discussion in
Aitkin, Vu and Francis (2014).
340 Introduction to Statistical Modelling and Inference
TABLE 18.4
Attendance, zeros removed
x W \E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 T
1 1 1 1 1 1 1 1 1 1 8
0 2 1 1 1 1 1 1 1 7
0 3 1 1 1 1 1 1 1 1 8
0 4 1 1 1 1 1 1 1 7
0 5 1 1 1 1 4
0 6 1 1 1 1 4
0 7 1 1 1 1 4
0 8 1 1 1 3
0 9 1 1 1 1 4
0 10 1 1 1 1 4
0 11 1 1 1 1 1 5
0 12 1 1 1 1 1 1 6
1 13 1 1 1 1 1 1 1 7
1 14 1 1 1 1 1 1 1 1 8
1 15 1 1 1 1 1 5
1 16 1 1 2
1 17 1 1 2
1 18 1 1 2
T 3 3 6 4 8 9 10 14 12 5 4 6 3 3 90

18.8.3 Statistical models


We consider the presence or absence of woman i at event j as a random process – her
attendance was determined by a possibly large number of factors unknown to us, so we
represent the process outcome as a Bernoulli (binary) random variable, taking the value
Yij = 1 with probability pij , and Yij = 0 with probability 1 − pij . We can bring the women
and event structures into a model in several ways.

18.8.3.1 The “null” random graph model


This is a single-parameter model, giving a constant probability pij = p for every woman
independently attending every event. It has no substantive interest in general, providing
only a baseline for comparison with informative models.

18.8.3.2 The “saturated” model


This is just a re-statement of the general model, with the event attendance probabilities
completely unrelated parameters pij , in general different for every i and j. We aim to improve
on this model, and the null model, with parsimonious models.

18.8.3.3 The Rasch model


This model is widely used in item response theory (IRT) in psychology for educational and
psychological testing.
• Each woman i has a propensity θi to attend any event.
• Each event j has an attractiveness ϕj to any woman.
• Women attend events independently.
Extensions of GLMs 341

The Rasch model is a main effect or additive model, in events and women, on the logit scale:
 
pij
logit pij = log = θ i + ϕj .
1 − pij
The model has no group structure for women, and so plays the role of a baseline model for
comparison with models with group structure.

18.8.4 The Exponential Random Graph Model (ERGM)


The ERGM is a general regression model for binary responses, and can be extended to more
complex responses, like counts or continuous response variables. It is a particular case of the
GLM family. The Rasch model is a simple case of the exponential family GLM.

18.8.4.1 The latent class or mixed Rasch model


We extend the Rasch model to a mixture of Rasch models, one for each social group. The
formal mixture model specifies a K-group latent structure for women; the groups are distin-
guished by K sets of event attendance parameters ϕjk , different among groups, but identical
within groups. Within each group k, all the women in the group attend event j with a com-
mon logit attendance parameter ϕjk , but with woman-dependent propensity parameters θi .
The proportion of women in group k is πk .
The group structure is unobserved, and the number K of groups is unknown. These are
implied and identified by the women’s different patterns of event attendance. We extend
the notation to qijk , the probability of attendance at event j for woman i in group k,
i = 1, . . . , nk .
The formal model and likelihood are
r
y
Y
Pr[{yij } | i, class k] = ij
qijk (1 − qijk )1−yij
j=1

logit qijk = θi + ϕjk


Pr[class k] = πk
K r
y
X Y
Pr[{yij } | i] = πk · ij
qijk (1 − qijk )1−yij
k=1 j=1
 
n K r
yij
Y X Y
1−yij 
L=  πk · qijk (1 − qijk ) .
i=1 k=1 j=1

An important question is how to specify the number of classes K. The finite mixture
structure allows ML and Bayesian analyses by EM and DA. We do not give details here.
Membership of each woman in the groups is expressed probabilistically through her condi-
tional group probabilities, given the number of groups and her event attendance pattern.
The number of groups is determined through the posterior distributions of the deviances for
each number of groups.
For the Natchez women data the two-group model has very high probability. Women
1–8 have very high probabilities of belonging to Group 1; Women 10–18 have very high
probabilities of belonging to Group 2. Woman 9 has probability close to 0.5 of being in both
groups. The last result may look ambiguous, or a model failure, but the sociologists asked
the women to define which group they belonged to. All agreed that Women 1–8 were in
Group 1 and Women 10–18 were in Group 2, and that Woman 9 belonged to both groups.
342 Introduction to Statistical Modelling and Inference

This can be understood from Woman 9’s attendance at only four events: 5, 7, 8 and 9:
she was not a frequent attender. All the women in Group 1 except one attended event 5,
and all the women in Group 2 except one attended event 9. So Woman 9’s membership in
both groups was consistent with her attendance pattern.
This analysis was the only one of the 21 analyses which identified Woman 9’s joint
membership. This modelling approach was applied to the identification of the structure of
the Noordin Top terrorist network in Aitkin, Vu and Francis (2017).

18.9 The motorcycle data


We reproduce in Figure 18.39 the motorcycle acceleration data from §2.16. A detailed analysis
was given by Silverman (1985); the data are available from R. The original simulated crash
study was reported by Schmidt, Mattern and Schueler (1981). A curious aspect of the data
is the bunching of multiple accelerations at the same time, even at the beginning, suggesting
multiple measuring instruments or observers of some kind. We ignore this possibility for the
moment and take the data as given; we diagnose this curiosity at the end of this section.
Nothing happens for the first 13.8 ms (milliseconds) – the helmet moves at almost con-
stant velocity before the collision. We clearly need a break-point model, with a constant part
before the break at 13.8 ms. The acceleration pattern then follows a sine wave, as in simple
harmonic motion, with clearly decreasing magnitude and increasing variability. We discuss
this further later, but first try a double Gaussian GLM, with a 12th-degree polynomial for

60

40

20

0
acceleration

-20

-40

-60

-80

-100

-120

10 20 30 40 50
time

FIGURE 18.39
Acceleration of motorcycle helmet
Extensions of GLMs 343

100

75

50

25

acceleration -25

-50

-75

-100

-125

-150

-175
20 30 40 50
time

FIGURE 18.40
Acceleration of motorcycle helmet after 13.8 ms, with ML mean and variance polyfit and
95% precision (green) and variability (red) bounds

the mean function and a fourth-degree polynomial for the log variance function, applied to
the data after the break.
We reduce the unnecessary terms, first from the variance model and then from the mean
model, terminating with a second degree model for the variance, and an 11th degree poly-
nomial without the seventh- and ninth-degree terms for the mean. The ML fitted model is
shown over the restricted time scale in Figure 18.40, with 95% variability bounds (red) and
95% precision bounds (green) around the fitted mean model. The horizontal and vertical
scales are both different in Figures 18.39 and 18.40.
Five observations out of the 113, 4%, fall outside the 95% bounds, a good fit to the
variability, though the wiggles in the fitted mean beyond 40 ms are unconvincing. The
precision of the polynomial model decreases and then increases with time.
Many modelling statisticians are sceptical of high-degree polynomials, which may be
unstable and cannot be extrapolated beyond the observed data. The instability is a matter
of over-fitting the degree of the polynomial terms, and the critical need for orthogonalisation
of the components. Centering and scaling the time scale are essential to avoid numerical
underflow or overflow. The extrapolation difficulty is true of all models, and the polynomial
is not intended to be a representation of an unobserved process, but simply a smoothing of
the variability, to allow the complex identifiable trend to be represented.
The fitted model has no useful engineering or design interpretation, and cannot be ex-
trapolated beyond the observed data: it is entirely empirical. The “cyclic” behaviour of the
helmet after the collision suggests that a real physical model for the mean, of a damped
simple harmonic motion (SHM), could fit the observed behaviour and could be extrapolated
beyond the upper time point.
Inspection of the data shows that the acceleration returns to zero at about 28 ms, then
continues in the second half of the full cycle period to about 48 ms, though with reduced
344 Introduction to Statistical Modelling and Inference

amplitude – the motion is damped by a factor of about λ̃ = 35/120 = 0.57 (the ratio of
the maxima). Beyond this time the cyclic pattern is lost in the noise level which increases
steadily from the initial collision and later decreases.
The damped sine wave can be represented initially by

µ(t) = α∗ exp(−λ t) sin(2π t/τ ),

where λ is the decay constant and τ the period – the time it takes for a single cycle of
the SHM. An immediate problem appears. The sine function cycles between positive and
negative values as time increases, but starts at zero, then increases from t = 0. To make it
decrease, we need to shift the sine function through π, and redefine the time scale to start
from 0 at 13.8 ms: t∗ = t − 13.8:

µ(t) = exp(α + β t∗ ) sin(π + γ t∗ ),

where α = log(α∗ ), β = −λ, γ = 2π/τ . If we move the time origin to 13.8 ms, then the
period can be roughly estimated as τ̃ = 2 × (28 − 13.8) = 28.4 ms.
There is no straightforward ML analysis for the model other than a general Newton-
Raphson. Both maximum likelihood and Bayesian analyses can be obtained by setting up
a 3-D grid in the parameters and assessing the likelihood near the edges of the grid for
possible grid changes. The posterior distributions of the parameters can then be obtained
by rescaling the likelihood.
The analysis used a wide random grid of 10,000 points with central values the eyeballed
parameter values α0 = 5.4, β0 = −0.070, γ0 = 0.224. The grids had to be extended repeat-
edly as the likelihood was very flat. The 10,000 random draws of the parameters with the
corresponding deviances are shown in Figures 18.41, 18.42 and 18.43 over the final grid, with
the deviance range restricted to the minimum deviance (1,068) plus 10, covering the range of

1078

1077

1076

1075

1074
deviance

1073

1072

1071

1070

1069

1068
7.30 7.35 7.40 7.45 7.50 7.55 7.60
alpha

FIGURE 18.41
Deviances and α
Extensions of GLMs 345

1105

1100

1095

deviance 1090

1085

1080

1075

1070

-0.112 -0.111 -0.110 -0.109


beta

FIGURE 18.42
Deviances and β

1078

1077

1076

1075

1074
deviance

1073

1072

1071

1070

1069

1068
0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120
gamma

FIGURE 18.43
Deviances and γ
346 Introduction to Statistical Modelling and Inference
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
7.30 7.35 7.40 7.45 7.50 7.55 7.60
alpha

FIGURE 18.44
Cdf α

appreciable likelihood. The deviance graphs for all three parameters are very poorly defined,
with no clear minimising parameter values. Computationally, the minimum deviance was
1,068 at the MLEs α b = 7.44, βb = −0.112, γ
b = 0.0017. These values are far from the eyeball
estimates; the bunching of multiple acceleration values at the same time confuses the eyeball
location process.
The posterior distributions shown in Figures 18.44, 18.45 and 18.46 are very unusual.
Those for α and γ have a long left tail and are then nearly uniform over the ranges shown,
while β has its MLE at the left-hand end with a long right tail. The data are not very
informative about the model, in spite of the very clear damped SHM structure. Many fitted
models will be consistent with the data: we do not pursue any further elaboration of the
model. Something important is missing in this model.
Some analysts of the data have expressed concerns over the apparent increase in accelera-
tion at the end of the time period, implying their expectation of a return to zero acceleration.
The damped SHM model has this increase, and limiting damping to zero magnitude, as a
consequence of its model structure.
We now return to the peculiarity of the bunching of multiple accelerations. What is
happening before 13.8 ms? If we graph just the early observations, we see something quite
bizarre (Figure 18.47). It appears that there are four different data sets which have been
amalgamated. In only one set is the helmet at constant velocity before the collision; in the
other sets the helmet is moving with different small but constant negative accelerations –
braking – before the collision. Up to 13.8 ms the data set membership of each observation
is clear, but beyond this point we are unable to identify data set memberships. To do this
would require a mixture model of four damped SHMs, which evidently have slightly different
parameters. This particularly complex model could be fitted by extending the likelihood
with the additional parameters of the mixture components. We do not discuss it further: we
Extensions of GLMs 347
1.0

0.9

0.8

0.7

0.6

cdf 0.5

0.4

0.3

0.2

0.1

0.0
-0.1120 -0.1115 -0.1110 -0.1105 -0.1100 -0.1095 -0.1090
beta

FIGURE 18.45
Cdf β

1.0

0.9

0.8

0.7

0.6
cdf

0.5

0.4

0.3

0.2

0.1

0.0
0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120
gamma

FIGURE 18.46
Cdf γ
348 Introduction to Statistical Modelling and Inference
0.0

-0.5

-1.0

-1.5

-2.0

acceleration
-2.5

-3.0

-3.5

-4.0

-4.5

-5.0

4 6 8 10 12
time

FIGURE 18.47
Acceleration of motorcycle helmet, first 13.8 ms

do not need to identify the individual recorders’ data. Constructing a credible region for any
of the four recordings is beyond the scope of this book, and will not be of great interest.
What would be of interest is to have the four disaggregated recordings.
The unmodelled variations between the four sets, and their slight differences in param-
eters, explain the large residual variation which also increases with time as the individual
curves diverge.
19
Appendix 1 – length-biased sampling

It is possible that even when a simple random sample of the population has been drawn,
and the values of Y obtained from all the sample members, the resulting values of Y are
not a simple random sample. This is the case when the sampling used is informative, in the
sense described in §6.5: that the selection of population member I into the sample depends
on the value of YI . It may be difficult to imagine how this could occur, since the values of
YI are observed only after the sample is drawn.
However, an important class of duration sampling designs do have this property, and
result in length-biased sampling. A simple example is that of residential tenure. We want to
investigate how long people live – their length of tenure – in their residences – dwellings –
before moving. We have the list of dwellings in the area of study interest, and draw a simple
random sample of dwellings of size n from this list. We assume it is with replacement – the
sample size is small compared to the population size.
We then visit each dwelling and ask the residents there how long they have lived there.
We record the residential tenure yi for each household i, and could easily assume that we have
a random sample of tenures. However some thought will show that this is not so: families
who move rarely will have a greater chance of being in the sampled dwellings than families
who move frequently (Cox and Miller 1965). We need to model this dependence in some way.
Here we assume that the probability of being included in the sample increases linearly with
tenure (time).
The population data are the pairs (UI , YI ) for I = 1, . . . , N . The data from which we
construct the likelihood are the sampled pairs (ui , yi ) with ui = 1, and the values uI = 0
for the unsampled dwellings. As we have no tenure recorded for the unsampled dwellings,
the likelihood contribution from these uI is just a constant, as in Chapter 2 – the dwellings
were drawn by a simple random sample. The probability contribution of a sampled dwelling
pair is

f ∗ (y) = Pr[u = 1, y] = Pr[u = 1 | y] Pr[y]


= c · y · f (y),

where f ∗ (y) is the distribution of the sampled durations, f (y) is the probability model
for the population value y and c is a constant of proportionality, reflecting the fact that the
chance of being included in the sample increases linearly with tenure y. We have assumed the
simplest model for inclusion; we could assume more generally that the probability increases
monotonically with y.
We want to find the probability density f ∗ (y) of the sampled values under this specifi-
cation of informative sampling. Since the density f ∗ (y) has to integrate to 1 over the range
of y (assumed positive), we must have
Z ∞ Z ∞

1= f (y)dy = c · yf (y)dy = c · µ,
0 0

DOI: 10.1201/9781003216025-19 349


350 Introduction to Statistical Modelling and Inference

where µ is the mean of y. So c = 1/µ, and the density of y under this informative sampling is

f ∗ (y) = yf (y)/µ.

This specification allows us to correctly estimate the model parameter(s), even under infor-
mative sampling, if the specification is correct.
We note immediately that the mean and variance of the sampled durations are quite
different from those of the underlying population. The mean and variance are
Z

E [Y ] = y 2 f (y)dy/µ

= (µ2 + σ 2 )/µ = µ + σ 2 /µ,


Z
Var∗ [Y ] = y 3 f (y)dy/µ − E∗ [Y ]2

= [µ3 + 3(µ2 + σ 2 )µ − 2µ3 ]/µ − (µ + σ 2 /µ)2


= µ3 /µ + σ 2 (1 − σ 2 /µ2 ),

where µ3 is the third central moment of Y : (E[(y−µ)3 ]. So the mean of the sampled durations
exceeds µ, the true population mean (as expected), and the variance depends on the third
moment of the true population.
Suppose for example that we specify the duration of tenure in the population by an
exponential distribution with mean µ. The likelihood, log-likelihood and its derivatives for
the informatively sampled durations y1 , . . . , yn are:
n
Y
L(µ) = [yi f (yi )/µ]
i=1
Yn
= [yi exp(−yi /µ)/µ2 ]
i=1
n
X
ℓ(µ) = [log yi − yi /µ − 2 log µ]
i=1
n
dℓ X
= [yi /µ2 − 2/µ]
dµ i=1
2 n
d ℓ X
= [−2yi /µ3 + 2/µ2 ].
dµ2 i=1

So the MLE of µ is ȳ/2, not ȳ, and its asymptotic variance is ȳ 2 /(8n). The sample mean is
seriously biased – twice the correct MLE – and its variance is µ b2 /(2n), four times that of
the MLE.
We do not take this discussion further.
20
Appendix 2 – two-component Gaussian mixture

Writing the two-component Gaussian mixture Hessian formally at length, we have (with the
expected Hessian terms on the left, the covariance terms on the right):
Pn
Z∗
Hy (µ1 , µ1 ) = − i=12 i
σ1
Pn
Z ∗ (1 − Zi∗ )(yi − µ1 )2
+ i=1 i
σ14
n
2 X ∗
Hy (µ1 , σ1 ) = − 3 Z (yi − µ1 )
σ1 i=1 i
Pn ∗ ∗
(yi − µ1 )2
 
i=1 Zi (1 − Zi )(yi − µ1 ) 1
+ − +
σ12 σ1 σ13
Pn ∗ ∗
Z (1 − Zi )(yi − µ1 )(yi − µ2 )
Hy (µ1 , µ2 ) = − i=1 i
σ12 σ22
Pn ∗ ∗
(yi − µ2 )2
 
Z (1 − Zi )(yi − µ1 ) 1
Hy (µ1 , σ2 ) = i=1 i −
σ12 σ2 σ23
Pn
Z ∗ (1 − Zi∗ )(yi − µ1 )
Hy (µ1 , p) = i=1 i
p(1 − p)σ12
Pn
Z∗ 3(yi − µ1 )2
 
Hy (σ1 , σ1 ) = i=12 i 1 −
σ1 σ12
Pn ∗ ∗
2
(yi − µ1 )2

i=1 Zi (1 − Zi )
+ −1 +
σ12 σ12
Pn ∗ ∗
 2

Z (1 − Zi )(yi − µ2 ) 1 (yi − µ1 )
Hy (σ1 , µ2 ) = − i=1 i 2 − +
σ2 σ1 σ13
Pn ∗ ∗
 2

Z (1 − Zi ) 1 (yi − µ1 )
Hy (σ1 , p) = i=1 i − +
p(1 − p) σ1 σ13
Pn
(1 − Zi∗ )
Hy (µ2 , µ2 ) = − i=1 2
σ2
Pn
Z ∗ (1 − Zi∗ )(yi − µ2 )2
+ i=1 i
σ24
n
2 X
Hy (µ2 , σ2 ) = − 3 (1 − Zi∗ )(yi − µ2 )
σ2 i=1
Pn ∗ ∗
(yi − µ2 )2
 
i=1 Zi (1 − Zi )(yi − µ2 ) 1
+ − +
σ22 σ2 σ23
Pn
Z ∗ (1 − Zi∗ )(yi − µ2 )
Hy (µ2 , p) = i=1 i
p(1 − p)σ22

DOI: 10.1201/9781003216025-20 351


352 Introduction to Statistical Modelling and Inference
Pn
(1 − Zi∗ ) 3(yi − µ2 )2
 
Hy (σ2 , σ2 ) = i=1 2 1−
σ2 σ22
Pn ∗ ∗
2
(yi − µ2 )2

i=1 Zi (1 − Zi )
+ −1 +
σ22 σ22
Pn ∗ ∗
 2

Z (1 − Zi ) 1 (yi − µ2 )
Hy (σ2 , p) = i=1 i − +
p(1 − p) σ2 σ23
n
X Z∗ 1 − Zi∗
 
i
Hy (p, p) = − 2
+
i=1
p (1 − p)2
Pn
Z ∗ (1 − Zi∗ )
+ i=1 i .
[p(1 − p)]2

An important point is that while the expected Hessian terms involve only the linear and
quadratic terms in y, the covariance matrix of the score involves third and fourth powers of
y. Departures of the response model from the assumed Gaussianity in each component will
therefore affect the observed data Hessian and the stated precisions of the model parameters.
21
Appendix 3 – StatLab variables

The description and codes for a selection of the variables are given in the tables.

21.1 Child variables

Variable Code/Description
CB Child blood type:
1 O – Rh negative
2 A – Rh negative
3 B – Rh negative
4 AB – Rh negative
5 O – Rh positive
6 A – Rh positive
7 B – Rh positive
8 AB – Rh positive
9 Unknown
LGTH Length of baby to 0.1 inch (2.54 mm)
CBWGT Weight of baby to 0.1 pound (45.4 gm)
C10HGHT Height of child (age ten) to 0.1 inch (2.54 mm)
C10WGT Weight of child (age ten) to nearest pound (454 gm)
PEA Score on the Peabody Picture Vocabulary Test
RA Score on the Raven Progressive Matrices Test

21.2 Family variables

Variable Code/Description
INCB Family income at time of birth, in units of $100
INC10 Family income at child’s age ten, in units of $100
CHURCH Church attendance at child’s age ten:
1 Entire family attends church fairly regularly
2 Mother and child attend fairly regularly
3 Child only attends fairly regularly
4 Anyone in family attends sporadically
5 Anyone in family attends on Holy Days only
6 No-one in family ever attends

DOI: 10.1201/9781003216025-21 353


354 Introduction to Statistical Modelling and Inference

21.3 Mother variables

Variable Code/Description
MB Mother’s blood type (same codes)
MAGE Mother’s age at baby’s birth (years)
MBWGT Weight of mother at diagnosis of pregnancy
MBOCC Mother’s occupation at diagnosis of pregnancy:
0 Housewife
1 Office/clerical
2 Sales
3 Teacher/counsellor
4 Professional/managerial
5 Services
7 Factory worker
8 All other
MBSM Mother’s cigarette smoking history at diagnosis of pregnancy:
N Never smoked
Q Smoked at one time but has now quit
01–99 Number smoked per day
M10HGHT Mother’s height at child’s age ten to 0.1 in (2.54 mm)
M10WGT Mother’s weight at child’s age ten to nearest pound (454 gm)
M10ED Mother’s education at child’s age ten:
0 Less than 8th grade
1 8th–12th grade
2 High school graduate
3 Some college
4 College graduate
M10OCC Mother’s occupation at child’s age ten (same codes)
M10SM Mother’s cigarette smoking history at child’s age ten

21.4 Father variables

Variable Code/Description
FB Father’s blood type (same codes)
FAGE Father’s age at baby’s birth (years)
FBOCC Father’s occupation at diagnosis of pregnancy:
0 Professional
1 Teacher/counsellor
2 Manager/official
3 Self-employed
4 Sales
5 Clerical
6 Craftsman/operator
7 Laborer
8 Service worker
FBSM Father’s cigarette smoking history at diagnosis of pregnancy
(same codes)
F10HGHT Father’s height at child’s age ten to 0.1 in (2.54 mm)
F10WGT Father’s weight at child’s age ten to nearest pound (454 gm)
F10ED Father’s education at child’s age ten
F10OCC Father’s occupation at child’s age ten
F10SM Father’s cigarette smoking history at child’s age ten
22
Appendix 4 – a short history of statistics from 1890

This history is of the major contributors to the subject up to the 1960s. The later modern
Bayesian developments are not detailed or discussed here.

22.1 Karl Pearson (1857–1936)


Pearson attended University College School, followed by King’s College, Cambridge, in 1876
to study mathematics, graduating in 1879. He travelled to Germany to study physics and
metaphysics at the University of Heidelberg, and visited the University of Berlin, where
he attended lectures on Darwinism and studied Roman law, medieval and 16th-century
German literature and socialism. He became an accomplished historian and Germanist and
spent much time in Berlin, Heidelberg and Vienna.
Pearson returned to London in 1881 to study law, then returned to mathematics, deputis-
ing for the mathematics professor at King’s College, London, in 1881 and for the professor
at University College, London, in 1883. In 1884, he was appointed to the Goldsmid Chair of
Applied Mathematics and Mechanics at University College, London (Wikipedia).
Late in the 19th century there was no general theory of statistical inference. Bayesian
inference with the uniform prior was widely used, following Laplace’s extension of the use of
Bayes’s theorem, as was the Central Limit Theorem for the Gaussian distribution of sample
means. The “normal” distribution was widely assumed to represent the population structure
of almost all continuous variables.
Pearson became dissatisfied with this widespread assumption, and set out to discredit it.
From 1894 to 1902 he established a new approach to data analysis, which we now describe
as statistical modelling. In this period, he published
• a new visualisation tool for large amounts of data: the histogram;
• the p-value;

• a new test – the famous Pearson X 2 or χ2 goodness-of-fit test – for Gaussianity of a


response variable, based on comparing the fitted density content of the histogram bars
to their sample content through the p-value; this was the first “test of significance”;
• a new family – the “Pearson” family – of probability densities p(x) based on solutions
to the differential equation (Pearson 1895, p. 381)

p (x) a + (x − λ)
+ = 0.
p(x) b0 + b1 (x − λ) + b2 (x − λ)2

DOI: 10.1201/9781003216025-22 355


356 Introduction to Statistical Modelling and Inference

where a, b0 , b1 and b2 were functions of the skewness β1 and kurtosis β2 of the density:
4β2 − 3β1
b0 = µ2 ,
10β2 − 12β1 − 18
√ p β2 + 3
a = b1 = µ2 β1 ,
10β2 − 12β1 − 18
2β2 − 3β1 − 6
b2 = .
10β2 − 12β1 − 18

• a new “Method of Moments” (later abbreviated to MOM) for fitting these densities
to data, by equating the sample moments to the population moments and solving the
resulting system of simultaneous equations for the parameter estimates.

These tools were widely used. Pearson applied his test to many published data sets, including
the famous Weldon dice data. The χ2 test showed that many of these data sets failed to fit the
Gaussian distribution. These examples began to discredit the almost universal assumption
of “normality”. Weldon’s data failed to fit the symmetric probability of 1/6 for all die faces.
In 1901, with Weldon and Galton, he founded the journal Biometrika whose object was
the development of statistical theory. He edited the journal until his death. In 1911 Pearson
relinquished the Goldsmid chair to become the first Galton professor of eugenics, a chair that
was offered first to him in keeping with Galton’s expressed wish. He formed the Department
of Applied Statistics into which he incorporated the Biometric and Galton laboratories. He
retired in 1933 but continued to work in a room at University College until a few months
before his death.
Pearson’s developments greatly widened the scope of data analysis in Britain and estab-
lished the basis of mathematical statistics internationally.

22.2 Ronald Fisher (1890–1962)


Fisher was a British statistician, geneticist and eugenicist. He graduated with a First in
Mathematics from Cambridge in 1912. At this time Karl Pearson was the most eminent
statistician in Britain. Remarkably, in the same year of his graduation, and at age 22, Fisher’s
first published paper in Messenger of Mathematics put forward a statistical inferential func-
tion based on the probability density of the sample observations. He did not give it a name,
but in his 1922 paper called it the likelihood function. His 1922 and 1925a papers set out a
detailed and nearly complete theory for statistical inference through the likelihood function
and its derivatives. It did not require prior distributions: Fisher was always a strong critic
of Bayesian inference, though his own unsuccessful fiducial theory was very close to it.
From 1919 to 1933 Fisher worked at the Rothamsted Experimental Station analysing the
vast amounts of crop data accumulated since 1842 from their historical field experiments.
This led him to develop the analysis of variance of such experiments, and equally importantly
to improve their design. His books Statistical Methods for Research Workers (1925b) and The
Design of Experiments (1935) became famous and led to a great increase in the importance
of statistics internationally.
Fisher corrected Pearson’s degrees of freedom for the χ2 goodness of fit test. Pearson
had given the number of bins, or class intervals used for the histogram, as the degrees of
freedom. Fisher saw that the “fitted” class frequencies had to add to the sample size, and so
Appendix 4 357

had one linear restriction on them. The number of degrees of freedom had therefore to be
reduced by 1. Pearson did not not accept this initially, but later did.
Pearson’s method of moments approach to parameter inference was inconsistent with
Fisher’s likelihood approach, and Fisher dismissed the former as inefficient, as for example
in the negative binomial distribution. Here the moment estimates were not based on the
likelihood, though Fisher allowed their value as initial estimates in a Newton-Raphson ML
procedure.
Nevertheless the method of moments remains alive and well, especially in economics
applications, as the “generalised method of moments” (GMM). This can be used when the
distribution of the response variable is not fully specified, so the likelihood is not defined.

22.3 Jerzy Neyman (1894–1981)


Neyman was a Polish mathematician and statistician who spent the first part of his pro-
fessional career at various institutions in Warsaw, Poland, and then at University College
London. Neyman proposed and studied randomised experiments in 1923. Later in the 1920s
he spent a year at University College London studying with Karl Pearson. In 1933 he was
invited to University College London to work with Egon Pearson, who had succeeded his fa-
ther Karl as the head of the Department of Statistics. Neyman’s paper “On the two different
aspects of the Representative Method: the method of Stratified Sampling and the method of
Purposive Selection”, given at the Royal Statistical Society in 1934, was a groundbreaking
event, leading to the intensive development of modern statistical sampling.
Neyman and Egon Pearson collaborated on extending Fisher’s null hypothesis test to the
comparison of null and alternative hypotheses. Fisher, as a scientist, did not accept the idea
of a formal alternative hypothesis to the null. The latter was the best current scientific model,
and its rejection implied the need for amendment of the current model, not the rejection of
the model in favour of some unspecified model. It was not a matter of a choice between two
scientific models.
Neyman and Pearson (1933) nevertheless worked to extend Fisher’s null hypothesis test-
ing to the comparison of the null and alternative hypotheses through the famous likelihood
ratio test. Pearson (1962) at first considered the use of the value of the likelihood ratio itself
as the test criterion, but Neyman persuaded him to use instead the tail-area probability
beyond the observed value, extending Fisher’s p-value.
Neyman introduced the confidence interval in 1935, based on the repeated sampling
principle. Fisher at first thought that this was equivalent to his own fiducial concept, but
decided later that it was not, and rejected the idea of a repeated sampling interpretation.
Neyman moved to the University of California at Berkeley in 1938 and remained there
until his death. He was a most influential supervisor of 39 PhD students in his 43 years there,
contributing greatly to the strength of the Neyman approach in the USA.

22.4 Harold Jeffreys (1891–1989)


Jeffreys was an English mathematician, statistician, geophysicist and astronomer. He studied
at Armstrong College in Newcastle upon Tyne, then part of the University of Durham, and
358 Introduction to Statistical Modelling and Inference

now the University of Newcastle. At the University of Cambridge he taught mathematics,


then geophysics and finally became the Plumian Professor of Astronomy.
He is best known in statistics for his book, Theory of Probability, which was first published
in 1939 (third edition 1961), and played an important role in the revival of the “objective”
Bayesian view of probability. Objective here refers to the use of diffuse or “non-informative”
priors, rather than priors personal to the statistician, or based on external data, or based on
replications of the current data. The Jeffreys prior is widely used as a reference or “diffuse”
prior, though Jeffreys himself found that the transformation principle on which it was based
conflicted with another principle he also supported.
Jeffreys was surprised by Fisher’s likelihood-without-prior analysis, and showed in (Gaus-
sian) examples that his diffuse prior analysis led to effectively the same conclusions as Fisher’s
maximum likelihood analysis. Jeffreys and Fisher were both major scientists in their own
different fields, and each respected, but was not influenced by, the other’s different statistical
point of view.
23
References

Abernethy, R.B., Breneman, J.E., Medlin, C.H. and Reinman, G.L. (1983) Weibull Analysis
Handbook. Aero Propulsion Laboratory, Air Force Wright Aeronautical Laboratories,
Wright-Patterson Air Force Base Ohio.
Abernethy, R.B. (2010) The New Weibull Handbook (5th edn.). At www.barringer1.com/
tnwhb.htm.
Agresti, A. and Caffo, B.S. (2000) Simple and effective confidence intervals for proportions
and differences of proportions result from adding two successes and two failures. The
American Statistician 54, 280–288.
Aitkin, M. (1987) Modelling variance heterogeneity in normal regression using GLIM. Ap-
plied Statistics 36, 332–339.
Aitkin, M. (1992) Evidence and the posterior Bayes factor. The Mathematical Scientist 17,
15–25.
Aitkin, M. (2010) Statistical Inference: An Integrated Bayesian/Likelihood Approach. Boca
Raton: CRC Press.
Aitkin, M. (2018) A history of the GLIM statistical package. International Statistical Review
86, 275–299.
Aitkin, M. and Foxall, R. (2003) Statistical modelling of artificial neural networks using
the multi-layer perceptron. Statistics and Computing 13, 227–239.
Aitkin, M., Liu, C.C. and Chadwick, T. (2009) Bayesian model comparison and model
averaging for small-area estimation. Annals of Applied Statistics 3, 199–221.
Aitkin, M. and Stasinopoulos, M. (1989) Likelihood analysis of a binomial sample size
problem. In Contributions to Probability and Statistics: Essays in Honor of Ingram Olkin,
eds L.J. Gleser, M.D. Perlman, S.J. Press and A.R. Sampson, New York: Springer-Verlag,
399–411.
Aitkin, M., Vu, D. and Francis, B. (2014) Statistical modelling of the group structure of
social networks. Social Networks 38, 74–87.
Aitkin, M., Vu, D. and Francis, B. (2017) Statistical modelling of a terrorist network.
Journal of the Royal Statistical Society A 180, 751–768.
Ando, T. (2010) Bayesian Model Selection and Statistical Modelling. Boca Raton: Chapman
and Hall/CRC Press.
Anscombe, F.J. (1964) Normal likelihood functions. Annals of the Institute of Statistical
Mathematics 26, 1–19.
Barbour, C.D. and Brown, J.H. (1974) Fish species diversity in lakes. The American Nat-
uralist 108, 473–489.
Barnard, G.A. (1945) A new test for 2x2 tables. Nature 156, 177.
Barnard G.A. (1949) Statistical inference. Journal of the Royal Statistical Society B 11,
115–149.

DOI: 10.1201/9781003216025-23 359


360 Introduction to Statistical Modelling and Inference

Bartlett, R.H., Roloff, D.W., Cornell, R.G., Andrews, A.F., Dillon, P.W. and Zwischen-
berger, J.B. (1985) Extracorporeal circulation in neonatal respiratory failure: a prospec-
tive randomized study. Pediatrics 76, 479–487.
Begg, Colin B. (1990) On inferences from Wei’s biased coin design for clinical trials (with
discussion). Biometrika 77, 467–484.
Berger, J.O., Bernardo, J.M. and Sun, D. (2009) The formal definition of reference priors.
The Annals of Statistics 37, 905–938.
Bertalanffy, L. von. (1969) General System Theory. New York: George Braziller.
Bishop, J., Huether, C.A., Torfs, C., Lorey, F. and Deddens, J. (1997) Epidemiologic study
of Down Syndrome in a racially diverse California population, 1989–1991. American
Journal of Epidemiology 145, 134–147.
Bliss, C.I. (1935) The calculation of the dose-mortality curve. Annals of Applied Biology
22, 134–167.
Caruso, T.M., Westgate, M.N. and Holmes, L.B. (1998) Impact of prenatal screening on the
birth status of fetuses with Down syndrome at an urban hospital, 1972–1994. Genetics:
Medicine 1, 22–28.
Celeux, G., Forbes, F., Robert, C.P. and Titterington, D.M. (2006) Deviance information
criteria for missing data models. Bayesian Analysis 1, 651–674.
Chambers, R.L. and Clark, R.G. (2012) An Introduction to Model-based Survey Sampling
with Applications. Oxford: Oxford University Press.
Cox, D.R. (1961) Tests of separate families of hypotheses. Proceedings of the 4th Berkeley
Symposium 1, 105–123.
Cox, D.R. (2006) Principles of Statistical Inference. Cambridge: Cambridge University
Press.
Cox, D.R. and Miller, H.D. (1965) The Theory of Stochastic Processes. London: Chapman
and Hall.
Davis, A., Gardner, B.B. and Gardner, M.R. (1941) Deep South: A Social Anthropological
Study of Caste and Class. Chicago: Chicago University Press.
Dempster, A.P. (1974). The direct use of likelihood in significance testing. In Proceedings
of the Conference on Foundational Questions in Statistical Inference, eds O. Barndorff-
Nielsen, P. Blaesild and G. Sihon, 335–352.
Dempster, A.P. (1997) The direct use of likelihood in significance testing. Statistics and
Computing 7, 247–252.
Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics 7,
1–26.
Efron, B. (1986) Double exponential families and their use in generalized linear regression.
Journal of the American Statistical Association 81, 709–721.
Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. New York: Chapman
and Hall.
Ericson, W.A. (1969) Subjective Bayesian models in sampling finite populations (with dis-
cussion). Journal of the Royal Statistical Society B 31, 195–233.
Finney, D. (1947) Probit Analysis: A Statistical Treatment of the Sigmoid Response Curve.
Cambridge: Cambridge University Press.
Fisher, R.A. (1912) On an absolute criterion for fitting frequency curves. Messenger of
Mathematics 41, 155–160.
Fisher, R.A. (1922) On the mathematical foundations of theoretical statistics. Philosophical
Transactions of the Royal Society of London 222A, 309–368.
References 361

Fisher, R.A. (1925a) Theory of statistical estimation. Proceedings of the Cambridge Philo-
sophical Society 22, 700–725.
Fisher, R.A. (1925b). Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd.
Fisher, R.A. (1945). A new test for 2 × 2 tables. Nature 156 (3961), 388.
Freeman, L.C. (2003) Finding social groups: a meta-analysis of the Southern women data.
in Dynamic Social Network Modeling and Analysis, eds R. Breiger, K. Carley and P. Pat-
tison Washington, DC: The National Academies Press.
Freedman, B. (1987) Equipoise and the ethics of clinical research. New England Journal of
Medicine 31, 141–145.
Galton, F. (1886) Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland 15, 246–263.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin D.B. (2014)
Bayesian Data Analysis (3rd edn.). Boca Raton: Chapman and Hall/CRC Press.
Genschel, U. and Meeker, W.Q. (2010) A comparison of maximum likelihood and median
rank regression for Weibull estimation. Quality Engineering 22, 234–253.
Geyer, C.J. (1991) Constrained maximum likelihood exemplified by isotonic convex logistic
regression. Journal of the American Statistical Association 86, 717–724.
Gilliatt, R.W. (1948) Vaso-constriction in the finger after deep inspiration. Journal of Phys-
iology 107, 76–88.
Hartley, H.O. and Rao, J.N.K. (1968) A new estimation theory for sample surveys.
Biometrika 55, 547–557.
Hartley, H.O. and Rao, J.N.K. (1969) A new estimation theory for sample surveys, II.
Clearing House for Federal Scientific and Technical Information, US Department of Com-
merce/National Bureau of Standards.
Herson, J. (1976) An investigation of relative efficiency of least squares prediction to con-
ventional probability sampling plans. Journal of the American Statistical Association 71,
700–703.
Higgins, J.F. and Koch, G.G. (1977) Variable selection and generalized chi-square anal-
ysis of categorical data applied to a large cross-sectional occupational health survey.
International Statistical Review 45, 51–62.
Hite, S. (1987) Women and Love – A Cultural Revolution in Progress. New York: Knopf.
Hodges, J.L., Krech, D. and Crutchfield, R.S. (1975) StatLab: An Empirical Introduction
to Statistics. New York: McGraw-Hill.
Holland, P.W. and Linehart, S. (1981) An exponential family of probability distributions
for directed graphs. Journal of the American Statistical Association 76, 33–50.
Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology 24, 417–441, 498–520.
Ibrahim, J.G. and Laud, P.W. (1991) On Bayesian analysis of generalized linear models
using Jeffreys’s prior. Journal of the American Statistical Association 86, 981–986.
Jeffreys, H. (1961) Theory of Probability (3rd edn). Oxford: Clarendon Press.
Kahn, W.D. (1987) A cautionary tale for Bayesian estimation of the binomial parameter
n. The American Statistician 41, 38–40.
Kay, J.W. and Titterington, D.M. (1999) Statistics and Neural Networks: Advances at the
Interface. Oxford: Oxford University Press.
Kemp, A.W. and Kemp, C.D. (1991) Weldon’s dice data revisited. The American Statisti-
cian 45 (3), 216–222.
362 Introduction to Statistical Modelling and Inference

Kendall, M.G. and Stuart, A. (1966) The Advanced Theory of Statistics, Vol. 3. London:
Griffin Hafner.
Kolmogorov, A.N. (1933) Grundbegriffe der Wahrscheinlichkeitrechnung; translated as
Foundations of Probability. New York: Chelsea Publishing Company.
Laska-Mierzejewska, T. (1970) Effect of ecological and socio-economic factors on the age at
menarche, body height and weight of rural girls in Poland. Human Biology 42, 284–292.
Lazar, N. (2003) Bayesian empirical likelihood. Biometrika 90, 319–326.
Lindsey, J.K. (1997) Applying Generalized Linear Models. New York: Springer-Verlag.
Little, R.J. (2004) To model or not to model? Competing models of inference for finite
population sampling. Journal of the American Statistical Association 99, 546–556.
Lord, F. (1952) A Theory of Test Scores: Psychometric Monograph no. 7. Richmond: Psy-
chometric Corporation.
Lunn, D., Jackson, C., Best, N., Thomas, A. and Spiegelhalter, D. (2013) The BUGS Book.
Boca Raton: CRC Press.
Lusher, D., Koskinen, J. and Robins, G. (eds) (2013) Exponential Random Graph Models
for Social Networks Cambridge: Cambridge University Press.
Lyons, S. (2020) What is ECMO, extracorporeal membrane oxygenation, and how is it being
used to help severe COVID-19 patients? ABC News, www.abc.net.au/news/health/2020-
07-22/coronavirus-ecmo-explainer/12472498.
McCullagh, P. and Nelder, J.A. (1983) Generalized Linear Models. London: Chapman and
Hall.
McLachlan, G. and Peel, D. (2000) Finite Mixture Models. New York: John Wiley.
Mehta, C. and Senchaudhuri, P. (2003) Conditional versus unconditional exact tests for
comparing two binomials. At www.google.com/search?channel=fs&client=ubuntu&q=
The+Fisher-Barnard+argument+over+the+%22exact%22+conditional+test.
Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and
Statistical Classification. New York: Ellis Horwood.
Milicer, H. and Szczotka, F. (1966) Age at menarche in Warsaw girls in 1965. Human
Biology 40, 199–203.
Mitchell, T. and Beauchamp, J. (1988) Bayesian variable selection in linear regression.
Journal of the American Statistical Association 83, 1023–1032.
Nelder, J.A. (1966) Inverse polynomials, a useful group of multi-factor response functions.
Biometrics 22, 128–141.
Neyman, J. (1935) On the problem of confidence intervals. Annals of Mathematical Statistics
6, 111–116.
Neyman, J. and Pearson, E.S. (1933). On the problem of the most efficient tests of statistical
hypotheses. Philosophical Transactions of the Royal Society of London A 231, 289–337.
Owen, A.B. (1988) Empirical likelihood ratio confidence intervals for a single functional.
Biometrika 75, 237–249.
Owen, A.B. (1995) Nonparametric likelihood confidence bands for a distribution function.
Journal of the American Statistical Association 90, 516–521.
Owen, A.B. (2001) Empirical likelihood. Boca Raton: Chapman and Hall/CRC Press.
Pearson, E.S. (1962) in the discussion of L.J. Savage The Foundation of Statistical Inference.
London: Methuen.
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew vari-
ation in homogeneous material. Philosophical Transactions of the Royal Society A 186,
343–414.
References 363

Pearson, K. (1900) On the criterion that a given system of deviations from the probable in
the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling. Philosophical Magazine 50, 157–175.
Piper, D.W., McIntosh, J.H., Ariotti, D.E., Calogiuri, J.V., Brown R.W. and Shy, C.M.
(1981) Life events and chronic duodenal ulcer: a case control study. Gut 22, 1011–1017.
Pitman, E.J.G. (1979) Some Basic Theory for Statistical Inference. London: Chapman and
Hall.
Plackett, R.L. (1977) The marginal totals of a 2 × 2 table. Biometrika 64, 37–42.
Postman, M.J., Huchra, J.P. and Geller, M.J. (1986) Probes of large-scale structures in the
Corona Borealis region. The Astronomical Journal 92, 1238–1247.
Racine, A., Grieve, A.P., Flühler, H. and Smith, A.F.M. (1986) Bayesian methods in prac-
tice: experiences in the pharmaceutical industry. Applied Statistics 35, 93–150.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge Uni-
versity Press.
Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and
voids in the galaxies. Journal of the American Statistical Association 85, 617–624.
Royall, R.M. and Cumberland, W.G. (1981) An empirical study of the ratio estimator
and estimators of its variance (with discussion). Journal of the American Statistical
Association 76, 66–88.
Rubin, D.B. (1981) The Bayesian bootstrap. Annals of Statistics 9, 130–134.
Ruppert, D., Wand, M.P. and Carroll, R.J. (2003) Semiparametric Regression. Cambridge:
Cambridge University Press.
Schmidt, G., Mattern, R. and Schueler, F. (1981) Biomechanical investigation to determine
physical and traumatological differentiation criteria for the maximum load capacity of
head and vertebral column with and without protective helmet under the effects of
impact. Tech. Report, Institut für Rechtsmedizin, University of Heidelberg, Germany.
Schønheyder, F. (1936) The quantitative determination of vitamin K. Biochemical Journal
30, 890–896.
Shaw, L.P. and Shaw, L.F. (2019) The flying bomb and the actuary. Significance 16(5),
12–17.
Sheppard, W.F. (1897) On the calculation of the average square, cube, of a large number
of magnitudes. Journal of the Royal Statistical Society 60, 698–703.
Sheppard, W.F. (1898) On the application of the theory of error to cases of normal distri-
butions and normal correlations. Philosophical Transactions of the Royal Society A 192,
101.
Si, Y. and Reiter, J.P. (2013) Nonparametric Bayesian multiple imputation for incomplete
categorical variables in large-scale assessment surveys. Journal of Educational and Be-
havioral Statistics 38, 499–521.
Silverman, B.W. (1985) Some aspects of the spline smoothing approach to non-parametric
curve fitting (with discussion). Journal of the Royal Statistical Society B 47, 1–52.
Smith, T.M.F. (1976) The foundations of survey sampling: a review (with discussion).
Journal of the Royal Statistical Society A 139, 183–204.
Smyth, G.K. (1986) Modelling the dispersion parameter in generalized linear models. In
Proceedings of the Statistical Computing Section. Alexandria: American Statistical Asso-
ciation, 278–283.
Smyth, G.K. (1989) Generalized linear models with varying dispersion. Journal of the Royal
Statistical Society B 51, 47–60.
364 Introduction to Statistical Modelling and Inference

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002) Bayesian measures
of model complexity and fit (with discussion). Journal of the Royal Statistical Society B
64, 583–639.
Sprott, D.A. (1980) Maximum likelihood in small samples: Estimation in the presence of
nuisance parameters. Biometrika 67, 515–523.
Surkova, E., Nikolayevskyy, V. and Drobniewski, F. (2020) False-positive COVID-19 results:
Hidden problems and costs. Lancet Respiratory Medicine 8(12), 1167–1168.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B 58, 267–288.
Valliant, R., Dorfman, A.H. and Royall, R.M. (2000) Finite Population Sampling and In-
ference: A Prediction Approach. New York: Wiley.
Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics with S (4th edn.). New
York: Springer.
Vermunt, J.K., Van Ginkel, J.R., Van der Ark, L.A. and Sijtsma K. (2008) Multiple imputa-
tion of categorical data using latent class analysis. Sociological Methodology 33, 369–297.
Wainer, H. (2016) Truth or Truthiness: Distinguishing Fact from Fiction by Learning to
Think Like a Data Scientist. New York: Cambridge University Press.
Ware, J.H. (1989) Investigating therapies of potentially great benefit: ECMO (with discus-
sion). Statistical Science 4, 298–340.
White, H. (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica 48, 17–38.
Index

A model comparison, 97
Acceleration of, motorcycle helmet, 15 null and alternative hypotheses, two
α cdf, 273, 274 models, 97–100
α density, 273, 274 pivotal functions, 134–135
Alternative hypotheses, 97–100, uniform distribution, 136
132–134, 257 µ0 vs. µ ̸= µ0, 131–134
Analysis of Covariance (ANCOVA), 219 µ1 vs. µ2, 130–131
Analysis of Variance (ANOVA), 218 Bayesian inference, 78–79, 127
Axioms probability, 35 prior arguments, 127–128
coin tossing, 35–36 Bayesian interpretation, 134, 220
Bayesian residuals, 199, 209
dice example, 35
Bayesian theory, 63–64, 122–123
B Bayes’s theorem, 64–65
bootstrap, 68
Backward elimination, 218–219 conjugate prior distributions, 67
Bayes factor, 185, 186 conjugate priors, 123
Bayesian analysis, 87, 138–139, frequentist objections, flat priors, 69
146–147 general prior specifications, 69–70
binary response models, 272–276 improving frequentist interval
credible interval, 94 coverage, 67
double GLMs non-informative prior rules, 68–69
absence from school data, 312–314 parameters, random variables, 70
fish species data, 313, 317–318 synopsis of, posterior distribution,
hospital beds and patients, 309–312 65–66
log-likelihood, 308–309 Bayes’s theorem, 64–65, 249
model assessment, 320, 321 Beetle data, 273, 277–279
sea temperatures, 316–320 Beta distribution, 71, 170
weight functions, 309 β cdf, 273, 275
inference for µ, 139–140 β density, 273, 275
inference for σ, 139 Big Data, 3
parametric functions, 140–142 Binary response models
posterior sampling, 87–89 Bayesian analysis, 272–276
prediction of, new observation, 142–143 beetle data, 273, 277–279
sampling without replacement, 89–90 binomial link functions and their
Bayesian and frequentist inferences, 203, 204 origins, 268–269
Bayesian bootstrap (BB), 171–175, 178–179 maximum likelihood, 271–272
analysis, 204 probit and logit transformation,
and posterior weighting, 299, 302–305 268, 269
Bayesian deviance, 188, 257 Racine data, 269–271
Bayesian formulation, 97, 142, 218 Binomial distribution, 53
Bayesian hypothesis testing binomial likelihood function,
conjugate priors, 135 53–54

365
366 Index

maximum likelihood estimate (MLE), CLL function, see Complementary log-log


57–58 link function
sufficient and ancillary statistics, 54–57 CLT, see Central Limit Theorem
Binomial link functions, 268–269 Complementary log-log link (CLL) function,
Binomials comparison 268, 277–279
Bayesian hypothesis testing/model Complete data
comparison, 97 density function, 235
null, alternative hypotheses and two likelihood, 235, 239, 243
models, 97–100 log-likelihood, 239–240
definition, 91 Computer sampling, 30–31
ECMO trials, 105 natural random processes, 30–31
first trial, 105–106 Constrained log-likelihood, 169
frequentist analysis, 106–107 Correlation coefficient, 207
likelihood, 107–109 Counts of, phone lifetimes, 112
second trial, 109–110 Cross-classifications, binary data
measures of, treatment difference, observed and proportions, all
100–102 regions, 296
conditional testing, 104–105 proportion, 292–293
frequentist analysis, hypothesis Region 1 to 4, 295
testing, 102 tolerant of, racial intermarriage,
hypothetical samples drawn, 103 292, 296
Monte Carlo simulation, 94–95 Cross-products matrix, 201
randomised clinical trial (RCT), 95–96 Cumulative distribution function (cdf), 1, 63
treatment of, duodenal ulcers, 92–93 78th quantile, lifetime, 155
Bayesian analysis, credible interval, 94 78th quantile, log hours, 148
frequentist analysis, confidence interval, 78th quantile, lognormal, 144
93–94 and empirical mass, 113
Biometrika, 356 motorcycle data, 346–347
Birthweight cumulative proportions, 124 Nile flood volumes, 321, 323
Bivariate Gaussian, 242, 243 of r, 153
Block diagonal, 202, 308 of θ, 154
Bootstrap, 68
distribution, 68, 252 D
frequentist, 178–180 DA algorithm, see Data Augmentation
likelihood ratio test, 252 algorithm
standard error, 178 Dam, Henrik, 195
Break-point regression, see Data and research study
Segmented/broken-stick regressions clustering of, V1 missile hits, 5, 6
court case, vaccination risk, 6
C
dugong growth, ages and lengths, 15
cdf, see Cumulative distribution function fish species, 10
Censoring, 149–151, 239–240 global warming, 15
Central Limit Theorem (CLT), 126 hospital bed use, 13
Change-point regression, see hostility, 11–12
Segmented/broken-stick regressions incidence of, Down’s syndrome, 9–10
Chebyshev inequality, 298 lifetimes of, radio transceivers, 5–6
Child Health and Development Study, 21 school absence, 10–11
Child’s blood group, 88–89 simulated motorcycle collision, 15
Child variables, 353 social group membership, 17
Clinical trial of Depepsen, 93 species counts, 9
Index 367

tolerance of, racial intermarriage, 12–13 four regions analysis, 284–287


toxicology, 9 incidence of, 9–10
treatment of, duodenal ulcers, 6–7 segmented/broken-stick regressions,
treatment respiratory distress, 327–328
newborn, 7–8 DPP, see Dirichlet process prior
Vitamin K, 8 Drug Depepsen, 6, 92
Data Augmentation (DA) algorithm, 251 Dugong growth, 13
Dirichlet process prior, 263 “Dummy variable” regression, 210
galaxy recession velocity, 251–263
Data matrix, 3 E
Data structure, 3, 227–232
ECMO, see Extra-Corporeal Membrane
Data visualisation, 3, 111
Oxygenation
empirical mass and cumulative
Elastic net, 219–220
distribution functions, 113
EM algorithm, 235, 237
histogram, 111–113
boy birthweight data, 247–250
probability models for, continuous
complete data likelihood, 243, 245
variables, 113–116
latent component identifier, 248
Deep learning, 337–338
MLE of λ, 239
Degree of, belief, 30
Empirical and ML fitted
Depepsen treatment group, 92–93
log exponential integrated hazard, 115
The Design of Experiments (Fisher),
lognormal, cdfs, 144
218, 356
Weibull log integrated hazard, 149
Deviance, 188–189, 266
Empirical cumulative distribution function
asymptotic distribution of, 190
(ecdf), 113
distribution, 257, 262
Empirical likelihood, 169–171
galaxy and asymptotic, 258–262
Experimental error, 232
motorcycle data, 344–346
Exponential distribution, 117–118, 183
DGLM polynomial regression, 317, 320
censoring, 239–240
Dirichlet multinomial, 172, 173
deviance, 192, 193
Dirichlet posterior weighting, 176, 178
Exponential family, 265
Dirichlet process prior (DPP)
Exponential likelihood, 118–119
Data Augmentation algorithm, 263
Exponential model assessment, 158–163
Haldane prior, 173
Exponential random graph model (ERGM),
Doisy, E.A., 195
341–342
Double GLMs, 307
Exponential relative likelihood, λ, 119
Bayesian analysis (see Bayesian
Extra-Corporeal Membrane Oxygenation
analysis, double GLMs)
(ECMO), 7–8, 105
maximum likelihood, 307–308
first trial, 105–106
motorcycle data, 342–348
frequentist analysis, 106–107
Down’s incidence, 283
likelihood, 107–109
break-point quadratic model, 327, 328
second trial, 109–110
four regions, 286
Extrasensory perception (ESP), 31–33
and full quadratic logit model, 286
Extreme value distribution, 148, 164, 269
logit scale, 284
quadratic logit model, 283–284,
286–287 327 F
quadratic model and credible Family variables, 353
region, 285 Father’s occupational category, 88
Down’s syndrome, 10–11 Father variables, 354
BC analysis, 280, 284, 285 Field telephones, 5, 111
368 Index

Fifth-degree polynomial regression, 318, 320 model-robust analysis, 204


Finite population, 50–51, 89 Bayesian bootstrap (BB) analysis,
Finney vasoconstriction data, 287–292 204, 206
Fisher, Ronald (1890–1962), 356–357 posterior distribution of α, 204, 205
Fisher conditional test, 104–105 posterior distribution of β, 204, 205
Fisher scoring algorithm, 267 posterior distribution of σ, 204, 206
Fourth-degree polynomial regression, robust variance estimate, 206
317, 318 sum of squares, 201
Frequentist analysis, 137–138, 146 Gaussian mean model, 202, 203
confidence interval, 93–94 Gaussian mixture, 247–251
Frequentist deviance, 188, 257 Gaussian MLEs, 206
Frequentist formulation, 143, 208 Gaussian model assessment, 157–158
Frequentist hypothesis testing, 129 Gaussian multiple regression likelihood,
µ0 vs. µ ̸= µ0, 130 215–216
µ1 vs. µ2, 129 absence from school, 216–217
Frequentist inference, 125–127 Gaussian prior distribution, 220
Frequentist residuals, 199, 211, 215 Gaussian regression model, 242
Frequentist theory, 58–60, 119 Gelman, A., 272
ambiguity of, notation, 63 Generalised Linear Interactive Modelling
frequentist asymptotics, 121 (GLIM), 265–266
parameter transformations, 60–63, Generalised linear models (GLMs)
119–121 Bayesian analysis from ML, 268
Full three-way interaction model, 227 Bayesian package development,
267–268
G binary response models (see Binary
Galaxy recession velocity, 251–263 response models)
Galton, Sir Francis, 195 extensions (see Double GLMs)
Gamma distribution, 122, 151 Fisher scoring algorithm, 267
deviance, 192, 193 gamma regression, 306
Gamma likelihood, 151 menarche data (see Menarche data,
Bayesian analysis, 152–155 GLMs)
frequentist analysis, 151–152 Newton-Raphson algorithm, 266–267
Gamma model assessment, 164 Poisson regression, fish species
Gamma regression, 306 frequency, 296–299
Gaussian approximation, 60–62, 121, 299 Bayesian bootstrap and posterior
Poisson regression, 299–301 weighting, 299, 302–305
Gaussian distribution, 123–124, 137, 199, conjugate W, 305–306
269, 320, 328 Gaussian approximation, 299–301
covariate distribution modelling, negative binomial model, 305
242–244 omitted variables, 305
“error” terms, 200 overdispersion, 305
Gaussian likelihood function, 125 single index, 267
statistical inference III, 137 Generalised method of moments
Gaussian linear regression model (GMM), 357
likelihood for, 199–201 General Newton-Raphson algorithm, 266
model assumptions, 211, 215 Geometric Bayesian deviance, 188
modelling Geometric distribution
boy birthweights, 220–223 vs. Poisson distribution, 183–184
girl intelligence, 223–226 Geometric-Poisson log-likelihood differences,
of hostility data, 226–232 186, 187
Index 369

GLIM, see Generalised Linear Interactive observed data score vector, 236
Modelling randomly missing Gaussian
GLMs, see Generalised linear models observations, 240–241
GMM, see Generalised method of moments Information matrix, 236, 237
Intention to treat (ITT) analysis, 103
H Inverse polynomials, 336–337
Inverse probability weighting (IPW), 180
Haldane prior, 68, 87–88
criticisms J
Bayesian bootstrap, 171, 172
Dirichlet process prior, 173 Jeffreys, Harold (1891–1989), 357–358
multivariate hypergeometric Joint draws of r, θ, 154
distribution, 172
posterior sampling, 173–175 K
structural zeros, 171 Kings of Spades (KOSs), 42
improper, 171
Helicobacter pylori, 96 L
Hessian matrix, 236 Laplace distribution, 220
Heterogeneous regressions, 328–333 Lasso, 219–220
Heteroscedasticity, 206 Latent class/mixed Rasch model,
Highly non-linear functions, 334–337 341–342
Histogram, 111–113, 355 Latent component identifier, 248
Horvitz-Thompson (HT) estimator, 181; LD48 cdf, 273, 276
see also Weighted sample LD88 cdf, 273, 276
mean Least squares fit to the log integrated
Hospital bed use, 13 hazard, 150
Hostility data, modelling of Length-biased sampling, 349–350
control group, 226 Likelihood and Gaussian approximation,
counts, means and SDs of affection, 61–62
226, 227 Likelihood function, 356
data structure, 227–232 Likelihood ratio, 41–42, 129–133, 185
Gottschalk-Gleser scales, 226 Likelihood ratio test (LRT), 102, 185, 252
Hypothesis testing, 128–129 Likelihood-without-prior analysis, 358
Linear predictor, 215
I Logistic linear regression model, 270–271
Improper Haldane prior, 87, 171 Log-likelihoods, 59–60, 186, 188
Incomplete data model, 51 Poisson and geometric, 187
Bayesian analysis and DA algorithm Lognormal distribution, 143, 183
(see Data Augmentation deviance, 192, 193
algorithm) lognormal density, 143–144
definition, 235 Lognormal model assessment, 158
EM algorithm (see EM algorithm) LRT, see Likelihood ratio test
exponential distribution, censoring in,
239–240 M
Hessian matrix, 236 MAR, see Missing at random
information matrix, 236–237 Marginal posterior distribution, 173
lost data, 238–239 Maximised likelihoods, 183, 185
missingness (see Missingness) Maximum likelihood, 201–202, 265–266
mixture distributions, Bayesian analysis from, 268
247–251 binary response models, 271–272
370 Index

double GLMs, 307–308 Gamma model assessment, 164


for GLMs (see Generalised linear Gaussian model assessment, 157–158
models) Lognormal model assessment, 158
Maximum likelihood estimate (MLE), prediction as tool, 209
57–58, 119–120, 126 probability, 209–210
frequentists, 185–186 through residual examination,
Maximum resolution histogram of, family 199, 200
income, 112 Weibull model assessment, 163–164
Maximum resolution population Model averaging, 192–194
histogram, 167 Model-based inference theories, 51
MCAR, see Missing completely at random Model choice, 192–193
Mean deviance, 257 Model comparison
Mean imputation, 239 Bayesian vs. frequentist model
Measures of, treatment difference, 100–102 known parameters, 185
conditional testing, 104–105 unknown parameters, 185–186
frequentist analysis, hypothesis null vs. alternative hypotheses,
testing, 102 97–118
hypothetical samples drawn, 103 Poisson vs. geometric distribution,
Median Rank Regression (MRR), 148–149 183–184
Menarche data, GLMs posterior distribution, 186–188
cross-classifications with binary data, Model deviances, 257
292–296 Model residuals, 320, 321
Down’s syndrome analysis, 283–287 MOM, see Method of Moments
Finney vasoconstriction data, 287–292 Monte Carlo simulation, 94–95
likelihood, 280, 281 Mother variables, 354
number of girls assessed, 279 Motorcycle data, 342–348
posterior cdf, 280–282 MRR, see Median Rank Regression
proportion data, 280 Multinomial distribution, 183, 193
Messenger of Mathematics (Fisher), 356 Bayesian analysis, 170–171
Method of Moments (MOM), 356 covariance structure, 168–169
−2 log L(θ), 188; see also deviance covariate distribution modelling,
Missing at random (MAR), 237, 238 244–246
Missing completely at random (MCAR), 237 Dirichlet posterior weighting, 176, 178
“Missing data” algorithms, 235 frequentist analysis, 169–170
Missingness frequentist bootstrap, 178–179
missing responses/covariates two-category sample, 179–180
covariate distribution modelling, Haldane prior, criticisms (see Haldane
Gaussian, 242–244 prior)
covariate distribution modelling, likelihood, 169
multinomial, 244–246 multinomial quantiles, inference for,
multiple covariates missing, 246 175–177
in single covariate in simple linear population multinomial
regression, 242 distribution, 167
types, 237 population proportion parameters,
Missing non-randomly (MNR), 237 167–168
Mixture of experts model, 247 StatLab family income histogram, boy
MLE, see Maximum likelihood estimate population, 167, 168
MNR, see Missing non-randomly stratified sampling and weighting,
Model assessment, 157, 197 180–181
Exponential model assessment, 158–163 Multinomial likelihood, 169
Index 371

N two-parameter binomial distribution,


Natchez women network, 339, 340 81–82
Natural/canonical parameter, 265 Bayesian analysis, 84–85
Negative binomial marginal frequentist analysis, 82–84
distribution, 305 Poisson fitted quadratic model, 298
Negative binomial model, 305 Poisson regression, fish species frequency,
Negative binomial probability, 306 296–299
Bayesian bootstrap and posterior
Nested models, 185
deviance difference, 191, 192 weighting, 299, 302–305
Gaussian model, 191 conjugate W, 305–306
null hypothesis, 191 Gaussian approximation, 299–301
Neural networks, 337–338 negative binomial model, 305
Newsday sample, 25 omitted variables, 305
overdispersion, 305
Neyman, Jerzy (1894–1981), 357
Non-nested models, 185 Population multinomial distribution, 167
Null hypotheses, 97–100, 257 Posterior density
Null hypothesis test, 357 placebo and depepsen, 94
Null random graph model, 340 of r, 153
of α, 146
O of θ, 147
Posterior deviances, 262, 263
Observed data Hessian, 250 Posterior distribution, 192–193
Overdispersion, 305 of likelihood, 186–188
Poisson and geometric Bayesian
P
deviances, 188, 189
PCA, see Principal Component Posterior predictive distribution, 142–143,
Analysis 207–208
Peabody test and score, 223–224, 226 Posterior probability of model, 185
Pearson, Karl (1857–1936), 355–356 Posterior sampling, 87–89, 173–175
Pearson X 2 , 355 inferences, 70–71
Penalised deviances, 257 precision of, posterior draws, 71–73
Penalised log-likelihood, 219 Prediction, regression model, 207–208
Peptic ulcers, stomach and duodenal, 7, 92 as model assessment tool, 209
Phone empirical and exponential survivor vitamin K concentration, 208–209
functions, 115 Predictive inference, 207–208
Phone empirical survivor function, 114 Principal Component Analysis
Phone lifetimes empirical cdf, 114 (PCA), 232
Piecewise regression, see Principal component regression, 232–233
Segmented/broken-stick regressions Probability, 29
Pitman, E.J.G., 167 axioms, 35
Placebo control group, 93 coin tossing, 35–36
Poisson Bayesian deviance, 188 dice example, 35
Poisson distribution, 75–77 computer sampling, 30–31
Bayesian inference, 78–79 natural random processes, 30–31
vs. geometric distribution, 183–184 degree of, belief, 30
Poisson likelihood and ML, 77–78 density, 349–350, 355–356
prediction of, new Poisson value, 79 misuse of, Sally Clark case, 39–42
side effect risk, 80 random variables and distributions, 42
Bayesian analysis, 81 definitions, 42–45
frequentist analysis, 80–81 relative frequency, 29
372 Index

sampling S
extrasensory perception (ESP),
SAE, see Small area estimation
31–33
Sample surveys
representative sampling, 33–35
bias, newsday sample, 25
screening tests and Bayes’s theorem, bias, women and love sample, 25–27
36–39 children desire, 24
StatLab dice sampling, 30
representative sampling, 24
sums of, independent random
women and love, 23
variables, 45
Sampling
Probability density function, 117 computer, 30–31
Probability model assessment, 209–210 natural random processes, 30–31
Probability models for, continuous variables, extrasensory perception (ESP), 31–33
113–116
representative, 33–35
Profiling, 336
representative sampling, 33–35
p-value, 82, 102, 130, 355
Saturated model, 340
p-variable multiple regression model, 215 “Scale-load” distribution, 167
Q Scale parameter, 265
Screening tests and Bayes’s theorem, 36–39
Quadratic logit model, 277–279 Segmented/broken-stick regressions,
320–321
R Down’s syndrome, 327–328
Racine data, 269–271 modelling the break, 323, 325–327
Racine data likelihood, 272, 273 Nile flood volumes, 321–323
Radio transceivers SHM, see Simple harmonic motion
lifetimes and cumulative numbers, 113 SIDS, see Sudden Infant Death Syndrome
lifetimes and numbers, 5–6, 113 Simple harmonic motion (SHM), 343,
Randomised clinical trial (RCT), 91, 344, 346
95–96 Simple linear regression model, 195, 196
treatment of, duodenal ulcers, correlation, 207
92–93 “dummy variable” regression, 210
Bayesian analysis, credible interval, 94 likelihood function, 199–201
frequentist analysis, confidence interval, ML estimates, 202
93–94 prediction, 207–208
Random variables as model assessment tool, 209
and distributions, 42–45 vitamin K concentration, 208–209
sums of, independent, 45 two-variable models, 216
Rasch model, 340–341 absence vs. dependents, 211, 212
Raven test and score, 223–226 absence vs. dependents and fitted linear
RCT, see Randomised clinical trial model, 211, 214
Regression sum of squares (RegSS), 202 absence vs. IQ, 211, 212
RegSS, see Regression sum of squares absence vs. IQ and fitted linear model,
Relative frequency, 29 211, 213
Replication, 232 interactions, 217–219
Representative sampling, 24, 33–35 IQ vs. dependents, 211, 213
Residual sum of squares (RSS), 202, 207 Simulated motorcycle collision, 14–16
“Restricted” ML (REML) estimate, 227 joint model ML, 14
“Restrictive” covariance structure, 168, 173 patients treated and hospital beds, 15
Reversed extreme value distribution, 269 Simulation marginalisation, 140
Ridge regression, 219–220 Single-index model, 215
Index 373

Small area estimation (SAE), 51 Statistical inference II, continuous


Social networks exponential, Gaussian and uniform
exponential random graph model distributions
(ERGM), 341–342 Bayesian hypothesis testing
Natchez women network, 339, 340 conjugate priors, 135
network structures, history of, 338–339 pivotal functions, 134–135
statistical models, 340–341 uniform distribution, 136
Statistical analysis, 1 µ0 vs. µ ̸= µ0, 131–134
Statistical inference, 1–2; see also individual µ1 vs. µ2, 130–131
entries Bayesian inference, 127
Statistical inference I, discrete distributions prior arguments, 127–128
basis of, 47 Bayesian theory, 122–123
Bayesian analysis, 87 conjugate priors, 123
posterior sampling, 87–89 exponential distribution, 117–118
sampling without replacement, 89–90 exponential likelihood, 118–119
Bayesian theory, 63–64 frequentist hypothesis testing, 129
Bayes’s theorem, 64–65 µ0 vs. µ ̸= µ0, 130
bootstrap, 68 µ1 vs. µ2, 129
conjugate prior distributions, 67 frequentist inference, 125–127
frequentist objections, flat priors, 69 frequentist theory, 119
general prior specifications, 69–70 frequentist asymptotics, 121
improving frequentist interval parameter transformations, 119–121
coverage, 67 Gaussian distribution, 123–124
non-informative prior rules, 68–69 Gaussian likelihood function, 125
parameters, random variables, 70 hypothesis testing, 128–129
synopsis of, posterior distribution, Statistical inference III, two-parameter
65–66 continuous distributions, 137
binomial distribution, 53 Bayesian analysis, 138–139
binomial likelihood function, 53–58 inference for µ, 139–140
categorical variables, multinomial inference for σ, 139
distribution, 85–86 parametric functions, 140–142
evidence-based policy, 47 prediction of, new observation, 142–143
frequentist theory, 58–60 frequentist analysis, 137–138
ambiguity of, notation, 63 gamma distribution, 151
parameter transformations, 60–63 gamma likelihood, 151
likelihood function, 52–53 Bayesian analysis, 152–155
maximum likelihood, 86–87 frequentist analysis, 151–152
model-based inference theories, 51 Gaussian distribution, 137
parameter transformations, 74–75 lognormal distribution, 143
Poisson distribution lognormal density, 143–144
Bayesian inference, 78–79 Weibull distribution, 145
Poisson likelihood and ML, 77–78 Bayesian analysis, 146–147
prediction of, new Poisson value, 79 censoring, 149–151
side effect risk, 80–81 extreme value distribution, 148
two-parameter binomial distribution, frequentist analysis, 146
81–85 Median Rank Regression (MRR),
posterior sampling inferences, 70–71 148–149
precision of, posterior draws, 71–73 Weibull likelihood, 145
sample design, 73–74 Statistical Methods for Research Workers
survey sampling approach, 48–51 (Fisher), 356
374 Index

Statistical modelling, 1; see also individual ANCOVA, 219


entries ANOVA, 218
Statistics, history from 1890 backward elimination, 218–219
Fisher, Ronald (1890–1962), 356–357 IQ vs. dependents, 211, 213
Jeffreys, Harold (1891–1989), 357–358
Neyman, Jerzy (1894–1981), 357 U
Pearson, Karl (1857–1936), 355–356 Unobserved heterogeneity, see
Statistics: Concepts and Controversies, 24 Overdispersion
StatLab database, 30, 85
population, 21 V
population questions, 21
V1 missile hits, 5, 6
variables types, 20
Vaccination risk, court case, 6
StatLab dice sampling, 30
Variance heterogeneity, 232
StatLab variables, 353–354
Vitamin K, 8, 203
Stopping rule, 106–107, 109
concentration vs. dose, 195–196
Sudden Infant Death Syndrome (SIDS),
log concentration vs. log dose,
40–41
196–198
Sudden Unexplained Death in Infancy
ML fitted linear model, 197, 198
(SUDI) study, 40
ML fitted model, 197, 198
Sufficient and ancillary statistics, 54–57
ML mean function and precision region
Sum of cross-products vector, 201
bounds, 202, 203
Sum of squares, 201
residuals, 199
von Bertalanffy growth function, 334
T
Theory of Probability (Jeffreys), 358 W
Tolerance of, racial intermarriage, 12–13 Weibull distribution, 145, 183
Toxicology, 9 Bayesian analysis, 146–147
Treatment of, duodenal ulcers, 92–93 censoring, 149–151
Bayesian analysis, credible interval, 94 deviance, 192, 193
frequentist analysis, confidence interval, extreme value distribution, 148
93–94 frequentist analysis, 146
Two-component Gaussian mixture Hessian, Median Rank Regression (MRR),
351–352 148–149
Two-component Gaussian mixture model, Weibull likelihood, 145
247–251 Weibull likelihood, 145
Two-variable models, 216 Weibull model assessment,
absence vs. dependents, 211, 212 163–164
absence vs. dependents and fitted linear Weighted sample mean,
model, 211, 214 180–181
absence vs. IQ, 211, 212 Women and love sample
absence vs. IQ and fitted linear model, bias, 25–27
211, 213 sexual relationships, 23
interactions, 217 statistics, 23

You might also like