Professional Documents
Culture Documents
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
Features
• Probability models are developed from the shape of the sample empirical cumulative
distribution function (cdf) or a transformation of it.
• Bounds for the value of the population cumulative distribution function are obtained from
the Beta distribution at each point of the empirical cdf.
• Bayes’s theorem is developed from the properties of the screening test for a rare condition.
• The multinomial distribution provides an always-true model for any randomly sampled
data.
• The model-free bootstrap method for finding the precision of a sample estimate has a
model-based parallel – the Bayesian bootstrap – based on the always-true multinomial
distribution.
• The Bayesian posterior distributions of model parameters can be obtained from the
maximum likelihood analysis of the model.
This book is aimed at students in a wide range of disciplines including Data Science. The book
is based on the model-based theory, used widely by scientists in many fields, and compares it, in
less detail, with the model-free theory, popular in computer science, machine learning and official
survey analysis. The development of the model-based theory is accelerated by recent developments
in Bayesian analysis.
Murray Aitkin earned his BSc, PhD and DSc from Sydney University, Australia, in Mathematical
Statistics. Dr Aitkin completed his post-doctoral work at the Psychometric Laboratory, University
of North Carolina, Chapel Hill. He has held teaching/lecturing positions at Virginia Polytechnic
Institute, the University of New South Wales and Macquarie University along with research
professor positions at Lancaster University (three years, UK Social Science Research Council)
and the University of Western Australia (five years, Australian Research Council). He has been a
Professor of Statistics at Lancaster University, Tel Aviv University and the University of Newcastle.
He has been a visiting researcher and also held consulting positions at the Educational Testing
Service (Fulbright Senior Fellow 1971–1972 and Senior Statistician 1988–1989). He was the Chief
Statistician from 2000 to 2002 at the Education Statistics Services Institute, American Institutes
for Research, Washington DC, and advisor to the National Center for Education Statistics, US
Department of Education.
He is a Fellow of the American Statistical Association, an Elected Member of the International
Statistical Institute, and an Honorary Member of the Statistical Modelling Society.
He is an Honorary Professorial Associate at the University of Melbourne (Department of
Psychology 2004–2008, Department [now School] of Mathematics and Statistics 2008–present).
Introduction to Statistical
Modelling and Inference
Murray Aitkin
University of Melbourne, Australia
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume respon-
sibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify
in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on
CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and
explanation without intent to infringe.
DOI: 10.1201/9781003216025
Typeset in CMR
by Deanta Global Publishing Services, Chennai, India
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Statistical Modelling? . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is Statistical Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 What is Statistical Inference? . . . . . . . . . . . . . . . . . . . . . . . . . 1
v
vi Contents
6 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Degree of belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 StatLab dice sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 Computer sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4.1 Natural random processes . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Probability for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5.1 Extrasensory perception . . . . . . . . . . . . . . . . . . . . . . . 31
6.5.2 Representative sampling . . . . . . . . . . . . . . . . . . . . . . . 33
6.6 Probability axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.1 Dice example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.2 Coin tossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.7 Screening tests and Bayes’s theorem . . . . . . . . . . . . . . . . . . . . . 36
6.8 The misuse of probability in the Sally Clark case . . . . . . . . . . . . . . 39
6.9 Random variables and their probability distributions . . . . . . . . . . . . 42
6.9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.10 Sums of independent random variables . . . . . . . . . . . . . . . . . . . . 45
8 Comparison of binomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Example – RCT of Depepsen for the treatment of duodenal ulcers . . . . . 92
8.2.1 Frequentist analysis: confidence interval . . . . . . . . . . . . . . . 93
8.2.2 Bayesian analysis: credible interval . . . . . . . . . . . . . . . . . 94
8.3 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 RCT continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.5 Bayesian hypothesis testing/model comparison . . . . . . . . . . . . . . . . 97
8.5.1 The null and alternative hypotheses, and the two models . . . . . 97
8.6 Other measures of treatment difference . . . . . . . . . . . . . . . . . . . . 100
8.6.1 Frequentist analysis: hypothesis testing . . . . . . . . . . . . . . . 102
8.6.2 How are the hypothetical samples to be drawn? . . . . . . . . . . 103
8.6.3 Conditional testing . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.7 The ECMO trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.7.1 The first trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.7.2 Frequentist analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.7.3 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.7.3.1 Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . 108
8.7.4 The second ECMO study . . . . . . . . . . . . . . . . . . . . . . . 109
16 Incomplete data and their analysis with the EM and DA algorithms. . . . 235
16.1 The general incomplete data model . . . . . . . . . . . . . . . . . . . . . . 235
16.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
16.3 Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
16.4 Lost data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
16.5 Censoring in the exponential distribution . . . . . . . . . . . . . . . . . . . 239
16.6 Randomly missing Gaussian observations . . . . . . . . . . . . . . . . . . . 240
16.7 Missing responses and/or covariates in simple and multiple regression . . . 242
16.7.1 Missing values in the single covariate in simple linear regression . . . 242
16.7.2 Modelling the covariate distribution – Gaussian . . . . . . . . . . 242
16.7.3 Modelling the covariate distribution – multinomial . . . . . . . . 244
16.7.4 Multiple covariates missing . . . . . . . . . . . . . . . . . . . . . . 246
16.8 Mixture distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
16.8.1 The two-component Gaussian mixture model . . . . . . . . . . . . 247
16.9 Bayesian analysis and the Data Augmentation algorithm . . . . . . . . . . 251
16.9.1 The galaxy recession velocity study . . . . . . . . . . . . . . . . . 251
16.9.2 The Dirichlet process prior . . . . . . . . . . . . . . . . . . . . . . 263
23 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Preface
This book develops a new introductory course in statistics, aimed at students in a wide
range of programmes including Data Science. It is suitable for students in a general Statistics
programme, for whom Data Science can be one of many application areas.
The book has two aims: to clarify for students the confusion between the several infer-
ential frameworks used in different fields of application of statistics, and their philosophical
bases, and to serve as a textbook for students new to statistics and statistical modelling. Stu-
dents are expected to have a basic mathematical background of algebra, coordinate geometry
and calculus.
xiii
xiv Preface
Efron and Tibshirani were not referring to Bayesian theory here, but to nonparametric
bootstrapping, without an “unnecessary” probability model.
We call this streamlining of statistical methodology minimalist statistics. As the
field of Statistics finds itself increasingly intertwined with other disciplines, and as
required models become more complicated, we believe that such minimalism is of
critical importance. There is only so much time available to educate interdisciplinary
researchers and practitioners in statistical theory and methodology.
(Ruppert, Wand and Carroll 2003, pp. 320–321, emphasis in original.)
Ruppert et al were not referring to Bayesian theory here, but to a high-level simplification of
statistical modelling in two-level models, which is beyond the scope of this book. Nevertheless
their comment refers equally well to the simplification of the Bayesian theory relative to the
frequentist theory.
The Bayesian approach has developed very rapidly in the last 20 years, greatly assisted
by the dramatic increases in speed and memory of personal computers. It is now practical for
very complex modelling problems, for which the frequentist approach struggles to provide
an analysis.
programmes. The course is equally suitable for statistics students not intending to study
Data Science, as it gives a broad coverage of applications.
An important philosophical position we endorse is the probability model and likelihood
basis for statistical analysis and inference. This basis underlies both the (model-based) fre-
quentist and the Bayesian analyses of sample data. An important group of statisticians (more
common in survey fields), mathematicians and computer scientists regard this approach as
limiting and unnecessary, and argue that the choice of estimators for a population quantity
of interest need not, and should not, be restricted to those based on a formal probability
model.
The classical theory of survey sampling is based on this argument, as we discuss in the
book. Some well-known and highly regarded methods in machine learning – bootstrapping is
an example – are not based on probability models. A common view outside statisticians is
that statistical analysis is a form of optimisation – maximisation or minimisation – and that
the important question is the choice of the criterion or objective function to be optimised.
The sum of squares or weighted sum of squares of residuals from the fitted function is often
used as the criterion.
The difficulty of general optimisation methods is that without a model basis, we cannot
evaluate their quality for statistical inference without very detailed performance simulation
studies. Such studies tend to rely on asymptotic – large-sample – behaviour, supplemented
by some finite-sample behaviour. If these methods are based on a probability model, then
we can assess their performance relative to the likelihood-based analysis. Fisher pointed this
out in his revolutionary proposal of the likelihood.
In the development of this book we have taken advantage of recent developments in
Bayesian theory, which have important applications in survey sampling and frequentist max-
imum likelihood theory, and this allows us to give a unified presentation of these areas even
in the first course.
0.4 Acknowledgements
This programme has developed from courses based on this approach given or prepared for
presentation at a number of universities, including the universities of Lancaster and New-
castle UK, and Melbourne Australia. We are grateful to Brian Francis, Göran Kauermann
and Adriano de Carli for invitations to give courses: their invitations have spurred the de-
velopment of this programme.
The present form of the book owes much to discussions with Sunil Rao and Steve Fien-
berg, and we have benefitted from extensive discussions with the staff at the Department of
Statistics, University of Colorado at Fort Collins, the Department of Social Statistics, School
of Social Sciences at the University of Manchester, and the Griffith Social and Behavioural
Research College, Griffith University Queensland.
I am grateful for the support over many years from the UK Social Science Research
Council and Economic and Social Research Council, the Australian Research Council, the
US National Center for Education Statistics and the US Institute of Education Sciences, for
research into the development of new model-based and Bayesian data analysis methods.
I am particularly grateful to the many historical and current statistical scientists who have
influenced my philosophical and computational views, including George Barnard, Al Beaton,
Jim Berger, David Cox, Noel Cressie, Art Dempster, Anthony Edwards, William Ericson,
Steve Fienberg, Ronald Fisher, Harold Jeffreys, Jim Kalbfleisch, Nan Laird, Jim Lindsey,
xvi Preface
Rod McDonald, Jon Rao, Richard Royall, Don Rubin, David Sprott, Martin Tanner and
Matt Wand. In the finalisation of the book text, we have had much help from Rob Calver
and Lillian Woodall at Taylor and Francis, and Andrew Robinson at CEBRA, University of
Melbourne.
DOI: 10.1201/9781003216025-1 1
2 Introduction to Statistical Modelling and Inference
the statistical model, but it can be based on conceptual or actual resampling from the
population without a model.
We develop analysis and inference progressively. We begin in Chapter 2 with Big Data, a
term which has become popular though undefined. Chapter 3 gives a variety of “small data”
sets from a wide range of scientific and social studies, with the research questions which have
led to these studies. These data sets are analysed progressively through the book. We also
make use in Chapter 4 of the StaLab database of a large public health study to investigate
important social questions at the time of the siudy. Before the technical discussions of models
and modelling, in Chapter 5 we discuss the problems of interpretation in sample surveys,
illustrated by a small survey and two larger surveys. The design of these studies is critical
to their interpretation and value.
Chapter 6 gives a discussion of the two aspects of probability, as the relative frequency
of some event in repeated “trials”, and the degree of belief in the outcome of some event.
These two aspects of probability are the foundations of the two approaches to statistical
inference, which are developed in the subsequent chapters. Chapter 7 develops the principles
of inference for the two aspects of probability in two simple discrete probability models: the
binomial and the Poisson. The fundamental evidence function of the model-based school is
introduced: the likelihood function, with its different usage by Bayesians and non-Bayesians.
Chapter 8 extends the discussion of the binomial model to the very important randomised
clinical trial.
Chapter 9 is a short discussion of visualisation of data and statistical analyses.
Chapter 10 extends the discussion of Chapter 7 to one-parameter continuous distribu-
tions: the exponential, Gaussian and uniform. Chapter 11 extends this discussion to two-
parameter distributions: the Gaussian, lognormal, Weibull and gamma.
The possibility of having several possible distributions for sample data requires ways of
assessing the models we are considering for suitability for the data. Chapter 12 on model
assessment shows how to do this with a data set of operating lifetimes of mobile phones.
Chapter 13 extends the binomial distribution to the multinomial, with more than two dis-
crete outcome categories. Chapter 14 continues assessing and comparing competing models
for the same data, extending Chapter 12 to the possibility of model averaging several models
to produce a composite analysis. Chapter 15 introduces regression models with a Gaussian
response variable, and their maximum likelihood and Bayesian analysis. Common machine
learning extensions to ridge regression, the lasso and principal component regression (PCA)
are discussed in the framework of statistical modelling.
Chapter 16 develops the analysis of incomplete data of many kinds, through the maximum
likelihood EM algorithm and the Bayesian Data Augmentation algorithm. Chapters 17 and
18 extend the Gaussian multiple regression framework to generalised linear models (GLMs),
and further extensions of GLMs.
Four appendices discuss length-biased sampling, two-component Gaussian mixtures, the
StatLab variables, and a short history of the major figures in the development of statistical
modelling and analysis since 1890.
2
What is (or are) Big Data?
Apart from Big Data meaning a lot of data, the use of this term implies a need for analysis
methods for large-scale data, for which the traditional methods for the statistical analysis of
“small” samples presumably fail. The problems with the analysis of large-scale data are not
new: they have been with us since surveys and experiments began to generate large samples
with complex data structures.
A simple way of quantifying the problem is through the size of the data matrix. In tradi-
tional social surveys, the data matrix X was an array of dimension n × p, of measurements
or recorded values of p “variables” on n “cases” or individuals. Simple tabulation procedures
in early computers for variable means and variances and their intercorrelations could handle
large numbers of individuals, but for any kind of regression modelling the covariance matrix
of the variables, of dimension p × p, had to be inverted.
When I began (1959) a summer research assistantship with the Chief Accountant’s Office
of the Commonwealth Bank in Sydney, the bank wanted to relate the total bank branch
salaries and number of staff at their 650 branches to 35 “activity” variables, the routine
activities of a branch staff member. The aim of this analysis was to establish which branches
were under- or overstaffed for the level of business, so that staff could be moved between
branches appropriately. To determine the relation between salaries or staff numbers and the
35 activity items through a regression model would require the inversion of a 36 × 36 matrix,
far beyond the capacity of an electric calculator, or of computers of that time. By the early
1960s, regression programs were available on an IBM mainframe which could do this. By
the early 1970s, regression programs could handle 250 variables, with very large numbers of
cases.
The major expansion of Big Data sizes came from several application areas: the recording
of detailed supermarket transactions on identifiable individuals, through their use of super-
market loyalty cards; the very fast and heavy stock-market transactions; and the genotyping
of individuals and its relation to disease occurrence, especially cancers. An early example
of the latter was gene expression data for p = 6,033 genes on 52 prostate cancer patients –
“cases” – and 50 normal subjects – “controls”. The research question was how to identify the
genes which were important for differentiating cases from controls in their expressions: these
might be related to genetic differences in the two populations. Here p >> n (102), so the
covariance and correlation matrices of the gene expressions, while they could be computed,
could not be inverted as they were of rank 100. Traditional statistical methods could not be
used without drastic changes.
With further increases in the size of genotyped data sets, and especially detailed astro-
nomical surveys with very high resolution telescopes, even holding the data set in computer
memory was impossible for even the largest computers. Statistical analysis would have to be
adapted to be usable.
In courses above the introductory level, the student will meet a variety of procedures
developed for handling very large data sets. One obvious one is to split the full data set into
multiple subsets, analyse each subset separately and then combine the analyses appropriately.
We do not pursue how to do this or other procedures further in this book: they depend on
DOI: 10.1201/9781003216025-2 3
4 Introduction to Statistical Modelling and Inference
more complex modelling. In this book, we establish the principles for statistical analysis for
the traditional (and important) small and medium-sized data sets, covering a wide range of
applications.
An important aspect of preliminary analysis is data visualisation. In the next chapter we
give data tables for a number of different research studies, and graphs for a few. Data tables
are not helpful in understanding the message of the data, and in later chapters we discuss
at some length historical and more informative modern ways of visualising data and fitted
models.
3
Data and research studies
We analyse and discuss a number of studies in this book. Research studies and collections
of administrative data have data of different types. Several small examples from research
studies follow: we will analyse them in later chapters.
DOI: 10.1201/9781003216025-3 5
6 Introduction to Statistical Modelling and Inference
TABLE 3.1
Lifetimes and numbers of radio transceivers
i t n i t n i t n i t n i t n i t n i t n
1 8 1 2 16 4 3 32 2 4 40 4 5 56 3 6 60 1 7 64 1
8 72 5 9 80 4 10 96 2 11 104 1 12 108 1 13 112 2 14 114 1
15 120 1 16 128 1 17 136 1 18 152 3 19 156 1 20 160 1 21 168 5
22 176 1 23 184 3 24 194 1 25 208 2 26 216 1 27 224 4 28 232 1
29 240 1 30 246 1 31 256 1 32 264 2 33 272 1 34 280 1 35 288 1
36 304 1 37 308 1 38 328 2 39 340 1 40 352 1 41 358 1 42 360 1
43 384 1 44 392 1 45 400 1 46 424 1 47 438 1 48 448 1 49 464 1
50 480 1 51 536 1 52 552 1 53 576 1 54 608 1 55 656 1 56 716 1
TABLE 3.2
V1 hits
Number of V1 hits Number of squares
0 237
1 189
2 115
3 28
4 6
5 1
FIGURE 3.1
Peptic ulcers, stomach and duodenal
TABLE 3.3
Clinical trial of Depepsen
Depepsen Placebo Total
Healed 13 10 23
Not healed 5 7 12
Total 18 17 35
To assess the value of Depepsen, a randomised clinical trial was carried out with 35
patients, in which 13 of the 18 patients receiving Depepsen healed, while ten of the 17
receiving an inert placebo healed. Does this indicate a real superiority in healing of Depepsen
over placebo? Classifying the patients by treatment and recovery, we have Table 3.3.
We discuss this study at length in Chapter 7.
CMT 0 1 1
ECMO 11 0 11
Total 11 1 12
TABLE 3.5
Concentration y and dose x
x 1.6 2.2 2.8 3.0 3.7 4.4 4.8
y 500 162 178 136 78 47 62
x 6.1 6.8 7.2 8.2 9.9 10.2 11.3 14.8
y 39 40 21 19 12 10 9 8
TABLE 3.6
Number of species observed
by five independent observers
16 18 22 25 27
3.6 Vitamin K
In a study of the role of Vitamin K in blood clotting, 15 chickens were deprived of Vitamin
K and then fed dried liver (a source of Vitamin K) for three days at a (varying) dose of x
mg per gram weight of chick per day. At the end of this period, the response of each chicken
was measured as the concentration y of a clotting agent needed to clot samples of its blood
in three minutes. The data from the 15 chickens are given in Table 3.5.
What can we say about any relationship between the dose of Vitamin K and the concen-
tration of the clotting agent? We discuss this data set in §14.1.1.
−0.86 5 0
−0.30 5 1
−0.05 5 3
0.73 5 5
TABLE 3.8
Counts of Down’s babies and baby populations in four
regions
Region BC Mass NY Sweden
age r n r n r n r n
0.035
0.030
0.025
Downs rate p
0.020
0.015
0.010
0.005
20 25 30 35 40 45
age
FIGURE 3.2
Incidence of Down’s syndrome in British Columbia, 1991 report
subset of the data is shown here. It is well known that the mother’s age is important, with
the observed incidence of Down’s syndrome (r/n) increasing for mothers over 30. Figure 2.2
shows the relation for British Columbia.
There are several research questions:
• Is the incidence of the syndrome low and constant up to some mother age?
• How is the increase best presented?
• Is the increase pattern consistent over the four regions? If not, how do they differ?
We address these questions in §16.6.
TABLE 3.10
Absence, IQ and number of family dependents, Aboriginal girls
days 14 11 2 5 5 35 22 20 13
IQ 60 60 70 86 86 81 86 93 93
deps 11 11 5 9 10 9 7 4 12
days 7 14 27 6 20 4 15 13 6 6
IQ 96 90 82 79 65 64 66 76 73 74
deps 6 5 11 6 10 7 8 3 11 9
days 5 16 17 46 43 40 16 14 32
IQ 70 98 84 100 84 91 105 104 76
deps 7 14 3 10 7 7 4 11 11
days 57 6 53 23 8 34 36 38 23 28
IQ 83 73 92 93 99 95 84 106 89 103
deps 10 9 9 13 7 9 10 11 9 9
of days absent from school, the IQ (intelligence quotient) and the number of dependents in
the family for 38 Aboriginal girl children. What can be said about the relation between
absence and the other variables? See §14.10.
wife. The other was that the husbands were angry, critical and unsupportive. These reports
were based on small numbers of cases of individual psychiatrists.
Dr Bennett, as part of his MD doctoral thesis, was given access to married women patients
admitted to hospital after a suicide attempt and to married women patients admitted to
hospital with critical organic (non-psychological) abdominal conditions. When the recovery
of the wives was established, Bennett interviewed the husbands and recorded their responses
to three questions about the state of the marriage. The responses were analysed for affection
and hostility content using the Gottschalk-Gleser scales (Gottschalk and Gleser 1969, from
now abbreviated to GG). The analysis scored the husbands’ responses for affect (emotional
response) on several forms of negative affect (anger, guilt, . . .), and one of positive affect –
affection. The analysis provides scale scores which can be taken as approximately Gaussian
(the scales use a transformation of the count of affective words). The psychiatric question
was: how are the levels of affect, on the several GG scales, related to the nature of the
event (suicide attempt or organic abdominal condition), and do personal factors influence
the affect level?
In §14.17 we discuss how the data (not given here) were analysed and the conclusion
about affection.
TABLE 3.11
Proportion tolerant of racial intermarriage
Region Ed 72 73 74
It is clear that educational level has large differences in proportions, and region has smaller
differences; it is unclear what time changes have occurred. How do we summarise the varia-
tions in proportions with the three cross-classifying factors? See §16.8.
2750
2500
2250
2000
1750
patients
1500
1250
1000
750
500
250
0
0 200 400 600 800
beds
FIGURE 3.3
Numbers of patients treated and hospital beds
14 Introduction to Statistical Modelling and Inference
6000
5000
4000
patients
3000
2000
1000
FIGURE 3.4
Joint model ML fit with 95% variability bounds
2.7
2.6
2.5
2.4
length (metres)
2.3
2.2
2.1
2.0
1.9
1.8
5 10 15 20 25 30
age (years)
FIGURE 3.5
Ages and lengths of dugongs
Data and research studies 15
60
40
20
0
acceleration
-20
-40
-60
-80
-100
-120
10 20 30 40 50
time
FIGURE 3.6
Acceleration of motorcycle helmet
16 Introduction to Statistical Modelling and Inference
100
75
50
25
0
acceleration
-25
-50
-75
-100
-125
-150
-175
10 20 30 40 50
time in ms
FIGURE 3.7
Acceleration of motorcycle helmet and ML fitted polynomial model
0.8
0.6
0.4
0.2
-0.0
anomaly
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
1880 1900 1920 1940 1960 1980 2000
year
FIGURE 3.8
Sea temperature anomaly by year
Data and research studies 17
TABLE 3.12
Event attendance
W \E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 T
1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 8
2 1 1 1 0 1 1 1 1 0 0 0 0 0 0 7
3 0 1 1 1 1 1 1 1 1 0 0 0 0 0 8
4 1 0 1 1 1 1 1 1 0 0 0 0 0 0 7
5 0 0 1 1 1 0 1 0 0 0 0 0 0 0 4
6 0 0 1 0 1 1 0 1 0 0 0 0 0 0 4
7 0 0 0 0 1 1 1 1 0 0 0 0 0 0 4
8 0 0 0 0 0 1 0 1 1 0 0 0 0 0 3
9 0 0 0 0 1 0 1 1 1 0 0 0 0 0 4
10 0 0 0 0 0 0 1 1 1 0 0 1 0 0 4
11 0 0 0 0 0 0 0 1 1 1 0 1 0 0 4
12 0 0 0 0 0 0 0 1 1 1 0 1 1 1 6
13 0 0 0 0 0 0 1 1 1 1 0 1 1 1 7
14 0 0 0 0 0 1 1 0 1 1 1 1 1 1 8
15 0 0 0 0 0 0 1 1 0 1 1 1 0 0 5
16 0 0 0 0 0 0 0 1 1 0 0 0 0 0 2
17 0 0 0 0 0 0 0 0 1 0 1 0 0 0 2
18 0 0 0 0 0 0 0 0 1 0 1 0 0 0 2
T 3 3 6 4 8 8 10 14 12 5 4 6 3 3 89
The StatLab database is a “small” population of 1,296 families. The database was freely
available on the Web for some years. It is described in pp. 318–319 of the StatLab book
(Hodges, Krech and Crutchfield 1975):
The StatLab population [called Census in the book] covers 1296 member families of
the Kaiser Foundation Health Plan (a prepaid medical care programme) living in
the San Francisco Bay area during the years 1961–72. These families were partici-
pating members of the Child Health and Development Study conceived and directed
by Professor Jacob Yerushalmy, in the School of Public Health at the University of
California, Berkeley.
On her first visit to the Oakland hospital of the Health Plan after pregnancy
was diagnosed, each woman was interviewed intensively on a wide range of medical
and socioeconomic matters relating both to herself and to her husband. In addi-
tion, various physical and physiological measures were made. When her child was
born, further data about her and her newborn baby were recorded. Approximately
10 years later the child and mother were called in for follow-up testing, interview-
ing and measurement. In some instances, the husband was also interviewed and
measured.
The 1296 families of the population are divided into two equal subpopulations:
648 families consisting of a mother, father and female child; and 648 families of a
mother, father and male child. The children were all born in the Kaiser Foundation
Hospital, Oakland California, between 1 April 1961 and 15 April 1963. The popu-
lation does not include any other children who may have existed in these families.
More than 10,000 families took part in the Child Health and Development Study. To make
the data more widely available and suitable for student training in statistics, the study
authors prepared a subsample of 1,296 families from the full study as a statistical population
– the StatLab population – to provide for dice sampling, discussed in the following.
From the available data recorded in the Child Health and Development Study, 32 variables
were selected for the StatLab book. The 36 pages of the population listed each of these 32
variables for each of the 1,296 families. The first 18 pages covered the families with girls;
the second 18 pages covered the families with boys. Within each of these two sets of pages
the families were listed in order of mother’s age, with the youngest mothers first and the
oldest last.
The population list consisted of printouts numbered in consecutive “dice numbers” (i.e.
the population pages were numbered 11, 12, 13, 14, 15, 16, 21, 22, 23, . . ., 65, 66). Similarly,
the 36 families on each page were designated in consecutive dice numbers from 11 to 66. The
identification number (ID no.) for any given family consisted of two pairs of dice numbers, the
first pair indicating the page and the second pair indicating the family on the page. To select a
family purely at random from the population of 1,296, it was necessary to throw a pair of dice
twice. (The StatLab book was sold with a pair of dice, one red, one green.) If, for example,
DOI: 10.1201/9781003216025-4 19
20 Introduction to Statistical Modelling and Inference
the first throw gave a red 2 and a green 6, this selected page 26. If the second throw gave a
red 5 and a green 4, this selected family 54 on that page. Thus the ID number for this family
was 26-54.
The 32 variables for each family were grouped by child, mother, father and family.
Part of the data were collected at the time of birth (1961–1963) and the other data at
the time of test (1971–1972). A full description of variables collected in the survey is given in
Appendix 1.
TABLE 4.1
Types of variables
Variable type Examples
• Do mothers who were smoking at the diagnosis of pregnancy have babies with lower
birthweight than mothers who were not smoking at diagnosis? (Low birthweight increases
risk for babies.)
• More generally, how is the baby’s birthweight related to mother’s weight, mother’s age,
mother’s blood group, mother’s smoking, father’s smoking and family income?
• How does the child’s intelligence at age ten, as assessed by the Peabody and Raven
tests, relate to birthweight, family income and mother’s and father’s age, education and
occupation?
There are many other social/economic/public health questions which could be addressed
using this database – that is one of the values of such databases. We do not have the full
study population data to check the inferences, but that is the reality of all research studies.
We are assuming, in using the database to answer these questions, that the database is
a random sample from the full study population. But why is such an assumption necessary?
Would it matter if it was a systematic sample of some kind? In the worst case, the analysis
might be relevant only to the database itself: it might not generalise to any larger population.
This becomes clear when we consider some notorious studies, in the next chapter.
5
Sample surveys – should we believe what we read?
• 39% of women married 25 years or more had been struck or beaten by a husband or
lover;
• 39% of women never married had been struck or beaten by a husband or lover.
Alarming as these “statistics” were, there seemed some strange inconsistencies. For example,
What are we to make of these “statistics”? That depends on how they were obtained. An
important issue is the sampling method used in this study, and in other studies. The sampling
method has a critical effect on what we think about the study, and illustrates the fundamental
concept of “randomness” and its use to achieve a representative sample. This requires an
understanding of probability concepts.
DOI: 10.1201/9781003216025-5 23
24 Introduction to Statistical Modelling and Inference
and a wide range of other organisations, such as senior citizens’ homes and disabled
people’s organisations, in various states.
In addition, individual women wrote for copies of the questionnaire . . .
All in all, 100,000 questionnaires were distributed, and 4,500 returned . . .
26 Introduction to Statistical Modelling and Inference
Those receiving the questionnaire were asked to pass it along to relevant people they
knew, if they were not themselves interested in responding. If they were interested in re-
sponding, they were permitted to omit any questions which were not relevant to them. The
response rate in this survey was 4.5%.
What should we conclude? What population is being sampled? The implication of the
book is that the target – the intended – population is the US adult female population. But
the sampling method is very likely to oversample groups with higher proportions of women
with relational difficulties. Since the questionnaires were sent to groups, we have no idea
who actually answered them (they were anonymous). The sample is clearly biased because
we cannot give the chance of any woman in the US population being included in the sample.
Hite’s questionnaire introduced a further difficulty. It directed respondents:
It is not necessary to answer every question! There are seven headings; feel free to
skip around and answer only those sections or questions you choose ...
(p. 787, emphasis in the original)
So the sample size may be different for different questions (these sample sizes were not given
in the book) – even the percentage responding to different questions may not be comparable
within the study, as we mentioned earlier:
• 95% reported emotional or psychological harassment from their husband or lover; but
• 84% of women were not emotionally satisfied by their relationships.
These numbers are surely a consequence of non-response to one question, or both. We gain
a misleading impression even within the survey by comparing percentages based on different
subsets of respondents.
If the sample is biased, what conclusions can we draw? Since the sampled population is
undefined, we can only regard the sample as the population. Hite had information from
4,500 women, and the percentages reported are the percentages in her sample, which is her
population.
These results have no knowable connection with percentages of the US female population,
and cannot be used to refer to this population. To make such a connection, it is not sufficient
to have a voluntary sample, no matter how large – we need a probability sample, in which
every woman in the population would be included with a known probability.
A little more can be said about the US female population. For the year 1986, the US
female population between the ages of 25 and 69 was 66,538,000 (rounded to the nearest
1,000). We take this as the target population for Hite’s survey. (It makes little difference if
we extend the age range.)
We know that, of the 4,500 women in the Hite sample, 95% – 4,325 – reported emotional
or psychological harassment by a husband or lover. So in the US population of women,
the proportion of those harassed is at least 4,325/66,538,000 = 0.000065, or 0.0065% of the
population. On the other hand, the proportion of women not harassed in the population is
at least 225/66,538,000 = 0.000003382, or 0.0003382%. So the most that we can say with
certainty about the population proportion of women who suffered emotional or psychological
harassment by a husband or lover is that it must be between 0.0065% and 99.9996618%. This
is so close to 0% and 100% that we have learnt almost nothing about the population. That
is the consequence of the survey design.
Hite responded vigorously to criticisms of the study. Some of her responses were emo-
tional: “the stories told by these women are heartbreaking” as indeed they must have been.
Sample surveys – should we believe what we read? 27
On the study design she persisted in the claim that 4,500 was a large sample and allowed gen-
eralisation to the population. Further comment on the sampling method and Hite’s responses
can be found at http://davidstreitfeld.com/archive/controversies/hite01.html.
An important issue is that in observational studies we may not have measured or recorded
some important variables. This issue is described pungently in Wainer (2016):
Controlled experimental studies are typically regarded as the gold standard for
which all investigators should strive, and observational studies as their polar oppo-
site, pejoratively described as “some data we found lying on the street”.
(p. 29)
6
Probability
The word “probability” is used in everyday language to represent uncertainty, and is often
used interchangeably with the word “likelihood”. However these words have quite different
and specific meanings in statistical inference, which is the basis of drawing conclusions from
data. Probability is used in two different ways:
• as a measure of relative frequency of occurrence of some event;
• as a measure of degree of belief in the occurrence of some event.
The “event” referred to is in the future – the probability statement is a predictive statement
about the not-yet-observed event.
DOI: 10.1201/9781003216025-6 29
30 Introduction to Statistical Modelling and Inference
draw integers “modulo N ”. Depending on the implementation of the routine, these may be
numbers from 1 to N or from 0 to N − 1. If the latter, we add 1 to each random number. (We
could avoid this altogether by renumbering the families from 1 to 1,296, and drawing ran-
dom integers from this range. For consistency with the StatLab book, we retain the original
database numbering.)
A class experiment with an ESP card deck is useful for understanding the basic ideas
of probability. The “sender” stands in the classroom behind an impermeable screen and
reads through the shuffled deck, one card at a time. At each card the sender pauses, con-
centrates on the symbol, and “projects” the card symbol to the student audience, which
cannot see him or her, only hear the “Next card” announcements. The students write down
the card symbols they “receive”. At the end of the 25 cards the sender asks the students
to switch their records of the card sequence with the students next to them, to prevent
“cheating”. The sender then reads through the sequence of cards, and the student markers
tick off the correct answers, sum the total number of correct identifications, and return the
records.
The sender then asks for a show of hands at each number correct, starting from 25
and working backwards. This requires a group of hand-counters to assist the sender.1 Af-
ter describing the experiment, but before it begins, the sender’s hand-counters count the
number of students present, and the sender puts up on the board the numbers of stu-
dents who are predicted to identify each possible number of cards correctly. The sender
emphasises that this is not a prediction for individual students, but for the class as a group.
In a class of 200 students, we would expect to see (rounded to integers) the numbers2 in
Table 6.1.
Only one student in a class of 200 would be expected to score zero, or more than ten
correct.3
Students are always puzzled by the closeness of the results to those predicted. “How does
he/she know that? How can he/she say that?”
The prediction is based on a simple statistical model. The analyst assumes (or believes)
that no student can learn anything about the cards being projected, so the student answers
are random guesses. The simplest model for this is that if the five card symbols are regarded
as equally likely under guessing, then the probability of a correct guess with any card is 1/5.
If the guesses are made independently and with the same probability, then the number r
correctly “identified” (guessed) has the binomial distribution b(r | 25, 0.2), shown in Table 6.2
to four decimal places.4
We develop this distribution in the framework of StatLab sampling.
TABLE 6.1
Expected number in 200 for ESP card guessing
r 0 1 2 3 4 5
Expected # 1 5 14 27 37 39
r 6 7 8 9 10 11
Expected # 33 22 12 6 2 1
Now we throw the dice again to draw a second family. The family we chose at the first draw
remains in the population and could be drawn again, though that would be very unlikely –
its probability, by the same argument, would be 1/1,296. Sampling the population in this
way is called sampling with replacement. What would happen if we did draw the same family
again? We would set it aside and draw another one – repeating the same family does not
give any more information. Practical surveys are always drawn without replacement, but
the two methods have very similar properties if the population is large compared to the
sample.
Since the population has not changed, and the design of the throws makes the outcome
of the second throw independent of that at the first throw, the probability of a boy family
at the second throw is again p = 1/2 = Pr[B2 ], and Pr[G2 ] = 1 − p = 1/2. Clearly this will
be true for all the successive throws. This is an example of an axiom of probability theory:
that if the events B1 and B2 are independent, then the probability that both occur is the
product of their separate probabilities:
Here the notation ∩ – commonly called “cap” – stands for intersection – the joint event.
The event of drawing a boy family and a girl family in the first two draws can occur in
two different ways: [B1 ∩ G2 ] and [G1 ∩ B2 ]. The probability of each of these is p(1 − p),
so the probability that a sample of two families contains one boy and one girl family is
2p(1 − p) = 2 · (1/2)2 = 1/2.
We can extend this argument indefinitely. In general, the probability that we obtain r
boy families in n throws of the two dice is given by the binomial (“two names”) distribution:
n
Pr[r boys | n, p = 1/2] = (1/2)n ,
r
where nr = r!(n−r)!
n!
is the binomial coefficient representing the number of arrangements of
the r boy families and n − r girl families – the number of distinct orderings of the r B and
n − r G symbols.
For the sample sizes n = 10, 20 and 40, and p = 1/2, the binomial distributions are shown
in Tables 6.3, 6.4 and 6.5 (probabilities less than 0.001 are omitted).
We now state the probability axioms more formally. The next section can be omitted or
postponed without loss.
TABLE 6.3
binomial distribution, n = 10, p = 1/2
r 0 1 2 3 4 5 6 7 8 9 10
Pr[r] .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
TABLE 6.4
binomial distribution, n = 20, p = 1/2
r 3 4 5 6 7 8 9 10
Pr[r] .001 .005 .015 .037 .074 .120 .160 .176
r 11 12 13 14 15 16 17
Pr[r] .160 .120 .074 .037 .015 .005 .001
Probability 35
TABLE 6.5
binomial distribution, n = 40, p = 1/2
r 12 13 14 15 16 17 18 19 20
Pr[r] .005 .011 .021 .037 .057 .081 .103 .119 .125
r 21 22 23 24 25 26 27 28
Pr[r] .119 .103 .081 .057 .037 .021 .011 .005
Elementary probability examples generally involve dice throwing and coin tossing.
probability axioms do not depend in any way on how the model probabilities are assigned, or what they are.
36 Introduction to Statistical Modelling and Inference
We have to assume that the throws are independent – that the face shown on one throw
does not affect the probability of the same face on the next throw. (In gambling dice games
the dice are thrown together to ensure this.)
Then two heads has probability p2 and one tail q, but the sequences of H and T are
arbitrary: HHT, HTH, THH. So the probability of two heads and one tail in any order is
3p2 q. We use these kinds of results repeatedly in statistical inference.
P 19 49 68
N 1 931 932
Total 20 980 1,000
remaining 931 are true negatives. Adding across the rows, we have 68 people testing positive
and 932 testing negative.
What does it mean if you test positive? What is the probability that you have the
condition? Of the 68 people who tested positive, only 19 actually had the condition. The rest
were false positives. So given that you had a positive test, your probability of having the
condition is only 19/68 = 0.279, about 28%.
What does it mean if you test negative? What is the probability that you have the
condition? Of the 932 people who tested negative, only one actually had the condition, a
false negative. The rest were true negatives. So given that you had a negative test, your
probability of having the condition is only 1/932 = 0.0011: the probability that you do not
have the condition is 0.9989.
So the test is very accurate at identifying people who do not have the condition. It is
much less accurate at identifying people who do have the condition. Of those testing positive,
only 28% actually have the condition.
This answer shocks many students – how is it possible that a test with such a high rate
of detection of people with the condition gives so little confidence in their identification
from a positive result? The answer is closely connected to the prevalence – frequency – of
the condition in the population; the condition is quite rare (only 2% have it), and so most
positive tests will be false positives.
To represent this result formally, we need the probability calculus. We have Pr[C−] =
1 − Pr[C+]. We usually have a good idea of the probability Pr[C+], that is, the proportion
of people in the population being tested who have the condition. We use the value stated
earlier of 0.02; 2% of people being tested would be expected to have the condition.
We suppose as before that the screening test correctly identifies 95% of people who have
the condition. We express this as a conditional probability: given that the person has condition
C+, the probability of a true positive response P is 0.95. We express this by the notation
Pr[P | C+] = 0.95. We suppose also that the test incorrectly identifies (by a positive test
result) only 5% of people who do not have the condition: the probability of a false positive
response is Pr[P | C−] = 0.05.
The first person tested, a man, gives a positive test result. What advice do we give him?
He could be truly positive, or truly negative.
We know the probability of a positive result from people with the condition is 0.95, and
is 0.05 for people without the condition. But these numbers are known prior to the test; we
want to update – revise – them given the positive result of the test. For this we need the
conditional probabilities given the data – the probabilities of C+ and C− given P.
Formally, we express the conclusions using a standard probability result. Given two events
A and B, the probability that they both occur, Pr[A and B], written formally as Pr[A ∩ B]
(“A cap B”) can be expressed in two different ways, as the product of the probability of one
event and the conditional probability of the other event given the first:
So
Pr[A|B] = Pr[B|A] Pr[A]/ Pr[B].
This result was derived by Thomas Bayes in 1763 and is known as Bayes’s theorem. It shows
us how to update the probability of an event A when we are given new information that the
event B has occurred.
We apply this to our events C and P : we are given
so
Pr[C+ | P ] = Pr[P |C+] Pr[C+]/ Pr[P ].
Now
Pr[P ] = Pr[P ∩ C+] + Pr[P ∩ C−]
since one of C and C− must be true, so
and hence
Pr[C+ | P ] = Pr[P |C+] Pr[C+]/ Pr[P ] = 0.019/0.068 = 0.279.
Now we see that the high true positive rate does not guarantee a high probability of the
condition given a positive test: we have to consider as well the chance of a positive test when
the person is free of the condition, and combine these probabilities with the prevalence of
the condition.
The misinterpretation of the high true positive rate as the probability of the condition
given the positive test used to be endemic in law court discussions of probability, and is
known as the prosecutor’s fallacy. We will give a court case example later.
We use Bayes’s theorem (generally but incorrectly abbreviated to Bayes’ or Bayes theo-
rem) – the inversion of the sequence of events A and B – repeatedly in this book. It is the
foundation of statistical inference in the Bayesian paradigm, and all the methods developed
later in the book are extensions of this result.
The results of the screening test are potentially seriously misleading to those screened.
How can this be improved? There are two possible approaches (for a fixed prevalence rate
of the condition):
(1) Increase the true positive rate;
(2) Increase the true negative rate.
Small improvements in these already high rates will not change the conclusions much. But
now suppose the prevalence of the condition in the population is higher – 20% instead of
2% and the condition is common. With the previous rates of 0.95 and 0.05 for the true and
false positive rates of the test, we have
and hence
Pr[C|P ] = Pr[P |C+] Pr[C+]/ Pr[P ] = 0.190/0.230 = 0.826,
and more than 80% of those testing positive do have the condition. But those testing negative
are less certain of their negative state:
and hence
Pr[C + |N ] = Pr[N |C+] Pr[C+]/ Pr[N ] = 0.01/0.77 = 0.013
This is a factor of 10 larger than for 2% incidence, but it is still very low. A negative test is
still a very strong indication that the person does not have the condition.
The properties of the screening test depend strongly on the population incidence of the
condition – the prior probability before the test is done that a randomly selected person will
test positive. In terms of the previous table, we have
Sally Clark (August 1964–15 March 2007) was a British solicitor who became the
victim of an infamous miscarriage of justice when she was wrongly convicted of the
murder of her two sons in 1999.
Clark’s first son died suddenly within a few weeks of his birth in 1996. After
her second son died in a similar manner, she was arrested in 1998 and tried for the
murder of both sons. Her prosecution was controversial due to statistical evidence
presented by pediatrician Professor Sir Roy Meadow, who testified that the chance
of two children from an affluent family suffering sudden infant death syndrome was
1 in 73 million, which was arrived at by squaring 1 in 8500 for the likelihood of
a cot death in similar circumstances. The Royal Statistical Society later issued a
public statement expressing its concern at the “misuse of statistics in the courts”
and arguing that there was “no statistical basis” for Meadow’s claim.
Clark was convicted in November 1999. The [two] convictions were upheld at
appeal in October 2000 but overturned in a second appeal in January 2003, after
it emerged that the prosecutor’s pathologist had failed to disclose microbiological
TABLE 6.7
True and false positives and
negatives, screening test, 20%
incidence
C+ C− Total
P 190 40 230
N 10 760 770
Total 200 800 1,000
40 Introduction to Statistical Modelling and Inference
reports that suggested one of her sons had died of natural causes. She was released
from prison having served more than three years of her sentence. The journalist
Geoffrey Wansell called Clark’s experience “one of the great miscarriages of justice
in modern British legal history”. As a result of her case, the Attorney-General
ordered a review of hundreds of other cases, and two other women convicted of
murdering their children had their convictions overturned.
What is the basis of the statistical evidence against Clark? We quote from the expert
witness testimony by Professor Phil Dawid:
The SUDI [Sudden Unexplained Death in Infancy] study was conducted between
February 1993 and March 1996 in a study area consisting of five regions of the
country [UK], having a total population of nearly 18 million. During the study
period there were around 470,000 live births in the study area. 456 of these babies
suffered sudden [unexplained] death in infancy, 363 of these deaths being classified
as cases of SIDS [Sudden Infant Death Syndrome]. Of these, 325 were subjected to
further analysis. For each of the 325 “index cases”, four “control” babies, born at
around the same time but not suffering SIDS, were identified by the health visitor.
For both index and control cases, a number of possibly relevant characteristics of
the family, baby, etc. were measured. Statistical analyses were conducted with the
aim of discovering differences between such characteristics, which might distinguish
the index (SIDS) babies from the control (non-SIDS) babies.
The figures presented in court were based on Table 3.58 of the SUDI report,
which purported to classify the risk of SIDS according to “the three prenatal factors
with the highest predictive value”:
• Anybody smokes in the household
• No waged income in household
• Mother less than 27 years and this child not her first.
In the case of Sally Clark, none of the above factors was present. For such a case,
the table gave a rate of 0.117 SIDS cases per 1000 live births, i.e. 1 in 8,543 live
births. The figure of 1 in 73 million mentioned . . . above was calculated by squaring
this (8,543 times 8,543 = 73 million, approximately).
(Dawid)
What was wrong with Meadow’s statistical argument? There were two main issues:
• The deaths of the two children were treated as independent events, and their probabilities
multiplied.
If a genetic or environmental factor contributed to the deaths of both children, this would
mean that the occurrence of the death of the first child would increase the probability
of the death of the second child, if the genetic or environmental factor was unchanged.
Probability 41
The assumption of independence was not based on evidence and changed substantially
the stated probability of two deaths.
• The question for the jury was never expressed correctly, in terms of inferring the proba-
bility of a hypothesis from the probability of events under the hypotheses.
The inference drawn by Meadow was based on the argument that if a child death event
occurs which is extremely unlikely under normal family circumstances, then the circum-
stances must have been abnormal, which he took to be murder by a parent or parents
(Clark’s husband was initially charged with murder as well, but the charge was dropped).
So there were only two hypotheses: SIDS deaths by chance (event C) or deaths by murder (ac-
tually event C̄ – “not C”). The probability of C was, by Meadows’s calculation 1/73,000,000.
This was “too small” for C to be believable, and therefore C̄ must be true. The jury, and
many commentators, were left with the impression that the probability of 1/73,000,000 was
the probability that Clark was innocent. This misinterpretation is so common in law that it
has been named by statisticians “the prosecutor’s fallacy”.
It should be clear that something is missing here. We express the problem through Bayes’s
theorem. We assign prior probabilities to C and C̄, and calculate the likelihood – the prob-
ability of the two deaths – under C and C̄. We have the likelihood under C as Meadow
claimed, but what is the probability that a mother will murder two very young children?
The implication of Meadow’s argument is that this must be large, or at least much larger
than 1/73,000,000.
In his evidence, Professor Dawid searched the UK crime database for relevant information:
In 1996 there were 649,489 live births in England and Wales. Of these babies, 14
were later classified as having been murdered in the first year of life. If we were to
take the ratio 14/649,489 as our estimate of the probability that a single baby will
be murdered in the first year of life, and manipulate it in exactly the same way
as he did the SIDS rate, we would calculate that the probability of two babies in
one family both being murdered is [(4/649, 489)2 ], which gives 1 in 2,152,224,291.
On this basis, the “logic” [this tiny probability of C̄] would imply that we could
essentially exclude the possibility that Sally Clark’s two babies were murdered!
However, Dawid does not for a moment accept the Meadow argument, or his own. The
relevant calculation is of the likelihood ratio. If we assume the child murder rate to be
relevant to this case, as the SIDS rate was assumed to be relevant, the likelihood ratio for
C to C̄ would be
1/85432
= 192 = 361.
(4/649, 489)2
With equal prior probabilities on C and C̄, the posterior probability of C would be 0.9972,
or 361/362.
As Dawid concluded, we could essentially exclude the possibility that Sally Clark’s two
babies were murdered. However the statistical evidence was less compelling than the revela-
tion of the suppressed pathological evidence that one child had died of natural causes, and
it was the combination of these two independent sources of information that secured the
successful appeal.
Meadow was struck off the medical register by the General Medical Council in
2005 for serious professional misconduct, but he was reinstated in 2006 after he
appealed and the court ruled that his misconduct was not serious enough to warrant
42 Introduction to Statistical Modelling and Inference
him being struck off. In June 2005, Alan Williams, the pathologist who conducted
the postmortem examinations on both the Clark babies, was banned from Home
Office pathology work and coroners’ cases for three years after the General Medical
Council found him guilty of serious professional misconduct in the Clark case. This
decision was upheld by the High Court in November 2007.
Clark was permanently affected by the accusation, wrongful imprisonment, and
persecution by other prisoners. She never recovered from the experience, developed
a number of serious psychiatric problems including serious alcohol dependency, and
died of acute alcohol poisoning in her home in March 2007.
(Wikipedia)
Further details of the statistical argument can be found in Professor Phil Dawid’s expert
witness testimony, at www.statslab.cam.ac.uk/∼apd/SallyClark report.doc.
A simple version of this argument comes from early uses of the likelihood function,
in which the design was often omitted, or was implicit in the data to be analysed. The
importance of the design is made clear in an old example sometimes used to attempt to
discredit the use of the likelihood ratio:
A 52-card deck is shuffled, and a card drawn from it at random. It is the King of
Spades. The card dealer says:
“You are a likelihoodist – what is the probability of this deck being a deck
entirely made up of Kings of Spades (KOSs)?
If the deck is a regular deck, the probability of drawing the KOS is 1/52. If the
deck is entirely KOSs, the probability is 1. So the likelihood ratio of KOS to regular
deck is 52:1. You must be almost certain that this is the KOS deck. Of course this
is nonsense.”
What is missing here is the design of the study. Consider this design, of two card decks.
The first deck is made up of 52 KOSs. The second is a regular deck. One deck is chosen
at random, shuffled well and a card is drawn from it. The card is the KOS. What do we
conclude about the chosen deck?
The implicit assumption of the card dealer’s argument is that the prior probability of his
deck being regular is very high, like 1: decks of KOSs don’t occur in regular games. Wainer’s
quote applies here:
The likelihoodist could resolve the matter very quickly by drawing a second card from the
deck. Of course the card dealer cannot allow this, a sure sign that this is a meaningless
example.
6.9.1 Definitions
A random variable is a variable which is the the outcome of a random process of some kind
whose properties are not deterministic. The variable may be a count, a binary, a category
Probability 43
or a variable measured on a scale. The usual examples of coin tossing and dice throwing are
simple cases. The outcomes of the throw of a die or a coin are uncertain, and cannot be
predicted deterministically.
An argument is sometimes made that this uncertainty is simply a consequence of igno-
rance of the precise details of the construction of the die or coin, the throwing or tossing, the
surface on which the die or coin is landing and the air movement in the throwing or tossing
environment. If these were known then the physical equations of motion under gravity and
friction would determine the face showing.
Since the precision necessary to remove this uncertainty is unknown or unavailable, this
argument is hypothetical. As a consequence we can specify only the probability of each
outcome, from a suitable probability model.
However, this argument provides the basis for a general approach to statistical modelling.
The model has two components: a systematic part which can be specified from the design
of the study, and a random part which cannot be specified except through a probability
distribution.
In traditional statistics courses, a distinction is made between discrete random variables,
which can take only a finite set of possible values, like the die faces, and continuous random
variables, like the phone lifetimes, which can take any value in a continuous interval. This
distinction was denied by the distinguished Australian statistician, Edwin Pitman, in his
last book (Pitman 1979, p. 1):
All actual sample spaces are discrete, and all observable random variables have
discrete distributions. The continuous distribution is a mathematical construction,
suitable for mathematical treatment, but not practically observable.
The distinction is artificial, since every “continuous” variable is recorded with finite measure-
ment precision on a discrete scale, whether of days, shifts, hours, minutes or seconds, or km,
metres, cm, mm or µm (micrometres) etc, and so can take only a finite set of measurable
values, though the number of values may be very large. Removing this distinction allows us
to develop a unified treatment of discrete and continuous random variables.
We allow, in the definition of a random variable, for a countably infinite set of values of
the variable, that is, a set which can be put in 1:1 correspondence with the non-negative
integers. This allows for count random variables for which there may be no physical upper
limit, though the observed values are always finite.
So a random variable Y takes a finite or countably infinite set of ordered distinct values
Y1 < Y2 < · · · < YI < · · ·
in a discrete variable space Y, with a corresponding set of non-negative probabilities pI which
sum to 1 over I.9 The set of possible values YI and corresponding probabilities pI define the
probability distribution of the random variable Y . The probabilities pI as a function of I are
called the probabilityPmass function of Y , sometimes abbreviated to pmf, and the cumulative
I
probabilities qI = J=1 pJ are called the cumulative distribution function of Y , usually
abbreviated to cdf.
We are often interested in the mean and variance of random variables. The mean is
denoted by µ and the variance by σ 2 ; σ (positive) is called the standard deviation of the
random variable. They are defined by
X X √
µ= pI YI , σ 2 = pI (YI − µ)2 , σ = σ 2 .
I I
9 In other treatments of random variables, they are defined over a sample space, which is confusing, because
The term expectation or expected value of a function g(y) of a random variable Y is used
more generally to describe the mean value of that function, and is denoted by E[g(y)].
An example of a random variable with a countably infinite set of possible values is the
Poisson random variable which takes the non-negative integer values YI = 0, 1, 2, . . . with
probabilities pI = e−λ λYI /YI !, where λ is the mean of the probability distribution:
∞
X
µ= YI ∗ pI
I=0
X∞
= YI ∗ e−λ λYI /YI !
I=0
∞
X
=λ∗ e−λ λYI −1 /(YI − 1)!
I=1
X∞
=λ∗ e−λ λYK /(YK )!
K=0
= λ,
where K = I − 1. An example of a random variable with a finite set of possible values is the
birthweights of the StatLab boy babies. The birthweight in pounds is recorded in the database
to one decimal place. A graph of the counts of the 648 boys at each distinct value of birth-
weight is given in Figure 6.1. The probability mass function is just a rescaling of the count
axis to give a total scaled count of 1.0. The cumulative probabilities are shown in Figure 6.2.
The sloping S-shape is characteristic of many variables with symmetric or near-symmetric
distributions. We discuss this much later with the Gaussian distribution in Chapter 9.
35
30
25
count
20
15
10
4 6 8 10 12 14
birthweight
FIGURE 6.1
Boy birthweight counts
Probability 45
1.0
0.9
0.8
0.7
0.6
cumulative 0.5
0.4
0.3
0.2
0.1
0.0
4 6 8 10 12 14
birthweight
FIGURE 6.2
Boy birthweight cumulative proportions
The second property is the famous Central Limit Theorem, commonly abbreviated to CLT.
We do not prove any of these properties: they are central to the frequentist theory but are
largely irrelevant to the Bayesian theory which we will develop.
7
Statistical inference I – discrete distributions
1 In some science and technology fields the term parameters is used for what we call variables.
DOI: 10.1201/9781003216025-7 47
48 Introduction to Statistical Modelling and Inference
We are thus entering a rather narrow area of statistical theory, but it is an area
which has been intensively cultivated, and this on the grounds of its practical
importance rather than of its mathematical attractiveness.
(Kendall and Stuart 1966, p. 166)
Among the branches of statistics, survey sampling is notable for its public impor-
tance and its theoretical isolation. It should perhaps be an important component
of every statistician’s education, but by and large is neglected, and when not ne-
glected, found to be an alien subject having its own rules and orientation at odds
with standard methods of statistical inference. Students of statistics catch a glimpse,
shudder, and pass on.
(Valliant, Dorfman and Royall 2000, p. xv)
Survey sampling is a major field of application of statistics and it is one of the most
satisfying and useful fields of statistics where both the target of inference is solid
and observable and the range of models and associated methods used in modern
statistics can be applied.
(Chambers and Clark 2012, Preface summary)
The survey sampling approach is used almost universally by National Statistical Offices or
Central Bureaus of Statistics. Much of the routine work of these institutions (though less so
now than in the past) focuses on population and sub-population means, or totals, of important
variables. The simplest problem of inference is how to relate the sample mean ȳ of a variable
Y, from a sample of size n, to the population mean µ in the finite population of size N. It
makes the repeated sampling principle central to the analysis:
Under this principle the variability in a parameter estimator, like the sample mean, is assessed
from its sampling distribution in conceptual – hypothetical – repeated samples of the same
size drawn from the same population. It does not require any population probability model
for the response variable Y, as such models make assumptions which cannot be verified
from the sample data, and which could lead to misleading conclusions if the probability
assumptions were incorrect. So the sampling process is not viewed as giving sample values of
a random variable Y : it gives sample values of a random variable U, the selection indicator.
The randomness in the sample values is a consequence of the random selection process for
the sampling of the finite population.
This selection process defines a set of N binary selection indicators u1 , . . . , uN, with
uI = 1 if population member I is selected into the sample, and uI = 0 otherwise. Then
the sample mean ȳ of the n sampled values y1 , . . . , yn can be expressed as a weighted linear
combination, with weights YI /n, of the randomly generated uI :
n
X N
X
ȳ = yi /n = uI YI /n.
i=1 I=1
Statistical inference I – discrete distributions 49
The observed sample is referred to a set of hypothetical other samples of the same size
which might have been drawn, but were not, by the selection process. In this hypothetical
repeated sampling, the uI and the resulting selected YI – the sample values yi – will vary.
The repeated sampling distribution of ȳ is that of a weighted linear function of (correlated)
binary Bernoulli random variables UI (correlated because their sum is the fixed sample size
n). The Central Limit Theorem can be used to obtain the asymptotic Gaussian distribution
of this linear function. Theories of inference which rely on the repeated sampling principle
for inference, like the survey sampling theory, are generally called frequentist.
For simple random sampling with Pr[UI = 1] = p = n/N for all I, the mean and variance
of the sampling distribution of ȳ can be easily found, in terms of the population mean µ and
variance σ 2 of the variable Y :
N
X
E[ȳ] = pYI /n
I=1
= N µ p/n = µ;
XN N X
X
Var[ȳ] = YI2 Var[UI ] + YI YJ Cov[UI , UJ ] /n2 .
I=1 I=1 J̸=I
The sample mean is an unbiased estimator of the population mean. Here, the word unbiased
in an estimator means that the expected value of the estimator is equal to the parameter it
is meant to be estimating. If its expected value is not equal to the parameter, the estimator
is biased. For the variance of ȳ we need the joint distribution of pairs of the UI . These are
not independent: their joint inclusion probability is
where π = n/N is called the sample fraction – the proportion of the population which is
included in the sample. Then
n(n − 1) n
Cov[UI , UJ ] = − ( )2
N (N − 1) N
1 n n
=− (1 − )
N −1N N
= −π(1 − π)/(N − 1)
Corr[UI , UJ ] = −1/(N − 1).
X XX
Var[ȳ] = YI2 Var[UI ]/n2 + YI YJ Cov[UI , UJ ]/n2
I I ̸=J
1 − n/N X XX
= (N − 1) YI2 − YI YJ
nN (N − 1)
I I ̸=J
50 Introduction to Statistical Modelling and Inference
1 − n/N X
= (YI − µ)2
n(N − 1)
I
= (1 − n/N )σ 2 /n
= (1 − π)σ 2 /n
if the population variance σ 2 is defined as I (YI − µ)2 /(N − 1). The first term (1 − n/N )
P
is called the finite population correction and is written (1 − π). For a small sample fraction,
Var[ȳ] ≃ σ 2 /n, but as π → 1, n → N , and Var[ȳ] → 0, since the sample exhausts the
population. In regression models, the foundation of all complex analyses, the least squares
principle (the Gauss-Markov theorem) is used to estimate regression coefficients.
This theory has been extensively developed for very complicated survey designs. Without
a model for the YI , there are no optimal procedures (in the model-based sense of §7.4):
competing procedures have to be evaluated by their biases and variances. As Little (2004,
p. 547) described it sadly in his review paper, it is all a matter of judgement. We change his
notation I to our U:
For inference about a finite population quantity Q = Q(Y), the following steps are
involved:
1. Choosing an estimator qb = qb(Yinc , U), a function of the observed part Yinc of
Y, that is unbiased or approximately unbiased for Q with respect to the distribution
of U. . . .
2. Choosing a variance estimator vb = vb(bq (Yinc , U)) that is unbiased or approx-
imately unbiased for the variance of qb with respect to the distribution of U.
So this inference procedure is not a science or a technology – art plays an important role. In
a much earlier review paper, Smith (1976) wrote in his conclusion, with the same frustration:
The basic question to ask is why should finite population inference be different from
inferences made in the rest of statistics? I have yet to find a satisfactory answer.
My view is that survey statisticians should accept their responsibility for providing
stochastic models for finite populations in the same way as statisticians in the
experimental sciences. These models can then be treated within the framework of
conventional theories of inference. The problems with the Neyman approach then
disappear to be replaced by disputes between frequentists, Bayesians, empirical
Bayesians, fiducialists and so on. But at least these disputes are common to all
branches of statistics and sample surveys are no longer seen as an outlier.
What is remarkable is that survey statisticians had already done this (Hartley and Rao 1968;
Ericson 1969), but their work was ignored for many years, and is still not taken seriously.
Further details are not given here; they can be found in Aitkin (2010, Chapter 4), and will be
Statistical inference I – discrete distributions 51
revisited in Chapter 12, where the multinomial distribution for the YI plays the role Smith
wished for: a stochastic always true model for any finite population.
Models are, however, used in survey sampling for two specific classes of problems: small-
area estimation (SAE) and incomplete or missing data.
• SAE: involves two-stage sampling, with small numbers of secondary sampling units.
These small samples give unreliable population estimates for their areas, and need to be
strengthened by borrowing strength from a distributional model for the area means. We
do not discuss two-stage sampling in this book.
N
Y
= [Pr(UI ) Pr(yI )]
I=1
N
Y
= (πI pI )uI
I=1
" N
# " N
#
Y Y
= πIuI · f (YI | θ) uI
.
I=1 I=1
Under this assumption of independence between the sample selection process and the re-
sponse variable distribution (called a non-informative or ignorable sample design), the first
term in the likelihood is a constant, depending only on the design of the data collection.
(For example, a simple random sample of size n has selection probabilities πI = 1/N for all
population members.)
2 In the next section we give an example in which this independence does not hold.
Statistical inference I – discrete distributions 53
The second term depends on the probability model, and can be (and generally is) written
in terms of the observed sample values yi , i = 1, . . . , n rather than the partly observed
population values. So the likelihood is generally written
n
Y
L(θ | y) = c · f (yi | θ),
i=1
where c is a constant, not involving the model parameters θ. (Many treatments of likelihood
omit any constant, but we retain the constant, for reasons which will become clear later.)
The first research question to be investigated in this chapter is the proportion of StatLab
mothers who were smoking at the diagnosis of their pregnancy.
where the notation I ∈ s means that population member I was drawn in the sample. (This
is a standard notation in survey sampling.)
It is convenient to have a similar index notation for the sample members drawn, as well
as for the full population. We write the sample index as i, ranging from 1 to n, and the
smoking “indicator variable” (1 for smokers, 0 for non-smokers) for the i-th sample member
as yi . Then equivalently
n
Y
L(p) = [(1/1296) · pyi · (1 − p)1−yi ]
i=1
= c · pr (1 − p)n−r ,
0.250
0.225
0.200
0.175
0.150
likelihood
0.125
0.100
0.075
0.050
0.025
0.000
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.1
Binomial likelihood, n = 10, r = 4
Statistical inference I – discrete distributions 55
0.250
0.225
0.200
0.175
likelihood
0.150
0.125
0.100
0.075
0.050
0.025
0.000
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.2
Binomial likelihood, n = 10, r = 3 (solid) and 4 (dotted)
The second sample gave three smoking mothers. The likelihood functions of p for n = 10
are shown in Figure 7.2 for r = 3 (solid curve) and 4 (dotted curve), in Figure 7.3 for r = 7
when n = 20, and in Figure 7.4 for r = 14 when n = 40. For the last two figures the observed
proportion of smoking mothers is the same (0.35) – but the likelihood is more concentrated
in the larger sample – we are more certain that p is in the range 0.2–0.6. These figures give
us a visual impression of the plausibility of other values of p. If the observed number of
successes is 0 or n, the likelihood has its maximum on the boundary of the parameter space,
at 0 or 1 (Figure 7.5).
Does the likelihood convey any other information, beyond the sufficient statistic? This
can be assessed from the conditional distribution of the data given the sufficient statistic:
" # " n #
X Y
yi 1−yi n r
Pr {yi } | yi = r = (1/1296) · p (1 − p) / p (1 − p)n−r
i i=1
r
" n #
Y n
= (1/1296) /
i=1
r
The result is just the probability of the observed sequence 0111001110, divided by the bino-
mial coefficient. Under the model, any other sequence would have the same probability, so
if the model is correct, the particular sequence we observed would be just as likely as any
other sequence – it tells us nothing about the probability of smoking. The sequence looks
like a random permutation of the 0s and 1s, as expected from the random sampling.
If the sequence had been 1111000000, this would look non-random, as though some event
changed the sample draws from non-smokers to smokers, or the random sampling design
had not been followed. The sequence of successes and failures is called an ancillary statistic
– a function of the data which is not informative about the model parameter p, but may
be informative about other aspects of the study design. There are several ways of assessing
56 Introduction to Statistical Modelling and Inference
0.18
0.16
0.14
0.12
likelihood
0.10
0.08
0.06
0.04
0.02
0.00
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.3
Binomial likelihood, n = 20, r = 7
0.12
0.10
0.08
likelihood
0.06
0.04
0.02
0.00
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.4
Binomial likelihood, n = 40, r = 14
Statistical inference I – discrete distributions 57
1.0
0.9
0.8
0.7
0.6
likelihood 0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.5
Binomial likelihood, n = 10, r = 0
departures from randomess, using properties of sequences – runs – of zeros and ones. We do
not discuss these further.
A potentially important point which we have not taken into account is that, once we
have the sample, we can see that there are other constraints on the possible values of p. We
have observed four smokers and six non-smokers in the sample, so in the population there
must be at least four smokers, and six non-smokers, so at most 1,290 smokers. So the possible
values of p must be in the smaller interval [4/1,296, 1,290/1,296], which is [0.0031, 0.9954].
The very small sample does not much restrict the possible values of p.
when p = r/n. The “continuous” maximum probability of four smoking mothers occurs at
pb = 0.4, and is 0.208222 to 6 dp. The estimates agree to 3 dp and the likelihoods agree to 4
dp, which is sufficiently accurate for the continuous assumption to be useful. We now use a
58 Introduction to Statistical Modelling and Inference
continuous approximation to the discrete likelihood: as we saw from its graph it is impossible
to distinguish visually the discrete population values in the population of size 1,296.
Not surprisingly, what we observe in the sample has the highest probability when the
population proportion is the same as the sample proportion! However the MLE does not
summarise the information in the likelihood. Other values of p also give high probability –
near the maximum – to r = 4; these values have high likelihoods. At p = 0.35 or 0.45, the
probability of r = 4 is 0.238 (to 3dp), so these values of p are also very plausible.4
Further analysis requires a theory of statistical inference.
The maximum of ℓ(p) (and L(p)) occurs at pb = r/n, the MLE. Since r has a binomial
distribution with mean np and variance np(1 − p), pb = r/n has a scaled binomial distribution
with mean p and variance p(1 − p)/n. We have the same result as from the direct calculation
of the variance, and the same difficulty in using it.
Although the repeated sampling principle was invoked above for the frequentist asymp-
totic Gaussian distribution of pb, this distribution approximation does not depend at all
on a repeated-sampling interpretation; it follows directly from the binomial distribution
properties and the Central Limit Theorem. This is true for many other MLEs in simple
models.
The frequentist inference then consists of quoting the MLE pb of p as the measure of
location or centrality, the SE as the measure of variability and a confidence interval for p as
a summary of the extent of the plausible variation in the inference about p. The confidence
interval for p is the MLE ± λSE based on the asymptotic Gaussian distribution, where λ is
chosen to give a specified probability coverage in repeated sampling (frequently 95%, with
λ = 1.96).
This is an important limitation for a general theory: the theory depends on an asymptotic
assumption, which is adequate only under restrictive conditions. We discuss this further in
§9.5 on the Gaussian distribution.
For our samples of n = 10, 20 and 40 with r = 1, 3, 4, 7 and 14, the 95% confidence
intervals for p are shown in Table 7.1 to 3 dp, using pb ± 1.96 SE(b p). The intervals shorten
as the sample size increases, but even for n = 40 they are far from precise. The variance
approximation fails for r = 1, n = 10. It is not sufficient to truncate the interval below zero:
the zero value for p is impossible if r = 1. Truncating it just above zero requires a choice of
truncation point. The asymptotic theory does not handle this event.
A remarkable feature of the likelihood is that asymptotically (that is, in large samples),
when its maximum is internal to the parameter space – not on a boundary – the log-likelihood
approaches a quadratic in the model parameters. The linear and quadratic terms define the
MLE and its SE, and other terms tend to zero with increasing sample size.
We discuss this at length in §9.5, but here give the results for the binomial distribution,
assuming that r ̸= 0 or n (maximum not on the boundary). The justification of this comes
from the Taylor expansion of the log-likelihood (here for a single parameter p) about the
maximising value pb, assumed to be an internal point in the parameter space. We write
′ 1 ′′ 1 ′′′
p) + (p − pb)ℓ (b
ℓ(p) = log L(p) = ℓ(b p) + (p − pb)2 ℓ (bp) + (p − pb)3 ℓ (b p)
2! 3!
1
+ (p − pb)4 ℓiv (b
p) + . . .
4!
1 ′′ 1 ′′′ 1
p) + (p − pb)2 ℓ (b
= ℓ(b p) + (p − pb)3 ℓ (b p) + (p − pb)4 ℓiv (b
p) + . . .
2 6 24
c2 = −1/ℓ′′ (b
since the first derivative is zero at the MLE. Writing σ p) = pb(1 − pb)/n, we have
after some algebra
′′′ 2(1 − 2b p)
ℓ (b
p) = 4
,
nb
σ
6(1 − 3b p2 )
p + 3b
ℓiv (b
p) = 2 6
n σ b
1 c2 + 1 − 2b
p 1 − 3b p2
p + 3b
p) − (p − pb)2 /σ
ℓ(p) = ℓ(b (p − p) 3
+ (p − pb)4 . . .
σ4 4n2 σ
b6
b
2 3nb
60 Introduction to Statistical Modelling and Inference
TABLE 7.1
95% confidence intervals for p
r n conf
1 10 −0.086, 0.286
3 10 0.016, 0.584
4 10 0.096, 0.704
7 20 0.141, 0.559
14 40 0.202, 0.498
The cubic and quartic terms in the expansion are of smaller order in the sample size n than
the quadratic, so as n increases,
1
p) → − (p − pb)2 /σ
ℓ(p) − ℓ(b c2 ,
2
which is equivalent on exponentiation to
L(p) 1
σ2 .
→ exp − (p − pb)/b
L(b
p) 2
So under this quadratic assumption the likelihood approaches a multiple of a Gaussian den-
sity function of pb with mean p and variance σc2 . In this case we can express the inference
about p through the MLE pb as the best estimate of p, and its precision from the standard de-
viation σ
b. We can check the quadratic assumption by computing the quadratic log-likelihood
approximation:
. 1 n
ℓ(p) = ℓ(b p − p)2
p) + (b
2 pb(1 − pb)
p − p)2
(b
= r log(bp) + (n − r) log(1 − pb) −
2Var(bp)
2
. (b
p − p)
L(p) = pbr (1 − pb)n−r · exp − .
2Var(b p)
Figures 7.6, 7.7, 7.8 and 7.9 give the binomial (solid curve) and approximating Gaussian
(dotted curve) likelihoods for the cases r = 1 and 4 for n = 10, and r = 7, n = 20 and
r = 14, n = 40. They are identical at the MLE, and diverge away from it, decreasingly with
increasing n: the skew in the binomial likelihood reduces with increasing n for fixed pb.
At n = 10 and r = 1, 3 or 4, the Gaussian approximation gives positive likelihood to
negative values of p, which are impossible. They do not appear in the graphs which are
constrained to the possible values, but the Gaussian approximation does not descend to zero
at the left end.
0.035
0.030
0.025
likelihood
0.020
0.015
0.010
0.005
0.000
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.6
Likelihood (solid) and Gaussian approximation (dotted), n = 10, r = 1
0.0012
0.0011
0.0010
0.0009
0.0008
0.0007
likelihood
0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
0.0000
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.7
Likelihood (solid) and Gaussian approximation (dotted), n = 10, r = 4
62 Introduction to Statistical Modelling and Inference
2.e-06
2.e-06
2.e-06
2.e-06
2.e-06
1.e-06
likelihood 1.e-06
1.e-06
8.e-07
6.e-07
4.e-07
2.e-07
0.e+00
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.8
Likelihood (solid) and Gaussian approximation (dotted), n = 20, r = 7
6.e-12
5.e-12
5.e-12
4.e-12
3.e-12
likelihood
3.e-12
2.e-12
2.e-12
1.e-12
1.e-12
5.e-13
0.e+00
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 7.9
Likelihood (solid) and Gaussian approximation (dotted), n=40, r=14
Statistical inference I – discrete distributions 63
distribution, the information we have about the model parameters after seeing the data,
through Bayes’s theorem, which we express generally as a theorem in probability.
if θ is continuous.
Statistical inference I – discrete distributions 65
In many studies there is no specific prior information about the parameters: the point of
the study is to obtain such information from the data. In other research areas there may be
some information from related or similar studies; the difficulty is in expressing this informa-
tion in a full probability distribution. We discuss in detail below ways in which informative
priors can be constructed. We use first the flat or uniform prior, leaving the likelihood un-
changed. This allows us to describe what the data say through the model likelihood, before
incorporating any informative prior.
We have a very large number – 1,297 – of possible values of p. As we noted from the graph
of L(p), we cannot distinguish the separate points in the graph at this level of resolution. In
such cases it is convenient to treat p as though it is continuous rather than discrete. That is,
we remove the restriction that p can take only the 1,297 values on the fine grid, and allow it
to take any value in [0,1]. This does not change the form of the likelihood, only its support
– the values on which it is defined.
We now define the uniform prior for p as π(p) = 1 for p ∈ [0,1]. To convert the likelihood
into
R 1 rthe posterior, we divide by the integral of the likelihood over its range. The integral
n−r
0
p (1 − p) dp is a Beta function:
Z 1
pr (1 − p)n−r dp = B(r + 1, n − r + 1)
0
Γ(r + 1)Γ(n − r + 1)
=
Γ(n + 2)
r!(n − r)!
= .
(n + 1)!
The cumulative distribution is not analytic (does not have a simple algebraic representation)
but is extensively tabulated and available as a system function in most statistical packages.
TABLE 7.3
95% credible and confidence intervals for p
r n cred conf
Table 7.2. As n increases the posterior medians vary slightly around the observed propor-
tions, while the 95% credible intervals converge slowly towards the true population value,
which we know from the population listing is 0.340. However samples of this size give little
precision.
The 95% central credible intervals for p are shown in Table 7.3 to 3 dp, together with
the 95% confidence intervals. The variance approximation for the confidence interval fails
for r = 1, and is inaccurate (relative to the credible interval) for r = 3 and 4, n = 10, but
is increasingly accurate, at least at the 97.5 percentile, for the larger samples, where r is far
from the boundary. The logistic transformation 95% confidence interval for r = 1, n = 10 of
[0.013, 0.441] agrees fairly well with the credible interval.
If we can regard the StatLab population as a random sample of the full Child Health
and Development Study population, how precise is the StatLab value? Based on the sample
value of 0.340 in the StatLab sample of 1,296, the posterior median is 0.340, and the 95%
credible interval is [0.314, 0.366]. A sample of 1,296 gives quite precise information about
the population proportion.
An important, if uncommon, problem is what to give for a credible interval if r = 0 or
n. We discuss only this case of r = 0; symmetry applies to the other case. Figure 7.5 shows
the likelihood for r = 0, n = 10.
The central 95% credible interval of [0.002, 0.285] does not include 0, though this has
the highest likelihood and posterior density. Any interval we quote should not exclude zero,
unless there is strong prior evidence against it. The choice of a central (two-sided, equal-
tailed) interval is not a principle of Bayesian analysis, only a convention (there are others,
like the highest posterior density interval). For this extreme case it is sensible to use a
one-sided highest posterior density 95% credible interval, starting from zero: [0, 0.240].
Zero counts like this occur frequently in animal trapping of rare species. Ten traps are
set in an area but no animals are caught. Could there be no animals of this species in the
area?
Statistical inference I – discrete distributions 67
This gives a useful interpretation of the prior parameters: a − 1 and b − 1 can be thought of
as the numbers of prior successes and prior failures from a previous study, or set of studies,
even if no such studies have occurred. It calibrates the information in the prior relative to
that in the likelihood. If no such studies have been carried out, then a = b = 1 and the prior
is uniform.
To take the prior probabilities different in the absence of observational reason for
doing so would be an expression of sheer prejudice. The rule that we should then
take them equal is not a statement of any belief about the actual composition of
the world, nor is it an inference from previous experience; it is merely the formal
way of expressing ignorance.
(pp. 33–34)
Two rules appear to cover the commonest cases. If the parameter may have any
value in a finite range, or from −∞ to +∞, its prior probability should be taken
as uniformly distributed. If it arises in such a way that it may have conceivably
any value from 0 to ∞, the prior probability of its logarithm should be taken as
uniformly distributed.
. . . It is now known that a rule with this property of invariance [under mono-
tone transformations] exists, and is capable of very wide, though not universal,
application.
(pp. 117–118)
TABLE 7.4
Bootstrap distribution of pb
pb 0 0.1 0.2 0.3 0.4 0.5
Pr .349 .387 .194 .057 .011 .002
Statistical inference I – discrete distributions 69
Jeffreys showed that his transformation invariance rule, now known as the Jeffreys prior
rule, when applied to the Bernoulli p was inconsistent with the “any value in a finite range”
prior rule, and he concluded that a single universal rule for non-informative prior assignment
would not be possible. This has not discouraged other searchers for a general rule: one of
the most recent is the reference priors of Berger, Bernardo and Sun (2009). Their reference
prior for p is the Jeffreys prior.
An early argument by Savage (1962) was called the principle of stable estimation, or
precise measurement. This specifies that when a likelihood function is sharply peaked in an
interval over which the prior density is relatively flat, the posterior density does not differ
much from the normed (scaled) likelihood function. The measurement itself is considered
to be precise, irrespective of the fine detail of the prior. This principle says essentially that
provided the information in the data is large relative to that in the prior, the precise form
of the prior is unimportant: the uniform prior would give essentially the same answer.
Jeffreys put the same point differently (1961, p. 122):
the mind retains great numbers of vague memories and inferences based on data
that have themselves been forgotten, and it is impossible to bring them into a formal
theory [formal prior distribution] because they are not sufficiently clearly stated. In
practice, if one of them leads to a suggestion of a problem worth investigating, all
that we can do is to treat the matter as if we were approaching it from ignorance –
the vague memory is not treated as providing any information at all.
the flat prior is not invariant under reparametrization. Thus if θ is uniform eθ has
an improper exponential distribution.
This argument ignores the finite population, as Geisser pointed out. For the smoking mothers
question, the effect on the flat prior on p might be expressed through the logistic transforma-
tion θ = log[p/(1 − p)], p = eθ /(1 + eθ ). It is easily seen that the transformed support values
log[R/(N + 1 − R)] are unequally spaced, but they have the same equal prior probabilities
1/(N + 1). The jump to the continuous parameter space introduces the Jacobian of the
transformation dp θ θ 2
dθ = e /(1 + e ) , so the prior in θ has this form in the continuous space,
which has a mode at θ = 0. We appear to have higher prior belief in θ = 0 : p = 0.5 than in
large positive or negative θ : p near 0 or 1.
The Jacobian derivative simply defines the “packing density” of the support points in the
discrete θ space. Near θ = 0 the packing density is high, but for larger positive or negative
θ it is very low. So the Jacobian contribution of the transformation from the p scale to the
θ scale does not alter the uniformity of the prior, just the density of its support points.
her prior – to specify a full probability distribution for the parameter – but various ways
of assessing the prior location and variation may allow a full probability distribution to be
specified. This school dislikes the flat or non-informative prior specification, because it is
unreasonable to them to suppose that the user has no prior information whatever about
the parameter. A further powerful objection of this school is that the (probability) model
for the data is already a subjective choice of the statistician or user, though it may be
regarded as scientifically objective. We address this important objection in Chapter 12 on
model diagnostics.
Another school feels that there should always be a reference analysis with a non-
informative or minimally informative prior, to assess “what the data say” given the data
model; the user can then perform an analysis with an informative prior, and will then be
able to assess the information provided by the prior in the informative analysis. A particular
peril of the single personal prior analysis is that, if the data have little information about
the parameter, the posterior will be determined essentially by the prior, and the user may
not realise this.
In this book we follow the reference school, in using generally flat or non-informative
priors. We argue that the object of the research study is not to update the prior of the
analyst or researcher, but to provide a “neutral” analysis and interpretation of the data
which will inform researchers and scientists in general, whatever their personal priors. This
releases the student, statistician or user from the need to elicit his or her own prior, and
it allows the information content of the data to be assessed independently of the prior. If
the data (through the likelihood) are uninformative about the parameter, then the study
or experiment which led to this data set has not contributed to our understanding of the
question being investigated. This requires a more informative experiment and likelihood, not
the perilous use of an informative prior with an uninformative likelihood, which will simply
reproduce the informative prior as the posterior.
We give an example of this in §8.7, on the notorious ECMO trials.
unobservable population parameters amounts to inference about things that do not exist, following the work
of Bruno de Finetti (Wikipedia).
Statistical inference I – discrete distributions 71
distribution for the data and the posterior distribution of the parameter with the conjugate
prior could be developed analytically (algebraically). However, in many more complex mod-
els this is not possible, but it is possible to draw random samples from the posterior, and
use these to draw inferences about the model parameters. This will become clear in later
chapters, but we give here a simple example to show how useful this can be.
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
FIGURE 7.10
n = 10
72 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta
FIGURE 7.11
n=100
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta
FIGURE 7.12
n=1,000
Statistical inference I – discrete distributions 73
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
beta
FIGURE 7.13
n=10,000
No reliable information can be obtained from the sample of ten. The credible region for
the sample of 100 covers the true cdf at all values, but is too wide to give any precision. The
cdf from 1,000 draws is fairly smooth, and very close to the true cdf.
The cdf from 10,000 draws is very smooth and overlies the true cdf: the upper and lower
bounds are almost equal. It is clear that credible intervals from 1,000 draws will be fairly
accurate, and the cdf from 10,000 draws will be a very accurate approximation to the true
cdf. In Chapter 12 on model diagnostics we use this approach generally.
In some Bayesian packages it is standard practice to give a kernel density estimate –
a graph of the estimated posterior density of the parameter draws – for each parameter.
The kernel density is a form of mixture density, frequently a Gaussian mixture (discussed in
Chapter 15). In this book we do not use these density graphs, for three reasons:
• the densities are sometimes jagged, with small bumps or ripples in the density estimate
which have no simple interpretation. The jaggedness is an artifact of the choice of the
bandwidth – the standard deviation – which is set too small in such cases;
• the density estimate cannot be used for anything other than to give an indication of
centrality – of the location of the mode or modes of the density – and the extent of skew;
• quantiles of the posterior distribution cannot be obtained from these density graphs: for
these we need the cdf of the draws.
a specified length. This usually requires a large random sample, for which the simple ap-
proximation for the credible interval is accurate. Suppose we want the 95% credible interval
for p to be not more than 0.04 in length. The simple approximate interval is
r
pb(1 − pb)
pb ± 2 ,
n
and if this is to be of length not more than 0.04, we must have
r
pb(1 − pb)
< 0.01,
n
which means that n > 104 · pb(1 − pb). Since the maximum value of pb(1 − pb) is 1/4, the interval
length requirement will be satisfied, whatever the sample outcome, if n > 2,500. In general,
if the 95% credible interval for p is to be of length not more than δ, then this is guaranteed
if the sample size exceeds 4/δ 2 .
We thus have a solution to the problem of inference about a single population proportion,
including the sample size required for a specified precision.
0.035
0.030
0.025
likelihood
0.020
0.015
0.010
0.005
0.000
-5 -4 -3 -2 -1 0
q
FIGURE 7.14
Logistic likelihood for θ ∈ [−5.4, 0.86]
For the example with r = 1, n = 10, we have θb = log(0.10) = −2.303, SE (θ) b = 1.054.
We begin with a 3 SE grid, giving a range of [−5.465, 0.859]. Figure 7.14 shows the logistic
likelihood calculated at each of the 100 equally spaced grid points in this interval.
The likelihood does not descend to zero at the left end. The range has to be extended
further on the left. Figure 7.15 shows the likelihood calculated at each of 1000 equally-
spaced grid points in the interval [−10,1]. The value −10 is 7.3 SEs away from the MLE.
The likelihood on the θ scale has extreme left skew, while that on the p scale has extreme
right skew (Figure 7.6). A consequence of the left skew in θ is that the 95% confidence
interval on the θ scale is also misleading: the left end of the confidence interval is much too
high, and so the left end of the confidence interval for p is much too high. Neither the p nor
the θ scale gives the confidence interval an accurate representation of the credible interval.
Is there a scale transformation which does give an accurate representation? In the ex-
ponential family this question was investigated by Anscombe (1964), in the framework of
removing skew, by the third derivative of the log-likelihood being zero at the MLE value. We
discuss this further in Chapter 9, with the two possible parametrisations of the exponential
distribution.
Now we give an application to a different model: an extended version of the binomial
model.
0.035
0.030
0.025
likelihood
0.020
0.015
0.010
0.005
0.000
-10 -8 -6 -4 -2 0
q
FIGURE 7.15
Logistic likelihood for θ ∈ [−10, 1]
TABLE 7.5
V1 hits
Number of V1 hits Number of squares
0 237
1 189
2 115
3 28
4 6
5 1
of the event of interest (“success”) becomes very small, in such a way that the mean number
of successes µ = np remains finite and not very small, and the variance σ 2 = np(1 − p) → µ.
Accidents are a common application: the chance of a road accident is small, but the number
of road users is large, and the actual number of accidents is appreciable.
The limiting process is easy to show. In the binomial model, write p = µ/n; then
n
Pr[Y = r | µ] = (µ/n)r (1 − µ/n)n−r
r
h
n! n(1−r/n)
i
= · (1 − µ/n) · µr /r!
nr (n − r)!
" r #
Y h i
= (1 − (i − 1)/n) · (1 − µ/n)n(1−r/n) · µr /r!
i=1
→ 1 · e−µ · µr /r!
Statistical inference I – discrete distributions 77
as n → ∞ with r, µ fixed.
The research question at the time was whether the squares with many hits were deliber-
ately targeted, or whether this distribution of hits was random. This was a critical issue for
the understanding of the guidance system of the missile. We examine whether the Poisson
distribution can represent the number of missile hits.
with the null hypothesis χ25 distribution of X 2 . The value of 6.18 is at the 29-th quantile of
χ25 . There is no strong evidence against the Poisson hypothesis. The Pearson X 2 test is an
TABLE 7.6
Observed and Poisson ML proportion of squares with each number of hits
Hits 0 1 2 3 4 5
Obs prop. 0.411 0.328 0.200 0.049 0.010 0.002
Pois prop. 0.397 0.367 0.169 0.052 0.012 0.002
78 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
cdf 0.7
0.6
0.5
0.4
0 1 2 3 4 5
Number of hits
FIGURE 7.16
Empirical cdf (circles), 95% bounds (red) and ML fitted Poisson model (green)
TABLE 7.7
Observed and Poisson expected frequencies of squares with each number of hits
Hits 0 1 2 3 4 5
Oi 237 189 115 28 6 1
Ei 228.7 211.4 97.3 30.0 6.9 1.2
π(µ | y) = c · exp(−nµ) µT · µs
= c · exp(−nµ) µT +s
= exp(−nµ)(nµ)T +s /Γ(T + s + 1),
Statistical inference I – discrete distributions 79
a gamma density with parameters n and T + s. This density does not have an analytic cdf,
but the cdf is well tabulated and available as a library function in most statistical packages,
usually as the cdf of the standard gamma density with parameter k:
f (θ | k) = exp(−θ) θk /Γ(k + 1).
So θ = nµ with k = T + s has this density, and µ has the density of θ/n. For reasonably large
T , whether s is 0 or 1 or −1 makes little difference to the posterior quantiles. With very
small T it can have an appreciable effect. The posterior median and 95% central credible
intervals for µ are shown to 3 dp in Table 7.8 for s = −1, 0, 1. Here n = 576, T = 532.
The MLE of 0.924 agrees well with the median, and the asymptotic 95% confidence
interval of [0.849, 0.999] agrees well with the credible interval, though it is slightly shorter.
The effect of s is to provide an additional contribution to the observation total without
increasing the sample size. This has a negligible effect because of the large n.
The general conjugate gamma prior and corresponding posterior are of the form
π(µ | s, m) = exp(−mµ)(mµ)s /Γ(s + 1)
L(µ) = c · exp(−nµ)µT
π(µ | n, T, s, m) = exp[−(n + m)µ][(n + m)µ]T +s /Γ(T + s + 1).
The limiting case m → 0 recovers the family above.
nT −1 Γ(T + y0 )
=
(n + 1)T +y0 −1
Γ(T + y0 ) T −1
Pr[y0 | n, T ] = p (1 − p)y0 .
y0 !Γ(T )
where p = n/(n + 1), a negative binomial distribution.
TABLE 7.8
Median and 95% credible interval quantiles (s = −1, 0, 1) for
Poisson mean, missiles data
s 0.025 0.5 0.975
−1 0.845 0.921 1.002
0 0.847 0.923 1.004
1 0.848 0.925 1.006
80 Introduction to Statistical Modelling and Inference
TABLE 7.9
Poisson distribution with µ = 0.9695
Y 0 1 2 3 4 5 6
Pr[Y ] 0.3793 0.3677 0.1782 0.0576 0.0140 0.0027 0.0004
TABLE 7.10
Poisson distribution with µ = 4
Y 0 1 2 3 4 5 6
Pr[Y ] 0.0183 0.0733 0.1465 0.1954 0.1954 0.1563 0.1042
Y 7 8 9 10 11 12 13
Pr[Y ] 0.0595 0.0298 0.0132 0.0052 0.0019 0.0006 0.0002
Statistical inference I – discrete distributions 81
conclusions should be based on the number observed, not on numbers not observed. The
probability of four events is 0.0140, but this cannot be interpreted by itself as evidence in
the frequentist framework.
0.18
0.16
0.14
0.12
density
0.10
0.08
0.06
0.04
0.02
0.00
0 2 4 6 8 10 12 14
mean
FIGURE 7.17
Posterior densities of side effect rate – solid, s = 0, dotted, s = 1
82 Introduction to Statistical Modelling and Inference
TABLE 7.11
Observed numbers of species
16 18 22 25 27
account for the different observer successes, and does not have a parameter for the species
population size. Instead we use the binomial distribution, where the chance of any observer
identifying a species is p, and the actual number of species present in the sampled area is N .
It might appear that every observer i should have his or her own probability pi of success
in identifying a species, but these parameters would not be identifiable because we will have
more parameters than observations, as we have observed only the numbers of successes in
the N binomial trials, and we do not know the number of trials, which is the parameter of
interest.
Given the sample of counts y1 , . . . , yn , the binomial likelihood is
n
Y N
L(N, p) = pyi (1 − p)N −yi
i=1
yi
" n #
Y N
= pT (1 − p)N n−T
i=1
y i
Pn
where T = 1 yi .
as N → ∞. The likelihood L(N, ψ) is shown in Figure 7.19. The MLE of ψ is ȳ for any value
of N , and is sharply defined. Not only is the curvature eliminated, but the two parameters
Statistical inference I – discrete distributions 83
FIGURE 7.18
Likelihood in N and p
FIGURE 7.19
Likelihood in N and ψ
are almost independent: the likelihood is almost separable as N → ∞. The “step” in the
graph is because ψ cannot be greater than N .
Figure 7.20 shows the likelihood in N at ψ.b It rises rapidly from N = 27 (the largest
observed count) to a poorly defined maximum around N = 100, but then decreases very
slowly to an asymptote at N = ∞ of 0.935 of its value at the maximum, where the binomial
reaches its Poisson limit.
84 Introduction to Statistical Modelling and Inference
0.0020
0.0018
0.0016
0.0014
0.0012
profile 0.0010
0.0008
0.0006
0.0004
0.0002
0.0000
100 200 300 400 500
N
FIGURE 7.20
Likelihood in N at ψb
We have learnt very little from the data about N – it has to be at least 27, but any value
between 50 and ∞ is plausible.
What is the marginal posterior distribution of N ? For fixed N and a flat prior distribu-
tion, p has a conditional Beta(T+1, Nn-T+1) distribution. N marginally has no standard
distribution. However, this does not help: we need to eliminate p, not N . What prior should
be used to eliminate p? It might seem obvious that a Beta distribution would be appropriate –
we just have to specify the prior parameters.
Kahn (1987) considered the general conjugate Beta prior
pa−1 (1 − p)b−1
π(p) =
B(a, b)
R
and evaluated the integrated likelihood in N : L(N, p)π(p)dp as a function of a and b. He
found that this did not depend at all on the second Beta index b, but depended critically on
Statistical inference I – discrete distributions 85
the first index a, which controlled the location of the mode and the heaviness of the tail of
the integrated likelihood:
• For a = 0 this tail was flat, giving an essentially uninformative integrated likelihood for
N , like the tail of the likelihood in the frequentist analysis with (N, ψ).
b
• For a = 1/2 (as in the Jeffreys prior) and a = 1 (as in the the uniform prior) the
integrated likelihoods decreased with large N and had well-defined (different) interior
modes.
So the choice of the a parameter of the Beta prior for p had a critical effect on the posterior
in N . This is a consequence of the unusual shape of the two-parameter likelihood in N and
p (Figure 7.18).
For the uniform prior on p, the integrated likelihood could not be normalised (scaled
to integrate to 1) because of the non-zero tail at ∞. So no credible interval inference was
possible with a flat prior on N : any value from 50 to ∞ has approximately the same likelihood.
Many Bayesians would put a prior on N to eliminate the tail: an obvious choice would be
π(N ) = c/N (improper for any constant c). However this is a strongly informative prior,
which would need external scientific justification. The posterior in N would also be very
sensitive to changes in its informative prior: the conclusions about N are determined by the
priors for p and N .
These uncomfortable conclusions are not different for the frequentist and Bayesian anal-
yses: the problem is that we do not have enough data to identify N effectively. The model
asks for more than the data can deliver.
A more detailed analysis of this example is given in Aitkin and Stasinopoulos (1989),
partly reproduced in Aitkin (2010, pp. 24–31). The model is relevant in animal herd counting
by multiple aerial observers, and in air warfare by multiple aerial observers counting defensive
missile flashes in attacks on targets. How many herds were there? How many missiles were
fired?
that the proportions must sum to 1.0, so that (for example) random draws of each proportion
will not add to 1 across the categories. We will want to use all the categories simultaneously
in many applications, so we develop a general approach, for any number of categories K.
We define the population proportion in the k-th of the K categories by pk , with the pk
PK
satisfying pk ≥ 0, k=1 pk = 1. We allow for the possibility that one or more pk could be
zero – these categories may be absent from the population, or from sub-populations. We
draw a random sample of size n with replacement from the population and obtain sample
counts n1 , . . . , nK in the K categories; some of the nk may be zero.
The probability of observing these sample counts has the multinomial distribution, writ-
ten M (n; p1 , . . . , pK ), in which
K
n! Y
M (n; p1 , . . . , pK ) = QK pnk k .
k=1 nk ! k=1
To maximise the likelihood we add a constraint to the log-likelihood, that the pk must sum
to 1:
X
P = log L − λ( pk − 1)
k
X X
= nk log pk − λ( pk − 1)
k k
∂P
= nk /pk − λ = 0
∂pk
pk = nk /λ
∂P X
= pk − 1 = 0
λ
k
Statistical inference I – discrete distributions 87
X X
pk = nk /λ = 1
k k
pbk = nk /n,
unsurprisingly. The variance of the MLE pbk is pbk (1 − pbk )/n, and the covariance of pbk and pbℓ
is −b
pk pbℓ .
K
Γ(a) Y
π(p1 , . . . , pK | a1 , . . . , aK ) = QK pakk −1 ,
k=1 Γ(a )
k k=1
PN
where the prior parameters ak ≥ 0, and k=1 ak = a. The posterior distribution of the pk
is again a Dirichlet distribution, with parameters nk + ak ; the prior weight ak is added to
the sample weight nk to give the posterior weight nk + ak at pk .
It might be expected that, as with the binomal/Beta case K = 2, the non-informative
prior would have ak = 1 for all k. However this means that the total prior weight would
be K, which may be quite large in many of the applications we want to consider, so it can
have a large effect on the posterior if the category sample sizes are small. We could use the
(improper) Haldane prior with ak = 0 for all k as the non-informative prior; this will be used
in §14.5 for “continuous” distributions.
A particular feature of this prior is that it gives zero posterior weight to categories k
for which there are no sample observations. That is, categories with zero sample counts are
treated as though they do not exist in the population – the sample zeros represent structural
(population) zeros, which is a strong statement of prior belief!
To avoid this, we need to assign positive values to the ak for which there are no sample
observations, which means in practice for all categories, since we do not know in advance
of the data which categories will have zero counts (and if the sample is small, many of the
counts may be zero). This treats the zero counts as sampling zeros, for which the population
counts are not forced to be zero.
To generate a sample from the Dirichlet, we need to specify a prior. The uniform prior
here will give a total prior weight of 9, relative to a sample weight of 40. The Haldane prior
will exclude category 3 with the zero count. We compromise with a minimally informative
prior, giving equal prior weight 0.1 to all categories, and a total prior weight of 0.9.
Table 7.13 gives the median and 95% central credible intervals based on the 2.5% and
97.5% quantiles of the simulated cdfs from 5,000 random draws from the Dirichlet posteriors
of the category proportions for father’s occupation. Table 7.14 gives the same results for
child’s blood group. An additional line in each table gives the true StatLab population
proportions in these categories.
The population values are covered by the 95% credible intervals for all categories of both
variables, but the medians are often not close to the population values. With the small
sample of 40, and the very small counts in each category, high precision is not achieved. The
ninth blood group of “missing” is not a blood group category. (How to deal with missing
data is an important practical question. We discuss it at length in Chapter 15.)
A biological/genetic question of interest is whether blood group and Rh factor are asso-
ciated. We can assess this from the population structure, omitting the “missing” category.
For the StatLab sub-population which has complete data on blood group and Rh factor,
the breakdown over the eight categories is given in Table 7.15. We now reorder the table to
make it two-way: blood group by Rh factor, in Table 7.16. Marginal totals over each factor
are added to the table.
Blood groups O and A are much more common than B or AB. The Rh– group is uncom-
mon. Although it is not clear from the table, we can assess the question of independence of
blood group and Rh factor from the table data. Write Rh and BG for the two classifications.
If these are independent, then
TABLE 7.13
Father’s occupational category, n = 40
Occ. Category 0 1 2 3 4 5 6 7 8
2.5% .054 .015 .016 .000 .027 .016 .215 .027 .041
median .133 .064 .064 .022 .088 .063 .341 .088 .113
97.5% .253 .163 .162 .027 .193 .160 .483 .193 .223
population .227 .069 .056 .023 .042 .103 .310 .060 .110
TABLE 7.14
Child’s blood group, n = 40
Blood group 1 2 3 4 5 6 7 8 9
2.5% .001 .029 .007 .000 .129 .152 .077 .007 .017
median .019 .094 .044 .000 .244 .266 .168 .044 .078
97.5% .092 .209 .138 .023 .385 .415 .299 .134 .171
population .052 .040 .012 .005 .343 .326 .110 .048 .065
TABLE 7.15
Child’s blood group, missing data excluded
Blood group 1 2 3 4 5 6 7 8
population .056 .043 .013 .005 .366 .348 .118 .051
Statistical inference I – discrete distributions 89
TABLE 7.16
Child’s blood group by Rh factor
O A B AB T
Rh– .056 .043 .013 .005 .117
Rh+ .366 .348 .118 .051 .883
T .422 .391 .131 .056 1.000
TABLE 7.17
Product table for child’s blood group
by Rh factor
O A B AB T
Rh– .049 .046 .015 .007 .117
Rh+ .373 .345 .116 .049 .883
T .422 .391 .131 .056 1.000
So if we multiply the marginal probabilities of each level of the two factors, we should see joint
probabilities very close to those in the population. This product table is given in Table 7.17.
Apart from blood group O, the product proportions differ from the population proportions
by at most 0.003. For group O, the difference is 0.007. The agreement is very close.
The Rh factor genetic information is also inherited from our parents, but it is
inherited independently of the ABO blood type alleles.
(The University of Arizona)
R N −R N
/ .
1 2 3
In general, following the notation for sampling with replacement, when there are D
distinct values of YI in the population, the probability that the sample contains nI of the
NI values of YI in the population is the hypergeometric probability
" D #
Y NI N
Pr[{nI } | {NI }] = / .
nI n
I=1
Both Bayesian and frequentist analyses are more complex than for the previous case of
sampling with replacement. A full discussion can be found in Aitkin (2010) Chapter 4.
We note here that sampling with replacement accurately approximates sampling without
replacement if the sample fraction n/N is small.
8
Comparison of binomials
The Randomised Clinical Trial
8.1 Definition
An important application of statistical inference using the binomial distribution is to the
comparison of new medical or surgical treatments for disease or illness in a randomised
clinical trial (RCT). Such trials have certain characteristic features:
• A new treatment which has been found to be effective in small studies on selected patients
is to be evaluated in a large study, compared with the current best treatment.
• The medical profession must be in equipoise regarding the treatments: there must be no
issues of different side effects or other unwanted aspects of the treatments (Freedman
1987).
• Patients taking part in the study are assigned to receive either the new treatment or the
current best treatment. Assignment to one treatment or the other is by randomisation:
in the simplest form (not used in practice) by tossing a coin for each patient – heads
means the new treatment, tails means the current best treatment.
• Randomisation of patients requires their informed consent in advance: they must be told
what the treatments are, and that they will be randomised to one or the other treatment,
but they will not be told which one.
• The two treatments received by the patients must appear to them to be the same, so the
patients are not aware of which treatment they are receiving – the patients are blinded
to the treatment identification.
• Physicians must also be blinded (the trial is then double-blinded) – they must not know
which treatment a patient is receiving, so that this information cannot be accidentally
disclosed to the patient.
• With new drug treatments, the pills or capsules are made to look the same for each
treatment, though for the current best treatment group the pill may have no active
component – it may be an inert placebo.
DOI: 10.1201/9781003216025-8 91
92 Introduction to Statistical Modelling and Inference
FIGURE 8.1
Peptic ulcers, stomach and duodenal
Comparison of binomials 93
after each main meal. Patients randomised to the placebo control group received a “dose”
of 5 ml of flavoured liquid at the same frequency. At the end of the eight-week period,
duodenoscopy was performed again to determine whether the ulcer had completely healed.
Twenty patients were randomly assigned to the treatment group and 18 to the control
group, but three patients had to be excluded from the study because they did not comply
with the protocol – the instructions for the treatment. All three took all their medication in
the first week. Two of these were in the treatment group and one in the control group. These
patients who did not follow the protocol were removed from the trial, and no follow-up after
the eight weeks was performed on them. They had no further role in the study.
Of the 35 remaining patients, 13 of the 18 receiving Depepsen healed, while ten of the
17 receiving placebo healed, in eight weeks. Does this indicate a real superiority in healing
of Depepsen over placebo? Classifying the patients by treatment and recovery, we have
Table 8.1.
We write pD for the probability of recovery with Depepsen, and p0 for the probability of
recovery with placebo. Then pbD = 13/18 = 0.722, and pb0 = 10/17 = 0.588. A higher sample
proportion of patients recover with Depepsen – the difference in proportions in favour of
Depepsen is 0.134 – but is this true in the population, or could it be that pD = p0 , or
pD < p0 ?
D ∼ N (pD , pD (1 − pD )/nD ),
pc
pb0 ∼ N (p0 , p0 (1 − p0 )/n0 ),
D − pb0 is
and so that of pc
D − pb0 ∼ N (pD − p0 , p
pc cD (1 − p
cD )/nD + pb0 (1 − pb0 )/n0 ).
For the sample values this gives the approximate 95% confidence interval for pD − p0 of
p
0.134 ± 1.96 ∗ 0.722 ∗ 0.278/18 + 0.588 ∗ 0.412/17,
TABLE 8.1
Clinical trial of Depepsen
Depepsen Placebo Total
Healed 13 10 23
Not healed 5 7 12
Total 18 17 35
94 Introduction to Statistical Modelling and Inference
3.5
3.0
2.5
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 8.2
Posterior densities placebo (dotted) and Depepsen (solid)
Also the actual coverage depends on whether the asymptotic Gaussian distribution of the
difference in proportions is appropriate in the small samples of 18 and 17, which cannot be
determined in real or hypothetical samples.
An alternative approach which leads to the same conclusion is through hypothesis testing,
discussed in §8.5.
• We generate a large number M of random draws X [m] from the distribution f1 (x) of X,
and M independent random draws Y [m] from the distribution f2 (y) of Y .
• We form the M values Z [m] = g(X [m] , Y [m] ).
• Then the Z [m] are random draws from the probability distribution of Z.
By sorting the draws into increasing order, and assigning each one probability 1/M ,
we obtain a discrete approximation to the probability distribution of Z. If M is large, say
10,000, the cdf of Z is very smooth, and the quantiles of the distribution of Z can be closely
approximated by the sample quantiles of the M draws. For example, the lower and upper
2.5% points of the distribution of Z = X − Y can be approximated directly from the sample
cdf of the Z [m] = X [m] − Y [m] , by finding the 250th and 9,750th ordered values of the Z [m] .
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.4 -0.2 -0.0 0.2 0.4 0.6
Probability difference
FIGURE 8.3
cdfs of 10,000 differences pD − p0
96 Introduction to Statistical Modelling and Inference
The median difference is 0.124 (close to the difference 0.134 in sample proportions),
and the lower and upper 2.5% points of the Z [m] are −0.172 and 0.410, so the central 95%
credible interval for pD − p0 is [−0.172, 0.410] which includes zero, the “no difference” value.
So the difference in recovery proportions in the two populations could plausibly be as much
as 0.41 in favour of Depepsen, or as much as 0.17 in favour of placebo. The asymptotic 95%
confidence interval of [−0.178, 0.446] is similar. The trial is so small that the small difference
in sample proportions is a poor indicator of the difference in the population proportions,
which could be zero – a critical issue for recommending the Depepsen treatment.
So this trial was inconclusive, as are many small trials – the sample sizes are too small
to give any precision in the difference in response proportions. How large would the trial
need to be to find that this difference did indicate the superiority of Depepsen? The solution
of this problem is beyond the course level, but by more advanced methods we can show
that, if the same sample recovery proportions were to be attained in a trial with about 110
patients in each group, the observed difference of 0.722−0.588 would indicate the superiority
of Depepsen over placebo, because the 95% credible interval would not include zero. The
actual sample sizes are less than 1/5 of the required size – the trial was far too small to
establish a real difference.
For this reason such a clinical trial would now be regarded as unethical – patients were
being exposed to a trial of a new treatment which had little chance of being demonstrated
to be more effective than the existing best treatment, even if in fact it was more effective.
Soon after this trial, a different drug treatment for duodenal ulcers – cimetidine (trade name
Tagamet) – was found to be effective, and trials of Depepsen for the treatment of duodenal
ulcers were abandoned.
In the last ten years, these drug treatments, which were based on reducing acidity in the
stomach, have been replaced by an entirely different treatment with antibiotics. It was discov-
ered that most ulcers develop from a stomach infection by the Helicobacter pylori bacterium,
which responds rapidly to antibiotic drug treatment. See the Helicobacter Foundation site,
www.helico.com/h history.html, from which we quote:
Helicobacter pylori (H. pylori for short) was first discovered in the stomachs of
patients with gastritis and stomach ulcers nearly 25 years ago by Dr Barry J. Mar-
shall and Dr J. Robin Warren of Perth, Western Australia. At the time (1982/83)
the conventional thinking was that no bacterium can live in the human stomach
as the stomach produced extensive amounts of acid which was similar in strength
to the acid found in a car-battery. Marshall and Warren literally “re-wrote” the
text-books with reference to what causes gastritis and gastric ulcers. In recogni-
tion of their very important discovery, they were awarded the 2005 Nobel Prize for
Medicine and Physiology.
H. pylori is a corkscrew-shaped Gram-negative bacterium which is found to be
present in the stomach-lining of nearly 3 billion people around the world (i.e. half
the world’s population) and is the most common bacterial infection of man. Many of
those carrying the bacterium have little or no symptoms and are apparently well,
but all without exception have inflammation of the stomach lining, a condition
which is called “gastritis”. Gastritis is the underlying condition which eventually
causes ulcers and other digestive complaints. If a person has had an H. pylori
infection constantly for 20–30 years, it can lead to cancer of the stomach. This is
the reason that the World Health Organisation’s (WHO) International Agency for
Research into Cancer (IARC) has classified H. pylori as a “Class I Carcinogen” i.e.
in the same category as cigarette smoking is to cancer of the lung and respiratory
tract.
Comparison of binomials 97
8.5.1 The null and alternative hypotheses, and the two models
We specify two models: the null (hypothesis) model and the alternative (hypothesis) model.
• The null model specifies that the recovery probabilities are the same in the Depepsen
and Placebo patient populations: pD = p0 = pc , unspecified.
• The alternative model specifies that the recovery probabilities are different in the De-
pepsen and Placebo patient populations: pD ̸= p0 , both unspecified.
If the three unspecified probabilities were all specified, we would have a simple comparison
of the models through Bayes’s theorem, by evaluating the likelihoods under each model. The
approach followed here, in which we find the posterior distributions of the deviances for each
model, is due originally to Dempster (1974, 1997) as extended by Aitkin (1997, 2010).
• Under the null model, the two treatment groups have the same recovery probabilities,
and the sample from both treatments gives us (from Table 8.1) 23 patients recovering out
of 35, with the common recovery probability pc . So the likelihood (omitting permutation
constants) is
L0 = p23 12
c (1 − pc ) .
L1 = p13 5 10 7
D (1 − pD ) · p0 (1 − p0 ) .
Given prior probabilities π0 for the null and π1 = 1 − π0 for the alternative, the ratio of
posterior probabilities is, from Bayes’s theorem,
π0|data π0 L0
= · .
π1|data π1 L1
98 Introduction to Statistical Modelling and Inference
If we take the prior probabilities equal, then the post-data probability of the null model
(hypothesis) is
L0
L0
π0|data = = L1L0 .
L0 + L1 1+ L 1
A problem in computing the likelihoods is numerical underflow: the values become ex-
tremely small in large samples and may be below the numerical accuracy of the statistical
package, or the computer. We avoid this problem by working on the log scale. Since we
need only the ratio of the likelihoods, we obtain this by computing the difference of the
log-likelihoods. For reasons of statistical theory which will become clear later, we modify
this slightly by computing the deviance, defined as −2 log L, and compute the difference of
the deviances under the two models, then exponentiate this back to get the likelihood ratio.
However, we are not given the recovery probabilities: we have only the sample information
[m]
about them, giving their posterior distributions. We can make M random draws pD and
[m]
p0 as before, and now make random draws of pc as well. So the posterior draws allow us
to make an inference about the probability of the null model:
[m] [m]
• We make M random draws pD and M independent random draws p0 , and substitute
them into the deviance
Figure 8.4 shows the posterior distribution of the deviances D0 (solid curve) and D1
(dotted curve). The deviance curves cross at about the 40% point of the cdfs. A smaller
deviance means a higher likelihood: the null model has higher likelihood in 60% of the
draws, the alternative model has higher likelihood in the other 40% of the draws. So the null
model is better supported than the alternative model, but not by much.
Figure 8.5 shows the posterior distribution of the deviance difference D0 −D1 . The median
is very close to zero (0.085) and the 95% central credible interval for the true deviance
difference is [−5.584, 4.13], which includes the zero point of no real difference. Figure 8.6
shows the posterior distribution of the null model probability. The median is 0.489 (close to
Comparison of binomials 99
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
45 50 55 60
deviance
FIGURE 8.4
Cdfs of 10,000 deviances D0 (solid) and D1 (dotted)
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-15 -10 -5 0 5 10 15
deviance difference
FIGURE 8.5
Cdf of 10,000 deviances D0 − D1
100 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
probability of H0
FIGURE 8.6
Cdf of 10,000 draws of π0|data
the prior “indifference” value of 0.5) and the central 95% credible interval is [0.112, 0.940].
This is wide, and does not point clearly to either model. We come to the same conclusion as
from the posterior distribution of pD − p0 : the patient samples are too small to give strong
evidence either way. It certainly could not be claimed that Depepsen had been shown to be
more effective than the then-current best treatment.
p0 /(1 − p0 )
θ= ,
pD /(1 − pD )
or its log, the log-odds ratio. The term “odds”, which comes from gambling, is used for
the ratio p/(1 − p), the “odds on recovery” under Placebo compared with the odds under
Depepsen. To complicate matters further, many medical studies were interested in the risk
ratio or relative risk, here pD /p0 , or its reciprocal.
A virtue of the Bayesian analysis is that all these different measures of effect can be
analysed in exactly the same way. We use the same draws of p0 and pD to substitute in the
appropriate function of the parameters.
For the odds ratio, shown in Figure 8.7, the median is 0.582 and the 95% central credible
interval is [0.144, 2.189]. This is very wide and includes 1, the “no difference” value. For the
risk ratio, shown in Figure 8.8, the median is 0.830 and the 95% central credible interval
Comparison of binomials 101
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8 10
odds ratio
FIGURE 8.7
Cdf of 10,000 draws of the odds ratio
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.5 1.0 1.5 2.0 2.5
relative risk
FIGURE 8.8
Cdf of 10,000 draws of the risk ratio
102 Introduction to Statistical Modelling and Inference
is [0.402, 1.323]. The conclusions are the same, not surprisingly: that there is not enough
evidence to come to a clear conclusion about which treatment is better.
a function of the ratio of maximised likelihoods under each hypothesis. The test statistic has
an asymptotic χ2ν distribution – a form of gamma distribution – under the null hypothesis,
with degrees of freedom equal to the difference ν in the number of parameters under the two
hypotheses. (The χ2 distribution is discussed in Chapter 15 on the Gaussian distribution.)
For computational reasons, we compute the frequentist deviances Dmin = −2 log Lmax to
avoid underflow. The LRTS is then D0min − D1min .
The formal testing decision process is to reject the null hypothesis in favour of the al-
ternative at test size α, if the LRTS value exceeds χ21−α,ν and to not reject it otherwise. In
the latter case the null hypothesis is not accepted, but maintained as tenable. A less formal
“descriptive” approach is to find the upper-tail probability of the LRTS and quote it as
the p-value or observed significance level of the test – the level at which the null hypothesis
would be just rejected. This allows the analyst to decide on the observed significance level –
the p-value – which would constitute compelling evidence. We describe this process for the
Depepsen trial.
• Under the null model, the two treatment groups have the same recovery probabilities,
and the sample from both treatments gives us (from Table 5.1) 23 patients recovering
out of 35, with the common recovery probability p̄. So the maximised likelihood is
The difference D0min − D1min = 0.69. This value is close to the 40th quantile of χ21 , no
evidence at all against the null hypothesis. The tail area probability from χ21 beyond the
observed value is the p-value, 0.594. It is often incorrectly treated as the probability of the
null hypothesis, as in the Sally Clark case. The frequentist theory does not have probabilities
of the hypotheses, only the probabilities of events under the two hypotheses.
Comparison of binomials 103
ITT analysis requires participants to be included even if they did not fully adhere
to the protocol. Participants who strayed from the protocol (for instance, by not
adhering to the prescribed intervention, or by being withdrawn from active treat-
ment) should still be kept in the analysis. An extreme variation of this is that the
participants who receive the treatment from the group they were not allocated to,
should be kept in their original group for the analysis.
...
The rationale for this approach is that, in the first instance, we want to estimate
the effects of allocating an intervention in practice, not the effects in the subgroup
of the participants who adhere to it.
In comparison, in a per-protocol analysis, only patients who complete the entire
clinical trial according to the protocol are counted towards the final results.
(Emphasis added)
Our added emphasis draws attention to the aim of the study: for most studies it is to
assess the effects of the different treatments, not the effect of being included in a clinical
trial, whatever the treatment assignment. The Depepsen analysis described earlier is a per-
protocol analysis. An intention-to-treat analysis could not be done, as the patients who did
not follow the protocol were removed from the trial and no follow-up after the eight weeks
was performed on them.
The asymptotic confidence interval procedure described earlier for the treatment differ-
ence can also be used as a formal frequentist test: if the 95% confidence interval does not
contain the zero value of the difference in response probabilities, then we can reject the null
hypothesis of no difference in these probabilities. If the confidence interval does contain the
null value then we cannot reject the null hypothesis (at significance level 5%: 1-the confidence
coverage).
104 Introduction to Statistical Modelling and Inference
Fisher saw that this likelihood could be factored into the product of a marginal and a
conditional likelihood, which could simplify the inference. We consider the distribution of
the random variable R = RD + R0 , the marginal total of recoveries under both treatments.
This is
u=u
X2
Pr[R = r] = Pr[R1 = u] Pr[R2 = r − u],
u=u1
u=u
X2 rD
nD n0 θD
= · (1 − pD )nD θ0r (1 − p0 )n0 ,
u=u1
u r−u θ0
Pr[RD = rD | R = RD + R0 ]
rD u=u X2 nD n0 θD u
nD n0 θD
= / .
rD r0 θ0 u=u
u r−u θ0
1
The conditional distribution depends only on the odds ratio, the relative odds on recovery
under the two treatments.
Fisher argued that the marginal total of successes was an ancillary statistic – it gave
information about the variability of the odds ratio, but not about its location. He then argued
that the inference about the odds ratio should be based on this conditional distribution,
which meant, in terms of hypothetical repeated sampling, that this had to be restricted to
Comparison of binomials 105
hypothetical tables which had the same margins as the observed table. This gave a different
interpretation of the observed difference from the unconditional large-sample result. In the
following ECMO example we show how dramatic these differences can be in very small
samples.
Barnard (1945) gave a different test, also “exact”, in which only the designed numbers
of subjects in each treatment were fixed. This was computationally much more complex
then Fisher’s test. He showed in a small example that this test had greater power than
Fisher’s test. Fisher (1945) was furious and attacked Barnard, who retracted his claim in a
later paper, Barnard (1949). Crossing Fisher could be professionally damaging. The detail
of the different hypergeometric probabilities in the two approaches is given by Mehta and
Senchaudhuri (2003) in a Web paper.
Arguments among frequentists over Fisher’s proposal continue to this day. Plackett (1977)
showed that the marginal total was not ancillary except in the limit as the treatment sample
sizes tend to ∞. This was clear from the beginning, since the marginal distribution of the
total depended on the additional term (1 − pD )nD θ0r (1 − p0 )n0 , which was not a function of
the odds ratio, but was nevertheless informative about the two probabilities. This term was
eliminated from the conditional distribution. In large samples the conditional test gave the
same results as the unconditional test, but in the small-sample limit it lost information. But
this was just where Fisher said it should be used!
Fisher’s “exact” test has been shown in simulations to be less sensitive than the test using
unrestricted responses in the repeated sampling. Simulation studies are exact applications
of the repeated sampling principle: when repeated samples are actually drawn from a known
population, and are used to assess the coverage of confidence intervals, or the performance
of test procedures.
Fisher’s supporters claim that this evaluation misses exactly the point – that the sim-
ulation evaluations should have been with respect to the restricted samples with the same
marginal number recovering. These two approaches are incommensurate, being based on dif-
ferent repeated-sampling arguments. But neither gives an analysis relevant to the observed
sample – they refer to different ensembles of simulated tables.
What is most remarkable about the “exact” test is that it forced the analyst to frame
the scientific questions in terms of odds ratios, rather than other measures like risk ratios
or risk differences. This conditional analysis requirement was even implemented, for a long
period, in protocols for the analysis of medical research studies under some US government
grants.
possibility of lung damage from the high-pressure treatment. In the COVID-19 pandemic
ECMO was used for patients with severe breathing difficulties who did not improve with the
ventilator. See Lyons (2020), www.abc.net.au/news/health/2020-07-22/coronavirus-ecmo-
explainer/12472498. The first ECMO study was reported in Bartlett et al (1985), and statis-
tical and ethical issues in this trial, and in a subsequent trial, were discussed in Ware (1989)
and Begg (1990). The trial used adaptive randomisation of babies to the treatments, with a
success (survival) under a treatment increasing the probability of randomisation of the next
baby to that treatment. This was intended to minimise the number of babies randomised to
the less effective treatment.
To understand the diversity of the conclusions, we need to understand the design of the study.
The then-current best treatment had a very high death rate, around 80%. Non-randomised
studies of ECMO had shown very low death rates. To compare the treatments required
ethical approval of a treatment assignment design which would minimise the number of
babies assigned to the less-successful treatment, whichever that was, and would terminate
the trial as soon as the evidence for the best treatment was strong.
The design adopted was a “play the winner” adaptive randomisation using an urn con-
taining initially M white balls (for CMT) and M black balls (for ECMO). The study used
M = 1.
• The urn was shaken well and a ball drawn from it. The ball was black, and the baby was
assigned to ECMO. Following treatment, this baby survived.
Comparison of binomials 107
• Before the next assignment, one black ball was added to the urn, which was shaken well
and a ball drawn from it. The ball was white, and the baby was assigned to CMT. This
baby died.
• Before the next assignment, another black ball was added to the urn, which was shaken
well and a ball drawn from it. The ball was black, and the baby was assigned to ECMO.
This baby survived.
This process and outcome continued, until nine babies had been randomly assigned (with
increasing probabilities) to ECMO, and all had survived.
At this point the stopping rule had been reached, which was that when the difference in
the number of babies recovering under the two treatments reached 9, the trial would stop.
However the trial did not stop, but the randomisation was stopped, and two more babies
were non-randomly (with probability 1) assigned to ECMO, and survived. (At this point, if
another randomisation had occurred, the probability that ECMO would have been chosen
was 11/12 = 0.917. This is very close to non-random assignment.)
Frequentist difficulties with the study are severe, because of the play-the-winner design
and the failure to follow the stopping rule. Following Cox’s discussion, what hypothetical set
of replications of the study should be used to assess a p-value? Should the replications follow
the stopping rule, or be allowed to add two non-randomised babies? Should the sequence of
outcomes of the urn draws be fixed (conditioned on the observed sequence), or should it be
random? Different choices for these and other possibilities led to the wide range of p-values,
from 0.001 to 0.62, proposed by different analysts. We do not discuss these; they are given
at great length in the two references.
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
FIGURE 8.9
Cdfs of 10,000 draws of the risk differences ECMO1-CMT, for 11 (solid curve) and nine
(dotted curve) ECMO survivals
(An argument arose about whether the last two babies should have been included in the
analysis, as they were non-randomly assigned. We give a second analysis next which excludes
them.)
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
FIGURE 8.10
Cdfs of 10,000 draws of the risk differences ECMO-CMT for ECMO1 (solid curve), ECMO2
rand (dotted), ECMO2 combined (dashed)
110 Introduction to Statistical Modelling and Inference
TABLE 8.4
ECMO trials comparisons
Trial median 95% credible interval
1 0.627 [0.056, 0.847]
2 rand 0.329 [0.005, 0.631]
2 comb 0.349 [0.090, 0.635]
of CMT babies has greatly increased the precision of the first analysis, though it has not
much changed the evidence against the “no difference” hypothesis. The medians and 95%
credible intervals for the difference pE − pC are given in Table 8.4.
Combining the non-randomised ECMO babies does not much change the interval, except
near zero. The survival rate under ECMO is still much better estimated than that under
CMT. We do not discuss these studies further.
9
Data visualisation
We divert from probability modelling for the moment to consider the general issue of how
to visualise the data structures and models we want to use. We illustrate the initial simple
uses of visualisation, and will return in later chapters to extend these ideas. We begin with
the single sample of field telephone lifetimes in hours, reproduced from §2.1.
Tables of data do not give much insight into structure or relationships. It is routine in
“descriptive” statistics (describing features of the sample data) to compute the mean and
variance of the sample data, as summaries of location and variation of the sample data.
These summaries do not lead to a probability model specification. Statisticians developed
visualisation tools in the late 19th century to assist the interpretation of their data.
35.00
28.00
14.00
7.00
0.00
50 100 150 200 250
INCOME
FIGURE 9.1
Maximum resolution histogram of family income at birth, StatLab population
5.00
4.00
count
3.00
2.00
1.00
0 100 200 300 400 500 600 700
hours
FIGURE 9.2
Counts of phone lifetimes
Data visualisation 113
TABLE 9.1
Lifetimes and numbers of radio transceivers
t n t n t n t n t n t n t n
8 1 16 4 32 2 40 4 56 3 60 1 64 1
72 5 80 4 96 2 104 1 108 1 112 2 114 1
120 1 128 1 136 1 152 3 156 1 160 1 168 5
176 1 184 3 194 1 208 2 216 1 224 4 232 1
240 1 246 1 256 1 264 2 272 1 280 1 288 1
304 1 308 1 328 2 340 1 352 1 358 1 360 1
384 1 392 1 400 1 424 1 438 1 448 1 464 1
480 1 536 1 552 1 576 1 608 1 656 1 716 1
TABLE 9.2
Lifetimes t and cumulative numbers n of radio transceivers
t n t n t n t n t n t n t n
8 1 16 5 32 7 40 11 56 14 60 15 64 16
72 21 80 25 96 27 104 28 108 29 112 31 114 32
120 33 128 34 136 35 152 38 156 39 160 40 168 45
176 46 184 49 194 50 208 52 216 53 224 57 232 58
240 59 246 60 256 61 264 63 272 64 280 65 288 66
304 67 308 68 328 70 340 71 352 72 358 73 360 74
384 75 392 76 400 77 424 78 438 79 448 80 464 81
480 82 536 83 552 84 576 85 608 86 656 87 716 88
disadvantages at the extremes of the data range where bin counts are small. Variable width
bins are sometimes used.
differencing.
114 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
cumulative proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700
hours
FIGURE 9.3
Phone lifetimes empirical cdf
1.0
0.9
0.8
0.7
survivor function
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700
hours
FIGURE 9.4
Phone empirical survivor function
Data visualisation 115
1.0
0.9
0.8
0.7
survivor function
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700
hours
FIGURE 9.5
Phone empirical (circles) and exponential (curve) survivor functions
1.5
1.0
0.5
0.0
-0.5
log integrated hazard
-1.0
-1.5
-2.0
-2.5
-3.0
-3.5
-4.0
-4.5
3 4 5 6
log hours
FIGURE 9.6
Empirical (circles) and ML fitted (solid) log exponential integrated hazard
116 Introduction to Statistical Modelling and Inference
to sample and population proportions. This is usually done by assessing the fit of the as-
sumed model to the data, if necessary through residuals from a fitted model; we discuss this
in detail in later chapters.
We use the phone lifetimes to illustrate this. The empirical cumulative distribution func-
tion of the lifetimes has a very smooth appearance, with small variations. We want to ap-
proximate the ecdf by a smooth function – the cdf of a continuous random variable. How do
we identify the appropriate random variable?
We use the practical background and technological context to provide a transformation
of the ecdf to suggest the choice. In studies of lifetimes of electrical and mechanical devices,
a common function of interest is the survivor function.
The survivor function for a continuous variable Y is S(y) = 1−F (y), the proportion surviving
– still functioning – at time y. The empirical survivor function at yi is 1 − qi where qi is the
empirical cdf. Figure 9.4 shows the empirical survivor function for the phones.
The appearance of an exponential decline is striking, and suggests that a suitable model
for the survivor function could be S(y) = e−y/λ for some value λ. Figure 9.5 shows the
empirical survivor function, with superimposed the exponential survivor function with the
value of λ taken as 210.8, the mean survival time.
The exponential curve agrees fairly well with the observed data, though it appears to un-
dershoot and then overshoot. Could a different value of λ improve the agreement? How do
we decide on the value of λ? Figure 9.6 shows the same function on transformed scales.
We continue this example in the next chapter.
10
Statistical inference II – the continuous
exponential, Gaussian and uniform distributions
Here we have used an approximation based on the Mean Value Theorem; that
Z b
.
F (b) − F (a) = f (x)dx = (b − a)f ((1 − ϕ)a + ϕb)
a
seen, this makes no difference to our conclusions about the model parameter, though the
approximation may be poor if the time scale is very coarse. We are free then to approximate
the likelihood as the product of continuous density ordinates, omitting the constant arising
from the mean value approximation.
We note here the mean µ of the random variable defined in this way is
Z ∞
µ= yf (y | λ)dy
Z0 ∞
= y exp(−y/λ)dy/λ
0
Z ∞
=λ z exp(−z)dz
0
= Γ(2)λ = λ.
We will also find useful the hazard function h(y), which is the instantaneous probability of
failure at time y, given survival up to this time. This is given by
h(y) = f (y | λ)/S(y | λ) = 1/λ = θ,
a constant. We note for later use the integrated hazard function H(y), given by
Z y
H(y) = h(s)ds = y/λ = θy,
0
which means that S(y) = exp[−H(y)], and H(y) = − log S(y). The log integrated hazard
has a particulary simple form:
log H(y) = log θ + log(y).
This function is shown with the fitted exponential distribution in Figure 9.6.
In later chapters we will see the value of this representation for model assessment. The
exponential distribution has the characteristic property of constant hazard. For the phones,
this means that they do not age or wear out with use; a phone which has been in use for 100
hours is just as good (in its remaining lifetime distribution) as a new phone. This is unlikely
to be true in the dusty conditions of their use.
0.9
0.8
0.7
0.6
likelihood
0.5
0.4
0.3
0.2
0.1
0.0
150 200 250 300 350
l
FIGURE 10.1
Exponential relative likelihood for λ
Pm
where T = i=1 ni yi , ȳ = T /n. The total lifetime T of all the phones is a sufficient statistic
for the mean λ. Equivalently, the sample mean lifetime ȳ is a sufficient statistic. Because the
likelihood values are extremely small, in graphing or computing the likelihood it is convenient
to scale the likelihood by its maximum (achieved at λ b = ȳ):
!n " !#
λ
b λ
b
L(λ)/L(ȳ) = exp −n −1 .
λ λ
This scaled likelihood is called the relative likelihood, relative to the maximum. It has a
maximum value of 1.0. The relative likelihood is shown over the reduced range of appreciable
likelihood in Figure 10.1.
The likelihood is right-skewed, with a longer right-hand than left-hand tail.
\ b2 /n. The
The MLE of λ is λ b = ȳ = 210.8, and its estimated variance is Var[
√ √ λ]
b = λ
standard error of ȳ is ȳ/ n = 210.8/ 88 = 22.47. Then an (asymptotic) 95% confidence
interval for the true (population) value of λ is 210.8 ± 2 × 22.47 = [165.9, 255.7], centred at
the MLE.
We note for later discussion that the third derivative of the log-likelihood, at the MLE, is
4n/ȳ 3 . This must be positive, so the likelihood in λ is inherently right-skewed : the first two
derivatives do not provide complete information about the parameter. The scale-free measure
skewness is defined by third derivative/(second derivative)3/2 (at the MLE), which is
of √
4/ n.
However, if we work instead with the hazard parametrisation θ, we have
-0.5
-1.0
-1.5
-2.0
-3.0
-3.5
-4.0
-4.5
-5.0
-5.5
-6.0
160 180 200 220 240 260 280 300
l
FIGURE 10.2
Exponential (solid) and Gaussian (dotted) relative likelihoods for λ, log scale
fourth derivative. We do not discuss these transformations further, as they are not useful in
regression problems.
verting tests (like the likelihood ratio test) rather than relying on the asymptotic approximation. We do not
give details for these, as the Bayesian analysis does not require them.
122 Introduction to Statistical Modelling and Inference
TABLE 10.1
Median and 95% credible intervals
for λ with prior parameter a
a median 95% interval
−1 214.0 [174.7, 266.2]
0 211.6 [172.9, 262.8]
1 209.2 [171.1, 259.5]
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 123
As a increases the median decreases and the credible interval shrinks: the prior parameter
increases the effective sample size without affecting the mean. The same approach can be
used for quantiles of the distribution. The 80th quantile is of interest because we want to
know when this fraction of the lifetime has been reached in the field. The 80th quantile,
which we denote by y80 , satisfies the equation
S(y80 | λ) = 0.20
exp(−y80 /λ) = 0.20
−y80 /λ = log(0.2)
y80 = −λ log(0.2) = 1.61λ
So the median and 95% credible interval for the 80th quantile follow by simply scaling up
those for λ (with a = −1) by the multiplier 1.61, giving median 344.4, credible interval
[281.2, 428.4].
The prior adds α to the sample size and β to the sum T of the observations. No new principle
is involved, and we do not discuss these further.
(y − µ)2
1
f (y | µ) = √ exp − ,
2πσ 2σ 2
where µ is the mean of the distribution and σ is a scale parameter (assumed known in this
chapter), the standard deviation.
124 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cumulative 0.5
0.4
0.3
0.2
0.1
0.0
4 6 8 10 12 14
birthweight
FIGURE 10.3
Boy birthweight cumulative proportions
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
4 6 8 10 12 14
birthweight
FIGURE 10.4
Boy birthweight cumulative proportions (circles), Gaussian model (solid) and 95% credible
region (red)
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 125
where c is a known constant, involving the data and σ but not µ. The likelihood depends
on only one function of the data: the sample mean ȳ, which is a sufficient statistic for µ:
it is the only data function (apart from the known sample size n) we need to describe the
likelihood.
PnAnother function
2
of the data, the sample sum of squares (about the sample mean)
i=1 (yi − ȳ) also appears in the likelihood, but it is a known constant, as is σ, the scale
parameter. The sum of squares in this model is an ancillary statistic: it does not give infor-
mation about µ, but it may give information about the model, in this case the specified value
of σ. We will see in Chapter 11 on two-parameter models that if the Gaussian distribution
model is correct, then the sample sum of squares divided by σ 2 has a χ2 distribution with
(n − 1) degrees of freedom. This would allow us to check whether the specification of the
value of σ 2 is consistent with the sample data.
It is striking that the likelihood is identical (apart from ignorable constants) to that
from a single Gaussianly distributed “observation” ȳ with mean µ and variance σ 2 /n. This
illustrates a remarkable property of the likelihood, pointed out by Fisher, that the likelihood
defines the distribution of the sufficient statistics (if these exist), in this case the sample
mean ȳ, which is Gaussianly distributed N (µ, σ 2 /n). Although the distribution of the sample
mean is commonly described as its repeated-sampling distribution, there is no invocation of
the Central Limit Theorem or the repeated sampling principle needed to justify it: the
distribution follows directly from the likelihood.
He may have thought this obvious: he regarded the result as an indication that the
Bayesian analysis was unnecessary: he had the same result without the need for a prior
distribution, uniform or not. Fisher extended the idea to the two-parameter Gaussian dis-
tribution, but it could not be extended to discrete distributions, and the aforementioned
simple result depended on the existence of pivotal functions (discussed in §10.12).
Neyman and Pearson (1933) based their development of confidence intervals on the re-
peated sampling principle. The principle does not state how procedures should be evaluated,
nor which behaviours, but common practice led to the use of bias and sampling variance or
mean square error in simulation studies, as measures of performance of parameter estimators
– estimates computed from actual or hypothetical repeated samples. For the performance of
intervals, their length and coverage in repeated sampling were the general criteria.
Before the likelihood or computer experiments were available, it was impractical to assess
the precision of intervals, and statistical inference therefore relied on asymptotic theory. The
Central Limit Theorem (CLT) played an important role: it could be assumed that sums (or
means) of random variables would be Gaussianly distributed in sufficiently large samples,
and the CLT could be invoked to justify the asymptotic properties of the MLE and the
confidence interval. It was neither necessary nor possible to know whether the asymptotic
result applied in the observed sample: it was sufficient that it existed.
The likelihood expressed this more precisely and strongly. If the MLE was internal to
the parameter space (not on a boundary) then as the sample size → ∞, the likelihood would
approach the Gaussian form. This meant that the log-likelihood would be quadratic in the
parameter, and so the frequentist analysis could be based on the first two derivatives of the
log-likelihood function ℓ(µ):
a much longer set of hypothetical random samples from which we filter out those which do
not have log-quadratic likelihoods. These constructions of hypothetical samples may seem
absurd, but follow exactly the argument used in the 2×2 contingency table in Chapter 7. As
Cox (2006, p. 198) put it,
One of the peculiarities of this procedure (apart from its hypotheticality) is that it gives no
information about whether the confidence interval covers the true value. It says only that
the probability of this coverage in the hypothetical samples is 95%. A long-run frequency
statement has to have a long run for its use, even if it is hypothetical. Students often confuse
this statement with the Bayesian posterior probability statement, which does refer to the
actual sample: it is the probability that the true value lies in the credible interval.
For the 648 boy babies, the sample mean birthweight is 7.65 pounds. With σ = 1.12
pounds, the 95% confidence
√ interval for the Child Development population is given by
7.65 ± 1.96 · 1.12/ 648 = 7.65 ± 0.086 = [7.56, 7.74] pounds.
It is rare in practice for us to know the standard deviation but not the mean. In general
both parameters are unknown. We deal with this case in Chapter 11.
Bayes and Laplace used flat – constant – priors to represent ignorance, if there is no prior
evidence to support any possible parameter value over any other possible value. If the prior
is constant, then the posterior distribution is the likelihood function, scaled to integrate to 1.
An immediate problem (frequently presented as a dismissive objection) arises with the
Gaussian mean µ. Since this can conceptually take on any value in an infinite range, a
“proper” uniform distribution for µ (one which integrates to 1 over the range of µ) cannot
be defined over an infinite range, since the integral would be infinite as well. This criticism
can be answered in two ways:
• No matter how large µ may be, we can always construct a vastly long but finite interval
for it.
• We can define the prior over a finite interval and then let the interval endpoints tend to
±∞.
An additional argument for the uniform prior comes from finite population considerations
(and all real populations are finite). For the proportion p of zero values of a binary variable
in a population of size N , the possible values of the proportion are I/(N + 1), where I is
an integer in the range 0 to N . Without any prior information, all the values I/(N + 1) are
equally probable – so the non-informative prior is discrete uniform on these values. As N
increases, the uniform discrete prior approaches the uniform continuous prior on (0,1). This
argument extends directly to the mean of any variable y measured with finite measurement
precision δ. Its possible values are on a finite grid, and the possible values of the population
mean µ are also on a finite grid of spacing δ/N .
For the Gaussian mean, using the second option above, with a uniform prior for µ over
a large range [a, b], the posterior is the scaled likelihood over the range [a, b], which must
integrate to 1 over this range after scaling. The posterior of µ is
π(µ | ȳ) = c · f (µ | ȳ)
n(µ − ȳ)2 )
= c · exp − , for µ ∈ [a, b].
2σ 2
The constant c is determined by integration over [a, b]. We have
Z b
1= c · f (µ | ȳ)dµ
a
√ √
n(b − ȳ) n(a − ȳ)
=c· Φ −Φ ,
σ σ
√ √
n(b − ȳ) n(a − ȳ)
c = 1/ Φ −Φ ,
σ σ
where Φ is the standard Gaussian cdf. As b → ∞ and a → −∞, c → 1. So we obtain the
same result as using the improper prior π(µ) = 1 on µ ∈ (−∞, ∞).
This is different from making a decision about the “best” model and taking some con-
sequent action. Decisions and actions have to consider many factors beyond the empirical
data. In this book we do not consider these factors, which are part of Statistical Decision
Theory, in which decisions and actions have consequent losses, determined by the state of
nature, and the object is usually to minimise the (expected) loss.
Our aim is narrower: to state the relative evidence for the competing hypotheses or mod-
els, and where possible and appropriate to average over the models to provide a composite
conclusion. This will be developed in subsequent chapters.
In the one-parameter Gaussian model, there are two possible types of competing hy-
potheses:
• comparing one specified value µ1 with another specified value µ2 ;
• comparing a specified “null” value µ = µ0 with an unspecified “alternative” value of
µ ̸= µ0 .
We take as an example for the first case a sample of n = 25 from a Gaussian distribution
with variance σ 2 = 1 and sample mean ȳ = 0.4. Model 1 has µ1 = 0, model 2 has µ2 = 1.
For the second case the sample data are the same, and model 1 has µ0 = 0, but model 2 has
µ unspecified.
10.10.2 µ0 vs µ ̸= µ0
We restrict this discussion to the use of the p-value, widespread in all fields of application.
The argument with the credible interval has an exact parallel in the confidence interval. Does
the confidence interval √cover the null hypothesis value? The 95% central confidence interval
for µ is 0.4 ± 1.96 · 1/ 25 = [0.008, 0.792]. This just excludes 0: the zero value is on the
boundary of the 95.44% confidence interval, and the parameter region beyond this interval
has confidence 4.56%3
Hoever, in the frequentist p-value hypothesis testing framework, we express this dif-
ferently. Under
√ the null hypothesis, the probability of observing a sample mean of 0.4 (a
“Z”-value of 25(ȳ − µ0 )/1 = 2) or more is 0.0228 – 2.28%. This is the one-sided p-value
of the sample outcome. The two-sided p-value of 0.0456 corresponds to including in the
“more extreme values” region values of µ < −0.4. This seems unreasonable, but the whole
process is based on values “more extreme than that observed” – these can be defined in
different ways, giving different p-values. A common calibration of the p-value is that 0.05
is mild evidence, 0.01 is strong evidence and 0.001 is very strong evidence against the null
hypothesis.
The logic of this expression of evidence is puzzling, as we noted before. What we observed
is a sample mean of 0.4, but the evidence we quote is for a different event: that the sample
mean was greater than or equal to 0.4. The simple reason for this is that in the frequentist
framework we cannot give non-zero probability to single values of a continuous random
variable – only intervals of values can have non-zero probability. Students are sometimes
told, confusingly:
Well, we would certainly reject the hypothesis even more definitely if the sample
mean was greater than 0.4, so we include this in the interval probability.
the data y the prior probabilities to posterior probabilities through Bayes’s theorem:
π1|y π1
= LR ·
π2|y π2
If we are initially indifferent between the two models, the posterior odds are equal to the
likelihood ratio, and π1|y = LR/(1 + LR), π2|y = 1/(1 + LR). In general
π1
LR · π2
π1|y = π1
1 + LR · π2
π1 LR
= .
π2 + π1 LR
An important question is how to calibrate the likelihood ratio and posterior probability.
There is no unique scale for the interpretation of the likelihood ratio, but values of 1, 3,
10, 30 and 100 are in common use for none, very weak, mild, strong and very strong data
evidence for the numerator model compared to the denominator model. If the models had
equal prior probabilities, the corresponding numerator posterior probabilities would be (to
3 dp) 0.5, 0.75, 0.909, 0.968, 0.990.
For the example, we have µ̄ = 0.5, and
L(µ1 ) 25
= exp − (1) · (−0.1)
L(µ2 ) 1
= exp(2.5) = 12.18.
By the calibration above, this would be mild sample evidence in favour of M1 , not surprising
since the sample mean is closer to 0 than to 1.
10.11.2 µ0 vs µ ̸= µ0
In the second case, the null model is based on the current best theory, which is to be retained
unless the data give convincing evidence against it. The example has µ1 = 0 and σ = 1, with
n = 25 and the sample mean ȳ = 0.4. Given the data y, which model is better supported?
Under the null hypothesis we have the likelihood L(µ0 ), but under the alternative µ is
unknown. However, we have its posterior distribution (with the flat prior on µ) given by
µ | ȳ ∼ N (ȳ, σ 2 /n). We can use this in two different but complementary ways.
132 Introduction to Statistical Modelling and Inference
The alternative hypothesis (that µ ̸= 0) is better-supported than the null at this value of µ,
but not very strongly: the inverse of the ratio is 1/0.1353 = 7.39. At other values of µ the
support for the alternative is even weaker, so the likelihood ratio is greater.
Dempster (1974, 1997) gave a straightforward approach to this calibration. We are unable
to give a single number likelihood ratio, but we can find its posterior distribution, from that
of µ. Although we can give in this simple case an analytic solution (as did Dempster), we
will instead use a simulation approach, because this is very general and simple.
We make a large number M of random draws µ[m] of µ from its N (ȳ, σ 2 /n) posterior
distribution, and substitute them into the denominator of the likelihood ratio, to give M
random draws LR[m] = L(µ0 )/L(µ[m] ) from the posterior distribution of the likelihood ratio.
The cdf of M = 10, 000 draws is shown in Figure 10.5.
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
likelihood ratio
FIGURE 10.5
Posterior distribution of the likelihood ratio
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 133
The distribution is extremely skewed. The median likelihood ratio is 0.1694, and the 95%
central credible interval is [0.1354, 1.666]. This does not exclude the value 1 of indifference
between the two hypotheses. The posterior probability that the likelihood ratio is greater
than 1 (null hypothesis better-supported than the alternative) is 0.0465. We can convert
these results to those for the posterior probability of the null hypothesis.
If we specify equal prior probabilities for the null and alternative hypotheses, then the
median posterior probability of the null hypothesis is 0.145, and the 95% credible interval is
[0.119, 0.625]. The evidence against the null hypothesis is quite weak.
The right-hand side is the scaling or normalising factor which scales the product of likelihood
and prior into the posterior. The integrated likelihood is a weighted average of the likelihood,
with prior weights given by the prior importance of the parameter values. In the convention,
the integrated likelihood is used as though it were the likelihood from a completely specified
model (as with the null model). So there is no uncertainty in the integrated likelihood: it
will be equal to a value of the likelihood L(µ∗ ) at some other value µ∗ of the parameter µ,
determined by the prior specification.
There is no principle in Bayesian analysis justifying this use of the integrated likelihood
as though it were the actual likelihood under the alternative model. It is not a consequence
of Bayes’s theorem. An alternative way of expressing the integrated likelihood is as the prior
mean of the likelihood: it is another one-point summary of the likelihood function, like the
maximised likelihood. This view emphasises the difficulty of the “marginal” argument. The
prior distribution represents our information about the model parameter before the data are
observed. It is not the distribution of an observable random variable on which the distribution
of the data has been conditioned. (An unusual view of some Bayesians is that Nature has
134 Introduction to Statistical Modelling and Inference
performed a random draw of θ from the prior, and the best we can do is to average over
Nature’s possible choices.)
Quite apart from this philosophical objection, there is an immediate difficulty with
the computation of the integrated likelihood if the prior is improper, for example flat on
(−∞, ∞). This prior does not integrate to 1, nor can it be scaled to do so by multiplying
by a constant. For a finite integrated likelihood, the prior must be proper, for example by
being defined on a finite interval, or by other parametrisations.
Suppose we define the prior for µ as flat on the finite interval [a, b] : π(µ) = 1/(b − a).
Then integrating the Gaussian likelihood over this interval gives
Z b √ √
1 c n(b − ȳ) n(a − ȳ)
L(µ)dµ = · Φ −Φ ,
b−a a b−a σ σ
where c is a constant function of the data and σ. The integrated likelihood is an explicit
function of a and b. Varying these values will vary the integrated likelihood correspondingly.
We cannot let a → −∞ and b → ∞ as in the posterior computation, for then, though the
second term → 1, the first term → zero, so the integrated likelihood under the alternative
hypothesis tends to zero – the null hypothesis will be infinitely well-supported relative to
the alternative!
Many attempts have been made to avoid this difficulty, by using priors that are “not
too diffuse” relative to the likelihood (for example Kass and Raftery 1990). This requires an
inspection of the likelihood to decide how to assess diffuseness and specify a not-too-diffuse
prior. This is using the likelihood to determine the prior, violating the fundamental Bayesian
precept that the prior must be specified before the data are observed: we are using the data
twice! A detailed discussion of this problem can be found in Aitkin (2010, §2.8).
there is a 95% probability that this 95% confidence interval covers the true popu-
lation value.
That statement is effectively a Bayesian credible interval statement, which correctly re-
expressed is
there is a 95% probability that the true population value lies in this 95% credible
interval.
Statistical inference II – the continuous exponential, Gaussian and uniform distributions 135
The first statement has to be re-expressed in terms of the coverage property of a hypothetical
ensemble of random intervals:
But in this special case of the Gaussian distribution with known variance (and in other cases
in which a pivot exists), the probability statement has both interpretations!
m(z̄ − µ)2 )
π(µ | z̄, m) = c · exp − ,
2σ 2
where m and z̄ are called prior parameters or sometimes hyper-parameters. The prior can be
expressed as representing the information from a prior experiment or prior study of size m in
which the study mean was z̄. By appropriate choice of the prior parameter values, the prior
can approximate the information about µ that is actually available (provided it is symmetric
around the mean). As m → 0, the Gaussian prior approaches the uniform prior.
The posterior distribution of µ can be evaluated directly from the two densities:
where the normalising denominator term does not involve µ. The posterior distribution of
µ is
σ2
nȳ + mz̄
µ | ȳ, z̄, n, m ∼ N , .
n+m n+m
The posterior mean is the weighted mean from the likelihood, weighted by the data sample
size, and the prior, weighted by the prior sample size. The posterior variance is reduced by
the additional prior sample size.
136 Introduction to Statistical Modelling and Inference
We now extend the family of models to those with two parameters. This raises new issues
in the relation between the two parameters.
where c is a known constant, not involving σ or µ. The likelihood depends on only two
functions of thePdata: the sample mean ȳ and the “residual sum of squares” (about the
n
mean) RSS = i=1 (yi − ȳ)2 . These are the sufficient statistics for µ and σ: they are the
only data functions (apart from the known sample size n) we need to describe the likelihood,
which factors into two separate pieces. The mean µ appears only in the first term, but σ
appears in both terms.
property was often used as an argument against maximum likelihood as a general method
of finding an estimator of a parameter. The unbiased (quadratic in the data) estimator is
s2 = RSS/(n − 1). We discuss next the use of the restricted or marginal likelihood for the
justification of the unbiased estimator.
The frequentist inference about µ and σ is based on the sufficient statistics ȳ and RSS,
which have indepenent distributions:
√
• n(ȳ − µ)/σ ∼ N (0, 1);
• RSS/σ 2 ∼ χ2n−1 .
The two-parameter Gaussian distribution has two pivots! For inference about µ, the frequen-
tist approach uses the derived pivot
√
n(ȳ − µ) s
t= /
√ σ σ
n(ȳ − µ)
=
s
∼ tn−1 ,
which has a Student’s t-distribution with ν = n − 1 degrees of freedom, with density
−( ν+1
2 )
t2
Γ((ν + 1)/2)
√ 1+ .
νπ Γ(ν/2) ν
This is symmetric about the mean, but has longer “tails” than the Gaussian distribution,
reflecting the uncertainty in the variability. Large deviations from the mean have higher
probability than under the Gaussian.
parameters, and is the common standard for a non-informative prior for a scale parameter
in Bayesian analysis. This shows the property of the two pivots in the Gaussian distribution:
the inferential conclusions for µ and σ are the same, for the flat priors on µ and log σ.
From a frequentist viewpoint, the marginal posterior of ϕ is equivalent to the Marginal
or Restricted likelihood for σ 2 . The maximising value of the restricted likelihood is given by
σ̃ 2 = RSS/(n − 1), which is generally called the REML (Restricted Maximum Likelihood)
or MML (Marginal Maximum Likelihood) estimate.
The joint conjugate prior is of the same form as the likelihood, and can be expressed as
being based on an auxiliary experiment with sample size m, sample mean z̄ and residual
sum of squares ASS. The joint posterior then factors into the marginal posterior χ2n+m−1
distribution of (RSS + ASS)/σ 2 and the conditional Gaussian distribution of µ given σ,
N ([nȳ + mz̄]/[n + m], σ 2 /[n + m]).
It might appear that this result is not general, because the same σ has been assumed for
the prior as for the data model. However, if we take a different variance for the prior, say
ϕ2 , we can redefine the prior variance to be σ 2 and the prior sample size m∗ to be such that
ϕ2 /m = σ 2 /m∗ , that is, define m∗ = mσ 2 /ϕ2 . We do not give further details.
The Gaussian distribution is one of the very few in which this integration can be done ana-
lytically. As for the frequentist case, the analytic
√ derivation gives a Student’s t-distribution
with degrees of freedom ν = n − 1, for t = n(µ − ȳ)/s where s2 = RSS/(n − 1), with
density
− ν+1
t2
Γ((ν + 1)/2) 2
√ 1+ .
νπ Γ(ν/2) ν
For most two-parameter distributions, this integration is not analytic, and can be carried
out by simulation. Instead of finding the probability density of µ by integration, we generate
140 Introduction to Statistical Modelling and Inference
a large random sample of M values of µ. If M is sufficiently large, like 10,000, this provides
an accurate approximation to the cdf, from which accurate quantiles can be obtained.
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0 5 10 15 20 25
y
FIGURE 11.1
Cdf of 10,000 draws of t4 (dots) and t4 cdf (solid curve)
1 In some disciplines this function is called the effect size.
Statistical Inference III – two-parameter continuous distributions 141
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
coefficient of variation
FIGURE 11.2
Cdf of 10,000 draws of the coefficient of variation σ/µ
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8 10
number of SDs
FIGURE 11.3
Cdf of 10,000 draws of µ/σ
142 Introduction to Statistical Modelling and Inference
simulation example. The posterior distribution of σ/µ is heavily skewed, with a very long
tail. Simulation values of µ near zero give very large values of the coefficient of variation.
The posterior of µ/σ is only slightly skewed, the small values of µ having little effect. So if
we are interested in the coefficient of variation, and µ may be near zero, it would be more
effective, and more accurate, to use the posterior distribution of µ/σ to provide quantiles,
then invert these to give those of the coefficient of variation.
The posterior predictive distribution has an important role in assessing the Gaussian model
assumption. Instead of predicting new values of Y, we can use qthe sample data we have,
n
yi −ȳ
and compare the sample cdf of the “Studentised” variables n+1 s with that for
the tn−1 distribution. This might seem strange – shouldn’t we be comparing it with the
Gaussian distribution? We pay a price for not knowing µ and σ, which are estimated
from the sample with imprecision; that imprecision is reflected in the more diffuse t-
distribution with its longer tails than the Gaussian. As n increases this imprecision goes
to zero.
Two-parameter skewed distributions have wide application. We examine three of them.
(z − µ)2
1
f (z)dz = √ exp − dz, −∞ < z < ∞
2πσ 2σ 2
(log y − µ)2
1
f (y)dy = √ exp − dy, 0 < y < ∞.
2πσy 2σ 2
The only difference between the densities is in the additional denominator term in y in the
lognormal. This does not affect the ML or the Bayesian posteriors, though it does affect the
value of the likelihood.
For the ML or Bayesian analysis, it is just a matter of replacing the usual terms in yi
by the corresponding terms in zi = log(yi ). So the MLE of µ from the random sample
Pn c2 = Pn (zi − µ
b = i=1 zi /n, and for σ 2 is σ
y = (y1 , . . . , yn ) is µ i=1 b)2 /n. The MLEs are
µ
b = 5.267, σ b = 0.842. Figure 11.4 shows the ML fitted cdf of the lognormal (solid curve)
with the ecdf (circles) and 95% credible region (red) for the true cdf on the scale of log hours.
The lognormal does not fit well: the curvature is wrong.
Inference about the 80th quantile follows from the 80th quantile of the standard Gaussian
distribution, which is 0.842. So the posterior distribution of the 80th quantile for the phone
data is obtained from the 10,000 draws of the parametric function µ[m] + 0.842 σ [m] . Figure
11.5 gives the posterior cdf. The median and 95% credible interval for the 80th quantile of
µ+0.842 σ are 6.02 and [3.66, 8.40]. Transforming exponentially to the original scale of y, the
median and 95% credible interval transform to 411.6 and [38.9, 4447]. These values are quite
different from those for the exponential. The apparently incorrect distribution specification
has thrown out the location and length of the credible interval. We discuss this further in
Chapter 12 on model assessment.
144 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
3 4 5 6
log hours
FIGURE 11.4
Empirical (circles) and ML fitted lognormal (solid) cdfs, and 95% credible region (red)
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
2 4 6 8 10
FIGURE 11.5
Cdf of the lognormal 80th quantile
Statistical Inference III – two-parameter continuous distributions 145
where α is the shape parameter and θ the scale parameter.2 The survivor function is now
S(y) = exp(−θy α ), the hazard function is h(y) = αθy α−1 , and the integrated hazard is
H(y) = θy α . The hazard is monotone decreasing for α < 1, constant for α = 1 and monotone
increasing for α > 1. The Weibull distribution is an accelerated failure time model, in the
sense that we may think of an age variable a as defining a rescaling of clock time y, by
a = y α . On this age scale, the distribution of age lifetime is exponential. So if α > 1, age is
“accelerated” relative to clock time, while if α < 1, age is “braked” relative to clock time.
The Weibull mean and variance are complex functions of the model parameters: mean
µ = θ−1/α (1/α)Γ(1/α), variance = θ−2/α (2/α)Γ(2/α) − (1/α)2 Γ2 (1/α) . Practical interest
is focussed on the quantiles of the distribution, which have simpler functional form in the
parameters. The 100γ quantile of the distribution, yγ , is given by
S(yγ ) = exp(−θyγα ) = 1 − γ
H(yγ ) = θyγα = − log(1 − γ)
yγ = {− log(1 − γ)/θ}1/α
log(yγ ) = {log[− log(1 − γ)] − log(θ)}/α.
literature.
146 Introduction to Statistical Modelling and Inference
0.012
0.011
0.010
0.009
0.008
posterior density
0.007
0.006
0.005
0.004
0.003
0.002
0.001
0.000
0.8 1.0 1.2 1.4 1.6 1.8
a
FIGURE 11.6
Posterior density of α
0.020
0.018
0.016
posterior density
0.014
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007
q
FIGURE 11.7
Posterior density of θ
Our interest is in the 80th quantile of the distribution, y80 . This is an analytic function
of the parameters:
S(y80 | θ, α) = 0.20
α
exp(−θ y80 ) = 0.20
α
θ y80 = − log(0.2)
α
y80 = − log(0.2)/θ = 1.61/θ
y80 = (1.61/θ)1/α
= [0.476 − log(θ)]/α.
We evaluate this function of the two parameters, sort the values, and construct the me-
dian and 95% credible interval for log(y80 ) in the usual way, then exponentiate these
values to give those for y80 . Figure 11.8 shows the posterior distribution for the 80th
quantile.4
The median and 95% credible interval on the log scale are 5.799 and [5.696, 5.968]; the
credible interval is slightly asymmetric about the median. Exponentiating gives median hours
330.0 and credible interval [283.2, 380.4]. The median is close to that for the exponential
distribution, but the credible interval is much shorter (48) at the upper end.
4 The small ripples near the median are a consequence of the grid spacing. With 106 points in the grid
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
5.5 5.6 5.7 5.8 5.9 6.0 6.1
Weibull 80th percentile
FIGURE 11.8
Cdf of the 80th quantile, log hours
The log survivor function is a negative exponential function of z, rather than the negative
power function of y in the Weibull. We do not discuss its properties further: they follow from
those of the Weibull.
-1
log H(t)
-2
-3
-4
-5
3 4 5 6
loghours
FIGURE 11.9
Empirical (circles) and ML fitted Weibull (black line) log integrated hazard, with 95% cred-
ible region (red lines)
This is shown in Figure 11.10. The Least Squares estimates and (SEs) are α̃ = 1.231 (0.013),
log θ̃ = −6.680 (0.071), θ̃ = 0.00126. The estimates are fairly close to the MLEs:
α
b = 1.308, θb = 0.000821, but the LS standard errors are much smaller than the preci-
sions of the MLEs or the posteriors, as they do not reflect the information in the Weibull
likelihood – they are optimal for a Gaussian likelihood. The estimates could be used, how-
ever, as initial estimates in a Newton-Raphson or Fisher scoring algorithm for maximum
likelihood.
The MRR approach to analysis reflects a view, unfortunately common outside statisti-
cians, of statistical data analysis as a branch of optimisation. Given the problem, we need
a criterion or objective (goodness-of-fit) function which has to be optimised – maximised or
minimised. The sum of squared residuals from the fitted model is a popular choice. As we
have seen in other models, the sum of squares would be optimal in the sense of statistical
theory if the “error” distribution were Gaussian. Here it is not, so the LS estimates cannot be
optimal in the statistical sense – they are not efficient. More seriously, since their standard
errors are seriously underestimated, they will give over-precise confidence intervals for any
parameters or parametric functions, like quantiles or predicted values.
The unusual name of the method comes from the early analysis of heavily censored
experimental data. We do not give details, but discuss censoring in the next section.
11.5.6 Censoring
An important application of the Weibull (and other survival distributions) is the censoring of
observations. This term is used in statistics for the termination – cutting off – of observation
of the lifetime, before failure has occurred. Censored observations cannot “speak” – their
information is “censored”.
Censoring is common in both engineering and medical applications. In the first, it in-
volves life-testing of components by placing them under stress on a suitable machine and
150 Introduction to Statistical Modelling and Inference
-2
-3
-4
-5
-6
3 4 5 6
log time
FIGURE 11.10
Least squares fit to the log integrated hazard
operating the machine until failure occurs, or until some pre-specified censoring time has
elapsed. In medical applications, especially of cancer treatment, the patient is observed after
treatment until death or for a fixed pre-specified follow-up time. The observed lifetimes of
the components, or patients, which have not been fully observed because of censoring are
not discarded: they provide information about the model parameters and must be included
in the data analysis.
This is done quite simply: a component or patient which has survived for a time T before
observation ends is known to have an unobserved failure time which is greater than the
termination time. So the contribution to the likelihood of a censored lifetime T is given by
the survivor function S(y) at time T .
We give a simple example of the exponential distribution. Suppose we have an additional
phone which was withdrawn from service after operating for a time T = 300 hours with-
out failing. How does this affect our inference about the mean lifetime λ? The likelihood,
including this observation, is now
"m #
Y
ni
L(λ) = f (yi | λ) · S(T | λ)
i=1
m y ni
Y 1 i T
= exp − · exp −
i=1
λ λ λ
Pm
1 n i yi + T
= n exp − i=1
λ λ
1 nȳ + T
= n exp − .
λ λ
Statistical Inference III – two-parameter continuous distributions 151
The effect of the censored observation is to increase the total survival time of the phones
by T without increasing their number, so the MLE of λ will increase, from ȳ = 210.8 to
ȳ + T /n = 210.8 + 3.41 = 214.21. The Bayesian analysis will be affected correspondingly.
We discuss the effect of censoring and its analysis further in §15.4.
where ψ is the digamma function – the derivative of the log gamma function. The MLEs
and (SE)s of µ and r are 210.8 (17.4) and 1.67 (0.89).
An attraction of this formulation is that the cross-derivative of the log-likelihood with
respect to µ and r is zero at the MLEs, as in the Gaussian log-likelihood with µ and σ. So the
MLEs µ b and rb are uncorrelated and asymptotically independent. However, as in the Gaussian
case, the likelihood in µ and r does not factor into independent terms: the parameters are
not independent in the posterior.
For independent priors for θ and r, with that for log θ uniform and π(r) not yet specified,
the posterior for θ and r can be expressed as
nr
T nr−1 Γ(nr) r−1
π(θ, r | y) = θ exp(−θT ) · n P π(r) .
Γ(nr) Γ (r)T nr
The joint posterior now factors into a term in r only, and a term in θ and r which is exactly
a gamma posterior. From this we can conclude that, conditional on r, θT has a gamma
posterior distribution with parameters nr and 1. The other term – the marginal distribution
of r – however, has no standard form, whatever prior distribution we might use for it.
We compute the likelihood component for r with a flat prior on a dense grid from 0.5
to 2.5, then scale the likelihood component by its sum. (Since r > 0, a flat prior on log r
would also be reasonable.) Figure 11.11 shows the posterior density of r, and Figure 11.12
the cdf. The median and central 95% credible interval for r are 1.54 and [1.16, 2.00]. The
distribution is slightly right-skewed.
Then we make 10,000 random draws r[m] from this posterior, and for each r[m] , make
one random draw θ[m] from the Gamma(nr[m] , 1)/T conditional posterior distribution.
Figure 11.13 gives the joint scatter of the 10,000 draws. The two parameters are strongly
associated, not surprisingly.
The posterior median for r is quite different from the MLE (1.80), which uses the addi-
tional information about r in the first component of the likelihood. This difference is similar
to that in the Gaussian distribution, where the MLE of σ 2 has n in the divisor of the RSS,
while the posterior distribution has its mode at n − 1.
Figure 11.14 shows the marginal posterior cdf of θ. The distribution of θ is also
slightly right-skewed. The median for θ is 0.00729 and the 95% central credible interval
is [0.00522, 0.00984]. The median differs somewhat from the MLE 0.00687, again a conse-
quence of the skew.
Our interest is in the 80th quantile of the distribution. This is not an analytic function
of the parameters, but we can use the inverse gamma cdf function (the “gamma deviate”
or “gamma quantile” function) to generate its posterior, by substituting the random draws
(r[m] , θ[m] ) into the gamma quantile function at the 80th quantile (Figure 11.15):
Statistical Inference III – two-parameter continuous distributions 153
0.006
0.005
0.004
posterior mass
0.003
0.002
0.001
0.000
0.5 1.0 1.5 2.0 2.5
r
FIGURE 11.11
Posterior density of r
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
1.0 1.5 2.0 2.5
r
FIGURE 11.12
Cdf of r
154 Introduction to Statistical Modelling and Inference
0.012
0.011
0.010
0.009
0.008
q
0.007
0.006
0.005
0.004
FIGURE 11.13
Joint draws of r, θ
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.004 0.006 0.008 0.010 0.012
q
FIGURE 11.14
Cdf of θ
Statistical Inference III – two-parameter continuous distributions 155
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
250 300 350 400 450
80th percentile
FIGURE 11.15
Cdf of the lifetime 80th quantile
[m]
y80 = G−1 (0.8, r[m] )/θ[m] .
The median is 325.7 and the 95% credible interval is [276.9, 389.5]. These values are very
close to those for the Weibull: 330.0 and [283.2, 380.4]. The gamma median is 4.3 hours less
than the Weibull median, and the gamma credible interval is 15 hours longer. We assess the
fit of the gamma and Weibull distributions in Chapter 12 on model assessment.
12
Model assessment
We have so far assumed that the probability models specified in the analyses are appropriate.
We now examine ways in which this assumption can be investigated. The first approach, for
continuous distributions, is through the agreement between the empirical cdf (ecdf) and the
model cdf, described several times previously. For this we construct a credible region for
the true model cdf from the set of credible intervals from the ecdf at each distinct data
point, based on the binomial model and the Beta posterior distribution. We give the formal
construction:
• At each ordered distinct sample value yi of the variable Y , where the value of the ecdf
[m]
is ni /n, we make 10,000 draws Pi from the Beta posterior with the uniform prior,
Beta(ni + 1, n − ni + 1).
• We order the draws and extract the 2.5 and 97.5 quantiles of the ordered draws at yi .
• These quantiles define the central 95% credible interval for the true population propor-
tions Pi .
• We repeat the simulations independently at each distinct value of y.
• The region defined by the interior of the set of 95% credible intervals is called a 95%
(pointwise) credible region for the true cdf.
An important value of this approach is that the uncertainty in the true cdf is assessed from
the data, not from the assumed distribution. This allows multiple models to be assessed for fit
to the same data without separate calculations of variability for each model. A frequentist
confidence region for the true cdf can be constructed in a number of ways (Owen 1995),
including the frequentist analogue of this one. We do not give details.
We begin with the Gaussian model example.
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
4 6 8 10 12 14
birthweight
FIGURE 12.1
Birthweight cumulative proportions (circles), ML fitted Gaussian model (black curve) and
95% credible region (red curves)
PROBability InTegral transformation, and we will use this term and Gaussian quantile in-
terchangeably.
Figure 12.2 shows the boy birthweights on this scale. It now appears that at least the
heaviest boy departs markedly from the Gaussian distribution model. The lightest boys
do not. The heaviest boy is an outlier in the statistical sense: he does not belong to this
distribution. However the central region, around nine pounds, also departs slightly from the
Gaussian model. We discuss better-fitting models which allow for the outlier in §15.8.
2
probit
1
-1
-2
-3
4 6 8 10 12 14
birthweight
FIGURE 12.2
Birthweight cumulative proportions (circles), ML fitted Gaussian model (line) and 95% cred-
ible region (red curves) on probit scale
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
3 4 5 6
log hours
FIGURE 12.3
Phone log lifetimes (circles), ML fitted lognormal model (curve) and 95% credible region
(red curves)
160 Introduction to Statistical Modelling and Inference
probit
0
-1
-2
3 4 5 6
log hours
FIGURE 12.4
Phone log lifetimes (circles), ML fitted lognormal model (line), and 95% credible region (red
curves) on probit scale
1.0
0.9
0.8
0.7
Survivor function
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700
hours
FIGURE 12.5
Phone empirical (circles), ML exponential (curve) survivor functions and 95% credible region
(red curves)
Model assessment 161
Survival probit
0
-1
-2
-3
FIGURE 12.6
Phone empirical (circles) and ML exponential (curve) survivor functions, and 95% credible
region (red curves), probit scale
survivor function for the phones. The exponential cdf falls completely inside the 95% credible
region. However, the cdf curve moves from one side of the credible region to the other,
suggesting a poor fit. The sample size is too small to give precision in the credible region,
but the curvature of the exponential seems to be wrongly specified. Figure 12.6 gives the
same picture on the probit scale with the same conclusions. The exponential distribution
cdf does not give a straight line on this scale. Of course it would not: the probit scale
assumes a Gaussian distribution. We need a different transformation of scale for the cdf, and
possibly time.
We have, for the exponential distribution cdf with mean λ,
the integrated hazard function. So if the exponential distribution is the correct model, the
graph of the integrated hazard H(y) should be close to a straight line through the origin,
with slope 1/λ. Figure 12.7 shows the empirical integrated hazard function (circles) with the
fitted exponential line and the 95% credible region. It is clear that the integrated hazard
function does not fit a straight line at all, so the log transformation of the survivor function
is not adequate. The remaining curvature suggests a log transformation of time as well. We
log transform both scales. Then if the exponential model is correct, we should have
So the log integrated hazard function should increase linearly with log(y), with intercept
− log(λ).
b Figure 12.8 shows the effect of this transformation. The graph is nearly linear, and
162 Introduction to Statistical Modelling and Inference
0
0 100 200 300 400 500 600 700
hours
FIGURE 12.7
Empirical (circles) and exponential (line) integrated hazard functions and 95% credible re-
gion (red curves)
0
log integrated hazard
-1
-2
-3
-4
-5
3 4 5 6
log hours
FIGURE 12.8
Empirical (circles) and exponential (line) log integrated hazard functions and 95% credible
region (red curves)
Model assessment 163
the line just fits inside the credible region, but it moves from one sie of the region to the
other: it is clear that the slope is incorrect. The ML fitted exponential log integrated hazard
is −5.35 + log(hours).
We extend this analysis in the next section.
F (y | θ, α) = 1 − exp(−θy α )
S(y | θ, α) = exp(−θy α )
− log S(y | θ, α) = θy α
log[− log S(y | θ, α)] = log(θ) + α log(y).
So on the log integrated hazard scale for the survivor function and the log scale for the
response, we should see a straight line with intercept log(θ) and slope α under the Weibull
model. Figure 12.9 shows the (ML) fitted log integrated hazard function and the credible
region.
0
log Weibull integrated hazard
-1
-2
-3
-4
-5
3 4 5 6
log hours
FIGURE 12.9
Empirical (circles) and Weibull log integrated hazard functions (line) with 95% credible
region (red curves)
164 Introduction to Statistical Modelling and Inference
The ML fitted Weibull log integrated hazard is −7.105 + 1.308 log(hours). All the data
points are within the credible region boundaries. The Weibull fit is acceptable. However, this
does not mean that the Weibull distribution is correct ! It means only that the Weibull is an
adequate fit to the data. But other distributions may also fit adequately; we will examine the
gamma as well. If more than one distributional model fits adequately, what do we conclude?
We discuss this in Chapter 13.
The Weibull distribution is not a location/scale family member. However, the log of the
response variable with a Weibull distribution has an extreme value distribution, which is
a location/scale family, as can be seen from the form of the density for z = log y, and
transforming the parameters by σ = 1/α, ψ = − log θ/α :
Here ψ is a location parameter and σ a scale parameter. The standard form of the extreme
value distribution, with ψ = 0 and σ = 1 is
The model assessment in this form is the same as it is in the Weibull form as the distribution
still depends on two parameters. However the log integrated hazard function is linear in log
y in the Weibull, but linear in z in the extreme value. We do not consider the extreme value
form further.
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700
hours
FIGURE 12.10
Fitted gamma cdfs (solid – ML, dashed – posterior, circles – empirical) and 95% credible
region (red curves)
0.0030
0.0025
0.0020
density
0.0015
0.0010
0.0005
FIGURE 12.11
ML fitted gamma (red) and Weibull (black) densities
13
The multinomial distribution
In §6.9 we used the multinomial distribution in its conventional form to represent the prob-
ability structure of the population of a categorical variable, as an extension of the binomial
distribution for two categories to three or more. But surprisingly, this representation can
be used as well for “continuous” variables, which has remarkably valuable consequences for
statistical modelling and analysis. A detailed discussion of the history of this development
can be found in Aitkin (2010) Chapter 4. Here we give a brief summary. The fundamental
point, quoted previously, was expressed by Pitman (1979):
because any real random variable Y is recorded with finite measurement precision δ, and
so is recorded on a grid of reported values of spacing δ. Then the set of possible recorded
population values YI ∗ , I ∗ = 1, . . . , N can be tabulated by the D distinct ordered values
YI , I = 1, . . . , D into a set of D population
PD counts NI on these distinct population values
YI . The total population count is N = I=1 NI , and the population proportions PI = NI /N
define the population multinomial distribution of Y .
This approach is due to Hartley and Rao (1968) who called it the “scale-load” distribu-
tion, and to Ericson (1969) who gave the Bayesian version. We can think of this (YI , PI )
structure as a maximum resolution population histogram, where the bin “width” is the mea-
surement precision δ. Figure 13.1 shows the 648 family incomes in dollars of the StatLab
boy population, as a histogram of counts at the level of resolution of the incomes (δ = $1).
The histogram is far from smooth!
The multinomial idea seems bizarre to many model-based statisticians, or at least unrea-
sonably complicated. Much of the early teaching of introductory statistics was based on the
idea of a smooth population: if we could draw larger and larger samples from the population
we would see the population histogram become increasingly smooth (no such demonstrations
were commonly done). The question of interest then would be what form of smooth density
function could best represent it.
However, the income data show that this assumption can fail: the population histogram is
jagged, and while a conceptual larger population (like the combined boy and girl populations)
might have a smoother histogram, the visible preference for particular numbers (5s and
10s) in the population would remain. Model-based statisticians have for so long regarded a
probability model as a necessary smooth simplification of the unknown population structure,
through a small number of model parameters, that the idea of complicating the population
structure with a large number of parameters, one for every distinct observation, seems to be
going in the wrong direction. We seem to be overburdening ourselves with parameters which
are scarcely identifiable.
But these population proportion parameters are not the parameters of interest in most
applications. It is surprising to find that, on the contrary, population parameters of interest,
like the mean, median or other moments or quantiles, and regression coefficients, are all well-
identified without any simplifying model, and that statistical inference with the multinomial
35.00
28.00
14.00
7.00
0.00
50 100 150 200 250
INCOME
FIGURE 13.1
StatLab family income histogram, boy population
where the factorial term gives the number of distinguishable arrangements of the sample
values. The multinomial is a multivariate version of the binomial, with the moments of the
nI given by
Chapter 4.
The multinomial distribution 169
of the population. Unlike the applications of the multinomial to true category variables, we
are not interested in the details of the individual PI and YI , since each “category” is just a
distinct single value ofP
Y . As we will see, we are interested in weighted linear functions of the
D
PI , like the mean µ = I=1 PI YI . The covariance structure is inherent in this representation,
and in the corresponding sample structure.
Formally, we need to know the number D of distinct values of Y in the population, and the
smallest and largest population values Y1 and YD , to be able to compute this likelihood.
But for any unobserved values of YI the corresponding nI is zero, so the likelihood can be
re-expressed in terms of the PI for only the observed distinct YI . The PI for the unobserved
YI do not contribute to the likelihood, so these YI do not need to be known unless the prior
gives them non-zero weight.
In the absence of informative prior information about the unobserved YI and their pop-
ulation proportions PI , we can rewrite the likelihood L(p1 , . . . , pd ), in terms of the sample
index i and the d ordered observed sample quantities yi and ni , i = 1, . . . , d:
d
Y
L(p1 , . . . , pd ) = pni i .
i=1
1 = λµ + ϕ
nI
PI =
nλYI + n(1 − λµ)
P˜I
= ,
1 + λ(YI − µ)
\ P˜I
P I (µ) = ,
d I − µ)
1 + λ(µ)(Y
where λ(µ)
d is the implicit solution of
D
X P˜I
= 1.
I=1 1 + λ(µ)(YI − µ)
d
This can be computed over a grid of µ and maximised numerically to give the MPLE of
µ. The precision of the MPLE is assessed by treating the profile empirical likelihood as a
parametric likelihood and inverting the likelihood ratio test to construct a profile-likelihood-
based confidence interval for µ. The confidence coefficient is asymptotic, from χ2 . Owen gave
details.
A complicating issue for the MPLE analysis is that if the variance as well as the mean
is to be estimated, the penalised maximisation with the additional variance constraint will
give a different MPLE for the mean, and a more complex constrained maximisation.
where aI is the prior weight on pI . The posterior distribution of the population proportions
with this prior is again Dirichlet:
QD D
I=1 Γ(nI + aI ) Y
π({PI } | {aI }, {nI }) = PD · PInI +aI −1 .
Γ( I=1 [nI + aI ]) I=1
Its use requires the specification of the prior parameters aI on all the values of Y , observed
or unobserved. An immediate possibility for a non-informative prior would appear to be
aI = 1 ∀ I, as in the binomial. However this prior has serious difficulties. In general we will
The multinomial distribution 171
The prior has an arbitrary constant c, showing that it is improper (does not integrate to 1).
With this prior, the posterior would be improper for any value PI which has sample size
nI zero, that is, is not observed in the data. The impropriety is that for such PI , the
corresponding posterior density 1/PI → ∞ as PI → 0, that is, has an infinite spike at zero:
we are infinitely certain that such PI are zero.
So in the posterior, we need distinguish only the d distinct ordered, observed sample
values, denoted by yi , with their sample counts ni and corresponding population proportions
pi , so the posterior can be expressed as
d
Γ(n) Y
π({pi } | {ni }) = Qd pni i −1
i=1 Γ(ni ) i=1
For values YI not observed in the sample, their posterior probabilities are zero. We can
express the posterior as the empirical likelihood with the improper Haldane prior on the pi
for the observed data values yi .
Banks (1988) took up these criticisms by developing a smoothing of the Dirichlet poste-
rior. Given the Haldane prior, he proposed generating a random value of pI for each observed
YI , and then spreading it uniformly over this YI and all unobserved values (which therefore
had to be known) to the left of this YI down to the next observed value.
This required prior assumptions about both the number of unobserved values and their
locations. In this way the posterior mass was spread over the extended sample range from
y1 to yn , though in an ad hoc way. Values of Y outside the sample range had zero posterior
probability, while those within the sample range had varying prior and posterior probabilities
determined by their prior-specified spacing.
Lazar (2003) examined proposals for other Bayesian analyses with the multinomial dis-
tribution, including a mixture of maximum likelihood and Bayesian analyses using the max-
imised (profile) likelihood as a parametric likelihood, with a conventional parametric prior.
This was discussed by Owen (2001, §§9.4, 9.5).
Lazar (2003, p. 320) dismissed the Bayesian bootstrap because of
the extreme sensitivity of the results to the model assumption, in particular the form
of prior, which is usually taken to be Dirichlet for reasons of conjugacy. Furthermore,
there is no intuitive way of setting the prior, since it is not likely that information
will be available a priori about the [PI ], making any type of subjective Bayesian
analysis impossible.
(Emphasis added)
This argument seems to suggest that the multinomial distribution is subject to extreme sen-
sitivity of results, though no evidence is given to support this claim. Sensitivity of Bayesian
posteriors in any model to variations in the prior are natural and expected.
Aitkin (2008) examined the behaviour of the Haldane prior: an apparently unreasonable
prior specification would be expected to perform poorly in simulations. He demonstrated
the contrary with a simulation study of several methods. Informative departures from the
Haldane prior performed poorly, while the multinomial analysis with the Haldane prior
performed as well as the correct parametric distribution as the sample size increased.
Some statisticians suggest the use of the “overdispersed” Dirichlet-multinomial com-
pound distribution (the generalisation of the beta-binomial to more than two categories) to
generalise the covariance structure. This only complicates matters further. Integrating out
the multinomial parameters PI gives a form of multivariate hypergeometric distribution:
P D
Γ(N + 1)Γ( I aI ) Y Γ(nI + aI )
Pr[{nI }] | {aI }] = P · .
Γ(N + I aI ) Γ(nI + 1)Γ(aI )
I=1
These probabilities now depend strongly on the prior parameters aI instead of the pI , for
which we have the same difficulties as before. Worse, the posterior inferences of interest to
us are about linear functions of the PI , which have been integrated out of the compound
distribution altogether. The multivariate hypergeometric distribution arises naturally when
the sample size is large relative to the population size, and we need the hypergeometric
likelihood. This is discussed at length, with its two-stage simulation extension, in Aitkin
(2008, 2010, Chapter 4).
An important issue in our use of the multinomial distribution as a data model is that we
never need to make explicit use of the covariance structure of the model, or make inferences
about individual parameters PI : we need only the simulated distributions of functions of
these parameters, linear or non-linear. These make implicit use of the covariance structure,
but the simulations reported in Aitkin (2008, 2010, Chapter 4) show that the simulation
inferences are competitive with model-based inferences as the sample size increases.
The multinomial distribution 173
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
7.50 7.55 7.60 7.65 7.70 7.75 7.80
mean
FIGURE 13.2
Boy birthweight mean posterior, cdf scale
sample. The posterior median is 7.651 and the 95% central credible interval is [7.567, 7.739].
These are almost identical to the sample mean 7.652 and the 95% confidence (and credible)
interval [7.565, 7.739] from the Gaussian model. The single “outlier” boy with the unusual
weight does not affect the Gaussian model inference.
Transforming the cdf scale to the probit scale in Figure 13.3 shows that the draws of the
posterior mean are very close to the Gaussian distribution. This could be expected since the
draws are the weighted sum of 61 values of y.
The computational and conceptual saving achieved with the non-informative Dirichlet
prior is striking compared with the empirical profile likelihood, and with frequentist boot-
strapping. These procedures also depend only on the observed sample values and their sample
frequencies: they do not assume anything about unobserved values. The finite population
survey sampling paradigm, which does not use parametric probability models, also makes no
assumption about the population values not included in the sample. Necessary assumptions
refer to the design of the sampling, not the structure of the population.
The term “Bayesian bootstrap” comes from the analogy with the frequentist bootstrap,
which resamples from the observed sample. This is discussed at length in §13.7. The Bayesian
bootstrap also uses only the observed sample, but it resamples from the posterior distribution
of the probabilities attached to each observed value, rather than from the values themselves.
The Bayesian bootstrap gives a more “fine-grained” distribution of the mean.
There are striking differences between the Bayesian generation of the posteriors of the
parameters of interest and the frequentist generation of the empirical profile likelihood in
these parameters.
• The population parameter posteriors are explicit functions of the multinomial Dirichlet
draws; the multinomial MPLEs are implicit functions of the multinomial parameters PI .
The multinomial distribution 175
• The parameter posteriors are fully informative in location and precision; the precision of
the MPLEs for the population parameters has to be assessed by additional analysis, in
which it is difficult to assess or allow for skew.
The Bayesian analysis is very much simpler: the Dirichlet posterior draws are unaffected by
additional specifications of the population parameters. A further important point is that the
Bayesian bootstrap is easily extended to cover Gaussian-type regression models. We give
several examples in later chapters.
1
probit
-1
-2
-3
FIGURE 13.3
Boy birthweight mean posterior, probit scale
176 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
50 55 60 65 70 75
median
FIGURE 13.4
Cdf of posterior median, family income
distribution of the median is also discrete, on the same sample support. So the cdf will be
constant between jumps at the observed support points, and the exact percentiles of the
posterior distribution will be a discrete set.
Figure 13.4 shows the posterior cdf of the median from M = 10, 000 draws of family
income at birth, and Figure 13.5 shows the posterior cdf of the 75th percentile from the
same set of draws. The posterior median (for the population median) is 60, and for the
75th percentile is 75. The discrete posterior mass functions of these percentiles are shown in
Table 13.1, and the posterior cdfs are shown in Table 13.2.
We cannot set arbitrarily the credibility coefficients for credible intervals for percentiles
because of the discreteness of the posterior distributions. Approximate 96% credible intervals
are for the median [56, 70] and for the 75th percentile [67, 93]. The true values are 70
and 90.
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
60 70 80 90 100
75th percentile
FIGURE 13.5
Cdf of posterior 75-th percentile, family income
TABLE 13.1
Posterior mass functions of median and 75th percentile
Y 46 47 52 53 55 56 58 60
median .0001 .0004 .0007 .0100 .0161 .0256 .4367 .2484
75th .0003 .0034
Y 65 67 69 70 71 72 75 77
median .1576 .0465 .0287 .0161 .0066 .0041 .0019 .0005
75th .0194 .0253 .0418 .0686 .0993 .1258 .1456 .1471
Y 80 81 85 93 96 104 107
75th .1296 .0846 .0601 .0328 .0186 .0012 .0002
TABLE 13.2
Posterior cdfs of median and 75th percentile
Y 46 47 52 53 55 56 58 60
median .0001 .0005 .0012 .0112 .0273 .0529 .4896 .7380
75th .0003 .0034
Y 65 67 69 70 71 72 75 77
median .8956 .9421 .9708 .9869 .9935 .9976 .9995 1.000
75th .0194 .0447 .0865 .1551 .2544 .3802 .5258 .6729
Y 80 81 85 93 96 104 107
75th .8025 .8871 .9472 .9800 .9986 .9998 1.000
178 Introduction to Statistical Modelling and Inference
scaled by the direct Dirichlets, since they appear to correspond to a sample of 1 instead of
n. The rescaling makes such weighted functions comparable with the unweighted functions.
We do not comment on this further.
The only difference between these draws is the use of the multinomial observed data
proportions pi in the bootstrap, and the Dirichlet posterior proportions πi in the Bayesian
bootstrap. The multinomial probabilities in the bootstrap are constant across draws, while
the Dirichlet probabilities are varying randomly across draws. The variability of the Bayesian
bootstrap mean draws will be greater than that of the bootstrap mean draws, correcting for
the assumption of known proportions pi in the bootstrap samples.
0.35
0.30
0.25
probability
0.20
0.15
0.10
0.05
0.00
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 13.6
Bootstrap (circles) and posterior (solid) distributions of p from one success in ten trials
180 Introduction to Statistical Modelling and Inference
One might think that this is easily dealt with – we can simply truncate the bootstrap
distribution to the positive values in the support – we cut off the value 0. This has two
difficulties: there is now no indication that values between 0 and 0.1 are even possible, and
the remaining probabilities have to be rescaled to sum to 1, by multiplying by the ratio
1/0.349 = 2.865. The rescaled values do not correspond to the shape of the posterior dis-
tribution of p. A characteristic feature of the bootstrap distribution is its discreteness, even
when the parameter space is continuous.
We now consider the relevance of the bootstrap distribution to formal inference about the
population mean, when the variable Y has more than two values. Conditional on the initial
sample with counts ni at the distinct values {yi }, the n counts nij in bootstrap sample j
have a multinomial distribution M (n, {qi }), where qi = ni /n is the proportion of the initial
sample at yi . The joint distribution of the sample counts ni and the bootstrap sample counts
nij is given by the marginal/conditional product:
( d
) r " d
#
n! Y Y n! Y n
Pr[{ni }, {nij } | {pi }] = Qd pni i · Qd qi ij
i=1ni ! i=1
j=1 nij !
i=1 i=1
Inspection of this joint distribution shows that the nij are ancillary for the pi and hence for µ:
their distribution M (n, qi ) is completely known (since the qi are known) and independent of
the pi , and hence of µ. So we cannot improve, in a model-based analysis, on the information
in the original sample. The bootstrap samples do not provide further information about µ:
the information they do provide is misleadingly precise.
We do not discuss the bootstrap further.
mean, known as the Horvitz-Thompson (HT) estimator (Horvitz and Thompson 1952), is
defined as
X ns
S X ns
S X
X
µ
e= ws ysj / ws
s=1 j=1 s=1 j=1
S
X S
X
= ws ns ȳs / ws ns
s=1 s=1
S
X S
X
= Ns ȳs / Ns
s=1 s=1
S
X
= πs ȳs .
s=1
From the model assessment for the phones, we have an uncomfortable situation. Three
distributions appear to be inappropriate – the exponential, lognormal and gamma – while
two others – the Weibull and multinomial – appear to be nearly equally appropriate. The
multinomial distribution is always appropriate – it makes no assumptions. The credible
intervals for the lifetime 80th quantile vary considerably among the distributions we have
considered. How do we express the information we now have about the 80th quantile?
A common frequentist procedure is to choose the model with the highest maximised
likelihood and base conclusions on it, ignoring the others, as though they had not been
investigated. If the maximised likelihoods for two of the models are very close, this will
clearly be unsatisfactory, especially if the 80th quantile conclusions from the two models are
different.
We develop recent Bayesian approaches to this problem: detailed discussions were given
in Aitkin, Liu and Chadwick (2009) and Aitkin (2010). We build up the general procedure
from simpler models.
TABLE 14.1
Counts from Cox (1961)
i 1 2 3 4 5
yi 0 1 2 3 >3
ni 12 11 6 1 0
3.e-16
3.e-16
2.e-16
likelihood
2.e-16
1.e-16
5.e-17
0.e+00
0.4 0.6 0.8 1.0 1.2 1.4 1.6
FIGURE 14.1
Poisson (solid) and geometric (dashed) likelihoods, Cox data
Y µ yi 1 ni
G(µ) =
i
1+µ 1+µ
µT
=
(1 + µ)T +n
Any function of model parameters and observed data has a posterior distribution
which can be obtained from that of the model parameters, since the data values
after their observation are known numbers.
We have seen this with the credible region for the population cdf. We now discuss the general
problem.
Model comparison and model averaging 185
Combined with the ratio of prior model probabilities it gives an analogue of the ratio of
posterior model probabilities.
• Frequentists use the MLE of the parameters under each model, giving the maximised
likelihoods. These are used in different ways for nested and non-nested models:
– for nested models, in which model 1 is a special case of model 2 under a null
hypothesis, the maximised likelihoods are used in the likelihood ratio test, with
the test statistic −2 log[L1 (θb1 )/L2 (θb2 )], which has under the null hypothesis an
asymptotic χ2 distribution with degrees of freedom equal to the difference in the
number of parameters between the models.
– for non-nested models, the values of the frequentist deviance −2 log Lmax are pe-
nalised by a function of the number of model parameters to provide a decision
criterion: the model with the smaller value of the criterion is the “best”.
The frequentist likelihood ratio test approach is not able to deal with non-nested models,
like the comparison of Weibull, gamma and lognormal models for the phone data. The
186 Introduction to Statistical Modelling and Inference
penalised maximised likelihood approach does not give a measure of strength of evidence for
each model, but a decision criterion for the “best” model. It does not provide a procedure
for the joint use of multiple well-supported models. The Bayes factor integrated likelihood
approach reduces the likelihood function to a single number – a one-point average summary
of a parametric function: Z
Lk (θk )πk (θk )dθk = Lk (θ̃k )
for some θ̃k which depends on the prior πk (θk ). This is analogous to the frequentist use of
the MLEs θbj as the appropriate values of the θj , though θ̃k is implicit and is known only
after the likelihood has been integrated. A detailed discussion of other Bayesian approaches
to model comparison is given in Ando (2010).
-36
-37
-38
-39
-40
log-likelihood
-41
-42
-43
-44
-45
-46
FIGURE 14.2
Poisson (upper) and geometric (lower) log-likelihoods
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-4 -2 0 2 4 6 8 10
log L difference
FIGURE 14.3
Geometric-Poisson log-likelihood difference
188 Introduction to Statistical Modelling and Inference
median posterior probability of the Poisson is 0.953 (the value at the MLE of µ), and the
95% credible interval is [0.673, 0.995]. As expected, the median is close to the MLE, but the
credible interval shows that much smaller values are quite plausible. Reporting only a point
estimate (as always) fails to account for the uncertainty in the small sample of 30.
The log-likelihood is widely used in applications in all fields. Recently the deviance has
become the common tool of inference, in both frequentist and Bayesian analyses.
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
75 80 85 90
deviance
FIGURE 14.4
Poisson (left, solid) and geometric (right, dashed) deviance distributions, Cox data
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-5 0 5 10 15 20
deviance difference
FIGURE 14.5
Deviance difference distribution, geometric minus Poisson, Cox data
190 Introduction to Statistical Modelling and Inference
b −1 ),
θ − θb | y ∼ N (0, I(θ)
b ′ I(θ)(θ
(θ − θ) b − θ) b | y ∼ χ2 ,
p
L(θ)
−2 log | y ∼ χ2p ,
L(θ)
b
2 2
−n(ȳ − µ0 ) + n(ȳ − µ)
= exp
2σ 2
L(µ0 , σ) n(ȳ − µ0 )2 − n(ȳ − µ)2
−2 log =
L(µ, σ) σ2
n(ȳ − µ0 )2 s2 n(ȳ − µ)2
= 2
· 2−
s σ σ2
2 2 2
∼ t · χn−1 /(n − 1) − χ1 ,
where t is the frequentist test statistic for the null hypothesis. The difference between the
two independent scaled χ2 terms has no exact distribution, but is easily simulated. The (pos-
terior) mean of the deviance difference is t2 −1, and the posterior variance is 2t4 /(n − 1) + 2.
Aitkin (2010, p. 71) gave an example: a sample of ten gives a sample mean of ȳ = 5 and
standard deviation s = 8.74. The null hypothesis of µ = µ0 = 0 gives a frequentist t-statistic
of 1.809, with a p-value of 0.104. The evidence against the null hypothesis is not convincing.
Figure 14.6 shows the posterior cdf of 10,000 draws of the deviance difference. The 95%
central credible interval for the true deviance difference is [−2.258, 6.266]: the difference is
poorly defined in this very small sample. The equivalent interval for the likelihood ratio of
null to alternative is [0.044, 3.09], and that for the posterior probability of the null hypothesis
is [0.042, 0.756]. The sample data are consistent with a wide interval of values for µ, and
hence for the likelihood ratio and posterior probability of the null hypothesis. The Bayesian
analysis is much more informative than the frequentist p-value.
192 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
-10 -5 -0 5 10
deviance difference
FIGURE 14.6
Deviance difference distribution, null minus alternative
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
1110 1115 1120 1125 1130 1135
Deviances
FIGURE 14.7
Exponential (red), Weibull (black), gamma (orange) and lognormal (green) deviance distri-
butions, phones data
the first two and the exponential is 18 deviance units. It is almost impossible for a random
draw of the exponential deviance to be smaller than a random draw of the Weibull or gamma.
The differences from the lognormal are even greater.
The Weibull and gamma are the only parametric models which need to be considered,
and their conclusions about the 80th quantile are very close, despite the slightly different
shapes of their density functions and corresponding cdfs. It is a matter for the researcher
whether the Weibull or the gamma is preferred. The analytic properties of the Weibull
quantiles (for five-year survivals) make it generally preferable in medical or life-teating
applications.
What about the multinomial? The median 80th quantile for the multinomial is close to
the medians for the Weibull and gamma, but the credible interval is more conservative –
longer tails in both directions. The increased precision of the parametric models comes at
the cost of the model assumptions.
A technical difficulty with the comparison of the multinomial deviance with that of any
continuous-parameter model is that the likelihoods are not comparable, because the support
for the multinomial is at a discrete set of probabilities, while the continuous distributions
have continuous support for their parameters. We do not discuss this further. The importance
of the multinomial is to compare its credible interval with the credible intervals for the
parametric models.
We do not need the formal model-averaging procedure here, but state it below. To con-
struct M draws of the averaged 80th quantile, we need to follow M times for each m this
sequence:
[m]
• draw at random a value of the Bayesian deviance Dj from each model j;
[m]
• convert the set of deviances to a set of model probabilities pj through Bayes’s theorem;
194 Introduction to Statistical Modelling and Inference
[m] [m]
• draw at random a model Mj using the model probabilities pj ;
[m] [m]
• For the selected model Mj , make one random draw yj,80 from the posterior distribution
of the 80th quantile of this given model.
[m]
The set of random draws yj,80 defines the posterior distribution of the averaged 80th quantile.
15
Gaussian linear regression models
The concept of regression comes from genetics and was popularized by Sir Fran-
cis Galton during the late 19th century with the publication of Regression towards
mediocrity in hereditary stature (Galton 1886). Galton observed that extreme char-
acteristics (e.g., height) in parents are not passed on completely to their offspring.
Rather, the characteristics in the offspring regress towards a mediocre point (a point
which has since been identified as the mean). By measuring the heights of hundreds
of people, he was able to quantify regression to the mean, and estimate the size of
the effect.
(Wikipedia)
15.1.1 Vitamin K
Vitamin K was discovered in the 1930s to have an effect on blood clotting: a lack of Vitamin
K in the diet could lead to very slow blood clotting of patients with disease or injury involving
haemorraging. The identification of Vitamin K was established by the Danish scientist Henrik
Dam, working with Schønheyder and other colleagues in Denmark, and E.A. Doisy and his
colleagues in the USA. It could be extracted from dried liver. Dam and Doisy shared the
Nobel Prize for Physiology or Medicine in 1943 for this discovery. The calibration of the effect
of Vitamin K on blood clotting was established in an experiment by Schønheyder (1936) in
a study of chickens.
Fifteen chickens were deprived of Vitamin K and then fed dried liver for three days at
a (varying) dose of x′ mg per gram weight of chick per day. At the end of this period, the
response of each chicken was measured as the concentration y ′ of a clotting agent needed
to clot samples of its blood in three minutes. The data from the 15 chickens are given in
Table 15.1, and are graphed in Figure 15.1.
There is a rapid drop of concentration with increasing dose, which then flattens out. This
decline appears more rapid than an exponential decay.
Modelling a sharp curve is not straightforward. A property of the data – the ra-
tio of largest to smallest on each variable – suggests an alternative. Ratios over 10 for
500
450
400
350
300
concentration
250
200
150
100
50
2 4 6 8 10 12 14
dose
FIGURE 15.1
Concentration vs dose, Vitamin K
positive variables suggest a scale transformation to a logarithmic scale. This ratio is 62.5
for concentration, and 9.25 for dose. Both variables are positive, and do not have zero
values. Figure 15.2 shows the data on (natural) log scales for both variables, defining
y = log(concentration), x = log(dose). The plot is now nearly linear: the curvature has
disappeared.
We model the relationship between the transformed variables by a simple linear regression
model. This can be expressed in several different ways. The most common is algebraic:
y = α + βx + ϵ.
This model structure can be interpreted as a fixed part – the α + βx – and a random part –
the ϵ. In the fixed part, α is the intercept parameter and β is the slope parameter or regression
coefficient. The ϵ is a random variability term for departures of the observations from the
straight line defined by the fixed part of the model. This is sometimes called an “error” term,
though there is no error in recording the data. The fixed and random parts of the model are
called in engineering and communications the “signal” and the “noise”.
How do we interpret the model? The covariate (sometimes called “explanatory vari-
able”) – liver dose – is determined by the experimenter, and at a given dose x, if there
were no “error”, the concentration response y would be the linear function α + βx of x.
Concentration would be determined or caused by dose. The ϵ term represents random “noise”
Gaussian linear regression models 197
6.0
5.5
5.0
log concentration
4.5
4.0
3.5
3.0
2.5
FIGURE 15.2
Log concentration vs log dose, Vitamin K
or variation due to differences in the individual chickens which had not been, or could not
be, controlled by the experimenter.
We need to estimate the parameters α and β. Without a probability distribution for ϵ,
we have no optimal principle for estimation. Many analysts use “ordinary least squares” –
OLS – a principle due to Gauss and Markov, which chooses the estimates α̃ and β̃ to
minimise the sum of the squared deviations of y from the “fitted” linear function α̃ + β̃x.
The same estimates are obtained as maximum likelihood estimates if we assume a Gaussian
distribution with zero mean and unknown variance σ 2 for the random terms ϵ, as we show in
the following. If this distribution is not Gaussian, then the best estimates will not be those
from OLS.
Figure 15.3 shows the fitted line with the transformed data, and Figure 15.4 shows the
fitted values from the linear model reverse-transformed back to the original scales. It might
appear that the fitted model on the original scale “misses” the first value by a large amount,
but the transformed scale shows that this difference is neither large nor unusual on that scale.
The slightly “jerky” appearance of the fitted model is a consequence of the graph procedure
which uses straight-line segments between adjacent fitted values. This can be smoothed by
computing fitted values explicitly over a fine x grid and graphing them againt the grid values,
giving a smooth curve.
6.0
5.5
5.0
log concentration
4.5
4.0
3.5
3.0
2.5
2.0
FIGURE 15.3
Log concentration vs log dose and ML fitted linear model, Vitamin K
500
450
400
350
300
concentration
250
200
150
100
50
2 4 6 8 10 12 14
dose
FIGURE 15.4
Concentration vs dose and ML fitted model, Vitamin K
Gaussian linear regression models 199
0.3
0.2
0.1
residual
0.0
-0.1
-0.2
-0.3
FIGURE 15.5
Vitamin K residuals
The standard frequentist assessment of model departures from the data is based on the
residuals. We will refer to these traditional data features as the frequentist residuals: as we
will see, the Bayesian analysis uses a different definition, which we will call the Bayesian
residuals. The frequentist residual ei for the ith observation is defined by ei = yi − αb − βx
b i,
where αb and βb are the maximum likelihood estimates. It is easily shown that the frequentist
residuals are uncorrelated with the covariate x: the linear dependence has been accounted
for by the regression model fitting.
A traditional frequentist approach is to treat the frequentist residuals ei as though they
were the unobserved ϵi , to plot them against x to see if any structure remains in the residuals,
and to plot the Gaussian quantiles of the empirical cdf of the residuals against them, to assess
departures from the Gaussian distribution.
This is not quite correct, since the frequentist residuals are correlated (they sum to zero)
and have different variances as well. Figure 15.5 shows the frequentist residuals (from the
log-transformed model) plotted against x, and Figure 15.6 shows their quantile plot. The
residuals show a random pattern against log dose, expected since they are uncorrelated with
the covariate, and their sample quantiles are nearly linear, supporting a Gaussian distribution
for the ϵi . There is no evidence casting doubt on the linearity of the model or the Gaussian
distribution assumption for the ϵi , though in a sample of 15 even large departures of the
frequentist residuals from the Gaussian straight line might not be persuasive. We formalise
the Bayesian residuals in a later section.
1.5
1.0
0.5
quantile 0.0
-0.5
-1.0
-1.5
FIGURE 15.6
Vitamin K residual quantiles
have a sample of size n from an experiment in which each response yi has a corresponding
covariate xi related to yi through the linear regression model. The “error” terms ϵi are
random values from the Gaussian distribution N (0, σ 2 ), independent of the xi . So the model
can be written alternatively as
yi | xi ∼ N (µi , σ 2 ),
µi = α + βxi .
Here the symbol | is a conditioning sign, read as “given” – the response yi is conditioned on
xi explicitly through the model. This representation will be used repeatedly in the book. It
represents the separation of the “random” part of the model – the probability distribution –
from the “fixed” or “structural” part of the model: the relation between the probability
model parameters and the covariates.
The likelihood for the model parameters α, β and σ is
n
1 (yi − µi )2
Y 1
= √ exp − δ
i=1
2πσ 2 σ2
n Pn
1 i=1 (yi − µi )2
δ 1
= √ · exp − .
2π σn 2 σ2
Gaussian linear regression models 201
Here δ is the measurement precision with which y is measured. The likelihood is based
on measurement of y with high precision, so the discrete probability of each yi can be
approximated accurately by the Pn density ordinate rectangle with width δ.
The sum of squares term i=1 (yi − µi )2 in the likelihood appears in some form in every
Gaussian model, and has a standard re-expression. We introduce vector notation for the
parameters and the covariates. We write θ = (α, β)′ for the column vector of the parameters,
and xi = (1, xi )′ for the column vector of the covariates for the ith data value (treating 1 as
a constant covariate). Then the model can be written as
yi | xi ∼ N (µi , σ 2 ),
µi = θ ′ xi .
Pn
We write SS(θ) = i=1 (yi − θ ′ xi )2 for the sum of squares. Writing y = (y1 , . . . , yn )′ , a
column vector of length n, and X = (x1 , . . . xn )′ , an n × 2 matrix, this can be expressed as
n
X
SS(θ) = (yi − θ ′ xi )2
i=1
= (y − Xθ)′ (y − Xθ)
= y′ y − 2y′ Xθ + θ ′ X′ Xθ.
The p×p matrix X′ X is called the sum of squares and cross-products matrix of the covariates
with each other, and the p × 1 vector X′ y is called the sum of cross-products vector, of the
covariates with the response.
The first term is the residual sum of squares RSS, the second is the regression sum of squares
RegSS. The likelihood can be decomposed correspondingly (see the following).
It is important to note (as in the simple Gaussian mean case) that the log-likelihood is
not quadratic in all the parameters, though SS(θ) is quadratic in θ. We can write these and
the ML estimates explicitly in the simple linear regression model:
Pn
X ′ X = Pn
n Pni=1 x2i = n 1 x̄
i=1 xi i=1 xi x̄ x¯2
x¯2 −x̄
′ −1 1 ′ ȳ
[X X] = Pn , Xy=n ,
i=1 (xi − x̄)
2 −x̄ 1 xy
¯
where x¯2 = i=1 x2i /n, xy
Pn Pn
¯ = i=1 xi yi /n. After some algebraic simplification,
Pn
(x − x̄)(yi − ȳ)
β = i=1
b Pn i 2
i=1 (xi − x̄)
¯ − x̄ȳ
xy
= ¯2 ,
x − x̄2
b = ȳ − βbx̄.
α
The information matrix (evaluated at the MLEs) is block diagonal: the regression parameter
estimates are uncorrelated with the σ estimate. The inverse of the information block for the
c2 [X ′ X]−1 , so
regression parameters is σ
n
c2 x¯2 /
X
\
Var[α]
b =σ (xi − x̄)2 ,
i=1
n
X
\
Var[β] c2 /
b =σ (xi − x̄)2 .
i=1
√
However, the parameter estimates α b and βb are correlated, with correlation r = −x̄/ x¯2 . This
has the opposite sign to x̄. It depends on the location (sample mean) of the covariate x, but
not its scale (sample standard deviation). An important treatment of the data, by centring (or
centering) the covariate by subtracting the mean x̄ : x′i = xi − x̄, gives uncorrelated parameter
estimates, with the intercept estimate now being ȳ, and the slope estimate i (x′i yi′ )/ i (x′2
P P
i ).
Gaussian linear regression models 203
The Bayesian and frequentist interpretations of this partition are parallel to those in the
Gaussian mean model:
• (Bayesian) with a prior on σ of 1/σ 2 , and a flat prior on θ, the marginal posterior
distribution of RSS/σ 2 is χ2n−2 , and the conditional posterior distribution of θ given σ 2
b σ 2 [X′ X]−1 );
is the bivariate Gaussian: N2 (θ,
• (frequentist) the first term is the χ2n−2 distribution of RSS/σ 2 , and the second is the
b N2 (θ, σ 2 [X′ X]−1 ).
bivariate Gaussian distribution of θ:
So as in the Gaussian mean model, the two inferences about the regression parameters and
the variance are identical, subject to the prior specifications above. We need not distinguish
the two approaches in the Gaussian model. The advantage of the centering of x is that it
eliminates the covariance, so that statements about the regression coefficient and the inter-
cept are independent: the two parameters have independent posterior Gaussian distributions
given σ 2 .
Inference about the regression parameters is usually based on the marginal t-distributions
of the parameters or their ML estimates; the joint distribution gives credibility or confidence
ellipses in the joint parameter space. Functions of the model parameters are straightforward.
One of particular interest is the mean function µ(x) = α + βx. Figure 15.7 shows the median
(line) and 95% mean precision region (green curves) from 10,000 draws of this function. The
uncertainty in the mean function increases away from the mean x̄ in both directions. This
is characteristic of regression models in general.
6.0
5.5
5.0
4.5
y 4.0
3.5
3.0
2.5
2.0
FIGURE 15.7
Vitamin K ML mean function (line) and 95% precision region bounds (green curves)
might affect the conclusions. We can guard against this possibility by using the Bayesian
bootstrap (BB) analysis from Chapter 12. The parametric functions of inference are the
population regression parameters – the population analogues of the sample estimates. These
are defined by
PN
I=1 (XI − X̄)(YI − Ȳ )
B= PN 2
I=1 (XI − X̄)
A = Ȳ − B X̄,
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
3.6 3.7 3.8 3.9
alpha
FIGURE 15.8
Posterior of α
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6
beta
FIGURE 15.9
Posterior of β
206 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
0.2 0.3 0.4 0.5 0.6
sigma
FIGURE 15.10
Posterior of σ
Gaussian analysis! – only specify the Gaussian as a tentative model for the log response. In
fact we do not even need to have a tentative probability model – it is sufficient to take the
population definitions of the regression parameters A and B as the population parameters of
interest for analysis. The whole BB analysis can be carried out without any formal statement
about a restrictive probability model for the response. This parallels the survey sampling
approach to analysis, but bases it on the non-restrictive multinomial model. (We should add
here that there is a circularity in the argument: the definitions of the population parameters
of interest are themselves implied by the Gaussian model.)
We make use of this feature of the BB analysis in more complex models.
where RSS1 is the residual sum of squares from the regression model. Algebraically,
Pn b i )2
(y − α b − βx
2
r = 1 − i=1 Pn i 2
i=1 (yi − ȳ)
Pn b i − x̄)]2
[(yi − ȳ) − β(x
= 1 − i=1 Pn 2
i=1 (yi − ȳ)
Pn
[ (yi − ȳ)(xi − x̄)]2
= Pn i=1 2
Pn 2
.
i=1 (yi − ȳ) i=1 (xi − x̄)
where the sign of r is the sign of the numerator sum of cross-products, or of the slope
coefficient β.
b
The squared correlation is restricted to the interval [0,1]; r itself is restricted to the
interval [−1,1]. For the Vitamin K sample, the correlation between the log-transformed dose
and response variables is −0.937; the squared correlation is 0.878. These values are extremely
high in magnitude: we will see in most surveys (as opposed to experiments) that correlations
are much lower. The correlation between the original variables before the transformation is
much smaller: on the original scale the relation between the variables is strongly non-linear.
15.7.2 Prediction
In designed studies like that for Vitamin K, the experimenter may want to use the fitted
regression model to make a prediction about the response value y0 for a new “predictor”
variable value x0 . We have already seen (in §10.3.1) the prediction of a new observation given
a sample of response values only. The Bayesian and frequentist analyses of the regression
extension are identical and very straightforward.
We observe y and x related through the regression model, and we have a new x0 and want
to make a predictive inference about the corresponding y0 , through its posterior predictive
208 Introduction to Statistical Modelling and Inference
distribution, derived from the posterior distribution of α + βx0 . The Bayesian predictive
inference is constructed from the conditional distribution of y0 given x0 , α, β, σ. We have
The posterior predictive distribution can be obtained analytically by integrating out succes-
sively α, β and σ:
Z Z Z
f (y0 , | x0 , y, x) = c · f (σ | y, x) dσ · f (y0 | x0 , α, β, σ) · f (α, β | σ, y, x) dα dβ,
However, posterior simulation gives a very simple (and more generally useful) alternative:
• We make M random draws σ 2[m] from the marginal posterior distribution of σ 2 ;
• then for each draw m we make a single random draw (αm] , β [m] ) from the conditional
bivariate Gaussian distribution of (α, β) given σ [m] ;
[m]
• finally for each draw m we make a single random draw y0 from the conditional Gaussian
distribution of y0 given x0 , α[m] , β [m] , σ [m] : N (α[m] + β [m] x0 , σ 2[m] ).
[m]
The M values y0 are a random sample from the posterior predictive distribution.
The frequentist formulation is in three steps, as before:
b′ x ∼ N (θ ′ x, σ 2 h)
θ
b′ x ∼ N (0, σ 2 (1 + h))
y0 − θ
RSS1 /σ 2 ∼ χ2n−2 independently,
√
b′ x)/s 1 + h ∼ tn−2 ,
(y0 − θ
15.7.3 Example
′
We want to predict the Vitamin K concentration for a new dose level of x0 = 5 mgm per
gm wt of chick per day. We transform to the log dose scale: x0 = log 5 = 1.61, then centre
this value by subtracting the original sample mean x̄: 1.686, so x0c = −0.076. The posterior
prediction median or mean of log (y) is 3.886, with 95% prediction interval [3.385, 4.387].
The corresponding prediction values for y are 48.7 and [29.5, 80.4].
There are several cautions about this procedure:
• No prediction should be given without its full posterior predictive distribution, or a
credible interval for it. A point prediction, like a point estimate, is useless.
• The precision of prediction from the posterior predictive depends on the validity of the
probability model for the response variables.
• Prediction from the model assumes that the model applies to the new value;
• in particular: the new x value lies within the range of the existing predictor variables.
Gaussian linear regression models 209
Prediction of a response variable from a covariate value outside the range of the observed
covariates, a common practice in extrapolating time series into the future, is hazardous,
because it assumes a “steady state”. We have no way of knowing if the “steady state” – the
regression – will continue unchanged beyond the observed data. For this reason predictions
without precisions are sometimes called “projections”, which are assumed not to require
measures of precision. They do require them!
ηi ] = σP2 i
Var[b
′
= Var[β
b xi ]
= x′i Cxi ,
yi − α − βxi ∼ N (0, σ 2 ),
(yi − α − βxi )/σ ∼ N (0, 1).
We denote the Bayesian (standardised ) residuals by ϵi = (yi − α − βxi )/σ, which have
independent standard Gaussian distributions. However, they are not observable. We generate
M independent draws of the parameters α, β, σ, and for each observation (yi , xi ) we construct
the standardised residuals ϵi at xi :
[m]
ϵi = (yi − α[m] − β [m] xi )/σ [m] .
[m]
At each xi we sort the M draws ϵi and find the 2.5% and 97.5% quantiles of these ordered
draws. The region within the upper and lower quantiles is a 95% credible region for the cdf
of these residuals. We show the quantiles joined by red straight line segments, together with
the median straight line, and the frequentist residuals as circles in Figure 15.11.
210 Introduction to Statistical Modelling and Inference
2.5
2.0
1.5
1.0
0.5
probit 0.0
-0.5
-1.0
-1.5
-2.0
-2.5
FIGURE 15.11
Vitamin K residual probits
The credible region is wide, with the Gaussian model in the middle. There is no evidence
against the Gaussian residual distribution. Joining the quantile points by straight lines is
more visible and attractive, but may give a misleading impression of our (lack of) knowledge
of the cdf between the observations.
Y ∼ N (α, σ 2 ), x = 0
∼ N (α + β, σ 2 ), x = 1.
So α is the mean in the first group of observations with x = 0, and β is the difference between
the means in the second and the first groups.
If we define x to take the two values a and b instead of 0 and 1, then β = (µ2 −µ1 )/(b−a)
and α = bµ1 −aµ2 . In particular, if a = −1/2 and b = 1/2, then β = µ2 −µ1 , α = (µ1 +µ2 )/2.
The 0, 1 choice gives the simpler relationship, as we are not usually interested in the equally
weighted average mean. We will see how useful this is in the analysis of cross-classifications.
Gaussian linear regression models 211
TABLE 15.2
Absence, IQ and number of family dependents, Aboriginal girls
days 14 11 2 5 5 35 22 20 13
IQ 60 60 70 86 86 81 86 93 93
deps 11 11 5 9 10 9 7 4 12
days 7 14 27 6 20 4 15 13 6 6
IQ 96 90 82 79 65 64 66 76 73 74
deps 6 5 11 6 10 7 8 3 11 9
days 5 16 17 46 43 40 16 14 32
IQ 70 98 84 100 84 91 105 104 76
deps 7 14 3 10 7 7 4 11 11
days 57 6 53 23 8 34 36 38 23 28
IQ 83 73 92 93 99 95 84 106 89 103
deps 10 9 9 13 7 9 10 11 9 9
For IQ, the fitted regression is b = −12.48 (0.39) + 0.391 (0.178) IQ.
µ
The 95% confidence/credible interval for the IQ slope excludes zero, while that for depen-
dents includes zero – dependents do not appear important for absence. However the increas-
ing variability is not accounted for in either model. We discuss how to deal with this in
Chapter 17.
55
50
45
40
35
days absent
30
25
20
15
10
60 70 80 90 100
IQ
FIGURE 15.12
Absence vs IQ
55
50
45
40
35
days absent
30
25
20
15
10
4 6 8 10 12 14
dependents
FIGURE 15.13
Absence vs dependents
Gaussian linear regression models 213
105
100
95
90
85
IQ
80
75
70
65
60
4 6 8 10 12 14
dependents
FIGURE 15.14
IQ vs dependents
55
50
45
40
35
days absent
30
25
20
15
10
60 70 80 90 100
IQ
FIGURE 15.15
Absence vs IQ and fitted linear model
214 Introduction to Statistical Modelling and Inference
55
50
45
40
35
days absent
30
25
20
15
10
4 6 8 10 12 14
dependents
FIGURE 15.16
Absence vs dependents and fitted linear model
1.5
1.0
0.5
probit
0.0
-0.5
-1.0
-1.5
-10 -0 10 20 30
residual
FIGURE 15.17
Probit residuals
Gaussian linear regression models 215
model does not fit well. Before proceeding further with this example we give the theory for
the general p-variable case.
• The linear predictor acts as a single variable in the regression, as in the simple linear
regression model.
The model is sometimes called a single-index model, with index referring to the linear func-
tion.
We first extend the notation of both covariates and regression parameters. We define x0
to be the vector of 1s, and rename
Pp the intercept parameter α as β0 . We write the linear
predictor in vector form: η = j=0 βj xj = β ′ x. We do not need a regression coefficient on
η because it is already a function of the regression coefficients β. More informatively,
yi | xi ∼ N (µi , σ 2 ),
µi = β ′ xi .
The x variables themselves do not need to be linear; they can be powers or products of other
variables, as we will see. Geometrically, the population mean values are modelled as lying in
a hyper-plane – the generalisation of a (y, x) line in two dimensions and a (y, x1 , x2 ) plane
in three dimensions.
for the sum of squares. Writing y = (y1 , . . . , yn )′ , a column vector of length n, and the
n × (p + 1) matrix X = (x0 , x1 , . . . , xn )′ , this can be expressed as
n
X
SS(β) = (yi − β ′ xi )2
i=1
= (y − Xβ)′ (y − Xβ)
= y′ y − 2y′ Xβ + β ′ X′ Xβ.
The matrix X′ X is now the sum of squares matrix of the covariates, and the vector X′ y is
the sum of cross-products vector of the covariates with the response.
It is clear that the Bayesian and frequentist results will again be the same, with appro-
priate adjustment of the prior on σ 2 :
• (Bayesian) with a prior on σ of 1/σ p+1 , and a flat prior on β, the marginal posterior
distribution of RSS/σ 2 is χ2n−p−1 , and the conditional posterior distribution of β given
b σ 2 [X′ X]−1 );
σ 2 is the (p + 1)-variate Gaussian distribution: Np+1 (β,
• (frequentist) the first term is the χ2n−p−1 distribution of RSS/σ 2 , and the second is the
p + 1-variate Gaussian distribution of β:b Np+1 (β, σ 2 [X′ X]−1 ).
In the Gaussian analyses described in the following, we use the frequentist terms as these
are most common in the applications of Gaussian regression analysis. We repeat that the
Bayesian analysis gives the same results.
As in the single variable model, the intercept and regression model parameter ML es-
timates (or parameters for Bayesians) are uncorrelated with the variance parameter ML
estimate (or parameter), but the regression coefficient estimates (or parameters) are in gen-
eral correlated with each other and with the intercept. By centering each of the covariates,
we can achieve independence of the intercept from the regression coefficients, but not of the
regression coefficients from each other.
the framework of omitted variables, in Chapter 16. We also examine in Chapter 17 alterna-
tive models for absence, allowing for relations between the covariates and the variability of
absence.
15.14 Interactions
The two-variable model we have examined has a strong assumption built-in: that the two
variables do not interact with each other. The meaning of interaction can be understood if
we examine the effect of varying the number of dependents on the regression of absence on
IQ. Writing the structural part of the model as µabs = β0 + β1 IQ + β2 deps, this structure is
for deps = 1 and 2,
µabs = β0 + β1 IQ + β2 ,
µabs = β0 + β1 IQ + 2β2 .
The change in the number of dependents affects only the intercept of the regression, β0 + β2
or β0 + 2β2 , and not the slope of the regression on IQ. So the two-variable regression can
be thought of as a set of parallel single-variable IQ regressions, with the vertical spacing
between them set by the deps coefficient β2 . The same argument leads to the alternative
interpretation of a set of parallel single-variable deps regressions, with the vertical spacing
between them set by the IQ coefficient β1 .
This no interaction property can be changed by extending the model, to include the
interaction – the cross-product of the IQ and dependents variables:
Now the change from deps 1 to deps 2 has an effect on the slope of the regression on IQ as
well as on the intercept:
µabs = β0 + β1 IQ + β2 + β3 IQ,
It is important to note that, when the interaction IQ ∗ deps is modelled, the regression
coefficients β1 and β2 for the interacting variables IQ and deps no longer represent the slopes
of the relations between the response and these variables: they represent only the slopes of
the relations at the zero values of the other variable. This is very important in the analysis
of cross-classifications, considered in §15.18.
It is difficult to identify the possible need for an interaction term from the graphs of
absence against IQ and dependents. The interaction term has to be fitted and assessed
for necessity. In the ML fitted interaction model, the coefficient of the interaction term is
−0.0310 with SE 0.0718: the interaction is small and can be omitted. Interactions can be
extended to higher orders with three, four, . . . variables interacting. We discuss examples
below and in the GLM chapter.
15.14.1.1 ANOVA
In the analysis of designed experiments in agriculture, engineering and many other fields,
factorial experiments were used in which different factors – treatments at different levels of
some variable – were analysed as cross-classifications with main effects and interactions of
the factors. This required the partitioning of the regression sum of squares into separate and
independent components for the main effects and their interactions. This partitioning process
was invented by R.A. Fisher, who called it the Analysis of Variance, generally abbreviated
to ANOVA.
A fundamentally important difference in the use of ANOVA occurs between designed
experiments and observational studies. Fisher invented both ANOVA and the concept of
designed experiments. He saw that if the factors being investigated in the experiment could
be designed to be orthogonal – uncorrelated – the ANOVA would have a unique partition
structure, and the effects of both main effects and interactions could be established in a
single analysis through the ANOVA table. His 1935 book The Design of Experiments became
famous, and changed the design and analysis of agricultural and other industrial experiments.
In observational studies, like the Bennett investigation of emotional response in hus-
bands to suicide attempts by their wives, the possibly important factors affecting emotional
response are almost never orthogonal except by accident. This means that the ANOVA table
is not unique – it depends on the sequence in which effects are included in, or removed from,
the model. With p covariates in the model, there are p! possible permutations of the order
of their entry into the model, and even more possible models: there are 2p possible models
with any number of covariates.
There have been several attempts to construct a single sum of squares table which cor-
rectly represents the importance of the covariates. In our view these are unsuccessful; a full
discussion of the attempts and their difficulties can be found in Aitkin (1978); it is unfor-
tunate that this discussion has been largely ignored. Fortunately, there is an alternative
approach.
the other component with mean 0 and a very large variance (the slab). These are mixed in
proportions pj and 1 − pj for covariate j, with the pj unknown with flat priors. The aim
of the mixture prior is to shrink the small MLEs towards zero, while leaving unaffected the
large MLEs; this is the same aim as backward elimination, though the shrinkage is generally
not to zero. We discuss this further in Chapter 16.
15.14.1.3 ANCOVA
In some experiments, one or more continuous covariates is part of the model, and the re-
gression of the response on these covariates could be interacted with the factorial effects.
The process of decomposition of the regression sum of squares in these models was called
the Analysis of Covariance, generally abbreviated to ANCOVA.
When all covariates are continuous, it is possible to interact covariates with each other
into a higher-dimensional surface model. In some engineering applications, the true response
is a complex multi-dimensional function, which is approximated by a response surface model
which includes powers and cross-products of the covariates.
The object of using the squared regression coefficients is to shrink the MLEs towards zero.
The “ridge” – the main diagonal of the SSP matrix – is “loaded”, augmented by the positive
product 2λσ 2 , decreasing the high correlations in the augmented matrix. Here λ is the
penalty constant which determines the penalised MLE (PMLE) θ̃ of θ:
θ̃ = [X′ X + 2λσ 2 I]−1 X′ y
= [I + 2λσ 2 [X′ X]−1 ]−1 [X′ X]−1 X′ y
= [I + 2λσ 2 [X′ X]−1 ]−1 θ.
b
A yet further generalisation is the “elastic net”, which has both penalties. From a Bayesian
point of view, the parameter prior in this case is the normalised product of the Gaussian and
the Laplace.
We do not take the discussion of penalties further here, though the idea will reappear in
later models. Near-collinearity is easily handled by the omission of one or more of the highly
correlated covariates (as occurs in backward elimination), or by redefining them into a single
covariate. Actual collinearity is detected by most packages.
and the graphs do not provide a reliable indication of the importance of the covariates in
the joint model.
We do not show the individual graphs for this analysis. We begin with a “full model” of
the six covariates mentioned, then eliminate the redundant covariates in order of smallest
t-value, while |t| < 2. This leaves two of the six covariates in the model: mothers’ weight
(mwt) and mothers’ smoking (msm).
The ML fitted values and (SEs) are
b = 4.65 (0.61) + 0.0116 (0.0019) mwt − 0.360 (0.091) msm + 0.0565 (0.0199) mage
µ
+ 0.0186 (0.0074) inc − 0.000644 (0.000249) mage.inc
Smoking mothers have a mean boy birthweight of 0.36 pounds (164 grams) less than non-
smoking or quit mothers, with 95% confidence or credible interval [0.188,0.536] pounds,
or [85,243] grams. This is a birthweight reduction of about 5% relative to average boy
birthweight, and about 1/3 of the residual standard deviation from the model.
An increase of ten pounds in mother’s weight is associated with a mean increase of 0.116
pounds, or 53 grams in birthweight. Mothers’ age and family income interact on birthweight:
both are positively related to birthweight, but their effects are reduced at high values of
222 Introduction to Statistical Modelling and Inference
14
13
12
11
10
boy weight 9
FIGURE 15.18
Boy birthweights against mother
weights
either. Figure 15.18 graphs boy birthweight against mother’s weight, and Figure 15.19 iden-
tifies mother’s smoking (red circles: smoker, green circles: non-smoker).
The fitted values from a model with just smoking amd log mother’s weight are shown
as two parallel red and green curves with a vertical spacing of 0.36 pounds. The additional
variation from the mother’s age and family income is not shown here: it remains in the
variation about the fitted terms. The log transformation of mother’s weight gives a slightly
better fit. Without the log transformation the fitted lines are very close to the curves.
It is very difficult to see, without the two curves, any difference between the red and
green circles: they appear to be randomly mixed. The difference in mean birthweight with
smoking is small, relative to the variability, but the sample size is so large that it is real. This
research study established that mother’s smoking affected negatively boy birthweights. It is
the mother’s smoking environment, not the number of cigarettes smoked, that is apparently
important for lower boy baby birthweights.
Could there be unobserved variables which explain some of the remaining variation? The
mixture model for overdispersion identifies only the largest boy birthweight as defining a
component; this has no effect on the regression model for the remaining 647 boys. The full
modelling of girl birthweights is left as an exercise for the student.
14
13
12
11
10
birthweight
9
FIGURE 15.19
Boy birthweights against mother weights and fitted model. Red smoker, green non-smoker
or quit
• How does the child’s intelligence, as measured by the Peabody and Raven tests, relate to
birthweight, family income and mother’s and father’s ages, occupation and education?
We examine this question for the StatLab girls.
The two intelligence tests from the 1960s assessed different aspects of intelligence. The
Peabody test (its full name was the Peabody Picture Vocabulary Test) consisted of 175
vocabulary items of generally increasing difficulty. The child listened to a word uttered by
the interviewer, and then selected one of four pictures that best described the word’s mean-
ing. All of the questions on the Raven test (its full name was the Raven Progressive Matrices
Test) consisted of visual geometric designs with a missing piece. The child was given six to
eight choices to pick from and fill in the missing piece. How the test results were coded into
a score is not given on the websites of these tests (it is given in the test manuals).
As these are tests of different aspects of the child’s intelligence, we would expect them to
be fairly highly correlated. Figure 15.20 graphs the Peabody score against the Raven score
for the girls. It is clear that the tests are fairly highly correlated. However, the variability of
the Peabody score is increasing with increasing Raven score – the data points “fan out” as
Raven score increases. If we were to regress Peabody score on Raven score we would need to
account for this pattern of variability. We may need this also for regressing Peabody score on
other possible covariates. In Chapter 17 we discuss how to use the double GLM to do this.
In this book we consider each intelligence measure separately. We first model the girls’
Raven scores on birthweight, family income at both birth and test, and mother and father
ages, occupation and education, the last two as categories. Many packages require factors
to have non-zero level codes, for which occupation and education need to be increased by 1.
We do not give complete details of the model reduction here.
Mother and father occupation, mother and father age and child birthweight were all irrel-
evant, as was family income at both birth and test. However, education level was important
224 Introduction to Statistical Modelling and Inference
120
110
100
Peabody score
90
80
70
60
10 20 30 40 50
Raven score
FIGURE 15.20
Girl Peabody score against Raven score
for both mother (MED) and father (FED), showing a steady increase in Peabody mean score
with increasing education level.
By defining a new education variable rather than a factor, and then its square, cube and
fourth power, we can model the education relation as a polynomial, of order necessary to fit
the Peabody values. A linear term is sufficient to fit the relation. The ML fitted model is
µ
\Rave = 13.63 (1.46) + 2.667 (0.438) MEDL + 2.196 (0.383) FEDL.
The mother and father linear education terms differ by one SE of either coefficient, so they
can be reduced to a common coefficient, by defining a new linear term MFEDL - “Total
MFL” - as the sum of the two. This gives a final model
µ
\Rave = 13.75 (1.44) + 2.411 (0.189) MFEDL.
Each step up the Total MFL scale corresponds to a 2.4 step increase in mean Raven score.
The fitted model is shown with the Total MFL values in Figure 15.21. Are the model resid-
uals reasonably near Gaussian? The frequentist residuals are shown on the probit scale in
Figure 15.22. The Gaussian model looks acceptable.
We repeat the modelling with the girls’ Peabody scores on birthweight, family income
at both birth and test and mother and father ages, occupation and education. As with the
Raven score, mother and father occupation, family income at both birth and test and mother
age and child birthweight were irrelevant. However father age was important, and as before
education level was important for both mother and father, showing the same form of steady
increase in mean score with increasing education level. The composite Total MFL score again
fitted linearly. The final model was
50
45
40
35
Raven score
30
25
20
15
10
2 4 6 8 10
Total education level
FIGURE 15.21
Girl Raven score against Total MFL, with ML fitted model
3.0
2.5
2.0
1.5
1.0
0.5
probit
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-3.0
-20 -10 0 10 20
Raven residuals
FIGURE 15.22
Girl Raven score residuals
226 Introduction to Statistical Modelling and Inference
The increase in mean Peabody score with an increase of 1 in Total MFL was very close to
that for the Raven score. Although the scales are different, as can be seen from Figure 15.20,
the ranges (maximum-minimum) are nearly the same. A probit plot of the residuals (not
shown) had some curvature. A log transformation of the Peabody score gave more nearly
Gaussian residuals, but did not change the conclusions.
So the level of both parents’ education has a clear relation to the aspects of the girls’
intelligence measured by both the Raven and Peabody tests. Father’s age is important for
the Peabody test.
deviance = 15.316
estimate s.e. t parameter
1 1.999 0.3573 1
2 1.020 0.6796 1.50 G2
3 1.135 0.5096 2.23 G3
4 0.4263 0.8274 0.52 N2
5 -0.07625 0.2409 -0.32 PO
6 -0.1286 1.292 -0.10 G2.N2
7 0.3215 1.401 0.23 G3.N2
8 -0.4803 0.4282 -1.12 G2.PO
9 -0.5271 0.3548 -1.49 G3.PO
10 -0.4463 0.6373 -0.70 N2.PO
11 0.5294 0.9501 0.56 G2.N2.PO
12 0.3685 0.8851 0.42 G3.N2.PO
scale parameter 0.2785
The “scale parameter” is the “Restricted” ML (REML) estimate of σ 2 (the posterior mode
with the default prior). We proceed with backward elimination, but with a hierarchical order
of main effects and interactions, starting with the highest-order interactions. The t-values of
both the three-way interaction terms are small, so they may be omitted.
228 Introduction to Statistical Modelling and Inference
deviance = 15.411
estimate s.e. t parameter
1 2.056 0.3370 1
2 0.8676 0.6028 G2
3 1.058 0.4648 G3
4 0.07410 0.5245 N2
5 -0.1172 0.2260 PO
6 0.5458 0.4207 1.30 G2.N2
7 0.8206 0.4158 1.97 G3.N2
8 -0.3780 0.3753 -1.01 G2.PO
9 -0.4714 0.3201 -1.47 G3.PO
10 -0.1596 0.3691 -0.43 N2.PO
scale parameter 0.2704
We examine next only the two-way interactions. We omit N2.PO. To save text space we
show only the interaction next omitted:
deviance = 15.461
8 -0.3955 0.3705 -1.07 G2.PO
9 -0.4773 0.3176 -1.50 G3.PO
scale parameter 0.2666
The scale parameter decreases if the omitted variable has a t of less than 1, otherwise it
increases. After these two interactions are omitted, we see another possibility for simplifica-
tion:
deviance = 16.131
estimate s.e. t parameter
1 2.456 0.2286 1
2 0.3240 0.2093 1.55 G2
3 0.4136 0.1665 2.48 G3
4 -0.1745 0.2608 -0.67 N2
5 -0.4025 0.1407 -2.86 PO
6 0.6232 0.4056 1.54 G2.N2
7 0.6572 0.3470 1.89 G3.N2
scale parameter 0.2688
The G.N interaction terms are rather large. Rather than omitting them both, we first note
that they are very similar in value, as are the “main effects” G2 and G3. This suggests that
the true and false control group dichotomy may not be relevant in the interaction with N.
We examine this by defining a new dummy variable G23 which combines these categories,
corresponding to G2 or G3, that is, not G1. Replacing the two G.N interactions by a single
G23.N2 interaction gives:
deviance = 16.133
estimate s.e. t parameter
1 2.450 0.2180 1
2 0.9618 0.3015 3.19 G2
3 1.062 0.2812 3.78 G3
4 -0.1737 0.2585 -0.67 N2
5 -0.3986 0.1323 -3.01 PO
6 0.6451 0.3150 2.05 G23.N2
scale parameter 0.2645
Gaussian linear regression models 229
The reparametrisation of the interaction changes the G2 and G3 estimates by 0.63, the vari-
ation suppressed by the single G23.N2 interaction. We now encounter a common situation:
the N2 “main effect” is small, but it has a substantial interaction with G23. This means
that the N “effect” (which is not averaged across the husband groups but is only that in
overdoses) is small in the overdose group but larger in the true and false control groups. We
can then set it to zero in the overdose group by eliminating the N2 dummy:
deviance = 16.252
estimate s.e. t parameter
1 2.403 0.2055 1
2 0.8199 0.2143 3.99 G2
3 0.9198 0.1841 5.00 G3
4 -0.3895 0.1310 -2.97 PO
5 0.4684 0.1729 2.71 G23.N2
scale parameter 0.2621
We see again that the main effects for G2 and G3 are very similar, and we equate them in
the same way, by replacing them by G23:
deviance = 16.341
estimate s.e. t parameter
1 2.399 0.2043 1
2 -0.3870 0.1302 -2.97 PO
3 0.4192 0.1402 2.99 G23
4 0.4712 0.1719 2.74 G23.N2
scale parameter 0.2594
At this point no further variable elimination can occur. The coefficients for G2 and G3
change with the reparametrisation of the dummy variables for the category levels.
The interpretation is straightforward:
• Husbands with a previous occurrence are much less affectionate than those without a
prior occurrence, consistently over the three groups.
• The true and false control husbands are substantially more affectionate than the overdose
husbands.
• The British husbands are substantially more affectionate than the Australian husbands
in the true and false control groups.
We use the terms “much less” and “substantially more” here without any natural scale of
affection. We can construct a scale in terms of standard deviation units differences (called
effect sizes in some fields) among the classifications. We first smooth the coefficients, by
noting that they are quite close in magnitude (in terms of their variabilities), and could be
equated in a simpler model. We define a variable lin, by
lin = −PO + G23 + G23*N2, and fit this variable instead of its three components:
deviance = 16.399
estimate s.e. t parameter
1 2.035 0.0684 1
2 0.4266 0.0776 5.50 LIN
scale parameter 0.2523
230 Introduction to Statistical Modelling and Inference
TABLE 15.4
Observed and fitted mean values of affection
Nation A B
Occur NPO PO NPO PO
OD 12 8 4 1
mean 1.92 1.85 1.90 1.38
fitted 2.0 1.5 2.0 1.5
FC 4 5 3 1
mean 2.46 1.91 2.84 2.37
fitted 2.5 2.0 3.0 2.5
TC 13 6 1 9
mean 2.53 1.93 3.20 2.58
fitted 2.5 2.0 3.0 2.5
We have been able to represent the mean affection variation by a single scale or index,
which has a set of steps as the covariates change. To make the scale values simpler we do
some small rounding, of the estimates
√ within their variability, and the scale parameter. The
residual standard deviation is 0.2523 = 0.502, which we round to 0.5, and we round the
lin coefficient by one standard error to 0.5, and the intercept by half a standard error to
2.0. We fit this constrained model lin2 = 2.0 + 0.5 lin as an offset – a variable with a fixed
regression coefficient of 1 – and without an intercept, and tabulate the fitted values by the
cross-classifying variables:
deviance = 16.627
-- No parameters to display
scale parameter 0.2482
We can summarise this table by noting that the fitted values define a four-point scale of
mean affection, with a spacing of one standard deviation (0.5) between the scale points; the
number of fathers at each scale point is given in parentheses:
1.5 : overdoses with a previous occurrence (9)
2.0 : overdoses with no previous occurrence (16);
Australian controls with a previous occurrence (11)
2.5 : Australian controls with no previous occurrence (17);
British controls with a previous occurrence (10)
3.0 : British controls with no previous occurrence (4)
There is no distinction between true and false controls in affection. The 1.5 gap between the
most and least affectionate scale groups is three standard deviations on the affection scale –
a very wide range.
We conclude by examining the Gaussian model assumption through the residuals from
the final model. Figures 15.23 and 15.24 give the cdf and probit plots of these residuals. The
probit plot shows some curvature.
A final assessment can be made by posterior Dirichlet weighting of the three-variable final
model, for the posterior distribution of the model parameters. The medians and 95% credible
intervals for the parameters from 10,000 draws are given in the left panel of Table 15.5, and
the MLEs and 95% confidence interval endpoints in the right panel.
Gaussian linear regression models 231
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
FIGURE 15.23
Affection residuals cdf
2.0
1.5
1.0
0.5
probit
0.0
-0.5
-1.0
-1.5
-2.0
FIGURE 15.24
Affection residuals probit
232 Introduction to Statistical Modelling and Inference
TABLE 15.5
Posterior medians and 95% credible intervals (left), MLEs and 95% confidence intervals
(right)
quantile int PO2 G23 G23N2 MLE,ends int PO2 G23 G23N2
2.5 1.806 −0.626 0.165 0.126 2.5 1.990 −0.647 0.139 0.127
50.0 2.014 −0.387 0.422 0.474 MLE 2.399 −0.387 0.419 0.471
97.5 2.205 −0.145 0.673 0.780 97.5 2.808 −0.127 0.699 0.815
Apart from the intercept, the MLEs correspond very closely to the medians. The larger
intercept shift is induced by the smaller shifts in the draws of the other parameters. The 95%
credible intervals for the regression coefficients are 5–10% shorter than the 95% confidence
intervals. There is no serious departure from Gaussianity in the parameter posteriors (not
shown).
regression coefficient estimates and their standard errors which components can be omitted:
their omission leaves the importance of the other principal variables unaffected.
It is tempting to assume that the importance of the components as predictors will cor-
respond to the size of their variances. There is no necessary connection between these two
properties; in particular it is not true that the first principal component is the best single
predictor of y. We know that the best single predictor of y is given by the MLE β b ′ x, not the
first principal variable u1 .
A particular difficulty with principal component regression is that the principal com-
ponent variables have no simple interpretation. In most research studies the covariates are
chosen for their relevance to variations in the response. So the weight (including zero) they
are given by the estimated regression coefficients has a direct interpretation in the research
process. Principal variables do not have such an interpretation, since they are already linear
functions of the covariates. An interpretation can be drawn by reverse-transforming the es-
timated regression function of principal variables back to a linear function of the covariates,
through γ b′u = γ
b ′ Hx. Whether this would be more precise than the backward elimination
process depends on the sample data. We do not pursue further in this book the use of
principal components in regression.
16
Incomplete data and their analysis with the EM
and DA algorithms
The generality of these “missing data” algorithms is remarkable: they have changed the face
of complex analysis since 1977. For historical reasons, we give a detailed exposition of the
EM algorithm with examples, before discussing the Bayesian extension through the Data
Augmentation algorithm and other MCMC procedures.
We first express the EM algorithm in its most general form, then give a number of
applications to problems we have already encountered in previous chapters. We do not give
a fully detailed theoretical treatment of EM: many such treatments have been published
since the fundamental paper of Dempster, Laird and Rubin (1977) and the books of Rubin
(1987) and Little and Rubin (1987). A comprehensive coverage of extensions of EM can be
found in McLachlan and Krishnan (1997).
where g is the observed data likelihood given the observed data y, and f is the complete data
likelihood given the complete data y, Z. The integral transformation X (y) is quite general,
1 This definition is slightly different notationally from the standard definition, for greater simplicity.
representing any kind of constraint on the y, Z space which reduces the data information to
that in the observable data y. We assume this transformation does not depend on the model
parameters θ.
Then taking logs and differentiating the log-likelihood under the integral sign (X (y) does
not depend on θ), we have, for the observed data score vector (first derivative) sy (θ) and
Hessian matrix (second derivative) Hy (θ),
∂
R
∂ log g f (Z | θ)dZ
∂θ X (y)
sy (θ) = = R
∂θ X (y)
f (Z | θ)dZ
R
s (θ)f (Z | θ)dZ
X (y) Z
= R
X (y)
f (Z | θ)dZ
Z
= sZ (θ)f (Z | θ, y)dZ
X (y)
= E[sZ (θ) | y]
2 Z
∂ log g ∂
Hy (θ) = = sZ (θ)f (Z | θ, y)dZ
∂θ∂θ ′ ∂θ ′ X (y)
∂f (Z | θ, y)
Z
= HZ (θ)f (Z | θ, y) + sZ (θ) dZ
X (y) ∂θ ′
Z
= [HZ (θ)f (Z | θ, y) + sZ (θ){sZ (θ)′ − E[sZ (θ)′ | y}f (Z | θ, y)] dZ
X (y)
Z
= [HZ (θ) + sZ (θ){sZ (θ)′ − E[sZ (θ)′ | y}f (Z | θ, y)] dZ
X (y)
So
• the observed data score is equal to the conditional expectation of the complete data score,
and
• the observed data Hessian is equal to the conditional expectation of the complete data
Hessian plus the conditional covariance of the complete data score,
where the conditioning is with respect to the observed data. We can re-express the second
result in terms of the information matrix I = −H:
• the observed data information matrix Iy (θ) is equal to the conditional expectation of the
complete data information matrix IZ (θ) minus the conditional covariance of the complete
data score.
Remarkably, these results hold regardless of the form of incompleteness, so long as the afore-
mentioned conditions hold. So algorithms like Gauss-Newton can be implemented directly
from the complete data form of the analysis, provided the conditional distribution of the
complete data score and Hessian given the observed data can be evaluated. This does not,
however, guarantee convergence without step-length or other controls on these algorithms.
It is important that, for the precisions of the MLEs, it is insufficient to compute the
conditional expectation of the complete data information matrix: this overstates the precision
of the estimates. It is corrected by the subtraction of the conditional covariance of the
complete data score function. The latter is quite complex even in simple models, and we do
not give details in most of the applications.
Incomplete data and their analysis with the EM and DA algorithms 237
A further concern in the use of standard errors from the information matrix is that in
incomplete data models the observed data likelihood can be far from Gaussian, even when
the complete data likelihood is Gaussian. For reliable measures of precision, we need the
Bayesian extension, given in a later section.
16.3 Missingness
We have assumed throughout the previous chapters that the data are always recorded cor-
rectly and are without missing values. The latter are endemic in all kinds of studies. In this
section we describe the types of missingness and the consequent methods for dealing with
them. In order to deal with missingness, we need to define the probability structure which
leads to values being missing.
• Missing completely at random (MCAR). The values are missing through a random pro-
cess, unrelated to the variables or any model parameters. An extreme example would
be a laboratory fire which destroys some specimens and loses data on other damaged
specimens. The damage is unrelated to the data values.
• Missing at random (MAR). The values are missing through a random process which may
depend on completely observed variables, but not on the values of the variables which are
incomplete. An example would be a study of randomly selected school boys and girls in
which some members of each sex are missing some variables from one of the interviewing
and data recording sessions because of minor illness.
• Missing non-randomly (MNR). The values are missing through a process which depends
explicitly on the values which would have been observed. An example is the destruction
of lifetime records of phones with low lifetimes. Another is a study of reported incomes
in which some sample members with high incomes decline to report them.
238 Introduction to Statistical Modelling and Inference
E[Zj | y, λ] = λ.
We need an initial value λ[0] of λ to assign this value to the Zj . Then the next value of λ is
An obvious choice for λ[0] is the MLE ȳ from the observed data – this should be close to the
MLE from the expected complete data. Then
So the λ estimate at the next step is again ȳ – the algorithm has converged immediately!
The MLE is unaffected by the imputation of the two missing values. We might as well have
ignored them.2
Note that this process is not the same as enlarging the observed data to n+2 by replacing
each unobserved Zj by the sample mean. That process is called mean imputation. It gives
the correct MLE, but its precision is wrong: its variance is λ2 /(n + 2) instead of λ2 /n.
This is a general feature of single imputation methods – replacing unobserved or incomplete
observations by the sample mean or some other estimate. In the EM algorithm the conditional
expectation step does not provide “plug-in” estimates of the incomplete observations: it is
a device which leads to ML estimation. Another way of looking at the EM result is to see
that the two missing observations cannot contribute to the likelihood. So if the missingness
is at random, we can forget them and analyse the observed data.
We now need to take the conditional expectation of the complete data log-likelihood
log CL(λ) with respect to the unobserved Tf . This is
nȳ + Tf
log CL(λ) = −(n + 1) log λ − .
λ
In the E step of the algorithm we replace the term Tf by its conditional expectation. The
exponential distribution has constant hazard: phones do not age in service; used is as good
as new! Knowing that the phone has survived T hours makes no difference to the expectation
of its future life, which remains at λ. So E[y | y > T ] = T + λ.
We adopt a notation for the successive replacement values T [r] for Tf and the successive
ML estimates λ[r] for λ. We begin with λ[0] = ȳ = 210.8. We replace Tf by T + λ[0] in the
2 If some other initial value of λ is used, the algorithm converges after some iterations. We illustrate this
in examples.
3 The single observation is not a restriction: any number of censored observations can be analysed in the
expected complete data log-likelihood, E[log CL], and maximise this over λ to get the next
MLE:
nȳ + T + λ[0]
log E[CL(λ)] = −(n + 1) log λ − .
λ
Maximising this gives
nȳ + T λ[r]
λ[r+1] = +
n+1 n+1
[r]
λ
= 211.8 + .
89
b = 211.8 + λ ,
b
λ
89
= 211.8/(1 − 1/89) = 214.2,
Given the current estimates µ[p] and σ [p] , the E-step replaces the log-likelihood for the
unobserved terms in x by its conditional expectation given the current parameter estimates.
So for the observed terms for i = 1, . . . , m, xi = yi , and for the unobserved missing terms for
i = m + 1, . . . , n, each (xi − µ)2 is replaced by
It is easily, if tediously, verified that the conditional expectation of the complete data score,
and the conditional expectation of the complete data Hessian plus the conditional covariance
of the complete data score are given by
m
1 X
E[sx (µ) | y] = 2 (yi − µ)
σ i=1
"m #
n 1 X 2 2
E[sx (σ) | y] = + 3 (yi − µ) + (n − m)σ
σ σ i=1
n
E[Hx (µ, µ) | y] = −
σ2
"m #
2 X
E[Hx (µ, σ) | y] = − 3 (yi − µ)
σ i=1
"m #
n 3 X
E[Hx (σ, σ) | y] = 2 − 4 (yi − µ)2 + (n − m)σ 2
σ σ i=1
n−m
C[sx (µ, µ) | y] =
σ2
C[sx (µ, σ) | y] = 0
2(n − m)
C[sx (σ, σ) | y] = ,
σ2
which give the observed data score and Hessian.
This application may seem trivial, but it is a widespread practice to use multiple impu-
tation for randomly missing response values in models of all kinds. This involves unnecessary
effort since the randomly missing values do not contribute to the observed data likelihood.
242 Introduction to Statistical Modelling and Inference
n+m
" #
Y 1 1 2
· √ p exp − (yi − α − βµx )
i=n+1
2π σ 2 + β 2 σx2 2(σ 2 + β 2 σx2 )
n+m
Y 1
1 2
6pt] · √ exp − 2 (xi − µx ) .
i=1
2πσx 2σx
Incomplete data and their analysis with the EM and DA algorithms 243
The second term in the likelihood mixes up the parameters for the Y | x and the X dis-
tributions, and the direct ML estimation of all the parameters now has to follow a general
Newton-Raphson approach. The number of parameters has increased by two, and the infor-
mation matrix no longer separates into independent pieces for (α, β) and σ. So even with
this simple “conjugate” model for X, direct ML estimation for the full data set requires the
full NR approach.
The EM algorithm applies directly. The complete data likelihood can be written as
n
Y 1 1
CL(θ, ϕ) = √ exp − 2 (yi − α − βxi )2
i=1
2πσ 2σ
n+m
Y 1 1 ∗ 2
· √ exp − 2 (yi − α − βxi )
i=n+1
2πσ 2σ
n
Y 1 1
· √ exp − 2 (xi − µx )2
i=1
2πσx 2σx
n+m
Y 1 1 ∗ 2
· √ exp − 2 (xi − µx ) ,
i=n+1
2πσx 2σx
where x∗ denotes the missing values of x. The complete data log-likelihood is, ignoring known
constants,
For the E step, we need the expectations of x∗i and x∗2 i given the observables yi and the
parameters θ, ϕ. Since (y, x) are bivariate Gaussian the conditional distribution of xi given
yi is Gaussian:
−1 −1
N (µx + σxy σyy (yi − α − βxi ), σxx − σxy σyy σyx ),
which we write as N (x̃i , Vi ). So x∗i is replaced by x̃i , and x∗2 2
i is replaced by Vi + (x̃i ) . A
considerable simplification of notation is possible by defining a new “x” variable, which we
will call w:
wi = xi for i ≤ n, wi = x̃i for i > n.
Initial estimates of the parameters can be those from the complete cases, which should be
near the final MLEs. Implementation of the algorithm can be accelerated by giving the ML
equations for the parameters with their Z functions replaced appropriately. So we replace
the complete data equations
" n n+m
#
∂ log CL 1 X X
= 2 (yi − α − βxi ) + (yi − α − βx∗i ) = 0
∂α σ i=1 i=n+1
" n n+m
#
∂ log CL 1 X X
∗ ∗
= 2 xi (yi − α − βxi ) + xi (yi − α − βxi ) = 0
∂β σ i=1 i=n+1
244 Introduction to Statistical Modelling and Inference
" n n+m
#
∂ log CL n+m 1 X 2
X
∗ 2
=− + 3 (yi − α − βxi ) + (yi − α − βxi ) = 0
∂σ σ σ i=1 i=n+1
" n n+m
#
∂ log CL 1 X X
= 2 (xi − µx ) + (x∗i − µx ) = 0
∂µx σx i=1 i=n+1
" n n+m
#
∂ log CL n+m 1 X X
=− (xi − µx )2 + (x∗i − µx )2 = 0
∂σx σx σx3 i=1 i=n+1
by the expected complete data equations with the new notation, and rearrange to define the
MLEs:
µ
bx = w̄;
"n+m n+m
#
X X
2 2
σ
bx = (wi − w̄) + Vi /(n + m)
i=1 i=n+1
= sww + V̄ ;
"n+m # "n+m n+m
#
X X X
βb = (wi − w̄)(yi − ȳ) / (wi − w̄)2 + Vi
i=1 i=1 i=n+1
= swy /[sww + V̄ ];
b = ȳ − w̄β;
α b
" n+m n+m
#
X X
2 2 2
σ
b = (yi − α
b − βwi ) /(n + m) + β
b b /(n + m)
i=n+1 i=n+1
= sy|w + βb2 V̄ ,
where
n+m
X n+m
X
w̄ = wi /(n + m); V̄ = Vi /(n + m);
i=1 i=n+1
n+m
X
sww = (wi − w̄)2 /(n + m);
i=1
n+m
X
swy = (wi − w̄)(yi − ȳ)/(n + m);
i=1
n+m
X
sy|w = (yi − α b i )2 /(n + m).
b − βw
i=1
Pn+m
The correction term i=n+1 Vi has the effect of a diagonal loading on the SSP matrix
for the two models, and can be programmed in this way. We do not give further details.
example in §16.7.2, we use the complete cases to provide an empirical distribution of the
missing covariate values. The d distinct observed values of the covariate in the complete
cases, which we denote by uj , have counts nj at the uj , with unknown covariate population
proportions pj . The EM algorithm applies as before, but the complete data likelihood is
different:
n
Y 1 1 2
CL(θ, ϕ) = √ exp − 2 (yi − α − βxi )
i=1
2πσ 2σ
n+m
Y 1
1
· √ exp − 2 (yi − α − βx∗i )2
i=n+1
2πσ 2σ
d d n+m
nj
Y Y Y Z
· pj · pj ij
j=1 j=1 i=n+1
n
Y 1 1 2
= √ exp − 2 (yi − α − βxi )
i=1
2πσ 2σ
n+m
Y 1
1
· √ exp − 2 (yi − α − βx∗i )2
i=n+1
2πσ 2σ
d
n +Z+j
Y
· pj j ,
j=1
where x∗ denotes a missing value of x, Zij = 1 if the missing x∗i is uj and is zero otherwise,
Pn+m
Z+j = i=n+1 Zij , and ϕ = (p1 , . . . , pd ). The missingness is now expressed in two ways: the
missed value x∗i and the indicator Zij which identifies which of the possible uj is the missed
value for x∗i .
We now need the log complete data likelihood, omitting known constants:
" n n+m
#
1 X X
log CL = − (n + m) log σ − 2 (yi − α − βxi )2 + (yi − α − βx∗i )2
2σ i=1 i=n+1
d
X
+ (nj + Z+j ) log pj .
j=1
The expected complete data log-likelihood requires the expectations of x∗i , x∗2
i and Z+j .
We cannot use the previous wi , as x does not have a Gaussian distribution. The conditional
distribution of x given y is given by
at each support point, and then normalised to sum to 1. The conditional mean and second
moment of xi average the uj over the support points with the posterior weights.
We now define a new wi = xi if xi is observed, and wi = E[xi |yi ] if xi is missing,
and a new Vi = E[x2i |yi ] − wi2 . Then we can parallel the results for the regression model
parameters from the Gaussian x case. We still need the conditional distribution of the Z+j ,
the unobserved total number of incomplete x observations assigned to the support point uj .
The derivatives of the log-likelihood with respect to the θ parameters of the Gaussian
regression model are the same as those for the Gaussian x model. For the multinomial model
parameters, we need to impose the constraint that the pj sum to 1, so we need the derivative
of the log likelihood minus a Lagrange multiplier times the constraint:
P
∂[log(CL) − λ( j pj − 1)]
= (nj + Z+j )/pj − λ = 0
∂pj
pbj = (nj + Z+j )/λ
P
∂[log(CL) − λ( j pj − 1)]
X
= pj − 1 = 0
∂λ j
X
pbj = (nj + Z+j )/ (nj + Z+j )
j
Pr[Zij = 1] = pj
d
X
Pr[Zij = 1|yi ] = pj · f (yi | xi )/ pj · f (yi | xi )
j=1
2
probit
-1
-2
-3
4 6 8 10 12 14
birthweight
FIGURE 16.1
Boy birthweights and Gaussian cdf (solid line), with 95% credible bounds for the true cdf
(red curves) on probit scale
248 Introduction to Statistical Modelling and Inference
probit
1
-1
-2
-3
4 6 8 10 12 14
birthweight
FIGURE 16.2
Boy birthweights and two-component Gaussian mixture cdf (solid curve), with 95% credible
bounds for the true cdf (red curves) on probit scale
close to 1/648. The heaviest boy defines the second component, with the next three heaviest
boys having decreasing probabilities of belonging to this component. The remaining boys
all belong to the first component. In both graphs the ML fitted curves fall just outside the
credible region at four and eight pounds weight.
The three-component mixture resolves the first component into two very close compo-
nents, with means 7.67 and 7.52, common standard deviation 1.095, and probabilities (to
2dp) 0.83 and 0.17. The mean separation of 0.15 is about 1/7th of a standard deviation,
barely detectable or observable.
We now give the details of the EM algorithm. The observed data y are a random sample
from the model
f (yi | θ) = pf1 (yi | µ1 , σ1 ) + (1 − p)f2 (yi | µ2 , σ2 ),
with " #
1 1 2
fj (yi | µj , σj ) = √ exp − 2 (yi − µj )
2πσj 2σj
and θ = (µ1 , µ2 , σ1 , σ2 , p). The likelihood is an awkward product across the observations of
sums across the two components.
For the EM algorithm, we introduce a latent component identifier Zi which, if observed,
would convert the model to a two-group Gaussian model with group-specific means and
variances. We define Zi = 1 if observation i is from component 1, and Zi = 0 if observation i
is from component 2, and give Zi the Bernoulli distribution with parameter p. The complete
data are then y and Z, and the complete data likelihood, log-likelihood, score and Hessian
Incomplete data and their analysis with the EM and DA algorithms 249
The common appearance of Zi in these terms greatly simplifies the observed score and Hes-
sian computations, which require the conditional expectations of Zi given the observed data.
From Bayes’s theorem,
E[Zi | y, θ] = Pr[Zi = 1 | y, θ]
= f1 (y; θ | Zi = 1) Pr[Zi = 1]/f (y; θ)
f1 (y; θ | Zi = 1) Pr[Zi = 1]
=
f1 (y; θ | Zi = 1) Pr[Zi = 1] + f2 (y; θ | Zi = 0) Pr[Zi = 0]
= Zi∗
sy (θ) = E[sZ (θ) | y]
n
X
= [Zi∗ sZi (µ1 , σ1 ) + (1 − Zi∗ (sZi (µ2 , σ2 ) + Zi∗ /p − (1 − Zi∗ )/(1 − p)].
i=1
The score equations are simple weighted versions of the complete data score equations, with
the component membership probabilities as weights. The EM algorithm alternates between
evaluating the parameter estimates given the weights, and evaluating the weights given the
parameter estimates. At convergence (asymptotic) variances of the estimates are given by
250 Introduction to Statistical Modelling and Inference
The formal version of the Hessian is formidable, and is not given here. The effort involved
in evaluating the covariance terms is considerable, even in this simple model, and increases
rapidly in more complex models. Several important features of the observed data Hessian
are visible:
• The off-diagonal expected Hessian terms in the means and standard deviations are zero
at the maximum likelihood estimates.
• The covariance terms are never zero, except by accident in the (µ1 , p) and (µ2 , p) terms.
• The covariance terms always have opposite signs to the expected Hessian terms: infor-
mation is reduced by the unobserved Zi , not surprisingly.
Consequently,
• the expected complete data Hessian, which is produced by many ML packages in the M
step, understates the standard errors of the ML estimates.
• All estimated parameters are positively correlated.
An important point which may not be visible is that the mixture likelihood can be far from
Gaussian in the model parameters. This explains the importance of EM in being able to reach
a maximum of the likelihood without step-length correction. However it also means that the
parameters generally have skewed, non-Gaussian distributions, and the standard errors from
the information matrix are unreliable as measures of precision. The Bayesian analysis is
essential for full information through the posterior distributions of the parameters.
A further important point is that mixture distributions, and other incomplete data mod-
els, may have multiple maxima in the likelihood, reachable from different initial estimates.
This means that an extensive search for local maxima with varying starting values has to
be a part of any mixture (or other latent variable) analysis. In mixture distributions, it is
common for a K-component model to have a local maximum at a K − 1 component model
local or global maximum.
Incomplete data and their analysis with the EM and DA algorithms 251
TABLE 16.1
Galaxy recession velocities (km/sec×10−3 )
9. 17 35 48 56 78
10. 23 41
–
16. 08 17
–
18. 42 55 60 93
19. 05 07 33 34 34 44 47 53 54 55 66 85 86 86 91 92 97 99
20. 17 18 18 20 22 22 42 63 80 82 85 88 99
21. 14 49 70 81 92 96
22. 19 21 24 25 31 37 50 75 75 89 91
23. 21 24 26 48 54 54 67 71 71
24. 13 29 29 37 72 99
25. 63
26. 96 99
–
32. 07 79
–
34. 28
252 Introduction to Statistical Modelling and Inference
The question of astronomical interest was whether these velocities were clumped into
groups or clusters, or instead the velocity density increased initially and then gradually
tailed off. This had implications for theories of evolution of the universe. If the velocities
were clumped, the velocity distribution should be multi-modal.
The question of clumping has been investigated repeatedly by fitting mixtures of Gaus-
sian distributions to the velocity data; the number of mixture components necessary to
represent the data – or the number of modes – is the parameter of particular interest. We
do not consider modes here: these raise additional complications, as two poorly separated
components may have one or two modes.
Mixtures of Gaussians were fitted by ML to the velocity data. Table 16.2 gives the MLEs
of the means, proportions and standard deviations with up to six components, together with
the frequentist deviances. We give both the equal variance and unequal variance cases; the
equal variance model is much inferior to the unequal variance model, as will be clear from
the table.
A first question is: could the data come from a single Gaussian distribution? Figure 16.3
shows the empirical cdf (circles) and the 95% credible region for the true cdf (inside the red
curves). Without the ML fitted Gaussian cdf the conclusion is unclear. Figure 16.4 shows
the cdf on the probit scale with the credible region and the ML fitted Gaussian cdf (solid
curve). The answer is clearly NO.
It is clear from Figure 16.4 that the observations in the central group follow a nearly
linear structure, while the upper and lower groups need structures of their own. We expect
to need two additional components at least, to represent the upper and lower groups of
observations. To proceed further, we need to use EM. We do not give details of the ML
mixture analyses.
The probit plots for ML fitted 2, 3, 4 and 6 component mixtures are shown in Figures 16.5,
16.6, 16.7 and 16.8. For clarity we do not give the credible region bounds: the models beyond
two components give a very close fit to the empirical cdf.
It is clear from Figures 16.7 and 16.8 that increasing K from 4 to 6 gives only an
interpolation of the data. We do not use K > 6 as this leads to degenerate components
in which a single observation is split off to form a component. A one-observation component
is meaningless, except as an outlier.
The three-component model fits well, apart from the region around velocity 20, where
the slope seems to be wrong. The four-component model provides a separate slope in this
region for a closer fit. We do not show the five-component model, which improves very little
over the four-component model. The six-component model has two jumps in the plot to
accommodate the two pairs of observations above and below the central group. Each pair
defines an additional component, but not fully: each new component has three parameters,
but only two observations.
The major difficulty with mixture analysis is the determination of the number of compo-
nents K needed to represent the data. In frequentist analysis this is usually done by sequen-
tially increasing K until the maximised log-likelihood no longer increases by a “substantial”
amount.
One frequentist way of assessing the number of components is the bootstrap likelihood ratio
test, in which the likelihood ratio test statistic, for the null hypothesis of the smaller number
against the alternative of the larger number, is computed for a large number of bootstrap
samples, generated by resampling from the fitted null hypothesis model, and computing the
bootstrap distribution of the likelihood ratio test statistic.
The more common frequentist way to determine what is substantial is by adding a penalty
to the frequentist deviance, based on the number of model parameters, and choosing the
model with the smallest penalised deviance.
Incomplete data and their analysis with the EM and DA algorithms 253
TABLE 16.2
Means, proportions and SDs, galaxy data
Several penalties, their number increasing over time, have been proposed and many of
these are in wide use, though there is no strong theoretical justification for their use, nor
agreement over which is most appropriate, or when. We illustrate this process with the two
oldest methods, the AIC and BIC. The AIC uses a penalty of 2p on the deviance while BIC
uses p log(n), where p = 3K − 1 is the number of model parameters, n is the sample size and
K is the number of components. The BIC is derived through an asymptotic approximation
to the integrated likelihood used in the Bayes factor. Many authors have noted that the BIC
penalty is greater than the AIC penalty, provided that log(n) > 2, or n > 7.4.
254 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
10 15 20 25 30
velocity
FIGURE 16.3
Empirical cdf of galaxy data with 95% credible region (red)
1
probit
-1
-2
10 15 20 25 30
velocity
FIGURE 16.4
Probit of galaxy data with ML fitted Gaussian (line) and 95% credible region (red)
Incomplete data and their analysis with the EM and DA algorithms 255
2.5
2.0
1.5
1.0
probit 0.5
0.0
-0.5
-1.0
-1.5
-2.0
10 15 20 25 30 35
velocity
FIGURE 16.5
Probit of galaxy data with ML fitted two-component Gaussian
3.0
2.5
2.0
1.5
1.0
probit
0.5
0.0
-0.5
-1.0
-1.5
-2.0
10 15 20 25 30 35
velocity
FIGURE 16.6
Probit of galaxy data with ML fitted three-component Gaussian
256 Introduction to Statistical Modelling and Inference
3.0
2.5
2.0
1.5
probit 1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
10 15 20 25 30 35
velocity
FIGURE 16.7
Probit of galaxy data with ML fitted four-component Gaussian
3.0
2.5
2.0
1.5
1.0
probit
0.5
0.0
-0.5
-1.0
-1.5
-2.0
10 15 20 25 30 35
velocity
FIGURE 16.8
Probit of galaxy data with ML fitted six-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 257
TABLE 16.3
Model deviances and penalised deviances
K p dev AIC BIC
1 2 480.83 484.83 489.64
2 5 440.72 450.72 462.75
3 8 406.96 422.96 442.21
4 11 395.43 417.43 443.90
5 14 392.27 420.27 453.96
6 17 365.15 399.15 440.06
Table 16.3 shows the frequentist deviance and the AIC and BIC for each of the unequal
variance models.
AIC prefers four components to three, BIC prefers three to four, but both prefer 6 overall.
The preference for six comes from the small drop in deviance from four to five components,
and the large drop from five to six, in which the two pairs of observations at the ends of the
central group are fitted by one component each.
In §13.4 we discussed the principled Bayesian approach to model comparison. This ap-
proach requires only the random posterior draws of the model parameters from the Bayesian
analysis, which are substituted into the deviance function to give random draws of the model
deviance, for each model. The deviance distributions are examined for stochastic ordering,
and can be used to give medians (or any other quantile) of the posterior model probabilities.
A recent addition to the penalised deviances is the DIC, the deviance information criterion
(Spiegelhalter, Best, Carlin and van der Linde 2002). This does not use the frequentist
deviance, but the random posterior draws of the Bayesian deviance. These are averaged,
to give the (simulated) mean deviance which is then penalised by an effective number of
parameters, which also has to be computed from the deviance draws.
For the DA analysis, we need priors for the component means, SDs and proportions.
Computationally these need to be proper: improper priors on (0, ∞) or (−∞, ∞) do not
provide useful random draws. Finite range flat priors for µ and log σ or σ, and the minimally
informative Dirichlet prior with indices 1 are commonly used. Details are not given here: see
Celeux, Forbes, Robert and Titterington (2006) for these specifications for the galaxy data.
The following figures, over several pages, show the cdfs, on different horizontal scales,
of the 10,000 deviance draws from each number of components K from 1 – 7 (solid curves)
together with the cdfs of the asymptotic shifted χ2 distributions (dashed curves). The fre-
quentist deviance for each K is the circle near the graph origin. An important point in these
graphs is that the left end of the posterior cdf of the deviance does not “reach” the frequen-
tist deviance beyond K = 2: with these data and models the frequentist deviance could not
be randomly drawn for mixtures with K > 2 even with a sample of 10,000. The frequentist
deviance is not a representative value of the data support for the model. The last figure shows
all the deviance draw distributions on the same scale.
The interpretation of these graphs is straightforward.
For K = 1 the asymptotic and the deviance draw distributions are indistinguishable: for
n = 82 the simulated Gaussian deviance cannot be differentiated from the shifted χ22 .
As K increases, even for K = 2, the deviance distributions depart increasingly from the
asymptotic distribution.
The deviance distribution for K = 1 is far to the right of the others – the one-component
mixture (single Gaussian distribution) is a very bad fit: its deviance distribution is the stochas-
tically largest.
258 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
482.5 485.0 487.5 490.0 492.5 495.0 497.5 500.0
deviance
FIGURE 16.9
Galaxy and asymptotic deviances, one-component Gaussian
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
450 460 470 480
deviance
FIGURE 16.10
Galaxy and asymptotic deviances, two-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 259
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
420 440 460 480 500
deviance
FIGURE 16.11
Galaxy and asymptotic deviances, three-component Gaussian
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
400 420 440 460 480 500 520
deviance
FIGURE 16.12
Galaxy and asymptotic deviances, four-component Gaussian
260 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
400 420 440 460 480 500
deviance
FIGURE 16.13
Galaxy and asymptotic deviances, five-component Gaussian
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
380 400 420 440 460 480 500
deviance
FIGURE 16.14
Galaxy and asymptotic deviances, six-component Gaussian
Incomplete data and their analysis with the EM and DA algorithms 261
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
375 400 425 450 475 500 525
deviance
FIGURE 16.15
Galaxy and asymptotic deviances, seven-component Gaussian
1.0
0.9
0.8
0.7
0.6
cdf
0.5 3 456 7 2 1
0.4
0.3
0.2
0.1
0.0
420 440 460 480 500
deviance
FIGURE 16.16
Galaxy deviances, one- to seven-component Gaussians
262 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-60 -40 -20 0 20 40 60
dev3-dev4
FIGURE 16.17
Galaxy deviance differences, three to four components
Incomplete data and their analysis with the EM and DA algorithms 263
TABLE 16.4
Percentages of correct model identification using DIC and
posterior deviance
n 82 164 328 656
K DIC Dev DIC Dev DIC Dev DIC Dev
1 100 100 100 99 100 100 100 100
2 85 98 100 100 100 100 100 97
3 51 99 98 99 100 99 100 99
4 3 9 11 67 30 99 17 99
5 0 18 0 9 0 37 1 89
6 2 9 0 10 56 100 78 100
7 0 1 0 15 4 3 4 32
DIC, especially for a large number of poorly separated components (Table 16.4). With the
galaxy sample size of 82, both methods had difficulty identifying more than three compo-
nents. Doubling repeatedly the galaxy sample size steadily improved the posterior deviance
comparisons, but did not change much the DIC comparisons.
there is sensitivity to the prior choice . . . however the qualitative shape of the
density is unchanged.
The number of component densities used in the random draws is not the number of compo-
nents which can be identified in the data but the number of component densities generated by
the Dirichlet process, which is unrelated to the data structure. The DPP analysis is unrelated
to the question of clumping.
17
Generalised linear models (GLMs)
Here θ is the natural or canonical parameter of the distribution (often the parameter of
interest, or a function of it) and ϕ is a scale parameter, usually a nuisance parameter. The
regression model was for θ or a monotone function of it. The scale parameter, if there was
one, represented the additional random variability not modelled by the regression. If there
wasn’t one, the probability model and the regression had to account for all the variability
in the data.
model itself. The program gave the ML estimates and standard errors based on the estimated
expected information (evaluated at the ML estimates), and a quantity called the deviance,
whose definition varied by the probability distribution. If the distribution was Gaussian, the
deviance was the residual sum of squares, while if it was not, the deviance was the likelihood
ratio test statistic for the fitted model compared to a “saturated” model with a parameter
MLE for each observation. This unusual combination was intended to give the deviance the
same interpretation, as a goodness of fit statistic for the model, whether Gaussian or not. A
detailed history of the development of GLIM was given in Aitkin (2018).
Books on GLMs, and the implementation of GLMs in other packages, developed slowly
at first. Nelder and McCullagh’s 1983 book became the standard theoretical text, though it
made little mention of GLIM or any other software implementation. The 1989 GLIM book
by Aitkin, Anderson, Francis and Hinde was widely adopted as a teaching text for the use
of GLIM; the text was ported to R in a 2009 edition (Aitkin, Francis, Hinde and Darnell).
The development of the object-oriented statistical computing system S, and its open source
version R, greatly extended the range of applied statistical modelling, using modern robust
and simulation-based methods. The series of books by Venables and Ripley (fourth edition
2002 for example) showed the importance of this development.
GLIM’s early success showed clearly the importance of GLMs as a major tool of both
theoretical and applied statistics. For applied statistics, it changed the focus of analysis from
statistical methods to statistical models and modelling. Maximum likelihood for GLMs is now
widely implemented and documented, and details of the model fitting are not given in the
examples here. We summarise the GLM algorithm.
∂2ℓ
H= −
∂βj ∂βk
∂2ℓ ∂2η
∂ℓ
= D′ − D+ − .
∂ηj ∂ηk ∂η ∂βj ∂βk
Generalised linear models (GLMs) 267
The right-hand side of the updating Newton-Raphson algorithm is evaluated at the current
parameter values. The Fisher scoring algorithm used in most GLM algorithms simplifies this
updating by replacing the second derivative matrix by its expected value (at the current
parameter values). This simplifies H to D’WD, where
∂2ℓ
Wjk = E − ,
∂ηj ∂ηk
so that
∂ℓ
β new = β + (D′ WD)−1 D′
∂η
′ −1 ′ −1 ∂ℓ
= (D WD) D W Dβ + W
∂η
= (D′ WD)−1 D′ Wz,
where
∂ℓ
z = Dβ + W−1 .
∂η
In applying this formulation to specific distributions in the exponential family, the GLM
packages hold a library of the necessary functions for the covered range of distributions. We
do not give details. The algorithm is in fact more general than the exponential family: all
that is required is that the distribution has a single index η = β ′ x.
draws from the posteriors. This process can be accelerated by using information about the
likelihood from the ML analysis.
which is asymmetric in p and (1 − p). All the binomial link functions can be derived from a
bifurcation of the cdf of a continuous distribution.
Suppose a continuous random variable Y has a density function f (y) and cdf F (y). At
a y-value y0 the distribution is cut into two pieces, with probability contents p = F (y0 ) and
1 − p = 1 − F (y0 ). The value of y is unobservable: we can observe only the binary variable Z,
that y > y0 (z = 1) or y ≤ (y0 ) (z = 0). We model (either of) these probabilities as functions
of the covariates x through the link function. The choice of the link function determines the
distribution of the underlying Y.
Generalised linear models (GLMs) 269
2.5
2.0
1.5
1.0
0.5
probit 0.0
-0.5
-1.0
-1.5
-2.0
-2.5
0.0 0.2 0.4 0.6 0.8 1.0
p
FIGURE 17.1
Probit (solid) and logit (dashed) transformations on the probit scale
The large uncertainty in the relation between dose and death probability is clear if we
construct the 95% credible region for the relation from the observed data. The 95% central
credible region for the death probabilities pi is given directly from the 95% credible intervals,
from the posterior Beta(yi + 1, ni − y1 + 1) distributions, with independent uniform priors
on each pi . The 95% credible interval bounds are shown as a red line-segment region with
the data (black circles) and the posterior median values of p (green circles) in Figure 17.2.
The line segments are in fact illusory as there are no data points between the design
points to provide credible intervals. However, it is clear that the true relation could be almost
anything – it need not even be monotone increasing. Any two-parameter model which goes
to 0 and 1 at the data margins will appear to be an adequate fit; in particular any link
function for the binomial will give an acceptable fit. (A cubic with any link will fit exactly.)
The logistic linear model is just one of these, but it is the most common. The probability
p(x) of death at dose level x (mgm per gram of bodyweight) is modelled by the logistic linear
1.0
0.9
0.8
0.7
0.6
proportion
0.5
0.4
0.3
0.2
0.1
0.0
-0.75 -0.50 -0.25 0.00 0.25 0.50 0.75
log dose
FIGURE 17.2
Racine 95% credible region (red segments) for the death probability function, with observed
proportions (black circles) and posterior medians (green circles)
Generalised linear models (GLMs) 271
regression model:
yi | xi , ni ∼ b(ni , pi )
pi
logit pi = log
1 − pi
= α + βxi .
1.0
0.9
0.8
0.7
0.6
probability
0.5
0.4
0.3
0.2
0.1
0.0
-0.75 -0.50 -0.25 0.00 0.25 0.50 0.75
x
FIGURE 17.3
Racine data (circles) and fitted logistic models (solid curve posterior median, dashed curve
MLEs)
272 Introduction to Statistical Modelling and Inference
model, centering the covariate is helpful, reducing both the correlation of the MLEs and the
parameter grid necessary to cover the region of appreciable likelihood.
With centering to the mean covariate value −0.12, the parameter estimates become
bc = −0.08 (0.73) and βb = 7.75 (4.87), with correlation 0.20.1 The slope estimate and SE are
α
unaffected by the centering, but the precision of the intercept is increased, with a smaller SE.
Part of the difficulty of the analysis without centering is the elongation of the near-elliptical
contours of the likelihood (shown in Gelman et al’s Figure 3.3a), which is greatly reduced
by centering.
We will compute the joint posterior distribution at a grid of points (α, β). It is a
good idea to get a rough estimate of (α, β) so we know where to look. To obtain
the rough estimate, we use existing software to perform a logistic regression . . .
The [ML] estimate is (b α, β)
b = (0.8, 7.7) with standard errors of 1.0 and 4.9 for
α and β respectively.
We are now ready to compute the posterior density at a grid of points (α, β).
After some experimentation, we use the range (α, β) ∈ [−5, 10] × [−10, 40] which
captures almost all of the mass of the posterior distribution. The resulting contour
plot appears in [their] Figure 3.3a.
Although it was not commented on, the uniform grid in both parameters is effectively
a discrete uniform prior for them. It might be thought that searching the parameter space
for the area of high likelihood would be “tuning” the prior to the data (the likelihood), but
this is not so. Conceptually, it is simply restricting the range of the parameters to the region
of high likelihood by having an initial near-infinite grid and then removing from this grid
regions of zero likelihood.
With such a small sample (and small numbers of animals at each dose level), we would
expect the posterior of β at least to be skewed. Neither Racine et al (1986) nor Gelman et
al (2014) showed the posterior distributions for α or β: they moved directly to the LD50.
Skew in the posterior for β is already clear, from the inconsistency between the two tests
for a zero value of β: the Wald test β/SE
b (1.59) and the likelihood ratio test (15.74), which
would be close to the squared Wald test value (2.53) in a Gaussian likelihood.
To assess this, we follow the Gelman et al analysis and generate 10,000 grid values of
α and β over the region [−5, 10] for α and [−10,40] for β. We substitute the grid values
into the binomial log-likelihood and exponentiate this. Direct exponentiation may fail for
very large negative values of the log-likelihood. To prevent this we carry out the exponen-
tiation in several steps. First we find the largest log-likelihood and subtract this from the
log-likelihood values. This guarantees negative values of the difference which exponentiate
without difficulty.
Figure 17.4 shows the likelihood classified by the regression coefficients for intercept and
slope, where the likelihood at each point is coded in two ways: on a colour scale and a size
scale. The largest circles, coloured green, have the highest likelihoods (at least 80% of the
1 The correlation is not zero, because the mean is unweighted, not weighted by the iterative weights used
in the ML estimation.
Generalised linear models (GLMs) 273
40
35
30
25
20
15
b
10
-5
-10
-4 -2 0 2 4 6 8 10
a
FIGURE 17.4
Racine data likelihood
maximum); the crimson have likelihoods 60–80% of the maximum, the blue 40–60%, the
yellow 20–40%, and the red less than 20% of the maximum. The size has a similar structure.
Blank areas have zero likelihoods to machine accuracy. The elliptic form of the likelihood is
asymmetric and very long-tailed in β. The marginal posterior densities and cdfs of α and β
are shown in Figures 17.5, 17.6, 17.7 and 17.8.
For the LD50 and LD90, we substitute the random parameter draws into the definitions
of these functions. Figure 17.9 gives the LD50 posterior cdf and Figure 17.10 gives the LD90
posterior cdf.
Despite the small samples at each design point, and the very small number of design
points, the study was able to draw fully informative (though not precise!) conclusions about
the parameters and the LD50 and LD90, without relying on asymptotic theory.
0.055
0.050
0.045
0.040
0.035
density
0.030
0.025
0.020
0.015
0.010
0.005
0.000
-2 0 2 4 6
a
FIGURE 17.5
α density
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-2 0 2 4 6
a
FIGURE 17.6
α cdf
Generalised linear models (GLMs) 275
0.035
0.030
0.025
density
0.020
0.015
0.010
0.005
0.000
0 10 20 30 40
b
FIGURE 17.7
β density
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0 10 20 30 40
b
FIGURE 17.8
β cdf
276 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.4 -0.2 0.0 0.2 0.4
LD50
FIGURE 17.9
LD50 cdf
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.2 -0.0 0.2 0.4 0.6 0.8
LD90
FIGURE 17.10
LD90 cdf
Generalised linear models (GLMs) 277
TABLE 17.2
Beetles dying y, exposed n, at dose d
y 6 13 18 28 52 53 61 60
n 59 60 62 56 63 59 62 60
d 1.6907 1.7242 1.7552 1.7842 1.8113 1.8369 1.8610 1.8839
1.0
0.9
0.8
0.7
death proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose
FIGURE 17.11
Proportion dying (circles) and 95% credible region (red curves)
better than the logit: the frequentist deviances from the three fitted models are 27.92 (LL),
11.23 (logit) and 3.45 (CLL), each with 6 df. Although these models are not nested, it is
clear that the CLL curvature is the best: it gives a close fit at every dose level. The fit of
the logit model can be improved, by increasing its curvature with an added quadratic term
in dose. The frequentist deviance of the quadratic logit model is 3.20. Figure 17.13 shows
the ML fitted logit (black), CLL (green) and quadratic logit (red) models. The quadratic
model is also a close fit. Is the difference in deviances meaningful, with the different degrees
of freedom?
We can assess this approximately (and sufficiently) by graphing the two deviance dis-
tributions, assuming the asymptotic χ2 forms. Figure 17.14 shows the CLL (green) and
quadratic logit (red) deviance cdfs from 10,000 random draws. The two cdfs cross near the
5% point of the cdf: neither is uniformly better than the other. In about 5% of the deviance
draws, the quadratic logit draw is smaller than the CLL draw; the remaining 95% reverse
this difference.
The quadratic logit model (red) has a more diffuse cdf from its 3 df and this contributes
to the large preference for the CLL over the quadratic logit model. However, this is only a
preference: the deviance difference between the two is quite small at all percentiles. Formally,
if we pair the random draws and subtract one from the other, the 95% credible interval for
the true deviance difference is [−5.37, 7.75], with a median of 0.60. We cannot choose firmly
between these models.
278 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
death proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose
FIGURE 17.12
LL (red), logit (black) and CLL (green) links
1.0
0.9
0.8
0.7
death proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.700 1.725 1.750 1.775 1.800 1.825 1.850 1.875
dose
FIGURE 17.13
Quadratic logit (red), logit (black) and CLL (green) links
Generalised linear models (GLMs) 279
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
5 10 15 20
deviance
FIGURE 17.14
Quadratic logit (red) and CLL (green) deviance cdfs
TABLE 17.3
Number of girls assessed, and number reaching menarche (positive), by mean age
group 1 2 3 4 5 6 7 8 9 10
mean age 9.21 10.21 10.58 10.83 11.08 11.33 11.58 11.83 12.08 12.33
number 376 200 93 120 90 88 105 111 100 93
positive 0 0 0 2 2 5 10 17 16 29
group 11 12 13 14 15 16 17 18 19 20
mean age 12.58 12.83 13.08 13.33 13.58 13.83 14.08 14.33 14.58 14.83
number 100 108 99 106 105 117 98 97 120 102
positive 39 51 47 67 81 88 79 90 113 95
group 21 22 23 24 25
mean age 15.08 15.33 15.58 15.83 17.58
number 122 111 94 114 1,049
positive 117 107 92 112 1,049
280 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
proportion 0.5
0.4
0.3
0.2
0.1
0.0
10 12 14 16
age
FIGURE 17.15
Proportion reaching menarche by age group (circles) and ML fitted logistic model (curve),
with 95% credible region (red segments)
The sample sizes at each mean age are large, and the number of age group is also large:
the regression should be well-defined. Figure 17.15 shows the proportion data, the fitted
centered ML logistic regression, and the 95% credible region bounds (red curves).
The fitted regression falls entirely within the 95% credible region for the true function;
the bounds vary in width with the variation in sample sizes at each age. The fit appears
very good: the deviance is 26.70 with 23 df. Centering makes a dramatic difference to the
intercept and its correlation with the slope:
• Uncentred: −21.23 (0.770) + 1.632 (0.059) AGE, correlation −0.9966;
1.85
1.80
1.75
1.70
1.65
b
1.60
1.55
1.50
1.45
1.40
-0.0 0.1 0.2 0.3 0.4
a
FIGURE 17.16
Likelihood in α and β
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.0 0.1 0.2 0.3 0.4
a
FIGURE 17.17
Posterior cdf of α
282 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
1.4 1.5 1.6 1.7 1.8
b
FIGURE 17.18
Posterior cdf of β
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.25 -0.20 -0.15 -0.10 -0.05 -0.00 0.05
zeta
FIGURE 17.19
Posterior cdf of ζ
Generalised linear models (GLMs) 283
0.035
0.030
0.025
Downs rate p
0.020
0.015
0.010
0.005
20 25 30 35 40 45
age
FIGURE 17.20
Down’s incidence
284 Introduction to Statistical Modelling and Inference
-3.5
-4.0
-4.5
-5.5
-6.0
-6.5
-7.0
-7.5
20 25 30 35 40 45
age
FIGURE 17.21
Down’s incidence, logit scale
-2.5
-3.0
-3.5
-4.0
-4.5
Downs rate logit
-5.0
-5.5
-6.0
-6.5
-7.0
-7.5
20 25 30 35 40 45
age
FIGURE 17.22
Down’s incidence and quadratic logit model
Generalised linear models (GLMs) 285
-1.6
-1.8
-2.0
-2.2
probit
-2.4
-2.6
-2.8
-3.0
-3.2
-3.4
20 25 30 35 40 45
age
FIGURE 17.23
Down’s incidence, quadratic model and 95% credible region
population counts as necessary, so that all regions have data of the same length. This assists
comparative analyses between regions: we can use interactions of region with the BC model
to establish the nature of differences in their regressions.
Figure 17.24 shows the four region observed proportions with Down’s syndrome on the
logit scale. The colour code is BC red, Massachusetts orange, New York blue and Sweden
green. They all show similar curvature, in different degrees: the quadratic is needed in all
regions.
To examine the differences in the regressions we need to “block” the four data sets r, n, a
of length 35 into a single data set of length 140, by concatenating them. We define the new
“long” variables lr, ln, la, qa with qa = la2 , and a four-level factor of region, all of length
140. Then we fit the “full” model of region × (la + qa) which gives different quadratic models
in each region. The ML fitted full model is shown in Figure 17.25. The blue proportions –
for New York – are notably lower than the others, which are quite similar.
We examine the importance of the region × qa interactions, omitting in sequence those
with the smallest t value less than 2 in magnitude. For those omitted, we then examine
the region × la interactions in the same way. Finally we examine any region terms which
are not included in interactions, to see which can be omitted. The process terminates in a
five-parameter model, shown with MLEs and (SE)s:
−6.094 (0.029) − 0.190 (0.003) la + 0.00745 (0.00343) qa
−0.846 (0.058) REG 3 − 0.00119 (0.00055) REG 3 ∗ qa.
The only region term needed is for Region 3 – New York. This region has both a lower
initial level and a reduced curvature relative to the others, which all have the same model
(Figure 17.26), shown in green.
Why might New York be different from the others? Maternal age is often said to be
the only certain risk factor for Down’s syndrome. A search of reports on terminations of
286 Introduction to Statistical Modelling and Inference
-3
-4
-5
logit p
-6
-7
-8
20 25 30 35 40 45
age
FIGURE 17.24
Four regions Down’s incidence
-2
-3
-4
-5
logit p
-6
-7
-8
20 25 30 35 40 45
age
FIGURE 17.25
Down’s incidence and full quadratic logit model by region
Generalised linear models (GLMs) 287
-3
-4
-5
logit p
-6
-7
-8
20 25 30 35 40 45
age
FIGURE 17.26
Down’s incidence and final quadratic logit model by region
pregnancies establishes that there were variations in the incidence of termination of Down’s
syndrome foetuses by ethnic origin in the United States. Two examples are Bishop, Huether,
Torfs, Lorey and Deddens (1997) and Caruso, Westgate and Holmes (1998).
This may imply a selection bias in New York in the analysis of the observed proportions,
since these may have been reduced by terminations. The references estimate the incidence
that would have been observed in their populations if the terminations had not occurred.
1.25
1.00
0.75
0.50
0.25
log rate
0.00
-0.25
-0.50
-0.75
-1.00
FIGURE 17.27
Vasoconstriction data
where θ = log [p/(1 − p)] is the logistic transformation. The ML fitted centered model with
(SE)s is
θb = −0.345 (0.540) + 5.220 (1.852) log vol + 4.631 (1.783) log rate.
The parameter estimates are all correlated, despite the centering: corr(b b = −0.522,
α, β)
corr(b b) = −0.373, corr(β, γ
α, γ b b) = 0.804. ML fitted values for the model are shown in
Figure 17.28 as “level curves” (actually parallel lines) for 10% (red), 50% (black) and 90%
(blue) probability of vasoconstriction, computed from:
• 10%: −0.345 + 5.220 log vol + 4.631 log rate = log(1/9) = −2.197;
• 50%: −0.345 + 5.220 log vol + 4.631 log rate = log(1) = 0;
• 90%: −0.345 + 5.220 log vol + 4.631 log rate = log(9) = 2.197.
Generalised linear models (GLMs) 289
1.25
1.00
0.75
0.50
0.25
-0.25
-0.50
-0.75
-1.00
-1.25
-0.5 0.0 0.5 1.0
log volume
FIGURE 17.28
Vasoconstriction data with level curves: 10% (red), 50% (black), 90% (blue)
The two covariates are almost equally important. The fitted surface within the data bound-
aries is essentially a triangular plain in the bottom left corner, and a triangular plateau in
the upper right corner, with a smooth logistic slope joining them.
Are the MLEs and SEs reliable? Are the posterior distributions of the parameters Gaus-
sian? For a grid in multiple dimensions, there is a conflict of fineness of the grid for indi-
vidual parameters and its size; this becomes severe in high-dimensional covariate models.
Most Bayesian analysts dismiss high-dimensional grids. However, we can avoid this problem
by having a fixed-dimension individual parameter precision of 10,000 points, but randomly
sampling uniformly the grid point locations, over a range determined by the parameter MLE
± 5 SEs from the asymptotic Gaussian distribution. We use ± 5 SEs to cover a major part
of the high-likelihood region, while recognising that an extension or relocation of the grid
may be necessary. (A wider range would be needed if the SEs are themselves unreliable, for
example if they are complete data standard errors from an incomplete data EM analysis.)
So we generate a 10,000-point 3D uniform random grid over the ten SE parameter inter-
vals: α ∈ −0.345 ± 5 ∗ 0.54; β ∈ 5.220 ± 5 ∗ 1.852; γ ∈ 4.631 ± 5 ∗ 1.783.
We evaluate the binomial log-likelihood at these points and exponentiate this, to give
the full joint posterior with this flat prior. Our particular interest is in the (marginal) joint
distribution of β and γ; the intercept α is of no contextual interest. Figure 17.29 shows
the likelihood classified by the regression coefficients for log volume and log rate, where
the likelihoods at all points are coded on two scales of likelihood value as in the previous
example:
• an interval scale of the size of the circle plotting character;
• a five-point colour scale from smallest red to largest green, the green corresponding to
likelihoods more than 80% of the maximum.
290 Introduction to Statistical Modelling and Inference
10
5
g
0
0 2 4 6 8 10
b
FIGURE 17.29
Vasoconstriction, log rate γ vs log volume β
The t statistic for the difference variable is 0.517: this variable can clearly be omitted. The
reduced centred ML model has deviance 29.535:
θb = −0.407 (0.524) + 4.93 (1.72) total.
The correlation between α b and βb is −0.491.
We repeat the Bayesian analysis through ML for the one-variable model. Figure 17.30
shows the likelihood (centred) in α and β based on 10,000 points. The negative correlation is
clear. Figures 17.31 and 17.32 show the marginal cdfs of the two parameters. The posterior
modes are very close to the MLEs; α is nearly symmetric and β is heavily skewed. We do
not take this analysis further.
Generalised linear models (GLMs) 291
12
10
6
b
-2
-3 -2 -1 0 1 2
a
FIGURE 17.30
Vasoconstriction, total
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-2 -1 0 1
a
FIGURE 17.31
Posterior cdf α
292 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
2 4 6 8 10 12
b
FIGURE 17.32
Posterior cdf β
TABLE 17.5
Proportion in (sample size) tolerant of racial intermarriage
Region Ed 72 73 74
South 3 0.704 (27) 0.729 (48) 0.860 (57)
2 0.438 (137) 0.584 (178) 0.568 (176)
1 0.258 (155) 0.222 (162) 0.274 (146)
Central 3 0.783 (60) 0.873 (55) 0.862 (65)
2 0.699 (219) 0.701 (211) 0.732 (231)
1 0.374 (155) 0.389 (131) 0.422 (135)
North East 3 0.893 (56) 0.949 (59) 0.929 (70)
2 0.740 (196) 0.729 (193) 0.764 (240)
1 0.504 (403) 0.488 (84) 0.443 (79)
West 3 0.966 (29) 0.903 (31) 1.0 (27)
2 0.714 (91) 0.839 (87) 0.827 (81)
1 0.578 (45) 0.556 (45) 0.620 (50)
It appears that nearly all the terms are necessary. This is a consequence of very highly cor-
related variables, which can be seen from the correlation matrix of the parameter estimates.
correlations between parameter estimates
1 1.0000
2 -0.9699 1.0000
3 -0.9723 0.9374 1.0000
4 0.9150 -0.9821 -0.8802 1.0000
5 0.9333 -0.8975 -0.9891 0.8411 1.0000
6 0.9476 -0.9717 -0.9694 0.9503 0.9570 1.0000
7 -0.8967 0.9577 0.9135 -0.9713 -0.9006 -0.9818 1.0000
8 -0.9102 0.9312 0.9600 -0.9092 -0.9692 -0.9887 0.9696 1.0000
9 0.8612 -0.9179 -0.9050 0.9295 0.9128 0.9713 -0.9882 -0.9817 1.0000
Many correlations are very close to ±1: the largest in amgnitude is −0.9891. This is be-
cause the linear and quadratic terms are very highly correlated. When their interactions are
included in the model they are even more highly correlated.
This can be avoided by expressing them as orthogonal polynomials: a simple example
would be to centre the scales: −1,0,1 instead of 1,2,3. We repeat this analysis with the
centred linear and quadratic scales. The full model and its estimated parameter correlation
matrix follow.
1 1.0000
2 -0.0000 1.0000
3 -0.0000 -0.0000 1.0000
4 -0.6291 0.3838 0.0000 1.0000
5 -0.7979 0.0000 -0.0742 0.5020 1.0000
6 -0.0000 -0.0000 -0.0000 0.0000 0.0000 1.0000
7 0.0000 0.0000 -0.5920 -0.0000 0.0439 0.5254 1.0000
8 0.0000 -0.7685 0.0000 -0.2949 -0.0000 -0.0513 -0.0432 1.0000
9 0.4906 -0.2992 0.0456 -0.7797 -0.6149 -0.0423 -0.0596 0.4404 1.0000
There are now no correlations near ±1; the largest is −0.7979. The highest-order interac-
tion EQ.YQ and its SE are unchanged. The others are changed considerably. The quadratic
term is now 1,0,1, uncorrelated with the linear term, instead of 1,4,9, which is correlated
0.9897 with the linear term. (Note that the sequence 1,5,9 correlates 1.0 with 1,2,3.)
Successive elimination of terms with the smallest t < 2 gives a final model.
Generalised linear models (GLMs) 295
17.7.3.1 Region 1
scaled deviance = 7.67
residual df = 6 from 9 observations
Proportion logits increase linearly with both time and (much more strongly) educational
level. We repeat the elimination analysis in turn for regions 2, 3 and 4. The final models
are given in the following, followed by a table of observed and fitted probabilities for each
region.
17.7.3.2 Region 2
scaled deviance = 3.52
residual df = 6 from 9 observations
There is no year effect but a strong linear education increase with negative curvature.
17.7.3.3 Region 3
scaled deviance = 3.97
residual df = 7 from 9 observations
17.7.3.4 Region 4
scaled deviance = 10.90
residual df = 7 from 9 observations
The deviance (1,488 with 67 df) shows immediately that the Poisson model is mis-specified.
For a well-fitting Poisson model the deviance should be of the same order as the degrees of
Generalised linear models (GLMs) 297
240
220
200
180
160
120
100
80
60
40
20
FIGURE 17.33
Count of fish species with lake area
6.0
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
1.5
1.0
0 2 4 6 8 10 12
log lake area
FIGURE 17.34
Count of fish species with lake area, both on log scales
298 Introduction to Statistical Modelling and Inference
freedom. The residual deviance for the simpler linear model is even larger (1,535.7). The t
statistic for the quadratic term is 7.25, though this does not reduce the deviance substantially:
the Poisson quadratic model is clearly incorrect.
We define the precision and variability bounds from the fitted Poisson model by analogy
with the Gaussian case, and the properties of the GLM algorithem, in which the “working
variate”
Z = log(µ) + (y − µ)/µ
is iteratively regressed on the covariates with a weight variable; the variance of the working
variate is 1/µ. Because of the discreteness of the Poisson and its varying mass function shape
with the mean, we do not have precise credible interval probabilities for these bounds. We use
instead a figure of ±4.5 SDs to define an approximate 95% credible interval, with the choice
of 4.5 SDs based on the Chebyshev inequality: that for any random variable, the probability
of the random variable exceeding k standard deviations from its mean is not more than 1/k 2 .
For k = 4.5 this probability is not more than 1/20.25 = 0.05.
So an approximate 95% credible (precision) region for the mean function at x is given by
p
µb(x) ± 4.5 Var[b µ(x)],
We show in Figure 17.35 the number of species graphed against area, both on log scales,
together with the Poisson fitted quadratic model (solid curve), 4.5 SD bounds (green curves)
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
1.5
0 2 4 6 8 10 12
log area
FIGURE 17.35
Count of fish species with lake area, log scales and Poisson fitted quadratic model (black
curve) with 4.5 SD precision bounds (green curves) and 4.5 SD prediction bounds (red
curves)
Generalised linear models (GLMs) 299
for the 95% precision region and 4.5 SD bounds (red curves) for the 95% prediction region.
Of the 70 points in the graph, 21 (30%) fall outside the 4.5 SD bounds of the prediction
region. The Poisson model does not provide the appropriate variability representation for the
data. For this reason the direct Bayesian analysis from the Poisson ML will not be effective.
Where to go from here?
log(count) ∼ N (α + β log(area), σ 2 ).
The ML estimates and (SE)s are αb = 2.339 (0.199), βb = 0.1436 (0.0266). The residual sum of
squares (RSS) from the model is 36.925; the RSS from the null model with β = 0 is 52.713,
so R2 = 0.2995, R = 0.547: 30% of the variability (measured by residual sum of squares) in
species “diversity” (measured by log count) is “explained” by log area.
Does the Gaussian model identify a real upturn on the extreme left? We assess this by
adding the quadratic term in log area to the model. The ML estimates and (SE)s from the
quadratic model are:
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
0 2 4 6 8 10 12
log area
FIGURE 17.36
Count of fish species with lake area, with Gaussian linear (solid line) and quadratic (dashed
curve) models
-1.4
-1.6
-1.8
-2.0
probit
-2.2
-2.4
-2.6
-2.8
-3.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
linear residuals
FIGURE 17.37
Residuals from Gaussian linear model, probit scale
Generalised linear models (GLMs) 301
6.0
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0 2 4 6 8 10 12
log area
FIGURE 17.38
Count of fish species with lake area, with Gaussian linear model (black line), 95% precision
region (green curves) and 95% predictive region (red lines)
6.0
5.5
5.0
4.5
log count
4.0
3.5
3.0
2.5
2.0
1.5
0 2 4 6 8 10 12
log area
FIGURE 17.39
Count of fish species with lake area, with Gaussian quadratic model (black curve), 95%
precision region (green curves) and 95% predictive region (red curves)
302 Introduction to Statistical Modelling and Inference
weighting of the Gaussian analysis to obtain the posterior distributions of the quadratic
model parameters, robust to the failure of the Gaussian model assumption.
Figures 17.40, 17.41, 17.42 and 17.43 give the posterior cdfs of the parameters from 10,000
draws. The posterior medians and 95% central credible intervals are:
α: 2.782, [2.433, 3.115]; β: − 0.0534, [−0.1772, 0.0750]; γ: 0.0156, [0.0048, 0.0260];
σ 2 : 0.4873, [0.3502, 0.7014]; VR: 0.664, [0.484, 0.863].
The 95% credible interval for γ excludes zero, confirming the need for the quadratic term.
Here VR is the ratio of the quadratic model variance to the null model variance: it is the
proportion of “unexplained” variance by the quadratic model. The Gaussian frequentist
MLEs agree quite closely with the posterior medians, so the ML fitted Gaussian quadratic
is a fair picture of the relation, but the 95% Gaussian-based confidence intervals below are
much longer than the 95% credible intervals:
α: 2.786, [2.222, 3.350]; β: − 0.0531, [−0.2429, 0.1635]; γ: 0.0156, [0.0011, 0.0301];
σ 2 : 0.5168, [0.378, 0.749]; VR: 0.675.
The distribution of α is left-skewed, while the others are right-skewed, heavily for σ 2 .
The poor agreement of the residuals with the Gaussian model leads to the poor summaries
of the precisions by the SEs. This poor precision may extend to the mean function.
We can assess the precision region from the joint posterior distribution of all the param-
eters. For the precision, we need the posterior of the linear predictor at each data value.
′
We sort the M values of the linear predictor β [m] xi at each xi , and save the median, 2.5%
and 97.5% quantiles. Connecting these values across the data gives the multinomial-based
median and 95% precision bounds of the linear predictor. Figure 17.44 shows the 95% pre-
cision regions for the Gaussian (red) and Bayesian bootstrap (green) analyses. Precision is
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
2.00 2.25 2.50 2.75 3.00 3.25
alpha
FIGURE 17.40
Posterior cdf of α
Generalised linear models (GLMs) 303
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
-0.3 -0.2 -0.1 -0.0 0.1 0.2
beta
FIGURE 17.41
Posterior cdf of β
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.00 0.01 0.02 0.03 0.04
gamma
FIGURE 17.42
Posterior cdf of γ
304 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
sigma^2
FIGURE 17.43
Posterior cdf of σ 2
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
0 2 4 6 8 10 12
log area
FIGURE 17.44
Gaussian (red) and Bayesian bootstrap (green) 95% precision regions
Generalised linear models (GLMs) 305
increased by the BB analysis, for at least the smaller and medium-sized lakes. The BB anal-
ysis gets the precisions right, without assumptions. We do not need to check the multinomial
variability representation: it is already accounted for in its definition.
ζ = α + β log(x) + γ ′ Z = η + γ ′ Z,
where Z is the full vector of omitted variables, with γ the corresponding vector of regression
coefficients. We are assuming that these additional variables are part of the linear predictor,
rather than acting non-linearly in some way. We don’t know how many variables have been
omitted, or what they are, apart from latitude which we don’t have. So Z is a completely
unknown quantity, varying over the data set – it is an unobserved or latent random variable.
In some fields this representation of omitted variables is called overdispersion or unobserved
heterogeneity. Since γ ′ Z is a scalar quantity, we might as well represent the more appropriate
model by
ζ = η + Z,
where now Z is an unobserved scalar variable with a completely unknown distribution. With-
out a probability model specification g(z) for Z, we do not have a probability distribution
and likelihood for Y . We first adopt the conjugate gamma distribution, leading to a negative
binomial marginal distribution for the response.
17.8.4 Conjugate W
Expressed in terms of the non-negative variable W = eZ , the Poisson conditional mean given
W is µW = eα+β log(x) · W = W eη . For the Gamma(r, θ) distribution of W with density
function
f (W | r, θ) = exp(−W θ)W (r−1) θr /Γ(r),
the mean is r/θ. The marginal distribution of Y is given by
Z
y! h(y) = e−µW µyW exp(−W θ)W (r−1) θr dW/Γ(r)
Z
η
= e−W e W y eηy exp(−W θ)W (r−1) θr dW/Γ(r)
Z
= eηy θr exp[−W (θ + eη )]W r−1+y dW/Γ(r)
where p = eη /(θ + eη ).
306 Introduction to Statistical Modelling and Inference
Since θ must be positive, the natural scale (link) for θ is the log, on which the regression
model is fitted: log θ = β ′ x. It is also possible that r is varying over the data. Since it also
must be positive, the log scale is natural for it. We do not give further details of model
fitting; an application follows in the next chapter.
18
Extensions of GLMs
where ei = yi − β ′ xi .
∂2ℓ X
′ =− xi x′i /σi2
∂β∂β
∂ℓ 1h X X i
= − zi + e2i zi /σi2
∂λ 2
1X 2 2
= (ei /σi − 1)zi
2
∂2ℓ 1X 2 2
′ =− (ei /σi ) zi z′i
∂λ∂λ 2
∂2ℓ X
= − (ei /σi2 ) xi z′i .
∂β∂λ′
Taking expectations, the expected information matrix is block-diagonal, with blocks
1
Iβ = X ′ W11 X, Iλ = Z ′ Z,
2
where
X ′ = [x1 , . . . , xn ], Z ′ = [z1 , . . . , zn ], W11 = diag(1/σi2 ),
since E[Ei ] = 0 and E[Ei2 ] = σi2 . So a Fisher scoring algorithm for the simultaneous MLE
of β and λ reduces to two separate algorithms for β and λ. Since, however, W11 depends
on λ and ei depends on β, it is simplest to formulate the scoring algorithm as a successive
relaxation algorithm. For given σi2 , β
b is a weighted least squares estimate with weights 1/σ 2 ,
i
and for given β, λb is the MLE from a gamma model with scale parameter 2 (an exponential
distribution) and a response variable e2i .
The algorithm begins with an initial unweighted Gaussian regression of y on x, taking
σi2 ≡ σ 2 . The squared residuals from the least squares fit are defined as a new response
variable with a gamma distribution with scale parameter 2. The linear predictor λ′ x is then
fitted using a log link function, and the frequentist deviance calculated for the initial estimate
of (β, λ). A weighted Gaussian regression of y on x is now fitted, with scale parameter 1 and
weights given by the reciprocals of the fitted values from the gamma model. This alternating
process continues until the deviance converges. At this point the standard errors (based on
the expected information) from both models are correct.
However, the log-likelihood in the two sets of parameters may be very skewed, and so
the standard errors are not a reliable indicator of variable importance. In addition, the loss
of degrees of freedom in the variance model due to the estimation of the mean parameters
may be serious, requiring a marginal or restricted likelihood maximisation for the variance
model. Smyth and Verbyla (1999) gave a discussion. The preceding analysis assumes that
the parameters β and λ are functionally independent, which will usually be the case.
We give several examples of ML with the double Gaussian model.
w1i = λ′ zi
w2i = (yi − β ′ xi )2 .
We alternate between these forms through Gibbs sampling, analogous to ML successive re-
laxation. We begin with an initial unweighted Gaussian regression of y on x, taking σi2 ≡ σ 2 .
We make M random draws β [m] from the constant-variance Gaussian posterior distribution
N (β,
b σb2 [X ′ X)]−1 , and for each m form the initial weights
[m] ′
w2i = (yi − β [m] xi )2 .
Pn [m] 1 ′ [m]
We write u[m] = i=1 w2i zi . The conditional log-likelihood − 2 λ u for λ given β is
a product-exponential, and with flat independent priors on λj gives the initial conditional
product-exponential posterior with
[m]
uj
[m] 1 [m]
π(λj | β )= exp − λj uj ,
2 2
[m]
so that λj has an exponential distribution with mean 2/uj .
For each m we make one random draw λ[m] from this posterior, and evaluate the weights
[m] ′
w1i = λ[m] zi . For the next M draws of β, we sample from the Gaussian conditional
posterior of β given λ:
b [X ′ W X]−1 X ′ W y),
β | λ ∼ Np (β,
where W = diag[e−w1i ], the w1i are the log variances of the observations yi , and the e−w1i
Pn [m]
are the reciprocal variances. (The additional “constant” term i=1 w1i cancels in the pos-
terior.) This alternating process continues until the posterior distributions stabilise.
2750
2500
2250
2000
1750
patients
1500
1250
1000
750
500
250
0
0 200 400 600 800
beds
FIGURE 18.1
Numbers of patients treated and hospital beds
It is possible that log transformations on both scales – a lognormal distribution for pa-
tients – would linearise the relation and also stabilise the variance (Figure 18.2). The squared
correlation increases to 0.891: more of the variability is “explained” by the transformations.
However the variance heterogeneity, while reduced, is now greater at low values, and/or the
relation is curving downward at the lower end. The ML fitted model with (SE)s is
µ[
log p = 1.316 (0.090) + 0.959 (0.018) log b,
which back-transforms to
cp = e1.316 b 0.959
µ
= 3.73 b 0.959 .
The mean number of patients appears to be nearly proportional to the number of beds b,
but the value 0.959 is 2.28 SEs away from 1. The lognormal model is not well supported.
The alternative is to model the Gaussian variance directly, through a log-linear model:2
µ = β0 + β1 b
log σ 2 = λ0 + λ1 log b.
The log scale for the variance guarantees positive fitted variances, increasing or decreasing
log-linearly in log b – variances increasing or decreasing as a power function of b. The un-
transformed number of beds b could be used in the variance regression instead of log b; then
the variance will increase exponentially with b, faster than a power function. We illustrate
2 A “saturated” model specifying unrelated different variances σ 2 for each observation will not be identi-
i
fiable.
Extensions of GLMs 311
8.0
7.5
7.0
6.5
6.0
log patients
5.5
5.0
4.5
4.0
3.5
3.0
3 4 5 6
log beds
FIGURE 18.2
Numbers of patients treated and hospital beds, log scales
this possibility by fitting both variance models. The ML fitted mean and variance model
with (SE)s and log b are followed by those for b in the variance:
µ
b = 14.84 (4.98) + 2.989 (0.052) b
\
log σ 2 = 1.850 (0.388) + 1.592 (0.073) log b
µ
b = 26.24 (9.73) + 3.019 (0.060) b
\
log σ 2 = 8.585 (0.116) + 0.00626 (0.00034) b.
The mean functions and SEs are very close visually, with the slope within 1/3 of an SE of 3.
Both fitted mean models µ b (solid lines) together with the 95% variability bounds (dashed
curves) based on µ b ± 2b
σ are shown in colour in Figure 18.3. The log b variance model is
shown in red, the b variance model is in green. The red bounds exclude 20 points, 5% of
the sample. The exponential shape of the bounds for the b model is striking, but they lie
inside those for the log b model for b < 500. The result is that the green bounds exclude
many more data points than the red bounds. The frequentist deviances indicate clearly the
data preference for the models: 5,116.4 for the log b model and 5,164.8 for the b model. The
difference of 48.4 is very substantial. The fit of the log b model appears to be good.
A minor point is that the intercepts in the mean regression are not near zero. It might
seem that the regression should go through the origin, since at zero beds there must be zero
patients. However, there are no zero beds hospitals: this extrapolation is irrelevant to the
reality (the smallest hospital has ten beds). The mean regression coefficient is close to 3: the
mean number of patients is roughly three times the number of beds plus 15. The variance
increases with the number of beds at roughly the rate b1.6 .
312 Introduction to Statistical Modelling and Inference
6000
5000
4000
patients 3000
2000
1000
FIGURE 18.3
Joint model ML fit with 95% variability bounds: red log b, green b
4.5
4.0
3.5
2.5
2.0
1.5
1.0
0.5
60 70 80 90 100
IQ
FIGURE 18.4
Data (circles), ML fitted mean (line) and variability bounds (triangles), log days
observed values, though this is difficult to see because of the variations in variability from
dependents. Converting the parameter ML estimates on the log days scale to the original
days scale, we show the observed data (circles), the fitted mean (line) and the 2 SD variability
bounds (red and blue triangles) for IQ in Figure 18.5, and in Figure 18.6 the observed data
(circles) and the 2 SD variability bounds (red and blue triangles) for DEPS. No mean function
is shown in the second figure as the mean is constant over dependents, though the variance
is not: it changes remarkably (quadratically) with DEPS.
The large variability with number of dependents suggests that there may be other unmea-
sured important variables (for example family income) related to both absence and number
of dependents.
140
120
100
days absent 80
60
40
20
0
60 70 80 90 100
IQ
FIGURE 18.5
Data (circles), ML fitted mean (curve) and variability bounds (triangles)
140
120
100
days absent
80
60
40
20
0
4 6 8 10 12 14
dependents
FIGURE 18.6
Data (circles) and variability bounds (triangles)
Extensions of GLMs 315
5.5
5.0
4.5
4.0
3.0
2.5
2.0
1.5
0 2 4 6 8 10 12
log area
FIGURE 18.7
Count of fish species with lake area, log scales and Gaussian fitted quadratic mean and log-
linear variance model with 2SD bounds
5.5
5.0
4.5
4.0
log count
3.5
3.0
2.5
2.0
1.5
0 2 4 6 8 10 12
log area
FIGURE 18.8
Count of fish species with lake area, log scales and Poisson fitted quadratic model (black
curve) with 4.5 SD precision bounds (green curves) and 4.5 SD prediction bounds (red
curves)
316 Introduction to Statistical Modelling and Inference
mean, rather than freely modelled, and by the log scale of the response, which has decreasing
instead of increasing variability with increasing log area.
https://www.epa.gov/climate-indicators/climate-change-indicators-sea-surface-temperature.
The word “anomaly” refers to a deviation from the annual temperature by subtraction of
the average temperature over the period. This is a simple location change in the temperature
variable.
There is a clear decline over the period 1880–1910, and a steady increase over 1910–2015,
apart from a very sudden increase and then decline in the period 1940–1945, and a much
smaller drop and increase in the period 1908–1911, together with a great deal of variation,
which appears to be decreasing over time. It is unclear how to model the variation.
The EPA quotes an uncertainty in the reported anomalies in terms of a 95% confidence
interval, presumably based on averaging across measurement occasions within each year.
The uncertainty jumps in the period 1907–1911, and jumps substantially in 1939–1945. It
decreases rapidly from 1945 to 1980, then increases until 2015. Figure 18.10 shows this vari-
ability as the measurement standard deviation implied by the Gaussian confidence interval.
In any regression model we need to allow for the precision of measurements. Other factors
may also need to be considered: the sudden jump in the period 1940–1945 corresponds to
World War II and the considerable sinkings of ships in many oceans, reducing the reliability
of measurements in this period, and possibly losing measurements completely in war zones.
The period 1907–1911 does not correspond to any obvious external event.
We need to weight each reported anomaly inversely by its measurement variance (squared
SD). It is important to centre and scale the year to a time t to avoid the computation of high
0.8
0.6
0.4
0.2
-0.0
anomaly
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
1880 1900 1920 1940 1960 1980 2000
year
FIGURE 18.9
Sea surface annual temperature anomaly 1880–2015 (◦ F)
Extensions of GLMs 317
0.24
0.22
0.20
measurement SD
0.18
0.16
0.14
0.12
0.10
FIGURE 18.10
Sea surface annual temperature anomaly standard deviation
0.8
0.6
0.4
0.2
-0.0
anomaly
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
1880 1900 1920 1940 1960 1980 2000
year
FIGURE 18.11
Sea annual temperature anomaly and fourth-degree polynomial regression
318 Introduction to Statistical Modelling and Inference
1.0
0.8
0.6
0.4
0.2
anomaly -0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
1880 1900 1920 1940 1960 1980 2000
year
FIGURE 18.12
Fourth-degree regression with 95% bounds (red segments) from the measurement SD
powers of large numbers. We begin with a standard Gaussian regression analysis with inverse
variance weighing. This requires a fourth-degree mean function, though the quadratic term
is not needed. Figure 18.11 shows the fitted model with the data. The effect of the quartic
term is that the slope of the regression increases rapidly with increasing year. Figure 18.12
adds the 95% variability bounds (red segments) given by the EPA measurement standard
deviation. The variability bounds exclude 20 observations, 15% of the data. It is immediately
clear that the 1940–1945 period is an anomaly in the technical sense – the sea temperatures
in this period do not belong to the remainder of the model structure, and the period 1907–
1911 is also anomalous. Changes in level have occurred in these periods which cannot be
represented by the polynomial model.
There is an additional difficulty, that the reported variability is inflated in the 1940–1945
period, and is much lower at the beginning and end of the full period analysed. This suggests
that an analysis which allows for smoothly varying variance is needed, through the double
GLM. We ignore the reported measurement variability, and fit a sixth-degree model in both
mean and log variance, eliminating unnecessary terms, and examine the effect of successively
removing the data for the periods 1940–1945 and 1907–1911 from the analysis.
The full 136 observations require a sixth-degree polynomial in the mean and a fifth
degree in the variance. The reason is clear from Figure 18.13. The two sets of anomalous
values induce bulges in the variance model from the high-degree terms. We now remove the
years 1940–1945 from the data and refit the models to the remaining 130 observations, in
Figure 18.14.
Both models now require fifth-degree polynomials, but the bulge in the variance model
remains. It now does not include the cluster of five observations at 1907–1911. We finally
exclude this cluster as well. The removal of these two clusters allows a constant variance model
to fit the remaining 125 observations with a fifth-degree mean regression, in Figure 18.15.
Five values fall outside the bounds, 4% of the restricted data. The two clusters excluded
explain the high-degree variance polynomial needed for the full data. The measurement
Extensions of GLMs 319
0.8
0.6
0.4
0.2
0.0
anomaly
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
FIGURE 18.13
Sea annual temperature anomaly, DGLM polynomial regression with 95% bounds
0.8
0.6
0.4
0.2
0.0
anomaly
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
FIGURE 18.14
DGLM polynomial regression with 95% bounds, excluding 1940–1945
320 Introduction to Statistical Modelling and Inference
1.0
0.8
0.6
0.4
0.2
anomaly
-0.0
-0.2
-0.4
-0.6
-0.8
-1.0
FIGURE 18.15
Fifth-degree polynomial regression with 95% bounds, excluding 1940–1945 and 1907–1911
standard deviation does not represent adequately the data variability: this can be represented
as constant variance apart from the two anomalous clusters. The fifth-degree mean function
ineases even more rapidly than the quartic with the EPA measurement model.
https://1v1d1e1lmiki1lgcvx32p49h8fe-wpengine.netdna-ssl.com/wp-content
/uploads/2021/01/1609991325-SoTC2020 ag1 V52 900w 1.png (BOM Australia).
2.0
1.5
1.0
0.5
residual 0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-2 -1 0 1 2
probit
FIGURE 18.16
Probit plot of residuals from the EPA125 final model, excluding 1940–1945 and 1907–1911
change-point and break-point regression are three of them. The idea is that there is a break,
or change, in the value of one or more model parameters at some point or points in the data
(there may be more than two sections). This is similar to a mixture, but different because
the break defines different models on each side of the break, with either continuity or a jump
in the response mean value given by the model at the break-point. We give two examples.
1300
1200
1100
1000
volume
900
800
700
600
500
FIGURE 18.17
Nile flood volumes 1870–1969
1
probit
-1
-2
FIGURE 18.18
Nile flood volumes, probit scale, with single Gaussian (line) and 95% credible region (red
segments)
Extensions of GLMs 323
1300
1200
1100
1000
volume 900
800
700
600
500
FIGURE 18.19
Nile flood volumes and ML linear model, 95% precision bounds (green) and prediction bounds
(red)
The quadratic model (Figure 18.20) with 95% variability bounds suggests an increase
in volume towards the end of the period. The t-statistic for the quadratic term is 4.01: the
linear model is not adequate.
The quadratic model also appears to represent adequately the data variability: four points
(4%) are outside the bounds.
1300
1200
1100
1000
volume 900
800
700
600
500
FIGURE 18.20
Nile flood volumes and ML quadratic model, 95% precision bounds (green) and prediction
bounds (red)
5
t value
FIGURE 18.21
Nile flood volumes break-point t values
Extensions of GLMs 325
1300
1200
1100
1000
volume 900
800
700
600
500
FIGURE 18.22
Nile flood volumes, 1899 break
better modelled? Seven (7%) of the observations fall outside the 95% prediction region, a
worse fit than the linear or quadratic models.
From a Bayesian point of view this analysis is unsatisfactory. We should be regarding the
break-point θ as an additional parameter in the model. Eyeballing the data to search for the
best break-point means that we have no external information about θ. So it should be given
a flat prior, on all the observations from 1871 to 1968. The extreme endpoints are excluded,
as they give a single Gaussian instead of two.
The analysis is now much heavier. Assuming we are fitting constant levels in both sub-
models, we have to repeat the above analysis for each break-point, find the sample mean and
variance in each sub-model and combine these to give the residual variance and the deviance
from the composite model.
A helpful summary for each model is the t-statistic for the difference of the two sub-model
means (Figure 18.21). The largest t value is 8.71 at 1899, but there are five values near 8.
We cannot say unequivocally that the change is at 1899. The t-statistic can be computed in
a loop of 98 repetitions with a transfer of one observation from one sub-model sample to the
other. The deviance is a 1:1 function of the t-statistic.
It is instructive to see the full Bayesian model comparison approach in this example. Each
break-point defines a Gaussian model with two Gaussian segments and a cut-point defining
them, with a frequentist deviance from the three parameters for each model. For the Gaussian
model the posterior deviance distributions are exactly χ23 , shifted by the frequentist deviance.
Figure 18.23 shows the full deviance distribution cdfs for the 98 possible break-points.
The individual years are not identified, but the first seven on the left – with the smallest
deviance distributions – are exactly those with the largest t-statistics. The cdfs are ex-
actly parallel and so their differences can be summarised by the frequentist deviances. Their
posterior model probabilities can be obtained by direct transformation of the deviances to
maximised likelihoods, and are shown in Table 17.1 for probabilities larger than 0.001. All
326 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
1060 1080 1100 1120 1140
deviance
FIGURE 18.23
Nile break-point model deviances
TABLE 18.1
Nile flood break-point posterior probabilities
year volume post.prob.
1896 1,260 0.0019
1897 1,220 0.0536
1898 1,030 0.1162
1899 1,100 0.7749
1900 774 0.0428
1901 840 0.0077
1902 874 0.0025
the remaining possible break-points have a total posterior probability of 0.0004. The years
1899 and 1898 are the only ones with appreciable probabilities. In the further analyses below
we take the break-point to be at 1899. Model-averaging over the possible break-points would
change very little the results, as only 1898 has appreciable probability, which would differ
from 1899 by the location of only one observation.
Could there be more than one break-point? Since we have identified one at 1899, we
keep this one and consider the two periods before and after 1899 for any possible break-
points in these periods. “Eyeballing” of Figure 18.22 is unhelpful. The period 1870–1899
seems too short, and the longer period 1900–1969 shows no obvious level change, though
the variabilities above and below the mean appear unequal. This raises a separate modelling
question: are the variances in the two break-point regions necessarily equal? Modelling the
variance with the linear, quadratic or break-point models gives no better fit than the constant
variance models. We leave open the possibility of the Bayesian bootstrap for the possibility
Extensions of GLMs 327
of non-Gaussian variability, though there is no sign of it. The flood valume decreased during
the 100-year period, but whether this was linear, quadratic, or through a jump is unclear.
-2.5
-3.0
-3.5
-4.0
-4.5
Downs rate logit
-5.0
-5.5
-6.0
-6.5
-7.0
-7.5
20 25 30 35 40 45
age
FIGURE 18.24
Down’s incidence, and quadratic logit model
328 Introduction to Statistical Modelling and Inference
-3.5
-4.0
-4.5
-5.0
logit
-5.5
-6.0
-6.5
-7.0
-7.5
20 25 30 35 40 45
age
FIGURE 18.25
Down’s incidence, and break-point quadratic model
TABLE 18.2
Posterior probability of a break-point at age
age 27 28 29 30 31 32 33 34 35
pp .009 .006 .001 .001 .001 .005 .008 .006 .183
age 36 37 38 39 40 41 42 43
pp .096 .107 .295 .039 .016 .120 .067 .040
At age 38 the posterior probability is less than 0.3, hardly a clear indication, despite the
substantial reduction in deviance which is also nearly achieved at other values of age. The
reflection of the quadratic trend in the last eight observations appears unreasonable.
The t value for the quadratic term past the break-point is −1.5: this term could be
eliminated from the model, leaving a linear trend from age 39 onwards, as is visible in the
figure. This may be less unreasonable, but we do not discuss it further. No strong conclusion
can be drawn about the value of the break-point model for these data.
5.0
4.5
4.0
duration (mins)
3.5
3.0
2.5
2.0
50 60 70 80 90
waiting time (mins)
FIGURE 18.26
Durations against waiting times, Old Faithful
The eruption times of Old Faithful, the famous Yellowstone National Park volcanic geyser,
have been recorded many times to investigate the regularity of its eruptions. The Wikipedia
site for Old Faithful gives a graph of one of the public data sets, which shows eruption
durations di against waiting times wi between eruptions. A graph of the same data is shown
in Figure 18.26.
The figure has a very clear message: short waiting times correspond to short durations,
and long waiting times to long durations. There are almost no intermediate waiting times,
or durations. A linear regression appears to be well-supported; see Figure 18.27. The near-
separation of the data into two point clouds raises possible difficulties with this simple
interpretation. It is possible that, within each cloud, the relation of duration to waiting time
is much weaker, if it is even the same in the two clouds. This kind of heterogeneity can be
investigated with mixture modelling as described in §15.5.
However, another difficulty arises from an alternative interpretation of the data. Azzalini
and Bowman (1990) described their 272 observations, tabulated as durations di followed by
waiting times wi , as
the time interval between the starts of successive eruptions wi , and the duration of
the subsequent eruption di .
(Emphasis added)
This implies that the first eruption duration d1 was followed by the first waiting time w1
before the second eruption duration d2 . So the first waiting time applies to the second eruption
time, not to the first. This would follow from an observation process which begins during a
(left-censored) waiting period, so that the first recorded observation would be of the length
of an eruption duration, followed by the next waiting period, and continuing with these
alternations.
330 Introduction to Statistical Modelling and Inference
5.0
4.5
4.0
3.0
2.5
2.0
1.5
50 60 70 80 90
waiting time (mins)
FIGURE 18.27
Durations di against waiting times wi with ML regression, Old Faithful
If we graph the resulting 271 observations di+1 against wi , as did Azzalini and Bowman,
we see something quite different. Now there appear to be three or four regions or point clouds
and their interpretation is quite different (Figure 18.28):
• For waiting times less than 70 minutes, nearly all the durations are long;
• for waiting times longer than 70 minutes, about 40% of the durations are short, and 60%
are long, though shorter than the durations for the short waiting times;
• the evidence for regression of duration on waiting time is weak in all three major clouds;
• there is no simple model description of the structure: a three- or four-component mixture
of linear regressions might be necessary to accommodate the small cloud of six in the
left-hand bottom quadrant;
• within the components the regressions may be quite different.
Azzalini and Bowman gave a detailed discussion of the geological mechanism of the eruptions
and waiting periods which could produce this effect.
Figures 18.29, 18.30, 18.31 and 18.32 show the two-, three-, four- and five-component
mixture regressions. The observations are coloured by the line colour if they have (ML
estimated) posterior probability greater than 0.5 of being in that coloured component. Un-
coloured (white) observations have appreciable probabilities of being in more than one com-
ponent. Details of the model fits are given in Table 18.3.
The two-component model is very clear. The two upper point clouds are fitted as one
component in all models except the five-component. The blue component regression is almost
constant. As more components are added, these two components lose peripheral observations
to the added components, though their slopes remain much the same.
Extensions of GLMs 331
5.0
4.5
4.0
3.0
2.5
2.0
50 60 70 80 90
waiting time (mins)
FIGURE 18.28
Durations di+1 against waiting times wi , Old Faithful
5.5
5.0
4.5
4.0
3.5
duration
3.0
2.5
2.0
1.5
1.0
40 50 60 70 80 90 100
waiting time
FIGURE 18.29
Durations against waiting times, two components
332 Introduction to Statistical Modelling and Inference
5.5
5.0
4.5
4.0
duration 3.5
3.0
2.5
2.0
1.5
1.0
40 50 60 70 80 90 100
waiting time
FIGURE 18.30
Durations against waiting times, three components
5.5
5.0
4.5
4.0
3.5
duration
3.0
2.5
2.0
1.5
1.0
40 50 60 70 80 90 100
waiting time
FIGURE 18.31
Durations against waiting times, four components
Extensions of GLMs 333
5.5
5.0
4.5
4.0
3.5
duration
3.0
2.5
2.0
1.5
1.0
40 50 60 70 80 90 100
waiting time
FIGURE 18.32
Durations against waiting times, five components
TABLE 18.3
K-component Gaussian regressions for the Old Faithful data
K p σ
b dev AIC BIC π1 π2 π3 π4 π5
1 3 0.950 741.1 747.1 757.9 1
2 5 0.334 524.4 534.4 552.4 0.641 0.359
3 8 0.305 502.3 518.3 547.1 0.594 0.153 0.253
4 11 0.256 475.0 497.0 536.6 0.536 0.080 0.151 0.233
5 14 0.192 454.9 482.9 533.3 0.261 0.295 0.076 0.129 0.239
The third green component appears artificial, with observations only at both ends. Unas-
signed observations, with no component membership probability greater than 0.5, occur only
in the three-component model, and there are only three of them.
The fourth orange component takes over some of the green and red components’ obser-
vations, plus the unassigned observations in the three-component model.
The five-component model splits the first component into two, shown as the red and
violet lines. The six-component model (not shown) has a deviance of 449.5 and the seven-
component model a deviance of 443.4: the small deviance changes for K 6 and 7 are less
than the penalties for these values, for both AIC and BIC.
The visual evidence for more than two components is not compelling. The common
frequentist model comparisons methods from Table 17.3 give K = 5 as the best model.
Further assessment requires the Bayesian model comparison, as in the galaxy recession
velocity data: we do not pursue that here.
334 Introduction to Statistical Modelling and Inference
2.7
2.6
2.5
2.4
length (metres)
2.3
2.2
2.1
2.0
1.9
1.8
5 10 15 20 25 30
age (years)
FIGURE 18.33
Ages and lengths of dugongs
Extensions of GLMs 335
1.00
0.95
0.90
0.85
log length
0.80
0.75
0.70
0.65
0.60
FIGURE 18.34
Ages and lengths of dugongs, log scales
1.00
0.95
0.90
0.85
log length
0.80
0.75
0.70
0.65
0.60
FIGURE 18.35
Ages and lengths of dugongs, with fitted log-log model
336 Introduction to Statistical Modelling and Inference
2.7
2.6
2.5
2.4
2.3
length
2.2
2.1
2.0
1.9
1.8
5 10 15 20 25 30
age
FIGURE 18.36
Ages and lengths of dugongs, with fitted exponentiated model
Here γ must be larger than the largest y value in the data. This model is not a member of
the linear model family: γ, the parameter of interest, appears non-linearly.
Maximum likelihood analysis can be carried out by profiling – hill-climbing: if we fix the
value of γ, α and β can be estimated by Gaussian ML, assuming the data model has added
Gaussian variability. By defining a grid of values of γ and maximising the likelihood over the
other parameters, we can find the profile MLE of γ which maximises this profile (maximised)
likelihood in γ.
For the dugong data, this approach is unsuccessful: the maximised likelihood increases
indefinitely with γ. The literature on this model mentions the difficulty of estimating this
parameter when there are no or few large animals – in effect, we have to extrapolate outside
the data range.
One possible way of having an asymptote is to use inverse polynomials (Nelder 1966;
McCullagh and Nelder 1983). With the model µ = α − β/x, the mean increases with x to
an asymptote of α. The same asymptote structure results with any positive power of x−1 ,
or multiple such powers. We illustrate with the inverse square root power x1 = x−1/2 and
its square x2 = x21 = x−1 . We also need to model the variability. After some searching
over combinations of x1 and x2 in the mean and variance models, we find the model with
covariates x1 and x2 in both mean and variance models gives the best fit.
The fitted mean model with SEs is
b = 3.118 (0.061) − 2.663 (0.236) x1 + 1.344 (0.182) x2 ,
µ
and the fitted variance model on the log scale (link), with SEs, is
\
log σ 2 = −7.256 (1.595) + 13.41 (7.09) x1 − 15.63 (6.29) x2 .
The fitted model, with (green) bounds for the 95% precision region and (red) bounds for the
95% variability region, and the sample data, are shown in Figure 18.37. All observations fall
inside the 95% variability region. The model appears to fit well.
Extensions of GLMs 337
2.8
2.7
2.6
2.5
2.4
length 2.3
2.2
2.1
2.0
1.9
1.8
5 10 15 20 25 30
age
FIGURE 18.37
Ages and lengths of dugongs, with fitted inverse polynomials and 95% bounds
The mean model intercept is the asymptote of the regression: 3.12 (0.06) metres, with
95% confidence interval [2.99, 3.24]. This interval is well above the heaviest sample dugong.
Without information about the age cycle of dugongs, we are unable to say whether this is a
reasonable range or not. Wikipedia gives dugongs a lifetime of 70 years; at this value of age
the model mean length is 2.82 metres: the sample we have is of young dugongs. The Great
Barrier Reef Marine Park Authority gives dugongs a mature length of up to three metres.
We leave further investigation to the student.
variables: in this model there is no direct connection between response and covariates, but
the model can be generalised to include this. Since the Zij are unobservable, we are unable to
determine which latent variables are “active” – have the value 1 – in the observed responses.
While the number p of covariates is known, the number q of latent variables is not known,
since they are not observable. This is a standard problem with finite mixtures.
The observed data Yi and xi , eliminating the unobserved Zij , have a complex mean
structure:
q
X
E[Yi | xi ] = λj E[Zij | xi ]
j=1
Xq
= λj exp(β ′j xi )/[1 + exp(β ′j xi )].
j=1
The early analyses of this complex mean structure used the residual sum of squares from
the fitted mean function as the objective function to be minimised. Details can be found in
the major books of Michie, Spiegelhalter and Taylor (1994), Ripley (1996) and Kay and Tit-
terington (1999). Considerable difficulties with the convergence of the optimising algorithm
with this model led to a loss of interest in neural networks by the statistical and computer
science communities. Aitkin and Foxall (2003) pointed out that the objective function being
minimised was not the negative log-likelihood for the complex model, as the unconditional
variance was not constant, but an even more complex function with two sets of regression
parameters:
By taking advantage of the incomplete data form of the model, Aitkin and Foxall gave both
scoring and EM algorithms for ML estimation in the explicitly formulated latent variable
model. Figure 18.38, reproduced from their paper, shows the ML fit of a four-node model
to the motorcycle acceleration data, discussed at length in §18.9. It did not allow for the
increasing variance.
In later research papers, the EM algorithm was extended to the multilevel “deep learning”
model. We do not give details here.
FIGURE 18.38
Acceleration of motorcycle helmet, with four-node neural network fit
link was developed extensively by Holland and Linehart (1981); a broad coverage is given
by Lusher, Koskinen and Robins (2013).
The Rasch model is a main effect or additive model, in events and women, on the logit scale:
pij
logit pij = log = θ i + ϕj .
1 − pij
The model has no group structure for women, and so plays the role of a baseline model for
comparison with models with group structure.
An important question is how to specify the number of classes K. The finite mixture
structure allows ML and Bayesian analyses by EM and DA. We do not give details here.
Membership of each woman in the groups is expressed probabilistically through her condi-
tional group probabilities, given the number of groups and her event attendance pattern.
The number of groups is determined through the posterior distributions of the deviances for
each number of groups.
For the Natchez women data the two-group model has very high probability. Women
1–8 have very high probabilities of belonging to Group 1; Women 10–18 have very high
probabilities of belonging to Group 2. Woman 9 has probability close to 0.5 of being in both
groups. The last result may look ambiguous, or a model failure, but the sociologists asked
the women to define which group they belonged to. All agreed that Women 1–8 were in
Group 1 and Women 10–18 were in Group 2, and that Woman 9 belonged to both groups.
342 Introduction to Statistical Modelling and Inference
This can be understood from Woman 9’s attendance at only four events: 5, 7, 8 and 9:
she was not a frequent attender. All the women in Group 1 except one attended event 5,
and all the women in Group 2 except one attended event 9. So Woman 9’s membership in
both groups was consistent with her attendance pattern.
This analysis was the only one of the 21 analyses which identified Woman 9’s joint
membership. This modelling approach was applied to the identification of the structure of
the Noordin Top terrorist network in Aitkin, Vu and Francis (2017).
60
40
20
0
acceleration
-20
-40
-60
-80
-100
-120
10 20 30 40 50
time
FIGURE 18.39
Acceleration of motorcycle helmet
Extensions of GLMs 343
100
75
50
25
acceleration -25
-50
-75
-100
-125
-150
-175
20 30 40 50
time
FIGURE 18.40
Acceleration of motorcycle helmet after 13.8 ms, with ML mean and variance polyfit and
95% precision (green) and variability (red) bounds
the mean function and a fourth-degree polynomial for the log variance function, applied to
the data after the break.
We reduce the unnecessary terms, first from the variance model and then from the mean
model, terminating with a second degree model for the variance, and an 11th degree poly-
nomial without the seventh- and ninth-degree terms for the mean. The ML fitted model is
shown over the restricted time scale in Figure 18.40, with 95% variability bounds (red) and
95% precision bounds (green) around the fitted mean model. The horizontal and vertical
scales are both different in Figures 18.39 and 18.40.
Five observations out of the 113, 4%, fall outside the 95% bounds, a good fit to the
variability, though the wiggles in the fitted mean beyond 40 ms are unconvincing. The
precision of the polynomial model decreases and then increases with time.
Many modelling statisticians are sceptical of high-degree polynomials, which may be
unstable and cannot be extrapolated beyond the observed data. The instability is a matter
of over-fitting the degree of the polynomial terms, and the critical need for orthogonalisation
of the components. Centering and scaling the time scale are essential to avoid numerical
underflow or overflow. The extrapolation difficulty is true of all models, and the polynomial
is not intended to be a representation of an unobserved process, but simply a smoothing of
the variability, to allow the complex identifiable trend to be represented.
The fitted model has no useful engineering or design interpretation, and cannot be ex-
trapolated beyond the observed data: it is entirely empirical. The “cyclic” behaviour of the
helmet after the collision suggests that a real physical model for the mean, of a damped
simple harmonic motion (SHM), could fit the observed behaviour and could be extrapolated
beyond the upper time point.
Inspection of the data shows that the acceleration returns to zero at about 28 ms, then
continues in the second half of the full cycle period to about 48 ms, though with reduced
344 Introduction to Statistical Modelling and Inference
amplitude – the motion is damped by a factor of about λ̃ = 35/120 = 0.57 (the ratio of
the maxima). Beyond this time the cyclic pattern is lost in the noise level which increases
steadily from the initial collision and later decreases.
The damped sine wave can be represented initially by
where λ is the decay constant and τ the period – the time it takes for a single cycle of
the SHM. An immediate problem appears. The sine function cycles between positive and
negative values as time increases, but starts at zero, then increases from t = 0. To make it
decrease, we need to shift the sine function through π, and redefine the time scale to start
from 0 at 13.8 ms: t∗ = t − 13.8:
where α = log(α∗ ), β = −λ, γ = 2π/τ . If we move the time origin to 13.8 ms, then the
period can be roughly estimated as τ̃ = 2 × (28 − 13.8) = 28.4 ms.
There is no straightforward ML analysis for the model other than a general Newton-
Raphson. Both maximum likelihood and Bayesian analyses can be obtained by setting up
a 3-D grid in the parameters and assessing the likelihood near the edges of the grid for
possible grid changes. The posterior distributions of the parameters can then be obtained
by rescaling the likelihood.
The analysis used a wide random grid of 10,000 points with central values the eyeballed
parameter values α0 = 5.4, β0 = −0.070, γ0 = 0.224. The grids had to be extended repeat-
edly as the likelihood was very flat. The 10,000 random draws of the parameters with the
corresponding deviances are shown in Figures 18.41, 18.42 and 18.43 over the final grid, with
the deviance range restricted to the minimum deviance (1,068) plus 10, covering the range of
1078
1077
1076
1075
1074
deviance
1073
1072
1071
1070
1069
1068
7.30 7.35 7.40 7.45 7.50 7.55 7.60
alpha
FIGURE 18.41
Deviances and α
Extensions of GLMs 345
1105
1100
1095
deviance 1090
1085
1080
1075
1070
FIGURE 18.42
Deviances and β
1078
1077
1076
1075
1074
deviance
1073
1072
1071
1070
1069
1068
0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120
gamma
FIGURE 18.43
Deviances and γ
346 Introduction to Statistical Modelling and Inference
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
7.30 7.35 7.40 7.45 7.50 7.55 7.60
alpha
FIGURE 18.44
Cdf α
appreciable likelihood. The deviance graphs for all three parameters are very poorly defined,
with no clear minimising parameter values. Computationally, the minimum deviance was
1,068 at the MLEs α b = 7.44, βb = −0.112, γ
b = 0.0017. These values are far from the eyeball
estimates; the bunching of multiple acceleration values at the same time confuses the eyeball
location process.
The posterior distributions shown in Figures 18.44, 18.45 and 18.46 are very unusual.
Those for α and γ have a long left tail and are then nearly uniform over the ranges shown,
while β has its MLE at the left-hand end with a long right tail. The data are not very
informative about the model, in spite of the very clear damped SHM structure. Many fitted
models will be consistent with the data: we do not pursue any further elaboration of the
model. Something important is missing in this model.
Some analysts of the data have expressed concerns over the apparent increase in accelera-
tion at the end of the time period, implying their expectation of a return to zero acceleration.
The damped SHM model has this increase, and limiting damping to zero magnitude, as a
consequence of its model structure.
We now return to the peculiarity of the bunching of multiple accelerations. What is
happening before 13.8 ms? If we graph just the early observations, we see something quite
bizarre (Figure 18.47). It appears that there are four different data sets which have been
amalgamated. In only one set is the helmet at constant velocity before the collision; in the
other sets the helmet is moving with different small but constant negative accelerations –
braking – before the collision. Up to 13.8 ms the data set membership of each observation
is clear, but beyond this point we are unable to identify data set memberships. To do this
would require a mixture model of four damped SHMs, which evidently have slightly different
parameters. This particularly complex model could be fitted by extending the likelihood
with the additional parameters of the mixture components. We do not discuss it further: we
Extensions of GLMs 347
1.0
0.9
0.8
0.7
0.6
cdf 0.5
0.4
0.3
0.2
0.1
0.0
-0.1120 -0.1115 -0.1110 -0.1105 -0.1100 -0.1095 -0.1090
beta
FIGURE 18.45
Cdf β
1.0
0.9
0.8
0.7
0.6
cdf
0.5
0.4
0.3
0.2
0.1
0.0
0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120
gamma
FIGURE 18.46
Cdf γ
348 Introduction to Statistical Modelling and Inference
0.0
-0.5
-1.0
-1.5
-2.0
acceleration
-2.5
-3.0
-3.5
-4.0
-4.5
-5.0
4 6 8 10 12
time
FIGURE 18.47
Acceleration of motorcycle helmet, first 13.8 ms
do not need to identify the individual recorders’ data. Constructing a credible region for any
of the four recordings is beyond the scope of this book, and will not be of great interest.
What would be of interest is to have the four disaggregated recordings.
The unmodelled variations between the four sets, and their slight differences in param-
eters, explain the large residual variation which also increases with time as the individual
curves diverge.
19
Appendix 1 – length-biased sampling
It is possible that even when a simple random sample of the population has been drawn,
and the values of Y obtained from all the sample members, the resulting values of Y are
not a simple random sample. This is the case when the sampling used is informative, in the
sense described in §6.5: that the selection of population member I into the sample depends
on the value of YI . It may be difficult to imagine how this could occur, since the values of
YI are observed only after the sample is drawn.
However, an important class of duration sampling designs do have this property, and
result in length-biased sampling. A simple example is that of residential tenure. We want to
investigate how long people live – their length of tenure – in their residences – dwellings –
before moving. We have the list of dwellings in the area of study interest, and draw a simple
random sample of dwellings of size n from this list. We assume it is with replacement – the
sample size is small compared to the population size.
We then visit each dwelling and ask the residents there how long they have lived there.
We record the residential tenure yi for each household i, and could easily assume that we have
a random sample of tenures. However some thought will show that this is not so: families
who move rarely will have a greater chance of being in the sampled dwellings than families
who move frequently (Cox and Miller 1965). We need to model this dependence in some way.
Here we assume that the probability of being included in the sample increases linearly with
tenure (time).
The population data are the pairs (UI , YI ) for I = 1, . . . , N . The data from which we
construct the likelihood are the sampled pairs (ui , yi ) with ui = 1, and the values uI = 0
for the unsampled dwellings. As we have no tenure recorded for the unsampled dwellings,
the likelihood contribution from these uI is just a constant, as in Chapter 2 – the dwellings
were drawn by a simple random sample. The probability contribution of a sampled dwelling
pair is
where f ∗ (y) is the distribution of the sampled durations, f (y) is the probability model
for the population value y and c is a constant of proportionality, reflecting the fact that the
chance of being included in the sample increases linearly with tenure y. We have assumed the
simplest model for inclusion; we could assume more generally that the probability increases
monotonically with y.
We want to find the probability density f ∗ (y) of the sampled values under this specifi-
cation of informative sampling. Since the density f ∗ (y) has to integrate to 1 over the range
of y (assumed positive), we must have
Z ∞ Z ∞
∗
1= f (y)dy = c · yf (y)dy = c · µ,
0 0
where µ is the mean of y. So c = 1/µ, and the density of y under this informative sampling is
f ∗ (y) = yf (y)/µ.
This specification allows us to correctly estimate the model parameter(s), even under infor-
mative sampling, if the specification is correct.
We note immediately that the mean and variance of the sampled durations are quite
different from those of the underlying population. The mean and variance are
Z
∗
E [Y ] = y 2 f (y)dy/µ
where µ3 is the third central moment of Y : (E[(y−µ)3 ]. So the mean of the sampled durations
exceeds µ, the true population mean (as expected), and the variance depends on the third
moment of the true population.
Suppose for example that we specify the duration of tenure in the population by an
exponential distribution with mean µ. The likelihood, log-likelihood and its derivatives for
the informatively sampled durations y1 , . . . , yn are:
n
Y
L(µ) = [yi f (yi )/µ]
i=1
Yn
= [yi exp(−yi /µ)/µ2 ]
i=1
n
X
ℓ(µ) = [log yi − yi /µ − 2 log µ]
i=1
n
dℓ X
= [yi /µ2 − 2/µ]
dµ i=1
2 n
d ℓ X
= [−2yi /µ3 + 2/µ2 ].
dµ2 i=1
So the MLE of µ is ȳ/2, not ȳ, and its asymptotic variance is ȳ 2 /(8n). The sample mean is
seriously biased – twice the correct MLE – and its variance is µ b2 /(2n), four times that of
the MLE.
We do not take this discussion further.
20
Appendix 2 – two-component Gaussian mixture
Writing the two-component Gaussian mixture Hessian formally at length, we have (with the
expected Hessian terms on the left, the covariance terms on the right):
Pn
Z∗
Hy (µ1 , µ1 ) = − i=12 i
σ1
Pn
Z ∗ (1 − Zi∗ )(yi − µ1 )2
+ i=1 i
σ14
n
2 X ∗
Hy (µ1 , σ1 ) = − 3 Z (yi − µ1 )
σ1 i=1 i
Pn ∗ ∗
(yi − µ1 )2
i=1 Zi (1 − Zi )(yi − µ1 ) 1
+ − +
σ12 σ1 σ13
Pn ∗ ∗
Z (1 − Zi )(yi − µ1 )(yi − µ2 )
Hy (µ1 , µ2 ) = − i=1 i
σ12 σ22
Pn ∗ ∗
(yi − µ2 )2
Z (1 − Zi )(yi − µ1 ) 1
Hy (µ1 , σ2 ) = i=1 i −
σ12 σ2 σ23
Pn
Z ∗ (1 − Zi∗ )(yi − µ1 )
Hy (µ1 , p) = i=1 i
p(1 − p)σ12
Pn
Z∗ 3(yi − µ1 )2
Hy (σ1 , σ1 ) = i=12 i 1 −
σ1 σ12
Pn ∗ ∗
2
(yi − µ1 )2
i=1 Zi (1 − Zi )
+ −1 +
σ12 σ12
Pn ∗ ∗
2
Z (1 − Zi )(yi − µ2 ) 1 (yi − µ1 )
Hy (σ1 , µ2 ) = − i=1 i 2 − +
σ2 σ1 σ13
Pn ∗ ∗
2
Z (1 − Zi ) 1 (yi − µ1 )
Hy (σ1 , p) = i=1 i − +
p(1 − p) σ1 σ13
Pn
(1 − Zi∗ )
Hy (µ2 , µ2 ) = − i=1 2
σ2
Pn
Z ∗ (1 − Zi∗ )(yi − µ2 )2
+ i=1 i
σ24
n
2 X
Hy (µ2 , σ2 ) = − 3 (1 − Zi∗ )(yi − µ2 )
σ2 i=1
Pn ∗ ∗
(yi − µ2 )2
i=1 Zi (1 − Zi )(yi − µ2 ) 1
+ − +
σ22 σ2 σ23
Pn
Z ∗ (1 − Zi∗ )(yi − µ2 )
Hy (µ2 , p) = i=1 i
p(1 − p)σ22
An important point is that while the expected Hessian terms involve only the linear and
quadratic terms in y, the covariance matrix of the score involves third and fourth powers of
y. Departures of the response model from the assumed Gaussianity in each component will
therefore affect the observed data Hessian and the stated precisions of the model parameters.
21
Appendix 3 – StatLab variables
The description and codes for a selection of the variables are given in the tables.
Variable Code/Description
CB Child blood type:
1 O – Rh negative
2 A – Rh negative
3 B – Rh negative
4 AB – Rh negative
5 O – Rh positive
6 A – Rh positive
7 B – Rh positive
8 AB – Rh positive
9 Unknown
LGTH Length of baby to 0.1 inch (2.54 mm)
CBWGT Weight of baby to 0.1 pound (45.4 gm)
C10HGHT Height of child (age ten) to 0.1 inch (2.54 mm)
C10WGT Weight of child (age ten) to nearest pound (454 gm)
PEA Score on the Peabody Picture Vocabulary Test
RA Score on the Raven Progressive Matrices Test
Variable Code/Description
INCB Family income at time of birth, in units of $100
INC10 Family income at child’s age ten, in units of $100
CHURCH Church attendance at child’s age ten:
1 Entire family attends church fairly regularly
2 Mother and child attend fairly regularly
3 Child only attends fairly regularly
4 Anyone in family attends sporadically
5 Anyone in family attends on Holy Days only
6 No-one in family ever attends
Variable Code/Description
MB Mother’s blood type (same codes)
MAGE Mother’s age at baby’s birth (years)
MBWGT Weight of mother at diagnosis of pregnancy
MBOCC Mother’s occupation at diagnosis of pregnancy:
0 Housewife
1 Office/clerical
2 Sales
3 Teacher/counsellor
4 Professional/managerial
5 Services
7 Factory worker
8 All other
MBSM Mother’s cigarette smoking history at diagnosis of pregnancy:
N Never smoked
Q Smoked at one time but has now quit
01–99 Number smoked per day
M10HGHT Mother’s height at child’s age ten to 0.1 in (2.54 mm)
M10WGT Mother’s weight at child’s age ten to nearest pound (454 gm)
M10ED Mother’s education at child’s age ten:
0 Less than 8th grade
1 8th–12th grade
2 High school graduate
3 Some college
4 College graduate
M10OCC Mother’s occupation at child’s age ten (same codes)
M10SM Mother’s cigarette smoking history at child’s age ten
Variable Code/Description
FB Father’s blood type (same codes)
FAGE Father’s age at baby’s birth (years)
FBOCC Father’s occupation at diagnosis of pregnancy:
0 Professional
1 Teacher/counsellor
2 Manager/official
3 Self-employed
4 Sales
5 Clerical
6 Craftsman/operator
7 Laborer
8 Service worker
FBSM Father’s cigarette smoking history at diagnosis of pregnancy
(same codes)
F10HGHT Father’s height at child’s age ten to 0.1 in (2.54 mm)
F10WGT Father’s weight at child’s age ten to nearest pound (454 gm)
F10ED Father’s education at child’s age ten
F10OCC Father’s occupation at child’s age ten
F10SM Father’s cigarette smoking history at child’s age ten
22
Appendix 4 – a short history of statistics from 1890
This history is of the major contributors to the subject up to the 1960s. The later modern
Bayesian developments are not detailed or discussed here.
where a, b0 , b1 and b2 were functions of the skewness β1 and kurtosis β2 of the density:
4β2 − 3β1
b0 = µ2 ,
10β2 − 12β1 − 18
√ p β2 + 3
a = b1 = µ2 β1 ,
10β2 − 12β1 − 18
2β2 − 3β1 − 6
b2 = .
10β2 − 12β1 − 18
• a new “Method of Moments” (later abbreviated to MOM) for fitting these densities
to data, by equating the sample moments to the population moments and solving the
resulting system of simultaneous equations for the parameter estimates.
These tools were widely used. Pearson applied his test to many published data sets, including
the famous Weldon dice data. The χ2 test showed that many of these data sets failed to fit the
Gaussian distribution. These examples began to discredit the almost universal assumption
of “normality”. Weldon’s data failed to fit the symmetric probability of 1/6 for all die faces.
In 1901, with Weldon and Galton, he founded the journal Biometrika whose object was
the development of statistical theory. He edited the journal until his death. In 1911 Pearson
relinquished the Goldsmid chair to become the first Galton professor of eugenics, a chair that
was offered first to him in keeping with Galton’s expressed wish. He formed the Department
of Applied Statistics into which he incorporated the Biometric and Galton laboratories. He
retired in 1933 but continued to work in a room at University College until a few months
before his death.
Pearson’s developments greatly widened the scope of data analysis in Britain and estab-
lished the basis of mathematical statistics internationally.
had one linear restriction on them. The number of degrees of freedom had therefore to be
reduced by 1. Pearson did not not accept this initially, but later did.
Pearson’s method of moments approach to parameter inference was inconsistent with
Fisher’s likelihood approach, and Fisher dismissed the former as inefficient, as for example
in the negative binomial distribution. Here the moment estimates were not based on the
likelihood, though Fisher allowed their value as initial estimates in a Newton-Raphson ML
procedure.
Nevertheless the method of moments remains alive and well, especially in economics
applications, as the “generalised method of moments” (GMM). This can be used when the
distribution of the response variable is not fully specified, so the likelihood is not defined.
Abernethy, R.B., Breneman, J.E., Medlin, C.H. and Reinman, G.L. (1983) Weibull Analysis
Handbook. Aero Propulsion Laboratory, Air Force Wright Aeronautical Laboratories,
Wright-Patterson Air Force Base Ohio.
Abernethy, R.B. (2010) The New Weibull Handbook (5th edn.). At www.barringer1.com/
tnwhb.htm.
Agresti, A. and Caffo, B.S. (2000) Simple and effective confidence intervals for proportions
and differences of proportions result from adding two successes and two failures. The
American Statistician 54, 280–288.
Aitkin, M. (1987) Modelling variance heterogeneity in normal regression using GLIM. Ap-
plied Statistics 36, 332–339.
Aitkin, M. (1992) Evidence and the posterior Bayes factor. The Mathematical Scientist 17,
15–25.
Aitkin, M. (2010) Statistical Inference: An Integrated Bayesian/Likelihood Approach. Boca
Raton: CRC Press.
Aitkin, M. (2018) A history of the GLIM statistical package. International Statistical Review
86, 275–299.
Aitkin, M. and Foxall, R. (2003) Statistical modelling of artificial neural networks using
the multi-layer perceptron. Statistics and Computing 13, 227–239.
Aitkin, M., Liu, C.C. and Chadwick, T. (2009) Bayesian model comparison and model
averaging for small-area estimation. Annals of Applied Statistics 3, 199–221.
Aitkin, M. and Stasinopoulos, M. (1989) Likelihood analysis of a binomial sample size
problem. In Contributions to Probability and Statistics: Essays in Honor of Ingram Olkin,
eds L.J. Gleser, M.D. Perlman, S.J. Press and A.R. Sampson, New York: Springer-Verlag,
399–411.
Aitkin, M., Vu, D. and Francis, B. (2014) Statistical modelling of the group structure of
social networks. Social Networks 38, 74–87.
Aitkin, M., Vu, D. and Francis, B. (2017) Statistical modelling of a terrorist network.
Journal of the Royal Statistical Society A 180, 751–768.
Ando, T. (2010) Bayesian Model Selection and Statistical Modelling. Boca Raton: Chapman
and Hall/CRC Press.
Anscombe, F.J. (1964) Normal likelihood functions. Annals of the Institute of Statistical
Mathematics 26, 1–19.
Barbour, C.D. and Brown, J.H. (1974) Fish species diversity in lakes. The American Nat-
uralist 108, 473–489.
Barnard, G.A. (1945) A new test for 2x2 tables. Nature 156, 177.
Barnard G.A. (1949) Statistical inference. Journal of the Royal Statistical Society B 11,
115–149.
Bartlett, R.H., Roloff, D.W., Cornell, R.G., Andrews, A.F., Dillon, P.W. and Zwischen-
berger, J.B. (1985) Extracorporeal circulation in neonatal respiratory failure: a prospec-
tive randomized study. Pediatrics 76, 479–487.
Begg, Colin B. (1990) On inferences from Wei’s biased coin design for clinical trials (with
discussion). Biometrika 77, 467–484.
Berger, J.O., Bernardo, J.M. and Sun, D. (2009) The formal definition of reference priors.
The Annals of Statistics 37, 905–938.
Bertalanffy, L. von. (1969) General System Theory. New York: George Braziller.
Bishop, J., Huether, C.A., Torfs, C., Lorey, F. and Deddens, J. (1997) Epidemiologic study
of Down Syndrome in a racially diverse California population, 1989–1991. American
Journal of Epidemiology 145, 134–147.
Bliss, C.I. (1935) The calculation of the dose-mortality curve. Annals of Applied Biology
22, 134–167.
Caruso, T.M., Westgate, M.N. and Holmes, L.B. (1998) Impact of prenatal screening on the
birth status of fetuses with Down syndrome at an urban hospital, 1972–1994. Genetics:
Medicine 1, 22–28.
Celeux, G., Forbes, F., Robert, C.P. and Titterington, D.M. (2006) Deviance information
criteria for missing data models. Bayesian Analysis 1, 651–674.
Chambers, R.L. and Clark, R.G. (2012) An Introduction to Model-based Survey Sampling
with Applications. Oxford: Oxford University Press.
Cox, D.R. (1961) Tests of separate families of hypotheses. Proceedings of the 4th Berkeley
Symposium 1, 105–123.
Cox, D.R. (2006) Principles of Statistical Inference. Cambridge: Cambridge University
Press.
Cox, D.R. and Miller, H.D. (1965) The Theory of Stochastic Processes. London: Chapman
and Hall.
Davis, A., Gardner, B.B. and Gardner, M.R. (1941) Deep South: A Social Anthropological
Study of Caste and Class. Chicago: Chicago University Press.
Dempster, A.P. (1974). The direct use of likelihood in significance testing. In Proceedings
of the Conference on Foundational Questions in Statistical Inference, eds O. Barndorff-
Nielsen, P. Blaesild and G. Sihon, 335–352.
Dempster, A.P. (1997) The direct use of likelihood in significance testing. Statistics and
Computing 7, 247–252.
Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics 7,
1–26.
Efron, B. (1986) Double exponential families and their use in generalized linear regression.
Journal of the American Statistical Association 81, 709–721.
Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. New York: Chapman
and Hall.
Ericson, W.A. (1969) Subjective Bayesian models in sampling finite populations (with dis-
cussion). Journal of the Royal Statistical Society B 31, 195–233.
Finney, D. (1947) Probit Analysis: A Statistical Treatment of the Sigmoid Response Curve.
Cambridge: Cambridge University Press.
Fisher, R.A. (1912) On an absolute criterion for fitting frequency curves. Messenger of
Mathematics 41, 155–160.
Fisher, R.A. (1922) On the mathematical foundations of theoretical statistics. Philosophical
Transactions of the Royal Society of London 222A, 309–368.
References 361
Fisher, R.A. (1925a) Theory of statistical estimation. Proceedings of the Cambridge Philo-
sophical Society 22, 700–725.
Fisher, R.A. (1925b). Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd.
Fisher, R.A. (1945). A new test for 2 × 2 tables. Nature 156 (3961), 388.
Freeman, L.C. (2003) Finding social groups: a meta-analysis of the Southern women data.
in Dynamic Social Network Modeling and Analysis, eds R. Breiger, K. Carley and P. Pat-
tison Washington, DC: The National Academies Press.
Freedman, B. (1987) Equipoise and the ethics of clinical research. New England Journal of
Medicine 31, 141–145.
Galton, F. (1886) Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland 15, 246–263.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin D.B. (2014)
Bayesian Data Analysis (3rd edn.). Boca Raton: Chapman and Hall/CRC Press.
Genschel, U. and Meeker, W.Q. (2010) A comparison of maximum likelihood and median
rank regression for Weibull estimation. Quality Engineering 22, 234–253.
Geyer, C.J. (1991) Constrained maximum likelihood exemplified by isotonic convex logistic
regression. Journal of the American Statistical Association 86, 717–724.
Gilliatt, R.W. (1948) Vaso-constriction in the finger after deep inspiration. Journal of Phys-
iology 107, 76–88.
Hartley, H.O. and Rao, J.N.K. (1968) A new estimation theory for sample surveys.
Biometrika 55, 547–557.
Hartley, H.O. and Rao, J.N.K. (1969) A new estimation theory for sample surveys, II.
Clearing House for Federal Scientific and Technical Information, US Department of Com-
merce/National Bureau of Standards.
Herson, J. (1976) An investigation of relative efficiency of least squares prediction to con-
ventional probability sampling plans. Journal of the American Statistical Association 71,
700–703.
Higgins, J.F. and Koch, G.G. (1977) Variable selection and generalized chi-square anal-
ysis of categorical data applied to a large cross-sectional occupational health survey.
International Statistical Review 45, 51–62.
Hite, S. (1987) Women and Love – A Cultural Revolution in Progress. New York: Knopf.
Hodges, J.L., Krech, D. and Crutchfield, R.S. (1975) StatLab: An Empirical Introduction
to Statistics. New York: McGraw-Hill.
Holland, P.W. and Linehart, S. (1981) An exponential family of probability distributions
for directed graphs. Journal of the American Statistical Association 76, 33–50.
Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology 24, 417–441, 498–520.
Ibrahim, J.G. and Laud, P.W. (1991) On Bayesian analysis of generalized linear models
using Jeffreys’s prior. Journal of the American Statistical Association 86, 981–986.
Jeffreys, H. (1961) Theory of Probability (3rd edn). Oxford: Clarendon Press.
Kahn, W.D. (1987) A cautionary tale for Bayesian estimation of the binomial parameter
n. The American Statistician 41, 38–40.
Kay, J.W. and Titterington, D.M. (1999) Statistics and Neural Networks: Advances at the
Interface. Oxford: Oxford University Press.
Kemp, A.W. and Kemp, C.D. (1991) Weldon’s dice data revisited. The American Statisti-
cian 45 (3), 216–222.
362 Introduction to Statistical Modelling and Inference
Kendall, M.G. and Stuart, A. (1966) The Advanced Theory of Statistics, Vol. 3. London:
Griffin Hafner.
Kolmogorov, A.N. (1933) Grundbegriffe der Wahrscheinlichkeitrechnung; translated as
Foundations of Probability. New York: Chelsea Publishing Company.
Laska-Mierzejewska, T. (1970) Effect of ecological and socio-economic factors on the age at
menarche, body height and weight of rural girls in Poland. Human Biology 42, 284–292.
Lazar, N. (2003) Bayesian empirical likelihood. Biometrika 90, 319–326.
Lindsey, J.K. (1997) Applying Generalized Linear Models. New York: Springer-Verlag.
Little, R.J. (2004) To model or not to model? Competing models of inference for finite
population sampling. Journal of the American Statistical Association 99, 546–556.
Lord, F. (1952) A Theory of Test Scores: Psychometric Monograph no. 7. Richmond: Psy-
chometric Corporation.
Lunn, D., Jackson, C., Best, N., Thomas, A. and Spiegelhalter, D. (2013) The BUGS Book.
Boca Raton: CRC Press.
Lusher, D., Koskinen, J. and Robins, G. (eds) (2013) Exponential Random Graph Models
for Social Networks Cambridge: Cambridge University Press.
Lyons, S. (2020) What is ECMO, extracorporeal membrane oxygenation, and how is it being
used to help severe COVID-19 patients? ABC News, www.abc.net.au/news/health/2020-
07-22/coronavirus-ecmo-explainer/12472498.
McCullagh, P. and Nelder, J.A. (1983) Generalized Linear Models. London: Chapman and
Hall.
McLachlan, G. and Peel, D. (2000) Finite Mixture Models. New York: John Wiley.
Mehta, C. and Senchaudhuri, P. (2003) Conditional versus unconditional exact tests for
comparing two binomials. At www.google.com/search?channel=fs&client=ubuntu&q=
The+Fisher-Barnard+argument+over+the+%22exact%22+conditional+test.
Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and
Statistical Classification. New York: Ellis Horwood.
Milicer, H. and Szczotka, F. (1966) Age at menarche in Warsaw girls in 1965. Human
Biology 40, 199–203.
Mitchell, T. and Beauchamp, J. (1988) Bayesian variable selection in linear regression.
Journal of the American Statistical Association 83, 1023–1032.
Nelder, J.A. (1966) Inverse polynomials, a useful group of multi-factor response functions.
Biometrics 22, 128–141.
Neyman, J. (1935) On the problem of confidence intervals. Annals of Mathematical Statistics
6, 111–116.
Neyman, J. and Pearson, E.S. (1933). On the problem of the most efficient tests of statistical
hypotheses. Philosophical Transactions of the Royal Society of London A 231, 289–337.
Owen, A.B. (1988) Empirical likelihood ratio confidence intervals for a single functional.
Biometrika 75, 237–249.
Owen, A.B. (1995) Nonparametric likelihood confidence bands for a distribution function.
Journal of the American Statistical Association 90, 516–521.
Owen, A.B. (2001) Empirical likelihood. Boca Raton: Chapman and Hall/CRC Press.
Pearson, E.S. (1962) in the discussion of L.J. Savage The Foundation of Statistical Inference.
London: Methuen.
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew vari-
ation in homogeneous material. Philosophical Transactions of the Royal Society A 186,
343–414.
References 363
Pearson, K. (1900) On the criterion that a given system of deviations from the probable in
the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling. Philosophical Magazine 50, 157–175.
Piper, D.W., McIntosh, J.H., Ariotti, D.E., Calogiuri, J.V., Brown R.W. and Shy, C.M.
(1981) Life events and chronic duodenal ulcer: a case control study. Gut 22, 1011–1017.
Pitman, E.J.G. (1979) Some Basic Theory for Statistical Inference. London: Chapman and
Hall.
Plackett, R.L. (1977) The marginal totals of a 2 × 2 table. Biometrika 64, 37–42.
Postman, M.J., Huchra, J.P. and Geller, M.J. (1986) Probes of large-scale structures in the
Corona Borealis region. The Astronomical Journal 92, 1238–1247.
Racine, A., Grieve, A.P., Flühler, H. and Smith, A.F.M. (1986) Bayesian methods in prac-
tice: experiences in the pharmaceutical industry. Applied Statistics 35, 93–150.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge Uni-
versity Press.
Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and
voids in the galaxies. Journal of the American Statistical Association 85, 617–624.
Royall, R.M. and Cumberland, W.G. (1981) An empirical study of the ratio estimator
and estimators of its variance (with discussion). Journal of the American Statistical
Association 76, 66–88.
Rubin, D.B. (1981) The Bayesian bootstrap. Annals of Statistics 9, 130–134.
Ruppert, D., Wand, M.P. and Carroll, R.J. (2003) Semiparametric Regression. Cambridge:
Cambridge University Press.
Schmidt, G., Mattern, R. and Schueler, F. (1981) Biomechanical investigation to determine
physical and traumatological differentiation criteria for the maximum load capacity of
head and vertebral column with and without protective helmet under the effects of
impact. Tech. Report, Institut für Rechtsmedizin, University of Heidelberg, Germany.
Schønheyder, F. (1936) The quantitative determination of vitamin K. Biochemical Journal
30, 890–896.
Shaw, L.P. and Shaw, L.F. (2019) The flying bomb and the actuary. Significance 16(5),
12–17.
Sheppard, W.F. (1897) On the calculation of the average square, cube, of a large number
of magnitudes. Journal of the Royal Statistical Society 60, 698–703.
Sheppard, W.F. (1898) On the application of the theory of error to cases of normal distri-
butions and normal correlations. Philosophical Transactions of the Royal Society A 192,
101.
Si, Y. and Reiter, J.P. (2013) Nonparametric Bayesian multiple imputation for incomplete
categorical variables in large-scale assessment surveys. Journal of Educational and Be-
havioral Statistics 38, 499–521.
Silverman, B.W. (1985) Some aspects of the spline smoothing approach to non-parametric
curve fitting (with discussion). Journal of the Royal Statistical Society B 47, 1–52.
Smith, T.M.F. (1976) The foundations of survey sampling: a review (with discussion).
Journal of the Royal Statistical Society A 139, 183–204.
Smyth, G.K. (1986) Modelling the dispersion parameter in generalized linear models. In
Proceedings of the Statistical Computing Section. Alexandria: American Statistical Asso-
ciation, 278–283.
Smyth, G.K. (1989) Generalized linear models with varying dispersion. Journal of the Royal
Statistical Society B 51, 47–60.
364 Introduction to Statistical Modelling and Inference
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002) Bayesian measures
of model complexity and fit (with discussion). Journal of the Royal Statistical Society B
64, 583–639.
Sprott, D.A. (1980) Maximum likelihood in small samples: Estimation in the presence of
nuisance parameters. Biometrika 67, 515–523.
Surkova, E., Nikolayevskyy, V. and Drobniewski, F. (2020) False-positive COVID-19 results:
Hidden problems and costs. Lancet Respiratory Medicine 8(12), 1167–1168.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B 58, 267–288.
Valliant, R., Dorfman, A.H. and Royall, R.M. (2000) Finite Population Sampling and In-
ference: A Prediction Approach. New York: Wiley.
Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics with S (4th edn.). New
York: Springer.
Vermunt, J.K., Van Ginkel, J.R., Van der Ark, L.A. and Sijtsma K. (2008) Multiple imputa-
tion of categorical data using latent class analysis. Sociological Methodology 33, 369–297.
Wainer, H. (2016) Truth or Truthiness: Distinguishing Fact from Fiction by Learning to
Think Like a Data Scientist. New York: Cambridge University Press.
Ware, J.H. (1989) Investigating therapies of potentially great benefit: ECMO (with discus-
sion). Statistical Science 4, 298–340.
White, H. (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica 48, 17–38.
Index
A model comparison, 97
Acceleration of, motorcycle helmet, 15 null and alternative hypotheses, two
α cdf, 273, 274 models, 97–100
α density, 273, 274 pivotal functions, 134–135
Alternative hypotheses, 97–100, uniform distribution, 136
132–134, 257 µ0 vs. µ ̸= µ0, 131–134
Analysis of Covariance (ANCOVA), 219 µ1 vs. µ2, 130–131
Analysis of Variance (ANOVA), 218 Bayesian inference, 78–79, 127
Axioms probability, 35 prior arguments, 127–128
coin tossing, 35–36 Bayesian interpretation, 134, 220
Bayesian residuals, 199, 209
dice example, 35
Bayesian theory, 63–64, 122–123
B Bayes’s theorem, 64–65
bootstrap, 68
Backward elimination, 218–219 conjugate prior distributions, 67
Bayes factor, 185, 186 conjugate priors, 123
Bayesian analysis, 87, 138–139, frequentist objections, flat priors, 69
146–147 general prior specifications, 69–70
binary response models, 272–276 improving frequentist interval
credible interval, 94 coverage, 67
double GLMs non-informative prior rules, 68–69
absence from school data, 312–314 parameters, random variables, 70
fish species data, 313, 317–318 synopsis of, posterior distribution,
hospital beds and patients, 309–312 65–66
log-likelihood, 308–309 Bayes’s theorem, 64–65, 249
model assessment, 320, 321 Beetle data, 273, 277–279
sea temperatures, 316–320 Beta distribution, 71, 170
weight functions, 309 β cdf, 273, 275
inference for µ, 139–140 β density, 273, 275
inference for σ, 139 Big Data, 3
parametric functions, 140–142 Binary response models
posterior sampling, 87–89 Bayesian analysis, 272–276
prediction of, new observation, 142–143 beetle data, 273, 277–279
sampling without replacement, 89–90 binomial link functions and their
Bayesian and frequentist inferences, 203, 204 origins, 268–269
Bayesian bootstrap (BB), 171–175, 178–179 maximum likelihood, 271–272
analysis, 204 probit and logit transformation,
and posterior weighting, 299, 302–305 268, 269
Bayesian deviance, 188, 257 Racine data, 269–271
Bayesian formulation, 97, 142, 218 Binomial distribution, 53
Bayesian hypothesis testing binomial likelihood function,
conjugate priors, 135 53–54
365
366 Index
GLIM, see Generalised Linear Interactive observed data score vector, 236
Modelling randomly missing Gaussian
GLMs, see Generalised linear models observations, 240–241
GMM, see Generalised method of moments Information matrix, 236, 237
Intention to treat (ITT) analysis, 103
H Inverse polynomials, 336–337
Inverse probability weighting (IPW), 180
Haldane prior, 68, 87–88
criticisms J
Bayesian bootstrap, 171, 172
Dirichlet process prior, 173 Jeffreys, Harold (1891–1989), 357–358
multivariate hypergeometric Joint draws of r, θ, 154
distribution, 172
posterior sampling, 173–175 K
structural zeros, 171 Kings of Spades (KOSs), 42
improper, 171
Helicobacter pylori, 96 L
Hessian matrix, 236 Laplace distribution, 220
Heterogeneous regressions, 328–333 Lasso, 219–220
Heteroscedasticity, 206 Latent class/mixed Rasch model,
Highly non-linear functions, 334–337 341–342
Histogram, 111–113, 355 Latent component identifier, 248
Horvitz-Thompson (HT) estimator, 181; LD48 cdf, 273, 276
see also Weighted sample LD88 cdf, 273, 276
mean Least squares fit to the log integrated
Hospital bed use, 13 hazard, 150
Hostility data, modelling of Length-biased sampling, 349–350
control group, 226 Likelihood and Gaussian approximation,
counts, means and SDs of affection, 61–62
226, 227 Likelihood function, 356
data structure, 227–232 Likelihood ratio, 41–42, 129–133, 185
Gottschalk-Gleser scales, 226 Likelihood ratio test (LRT), 102, 185, 252
Hypothesis testing, 128–129 Likelihood-without-prior analysis, 358
Linear predictor, 215
I Logistic linear regression model, 270–271
Improper Haldane prior, 87, 171 Log-likelihoods, 59–60, 186, 188
Incomplete data model, 51 Poisson and geometric, 187
Bayesian analysis and DA algorithm Lognormal distribution, 143, 183
(see Data Augmentation deviance, 192, 193
algorithm) lognormal density, 143–144
definition, 235 Lognormal model assessment, 158
EM algorithm (see EM algorithm) LRT, see Likelihood ratio test
exponential distribution, censoring in,
239–240 M
Hessian matrix, 236 MAR, see Missing at random
information matrix, 236–237 Marginal posterior distribution, 173
lost data, 238–239 Maximised likelihoods, 183, 185
missingness (see Missingness) Maximum likelihood, 201–202, 265–266
mixture distributions, Bayesian analysis from, 268
247–251 binary response models, 271–272
370 Index
sampling S
extrasensory perception (ESP),
SAE, see Small area estimation
31–33
Sample surveys
representative sampling, 33–35
bias, newsday sample, 25
screening tests and Bayes’s theorem, bias, women and love sample, 25–27
36–39 children desire, 24
StatLab dice sampling, 30
representative sampling, 24
sums of, independent random
women and love, 23
variables, 45
Sampling
Probability density function, 117 computer, 30–31
Probability model assessment, 209–210 natural random processes, 30–31
Probability models for, continuous variables, extrasensory perception (ESP), 31–33
113–116
representative, 33–35
Profiling, 336
representative sampling, 33–35
p-value, 82, 102, 130, 355
Saturated model, 340
p-variable multiple regression model, 215 “Scale-load” distribution, 167
Q Scale parameter, 265
Screening tests and Bayes’s theorem, 36–39
Quadratic logit model, 277–279 Segmented/broken-stick regressions,
320–321
R Down’s syndrome, 327–328
Racine data, 269–271 modelling the break, 323, 325–327
Racine data likelihood, 272, 273 Nile flood volumes, 321–323
Radio transceivers SHM, see Simple harmonic motion
lifetimes and cumulative numbers, 113 SIDS, see Sudden Infant Death Syndrome
lifetimes and numbers, 5–6, 113 Simple harmonic motion (SHM), 343,
Randomised clinical trial (RCT), 91, 344, 346
95–96 Simple linear regression model, 195, 196
treatment of, duodenal ulcers, correlation, 207
92–93 “dummy variable” regression, 210
Bayesian analysis, credible interval, 94 likelihood function, 199–201
frequentist analysis, confidence interval, ML estimates, 202
93–94 prediction, 207–208
Random variables as model assessment tool, 209
and distributions, 42–45 vitamin K concentration, 208–209
sums of, independent, 45 two-variable models, 216
Rasch model, 340–341 absence vs. dependents, 211, 212
Raven test and score, 223–226 absence vs. dependents and fitted linear
RCT, see Randomised clinical trial model, 211, 214
Regression sum of squares (RegSS), 202 absence vs. IQ, 211, 212
RegSS, see Regression sum of squares absence vs. IQ and fitted linear model,
Relative frequency, 29 211, 213
Replication, 232 interactions, 217–219
Representative sampling, 24, 33–35 IQ vs. dependents, 211, 213
Residual sum of squares (RSS), 202, 207 Simulated motorcycle collision, 14–16
“Restricted” ML (REML) estimate, 227 joint model ML, 14
“Restrictive” covariance structure, 168, 173 patients treated and hospital beds, 15
Reversed extreme value distribution, 269 Simulation marginalisation, 140
Ridge regression, 219–220 Single-index model, 215
Index 373