Download as pdf or txt
Download as pdf or txt
You are on page 1of 123

M248 Analysing data

Block C
About this course
M248 Analysing data uses the software package Student Version of MINITAB for
Windows (Minitab Inc.) and other software to explore and analyse data and to
investigate statistical concepts. This software is provided as part of the course, and its
use is covered in the associated computer books.

Acknowledgement
Grateful acknowledgement is made to the Statistical Laboratory, Iowa State University
for permission to reproduce the photograph of R.A. Fisher in Figure 5.4 of Unit C1.
Every effort has been made to contact copyright owners. If any have been inadvertently
overlooked, the publishers will be pleased to make the necessary amendments at the
first opportunity.
This publication forms part of an Open University course. Details of this and other
Open University courses can be obtained from the Student Registration and Enquiry
Service, The Open University, PO Box 197, Milton Keynes MK7 6BJ, United Kingdom:
tel. +44 (0)845 300 6090, email general-enquiries@open.ac.uk
Alternatively, you may visit the Open University website at http://www.open.ac.uk
where you can learn more about the wide range of courses and packs offered at all
levels by The Open University.
To purchase a selection of Open University course materials visit
http://www.ouw.co.uk, or contact Open University Worldwide, Walton Hall, Milton
Keynes MK7 6AA, United Kingdom, for a brochure: tel. +44 (0)1908 858793,
fax +44 (0)1908 858787, email ouw-customer-services@open.ac.uk

The Open University, Walton Hall, Milton Keynes, MK7 6AA.


First published 2009.
Copyright 
c 2009 The Open University
All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, transmitted or utilised in any form or by any means, electronic,
mechanical, photocopying, recording or otherwise, without written permission from
the publisher or a licence from the Copyright Licensing Agency Ltd. Details of such
licences (for reprographic reproduction) may be obtained from the Copyright
Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS;
website http://www.cla.co.uk.
Open University course materials may also be made available in electronic formats
for use by students of the University. All rights, including copyright and related
rights and database rights, in electronic course materials and their contents are
owned by or licensed to The Open University, or otherwise used by The Open
University as permitted by applicable law.
In using electronic course materials and their contents you agree that your use will
be solely for the purposes of following an Open University course of study or
otherwise as licensed by The Open University or its assigns.
Except as permitted above you undertake not to copy, store in any medium
(including electronic storage or use in a website), distribute, transmit or retransmit,
broadcast, modify or show in public such electronic materials in whole or in part
without the prior written consent of The Open University or in accordance with the
Copyright, Designs and Patents Act 1988.
Edited, designed and typeset by The Open University, using the Open University
TEX System.
Printed in the United Kingdom by Hobbs the Printers Limited, Brunel Road,
Totton, Hampshire SO40 3WX
ISBN 978 0 7492 5275 5
1.1
Contents
Study guide 5
Study guide for Block C 5

UNIT C1 Testing hypotheses 6


Study guide for Unit C1 6

Introduction 6

1 An approach using confidence intervals 8

2 Introducing significance testing 14


3 More on significance testing 20
3.1 Testing a normal mean 20
3.2 Testing a proportion 26
3.3 Performing significance tests using MINITAB 28

4 Two-sample tests 29
4.1 Testing the difference between two Bernoulli
probabilities 29
4.2 The two-sample t-test 31
4.3 Performing significance tests using MINITAB 35

5 Fixed-level testing 36
5.1 Performing a fixed-level test 36
5.2 A few comments 42
5.3 Fisher, Pearson and Neyman 43
5.4 Exploring the principles of hypothesis testing 44

6 Power, and choosing sample sizes 45


6.1 Calculating the power of a test 46
6.2 Planning sample sizes 49
6.3 Power and sample size using a computer 51

Summary of Unit C1 52
Learning outcomes 53

Solutions to Activities 54
Solutions to Exercises 58

UNIT C2 Nonparametrics 60
Study guide for Unit C2 60
Introduction 60
4

1 Nonparametric tests 61
1.1 Early ideas: the sign test 61
1.2 The Wilcoxon signed rank test 66
1.3 The Mann–Whitney test 71
1.4 Nonparametric tests using MINITAB 74

2 A test for goodness of fit 75


2.1 Goodness of fit of discrete distributions 76
2.2 The chi-squared distribution 78
2.3 The chi-squared goodness-of-fit test 80

Summary of Unit C2 83
Learning outcomes 84

Solutions to Activities 85

Solutions to Exercises 88

UNIT C3 The modelling process 90


Study guide for Unit C3 90

Introduction 90
1 Choosing a model: getting started 91
1.1 Continuous or discrete? 92
1.2 Which discrete distribution? 94
1.3 Which continuous distribution? 96

2 Exploring the data 99


2.1 Getting a feel for the data 99
2.2 Interpreting probability plots 103
2.3 Transforming the data 104
2.4 Dealing with outliers 107

3 Statistical modelling with MINITAB 110

4 Writing a statistical report 110


4.1 The structure of a statistical report 111
4.2 Writing the report 112

Summary of Unit C3 117


Learning outcomes 117

Solutions to Activities 119

Solutions to Exercises 122

Index for Block C 123


Study guide 5

Study guide
Study guide for Block C
There is no CMA on Block C. TMA 03 covers Unit B3 and Block C.
Unit C1 is longer than average and Units C2 and C3 are both shorter than
average. Unit C1 will need seven study sessions; Units C2 and C3 will each need
four study sessions.
6 Unit C1

UNIT C1 Testing hypotheses


Study guide for Unit C1
You should schedule seven study sessions, including time for answering the TMA
questions on the unit and for generally reviewing and consolidating your work on
this unit.
In general, each section depends to some extent on ideas and skills from preceding
sections, so we recommend that you study the sections in order.
There are five chapters of Computer Book C associated with this unit: there is
one chapter associated with each of Sections 3, 4 and 5, and there are two
chapters associated with Section 6. We recommend that you study the chapters at
the points indicated in the text, although it would be possible to postpone
Chapter 1 and study it in the same session as Chapter 2 if this is more convenient
for you. If you follow the study pattern below, then you will need access to your
computer for the second, third, fifth and sixth study sessions. You may find that
the first session is longer than average.
Study session 1: Sections 1 and 2.
Study session 2: Section 3. You will need access to your computer for this session,
together with Computer Book C.
Study session 3: Section 4. You will need access to your computer for this session,
together with Computer Book C.
Study session 4: TMA question on significance testing and consolidation of your
work on Sections 1 to 4.
Study session 5: Section 5. You will need access to your computer for this session,
together with Computer Book C.
Study session 6: Section 6. You will need access to your computer for this session,
together with Computer Book C.
Study session 7: Complete TMA questions on Unit C1.

Introduction
In Block B, the results of statistical experiments were used to obtain confidence
intervals for population parameters, thus providing plausible ranges of values for
the parameters. This unit is about testing claims, or hypotheses, about the values
of population parameters: in each situation discussed, a claim is reinterpreted as a
statement about a population parameter — that is, a hypothesis is
formulated — and data are used to investigate the validity of the hypothesis.
Examples of the sorts of claims that are amenable to statistical investigation
include the following.
Drug A prolongs sleep, on average.
Drug A and Drug B are, on average, equally effective at prolonging sleep.
Eight out of ten dogs prefer Pupkins to any other dog food. At the time of writing, there is
no dog food on the market
Notice that all these claims can be reinterpreted as statements about the values of called Pupkins, and no allusion
some unknown population parameters. For example, the first claim might be to any other real trade name is
interpreted as a statement about the mean sleep gain of patients taking Drug A intended here.
(for some relevant population of patients). And the second claim can be thought
of as a statement that the mean sleep gain using Drug A is equal to the mean
sleep gain using Drug B (for some relevant population of patients).
Introduction 7

Now consider the third claim, which might possibly be advanced by the
manufacturer of the dog food in the course of an advertising campaign: ‘Eight out
of ten dogs prefer Pupkins to any other dog food.’ It appears to mean that, in the
relevant population of dogs (perhaps all those in Britain), 80% of dogs presented
with a choice of all available dog foods would select Pupkins, while the other 20%
would make a selection from the remainder. So this claim may be interpreted as a
hypothesis about a population proportion.
One way to test this hypothesis would be to take a sample of dogs from the
population of dogs in Britain, offer them the full array of available dog foods, and
keep a record of which of them preferred Pupkins to all others. (There is a
problem of definition here: what exactly does ‘any other dog food’ mean? But let
us gloss over that.)
This could be an expensive experiment in terms of materials, and not an easy one
to conduct. But let us concentrate on the principles. Maybe some alternative
experimental design, in which
Notice that there is, implicit in the claim, the idea that ‘at least 80% of dogs different dogs are offered a more
prefer Pupkins’. We would not enter into serious dispute with the manufacturer if limited choice in various
the evidence actually suggested that, say, 85% or 90% of dogs preferred Pupkins. combinations, could be
We would seriously contest the claim only if there was evidence that it was contrived.
exaggerated, and that in fact the underlying proportion was less than 80%.
If, in a small sample of 20 dogs, only 15 dogs showed the claimed preference for
Pupkins, an observed sample proportion of 75%, only an unreasonable person
would challenge the claim of an underlying 80%, for allowance must be made for
random variation arising from the sampling process. But if as few as (say) 11 or
12 of the 20 dogs demonstrated the claimed preference — only just over half the
sample — we might seriously begin to doubt the manufacturer’s claim. This unit
is about what constitutes sufficient evidence to reject a claim or hypothesis, or at
least to cast doubt upon it, and how to assess the strength of the evidence against
it.
In Section 1, the following straightforward approach to hypothesis testing is taken.
First, the claim is reinterpreted as a statement about the value of some unknown
population parameter. Then a random sample is taken from the population; this
is used to construct a confidence interval for the parameter. If the confidence
interval contains the hypothesized parameter value, then that value is a plausible
value for the parameter, and the sample does not provide enough evidence to
dismiss the claim. However, if the interval does not contain the hypothesized
value, then the conclusion is reached that there is sufficient evidence to doubt the
claim. In this way, a rule is developed for deciding whether or not to reject a
hypothesis. Notice that, since the confidence interval will be different for different
confidence levels, the decision rule will depend on the confidence level adopted.
Notice, also, the wording used here: whatever the results of the test, there is no
implication that the claim is accepted as ‘true’. Even if it is established that the
hypothesized parameter value is plausible, it will in general be only one of a range In the ‘Pupkins’ example, if
of plausible values. Statistical hypothesis testing is based on a principle of only 15 dogs out of a sample of
falsifiability rather than of verifiability. In the approach just described, a 20 showed the claimed
preference, that would not be
hypothesis is either rejected in the light of the evidence, or not rejected because enough evidence to reject the
there is insufficient evidence to reject it. In other words, this statistical approach claim that 80% of dogs prefer
may not be used to prove the truth of something, but merely to provide evidence Pupkins, but nor does it prove
for its falsity. that the claim is true!

A rather different approach to testing hypotheses, known as significance testing, is


introduced in Section 2. This is the approach that is most commonly used in
practice. The main feature of a significance test is that it provides a numerical
measure of the extent to which the data provide evidence against the hypothesis
being tested, rather than a rule for deciding whether or not to reject it. Several
applications of significance testing are discussed in Sections 3 and 4. In Section 3,
you will learn about Student’s one-sample t-test, which is used for testing
hypotheses involving the mean of a normal distribution; you will also learn how to
test a proportion. Situations where a comparison is being made between two
8 Unit C1

populations are covered in Section 4. The main test covered in this section is the
two-sample t-test, for comparing two normal means.
In Section 5, an approach called fixed-level testing is discussed briefly. This
approach is related to that in Section 1. In many circumstances, it leads to the
same conclusion.
Finally, in Section 6, you will see how considerations and calculations, based on
the statistical testing procedure to be used, can help in the process of planning an
experiment, and in particular with the question of what an appropriate size for
the sample(s) involved would be.

1 An approach using confidence


inter vals
One of the interpretations of confidence intervals that you met in Block B is the
‘plausible range’ interpretation: a confidence interval for an unknown parameter
can be seen as the range of parameter values that are plausible in the light of the
observed data. Thus we can decide whether or not a specific hypothesized
parameter value is plausible simply by calculating a confidence interval for the
parameter and seeing whether the hypothesized value is one of the plausible
values within the interval. This approach to testing a hypothesis is illustrated in
Example 1.1. It will be used to introduce some of the ideas and terminology of
hypothesis testing.

Example 1.1 Testing a hypothesis about a proportion


This example is about testing the random number generator within a computer
program. In general, there is more to this than simply testing the frequencies of
the different digits, but to keep things simple that is all that will be considered
here.
In an experiment a computer is used to simulate successive rolls of a perfect die,
and to calculate the proportion of sixes obtained. The hypothesis to be tested is
that the underlying proportion of sixes generated by the computer program is
equal to 16 . If the observed proportion is sufficiently close to 16 , then the program
will be deemed satisfactory. If not — that is, if there are either too many sixes, or
too few — then the program will be called unsatisfactory.
In what is intended to be a simulation of a sequence of Bernoulli trials, a 0
indicates that a 1, 2, 3, 4 or 5 was rolled; a 1 indicates a 6. The results of 100 rolls
are shown in Table 1.1.

Table 1.1 Throwing a six: computer simulation


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

In this example, ten sixes were obtained in a total of 100 rolls of the supposedly
fair die — rather fewer than expected. Does this experiment provide any
substantial evidence that the die is biased — that is, that the program generating
the throws is flawed?
The methods of Block B can be used to find an exact 90% confidence interval
for p, the underlying proportion of sixes. Assuming a binomial model, B(100, p),
Section 1 9

for the number of sixes that occur in 100 rolls of the die, the confidence interval
for p is (0.0553, 0.1637). The exact confidence intervals
quoted in this example and in
A confidence interval can be used to provide a decision rule for a test of a Activity 1.1 were calculated
hypothesis about the value of a model parameter. The confidence interval can be using MINITAB.
thought of as a range of plausible values of the parameter. The most noticeable
feature of this particular confidence interval is that it does not contain the
theoretical (or assumed) underlying value p = 16  0.1667. The conclusions of this
simple test may be stated as follows.
In a test of the hypothesis that p = 16 , there was sufficient evidence at the The terminology used here will
10% significance level to reject the hypothesis in favour of the alternative that be explained shortly.
p = 16 . In fact, since the entire confidence interval is below 16 , there is some
indication that p < 16 . �

There are a number of features of the testing process in Example 1.1 to notice.
Most obviously, the raw material for the statistical testing of a hypothesis is the
same as that for the construction of a confidence interval: we require data; we
require an underlying probability model; and we need to have identified a model
parameter relevant to the question we are interested in answering.
What has altered is the form of the final statement: rather than listing a range of
plausible values for an unknown parameter, at some level of confidence, a
statement is made about whether or not a hypothesis about a particular
parameter value is tenable, at some assigned significance level. Notice that in
this example the significance level has been expressed as a percentage, and is
equal to 100% minus the confidence level (90%) for the interval used to perform In most circumstances, large
the test. This is just the way the conventional language has developed; you will confidence levels (90% or more)
become used to it as you work through this unit. are used; so the corresponding
significance levels are small.
Example 1.1 continued Testing a hypothesis about a proportion
An exact 95% confidence interval for the binomial parameter p, based on a count
of 10 successes in 100 trials, is (0.0490, 0.1762). In this case, the hypothesized
value p = 16 is contained in the confidence interval. In other words, at this
confidence level, and based on these data, it is a plausible value. The conclusions
of the corresponding test may be stated as follows.
In a test of the hypothesis that p = 16 , there was insufficient evidence at the
0.05 significance level to reject the hypothesis in favour of the alternative that Here, the significance level has
p = 16 . been expressed as a number
(0.05) between 0 and 1, rather
Note that it has certainly not been concluded that the parameter p is definitely than as a percentage (5%).
equal to 16 , but merely that 16 is a plausible value for the parameter. There is not Both formulations are common.
sufficient evidence to reject the hypothesis that p = 16 ; but that does not mean
that this hypothesis must be true.
In the statement of the hypothesis that p = 16 and in the context of the problem,
the implication has been that either a sample proportion too high (suggesting
p > 16 ) or a sample proportion too low (suggesting p < 16 ) would both offer
evidence to reject the hypothesis. Using the conventional terminology, a
two-sided test has been performed. � A two-sided test is sometimes
called a two-tailed test.
The approach to hypothesis testing based on confidence intervals, which has been
illustrated in Example 1.1, may be summarized as follows.
First, the characteristics of a population are expressed in terms of a parameter θ
in such a way that the hypothesis under test takes the form θ = θ0 , where θ0 is
some specified value. This hypothesis is called the null hypothesis and is The reason for the term ‘null’
denoted by H0 , so it is written will become clearer later in the
unit.
H0 : θ = θ0 .
(In Example 1.1, the null hypothesis is H0 : p = 16 .)
10 Unit C1

The alternative hypothesis, which is denoted by H1 , is that the claim


expressed by the null hypothesis is false. This is written
 θ0 .
H1 : θ =
(In Example 1.1, the alternative hypothesis is H1 : p = 16 .)
Next, data are collected on an appropriate random variable, whose variation may
be expressed by a probability distribution indexed by θ. (In Example 1.1, the
data were modelled by a binomial distribution B(100, p).)
A 100(1 − α)% confidence interval for θ is calculated using the data. This
confidence interval is used to decide whether or not the null hypothesis can be
rejected. The decision rule for rejecting the null hypothesis H0 depends simply on
whether the hypothesized value θ0 of θ is, or is not, contained in the interval. If θ0
is in the interval, there is insufficient evidence to reject the null hypothesis H0 at
the α significance level, because θ0 is a plausible value for θ. If θ0 is not in the
interval, the null hypothesis H0 can be rejected in favour of the alternative
hypothesis.
Notice that the significance level is related to the confidence level used in
calculating the confidence interval: for a 100(1 − α)% confidence level, the
significance level is just α (or 100α% when written as a percentage).

Activity 1.1
A different computer, and a different statistical package, were used to simulate
the results of a sequence of Bernoulli trials. The intention was that the
probability of success p at any trial should be 23 . The results of a sequence of 25
trials are shown in Table 1.2.

Table 1.2 Computer simulation


0 0 0 1 0 0 1 0 0 1 0 1 0
1 0 1 1 0 0 1 1 0 0 0 1

An exact 95% confidence interval for p based on the observed sequence of trials is
(0.2113, 0.6133). An exact 99% confidence interval for p is (0.1679, 0.6702).
These intervals are to be used to test the hypothesis that the underlying
proportion of successes is 23 against the alternative hypothesis that the underlying
proportion is different from 23 . Thus the null and alternative hypotheses are
H0 : p = 23 , H1 : p = 23 .
(a) What would the conclusion of a test be if a significance level of 0.05 is used?
(b) What would the conclusion of a test be if a significance level of 0.01 is used?

In Activity 1.1, you saw that there was enough evidence against the hypothesis
p = 23 to reject the hypothesis at the 0.05 significance level; but more evidence is
required to reject it at the 0.01 level, and there was not enough to do this. It is of
some interest that the null hypothesis was rejected at one significance level (0.05)
but not at another (0.01), on the basis of exactly the same data. This illustrates
that hypothesis testing is a matter of evaluating evidence.
Example 1.1 and Activity 1.1 both involved using (exact) confidence intervals for
a binomial parameter p to test a hypothesis. Example 1.2 illustrates that the
method of using a confidence interval to test a hypothesis may be applied in other
situations.
Section 1 11

Example 1.2 A mechanical kitchen timer


The time was recorded when a mechanical kitchen timer rang after being set for a
five-minute delay (300 seconds). On ten such occasions, a stop-watch was used to
record the time (in seconds). The resulting data are shown in Table 1.3.

Table 1.3 Ten times (seconds) Data provided by B.J.R. Bailey,


University of Southampton.
293.7 296.2 296.4 294.0 297.3
293.7 294.3 291.3 295.1 296.1

An appropriate probability model for the times is a normal distribution with


unknown mean µ and unknown variance σ2 ; that is, the data can be treated as a
random sample from a population with distribution N (µ, σ2 ). A method involving The method is described in
the t-distribution can be used to calculate a 90% confidence interval for the Block B.
unknown mean µ of this normal distribution: the interval is (293.8, 295.8).
An obvious specific question to ask is whether the timer is biased, in the following
sense. Clearly, when it is set for 300 seconds, it does not always take exactly the
same time to ring. But could the average time it takes to ring be 300 seconds, or
does it take on average more or less time than that? This question can be
investigated using a hypothesis test. The null hypothesis is
H0 : µ = 300.
The alternative hypothesis is
H1 : µ = 300.
Since 300 does not lie in the 90% confidence interval for µ based on these data, it
is not a plausible value for µ. Thus the null hypothesis that µ = 300 can be
rejected at the 10% significance level in favour of the alternative that µ = 300. In
other words, the timer seems to be biased. In fact, since the sample mean is less
than 300 (it is 294.81), there is an indication that µ < 300, or in other words that
the timer is under-recording five-minute intervals, on average. (Alternatively, the
fact that the confidence interval is entirely below 300 could have been used as
evidence that the timer is under-recording five-minute intervals, rather than
over-recording them. The point here is to use the data in some way as evidence
that µ < 300.)
Other significance levels can be dealt with similarly, by using other confidence
intervals. For instance, a 99% confidence interval for µ is (293.0, 296.6). Thus,
since 300 is not in this interval, the null hypothesis can also be rejected at the 1%
(or 0.01) significance level.
This shows that the evidence against the null hypothesis that µ = 300 is actually
considerably stronger than would be implied simply by the statement that the
null hypothesis can be rejected at the 10% significance level. Informally, this is
because we can be more confident that the 99% confidence interval contains the
true value of µ than that the 90% confidence interval does. The hypothesized
value of µ, 300, lies outside the wider 99% confidence interval, as well as outside
the 90% confidence interval. Therefore we can be more certain, in a sense, that µ
is not 300, than we could be if we knew only that 300 lay outside the 90%
confidence interval. �

The discussions in the previous paragraph and after Activity 1.1 indicate that, in
cases where the null hypothesis has been rejected, there appears to be some kind
of relationship between the significance level used in a test and the strength of
evidence against the null hypothesis provided by the test — the lower the
significance level, the stronger the evidence. This idea can be made more precise
by investigating how hypothesis tests and significance levels are interpreted. This
is done in Example 1.3 by considering how hypothesis testing is related to the
repeated experiments interpretation of confidence intervals.
12 Unit C1

Example 1.3 Interpreting the significance level


Imagine that you take a kitchen timer, like the one that provided the data in
Table 1.3, and that you repeat many times the experiment of taking the timer,
setting it to five minutes on ten occasions, recording the resulting times with a
stop-watch, and calculating a 90% confidence interval for µ, the mean length of
time (in seconds) until the timer rings. You would, of course, get a different
confidence interval each time, because of the variability in the time until the timer
rings. But the repeated experiments interpretation of confidence intervals says
that (approximately) 90% of the resulting intervals will contain the true value
of µ.
Now imagine that this particular kitchen timer is unbiased, in the sense that the
true value of µ is actually 300. Then 90% of the 90% confidence intervals will
contain the value 300, and the other 10% will not. This can be translated into the
language of hypothesis testing as follows. Suppose that the null hypothesis
H0 : µ = 300
is in fact true. That is, you have a timer that takes, on average, exactly
300 seconds to ring when it is set for five minutes. Suppose that, as above, you
repeat many times the experiment of taking ten timings, and for each of these
repeated experiments, you test the hypothesis H0 , using the method based on
confidence intervals. In 90% of these repeated experiments, the 90% confidence
interval will contain 300, the true value of µ, and the null hypothesis H0 will not
be rejected. In the other 10% of the experiments, the 90% confidence interval will
not contain 300, and the null hypothesis H0 will be rejected even though it is
true. �

Example 1.3 illustrates one way of thinking of the significance level of a


hypothesis test. This is stated in the following box.

Interpreting significance levels


If the null hypothesis H0 is true then, in repeated experiments, H0 will be Note that this statement refers
rejected in some of the experiments, even though it is true. The significance to repeated experiments in
level gives the proportion of the repeated experiments (or percentage, if the which the null hypothesis is true.
significance level is expressed as a percentage) in which H0 will be rejected
falsely.

This interpretation means that the smaller the significance level that is used, the
less likely it is that the null hypothesis will be rejected when it is true.
The examples discussed so far have been based on exact confidence intervals. In
Block B, you learned about large-sample confidence intervals. These can be used
to carry out hypothesis tests in the same way as exact confidence intervals are
used. This is illustrated in Example 1.4.

Example 1.4 Yeast cells on a microscope slide


Some of Student’s original experiments involved counting the numbers of yeast ‘Student’ (1907) On the error of
cells found on a microscope slide. The results of one such experiment are given in counting with a
Table 1.4. haemacytometer. Biometrika, 5,
351–360. ‘Student’ was the
pseudonym of W.S. Gosset
Table 1.4 Yeast cells on a microscope slide
(1876–1937), developer of many
key statistical ideas and results
Cells in a square, i 0 1 2 3 4 5
including the t-distribution.
Frequency 213 128 37 18 3 1 Gosset pursued his statistical
research while working for the
The table gives the numbers of yeast cells found in each of 400 very small squares Guinness brewery company,
on the slide when a liquid was spread over it. The first row gives the number i of which prohibited its employees
from publishing under their own
yeast cells observed in a square and the second row gives the number of squares
names.
containing i cells.
Section 1 13

These data are to be used to test the null hypothesis that the mean number of
cells per square is 0.6, against the alternative hypothesis that the mean is
different from 0.6:
H0 : µ = 0.6, H1 : µ = 0.6.
In Block B you learned how to calculate an approximate confidence interval for a
population mean µ without choosing a specific probability model for the data.
The method is valid when the sample size n is large, as is the case here. So it is
not necessary to assume a distribution for the variable (the number of cells in a
square). An approximate 100(1 − α)% confidence interval for µ is given by
� �
s s
(µ− , µ+ ) = x − z √ , x + z √ ,
n n
where x is the sample mean, s is the sample standard deviation and z is the
(1 − α/2)-quantile of the standard normal distribution.
No significance level has been stipulated for the test. For a significance level of α,
a 100(1 − α)% confidence interval for the unknown population mean is required.
So, if α = 0.05 is chosen, a 95% confidence interval is needed.
Here x = 0.6825, s = 0.9021, n = 400 and, for a 95% confidence interval, z = 1.96. The values of x and s were
So the interval is calculated using the data in
� � Table 1.4.
− + 0.9021 0.9021
(µ , µ ) = 0.6825 − 1.96 × √ , 0.6825 + 1.96 × √
400 400
 (0.5941, 0.7709).
This confidence interval contains 0.6, the hypothesized value of the underlying You should bear in mind that
mean µ. So, there is insufficient evidence to reject the null hypothesis at the 5% there is a certain amount of
significance level. It remains plausible that the population mean takes the value approximation here, since the
confidence interval is
0.6; that is, it is plausible that the mean number of cells per square is 0.6. � approximate.

Activity 1.2 March rainfall


The March rainfall (in inches) was measured in each of 30 years in Minneapolis St Hinkley, D. (1977) On quick
Paul, USA. An approximate 95% confidence interval for µ, the underlying mean choice of power transformation.
March rainfall in Minneapolis St Paul, calculated using these measurements, is Applied Statistics, 26, 67–69.
(1.317, 2.033). This confidence interval is used to test the null hypothesis that
µ = 2 against the alternative hypothesis that µ = 2.
(a) What is the significance level of the test?
(b) What is the conclusion of the test?

Summar y of Section 1
In this section, a method for performing hypothesis tests based on confidence
intervals has been described. The terminology for the hypotheses involved in a
test, and the notion of significance level, have been introduced. The approach to
hypothesis testing using confidence intervals is general, and straightforward (if
you know how to calculate the interval). However, there is more to hypothesis
testing than this! In Section 2, a different approach is introduced, in which the
strength of the evidence against the null hypothesis is quantified.
14 Unit C1

Exercise on Section 1
Exercise 1.1 Book loans
A sample of 122 books in a particular library was selected and, for each book, the Burrell, Q.L. and Cane, V.R.
number of times that it had been borrowed in the preceding twelve months was (1982) The analysis of library
counted. The sample mean number of loans was x = 1.992 and the sample data. J. Royal Statistical
Society, Series A, 145, 439–471.
standard deviation of the number of loans was s = 1.394. An approximate 90% The authors collected their data
confidence interval for µ, the mean number of loans in a year for books in the from several libraries. The data
library, calculated using large-sample methods, is (1.784, 2.200). This confidence used in the exercise are from
interval is used to test the null hypothesis that µ is 2.5 against the alternative one of the sections of the
hypothesis that µ =  2.5. Wishart Library in Cambridge.

(a) What is the significance level of the test?


(b) What is the conclusion of the test?

2 Introducing significance testing


The approach to testing hypotheses using confidence intervals, which was
discussed in Section 1, requires the user to decide on an appropriate significance
level α to use, and you may well have felt that it was not clear exactly how this
choice should be made. But once the choice has been made, the procedure gives
the user a clear decision rule on whether or not to reject the null hypothesis H0 .
This clarity may be an advantage in certain situations, but it has drawbacks. In
Example 1.1, for instance, the null hypothesis was clearly rejected at the 10%
significance level, and just as clearly not rejected at the 5% significance level. In
Example 1.2, the null hypothesis was rejected at the 1% significance level, but
that simple statement does not tell us everything about how strong the evidence
against the null hypothesis is in this case.
In this section, a second approach to testing hypotheses is introduced. This
approach, which is called significance testing, has become the most common
method for assessing a hypothesis. For the most part, this is the approach that is
used in later units of M248; and unless you are specifically asked to use a different
approach, you should use the significance testing approach whenever you have to
perform a hypothesis test.
The significance testing approach gets around the problem of having to decide in
advance on an appropriate significance level for a hypothesis test, but at the price
of not providing a clear decision rule on when to reject the null hypothesis. But,
as you will see, this ‘disadvantage’ is in most circumstances not as important as it
may seem at first sight. Instead of leading to a stated decision (such as
‘reject H0 ’), the test results in a number called the significance probability
which, loosely speaking, describes the extent to which the data provide evidence
against the null hypothesis. The general idea is that the lower the significance
probability, the more evidence the data provide against the null hypothesis.
The significance probability is denoted by p; it is because of this notation that it Do not confuse this use of the
is often called simply the p value. letter p with its use to denote
the probability of success in a
Bernoulli trial. The same letter
happens to be standard in both
situations. However, note that
some books use an upper-case P
for p value.
Section 2 15

Performing a significance test


In Section 1, you saw that, in order to perform a hypothesis test, the first
requirement is for a clear statement of the hypothesis being tested. This is usually
expressed in terms of a parameter of a probability model. And, of course, we need
data!
Further, we need to decide on an alternative hypothesis that will indicate the sort
of departure from the null hypothesis that would be of interest. In this section,
values above the parameter value specified in the null hypothesis and values below
it are both of interest — so the tests considered are two-sided.
You have seen that the hypothesis to be tested is called the null hypothesis, and
is denoted by the symbol H0 . The hypothesis to be regarded as an alternative to
this is called the alternative hypothesis and is denoted by H1 .
The general statement of the null and alternative hypotheses for a two-sided test
might take the form
H0 : θ = θ0 ,  θ0 .
H1 : θ =
The other crucial ingredient for a hypothesis test, in the approach introduced in
this section, is a random variable called the test statistic; often it is a statistic
such as the sample mean, or the sample total, or perhaps the sample maximum or
median, or it may be a rather more complicated quantity. It must be possible to
calculate a value of the test statistic from the data, and the test statistic must
provide information about the value of the parameter θ. Also, you need to know
what the probability distribution of the test statistic would be if the null
hypothesis H0 were true. This distribution is called the distribution of the test
statistic under the null hypothesis, or, more conveniently, the null distribution
of the test statistic. Knowing (at least, approximately) the null distribution of the
test statistic is crucial to being able to carry out a hypothesis test. To illustrate
these ideas, Example 1.4 will be revisited.

Example 2.1 Yeast cells: choosing a test statistic


Table 1.4 contains counts of yeast cells in small squares on a microscope slide. In
Example 1.4, these data were used to test the hypothesis that the mean number
of cells per square is 0.6, against the alternative hypothesis that the mean is
different from 0.6:
H0 : µ = 0.6, H1 : µ = 0.6.
Using the approach based on (approximate large-sample) confidence intervals, you
saw that there was insufficient evidence to reject the null hypothesis (at the 5%
significance level).
How could the same problem be approached using the idea of a test statistic and
its null distribution? The hypotheses have been set up in terms of the underlying
mean µ of the distribution of the number of cells per square. If you wanted to
estimate µ, the observed value x of the sample mean X would be the obvious
quantity to use; and, roughly speaking, the further x is from 0.6, the more
evidence the data provide against the null hypothesis. Thus the random variable
X provides useful information about the underlying mean µ. Furthermore, the
distribution of X under the null hypothesis is known. Since the sample size is 400,
which is large, the approximate distribution of X can be used here. For any value
of the underlying mean µ, the approximate distribution of X is N (µ, σ2 /400).
Thus when H0 is true — that is, when µ = 0.6 — the (approximate) distribution of
X is N (0.6, σ2 /400). So the (approximate) null distribution of X is
N (0.6, σ2 /400).
This raises a problem. In order to use X as the test statistic, its null distribution
must be known and not depend on unknown parameters; but the distribution of
X does depend on the (unknown) population variance σ2 . However, in other
situations where the large-sample distribution of the sample mean has been
involved, it has been an acceptable approximation to replace the population
variance by the sample variance, s2 . This is also the case here; for these data, the
16 Unit C1

sample standard deviation s is 0.9021. (See Example 1.4.) So, for the null
hypothesis H0 : µ = 0.6, the approximate null distribution of the sample mean X
is N (0.6, 0.90212 /400), or N (0.6, 0.00203). Thus, in this case, the sample mean X
can be used as the test statistic. �

In significance testing, the idea is to describe the extent to which the data provide
evidence against the null hypothesis, rather than to decide whether or not there is
sufficient evidence to reject the null hypothesis (as was the case in Section 1 when
using the confidence interval approach). This is done by calculating the
probability of obtaining a value of the test statistic that is ‘at least as extreme as’
the observed value when the null hypothesis is true. This is illustrated in
Example 2.2.

Example 2.2 Yeast cells: calculating the significance probability


For the counts of yeast cells, the observed value x of the test statistic X is 0.6825.
(See Example 1.4.) Figure 2.1 shows the null distribution of X; the observed
value x = 0.6825 is also marked.

Figure 2.1 The null distribution of X, and the values that are at least as extreme
as the observed value x = 0.6825

The shaded regions in Figure 2.1 contain all the possible values of X that, if µ
were equal to 0.6 (the value specified in the null hypothesis), would be at least as
extreme as the actual value that was observed (0.6825). In the tail of the null In a certain sense these
distribution that contains the observed value 0.6825, it is easy to see which values ‘extreme’ values in the shaded
these are: any value greater than 0.6825 would be more extreme than 0.6825 (as it regions are those that are as
little or even less in accord with
is further from 0.6). Thus the shaded area in the right tail of the null distribution the null hypothesis than is the
includes all values of X greater than or equal to 0.6825. actual observed value — so they
provide as little or less support
Why is there a shaded area in the left tail as well? Remember this is a two-sided for the null hypothesis than
test, so a sample mean a long way below the hypothesized value of 0.6 would also does the observed value.
provide evidence against the null hypothesis. The observed value, 0.6825, is
0.0825 above 0.6. The null distribution of X is symmetric about x = 0.6, and the
value 0.5175 is 0.0825 below 0.6. Therefore it seems reasonable to state that a
value of X equal to 0.5175 would be just as extreme as the observed value, 0.6825,
and that values further into the left tail (that is, less than 0.5175) are even more
extreme. Thus the set of values of X that are at least as extreme (in relation to
the null hypothesis) as the observed value, 0.6825, consists precisely of all values
greater than or equal to 0.6825, together with all values less than or equal
to 0.5175.
The significance probability for the test is simply the sum of the two shaded tail
areas in Figure 2.1. That is,
p = P (X ≥ 0.6825) + P (X ≤ 0.5175),
where X ∼ N (0.6, 0.00203). Since the null distribution of X is symmetric about
x = 0.6, the probability in the left tail is equal to the probability in the right tail,
and hence the significance probability is
p = 2P (X ≥ 0.6825).
Section 2 17

Standardizing and using the table of probabilities for the standard normal
distribution in the Handbook gives
� �
0.6825 − 0.6
p = 2P Z ≥ √ , where Z ∼ N (0, 1),
0.00203
 2P (Z ≥ 1.83)
= 2 × 0.0336
= 0.0672
 0.067. �

Note that the test could have been carried out using a test statistic a little more
complicated than X. The null distribution of X is N (0.6, 0.00203). Although this
distribution is not difficult to deal with, it would have been even easier to deal
with N (0, 1). If X had been standardized using its null distribution, then the
resulting random variable Z would have had the √standard normal distribution as
its null distribution. That is, if Z = (X − 0.6)/ 0.00203 had been used as the
test statistic, then its null distribution would have been N (0, 1). There would
have been slightly more effort in calculating the observed value of Z — it comes to
1.83 — but then we would simply have had to find the value of P (|Z| ≥ 1.83),
where Z ∼ N (0, 1). In this situation, the change is of no practical importance; it
has been mentioned only to show that the most obvious choice for a test statistic
is not the only possibility, and that it is possible that another choice may be
easier to work with.

Example 2.3 Yeast cells: drawing a conclusion


The significance probability p = 0.067 that has been calculated in Example 2.2
tells us that the data provide some evidence against H0 but the evidence is far
from strong. If the null hypothesis were really true, we would expect to find
evidence against H0 that was at least as strong as this (in the sense of being in
the shaded areas in Figure 2.1) 6.7% of the time, and 6.7% is not an extremely
small number.
The significance probability does not tell you whether or not to reject H0 , but
maybe that is in itself an advantage. Different individuals may need different
levels of evidence in order to be content with a particular conclusion. A
significance test gives a measure of evidence, rather than a yes/no, reject/do not
reject rule, and hence allows different people to come to different conclusions. In
other words, the test provides information that is an aid to decision making,
rather than a prescriptive method of decision making. �

Examples 2.1 and 2.2 illustrate most features of significance testing. However,
before summarizing the procedure for carrying out a significance test, there
remain two areas that need further consideration.
First, in a two-sided test, the rule for deciding which values in the opposite tail of
the null distribution (that is, the one that does not include the observed value)
should be counted as being ‘at least as extreme, in relation to the null hypothesis,
as that observed’ is somewhat arbitrary. In fact, there is no universal agreement
on exactly how this should be done. Several proposed general rules exist. In
practice, in cases like that in Example 2.2, where the null distribution is unimodal
and symmetric, the rules in common use all agree: the appropriate area in the
opposite tail is drawn symmetrically to match that in the observed tail. (This has
the consequence that the areas in these two tails are equal, so that the total
probability in the two tails can be found simply by doubling the probability in one
of them.) In very many common testing situations, the null distribution is indeed
unimodal and symmetric, so no problem arises. In Subsection 3.2, you will meet
an example where the null distribution is not symmetric, and which is complicated
even more by the fact that the test statistic is discrete. The rule used in this
course for defining the appropriate area in the opposite tail is described there.
18 Unit C1

Secondly, it is all very well saying that the significance testing approach does not
carry with it a firm rule on whether or not to reject H0 , but in practice how are
these p values to be interpreted? It is dangerous to give a general rule of thumb,
because there are always situations where any rule of thumb will be found
wanting. Nevertheless, a rough guide to interpreting significance probabilities is
given in Table 2.1.

Table 2.1 Interpreting significance probabilities

Significance probability p Rough interpretation


p > 0.10 little evidence against H0
0.10 ≥ p > 0.05 weak evidence against H0
0.05 ≥ p > 0.01 moderate evidence against H0
p ≤ 0.01 strong evidence against H0

Example 2.4 Yeast cells: completing the test


In Example 2.2, you saw that the p value for the two-sided test is 0.067. Using
Table 2.1 as a guide, this p value can be interpreted as follows.
Since 0.05 < p ≤ 0.10, there is weak evidence against the null hypothesis that the
mean number of cells in a square is 0.6.
Further, since x = 0.6825 > 0.6, the data suggest that the underlying mean
number of cells in a square is greater than 0.6.
This completes the test. �

The procedure for a significance test is summarized in the following box.

Procedure for significance testing


1 Determine the null hypothesis H0 and the alternative hypothesis H1 .
2 Decide what data to collect that will be informative for the test.
3 Determine a suitable test statistic and the null distribution of the test
statistic (that is, the distribution of the test statistic when H0 is true).
4 Collect the data and calculate the observed value of the test statistic for
the sample.
5 Identify all other values of the test statistic that are at least as extreme,
in relation to the null hypothesis, as the value that was actually
observed.
6 Calculate the significance probability p. This is the probability, under
the null hypothesis, of those values of the test statistic identified in
Step 5.
7 Interpret the significance probability.
8 Report the conclusion to be drawn from the test clearly.

The main features of the significance testing approach to testing a hypothesis


have been discussed in some detail in this section in the context of the yeast cell
data from Table 1.4. And the procedure for carrying out a significance test has
just been summarized. But when performing a significance test, how much detail
should be included, and how should the test be set out? The significance test
discussed in Examples 2.1 to 2.4 is used in Example 2.5 to illustrate the level of
detail required in practice. The steps of the procedure are identified by their
numbers in the margin.
Section 2 19

Example 2.5 Yeast cells: the complete test


The null and alternative hypotheses are Step 1.
H0 : µ = 0.6, H1 : µ = 0.6,
where µ is the mean number of yeast cells in a square.
The data to be used are in Table 1.4. Step 2.
The test statistic is the sample mean X. Its approximate null distribution is Step 3.
N (0.6, s2 /400) = N (0.6, 0.00203).
The observed value of the test statistic is x = 0.6825. Step 4.
The values of the test statistic that are at least as extreme as the observed value
fall into the two tails of the null distribution (see Figure 2.1). These consist of
values greater than or equal to 0.6825 and values less than or equal to 0.5175. Step 5.
The significance probability is thus
p = P (X ≤ 0.5175) + P (X ≥ 0.6825)
= 2P (X ≥ 0.6825)
� �
0.6825 − 0.6
= 2P Z ≥ √
0.00203
 2P (Z ≥ 1.83)
= 2 × 0.0336
= 0.0672
 0.067. Step 6.
There is weak evidence against the null hypothesis that the mean number of cells
in a square is 0.6. Since x = 0.6825 > 0.6, the data suggest that the underlying
mean number of cells in a square is greater than 0.6. � Steps 7 and 8. In practice,
steps 7 and 8 are often
Activity 2.1 Insect traps combined.

A total of 33 insect traps were set out across sand dunes and the numbers of
different insects caught in a fixed time were counted. Table 2.2 gives the number Gilchrist, W. (1984) Statistical
of traps containing various numbers of insects of the taxon Staphylinoidea. Modelling. John Wiley and
Sons, Chichester, p. 132. The
Table 2.2 Staphylinoidea in 33 traps original purpose of the
experiment was to test the
Count 0 1 2 3 4 5 6 ≥7 quality of fit of a Poisson model.
Here no specific model is
Frequency 10 9 5 5 1 2 1 0 assumed, though a sample size
of 33 is arguably a little small to
The sample mean of the counts is 1.636, and the sample standard deviation use the large-sample result for
is 1.655. the distribution of the sample
mean.
Perform a two-sided significance test of the null hypothesis that µ, the underlying
mean number of insects of the taxon Staphylinoidea in a trap, is 1. Use the
sample mean X as the test statistic, and use a normal approximation to the null
distribution of X.
Follow the procedure for significance testing, numbering the steps in your test.

Summar y of Section 2
In this section, a general procedure for carrying out a hypothesis test has been
introduced. The method described is known as significance testing. You have
learned how to interpret the significance probability that arises from a significance
test.
20 Unit C1

Exercise on Section 2
Exercise 2.1 Book loans
In Exercise 1.1, data on the number of times that a sample of 122 books in a
library were borrowed in a year were investigated using the confidence interval
approach to hypothesis testing. The sample mean number of loans per book was
x = 1.992, and the sample standard deviation was s = 1.394. Perform a
(two-sided) significance test of the null hypothesis that the underlying mean
number of loans per book in a year is 2.5.

3 More on significance testing


In this section, the significance testing approach to the problem of testing a
statistical hypothesis is discussed further. A description of what the technique
involves for tests on normal means is given in Subsection 3.1. Testing a
proportion is discussed in Subsection 3.2. In Subsection 3.3, you will learn how to
use MINITAB to carry out the tests described in Sections 2 and 3.

3.1 Testing a nor mal mean


The use of significance testing to test a hypothesis about a normal mean is
illustrated in Example 3.1.

Example 3.1 Shoshoni rectangles


Most individuals, if required to draw a rectangle (for example, when composing a
picture), would produce something which is not too close to a square and not too
oblong. A typical rectangle that might be produced is shown in Figure 3.1.
The
�√ Greeks
� called a rectangle ‘golden’ if the ratio of its width to its length was
5 − 1 /2  0.618. The Shoshoni, a tribe of Native North Americans, used
beaded rectangles to decorate their leather goods. The data in Table 3.1 are the
width-to-length ratios for 20 Shoshoni rectangles, analysed as part of a study in
experimental aesthetics. Figure 3.1 A typical rectangle

Table 3.1 Width-to-length ratios, Shoshoni rectangles DuBois, C. (1960) Lowie’s


Selected Papers in Anthropology.
0.693 0.662 0.690 0.606 0.570 0.749 0.672 0.628 0.609 0.844 University of California Press,
0.654 0.615 0.668 0.601 0.576 0.670 0.606 0.611 0.553 0.933 pp. 137–142.

Do the Shoshoni tend to use golden rectangles? This can be investigated using a
significance test as follows.
Appropriate hypotheses for the test are
H0 : µ = 0.618, H1 : µ = 0.618,
where µ is the underlying mean width-to-length ratio of Shoshoni rectangles
(step 1). (Note that departures from the null hypothesis in either direction would
indicate that the Shoshoni are doing something other than constructing ‘golden’
rectangles, so the test is two-sided.)
The values in Table 3.1 will be used for data (step 2).
The next step (step 3) is to decide on the test statistic and find its null
distribution. Here, the hypotheses relate to the underlying mean µ. So, as in
Example 2.1, the obvious choice for the test statistic is perhaps the sample
mean X. But what is its null distribution?
Section 3 21

An appropriate probability model for the ratios is a normal distribution with


unknown mean µ and unknown variance σ2 ; that is, the data can be treated as a
random sample from a population with distribution N (µ, σ2 ), where both the
mean and variance are unknown. The distribution of X is N (µ, σ2 /n), where n is
the sample size (20, in this case). So far so good; the null distribution of X is
N (0.618, σ2 /20). A snag arises: in order to calculate the significance probability,
the variance of the null distribution of X must be known, but we do not know the
value of σ2 . In Example 2.1, the way round this was to replace σ2 by its sample
estimate s2 , but that is inappropriate here because the sample size is small.
Therefore X is simply not usable as a test statistic in this situation.
The way round this impasse is to use a different test statistic, one whose null
distribution is known. �

Activity 3.1 Choosing a test statistic


Do you have any suggestions as to how an appropriate test statistic might be
constructed?
Comment
When calculating confidence intervals in Block B in a similar situation, you made
use of the quantity
X −µ
√ ,
S/ n
where S is the sample standard deviation. You might have thought that this
would help; and indeed it does.

Example 3.1 continued Shoshoni rectangles


X −µ
The random variable √ has the advantage that its distribution is known: it
S/ n
has Student’s t-distribution with n − 1 degrees of freedom. However, it cannot be
used directly as a test statistic, for two reasons. First, it involves the unknown
quantity µ, so that it cannot be calculated from the data; and secondly, it is not
clear exactly what it has to do with the hypothesis we are trying to test. But
consider the random variable
X − 0.618 X − 0.618
T = √ = √ .
S/ n S/ 20
This does not involve any values that cannot be calculated from the data. Also, it
might be expected to take values relatively close to zero when the null hypothesis
is true (because the sample mean would then be close to the hypothesized
underlying mean, 0.618). Furthermore, when H0 is true (so that µ = 0.618), T
has Student’s t-distribution with n − 1 = 19 degrees of freedom. That is, we know
its null distribution: it is t(19). So, if we use T as a test statistic, we can deal
with step 3 of the testing procedure.
Step 4 is to calculate the observed value t of the test statistic T . For the data in
Table 3.1, the sample mean is 0.6605 and the sample standard deviation is 0.0925,
so the observed value of the test statistic is
x − 0.618 0.6608 − 0.618
t= √ = √  2.055.
s/ 20 0.0925/ 20
Figure 3.2 shows the null distribution of T ; the observed value t = 2.055 of T is Figure 3.2 The null
also marked. The shaded regions in the diagram are those possible values of T distribution of T
that, if they had been observed, would have been at least as extreme, in relation
to the null hypothesis, as the actual value that was observed (2.055). (Step 5.) It
is easy to see which values these are in the same tail of the distribution as that
where the observed value lies. Any value of T greater than the actual observed
value of 2.055 would be more extreme (in relation to the null hypothesis) than is
2.055. Thus the shaded area in the right tail of the distribution includes all values
of T greater than or equal to t = 2.055.
22 Unit C1

Why is there a shaded area in the left tail as well? Remember this is a two-sided
test, so that a sample mean a long way below the hypothesized value of 0.618,
corresponding to a negative value of T , would also provide evidence against the
null hypothesis. Since the null distribution of T is symmetric about t = 0, it
seems reasonable to state that a value of T equal to −2.055 would be just as
extreme in relation to the null hypothesis as is the observed value, 2.055, and that
values further into the left tail (that is, less than −2.055) are even more extreme.
Thus the set of values of T that are at least as extreme (in relation to H0 ) as the
observed value, 2.055, consists precisely of all values greater than or equal to
2.055, together with all values less than or equal to −2.055.
The significance probability for the test is simply the sum of the two shaded tail
areas in Figure 3.2. In other words, it is p = P (|T | ≥ 2.055), where T ∼ t(19). To
find this probability exactly needs computing facilities; in fact, p = 0.0539
(step 6).
The necessity of using a computer (or some other calculating method) to calculate
the significance probability for many tests may be seen as a disadvantage of
significance testing. However, in practice it is not usually a major problem. For
instance, in this case the table of quantiles for t-distributions indicates that the
0.95-quantile and the 0.975-quantile of t(19) are 1.729 and 2.093, respectively.
Thus
P (T ≥ 1.729) = 1 − 0.95 = 0.05
and
P (T ≥ 2.093) = 1 − 0.975 = 0.025.
Since 2.055 is between 1.729 and 2.093, it follows that
0.025 < P (T ≥ 2.055) < 0.05.
But p = P (|T | ≥ 2.055) = 2P (T > 2.055), so 0.05 < p < 0.1. In most
circumstances, this kind of rather imprecise information is adequate for drawing
conclusions from a test.
There is weak evidence against the null hypothesis that the underlying mean
width-to-length ratio of Shoshoni rectangles is 0.618 and in favour of the
alternative hypothesis.
Since the sample mean is greater than 0.618, the data suggest that Shoshoni
rectangles may be somewhat ‘squarer’ than golden ratio rectangles (steps 7 and 8).
This completes the test. �

The test just used for testing a hypothesis about a normal mean is called the
t-test or Student’s t-test (although, actually, R.A. Fisher had a lot to do with
its development). To be more specific, it is called the one-sample t-test to
distinguish it from another test, involving two samples, that you will meet in
Section 4. The various kinds of t-test are among the most commonly used tests in
statistics because, as you have seen, the normal distribution is a very widely used
probability model.

Activity 3.2 The kitchen timer again


In Example 1.2, a hypothesis test was carried out using the confidence interval
approach on data obtained by timing a kitchen timer with a stop-watch. Ten The data are in Table 1.3.
times (in seconds) were measured when the timer was set to five minutes
(300 seconds). The observed sample mean is 294.81 seconds and the sample
standard deviation is 1.77.
Section 3 23

Assuming a normal model for the variation in times, use the procedure for
significance testing to test the null hypothesis that µ, the underlying mean time
that the timer takes to ring when it is set to five minutes, is equal to five minutes
(300 seconds) against the alternative hypothesis that it is different from five
minutes:
H0 : µ = 300, H1 : µ = 300.
Follow the procedure for significance testing, numbering the steps in your test.

When the alternative hypothesis of a test describes departures from the null
hypothesis in only one of the two possible directions, the test is said to be
one-sided. So the general statement of the null and alternative hypotheses might A one-sided test is sometimes
take either of the following forms, depending on the context of the claim being called a one-tailed test.
investigated:
H0 : θ = θ0 , H1 : θ > θ0 ;
or
H0 : θ = θ0 , H1 : θ < θ0 .
When you learned how to calculate confidence intervals for normal means in
Block B, you saw that the technique can be applied to data that are actually
differences of matched pairs. Not surprisingly, tests for normal means can also be
applied to differences of matched pairs. Apart from the fact that the data arise in
a particular way, the technique is the same as that used in Example 3.1. However,
to illustrate this, an example for which a one-sided test is appropriate will be used.

Example 3.2 Paired obser vations


In his 1908 paper, ‘Student’ analysed data on the effects of two drugs on sleep ‘Student’ (1908) The probable
duration. Ten patients were each given two different drugs, L-hyoscyamine error of a mean. Biometrika, 6,
hydrobromide and D-hyoscyamine hydrobromide. An interesting preliminary test 1–25.
is to investigate whether each of the drugs on its own has an effect on sleep.
The sleep gains (measured in hours) for the ten individuals when they took Table 3.2
L-hyoscyamine hydrobromide are given in Table 3.2. Sleep gain (hours)

These data are actually individual differences between matched pairs of Patient Gain
observations. For each of the ten individuals, the length of time asleep after 1 1.9
taking no drug was subtracted from the length of time asleep after taking 2 0.8
3 1.1
L-hyoscyamine hydrobromide. The differences are all positive except the fifth, and 4 0.1
this remark alone suggests that L-hyoscyamine hydrobromide is effective, on 5 −0.1
average, at prolonging sleep. However, a null hypothesis for a formal test that the 6 4.4
drug, in fact, makes no difference to the duration of sleep might take the form 7 5.5
8 1.6
H0 : µ = 0, 9 4.6
10 3.4
where the parameter µ is the mean underlying sleep gain. This is an example
where the name ‘null’ for this hypothesis makes particularly good sense, because
it proposes a zero value for the parameter. However, the hypothesis is still called
‘null’ when it does not propose a zero value, as you have seen.
In this case, it will be interesting to pursue a one-sided test to reflect the
suspicion (or even the aim in administering the dose) that the drug does indeed
prolong sleep, on average. The alternative hypothesis will therefore be written as
H1 : µ > 0. This concludes Step 1.
24 Unit C1

The data to be used are those in Table 3.2. Step 2.


The next step is to determine a suitable test statistic and its null distribution. In
order to do this, it is necessary to specify a model for the variability in the data
on sleep gain. This data set is very small and a histogram, or even a normal
probability plot, is not likely to display much in the way of persuasive evidence for
or against a normal model. However, a continuous model that will permit
negative as well as positive observations is required, and in this regard there are
not really any alternative models that have been considered in the course.
Using D to represent the observed random variable, it makes sense to base the Any symbol could have been
test statistic on the sample mean D, but (just as in Example 3.1) we need to take used, but D (for ‘difference’) is
account of the fact that the variance of D is unknown. Assuming that the often used in this kind of
context.
observed differences di , i = 1, 2, . . . , 10, are independent observations on a normal
random variable D ∼ N (µ, σ2 ), then under the null hypothesis H0 : µ = 0, the More generally, if the sample
D size is equal to n, then under
quantity √ , where S is the sample standard deviation of the differences, has the null hypothesis that µ = 0,
S/ 10 D
a t-distribution with 9 degrees of freedom. √ ∼ t(n − 1).
S/ n
D
So the quantity √ will be used as a test statistic; its null distribution is t(9). This concludes Step 3.
S/ 10
Next, the observed value of the test statistic is calculated. The sample standard Step 4.
deviation of the observed differences is s = 2.002 and the sample mean is d = 2.33.
The observed value of the test statistic is therefore
d 2.33
√ = √  3.68.
s/ 10 2.002/ 10
The null distribution of T is shown in Figure 3.3. The observed value is also
marked on the diagram.
The other values of the test statistic that are at least as extreme, in relation to
the null hypothesis, as the observed value are in the (very small) shaded area in
Figure 3.3. Figure 3.3 The null
distribution of the test statistic,
In contrast to the situation in Example 3.1, only values in the same tail as the with the observed value marked
observed value are shaded. This is because the test is one-sided, and only positive
values of the test statistic provide evidence against the null hypothesis and in
favour of the alternative hypothesis. (A negative value of the test statistic would
tend to indicate that the underlying mean sleep gain is less than zero, but that is
closer to the position described by the null hypothesis than it is to the position
described by the alternative hypothesis.)
Thus the set of values required for step 5 of the testing procedure consists of those
values greater than or equal to the observed value, 3.68. Step 5.

The significance probability is thus p = P (T ≥ 3.68), where T ∼ t(9). � Step 6.

Activity 3.3 Calculating a significance probability


Suppose that you did not have access to a computer to calculate the significance
probability in Example 3.2. What could you say about it on the basis of the table
of quantiles for t-distributions in the Handbook (complete step 6)? How would
you interpret the results of the test (steps 7 and 8)?
Section 3 25

Comment
The 0.995-quantile and the 0.999-quantile of t(9) are, respectively, 3.250 and
4.297. Thus
P (T ≥ 3.250) = 1 − 0.995 = 0.005
and
P (T ≥ 4.297) = 1 − 0.999 = 0.001.
The significance probability p must therefore be somewhere between these two
values. That is, 0.001 < p < 0.005. Step 6.

Wherever p is in this range, it is clearly very small, so there is strong evidence


against the null hypothesis. We conclude that the drug L-hyoscyamine
hydrobromide does prolong sleep, on average. Steps 7 and 8.
Actually, the exact p value here, calculated by computer, is p = 0.0025.

Activity 3.4 Calculating a significance probability for a two-sided test


Suppose that there were good reasons for conducting a two-sided test, with the
same data and the same null hypothesis. What would be the significance
probability?
Comment
The test statistic and null distribution are exactly the same as for the one-sided
test. The only difference is that the set of values required for step 5 of the testing
procedure includes values in both tails of the null distribution. That is, it includes
values less than or equal to −3.68 as well as values greater than or equal to the
observed value, 3.68.
Therefore, the significance probability is p = P (|T | ≥ 3.68), where T ∼ t(9). But,
by the symmetry about 0 of the t-distribution, P (|T | ≥ 3.68) = 2P (T ≥ 3.68), so
the significance probability for the two-sided test is exactly twice that for the
one-sided test of the same null hypothesis. It is thus 2 × 0.0025 = 0.005.

In general, for tests with a continuous null distribution, the significance


probability for a two-sided test will be double that for a one-sided test (provided
that the observed value of the test statistic is in the tail of the null distribution
that indicates support for the one-sided alternative hypothesis). The same is true
of tests with a discrete null distribution, as long as that distribution is symmetric.
The position is more complicated for tests with a discrete null distribution that is
not symmetric, as you will see in Subsection 3.2.

Activity 3.5 Another hypnotic substance?


Table 3.3
The data in Table 3.3 are also from Student’s 1908 paper. The table gives the
Sleep gain (hours)
sleep gains (measured in hours) of ten patients when given the drug
D-hyoscyamine hydrobromide. Use the data to perform a one-sided test of the Patient Gain
hypothesis that the drug has no effect on sleep, against the alternative hypothesis 1 0.7
that, on average, it leads to a net sleep gain. Start with an explicit statement of 2 −1.6
the hypotheses H0 and H1 , and follow the procedure for significance testing. 3 −0.2
4 −1.2
5 −0.1
6 3.4
7 3.7
8 0.8
9 0.0
10 2.0
26 Unit C1

3.2 Testing a proportion


Genetics is a field where there is much interest in the fraction of a population
displaying a particular attribute; so, many examples in genetics involve testing
the value of a Bernoulli probability p. Often sample sizes are quite small, and
exact distribution theory is appropriate. A few complications arise when applying
the general approach of significance testing to discrete distributions, as you will
see in Example 3.3.

Example 3.3 The colour of seed cotyledons in the edible pea


Gregor Mendel, an early explorer of the science of genetics, observed that seed
cotyledons in the edible pea may be either yellow or green and that the peas
themselves appear either yellow or green. In his second experiment in what
became known as the ‘first series’, he observed 6022 yellow peas and 2001 green
peas in a harvest of 8023 peas bred in particular circumstances. These results
offered support for his theory that on genetic principles the proportion of yellow
peas in such circumstances should be 34 .
In a smaller experiment, 12 yellow peas were found in a harvest of 20 peas. It was Steps 1 and 2.
required to use these data in a significance test of the null hypothesis
H0 : p = 34 .
The test will be two-sided, because departures in either direction from a value of
3
4 for p will throw doubt on the genetic theory, or more specifically on its
applicability in this case. Thus an appropriate alternative hypothesis is
H1 : p = 34 .
The obvious test statistic to use in this context is the number of yellow peas —
N , say. In repeated experiments of the same size, N would have a binomial Step 3.
distribution: N ∼ B(20, p). Under the null hypothesis H0 : p = 34 , the null
distribution of N is B (20, 34 ).
The observed value of the test statistic was n = 12. Step 4.

A diagram of the null distribution is shown in Figure 3.4. The shaded regions in
the diagram show the possible counts that are at least as extreme, in relation to Step 5.
the null hypothesis, as the value that was actually observed.

Figure 3.4 The null distribution B(20, 34 ) and counts that are at least as extreme,
in relation to the null hypothesis, as that observed, n = 12
Section 3 27

It is clear from the diagram that the observed value n = 12 is in the lower tail of
the null distribution; that is, it is rather a small value to have observed if the null
hypothesis is true. It is therefore clear that values further out in this tail — 11, 10,
9, and so on — provide even more evidence against the null hypothesis. However,
this is a two-sided test, so we need to consider values in the other tail as well, but
which values?
In this course the principle that will be used to choose ‘extreme’ values in the
opposite tail to that observed is based on ‘tail probabilities’ — that is, on the total
probability included in the ‘tails’ of the distribution, where a ‘tail’ includes all
values from one extreme or the other up to a certain value. Table 3.4 shows the
probability function of the null distribution B(20, 34 ), together with the
corresponding tail probabilities for both tails of the distribution.

Table 3.4 The binomial distribution


B(20, 34 ), with tail probabilities

x p(x) P (N ≤ x) P (N ≥ x) The values in this table were


calculated using MINITAB.
0 0.0000 0.0000 1.0000
1 0.0000 0.0000 1.0000
2 0.0000 0.0000 1.0000
3 0.0000 0.0000 1.0000
4 0.0000 0.0000 1.0000
5 0.0000 0.0000 1.0000
6 0.0000 0.0000 1.0000
7 0.0002 0.0002 1.0000
8 0.0008 0.0009 0.9998
9 0.0030 0.0039 0.9991
10 0.0099 0.0139 0.9961
11 0.0271 0.0409 0.9861
12 0.0609 0.1018 0.9591
13 0.1124 0.2142 0.8982
14 0.1686 0.3828 0.7858
15 0.2023 0.5852 0.6172
16 0.1897 0.7748 0.4148
17 0.1339 0.9087 0.2252
18 0.0669 0.9757 0.0913
19 0.0211 0.9968 0.0243
20 0.0032 1.0000 0.0032

The probability in the ‘tail’ up to and including the observed count, n = 12, is MINITAB uses a more
thus 0.1018. The upper tail is then chosen to be the tail which includes the complicated method to choose
largest amount of probability that is not greater than this value (that is, less than the second tail. However, in
many instances this leads to the
or equal to 0.1018). This is the tail including the values 18, 19 and 20, for which same p value as the method
the total probability is 0.0913. described here.
The significance probability for the test is given by the sum of the two shaded
areas in Figure 3.4 — that is, by the sum of the probabilities in the two tails. So Step 6.

p = P (N ≤ 12) + P (N ≥ 18) = 0.1018 + 0.0913 = 0.1931.


In this case, there is little evidence that the underlying value of the proportion p Steps 7 and 8.
is different from the hypothesized value of 34 . �

Activity 3.6 Calculating another binomial significance probability


Suppose that the observed number of yellow peas (out of 20) in the experiment
described in Example 3.3 had been 19 instead of 12. What would the significance
probability have been and what would the conclusion be in this case?
28 Unit C1

3.3 Performing significance tests using MINITAB


The work in this subsection involves working through a chapter of Computer
Book C. You will learn how to use MINITAB to do the calculations involved in
carrying out the significance tests discussed in Section 2 and in this section.

����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����

Summar y of Section 3
In this section, you have seen how significance tests can be used for testing normal
means and Bernoulli probabilities. One-sided tests and two-sided tests have both
been considered.
In particular, you have seen that a useful test statistic for testing the null
hypothesis H0 : µ = µ0 about the mean µ of a normal distribution is the random
variable
X − µ0
T = √ .
S/ n
The null distribution of T is t(n − 1), where n is the sample size. The test is
called the one-sample t-test.
You have used MINITAB to carry out the tests described in Sections 2 and 3.

Exercise on Section 3
Exercise 3.1 Differences in plant heights
Charles Darwin measured differences in height for fifteen pairs of plants of the The data are quoted in Fisher,
species Zea mays. Each pair had parents grown from the same seed — one plant in R.A. (1942) The Design of
each pair was the progeny of a cross-fertilization, the other of a self-fertilization. Experiments, 3rd edn. Oliver
and Boyd, London, p. 27.
Darwin’s measurements were the differences in height between cross-fertilized and
self-fertilized progeny. The data are given in Table 3.5. The units of measurement
are eighths of an inch. Table 3.5 Difference in
plant height ( 18 inch)
Suppose that the observed differences di , i = 1, 2, . . . , 15, are independent
observations on a normally distributed random variable D with mean µ and Pair Difference
variance σ2 . 1 49
2 −67
(a) State appropriate null and alternative hypotheses for a two-sided test of the
3 8
hypothesis that there is no difference (on average) between the heights of 4 16
progeny of cross-fertilized and self-fertilized plants, and state the null 5 6
distribution of an appropriate test statistic. 6 23
7 28
(b) Calculate the value of the test statistic for this data set, and obtain the 8 41
significance probability for the test. 9 14
(c) Interpret the significance probability, and state your conclusion clearly. 10 29
11 56
12 24
13 75
14 60
15 −48
Section 4 29

4 Two-sample tests
Many of the tests that you have met so far have been concerned with testing the
hypothesis that some population parameter takes a particular value. Such
hypotheses are not uncommon in certain application areas, for example, genetics.
However, a much more common testing situation is where samples are drawn from
two separate populations, in order to test some hypothesis about differences in
population characteristics. Very often, in such a situation, the hypothesis being
tested is that the populations actually do not differ in terms of the characteristic
of interest. This is the origin of the term null hypothesis. ‘Null’ ≡ ‘no difference’.

The most general question that could be asked is perhaps this: ‘Is the pattern of
variation in the attribute of interest exactly the same in both populations?’ In
other words, denoting the respective cumulative distribution functions by F1 (.)
and F2 (.), this might suggest the hypothesis
H0 : F1 (x) = F2 (x) for all x.
This may make sense, for example, in a context where the two populations differ
only in that different experimental treatments have been applied to them. If, in
fact, it is possible that the treatments make no difference at all, and the
individuals involved were drawn originally from a single population, then this kind
of hypothesis would be a sensible one to investigate. But more typically, some
common probability model, with parameters, would be assumed for the two
populations, and the null hypothesis would simply state that the parameter values
were equal for the two populations. An example is where the model is
parameterized by the mean and, denoting the two population means by µ1 and
µ2 , the null hypothesis is
H0 : µ1 = µ2 .
Subsection 4.1 is concerned with the situation where the variation in each of the
two populations is assumed to be adequately described by the binomial
distribution, and the question of interest is whether the Bernoulli probability p is
the same in the two populations. In Subsection 4.2, the variation in the two
populations is assumed to be adequately described by a normal distribution, and
the question of interest is whether the means of the two populations are equal.
Subsection 4.3 involves working through a chapter of Computer Book C : you will
learn how to carry out the significance tests described in Subsections 4.1 and 4.2
using MINITAB.
In Sections 2 and 3, the steps in the significance testing procedure were identified
by their numbers, both in examples and in solutions. From now on, although the
steps will be followed, they will not be numbered explicitly. However, you should
continue to use them in activities and exercises to help you to produce clear,
structured solutions.

4.1 Testing the difference between two Bernoulli


probabilities
In Example 4.1, data on the proportion of male sand-flies sampled in traps at
different heights are used to describe a procedure for testing the difference
between two Bernoulli probabilities.
30 Unit C1

Example 4.1 Sand-flies


An experiment was performed in which sand-flies were caught in two different
light traps, one at 3 feet above the ground and the other at 35 feet. At the higher Christiensen, H.A., Herrer, A.
altitude, 125 male flies were found out of a total of 198 flies sampled; at the lower and Telford, S.R. (1972)
altitude, there were 173 male flies out of 323 flies sampled. One question that Enzootic cutaneous
leishmaniasis in Eastern
arises is whether the underlying proportions of male sand-flies differed at these Panama. II: Entomological
two different heights. investigations. Annals of
Tropical Medicine and
These data can be modelled using the binomial distribution as follows. Let the Parasitology, 66, 55–66.
underlying proportions of male flies at high and low altitude be denoted by p1 and
p2 , respectively. Then assume that X1 , the number of male flies sampled at high
altitude, has the binomial distribution B(198, p1 ), and X2 , the number of male
flies sampled at low altitude, has the binomial distribution B(323, p2 ).
Appropriate hypotheses for the test, in terms of the parameters p1 and p2 , are
H0 : p1 = p2 ,  p2 .
H1 : p1 = There seems to be no reason in
the context of the situation to
An obvious choice for a test statistic is the difference between the sample concentrate on differences in one
proportions of males: direction or the other, so a
two-sided test is appropriate.
X1 X2
D= − .
198 323
(Indeed, D is an appropriate quantity to use for constructing a confidence interval
in this situation.) The observed value of D is
125 173
d= −  0.6313 − 0.5356 = 0.0957. That is, d = p�1 − p�2 where
198 323 p�1 = x1 /n1 and p�2 = x2 /n2 .
In order to use D as a test statistic, its null distribution must be known, at least
approximately. In general, the distribution of D is approximately normal, and is
given by
� �
p1 (1 − p1 ) p2 (1 − p2 )
D ≈ N p1 − p2 , + ,
n1 n2
where n1 and n2 are the sample sizes (198 and 323, in this case). When finding a
confidence interval for the difference D, the variance is estimated by replacing p1
and p2 by their sample estimates, p�1 = x1 /n1 and p�2 = x2 /n2 . This could be done
in order to find the approximate null distribution of D. Then since, under the null
hypothesis, p1 = p2 , the approximate null distribution would be
� �
p�1 (1 − p�1 ) p�2 (1 − p�2 )
N 0, + .
n1 n2
That would work, but we can do (a little) better when estimating the variance of
the null distribution. The procedure above does not allow for the fact that, under
the null hypothesis, the value of the Bernoulli parameter p is the same for both
populations. Under the null hypothesis that p1 = p2 , the two samples can be
thought of as a sequence of n1 + n2 Bernoulli trials with the same success
probability p (say), so p1 = p2 = p. The two samples can thus be combined to find
an estimate of this common value of p. The resulting estimate is
x1 + x2
p� = .
n1 + n2
For the given data, the observed value of p� is
125 + 173 298
p� = =  0.5720.
198 + 323 521
Assuming the null hypothesis is true, this is a rather better estimate of the
common value of the Bernoulli probability than either of the two estimates
derived from the two samples separately. Thus a better approximation to the null
distribution of D is
� � � � ��
p�(1 − p�) p�(1 − p�) 1 1
N 0, + = N 0, p�(1 − p�) + .
n1 n2 n1 n2
Section 4 31

For the given data,


� � � �
1 1 1 1
p�(1 − p�) + = 0.5720(1 − 0.5720) +  0.001994.
n1 n2 198 323
So the null distribution of D is approximately N (0, 0.001994). Therefore, if
Z ∼ N (0, 1), then the (two-sided) significance probability is Recall that the observed value
� � of the test statistic is
0.0957 d = 0.0957. Since the
P (|D| ≥ 0.0957)  P |Z| ≥ √
0.001994 approximate null distribution is
symmetrical about 0, the
 P (|Z| ≥ 2.14)
appropriate ‘opposite tail’ of the
= 2 × 0.0162 null distribution to consider is
that below −0.0957.
= 0.0324
 0.032. The table of probabilities for
N (0, 1) in the Handbook has
Thus there is moderate evidence against the null hypothesis that the underlying been used here.
proportion of sand-flies that are male is the same at the two altitudes. Moreover,
p�1 = 0.6313 and p�2 = 0.5356, so p�1 > p�2 , and hence the data indicate that the
proportion of sand-flies that are male is higher at the higher altitude. �

Activity 4.1 Helping behaviour


This activity uses data from an experiment designed to test people’s willingness to
help others. The question of interest was whether the gender of the person
requiring help was an important feature. In the experiment described, 71 male Sissons, M. (1981) Race, sex
students out of 100 asking for help (changing a small coin) were given it; and helping behaviour. British
89 female students out of 105 were helped. Use the procedure outlined in Journal of Social Psychology,
20, 285–292.
Example 4.1 (with a pooled estimate of the common Bernoulli parameter under
the null hypothesis) to explore any differences between the proportions of male
and female students who receive help.

4.2 The two-sample t-test


The two-sample t-test is one of the most useful tests available in statistics.
Under certain circumstances, it permits a test of the null hypothesis that the
means µ1 and µ2 of two distinct populations are equal:
H0 : µ1 = µ2 .
The assumptions of the test are as follows.
The variation in the first population may be adequately modelled by a
normal distribution with mean µ1 and variance σ2 , and the variation in the
second population may be adequately modelled by a normal distribution with
mean µ2 and variance σ2 . That is, denoting the random variables being
observed in the samples from the two populations by X1 and X2 ,
X1 ∼ N (µ1 , σ2 ), X2 ∼ N (µ2 , σ2 ).
The observations on the two populations are independent of one another.
You have seen that the normal distribution is an adequate model for many
situations. Given data, its adequacy can be checked by graphical methods — for
example, by drawing a histogram or a probability plot. But you should be aware
that another assumption is being made above.
The variance is the same in the two populations.
This assumption is not a trivial one. It will almost invariably be true that the
sample variances s21 and s22 are not equal, and thus the question arises of how
pronounced the difference between them must be before it throws the assumption
of equal underlying variances into doubt. It is sometimes suggested that a formal
test of the hypothesis that the population variances are equal should be carried
out before the t-test for the equality of the means. However, most statisticians
32 Unit C1

would see this as inappropriate. But it is usually worth at least having a quick
informal look at the sample variances. The following rule of thumb will be used in
this course: if the sample variances differ by a factor of less than about 3, it may
be assumed that the assumption of equal variances is not seriously amiss.
There are also many situations where an assumption of equal variances is
reasonable on modelling grounds, at least when the null hypothesis of equal
means is true. These situations include those where the individuals in both
samples have been sampled from a common population and then had different
treatments applied to them. Under the hypothesis that the treatments have no
effect, the underlying distributions of both populations will be the same. So, their
variances as well as their means will be equal.
The assumptions required for the two-sample t-test may be summarized as in the
following box.

Assumptions for the two-sample t-test


The variation in each population is adequately modelled by a normal
distribution.
The samples drawn from the populations are independent.
The population variances are equal.

The two-sample t-test is illustrated in Example 4.2.

Example 4.2 Infants with SIRDS


The birth weights of 50 infants who displayed severe idiopathic respiratory
distress syndrome (SIRDS) were collected. Of these infants, 27 died, while 23 van Vliet, P.K. and Gupta, J.M.
survived. Is birth weight associated with survival in children with SIRDS? (1973) Sodium bicarbonate in
idiopathic respiratory distress
One way to investigate this question is to compare the mean birth weight of the syndrome. Archives of Disease
children who died with that of the children who survived. This may be done in in Childhood, 48, 249–255.
various ways, one of which is to carry out a test of the hypothesis
H0 : µ1 = µ2 ,
where µ1 is the mean birth weight of infants with SIRDS who die, and µ2 is the
mean birth weight of infants with SIRDS who survive. The two-sample t-test is
appropriate here provided that the data indicate that the assumptions of the test
are reasonable. Normal probability plots (not shown) suggest that it might be
reasonable to assume a normal model for the variation observed in each group.
The observations in the two groups can be assumed to be independent.
What about the assumption of equal variances? If the birth weights of the
27 infants who died are taken as the first sample, and the birth weights of the 23
infants who survived as the second sample, then the sample variances are
s21 = 0.268, s22 = 0.442.
These are certainly not equal. The rule of thumb is that the assumption of equal
variances is not unreasonable provided the sample variances differ by a factor of
less than about 3. The ratio of the larger sample variance to the smaller is
s22 0.442
= = 1.65.
s21 0.268
This ratio is a lot less than 3, so the assumption of equal variances will be made.
A two-sided test will be carried out, since departures from the null hypothesis in
either direction are of interest. Thus, denoting the underlying mean birth weights
by µ1 and µ2 , the hypotheses are
H0 : µ1 = µ2 , H1 : µ1 = µ2 . �
Section 4 33

The next stage in setting up a test is to find a suitable test statistic. It would
seem sensible to base the test statistic on the difference between the sample
means, D = X 1 − X 2 . It follows from results on the distribution of a linear
combination of independent normally distributed random variables that D also
has a normal distribution: The distribution of a linear
� � combination of independent
σ2 σ2
D = X 1 − X 2 ∼ N µ1 − µ 2 , + , (4.1) normally distributed random
n1 n2 variables was discussed in
Block B.
where n1 and n2 are the sample sizes. The snag, just as in the single sample case,
is that the variance of D depends on the unknown variance σ2 . Thus, even under
the null hypothesis H0 : µ1 = µ2 , where the mean of D is known to be 0, the
distribution of D is not known. So D cannot be used as a test statistic. When
dealing with a single sample, this difficulty was resolved by calculating a test
statistic that involved the sample standard deviation S and using the
t-distribution. In the present setting, however, there are two distinct (and
independent) estimates of σ2 , one from each sample. Intuitively, it would make
sense to combine them in some way to obtain a single estimate. There are several
ways of doing this, but the optimal combination is given in the following box.

Pooled estimate of the common variance


Given independent samples of size n1 with sample variance s21 and size n2
with sample variance s22 , from distributions with a common variance σ2 , the
pooled estimate of σ2 is
(n1 − 1)s21 + (n2 − 1)s22
s2P = . (4.2)
n1 + n2 − 2
The corresponding random variable is denoted SP2 .

Note that the pooled estimate of the common variance gives more weight to the
estimate from the larger sample.
The following result is a consequence of (4.1): The result does not follow quite
directly, but the details are
(X 1 − X 2 ) − (µ1 − µ2 ) unimportant. Note that it can
� ∼ t(n1 + n2 − 2).
1 1 be used to calculate a
SP + confidence interval for µ1 − µ2 .
n1 n2
That is, the random variable on the left has a t-distribution with n1 + n2 − 2
degrees of freedom. Under the null hypothesis H0 : µ1 = µ2 , (µ1 − µ2 ) = 0, so the
null distribution of the quantity
X1 − X2
T = �
1 1
SP +
n1 n2
is t(n1 + n2 − 2). The statistic T involves only quantities whose values can be
calculated from the samples, and it is essentially just a scaled version of the
difference between the two sample means. Furthermore, its null distribution is
known. Thus it is an appropriate test statistic for the test of equality of means.

The test statistic for a two-sample t-test


The test statistic T for a two-sample t-test is
X1 − X2
T = � , (4.3)
1 1
SP +
n1 n2
where n1 and n2 are the sample sizes and SP2 is the pooled estimate of the
common variance.
The null distribution of T is t(n1 + n2 − 2).
34 Unit C1

Example 4.2 continued Infants with SIRDS


For the data on SIRDS, relevant summary statistics are
n1 = 27, x1 = 1.692, s21 = 0.268;
n2 = 23, x2 = 2.307, s22 = 0.442.
Thus the null distribution of the test statistic T is a t-distribution with degrees of
freedom n1 + n2 − 2 = 27 + 23 − 2 = 48.
The pooled estimate of the common variance is
(n1 − 1)s21 + (n2 − 1)s22
s2P =
n1 + n2 − 2
26 × 0.268 + 22 × 0.442
=
27 + 23 − 2
= 0.34775.
Hence the observed value t of the test statistic T given in (4.3) is
x − x2 1.692 − 2.307
t= �1 = �  −3.68.
1 1 √ 1 1
sP + 0.34775 × +
n1 n2 27 23
The significance probability is p = P (|T | ≥ 3.68), where T ∼ t(48). Figure 4.1
illustrates the area to be calculated to find p. Figure 4.1 The significance
probability for a two-sample
This probability cannot be found exactly from the table of t-quantiles in the t-test
Handbook, but the values in the table can give us a good idea of its general size.
There is no row in the tables for 48 degrees of freedom, but the 0.999-quantiles of
t(45) and of t(50) are given as 3.281 and 3.261, respectively. Thus the
0.999-quantile of t(48) must be between these values, and hence must be less than
3.68, the absolute value of the observed test statistic. Therefore, the significance The significance probability can
probability must be less than 2 × (1 − 0.999) — that is, less than 0.002. It is be calculated using a computer;
certainly very small! We conclude that there is strong evidence that the it is 0.0006.
underlying means differ. Moreover, since x2 > x1 , the data indicate that the mean
birth weight of children with SIRDS who die is less than that for children with
SIRDS who survive. �

Activity 4.2 Insect joint measurements


Measurements were taken on the widths of the first joints of the second tarsus for Lindsey, J.C., Herzberg, A.M.
two species of the insect Chaetocnema. Ten insects from each species were and Watts, D.G. (1987) A
sampled, yielding the following sample means and variances: method of cluster analysis based
on projections and
x1 = 128.40, s21 = 48.93; quantile-quantile plots.
Biometrics, 43, 327–341.
x2 = 122.80, s22 = 114.84.
You may assume that the distribution of joint widths in each population is normal.
(a) Check that the assumption of equal variances is reasonable, and calculate the
pooled estimate of the common variance.
(b) Carry out a (two-sided) significance test of the hypothesis that the
underlying mean joint width is the same in the two populations. Interpret
the result of the test.

What should we do if the assumption of equal variances is not justified? There


are alternative tests, one of which is implemented in MINITAB. You might ask
yourself: if there is another test that does not make the restrictive assumption of
equal variances, why do we bother with the test that assumes equal variances?
The answer is that, when the population variances are indeed equal, the test that
assumes equal variances is more ‘powerful’ than any test that does not make this The power of a test will be
assumption, particularly when the sample sizes are relatively small. Essentially, defined in Section 5.
this means that, when the population variances are equal, the p value obtained
using the two-sample t-test that assumes equal variances will be smaller (and thus
Section 4 35

the evidence against the null hypothesis will be stronger) than for any test that
does not make this assumption. Thus the two-sample t-test with an assumption of
equal variances is of considerable practical use. Its theory is also considerably
easier to describe than is that of the test that does not assume equal variances.
The theory of that test is beyond the scope of this course and will not be
described.

4.3 Performing significance tests using MINITAB


As you have seen, the calculations involved in performing a significance test can
be tedious; they are more easily done on a computer. In this subsection, you will
learn how to use MINITAB to carry out the significance tests described in this
section.

����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����

Summar y of Section 4
In this section, you have learned how to carry out significance tests for the
difference between two Bernoulli probabilities and for the difference between the
means of two normal populations. You have also learned how to use MINITAB to
carry out these tests.

Exercises on Section 4
Exercise 4.1 A clinical trial
Patients newly diagnosed with rheumatoid arthritis were recruited into a clinical
trial of a new drug to control the symptoms of the disease. Of these patients, 62
were randomly allocated to the new drug, and 60 to receive standard therapy.
After one year, participants in the trial were categorized as being in remission or
not being in remission. Of the 62 patients on the new drug, 38 were in remission;
and of the 60 on standard treatment, 22 were in remission.
Test the hypothesis that the underlying probabilities of being in remission after a
year are the same for patients on the new drug and for patients on the standard
therapy. (In clinical trials of this general nature, it is normal practice to use
two-sided significance tests, because it is considered important that the analysis
should pick up situations in which the new treatment is actually worse than the
existing standard treatment, as well as those where the new treatment is better.)

Exercise 4.2 Etruscan and Italian skulls


Data were collected on the maximum breadths (in mm) of 84 Etruscan skulls and Barnicot, N.A. and Brothwell,
70 modern Italian skulls. The question of interest was whether there was a D.R. (1959) The evaluation of
difference between the underlying means of the two populations. Assume that metrical data in the comparison
of ancient and modern bones, in
normal distributions are adequate for describing the variation in these Wolstenholme, G.E.W. and
populations. Summary statistics for the two samples (labelling the Etruscans as O’Connor, C.M. (eds) Medical
sample 1) are as follows: Biology and Etruscan Origins.
Little, Brown and Co., USA.
x1 = 143.77, s21 = 35.65;
x2 = 132.44, s22 = 33.06.
(a) Check that the assumption of equal population variances is reasonable, and
calculate the pooled estimate of the common variance.
(b) Carry out a (two-sided) significance test of the hypothesis that the
underlying mean maximum head breadth is the same in the two populations.
Interpret the result of the test.
36 Unit C1

5 Fixed-level testing
In Block B, you saw that it is common to calculate confidence intervals at the
predetermined confidence levels of 90%, 95% and 99% (although there is no
particularly compelling reason for using these levels). Similarly, it is common to
perform tests of hypotheses at predetermined significance levels. Such tests are
called fixed-level tests. It is common to choose levels such as 10%, 5% and 1%,
corresponding to the confidence levels just mentioned. In Section 1, you saw that
these tests can be performed simply by using the corresponding confidence
intervals. In Subsection 5.1, a different way of thinking of the testing procedure is
introduced. In many common cases, the way of thinking about the procedure is
the only difference, since the outcomes of the tests will be the same as for the
process using confidence intervals. The aim of this approach is, as in Section 1, to This is not true for some testing
develop a decision rule for rejection of a null hypothesis in favour of a stated situations, including tests
alternative hypothesis, at some predetermined significance level. This approach to involving binomial proportions.
However, even in these cases,
hypothesis testing is quite general and can be applied in any context where a the differences between the two
significance test can be used. The procedure for fixed-level testing will be methods are usually small, and
illustrated for a few of the situations discussed in Sections 2 and 3. Two further do not often lead to differences
ideas associated with hypothesis testing are discussed briefly in Subsection 5.2. in conclusions.

It has probably occurred to you to ask why there should be so many rather
different approaches to what seems, after all, to be a straightforward problem to
describe. A brief investigation of the historical background of hypothesis testing,
which may help to make some sense of this, is given in Subsection 5.3.
In Subsection 5.4, you will use a SUStats program to develop further your
understanding of the principles of hypothesis testing.

5.1 Performing a fixed-level test


Most of the ingredients for a fixed-level test are the same as for a significance test:
both require a clear statement of the null and alternative hypotheses, a test
statistic and its null distribution, and data. However, in fixed-level testing, as in
the confidence interval approach of Section 1, a significance level is specified in
advance, and a rule is developed for deciding whether or not to reject the null
hypothesis. The fixed-level approach to hypothesis testing is illustrated in
Example 5.1 using the yeast cell data from Table 1.4, which were used to
introduce significance testing in Section 2.

Example 5.1 Yeast cells: a fixed-level test


In Section 2, a significance test was used to test the null hypothesis that µ, the
underlying mean number of yeast cells in a square on a microscope slide, is 0.6
against the alternative hypothesis that µ differs from 0.6. In this example, a
fixed-level test of the same hypotheses is described.
As in the significance test, the null and alternative hypotheses are
H0 : µ = 0.6, H1 : µ = 0.6.
The data to be used are in Table 1.4 and, as before, the test statistic is the sample
mean X, and its null distribution is N (0.6, 0.90212 /400) = N (0.6, 0.00203). See Example 2.1.
The idea of fixed-level testing is to see whether or not the observed value of the
test statistic is consistent with this null distribution. We wish to determine a
precise rule for whether or not to reject the null hypothesis in favour of the
alternative hypothesis. In fixed-level testing, this is done by determining in
advance what values of the test statistic we would regard as extreme. In
Example 1.4, when using the confidence interval approach, a 5% significance level
was used. Suppose that, as in that example, we wish to use a 5% significance
level. Then values less than or equal to the 0.025-quantile of the null distribution
of the test statistic and values greater than or equal to the 0.975-quantile will be
regarded as extreme.
Section 5 37

The 0.025-quantile of N (0.6, 0.90212 /400) is



q0.025 = 0.6 − 1.96 × 0.9021/ 400  0.5116,
and the 0.975-quantile is

q0.975 = 0.6 + 1.96 × 0.9021/ 400  0.6884.
The null distribution of X is shown in Figure 5.1, together with its 0.025-quantile
and 0.975-quantile.
The decision rule for the test is as follows. If the observed value x of the test
statistic X is so small (that is, less than or equal to 0.5116) or so large (that is,
greater than or equal to 0.6884) that it indicates a large departure from the null
hypothesis, in the sense that the observed value is in the set of extreme values of Figure 5.1 The 0.025-quantile
X that have a probability of only 0.05, then H0 will be rejected in favour of H1 at and the 0.975-quantile of the
the 5% significance level. The shaded region in Figure 5.2 shows the set of values null distribution of X
of the test statistic for which H0 will be rejected. This set is called the rejection
region.
The penultimate stage of the test is to calculate the observed value x of the test
statistic. For these data, x = 0.6825. (See Example 1.4.) This value is (just)
outside the rejection region. (This is shown in Figure 5.2.)
Finally, the conclusions of the test must be stated: there is insufficient evidence at
the 5% significance level to reject the null hypothesis that the underlying mean
number of yeast cells per square is equal to 0.6. In other words, in the light of
these data, it remains plausible that the mean number of yeast cells per square is
indeed 0.6. �

There are two points about what has been done in Example 5.1 that need further
comment.
√ Figure 5.2 The rejection region
First, as you saw in Section 2, Z = (X − 0.6)/ 0.00203 could be used as the test and the observed value of the
statistic instead of X; its null
√ distribution is N (0, 1). In this case, the observed test statistic X
value of Z is (0.6825 − 0.6)/ 0.00203  1.831. The 0.025-quantile and the
0.975-quantile of the standard normal distribution are −1.96 and 1.96, so the
rejection region consists of values of Z greater than or equal to 1.96 or less than
or equal to −1.96. The conclusion is as before: since 1.831 is not in the rejection
region, there is insufficient evidence at the 5% significance level to reject the null
hypothesis.
Secondly, note that the conclusion in Example 5.1 is the same as that in
Example 1.4, where the same hypothesis was tested, using the same data, but
using an approach based on confidence intervals. It might have been somewhat
worrying if the conclusions had been different. The fact that they must be the
same may be explained as follows.
In Example 5.1, the limit of the part of the rejection region in the upper tail of
the null distribution of X is √
just its 0.975-quantile, which was calculated as
q0.975 = 0.6 + 1.96 × 0.9021/ 400. That means we would reject the null
hypothesis, using this part of the rejection region,
√ if the observed value x of the
sample mean satisfies√ x ≥ 0.6 + 1.96 × 0.9021/ 400. This will happen if
x − 1.96 × 0.9021/ 400 ≥ 0.6.
However, in Example 1.4 you saw that the lower limit of√the 95% confidence
interval for the population
√ mean µ is x − 1.96 × 0.9021/ 400. If
x − 1.96 × 0.9021/ 400 ≥ 0.6, then the hypothesized value (0.6) of the population
mean will be on or below this lower limit, and hence outside the confidence The confidence interval includes
interval. Thus the test statistic will fall in the upper tail of the rejection region all values in between (but not
exactly when the hypothesized value of the population mean falls on or below the equal to) the confidence limits.
lower limit of the confidence interval. Similarly, the test statistic will fall in the
lower tail of the rejection region exactly when the hypothesized value of the
population mean falls on or above the upper limit of the confidence interval.
38 Unit C1

Putting these two facts together, the test statistic falls in the rejection region
exactly when the hypothesized value (0.6) of the population mean falls outside the
confidence interval for the population mean.
Thus, in circumstances like this, the results of the confidence interval approach
and the fixed-level approach involving the null distribution of a test statistic have There are circumstances in
to be the same. which the two methods do not
coincide in this way.
Before summarizing the procedure for fixed-level testing, there is a general point
about the hypotheses of a test that should be emphasized. It is important to note
that the hypothesis testing procedure does not treat the two hypotheses (null and
alternative) on an equal footing. The test is performed by finding out whether the
data provide enough evidence against the null hypothesis in order for it to be
rejected (at the chosen significance level). If the observed value of the test
statistic falls inside the rejection region, which consists of those values that are
least likely if the null hypothesis is true, then the null hypothesis is rejected in
favour of the alternative hypothesis. But if the test statistic falls outside the
rejection region, this means that there is not enough evidence to reject the null
hypothesis, which is not the same as saying that we accept that the null
hypothesis must be true. The situation is in some ways like the procedure in
criminal courts in countries with legal systems similar to those in England and the
USA. There, the initial assumption is that the accused person is innocent
(corresponding to the null hypothesis) and, to secure a conviction, the prosecution
have to provide evidence that proves beyond reasonable doubt that this
assumption is wrong and must be rejected. If they cannot provide such evidence
and proof, the ‘null hypothesis’ of innocence cannot be rejected. In many cases,
this will indeed be because the accused really is innocent of the crime; in other
cases, the position may be that the accused actually committed the crime but the
evidence was insufficient to prove this.
The strategy for a fixed-level test may be summarized as in the following box.

Procedure for fixed-level testing


1 Determine the null hypothesis H0 and the alternative hypothesis H1 .
2 Decide what data to collect that will be informative for the test.
3 Determine a suitable test statistic and the null distribution of the test
statistic (that is, the distribution of the test statistic when H0 is true).
4 Choose an appropriate significance level for the test (usually 1%, 5% or
10%), if one has not been specified.
5 Use the significance level of the test to determine the rejection region of
the test. For this, you will need to calculate quantiles of the null
distribution.
6 Collect the data and calculate the observed value of the test statistic for
the sample.
7 By determining whether or not the observed value of the test statistic
lies in the rejection region, decide whether or not to reject the null
hypothesis in favour of the alternative hypothesis.
8 Report the conclusion to be drawn from the test clearly.

This procedure is used in Example 5.2 and Activity 5.1 for a data set from
Section 1.
Section 5 39

Example 5.2 That kitchen timer again


In Example 1.2, a hypothesis test was carried out using the confidence interval
approach on data recorded by timing a kitchen timer with a stop-watch. Ten
times (in seconds), which were measured when the timer was set to five minutes
(300 seconds), were used to test the null hypothesis H0 : µ = 300 against the
alternative hypothesis H1 : µ = 300. You performed a significance test of these
hypotheses in Activity 3.2. The process will be repeated using the fixed-level
testing procedure that has just been summarized. The same hypotheses will be
used as before (step 1), and the values in Table 1.3 will be used for data (step 2).
The next step (step 3) is to decide on the test statistic and find its null
distribution. As in Activity 3.2, an appropriate test statistic is
X − 300 X − 300
T = √ = √ ,
S/ n S/ 10
and its null distribution is t(9). �

Activity 5.1 Completing the test


Complete the test that has been started in Example 5.2. Use a 1% significance
level (step 4). One of the significance levels
(a) Step 5: Find appropriate quantiles of t(9), the null distribution of T , and use used in Example 1.2 was 1%.
them to define a rejection region.
The observed value of the test statistic, which you calculated in Activity 3.2, is
−9.272 (step 6).
(b) Step 7: Decide whether or not the null hypothesis can be rejected in favour of
the alternative hypothesis, at the 1% significance level.
(c) Step 8: State your conclusions clearly.

Activity 5.2 Shoshoni rectangles


In Example 3.1, data on the width-to-length ratios of 20 beaded rectangles used
to decorate leather goods by the Shoshoni were used in a significance test of the
null hypothesis that µ, the mean width-to-length ratio, is 0.618 against the The observed sample mean is
alternative hypothesis that it is different from 0.618: x = 0.6605 and the sample
standard deviation is
H0 : µ = 0.618, H1 : µ = 0.618. s = 0.0925.
Perform a fixed-level test of the same hypotheses using a 5% significance level.

Significance tests and fixed-level tests


In most situations, the conclusion of a fixed-level hypothesis test can be deduced,
at any fixed level, from the p value for the corresponding significance test. For
instance, for the Shoshoni rectangles data, you saw in Example 3.1 that the
p value for the significance test is 0.0539. Thus, since the observed value of the
test statistic is 2.055, 5.39% of the probability in the null distribution lies on or
outside the values −2.055 and +2.055. So the probability in each tail of extreme
values is approximately 0.027. In a fixed-level test using a 5% significance level,
the probability in each tail of the rejection region is 0.025. Therefore, the
boundaries of the rejection region must be further out than −2.055 and +2.055,
and hence the observed value t = 2.055 does not lie inside the rejection region for
a fixed-level test using a 5% significance level. So, in a fixed-level test, H0 would
not be rejected at the 5% significance level (as you know from Activity 5.2). By a
similar argument, H0 would not be rejected at the 1% significance level, but it
would be rejected in a fixed-level test with a 10% significance level. More
generally, H0 would not be rejected in any fixed-level test with level smaller than
the significance probability of 0.0539, but it would be rejected in a fixed-level test
with level greater than 0.0539.
40 Unit C1

Activity 5.3 Significance probabilities and significance levels


(a) Suppose that the p value for a significance test is 0.083. Say whether or not
the null hypothesis would be rejected in a fixed-level test of the same
hypotheses using the same data if the significance level of the test is (i) 10%
(ii) 5% (iii) 1%.
(b) Suppose that in a fixed-level test, the null hypothesis H0 is rejected at the
1% significance level. What can you say about the p value for the
corresponding significance test using the same data?

So far, fixed-level testing has been discussed only for situations where the
alternative hypothesis is two-sided. A one-sided fixed-level test is described in
Example 5.3.

Example 5.3 Prolonging sleep: a one-sided test


In Example 3.2, data on the sleep gains (measured in hours) for ten individuals
when they took the drug L-hyoscyamine hydrobromide were used in a significance
test of the null hypothesis that the drug makes no difference to the duration of
sleep against the alternative hypothesis that the drug prolongs sleep. The
hypotheses for the test were
H0 : µ = 0, H1 : µ > 0, Step 1.
where the parameter µ is the mean underlying sleep gain of patients given
L-hyoscyamine hydrobromide.
The data used are in Table 3.2. Step 2.

The test statistic used was


D
T = √ ,
S/ 10
where D is the sample mean sleep gain, and S is the sample standard deviation;
its null distribution is t(9). Step 3.
The implication of setting up the alternative hypothesis as above is that only
large values of the test statistic will be treated as providing evidence against the
null hypothesis (and in favour of the alternative hypothesis). That is, negative
values of the sample mean will be considered simply as arising from sampling
variability, and not from some effect of this drug to reduce sleep. Negative values
of the test statistic arise when the sample mean is negative, so they cannot be
evidence in favour of the hypothesis that the underlying mean is positive.
Therefore the rejection region for the test must consist only of positive values, in
contrast to the situation with a two-sided test.
Suppose that a 10% significance level is chosen for the test. Then, under the null
hypothesis, the probability that the observed value of the test statistic falls in the
rejection region must be equal to 0.10. Thus the relevant quantile of t(9) is the
0.9-quantile: q0.90 = 1.383; and the rejection region is the shaded area shown in Steps 4 and 5.
Figure 5.3.

Figure 5.3 The rejection region for a one-sided test with significance level 0.10
Section 5 41

Next, the observed value of the test statistic is calculated. The sample standard Step 6.
deviation of the observed differences is s = 2.002 and the sample mean is d = 2.33
(from Example 3.2), so the observed value of the test statistic is
d 2.33
√ = √  3.68.
s/ 10 2.002/ 10
The observed value of the test statistic, t = 3.68, is well inside the rejection
region, so the null hypothesis is rejected in favour of the alternative hypothesis at Steps 7 and 8.
the 10% significance level. We conclude that, on average, the drug L-hyoscyamine
hydrobromide does prolong sleep. �

Activity 5.4 Another hypnotic substance?


In Activity 3.5, you used data on the sleep gains (measured in hours) of ten The data are in Table 3.3; the
patients when given the drug D-hyoscyamine hydrobromide to perform a sample mean is 0.75 and the
one-sided significance test of the hypothesis that the drug has no effect on sleep, sample standard deviation is
1.789.
against the alternative hypothesis that, on average, it leads to a net sleep gain.
Perform a fixed-level test of the same hypotheses. Follow the procedure for
fixed-level testing, and use a 5% significance level.

In Example 5.3, you have seen how to carry out a one-sided fixed-level test. In
fact, in most areas of application of statistics, one-sided tests are not used very
often. There are several reasons for this. First, situations where departures from
the null hypothesis in one particular direction are of no interest, or can
realistically be assumed not to be possible, are not particularly common in
practice. For instance, in a drug testing scenario, we may be confident in advance
that a new drug is on average at least as good as no treatment at all, and this
might lead us to propose a one-sided test, where the null hypothesis is that the
new drug performs exactly as well (on average) as no treatment, and the
alternative hypothesis is that the new drug is better. But what would happen if,
when the data were collected, the new drug turned out to do much worse (on
average) than no treatment? The null hypothesis could not be rejected; there
would be no possibility of concluding that the new drug could actually be worse
than useless. In general, it would be important to use a procedure that allowed
this conclusion as a possibility. Secondly, the two ‘tails’ involved in the rejection
region of a two-sided test usually contain (exactly or approximately) equal
probabilities. This means that, if the observed value of the test statistic falls just
inside the rejection region at the 10% significance level on a two-sided test, it will
fall inside the rejection region at the 5% significance level on the corresponding
one-sided test. A brief glance at these results might lead us to conclude that the
one-sided test provides stronger evidence against the null hypothesis than the
two-sided test, even though the data are the same in both cases. In general,
statisticians wish to avoid making the evidence in their data look stronger than it
really is, and in many situations this issue would lead a statistician to prefer a
two-sided test.
42 Unit C1

5.2 A few comments


Inter preting the significance level
You have seen that the rejection region for a fixed-level hypothesis test is defined
by identifying those values of the test statistic that, under the null hypothesis,
would be most extreme. It constitutes a summary of those results that would
appear to be so inconsistent with the null hypothesis that the null hypothesis is
rejected. But, of course, from the definition of the rejection region, its boundaries
are calculated using the null distribution, which is the distribution of the test
statistic when the null hypothesis H0 is true. So we have
significance level = α = P (rejecting H0 , given that H0 is true).
The act of rejecting H0 when H0 is true is called a Type I error. By convention,
acceptable values for α, the probability of a Type I error, are 1%, 5% and 10%;
occasionally smaller values or different values are acceptable, but not usually
values greater than 10%. A Type I error is an error which, in the nature of things,
the person carrying out the test will not know has been committed. It can occur
only if the null hypothesis has been rejected, but then the person carrying out the
test cannot know whether the correct conclusion has been reached or a Type I
error has occurred. But notice that its probability α is entirely within the control
of the designer of the test.

The power of a test


There is another sort of error that can occur in hypothesis testing: the person
carrying out the test fails to reject the null hypothesis H0 even though it happens
to be false. This outcome of the testing scenario is called a Type II error. If the
null hypothesis is not rejected, the person carrying out the test cannot know
whether this conclusion is correct or whether a Type II error has occurred
instead. Moreover, if the test designer proceeds by fixing the level of the test in
advance, and therefore defining the rejection region, the probability of a Type II
error is determined as a consequence, and thus is not explicitly controlled. Indeed
(other things being equal) the smaller the rejection region, the less likely a Type I
error will be, but the more likely it will be that a Type II error could be made.
In many circumstances, the usefulness of a particular hypothesis test is most aptly
measured by the following probability, which is called the power of the test:
P (rejecting H0 , given that H0 is false).
This is simply one minus the probability of a Type II error; in other words, it is
the probability of avoiding that error.
Other things being equal, there is a trade-off between the power and the
significance level of a test. If the power is increased (which is desirable) by making
the rejection region larger, then this will increase the significance level (which is
undesirable). If other things are not equal, the position will be different. One way
of simultaneously increasing the power and reducing the significance level is to
increase the sample size. In situations where the sample size is under the control
of the person gathering the data, calculations involving the power can throw light
on what an appropriate sample size would be; this topic is discussed in Section 6.
Section 5 43

5.3 Fisher, Pearson and Neyman


The twentieth century was enlivened by a number of philosophical disputes among
statistical practitioners. One of the more hotly argued is that between the British
statistician and geneticist, Ronald Aylmer Fisher (1890–1962), and the duo of
Egon Pearson (1895–1980) and Jerzy Neyman (1894–1981).
At the age of 22, on graduation from Cambridge University, Fisher worked for
three years as a statistician in London and then until 1919 as a schoolteacher (and
not a good one, according to contemporary sources). From 1919 until 1933, he
worked at Rothamsted Experimental Station, the agricultural research
establishment near Harpenden in Hertfordshire, England; in 1925, he published
the first edition of the famous text entitled Statistical Methods for Research
Workers. After leaving Rothamsted he was Professor of Eugenics at University
College London until 1943, after which he was appointed Professor of Genetics at
Cambridge. His papers on theoretical statistics form the foundation of much of
modern statistics, and many of his methods are used worldwide to this day. Figure 5.4 R.A. Fisher
Egon Pearson was the son of Karl Pearson (1857–1936), arguably the founder of
modern statistics. Egon worked in the Department of Applied Statistics at
University College London (headed by his father) from 1921. In 1933, on Karl’s
retirement, he took over the chair of the department, which he headed until his
own retirement in 1960. During the key period in the dispute between Fisher and
Neyman and Pearson, their departments occupied different floors of the same
building at University College.
Jerzy Neyman was born in Bendery near the border between Russia and
Romania. He was educated at the University of Kharkov in the Ukraine and
lectured there until going to live in Poland in 1921. He was a lecturer at the
University of Warsaw when in 1925 he visited London and met Egon Pearson.
The pair, much of an age, struck up an immediate and close personal and
professional relationship.
In 1933, Neyman and Pearson published a paper ‘On the Problem of the Most
Efficient Tests of Statistical Hypotheses’ in the Philosophical Transactions of the
Royal Society, Series A, 231, 289–337. Their approach to hypothesis testing is Figure 5.5 Egon Pearson
basically the fixed-level approach of Subsection 5.1. Essentially, their work was
generated by concern that there should be some criterion other than intuition to
provide a guide to what test statistic to utilize in performing a hypothesis test,
and this in turn implied the strict requirement for an alternative hypothesis.
In many cases, however, a statistical test is used more or less to assess the data,
and not necessarily to reach any firm conclusion, at least not without
consideration of other relevant matters. This is the idea behind the significance
testing approach, and seems to have been the attitude of Fisher, who was thinking
of scientific research situations and not of cases where the background of the
problem requires a clear decision. Fisher’s approach corresponds in most respects
to the approach described in Sections 2, 3 and 4. However, he would not have
agreed with everything you have read there, and certainly not with the mentioning
of pre-specified alternative hypotheses in the context of significance testing.
Fisher’s approach requires three components: a null distribution for the test
statistic; an ordering of all possible observations of the test statistic according to
their degree of support for the null hypothesis; and a measure of deviation from Figure 5.6 Jerzy Neyman
the null distribution as the chance that anything even more extreme is observed.
44 Unit C1

Within repeated experiments, the idea of outcomes more discordant with a null
hypothesis than others is fairly clear. However, with different experiments or
when using different test statistics, it is not at all clear whether a significance
probability as an absolute measure of accord with the null hypothesis — one that
can be compared across experiments — is a useful notion. This is an important
criticism of Fisher’s approach.
The approach of Neyman and Pearson offers an alternative, but some key
concepts were always rejected by Fisher. Among other things, he considered the
use of a pre-specified alternative hypothesis to be inappropriate for scientific
investigations. He maintained that the fixed-level approach was that of mere
mathematicians, without experience in the natural sciences. As well as subtle and
irreconcilable philosophical and theoretical incompatibilities between the two
approaches, there is no doubt that the controversy was fuelled by personal
antipathies as well. Peters (1987) writes, ‘Fisher was a fighter rather than a Peters, W.S. (1987) Counting
modest and charitable academic.’ for Something — Statistical
Principles and Personalities,
The heat has gone out of this controversy now, and most statisticians who test Springer-Verlag, New York.
hypotheses use approaches that to some extent compromise between the Fisher
approach and that of Neyman and Pearson. For example, they typically calculate
significance probabilities (Fisher) but refer to alternative hypotheses
(Neyman–Pearson). But there do remain differences between the approaches, and
these still tend, sometimes, to blur what is really going on when a hypothesis is
being tested.

5.4 Exploring the principles of hypothesis testing


The work in this subsection involves working through a chapter of Computer
Book C. This involves using a SUStats program to confirm and extend your
understanding of the principles behind hypothesis testing.

����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����

Summar y of Section 5
In this section, a general procedure for carrying out a fixed-level test of a
statistical hypothesis has been described. You have seen how to interpret the
significance level of a test as the probability of the test statistic being in the
rejection region, according to the null distribution of the test statistic. The ideas
of Type I error and Type II error, and the power of a test, have been introduced.
You have also read a little about the history of testing statistical hypotheses. And
you have used a SUStats program to enhance your understanding of the principles
behind hypothesis testing.
Section 6 45

Exercise on Section 5
Exercise 5.1 Differences in plant heights Table 5.1 Difference in
plant height ( 18 inch)
In Exercise 3.1, you performed a significance test on the differences in height
between the progeny of cross-fertilized and self-fertilized plants. The data are Pair Difference
reproduced in Table 5.1. The units of measurement are eighths of an inch. 1 49
2 −67
As in Exercise 3.1, suppose that the observed differences di , i = 1, 2, . . . , 15, are 3 8
independent observations on a normally distributed random variable D with mean 4 16
µ and variance σ2 . 5 6
6 23
Perform a two-sided fixed-level test of the hypothesis that there is no difference 7 28
(on average) between the heights of progeny of cross-fertilized and self-fertilized 8 41
plants. Use a 10% significance level. 9 14
10 29
11 56
12 24
13 75
14 60
15 −48

6 Power, and choosing sample sizes


The power of a test was defined in Subsection 5.2; it is the probability that the
null hypothesis is rejected when, in fact, it is false. Essentially, this concept makes
sense only when there is an explicit rule for rejecting the null hypothesis. Since
this is not the case for significance testing, power will be discussed within the
framework of fixed-level testing. Certain aspects of the power of a statistical test
will be investigated; and you will see how these considerations are related to the
problem of planning a statistical investigation before any data have been collected.
Considerations of power are important in planning an investigation because it
makes very little sense to spend time and money on gathering data that are
unlikely to lead to the null hypothesis being rejected even when it is false: a study
that does not have reasonable power may not be worth carrying out. On the other
hand, if a planned study has extremely high power, it might be a better use of
resources to reduce the sample sizes and spend the resulting savings on something
else. Thus, in increasingly many circumstances, institutions responsible for
approving and funding research require to see appropriate statistical power
calculations as part of any research plan they consider.
Although the principles involved are reasonably straightforward, some of the
details of the calculations can be complicated and are best left to a computer.
Therefore, the details of only one particularly simple situation are considered in
this section. Further (more realistic) situations are dealt with in the associated
computer book chapters.
Power calculations are discussed briefly in Subsection 6.1, and their use to plan
the sample size of a study is described in Subsection 6.2. The work in
Subsection 6.3 consists of two chapters of Computer Book C.
46 Unit C1

6.1 Calculating the power of a test


In this subsection, power calculations will be illustrated for the situation where
the data may be adequately modelled by a normal distribution and where it is
assumed that the population standard deviation σ is known.

Example 6.1 More sleep?


In Activity 5.4, you carried out a t-test using data on sleep gains (in hours) for
ten patients who had had their length of sleep measured with and without taking
the drug D-hyoscyamine hydrobromide. A one-sided test of the null hypothesis
H0 : µ = 0 against the alternative hypothesis H1 : µ > 0 was performed (where µ
is the underlying mean sleep gain with the drug). The data were assumed to come
from a normal distribution. The sample mean (0.75) and sample standard
deviation (1.789) were used to calculate the observed value t = 1.33 of the test
statistic T , and as a result of comparing this value with the null distribution of T ,
which is t(9), the null hypothesis was not rejected at the 5% significance level. Under the null hypothesis that
µ = 0,
Let us briefly consider again what exactly is going on here. There was not enough D
evidence to reject the null hypothesis that the underlying mean sleep gain was T = √ ∼ t(n − 1),
S/ n
zero. But this does not establish that the underlying mean is zero. It establishes
where D is the sample mean
only that zero is a plausible value for this mean. There are other plausible values sleep gain, S is the sample
as well. standard deviation and n is the
sample size.
Suppose that, in fact, the underlying mean µ takes the value 0.5. That is, the
alternative hypothesis is true (though, of course, 0.5 is not the only possible value
of µ included in the alternative hypothesis). In this case, how likely would it be
that the test would come to the correct conclusion (that is, reject H0 )? That is,
what is the power of the test when the mean µ takes the value 0.5? In general,
the power is a little tricky to work out — but we can find an approximate value by
assuming, temporarily, that we know the value of the population standard
deviation σ. Let us assume that σ is actually equal to the sample standard
deviation, 1.789. Then the test statistic would be
D D
T = √ = √ .
σ/ n 1.789/ 10
The next stage is to consider the null distribution of T — that is, the distribution
of T when µ = 0. In a real application, T would have a t-distribution. But here
we are pretending that we know the underlying standard deviation. Under these
circumstances, the null distribution of T would be standard normal. For a
fixed-level test using a 5% significance level, the null hypothesis would be rejected
if the observed value of the test statistic were greater than or equal to q0.95 , the Figure 6.1 The null
0.95-quantile of N (0, 1), which is 1.645. The null distribution of T and the distribution and rejection region
rejection region are shown in Figure 6.1. for a test
But what is the power of the test if µ = 0.5? How likely is it that the test would
reject the null hypothesis if µ = 0.5? To answer this question, we need to know
the distribution of the sample mean D when µ = 0.5. In general, whatever the
value of µ, the distribution of the sample mean D is normal: D ∼ N (µ, σ2 /n).
Since the distribution of D is normal, any linear function of D is also normally
D
distributed. In particular, T = √ is normal: The following results from
σ/ n Block B have also been used
� � here. For any random variable
µ
T ∼N √ ,1 . X and constant a,
σ/ n E(aX) = aE(X),
V (aX) = a2 V (X).
Section 6 47

So, in particular, if µ = 0.5, then


� �
0.5
T ∼N √ ,1 ;
σ/ n
that is,
� �
0.5
T ∼N √ ,1 = N (0.884, 1);
1.789/ 10
and the probability that H0 is rejected if µ = 0.5 is just P (T ≥ q0.95 ), where q0.95
is the 0.95-quantile of the standard normal distribution. (This probability is
illustrated in Figure 6.2.)
Figure 6.2 The power of the
So the power of the test when µ = 0.5 is test when µ = 0.5
P (T ≥ q0.95 ) = P (T − 0.884 ≥ q0.95 − 0.884)
= P (Z ≥ q0.95 − 0.884)
Here Z ∼ N (0, 1), and the
= P (Z ≥ 1.645 − 0.884) probability is obtained from the
= P (Z ≥ 0.761) table of probabilities for the
standard normal distribution
 0.223. (using linear interpolation).
This probability is not very high! If the alternative hypothesis were true and the
underlying mean were actually 0.5, this test with this sample size would not be
very likely to come to the correct conclusion by rejecting the null hypothesis. �

More generally, suppose that the data are distributed as N (µ, σ2 ), where σ is
X − µ0
known, and the test statistic T = √ is to be used to test the null hypothesis
σ/ n
H0 : µ = µ0 against the alternative hypothesis H1 : µ > µ0 , with a significance
level α. Then, using an argument similar to that in Example 6.1, it can be shown
that the power of the test when the true value of the underlying mean is µ0 + d is
given by
� �
d
P Z ≥ q1−α − √ , (6.1)
σ/ n
where Z ∼ N (0, 1). If the corresponding two-sided test is performed, and d is
positive, the power will be approximately
� �
d
P Z ≥ q1−α/2 − √ . (6.2) Expression (6.2) ignores the
σ/ n probability that the null
What happens to the power of the test when the significance level of the test is hypothesis is rejected due to a
negative value of the test
changed? An example is helpful here. In Example 6.1 you saw that, for a statistic, even though the
significance level of 0.05, the rejection region consists of values of 1.645 or larger hypothesized value of the
and the power of the test when µ = 0.5 is 0.223. For a significance level of 0.01, underlying mean is actually
the rejection region consists of values at least as large as q0.99 , the 0.99-quantile of bigger than the value in the null
N (0, 1), which is 2.326. So, in this case, the power of the test when µ is 0.5 is hypothesis. This probability can
be ignored because it is
given by (6.1): typically very small.
� �
d
P Z ≥ q1−α − √ = P (Z ≥ 2.326 − 0.884)
σ/ n
= P (Z ≥ 1.442)
 0.075.
48 Unit C1

These results are illustrated in Figures 6.3 and 6.4. Figure 6.3 shows the rejection
regions corresponding to significance levels of 0.05 and 0.01; the power of the test
when µ is 0.5 is represented in Figure 6.4 for each of these significance levels.

Figure 6.3 Rejection regions (a) α = 0.05 (b) α = 0.01

Figure 6.4 The power when µ is 0.5 (a) α = 0.05 (b) α = 0.01

As you can see, reducing the significance level has made the rejection region
smaller and hence, when µ is 0.5, it has also made the power of the test smaller.
d
In general, when α is made smaller, q1−α becomes larger, so q1−α − √
σ/ n
becomes larger. Hence the power
� �
d
P Z ≥ q1−α − √
σ/ n
becomes smaller. That is, when the significance level is made smaller, the power
of the test decreases.

Activity 6.1
By examining Expressions (6.1) and (6.2) for the power of the test, say what
happens to the power of the test in each of the following cases.
(a) The difference d (between the mean in the null hypothesis and the mean for
which the power is being calculated) is increased.
(b) The population standard deviation σ is increased.
(c) The sample size n is increased.

Intuitively, the results of Activity 6.1 seem reasonable. If the difference between
the two hypothesized values of the underlying mean is bigger, then, other things
being equal, the test will more likely be able to distinguish them and come to the
right conclusion. If the population standard deviation increases, there is more
variability in the data, and the test will less likely be able to ‘see through’ this
variability and come to the correct conclusion. Finally, increasing the sample size
will increase the information obtained from the test and thus make it more likely
that the correct conclusion is reached.
Section 6 49

Activity 6.2
A psychologist wishes to investigate the IQ of a certain specific population. The
aim is to investigate whether the mean IQ in this population could plausibly be
equal to that in the population of the UK as a whole. For the IQ test that the
psychologist will use, the mean score for the general UK population is 100, and
the standard deviation of scores is 15. The psychologist intends to take a sample
of 80 individuals from the specific population and measure their IQs using this
test. It is thought likely that the standard deviation of IQ scores in the specific
population is the same as that in the general UK population (though, of course,
nobody can be sure until the data have been collected). The psychologist will test
the null hypothesis that the mean IQ in the specific population is 100, using a
two-sided test based on the normal distribution, at significance level 0.05.
Suppose the actual mean IQ score in the specific population were 105. What is
the probability that the psychologist will reject the null hypothesis?

Expressions (6.1) and (6.2) above for calculating the power are, strictly, only valid
if the population standard deviation is known. In practice, in the great majority
of cases, it will not be known, and the population standard deviation will be
replaced by the sample standard deviation and a t-test performed. If the sample
size is fairly large, these expressions will still give reasonable approximations to
the power of the test; but if the sample size is small, they will not. More
complicated calculations, based on the same principles, can give accurate values
for the power of t-tests. But you are spared the details: in Subsection 6.3, you will
use your computer for such calculations, and also for similar calculations in the
case of other tests.

6.2 Planning sample sizes


Perhaps the most common use of calculations of power in a hypothesis test is in a
slightly different context: power calculations can be used in planning a study in
order to get an idea of what appropriate sample sizes would be. The sleep gain
example will be used to illustrate how this can be done.

Example 6.2 How many sleepers do we need?


In Example 6.1 you saw that, with a sample of size 10, the test actually had a low
power, 0.223, if the underlying population mean sleep gain was really 0.5 hours.
Suppose that, for some reason, it is important to be able to detect a mean sleep
gain of this size. You know that increasing the sample size will, other things being
equal, increase the power of the test. What sample size would you have to use in
order that the power was reasonably high, say 0.8?
� �
0.5
The power was calculated by finding P (T ≥ q0.95 ), where T ∼ N √ , 1 . We
σ/ n
want to choose the sample size, n, so that this power is equal to 0.8. Instead of
knowing the sample size n and using its value to find the power, the position is
that we know the power and want to find n. How can this be done?
50 Unit C1

The power is given by


� �
0.5 0.5
T − √ ≥ q0.95 − √
P (T ≥ q0.95 ) = P
σ/ n σ/ n
� �
0.5
= P Z ≥ q0.95 − √ , In Example 6.1, it was assumed
1.789/ n that σ is 1.789.
where Z ∼ N (0, 1). We want this probability to be equal to 0.8; that is,
� �
0.5
P Z ≥ q0.95 − √ = 0.8.
1.789/ n
But if P (Z ≥ z) = 0.8, then z must be q0.2 , the 0.2-quantile of the standard
normal distribution, which is −0.8416. Hence
0.5 0.5
−0.8416 = q0.95 − √ = 1.645 − √ ,
1.789/ n 1.789/ n
so that
0.5
√ = 1.645 + 0.8416 = 2.4866
1.789/ n
and therefore

n = 2.4866 × 1.789/0.5  8.897.
Thus n = 8.8972  79.16. The sample size must be a whole number, so this is
rounded up to 80.
The conclusion is that 80 people rather than 10 must be used in this study, if we
want to be able to detect a mean sleep gain of 0.5 hours with a probability
of 0.8. �

In general, suppose the data are distributed as N (µ, σ2 ) and the test statistic
X − µ0
T = √ is to be used to test the null hypothesis H0 : µ = µ0 against the
σ/ n
alternative hypothesis H1 : µ > µ0 , with a significance level α. Suppose the
sample size n is to be chosen such that the power of the test when the true value
of the underlying mean is µ0 + d is equal to some predetermined value γ. Then it
can be shown that the required sample size is
σ2 2
n= (q1−α − q1−γ ) , (6.3)
d2
where q1−α and q1−γ are quantiles of the standard normal distribution. For a
two-sided test, q1−α is replaced by q1−α/2 in (6.3) to give an expression for the
approximate sample size required. The approximation is reasonable provided that
d/σ is not too small.

Example 6.3 Using the sample size formula


Formula (6.3) will be applied to the situation in Example 6.2. The test is
one-sided, and the significance level is α = 0.05. It was assumed that the
population standard deviation σ is 1.789. The difference d that the test is being
designed to detect is 0.5, and the required power to detect such a difference is
γ = 0.8. Thus q1−α = q0.95 = 1.645, and q1−γ = q0.2 = −0.8416. So, the sample
size required is
σ2 2 1.7892 1.7892
n= 2
(q1−α − q1−γ ) = 2
(1.645 + 0.8416)2 = (2.4866)2  79.16,
d 0.5 0.52
which is rounded up to 80, as before. �
Section 6 51

Activity 6.3 Calculating sample sizes


A researcher plans an investigation of whether a particular drug alters blood
pressure. In the investigation, each member of a group of individuals will have
their blood pressure measured; then they will take the drug, and one hour later
they will have their blood pressure measured again. Blood pressure measurements
have been studied on many occasions, and it is known that, for patients who have
not taken any drug, the standard deviation of hourly blood pressure
measurements on the same patient is about 10 mm Hg. The researcher intends to
analyse the data by applying a two-sided test (based on the normal distribution)
to the individual differences in blood pressure before and after taking the drug, at
a significance level of 0.01. The intention is that the study should have a power of
0.9 of finding a mean difference in blood pressure of 5 mm Hg. How many subjects
should the researcher use?

Formula (6.3) will still give approximately correct answers in the case where the
underlying variance is not known and a t-test is being performed. But your
computer can perform accurate sample size calculations even for small sample
sizes, and can also use similar expressions in other testing situations. You will be
using MINITAB to do sample size calculations in Subsection 6.3.

6.3 Power and sample size using a computer


In this subsection, you will use a SUStats program to investigate further the
notion of the power of a test. You will also learn how to do power and related
sample size calculations using MINITAB.

����� �� �������� � ��� � �� �������� ���� � ��� ��� ���� �� ����


����� �����

Summar y of Section 6
In this section, you have learned how to calculate the power of a test on a normal
mean in the simple case where the population variance is assumed known. You
have seen how these calculations can be used to choose an appropriate sample size
for a statistical investigation.
You have also learned how to use MINITAB to perform sample size and power
calculations for one- and two-sample t-tests, and for a test of one proportion.

Exercises on Section 6
Exercise 6.1 Power
Suppose that the researcher in Activity 6.3 did indeed carry out the study as
described, with 60 subjects. What would be the power of the researcher’s test to
detect a mean difference in blood pressure of just 2 mm Hg?

Exercise 6.2 Calculating samples sizes


Suppose now that the psychologist in Activity 6.2 was planning a similar study on
a different specific population, and that using a 5% significance level the
psychologist wished to detect a difference in mean IQ of 2 points between the
specific population and the UK general population, with power 0.8. How many
subjects should the psychologist use?
52 Unit C1

Summar y of Unit C1
In this unit, you have learned how to test statistical hypotheses. Specifically, you
have seen how null and alternative hypotheses, making specific statements about
the value of a population parameter, can be set up.
You have seen three different approaches to testing a statistical hypothesis, on the
basis of data. In the first approach, a confidence interval for the parameter of
interest is calculated, and if the specific parameter value named in the null
hypothesis is outside this interval, the null hypothesis is rejected in favour of the
alternative hypothesis. Otherwise, the null hypothesis remains plausible. In this
context, the significance level of the hypothesis test is 100% minus the confidence
level for the interval.
The second, and most common, approach to testing hypotheses is known as
significance testing. A test statistic is chosen; this is a quantity whose value gives
information about the parameter in the hypotheses, and whose distribution when
the null hypothesis is true — its null distribution — is known. A significance
probability, or p value, is calculated. This is the probability that, under the null
hypothesis, a value of the test statistic would be observed that is at least as
extreme as the value that was actually observed. The smaller the significance
probability, the stronger is the evidence against the null hypothesis. Significance
tests can have two-sided or one-sided alternative hypotheses, depending on
whether or not departures from the null hypothesis in both directions are
included. For one-sided tests, the extreme values fall in one tail of the null
distribution; for two-sided tests, the extreme values fall in both tails.
In the third approach, a significance level is chosen and fixed in advance. This
approach also involves choosing a test statistic and knowing its null distribution.
However, instead of calculating a significance probability, a rejection region is set
up, such that values of the test statistic in the rejection region tend to favour the
alternative hypothesis rather than the null hypothesis, and such that the
probability that the test statistic is in the rejection region, when the null
hypothesis is true, is equal to the significance level. In some circumstances, such
tests are exactly equivalent to those produced using the method based on
confidence intervals. Fixed-level tests also come in one-sided or two-sided forms,
depending on the nature of the alternative hypothesis. The choice of alternative
hypothesis is followed in setting up the rejection region: in one-sided cases the
rejection region falls in just one tail of the null distribution of the test statistic,
but in two-sided cases the rejection region falls in both tails.
Tests of hypotheses may lead to two possible kinds of wrong conclusion. First, a
null hypothesis may be rejected when it is in fact true. This is known as a Type I
error, and its probability is equal to the significance level of the test. Secondly, a
test may fail to reject a null hypothesis when it is in fact false. This is known as a
Type II error. The power of a test is the probability that the test will (correctly)
reject the null hypothesis when it is false. So the power of a test is one minus the
probability of a Type II error. You have seen how to calculate the power of certain
simple tests by hand, and of a wider range of tests using MINITAB. You have also
seen how to use power calculations to plan an appropriate sample size for a study.
You have seen how to perform statistical tests for null hypotheses involving a
single normal mean and a single proportion, as well as for differences between
these quantities in two populations. (The tests for normal means are called
t-tests.) You have also seen how to perform tests involving a population mean,
using large-sample results for the distribution of the sample mean. You have
learned to carry out these tests both with and without using a computer.
Section 6 53

Learning outcomes
You have been working towards the following learning outcomes.

Terms to know and use


Hypothesis test, null hypothesis, alternative hypothesis, significance level,
two-sided test, one-sided test, fixed-level test, test statistic, null distribution,
rejection region, (Student’s) one-sample t-test, (Student’s) two-sample t-test,
Type I error, Type II error, power, significance test, significance probability,
p value.

Symbols and notation to know and use


The notation H0 for the null hypothesis of a statistical test.
The notation H1 for the alternative hypothesis of a statistical test.

Ideas to be aware of
That a significance test involves assessing the strength of the evidence against
a null hypothesis.
That a fixed-level test typically involves deciding whether or not to reject a
null hypothesis.
That, even when a null hypothesis is not rejected, it is not appropriate to
conclude that it must be true.
That the conclusions of fixed-level tests are subject to two different possible
types of error, and that the probabilities of these errors can be taken into
account.
That the context of a statistical investigation determines whether the
appropriate statistical test is one-sided or two-sided.
That a quantity cannot be used as a test statistic unless its null distribution
is known (at least approximately).
That there is a relationship between confidence intervals and fixed-level
hypothesis testing.
That a two-sample t-test for the difference between two normal means may
involve the assumption that the variances of the two populations are equal.
That the power of a test decreases when the significance level is made smaller.

Statistical skills
Set up hypotheses appropriately for a hypothesis test.
Perform a significance test or a fixed-level test for null hypotheses involving a
single mean of a normal distribution, a single proportion (exact and
approximate), a mean of a population whose distribution is unspecified
(based on a large sample), the difference between the means of two normal
distributions, or the difference between two proportions.
Interpret the results of a significance test or a fixed-level test, in terms of the
real-world investigation that gave rise to the test.
Calculate the power of a test on a normal mean when the population variance
is assumed to be known.
Calculate the sample size required to obtain a given power in a test on a
normal mean when the population variance is assumed to be known.

Features of the software to use


Use MINITAB to carry out one-sample z-tests, one- and two-sample t-tests,
and one- and two-sample tests on proportions.
Use MINITAB to perform power calculations, including calculations for
planning sample sizes.
54 Unit C1

Solutions to Activities
Solution 1.1
(a) The appropriate confidence interval to use for a
test with significance level 0.05 (or 5%) is the 95%
confidence interval. This is given in the question as
(0.2113, 0.6133). The value in the null hypothesis,
2
3  0.6667, is not contained in this interval. The
conclusions of the test may be stated as follows.
There is sufficient evidence at the 0.05 significance
level to reject the null hypothesis that p = 23 in favour
of the alternative hypothesis that p differs from 23 . In
fact, since the entire confidence interval is below 23 , the
data suggest that p < 23 . Figure S.1 The null distribution of X, and the values
(b) This time, the appropriate confidence interval to that are at least as extreme as the observed value
use is the 99% interval. The value 0.6667 is included x = 1.636
in this interval. So there is insufficient evidence at the
0.01 significance level to reject the null hypothesis that Steps 7 and 8: There is moderate evidence against the
p = 23 . null hypothesis that the mean number of insects of the
taxon Staphylinoidea in a trap is 1.
Solution 1.2 Further, since x = 1.636 > 1, the data suggest that the
underlying mean number of insects in a trap is greater
(a) The significance level of the test will be 5% (or
than 1.
0.05).
(b) Since 2 is in the confidence interval, there is Solution 3.2 The null and alternative hypotheses
insufficient evidence at the 5% significance level to are given in the question (step 1), and the data to be
reject the null hypothesis that the mean March rainfall used are in Table 1.3 (step 2).
in Minneapolis St Paul is 2 inches.
Step 3: An appropriate test statistic is
Solution 2.1 Step 1: The null and alternative X − 300
T = √ ;
hypotheses are S/ 10
H0 : µ = 1, H1 : µ = 1, its null distribution is t(9).
where µ is the underlying mean number of insects of Step 4: The observed value of the test statistic is
the taxon Staphylinoidea caught in a trap. x − 300 294.81 − 300
t= √ = √  −9.272.
Step 2: The data are in Table 2.2. s/ 10 1.77/ 10
Step 3: An appropriate test statistic is the sample Step 5: The set of values of the test statistic that are
mean X. When the null hypothesis is true, µ = 1, so at least as extreme as the observed value consists of all
the approximate null distribution of X is N (1, σ2 /33). values less than or equal to −9.272, together with all
The unknown population variance σ2 can be replaced values greater than or equal to 9.272.
by its sample estimate s2 = 1.6552 , so the approximate Step 6: The significance probability is
null distribution of X is p = P (|T | ≥ 9.272).
N (1, 1.6552 /33) = N (1, 0.0830). From the table of quantiles in the Handbook, the
Step 4: The observed value of the test statistic is 0.999-quantile of t(9) is 4.297. Therefore
x = 1.636. P (T ≥ 9.272) < 0.001, and hence p < 0.002.
Step 5: Since this is a two-sided test, the values that Steps 7 and 8: There is strong evidence against the
are at least as extreme as the observed value (1.636) null hypothesis that the mean time that the timer
will fall into two tails. (See Figure S.1.) takes to ring when set to five minutes is equal to five
Step 6: The significance probability is minutes (300 seconds). Since the observed sample
mean x = 294.81, the data suggest that the average
p = P (X ≤ 0.364) + P (X ≥ 1.636)
time the timer takes to ring is less than five minutes.
= 2P (X ≥ 1.636)
� �
1.636 − 1
= 2P Z ≥ √ , where Z ∼ N (0, 1)
0.0830
 2P (Z ≥ 2.21)
= 2 × 0.0136
= 0.0272
 0.027.
Solutions to Activities 55

Solution 3.5 Step 1: Denoting the population Solution 4.1 Appropriate hypotheses for the test,
mean sleep gain for patients taking D-hyoscyamine in terms of the parameters p1 and p2 , which represent
hydrobromide by µ, the null and alternative the underlying proportions of males and females who
hypotheses are are helped, are
H0 : µ = 0, H1 : µ > 0. H0 : p1 = p2 , H1 : p1 = p2 .
(The justifications for carrying out a one-sided test The observed value of D, the difference between the
here are as follows. First, that is what you were asked sample proportions, is
to do. Secondly, there may be grounds for assuming 71 89
that the drug can have an effect only in one direction, d= −  0.71 − 0.8476 = −0.1376.
100 105
as was considered to be the case for L-hyoscyamine Under the null hypothesis that p1 = p2 = p, the pooled
hydrobromide.) estimate of the common value of the proportion p is
Step 2: The data are given in the question. 71 + 89 160
p� = =  0.7805.
Step 3: Since there are ten patients, the appropriate 100 + 105 205
test statistic is The approximate null distribution of D is
� � ��
D 1 1
T = √ , N 0, p�(1 − p�) +
S/ 10 n1 n2
� � ��
where D is the sample mean sleep gain and S is the 1 1
= N 0, 0.7805(1 − 0.7805) +
sample standard deviation of the sleep gains. The null 100 105
distribution of this statistic is t(9).
= N (0, 0.003345).
Step 4: The sample mean is 0.75 and the sample
Therefore, if Z ∼ N (0, 1), the significance probability
standard deviation is 1.789, so the observed value of
is � �
the test statistic is
0.1376
d 0.75 P (|D| ≥ 0.1376)  P |Z| ≥ √
t= √ = √  1.33. 0.003345
s/ n 1.789/ 10
 P (|Z| ≥ 2.38)
Step 5: This is a one-sided test, so the values of the
test statistic that are at least as extreme as the = 2 × 0.0087
observed value fall in one tail of the null distribution. = 0.0174
This tail corresponds to values of T that are greater
 0.017.
than or equal to the observed value t = 1.33.
Thus there is moderate evidence against the null
Step 6: The significance probability is
hypothesis and in favour of the alternative hypothesis.
p = P (T ≥ 1.33), where T ∼ t(9). The 0.9-quantile of
The conclusion is that there is moderate evidence that
t(9) is 1.383, so P (T ≥ 1.383) = 0.1. But 1.33 < 1.383,
male students and female students are not equally
so
likely to be helped in the context of an experiment like
p = P (T ≥ 1.33) > 0.1. this. Indeed, since p�2 > p�1 , the data indicate that
Steps 7 and 8: There is little evidence against the null female students are more likely to be helped.
hypothesis that, on average, D-hyoscyamine
hydrobromide does not prolong sleep. Solution 4.2
(a) The ratio of the larger sample variance to the
Solution 3.6 The observed value, 19, is in the smaller is 114.84/48.93  2.35. This is less than 3. So,
upper tail of the null distribution of the test statistic. by the rule of thumb, the equal variance assumption is
Thus, from Table 3.4, the probability in the tail not unreasonable.
including the observed value and larger values is The pooled estimate of the common variance is
0.0243. The lower tail is chosen so that it contains the (n1 − 1)s21 + (n2 − 1)s22
largest amount of probability that is less than or equal s2P =
n1 + n2 − 2
to 0.0243. Again using Table 3.4, this is the tail
containing all values up to and including 10, for which 9 × 48.93 + 9 × 114.84
=
the total probability is 0.0139. 10 + 10 − 2
Thus the significance probability is  81.89.
p = P (N ≤ 10) + P (N ≥ 19) = 0.0139 + 0.0243 (In this case, the two sample sizes are equal, so the
pooled estimate is in fact the average of the two
= 0.0382.
sample variances, and it is a bit quicker to calculate it
Observing 19 yellow peas out of 20 would thus provide as (48.93 + 114.84)/2  81.89.)
moderate evidence that the underlying value p of the
(b) Denoting the underlying mean joint widths in the
proportion of peas that are yellow is different from the
two populations by µ1 and µ2 , the hypotheses are
hypothesized value of 34 . Since the observed
proportion is greater than 34 , the data suggest that the H0 : µ1 = µ2 , H1 : µ1 = µ2 .
underlying proportion is greater than 34 . The null distribution of the test statistic has a
t-distribution with 10 + 10 − 2 = 18 degrees of
freedom.
56 Unit C1

The observed value of the test statistic is Solution 5.2 Step 1: The hypotheses are given in
x1 − x2 128.40 − 122.80 the question:
t= � = �  1.384.
1 1 √ 1 1 H0 : µ = 0.618, H1 : µ = 0.618.
sP + 81.89 × +
n2 n2 10 10 Step 2: The data are in Table 3.1.
The significance probability is p = P (|T | ≥ 1.384), Step 3: Since there are 20 rectangles, the appropriate
where T ∼ t(18). The number 1.384 lies between the test statistic is
0.9-quantile and the 0.95-quantile of t(18), so X − 0.618
P (T ≤ 1.384) lies between 0.9 and 0.95, and hence T = √ ,
S/ 20
0.05 < P (T ≥ 1.384) < 0.1.
where X is the sample mean ratio and S is the sample
Since p = P (|T | ≥ 1.384) = 2P (T ≥ 1.384), it follows standard deviation of the ratios. The null distribution
that 0.1 < p < 0.2. (Computer calculations give the of this statistic is t(19).
value of the significance probability as 0.183.)
Step 4: The significance level has been specified as 5%.
Thus there is little evidence against the null
Step 5: The rejection region is defined by the
hypothesis. We cannot rule out the possibility that the
0.025-quantile and 0.975-quantile of t(19); these are
underlying mean joint widths are equal.
q0.025 = −2.093 and q0.975 = 2.093. (See Figure S.3.)
Solution 5.1
(a) Step 5: For a test using a 1% significance level,
the appropriate quantiles are q0.005 and q0.995 . From
the table of quantiles for t-distributions, the
0.995-quantile of t(9) is q0.995 = 3.250. Using the
symmetry of the distribution, q0.005 = −3.250. The
rejection region for the test is therefore as shown in
Figure S.2.

Figure S.3 The null distribution of the test statistic,


and the rejection region

Step 6: The observed sample mean and sample


standard deviation are, respectively, x = 0.6605 and
s = 0.0925. Thus (as in Example 3.1) the observed
value of the test statistic is
Figure S.2 The null distribution of the test statistic, x − 0.618 0.6605 − 0.618
t= √ = √  2.055.
and the rejection region s/ 20 0.0925/ 20
Step 7: The observed value of the test statistic lies
very close to the boundary of, but not in, the rejection
(b) Step 7: The observed value of the test statistic, region, so the null hypothesis cannot be rejected at the
−9.272, falls well into the lower part of the rejection 5% significance level.
region (because −9.272 < −3.250). Therefore, the null Step 8: Therefore, there is not sufficient evidence (at
hypothesis can be rejected at the 1% significance level. the 5% significance level) to reject the hypothesis that,
(c) Step 8: We conclude that the average time taken on average, the population mean ratio for Shoshoni
for the timer to ring, when set to 300 seconds, is not rectangles is 0.618. It remains plausible that Shoshoni
300 seconds. Indeed, since the observed sample mean rectangles do, on average, have a width-to-length ratio
is less than 300, the data provide evidence that the of 0.618.
average time is less than 300 seconds.
Notice that the conclusion matches that found in
Example 1.2 by the procedure using confidence
intervals.
Solutions to Activities 57

Solution 5.3 Solution 6.1 For a one-sided test, the key quantity
(a) The null hypothesis H0 would be rejected in a in (6.1) to consider is
fixed-level test with a significance level greater than d
z = q1−α − √ .
0.083, but would not be rejected if a significance level σ/ n
less than 0.083 is used. Thus H0 would be rejected in If z decreases, then P (Z ≥ z) increases, and the power
a fixed-level test with a 10% significance level, but not of the test will go up. If z increases, then the power of
in a fixed-level test with either a 5% or a 1% the test will go down.
significance level. d
(a) If d increases, then q1−α − √ will decrease.
(b) Since H0 is rejected at the 1% significance level, σ/ n
the observed value of the test statistic lies in the Thus the power of the test will increase.
rejection region, and the rejection region contains 1% d
of the probability in the null distribution. Thus the (b) If σ increases, then q1−α − √ will increase.
σ/ n
observed value must be at least as extreme as the Thus the power of the test will decrease.
boundaries of the rejection region, and hence the d
significance probability p ≤ 0.01. (c) If n increases, then q1−α − √ will decrease.
σ/ n
Thus the power of the test will increase.
Solution 5.4 Step 1: As in Activity 3.5, denoting For a two-sided test, the key quantity in (6.2) to
the population mean sleep gain for patients taking
consider is
D-hyoscyamine hydrobromide by µ, the null and d
alternative hypotheses are q1−α/2 − √ .
σ/ n
H0 : µ = 0, H1 : µ > 0. Similar results follow in this case.
Step 2: The data are in Table 3.3.
Step 3: Since there are ten patients, the appropriate Solution 6.2 Here, translating the problem into
test statistic is the notation of power calculations, we have
D α = 0.05, d = 105 − 100 = 5, σ = 15, n = 80.
T = √ ,
S/ 10 The test is two-sided, so the probability required is
where D is the sample mean sleep gain and S is the given by
� �
sample standard deviation of the sleep gains. The null d
P Z ≥ q1−α/2 − √ ,
distribution of this statistic is t(9). σ/ n
Step 4: The significance level has been specified as 5%. where Z ∼ N (0, 1). In this case,
Step 5: The rejection region is defined by the q1−α/2 = q0.975 = 1.96. So
0.95-quantile of t(9); this is q0.95 = 1.833. d 5
q1−α/2 − √ = 1.96 − √  −1.02.
Step 6: The sample mean is 0.75 and the sample σ/ n 15/ 80
standard deviation is 1.789, so the observed value of From the table of probabilities for the standard
the test statistic is normal distribution,
d 0.75 P (Z ≥ −1.02) = 0.8461  0.846.
t= √ = √  1.33.
s/ n 1.789/ 10 The psychologist’s procedure has a reasonably good
Step 7: The observed value of the test statistic does chance of finding a difference in mean IQ of this size.
not lie in the rejection region, because it is less than
q0.95 = 1.833. The null hypothesis cannot be rejected Solution 6.3 The appropriate number of subjects
at the 5% significance level. is given by
Step 8: On the basis of these data, it remains plausible σ2 � �2
that, on average, D-hyoscyamine hydrobromide does n = 2 q1−α/2 − q1−γ .
d
not prolong sleep. Here α = 0.01. The assumed population standard
deviation σ is 10. The difference d that the test is
being designed to detect is 5, and γ, the required
power to detect such a difference, is 0.9. Thus
q1−α/2 = q0.995 = 2.576 and q1−γ = q0.1 = −1.282. So,
the sample size required is
102
n = 2 (2.576 + 1.282)2  59.54,
5
which is rounded up to 60.
58 Unit C1

Solutions to Exercises
Solution 1.1 (b) Steps 4, 5 and 6: For this data set, the sample
(a) The significance level of the test is 10% (or 0.1). mean is 20.93 and the sample standard deviation is
37.74, so the observed value of the test statistic is
(b) The hypothesized value of 2.5 is outside the 90%
confidence interval for µ. So the null hypothesis that d 20.93
t= √ = √  2.148.
µ = 2.5 can be rejected at the 0.1 significance level in s/ n 37.74/ 15
favour of the alternative hypothesis that the The significance probability is p = P (|T | ≥ 2.148).
population mean number of loans per book in a year The 0.975-quantile of t(14) is 2.145 and the
differs from 2.5. Indeed, since the sample mean is less 0.99-quantile is 2.624, so P (T ≥ 2.148) lies between
than 2.5, the data indicate that the population mean 0.01 and 0.025. Therefore 0.02 < p < 0.05.
is less than 2.5. (c) Steps 7 and 8: There is moderate evidence against
the null hypothesis that there is no difference, on
Solution 2.1 Step 1: The null and alternative average, between the heights of cross-fertilized and
hypotheses are self-fertilized plants whose parents were grown from
H0 : µ = 2.5, H1 : µ = 2.5, the same seed. Since the sample mean difference is
where µ is the underlying mean number of loans per 20.93, the data indicate that cross-fertilized plants
book in a year. tend to be taller than self-fertilized plants.
Step 2: The data are given in the question in
summary form: x = 1.992, s = 1.394, n = 122. Solution 4.1 Appropriate hypotheses for the test
are
Step 3: An appropriate test statistic is the sample
mean X. When the null hypothesis is true, µ = 2.5 so, H0 : p1 = p2 , H1 : p1 = p2 ,
using s to estimate the population standard deviation where the parameters p1 and p2 denote the underlying
σ, the approximate null distribution of X is proportions of patients receiving the new drug and the
N (2.5, 1.3942 /122) = N (2.5, 0.01593). standard treatment, respectively, who are in remission
Step 4: The observed value of the test statistic is after a year.
x = 1.992. The observed value of D, the difference between the
Steps 5 and 6: Since this is a two-sided test, the values sample proportions, is
that are at least as extreme as the observed value 38 22
d= −  0.6129 − 0.3667 = 0.2462.
(1.992) fall into two tails, and the significance 62 60
probability is twice the probability in the lower tail: The pooled estimate of the common value of the
proportion p (under the null hypothesis that
p = 2P (X ≤ 1.992)
� � p1 = p2 = p) is
1.992 − 2.5 38 + 22 60
= 2P Z ≤ √ p� = =  0.4918.
0.01593 62 + 60 122
 2P (Z ≤ −4.02). The approximate null distribution of D is
� � ��
Using the table of probabilities for the standard 1 1
normal distribution in the Handbook gives p = 0.0000 N 0, p�(1 − p�) +
n1 n2
to four decimal places. or � � ��
Steps 7 and 8: There is strong evidence against the 1 1
N 0, 0.4918(1 − 0.4918) + = N (0, 0.00820).
null hypothesis that the mean number of loans per 62 60
book is 2.5. The significance probability is thus
Since the sample mean is less than 2.5, the data P (|D| ≥ 0.2462)
indicate that the underlying mean number of loans per � �
0.2462
book in a year is less than 2.5.  P |Z| ≥ √ , where Z ∼ N (0, 1),
0.00820
Solution 3.1  P (|Z| ≥ 2.72)
(a) Steps 1 and 3: The null and alternative = 2 × 0.0033
hypotheses are = 0.0066.
H0 : µ = 0, H1 : µ = 0, Thus there is strong evidence against the null
where µ is the (population) mean difference between hypothesis and in favour of the alternative hypothesis.
the heights of a pair of cross-fertilized and The conclusion is that there is strong evidence that
self-fertilized plants whose parents were grown from patients on the two treatments differ in their
the same seed. An appropriate test statistic is probability of being in remission after a year.
Moreover, since p�1 = 0.6129 > 0.3667 = p�2 , the data
D D indicate that patients on the new drug are more likely
T = √ = √
S/ n S/ 15 to be in remission than are those on the standard
with null distribution t(n − 1) = t(14). treatment.
Solutions to Exercises 59

Solution 4.2 Step 6: As in Exercise 3.1, the observed value of the


(a) The sample variances are very close, so there is no test statistic is t = 2.148.
reason to doubt the equal variance assumption. Step 7: The observed value of the test statistic is in
The pooled estimate of the common variance is the rejection region (because 2.148 > 1.761), so the
null hypothesis that the mean difference is zero can be
(n1 − 1)s21 + (n2 − 1)s22
s2P = rejected at the 10% significance level.
n1 + n2 − 2
Step 8: We conclude that the mean difference is not
83 × 35.65 + 69 × 33.06
= zero. Since the sample mean difference is 20.93, the
84 + 70 − 2 data suggest that cross-fertilized plants tend to be
 34.47. taller, on average, than self-fertilized plants whose
(b) Denoting the underlying mean maximum skull parents were grown from the same seed.
breadths in the two populations by µ1 and µ2 , the
hypotheses are Solution 6.1 Here, in the notation of power
H0 : µ1 = µ2 , H1 : µ1 = µ2 . calculations, we have
The null distribution of the test statistic T is a α = 0.01, d = 2, σ = 10, n = 60.
t-distribution with 84 + 70 − 2 = 152 degrees of The test is two-sided, so the required power is given
freedom. by (6.2):
� �
The observed value of the test statistic T is d
P Z ≥ q1−α/2 − √ ,
x1 − x 2 143.77 − 132.44 σ/ n
t= � = �  11.92. where Z ∼ N (0, 1). In this case,
1 1 √ 1 1
sP + 34.47 × + q1−α/2 = q0.995 = 2.576. So
n1 n2 84 70
d 2
The shape of t(152) is very close to that of the q1−α/2 − √ = 2.576 − √  1.03.
standard normal distribution. Without recourse to σ/ n 10/ 60
tables or a computer, it is clear that the significance From the table of probabilities for the standard
probability p is very close to zero. normal distribution,
Thus there is strong evidence against the null P (Z ≥ 1.03) = 0.1515  0.152.
hypothesis that the underlying mean maximum skull The procedure does not have a very good chance of
breadths are equal. Indeed, since x1 > x2 , the data detecting a difference in blood pressure of 2 mm Hg.
indicate that on average Etruscans had wider skulls
than modern Italians. Solution 6.2 The appropriate number of subjects
is given by
Solution 5.1 Steps 1 to 3 are the same as for the σ2 � �2
significance test in Exercise 3.1: the hypotheses are n = 2 q1−α/2 − q1−γ .
d
H0 : µ = 0, H1 : µ = 0, Here α = 0.05. The assumed population standard
where µ is the (population) mean difference between deviation is σ = 15. The difference d that the test is
the heights of a pair of cross-fertilized and being designed to detect is 2, and γ, the required
self-fertilized plants whose parents were grown from power to detect such a difference, is 0.8. Thus
the same seed; the data are in Table 5.1; an q1−α/2 = q0.975 = 1.96 and q1−γ = q0.2 = −0.8416. So
appropriate test statistic is the sample size required is
D 152
T = √ , n = 2 (1.96 + 0.8416)2  441.5,
S/ 15 2
which is rounded up to 442.
which has null distribution t(14).
Step 4: A 10% significance level is to be used.
Step 5: The test is two-sided, so the boundary points
of the rejection region are the 0.05-quantile and the
0.95-quantile of t(14), which are −1.761 and 1.761.
The rejection region consists of values less than or
equal to −1.761 and values greater than or equal to
1.761.
60 Unit C2

UNIT C2 Nonparametrics
Study guide for Unit C2
This unit is shorter than average. You should schedule four study sessions,
including time for answering the TMA questions on the unit and for generally
reviewing and consolidating your work on this unit.
Section 2 does not depend on ideas or skills from Section 1. Thus, if you have
some pressing reason for studying the two sections out of order, this should cause
no problems.
As you study this unit you will be asked to work through Chapter 6 of
Computer Book C. We recommend that you do this at the place indicated in the
unit (in Subsection 1.4), though it would be possible to postpone studying the
chapter until later in the unit without affecting your understanding of the unit
itself.
One possible study pattern is as follows.
Study session 1: Subsections 1.1 and 1.2.
Study session 2: Subsections 1.3 and 1.4. You will need access to your computer
for this session, together with Computer Book C.
Study session 3: Section 2.
Study session 4: TMA questions on Unit C2.

Introduction
This unit contains two different topics, which do not have a great deal in common
except that they both involve significance testing.
Most of the significance test procedures that you learned in Unit C1 involved a
particular assumption about the underlying distribution of the data involved (for
example, that they had a normal distribution, or a binomial distribution). What
do we do if such assumptions do not seem justified? In Section 1, you will meet
several hypothesis tests that do not involve such assumptions.
This raises the question of how we can actually tell whether a sample of data
could plausibly have been drawn from a particular distribution. You have already
seen certain graphical methods for investigating this, in the form of probability
plots. In Section 2, you will learn about a more formal procedure for testing what
is known as the goodness of fit of a probability model to discrete data.
Section 1 61

1 Nonparametric tests
Many of the significance tests you met in Unit C1 involve assumptions about the
distribution of the data. For instance, t-tests involve the assumption that the
underlying distribution of the population or populations is normal. The
distributional properties of the test statistic depend on this assumption of
normality. If the underlying population distribution(s) were not normal, the t-test
statistic would not in general have a Student’s t-distribution, and therefore any
p value that you calculated on the basis of this assumption might be incorrect.
How serious this was in practice would depend on how different the underlying
distribution was from the assumed normal form. In some cases the discrepancy
might be small, and no practical problems might arise; but in other cases the
discrepancy might be crucial. This kind of restriction does not apply to all the
tests you have met; some of them are based on large-sample approximations
(using the central limit theorem) which are valid for a wide range of underlying
distributions. In other circumstances, as you will see in Unit C3, it may be
possible to transform the data, that is, to apply an appropriate function to all the
data values, such that the transformed data are at least approximately normally
distributed. Then methods that assume a normal distribution can be applied to
the transformed data. However, this is not always possible. In such cases, a
technique that does not require a specific probability model must be used. Such
techniques are called nonparametric. In the one-sample t-test, for example, the
underlying distribution is assumed to be normal, but with unknown mean and
variance. That is, the underlying distribution comes from a family defined by two
parameters, the mean µ and the variance σ2 . Thus this test is said to be
parametric. In a nonparametric test, there is no assumption that the underlying
distribution comes from a specified family indexed by parameters in this way.
Since no particular distributional form is assumed, the tests are also called
distribution-free, though, as you will see, this does not mean that there are no
distributions involved at all!
A nonparametric test is described in each of Subsections 1.1 to 1.3. You will learn
how to use MINITAB to carry out these tests in Subsection 1.4.

1.1 Early ideas: the sign test


Distribution-free statistical tests pre-date parametric tests like the t-test. They
can be traced back at least as far as 1710, when John Arbuthnot produced the Arbuthnot, J. (1710) An
first recorded instance of such a technique. The fundamental principle behind his Argument for Divine
test is simple. For each year in the 82-year period from 1629 to 1710 (inclusive), Providence, taken from the
constant Regularity observ’d in
he observed from City of London records that the number of births of boys the Births of both Sexes.
exceeded the number of births of girls. If births of boys and girls were equally Philosophical Transactions, 27,
� �82
likely, then the probability of such an outcome would be 12 , which is a very 186–190. Presumably the data
tiny probability. He therefore refused to believe that this was so — in statistical for 1710 did not cover the whole
year.
parlance he rejected the null hypothesis of boys and girls being equally
likely — and concluded instead that the probability of a boy was greater than that
of a girl. He further concluded that the observation constituted clear evidence for
divine providence since, with wars and diseases resulting in a higher death rate for
males, God had compensated by arranging for more males to be produced, and
hence arranged ‘for every woman her proper husband’.
Notice that Arbuthnot’s test makes no assumptions about the distribution of the
number of births for either sex. Nowadays his test is called the sign test, under
which name it appears in most elementary statistics textbooks. The principle of
the test is as follows: if you have paired data (such as Arbuthnot’s numbers of
male and female births for each year), calculate the differences and count the
number of differences that have a + sign. This simple count is the test statistic.
If the distribution of the differences is centred on zero in the sense that zero is its
median, then you can expect roughly as many + signs as − signs; so you can
62 Unit C2

� �
obtain a significance probability by using a binomial distribution, B n, 12
(where n is the sample size). Arbuthnot subtracted the number of girls recorded
from the number of boys for each of 82 years and obtained 82 + signs.

Example 1.1 Corneal thickness — the sign test


Table 1.1 gives the corneal thickness, in microns, of eight people, each of whom
had one eye affected by glaucoma. Ehlers, N. (1970) On corneal
thickness and intraocular
Table 1.1 Corneal thickness in patients with glaucoma (microns) pressure. II. A clinical study on
the thickness of the corneal
Patient 1 2 3 4 5 6 7 8 stroma in glaucomatous eyes.
Glaucomatous eye 488 478 480 426 440 410 458 460 Acta Ophthalmologica, 48,
Normal eye 484 478 492 444 436 398 464 476 1107–1112.

These data were collected to investigate whether, on average, there is a difference


between corneal thickness in the eye affected by glaucoma and the other eye. A
t-test for zero mean difference would involve calculating the differences between
the thicknesses in the two eyes, and assuming that these differences are
adequately modelled by a normal distribution.
The differences (glaucomatous eye − normal eye) are as follows.
4 0 −12 −18 4 12 −6 −16
A normal probability plot of these differences is shown in Figure 1.1.

Figure 1.1 A normal probability plot for corneal thickness differences

The points do not lie particularly close to a straight line, so the evidence is not
compelling that a normal distribution is appropriate for modelling these data.
However, it must be borne in mind that the sample size is small, and therefore
that the evidence against a normal distribution is not compelling either. In fact,
the (two-sided) t-test gives a p value of 0.33, so there is very little evidence
against the null hypothesis of a zero mean difference. But this conclusion may be
in doubt, because of the possible non-normality of the underlying distribution.
One way to avoid assuming a normal model for the differences is to use the sign
test. The null hypothesis for this test is that the underlying median difference
(rather than the mean difference, as used in the t-test) is zero. The test then
proceeds simply by counting the number of differences with + signs and the
number with − signs. It is common practice simply to ignore zeros in the sign test
(and reduce the sample size accordingly). Looking at the differences, and ignoring There is another approach to
the zero, there are three + signs and four − signs or, to put it another way, the sign test that incorporates
the zeros in the analysis, but it
three + signs out of seven. Assuming equal probability for individual
� �+ and − is not discussed in this course.
signs, the distribution of the number of + signs out of seven is B 7, 12 . If we
wished to perform a one-sided test, the p value would be the probability of three
or fewer + signs out of seven; this is
�3 � �
7 � 1 �7 � �7 � �7 � �7 � �7
2 = 1 × 12 + 7 × 12 + 21 × 12 + 35 × 12 = 12 .
x
x=0
Section 1 63

Thus, for a one-sided test, the p value would be 0.5. This is illustrated in
Figure 1.2. For a two-sided test, as well as the probability that the number of
+ signs is less than or equal to the observed value, 3, we would need to consider
the other tail of the distribution of the test statistic. The lower tail contains the
values 0, 1, 2 and 3, so the upper tail contains the values 4, 5, 6 and 7. Together
this accounts for all the possible numbers of + signs out of seven; the p value for
the two-sided test is 1. This provides absolutely no evidence against the null
hypothesis; in other words, there is no evidence of a difference (on average) in
corneal thickness. �

Another way of thinking of the relationship between the p values for the one-sided
and two-sided sign tests�is as�follows. The distribution of the test statistic (the
number of + signs) is B n, 12 . This is a symmetric distribution. Thus the Figure 1.2 The null
probability in the tail of this distribution opposite to that actually observed will � 1of
distribution � the number of +
signs: B 7, 2
always be equal to that in the observed tail. Therefore the p value for the
two-sided test is generally double that of the one-sided test. There is one exception. This
occurs when n, the total number
Activity 1.1 Sleep gain — the sign test of signs, is even, and the
observed number of + signs is in
In Example 3.2 of Unit C1, you met some data on the possible hypnotic effect in the middle of its distribution,
humans of the drug L-hyoscyamine hydrobromide. Ten individuals had their sleep at n/2. In this case, p = 1 for
time measured before and after taking this drug. The differences in sleep time the two-sided test.
(time after taking the drug − time before taking the drug) are given in Table 1.2.
Table 1.2 Sleep gain (hours) when patients take L-hyoscyamine hydrobromide ‘Student’ (1908) The probable
error of a mean. Biometrika, 6,
Patient 1 2 3 4 5 6 7 8 9 10 1–25.
Gain in sleep 1.9 0.8 1.1 0.1 −0.1 4.4 5.5 1.6 4.6 3.4

In Unit C1 these data were analysed using a t-test. The null and alternative
hypotheses were as follows:
H0 : µ = 0, H1 : µ > 0,
where µ is the underlying mean sleep gain. (Thus you were performing a
one-sided test.) The p value for this test was 0.0025, so that there was strong See Activity 3.3 of Unit C1.
evidence against the null hypothesis (and hence that this substance worked as a
hypnotic). There was no particular reason to doubt the appropriateness of a
normal model for these data. However, the data can still be analysed using the
sign test. The hypotheses will need to be amended as follows:
H0 : m = 0, H1 : m > 0,
where m is the (population) median sleep gain.
What is the value of the sign test statistic for these data? Calculate the
corresponding p value and report your conclusion.

In Unit C1 you saw how the one-sample t-test could be used both to test for a
specified value of the population mean given a sample from a single population,
and to test for a zero mean difference between matched pairs. Working with
matched pairs merely involves calculating the differences between the paired
values, and then applying the one-sample t-test procedure to the resulting
differences, to test whether their underlying mean could be zero. The sign test
has been introduced here as a test for differences between matched pairs, but the
first step is to look at the differences between the paired values. Thus, if we had a For the sign test, it is not
single sample of data and wanted to test whether the population median could be strictly speaking necessary to
zero, we could simply apply the sign test to the original data. Indeed we can go calculate these differences, since
all we need to know is the sign
further; we can test the null hypothesis that a single sample of data is drawn from of each difference and that can
a population with a specified median simply by subtracting the specified median be found simply by looking to
from all the data values, and applying the test procedure to the resulting see which of the paired values is
numbers. Example 1.2 illustrates how this works. the larger; but the principle
remains true.
64 Unit C2

Example 1.2 Shoshoni rectangles


In Example 3.1 of Unit C1, data on the width-to-length ratio of twenty rectangles DuBois, C. (1960) Lowie’s
drawn by Shoshoni Native North Americans were analysed, to investigate whether Selected Papers in Anthropology.
(on average) the ratios matched the Greek ‘golden ratio’ of 0.618. A t-test was University of California Press,
pp. 137–142.
used to test the null hypothesis that the mean width-to-length ratio is 0.618,
against the alternative hypothesis that it is not 0.618:
H0 : µ = 0.618, H1 : µ = 0.618.
The data are repeated in Table 1.3.
Table 1.3 Width-to-length ratios of Shoshoni rectangles
0.693 0.662 0.690 0.606 0.570 0.749 0.672 0.628 0.609 0.844
0.654 0.615 0.668 0.601 0.576 0.670 0.606 0.611 0.553 0.933

This test involved assuming that a normal model is appropriate. A normal


probability plot of these data is shown in Figure 1.3.

Figure 1.3 A normal probability plot for Shoshoni rectangle ratios

The shape of the plot is decidedly curved, so there must be some doubt over the
appropriateness of the assumption of normality. Can an alternative analysis be
performed that avoids the normality assumption?
One alternative analysis is to use the sign test with the data in Table 1.3 to test
the null hypothesis that the population median m of width-to-length ratios of
Shoshoni rectangles is 0.618 against the alternative hypothesis that it is not 0.618:
H0 : m = 0.618, H1 : m = 0.618. As usual with the sign test, the
hypotheses refer to the
The first step is to subtract the value hypothesized in the null hypothesis (that is, population median rather than
0.618) from each value in the sample. Then omit any zero differences and count the mean.
how many of the resulting differences have + signs and how many have − signs.
You can do this if you wish, but it is quicker to note that a data value greater
than 0.618 will lead to a + sign, and one less than 0.618 will lead to a − sign. �

Activity 1.2 Shoshoni rectangles — the sign test statistic


Use this quicker approach to calculate the sign test statistic. Also write down the
value of n, the total number of + and − signs.
Comment
Of the original data values, 11 are greater than 0.618, and the other 9 are smaller
than 0.618. Thus the sign test results in 11 + signs and 9 − signs. The sign test
statistic is therefore 11. The total number of + and − signs, n, is 20. (In this
case, the total is the same as the sample size because there are no zero
differences — no original value is exactly equal to 0.618.)
Section 1 65

Example 1.2 continued Shoshoni rectangles


It would now be possible to calculate the p value in the usual
� way.
� Under the null
hypothesis, the number of + signs has the distribution B 20, 12 . The observed
number of + signs, 11, is in the upper tail of this distribution. Therefore� the �
probability required is that of a random variable with the distribution B 20, 12
being greater than or equal to 11; and then, to obtain the required p value for a
two-sided test, this probability will need to be doubled. That is, we require
20 �
� �
20 � 1 �20
2 2 .
x
x=11

Probabilities like this are usually rather tedious to calculate by hand; a computer
calculated this one to be 0.824. However, in this case the calculation is actually
reasonably straightforward if the relevant probability � in1the
� lower tail is considered
in a different way. The relevant lower
� tail
� of the B 20, 2 distribution consists of
1
values from 0 to 9 inclusive. A B 20, 2 random variable can take integer values
from 0 to 20 inclusive. So all its possible values except 10 are included in one or
other of the relevant tails of the distribution, and the required p value is thus
�9 � � 20 � � � �
20 � 1 �20 � 20 � 1 �20 20 � 1 �20
2 + 2 = 1 − 2 P (X ≤ 9 or X ≥ 11)
x x 10
x=0 x=11 = 1 − P (X = 10).
20! 1
=1− × 20
10! × 10! 2
1
= 1 − 184756 ×
1048576
 1 − 0.176
= 0.824.
Thus (however you did the calculation) the sign test provides little evidence that
the underlying median of width-to-length ratios of Shoshoni rectangles is different
from the hypothesized value of 0.618.
You may have found this result quite surprising. The p value for the t-test that
was carried out in Unit C1 was 0.0539, which could be interpreted as providing See Unit C1 , Example 3.1.
some evidence (though rather weak) against the hypothesis that the population
mean is 0.618. There are various possible reasons for the large difference in
p values between the two tests, among them the following. First, the two tests
were actually of different hypotheses: the t-test is a test for the population mean,
whereas the sign test is a test for the population median. In fact, the curved
shape of the probability plot in Figure 1.3 indicates that the data have a skew
distribution; therefore the mean will not be equal to the median, and it could thus
be the case that the population median is 0.618 while the population mean is
different from 0.618. Secondly, and much more likely to be important here, is the
following. Maybe the population median really does differ from 0.618, but the
sign test is simply not powerful enough in this case to detect the difference. In
other words, maybe a Type II error has occurred. In fact, in most contexts, the
sign test has relatively low power (and indeed further analysis of these data
provides an indication that this lack of power is the real problem here). �

The question of the power of the sign test is discussed further in the next
subsection. This subsection concludes with a summary of the procedure for
performing the sign test.
66 Unit C2

The sign test


The sign test is a test on a single sample of data, which (in many
applications of the test) may have arisen as a set of differences between
matched pairs.
1 Determine the null and alternative hypotheses. The null hypothesis for
the test takes the form
H0 : m = m0 .
For a two-sided test, the alternative hypothesis is
H1 : m = m0 .
For a one-sided test, the alternative hypothesis is either
H1 : m > m0 or H1 : m < m0 ,
as appropriate. Here m denotes the underlying population median,
and m0 is some specified value of interest. In a test on differences
between matched pairs, m0 is usually 0.
2 Delete any data values that are equal to m0 from the sample.
3 Calculate the test statistic. In principle, this is done by subtracting m0
from the remaining data values, and counting how many of the resulting
values are positive. (In practice, since the signs are all that matter, this
can be done by simply counting how many of the data values are greater
than m0 .) The test statistic is the number of positive signs.
4 Under
� �the null hypothesis, the test statistic has the binomial distribution
B n, 12 , where n is the sample size (after deleting any data values equal
to m0 ). Find the significance probability using this distribution.
5 State your conclusions.

1.2 The Wilcoxon signed rank test


The sign test is of historical interest, but it is now rarely used in practice. The
reason for this is that it simply ignores too much valuable information, and as a
result is not very powerful. That is, it is prone to failing to reject a null
hypothesis except when the evidence against the null hypothesis is very clear. The
information ignored is information on the size of the differences. Among the
differences in Example 1.1 for patients with glaucoma, patient 4 with a difference
of −18 microns is given exactly the same importance in the analysis as patient 1
with a difference of 4 microns. In most situations this seems unsatisfactory, and it
fell to Frank Wilcoxon in 1945 to propose a method of testing which takes into Wilcoxon, F. (1945) Individual
account the size of the differences. comparisons by ranking
methods. Biometrics Bulletin,
Wilcoxon’s idea was to replace the individual differences by ranks. An overview 1, 80–83.
of the method is as follows. Ranks are allocated to the absolute values of the
differences, the smallest being given a rank of 1, the next smallest a rank of 2, and
so on. These ranks are then allocated to two groups: ranks for differences with a
positive sign are allocated to one group; ranks for differences with a negative sign
are allocated to the other group. The ranks are added up separately for the two
sign groups. If the total for one of the sign groups is very small (which means that
the total for the other sign group is very large, because they add up to a fixed
total), then the null hypothesis of zero median difference is rejected. The test is
called the Wilcoxon signed rank test. An example should make the method
clear.
Section 1 67

Example 1.3 Corneal thickness — the Wilcoxon signed rank test


Consider again the paired data from Example 1.1 on corneal thickness in patients
with glaucoma. Table 1.4 separates the absolute values of the differences from
their associated signs, and shows the ranks of the absolute values.
Table 1.4 Corneal thickness in patients with glaucoma (microns), with
differences and their ranks
Patient 1 2 3 4 5 6 7 8
Glaucomatous eye 488 478 480 426 440 410 458 460
Normal eye 484 478 492 444 436 398 464 476
Sign of difference + − − + + − −
Absolute value of difference 4 0 12 18 4 12 6 16
Rank of absolute value of difference 1 21 4 21 7 1 12 4 12 3 6

There are two important things to notice in Table 1.4. The first is that the
difference of zero for patient 2 has not been included in the ranking; it is ignored,
and the sample size is taken as 7 instead of 8, just as for the sign test. The second As for the sign test, there is an
is that, where two differences have the same absolute value, an average rank is alternative procedure in which
given. In the table (after ignoring the zero difference), the two lowest absolute zeros are incorporated into the
analysis. Except for very small
differences are tied on 4. Since the two lowest ranks are 1 and 2, each of these two samples, this alternative
lowest differences is allocated rank 12 (1 + 2) = 1 12 . The same has happened where approach does not usually lead
two absolute differences are tied on 12. If they had had very slightly different to substantially different
values instead of being equal, they would have had ranks 4 and 5, so each is given conclusions.
a rank of 12 (4 + 5) = 4 12 .
Now the signs are taken into account. The sum of the ranks for the positive
differences is
w+ = 1 12 + 1 12 + 4 12 = 7 12 .
The sum of the ranks for the negative differences is
w− = 4 12 + 7 + 3 + 6 = 20 21 .
If either of these sums were particularly large or particularly small, this would
provide evidence against the null hypothesis of zero median difference. In general,
the sum w+ + w− is equal to 1 + 2 + · · · + n = 12 n(n + 1), where n is the sample The use of average ranks makes
size (after excluding zeros). Thus, for a given sample size, w+ is small exactly this true even where there are
when w− is large, and vice versa. Thus we can concentrate on just one of these ties. In this case, w+ + w− = 28;
the sample size n is 7.
quantities. The test statistic for the Wilcoxon signed rank test is w+ : under the
null hypothesis of zero median difference, values of w+ that are extremely small or
extremely large will lead to rejection of the null hypothesis (for a two-sided test).
The null distribution of the test statistic w+ is different for each value of n. It is
rather complicated, and is in general calculated using a complete enumeration of
cases. So to obtain the p value for a Wilcoxon signed rank test, a computer is
generally used. As for most of the tests you have met, the null distribution of the
test statistic is symmetric, so that the p value for a one-sided test is exactly half
that for a two-sided test.
The significance probability for a two-sided test with w+ = 7 12 is 0.344. On this This is not the same as the
analysis (as for the t-test and the sign test) there is little evidence of a non-zero p value given by MINITAB,
difference in corneal thickness. � which uses an approximation to
calculate p values for the
Wilcoxon signed rank test.
In Subsection 1.1, you saw that the sign test can be used to test the null
hypothesis that a single sample of data is drawn from a population with a
specified median m0 . Similarly, the Wilcoxon signed rank test can be used to test
such a null hypothesis simply by subtracting the specified median from each data
value to obtain a set of differences. The procedure for performing the Wilcoxon
signed rank test for zero median difference is summarized in the following box.
68 Unit C2

The Wilcoxon signed rank test for zero difference


The Wilcoxon signed rank test is a test on a single sample of data, which (in
many applications of the test) may have arisen as a set of differences
between matched pairs.
1 Determine the null and alternative hypotheses. The null hypothesis for
the test takes the form
H0 : m = m0 .
For a two-sided test, the alternative hypothesis is
 m0 .
H1 : m =
For a one-sided test, the alternative hypothesis is either
H1 : m > m0 or H1 : m < m0 ,
as appropriate. Here m denotes the underlying population median.
2 Obtain a set of differences d1 , d2 , . . . , dn with 0s deleted.
3 Without regard to their sign, order the differences from least (that is,
nearest to zero) to greatest, and allocate rank i to the ith absolute
difference. In the event of ties, allocate the average rank to the tied
differences.
4 Now consider again the signs of the original differences. Denote by w+
the sum of the ranks of the positive differences. This is the Wilcoxon
signed rank test statistic.
5 Obtain the significance probability p (usually by using computer
software).
6 State your conclusions.

Activity 1.3 Sleep gain — the Wilcoxon signed rank test


In Activity 1.1, you used the sign test to analyse some data on the possible
hypnotic effect in humans of the drug L-hyoscyamine hydrobromide. Ten
individuals had their sleep time measured before and after taking this drug. The
differences in sleep time (time after taking the drug − time before taking the
drug) for ten individuals who took this drug were given in Table 1.2.
In this activity you are asked to analyse the data using the Wilcoxon signed rank
test. The hypotheses are the same as for the sign test:
H0 : m = 0, H1 : m > 0,
where m is the (population) median sleep gain.
(a) Calculate the Wilcoxon signed rank test statistic for the data.
(b) The p value for the Wilcoxon test is 0.003. What do you conclude? How does
this compare with your conclusion in Activity 1.1 where you used the sign
test?

In Activities 1.4 and 1.5, you are asked to use the Wilcoxon signed rank test to
investigate the Shoshoni rectangle data given in Example 1.2. In fact, there is a
snag in using the test with this data set. However, ignore that for now and
proceed with the test. (This snag is discussed later in the subsection.)

Activity 1.4 Shoshoni rectangles — the Wilcoxon signed rank test


Obtain a table of differences for the data on width-to-length ratios of Shoshoni
rectangles by subtracting the value in the null hypothesis, 0.618, from each of the
data values in Table 1.3. Allocate ranks to the absolute values of the differences,
and hence calculate the Wilcoxon signed rank test statistic, w+ .
Section 1 69

Activity 1.5 Shoshoni rectangles — drawing a conclusion


The Wilcoxon signed rank test was used to test the null hypothesis in Activity 1.4
against a two-sided alternative hypothesis. The p value for the test is 0.088. What
can you conclude? How do your conclusions compare with those from the sign
test (see Example 1.2)?
Comment
This significance probability provides weak evidence against the null hypothesis
that the median width-to-length ratio is 0.618 (but considerably more than was
the case for the sign test, for which the significance probability was 0.824).
Looking at the data, it would seem that the median width-to-length ratio may be
greater than 0.618.

There is a sense in which the central limit theorem operates with the Wilcoxon
signed rank test statistic: provided that the number of differences is sufficiently
large, a normal approximation to the null distribution of the test statistic may be
used. This approximation is described in the following box.

Normal approximation to the null distribution of the Wilcoxon


test statistic
Under the null hypothesis of zero median difference, for a sample of size n
(excluding any zero differences), the random variable W+ , whose observed
value is the Wilcoxon test statistic w+ , has mean and variance given by
n(n + 1) n(n + 1)(2n + 1)
E(W+ ) = , V (W+ ) = .
4 24
The distribution of
W+ − E(W+ ) Here, SD(W+ ) denotes the
Z=
SD(W+ ) standard deviation of W+ , that
is, the square root of V (W+ ).
is approximately standard normal. The approximation is quite good, but
should not be used for sample sizes that are very small. (As a rule of thumb,
the normal approximation is generally adequate as long as the sample size n
is at least 16.)

Example 1.4 Corneal thickness — normal approximation for the


Wilcoxon signed rank test
In Example 1.3 we looked at differences in corneal thickness in patients with one
normal and one glaucomatous eye. There were seven such differences (excluding
one with zero difference), so
7×8
E(W+ ) = = 14,
4
7 × 8 × 15
V (W+ ) = = 35.
24
In this case, the observed sum of ranks for the positive differences is w+ = 7 12 , so
the corresponding observed value of Z is
w+ − 14 7 1 − 14
z= √ = 2√  −1.10.
35 35
Using the table of probabilities of the standard normal distribution in the
Handbook gives
P (Z ≤ −1.10) = 1 − Φ(1.10) = 0.1357  0.136.
So, according to the approximation, the probability of obtaining a Wilcoxon
signed rank test statistic of 7 12 or less is approximately 0.136. For a two-sided
test, the p value is double this, that is, 0.272.
70 Unit C2

Here the sample size is only 7; the approximate p value is noticeably different from
the exact value of 0.344 given in Example 1.3. However, 7 is a lot less than the
minimum sample size of 16 given in the ‘rule of thumb’ for adequacy of the normal
approximation, so it is not very surprising that the approximation is poor. �

Activity 1.6 Shoshoni rectangles — normal approximation


In Activity 1.4, you found that the value of the Wilcoxon signed rank test statistic
is 151 for the data on width-to-length ratios of Shoshoni rectangles. Use a normal
approximation to test the null hypothesis that the Shoshoni rectangles conform
(on average) to the Greek golden ratio standard. Compare the p value you obtain
using the approximation with the exact p value (0.088) given in Activity 1.5.

Before leaving the Wilcoxon signed rank test, we should spend a little time
considering the assumptions behind it. The test is indeed nonparametric, in that
it does not involve an assumption that the data can be modelled by a particular
parametric family of distributions. But that does not mean that it does not
involve any assumptions about the population distribution. Its advantage over the
sign test is that it makes some use of the size of the differences, instead of just
using their signs. This, however, comes at a price. The null distribution of the
test statistic is found by assuming that an absolute difference with a particular
rank is just as likely to be associated with a positive difference as with a negative
one. Suppose, however, that the null hypothesis of the test (zero population
median difference) is true, but that the differences have a distribution that is not
symmetric about zero. For definiteness, suppose that the true distribution of
differences is right-skew, that is, that its upper tail (positive values) is more
spread out than its lower tail. Then, because we have assumed that the median
difference is zero, we would expect the number of positive differences to be about
the same as the number of negative differences, but on the whole the positive
differences would tend to be larger in absolute value than the negative differences.
In other words, a difference whose absolute value had a large rank would be more
likely to be positive than to be negative. If this were indeed the case, the null
distribution of the Wilcoxon test statistic would be wrong. Therefore, in order to
use information about the relative size of the differences and use the Wilcoxon
signed rank test, we must make an assumption that the differences can reasonably
be modelled by a symmetric distribution, at least under the null hypothesis. The
particular shape of the distribution does not matter at all, as long as it is
symmetric. If this is not the case, the test is not valid. In many circumstances,
particularly for differences in paired data, such an assumption of symmetry is
perfectly reasonable. However, this is not always so. Judging from the sample
values, it looks as if it may well be inappropriate to model the Shoshoni rectangles
data by a symmetric distribution, because the sample is quite heavily right-skew. Further investigation, using the
(Its sample skewness is 1.75.) Thus, it may be inappropriate to use the Wilcoxon transformation methods you will
test with these data. One plausible explanation for the fact that both the meet in Unit C3, indicates that
the Wilcoxon test and the t-test
one-sample t-test and the Wilcoxon signed rank test provide some evidence do not provide misleading
against the null hypothesis may be simply that both reflect the skewness of the information in this case, even
data, rather than the possibility that their mean or median is different from 0.618. though the sample data are
skew. That is, the data do
Note that, since the Wilcoxon signed rank test involves an assumption of really provide some weak
symmetry, the care that was taken to express its null hypothesis as evidence that the underlying
mean or median is not 0.618,
H0 : m = 0 and the fact that the sign test
failed to detect this is simply a
(where m is the population median) was actually misplaced. If the underlying consequence of its lack of power.
distribution is symmetric, then its median is equal to its mean, so we could have
written the null hypothesis as
H0 : µ = 0
(with corresponding forms for the alternative hypothesis), just as for the
one-sample t-test.
Section 1 71

1.3 The Mann–W hitney test


The idea of using ranks instead of the original data values, as in the Wilcoxon
signed rank test, is a logical and appealing one. Furthermore, it has an extension
to testing two groups of data when a two-sample t-test may not be applicable
because of lack of normality. This two-sample test was first proposed by
H.B. Mann and D.R. Whitney in 1947, and later (in a different form that was Mann, H.B. and Whitney, D.R.
shown to be equivalent) by Wilcoxon. Strictly speaking, a test based on ranks (1947) On a test of whether one
does not test the same null hypothesis as the t-test, which tests for equal means. of two random variables is
stochastically larger than the
It can perhaps best be thought of as a test of the null hypothesis that both other. Annals of Mathematical
samples are drawn from the same population distribution, with no assumptions Statistics, 18, 50–60.
being made about the form of this distribution (so that, for instance, no
assumption of symmetry is needed). The null hypothesis in the two-sample t-test
is that the two population means are equal. But since two assumptions of the test
are that both populations have normal distributions and that the population
variances are equal, this means that, under the null hypothesis, the two samples
are drawn from the same normal distribution. Under the alternative hypothesis of
the two-sample t-test, the assumptions of normality and equal variance continue
to hold, but the population means differ; that is, the population distributions are
the same shape (in terms of spread, skewness and so on), but differ in location.
Theoretically, for the Mann–Whitney test, the alternative hypothesis can include
situations in which the two population distributions differ in shape as well as
location. However, in most situations, it makes sense to think of the alternative
hypothesis in terms of a difference in location between the two populations. In
this sense, the Mann–Whitney test is a valid alternative to the two-sample t-test,
without the necessity of assuming normal population distributions.
Suppose we have two independent samples, A and B: the Mann–Whitney test This test is sometimes called the
may be used to test the null hypothesis that the samples arise from the same Mann–Whitney–Wilcoxon test,
population, using the following procedure. because of Wilcoxon’s
involvement in its development.

The Mann–Whitney test


The Mann–Whitney test is a test on two independent samples of data,
designed to investigate differences in location between the populations from
which the samples were drawn.
1 Determine the hypotheses. The null hypothesis is that the distributions
of the populations from which the samples were drawn are identical (in
terms of location and in every other respect); the alternative hypothesis
is that the populations differ in location. The alternative hypothesis can
be two-sided or one-sided,
2 Pool the two samples, keeping track of the sample to which each data
depending on whether
value belongs. Then sort the combined data into ascending order. differences in both directions or
3 Allocate a rank to each data value, the smallest being given rank 1. If just one specified direction are
two or more data values are equal, allocate the average of the ranks to of interest. All the tests
performed in this subsection are
each. two-sided.
4 Add up the ranks for each sample, and write
uA = the sum of the ranks for sample A,
uB = the sum of the ranks for sample B.
Notice that if the size of sample A is nA and the size of sample B is nB ,
then the sum of uA and uB is
uA + uB = 1 + 2 + · · · + (nA + nB )
1
= 2 (nA + nB ) (nA + nB + 1) .
This provides a useful check on arithmetic.
5 The Mann–Whitney test statistic is uA . Very small or very large
observed values provide evidence against the null hypothesis, suggesting
respectively that A-values are ‘too frequently’ smaller than or larger
than B-values.
72 Unit C2

Considering the test statistic uA as the observed value of a random variable UA , it


may then be compared with the null distribution of UA to yield a significance
probability for the test. As for the Wilcoxon signed rank test, the null
distribution is complicated: its calculation would normally require the use of a
computer. However, at least when there are no ties (values that are the same) in
the data, the null distribution is symmetric, so the p value for a two-sided test is
exactly double that for the corresponding one-sided test.
In Subsection 1.2, you saw that there is a normal approximation for the null
distribution of the Wilcoxon signed rank test statistic. There is also a normal
approximation for the null distribution of the Mann–Whitney test statistic UA ,
which may be used to calculate approximate p values. This approximation and a
rough rule for when the approximation is adequate are given in the following box.

Normal approximation to the null distribution of the


Mann–Whitney test statistic
For independent samples of sizes nA and nB , the null distribution of the
Mann–Whitney test statistic UA may be approximated by a normal
distribution:
� �
nA (nA + nB + 1) nA nB (nA + nB + 1)
UA ≈ N , . This normal approximation is
2 12 valid as long as the number of
This approximation can be used for quite modest values of nA and nB ; say, tied values (that is, values that
are the same) in the pooled data
each of size 8 or more. set is not too great.

This test, including the use of the normal approximation, is illustrated in


Example 1.5.

Example 1.5 Dopamine activity


In a study into the causes of schizophrenia, 25 hospitalized patients with Sternberg, D.E., Van Kammen,
schizophrenia were treated with antipsychotic medication, and after a period of D.P. and Bunney, W.E. (1982)
time were classified as psychotic or non-psychotic by hospital staff. A sample of Schizophrenia: dopamine
β-hydroxylase activity and
cerebro-spinal fluid was taken from each patient and assayed for dopamine treatment response. Science,
β-hydroxylase enzyme activity. The data are given in Table 1.5; the units are 216, 1423–1425.
nmol/(ml)(hr)/mg of protein.
Table 1.5 Dopamine β-hydroxylase activity (nmol/(ml)(hr)/mg)

(A) Judged non-psychotic


0.0104 0.0105 0.0112 0.0116 0.0130 0.0145 0.0154 0.0156
0.0170 0.0180 0.0200 0.0200 0.0210 0.0230 0.0252
(B) Judged psychotic
0.0150 0.0204 0.0208 0.0222 0.0226 0.0245 0.0270 0.0275
0.0306 0.0320

The Mann–Whitney test may be used to test the hypothesis that the distribution
of β-hydroxylase activity is the same for patients judged non-psychotic as for
patients judged psychotic. The data may be pooled and ranked as shown in
Table 1.6.
Section 1 73

Table 1.6 Pooled and ranked data


0.0104 0.0105 0.0112 0.0116 0.0130 0.0145 0.0150
Sample A A A A A A B
Rank 1 2 3 4 5 6 7
0.0154 0.0156 0.0170 0.0180 0.0200 0.0200 0.0204
Sample A A A A A A B
Rank 8 9 10 11 12 12 12 12 14
0.0208 0.0210 0.0222 0.0226 0.0230 0.0245 0.0252
Sample B A B B A B A
Rank 15 16 17 18 19 20 21
0.0270 0.0275 0.0306 0.0320
Sample B B B B
Rank 22 23 24 25

It is fairly clear from the table that the values in the A sample on the whole have
smaller ranks than the values in the B sample, though it is not utterly
straightforward to make this comparison because of the different sample sizes.
The sample sizes are nA = 15 and nB = 10; so nA + nB = 25. Summing the ranks
for each sample gives
uA = 1 + 2 + 3 + · · · + 19 + 21 = 140,
uB = 7 + 14 + 15 + · · · + 24 + 25 = 185.
Their sum is 140 + 185 = 325. Note also that
1
2 (nA + nB ) (nA + nB + 1) = 12 (25)(26) = 325.
This provides a useful check on your arithmetic if you are not using a computer.
The expected value of UA under the null hypothesis that the two samples are
from identical populations is
nA (nA + nB + 1) 15(15 + 10 + 1)
= = 195.
2 2
The observed value uA = 140 is substantially smaller than this (in accord with
our observation that the A values tend to be smaller than the B values), but is it
significantly smaller? When there are ties in the data (as there are here), the null
distribution of UA can have a very complicated shape with many modes. Exact
computation of significance probabilities in such a context is quite difficult. A
computer gives the (two-sided) significance probability as p = 0.0015.
Alternatively, the variance of UA under the null hypothesis is
nA nB (nA + nB + 1) 15 × 10 × 26
= = 325.
12 12
For the observed value uA = 140, the corresponding z value is
140 − 195 −55
z= √ =√  −3.05.
325 325
So the approximate p value based on the normal approximation is
p = 2 × Φ(−3.05) = 0.0022  0.002.
This is close to the exact p value.
The significance probability is very small, so there is strong evidence that the
distribution of dopamine activity is not the same in the two groups. The dopamine
activity in psychotic and non-psychotic patients appears to differ, and, looking at
the data, it would seem that those judged non-psychotic (the population
corresponding to sample A) have lower dopamine activity, on average. �
74 Unit C2

Activity 1.7 Recall of pleasant and unpleasant memories


In a study of memory recall times, a series of stimulus words was shown to a Dunn, G. and Master, M. (1982)
subject on a computer screen. For each word, the subject was instructed to recall Latency models: the statistical
either a pleasant or an unpleasant memory associated with that word. Successful analysis of response times.
Psychological Medicine, 12,
recall of a memory was indicated by the subject pressing a bar on the computer 659–665.
keyboard. The data on recall times (in seconds) for twenty pleasant and twenty
unpleasant memories are given in Table 1.7. Table 1.7 Memory recall
times (seconds)
The distributions of the two samples are actually fairly skewed, so that the
Pleasant Unpleasant
normality assumption required for a t-test may be inappropriate. Carry out a memory memory
distribution-free test of the null hypothesis that there is no difference between the
1.07 1.45
distributions of recall times for pleasant and unpleasant memories.
1.17 1.67
1.22 1.90
1.42 2.02
So far in this section, you have seen that, by the simple expedient of replacing 1.63 2.32
data values by ranks, it is possible to carry out tests of statistical hypotheses 1.98 2.35
2.12 2.43
without making detailed distributional assumptions. Of course, some of the
2.32 2.47
information in the data has still been discarded by using the ranks rather than the 2.56 2.57
original values: information on how far apart the data values are can no longer be 2.70 3.33
used. But in most circumstances this loss is not very important. Even in the 2.93 3.87
situation where the data really do come from normal distributions, the Wilcoxon 2.97 4.33
3.03 5.35
signed rank test and the Mann–Whitney test are almost as powerful as their t-test
3.15 5.72
analogues, and they can be applied in situations where t-tests cannot. (However, 3.22 6.48
remember that the sign test is considerably less powerful than the t-test or the 3.42 6.90
Wilcoxon signed rank test, in situations where all three are valid.) 4.63 8.68
4.70 9.47
In view of this, it may occur to you to wonder why statisticians do not simply use 5.55 10.00
such nonparametric tests routinely, rather than putting themselves at risk of 6.17 10.93
making false assumptions of normality by using parametric methods like
the t-test. There are several reasons. In some cases, the nonparametric test does
have some sort of distributional assumptions attached (for example, symmetry for
the Wilcoxon signed rank test), so that we cannot forget about distributions
entirely. Some departures from the usual assumptions behind tests — such as the
assumption that the data are sampled randomly and are thus independent of one
another — interfere with nonparametric tests as much as with standard
parametric tests. Furthermore, for many of the more complicated statistical
testing procedures, there is no straightforward nonparametric alternative to the
parametric procedure. Thus, while nonparametric tests can be extremely useful,
they certainly do not solve all the awkward problems of significance testing!

1.4 Nonparametric tests using MINITAB


The calculations for the Wilcoxon signed rank test and the Mann–Whitney test
are tedious except for very small samples, and in general are best done on a
computer. In this subsection you will learn how to use MINITAB to carry out
these tests and the sign test.

����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����
Section 2 75

Summar y of Section 1
In this section, the idea of a nonparametric (or distribution-free) test has been
introduced. You have learned how to perform the sign test and the Wilcoxon
signed rank test (which are both tests for the population median difference given
paired data or for the median of a single population). In most circumstances, the
Wilcoxon signed rank test is more powerful than the sign test, because it uses
more of the information in the data. You have also learned how to perform the
Mann–Whitney test for comparing the distributions of two populations given a
sample of data from each population. You have been introduced briefly to the
pros and cons of these tests compared with the corresponding t-tests, and you
have learned how to carry out the tests using MINITAB.

Exercise on Section 1
Exercise 1.1 Byzantine coins — nonparametric tests
Data are given in Table 1.8 on the silver content of coins from two different Hendy, M.F. and Charles, J.A.
coinages of the reign of Manuel I, Comnenus (1143–1180). (1970) The production
techniques, silver content and
Table 1.8 Silver content of coins: first and fourth coinages (% Ag) circulation history of the
twelfth-century Byzantine
First coinage 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2 trachy. Archaeometry, 12,
Fourth coinage 5.3 5.6 5.5 5.1 6.2 5.8 5.8 13–21.

It is not easy to tell from such small samples whether an assumption of normality
is appropriate, and there is certainly no strong evidence against their normality.
Nevertheless, in this exercise you are asked to carry out nonparametric tests using
these data.
(a) Suppose that an archaeologist wishes to investigate whether it is plausible
that the coins from the first coinage could come from a population where the
median silver content is 6.4%. Test this hypothesis using the sign test.
(b) Test the archaeologist’s hypothesis from part (a) using the Wilcoxon signed
rank test. Use the normal approximation to calculate the p value (even
though this will not be particularly accurate because of the small sample
size).
(c) Use the Mann–Whitney test to investigate whether it is plausible that the
silver contents of both of these sets of coins come from the same distribution.
Use the normal approximation to compute the p value.

2 A test for goodness of fit


The statistical tests you have encountered so far are generally used to investigate
scientific or other hypotheses of interest. This applies just as much to the tests you
learned about in Unit C1 as to the nonparametric tests described in Section 1.
In this section, a different type of test is described, which you can apply to test
the validity of a statistical model. In Blocks A and B, a variety of graphical
techniques were described which you can use to explore the validity of the
distributional assumptions you may have made. For example, you can investigate
whether the shape of a bar chart or histogram of the data mirrors that of your
chosen distribution. More formally, you can use an exponential probability plot to
investigate the validity of an exponential model, or a normal probability plot to
look at whether a normal model is reasonable.
76 Unit C2

These graphical techniques, though extremely useful, are necessarily exploratory


and generally involve a degree of subjective judgement. Hence, developing a
significance test of the null hypothesis that the model ‘fits’ the data would be
attractive.
In this section, a significance test is introduced that can be used with discrete
data to decide how reasonable it is to assume that a particular probability model
generated the data. The test provides a test of the null hypothesis that the model
‘fits’ the data, allowing for random variation consistent with the model, and is
thus referred to as a test of ‘goodness of fit’. In fact, the method can be adapted
to test for goodness of fit of continuous distributions. However, this will not be
covered here as the calculations, though quite simple, can become a little tedious.
In Subsection 2.1, a test statistic is developed; the distribution of the test statistic
is described in Subsection 2.2; and the test is illustrated in Subsection 2.3.

2.1 Goodness of fit of discrete distributions


The assessment of goodness of fit is based on quantifying the discrepancy between
the data observed and the values that are expected under the model. The method
is demonstrated through an example.

Example 2.1 Testing the fit of a Poisson model


In order to elucidate the phenomenon of radioactivity, the scientists Rutherford Rutherford, E. and Geiger, H.
and Geiger counted the number of alpha particles emitted from a radioactive (1910) The probability
source during 2612 different intervals of 7.5 seconds duration; the data are variations in the distribution of
alpha particles. Philosophical
reproduced in Table 2.1. In 57 of these intervals there were zero emissions, in 203 Magazine, Sixth Series, 20,
there was a single emission, and so on. These observed frequencies Oi constitute 698–704.
the data. Rutherford and Geiger observed that a Poisson model seemed to
provide a good fit to these data. How can this statement be made more precise?
The idea is to assume that the Poisson model is indeed correct and to calculate The expression ‘the Poisson
the corresponding expected frequencies Ei of intervals containing 0, 1, 2, . . . model is correct’ is convenient
emissions. If the model fits the data well, then the differences Oi − Ei should be shorthand for ‘the data were
randomly sampled from a
small in magnitude. Poisson distribution’.
The first step in obtaining the expected frequencies is to calculate the mean
number of emissions in a 7.5-second interval. The sample mean is 3.877. If the
Poisson model is correct, then the probability that i emissions occur in Table 2.1 Emissions of alpha
a 7.5-second interval is given by the corresponding Poisson probability: particles: counts and
observed frequencies
e−3.877 3.877i
P (X = i) = . Count Observed frequency
i!
i Oi
A total of 2612 intervals were observed during the experiment. If the Poisson
0 57
model is correct, the expected number of intervals in which i emissions occur is
1 203
e−3.877 3.877i 2 383
Ei = 2612 × P (X = i) = 2612 × . 3 525
i! 4 532
For example, the expected frequency for one emission is 5 408
6 273
e−3.877 3.8771 7 139
E1 = 2612 ×  209.75;
1! 8 49
and hence the difference between the observed and expected frequencies for one 9 27
10 10
emission is 11 4
O1 − E1  203 − 209.75 = −6.75. 12 2
> 12 0
Section 2 77

When similar calculations are made for each emission frequency, the results are as
shown in the third and fourth columns of Table 2.2.
Table 2.2 Emissions of alpha particles: observed and expected
frequencies and differences between them, assuming a Poisson model
Count Observed frequency Expected frequency Difference
i Oi Ei Oi − Ei
0 57 54.10 2.90
1 203 209.75 −6.75
2 383 406.61 −23.61
3 525 525.47 −0.47
4 532 509.31 22.69
5 408 394.92 13.08
6 273 255.19 17.81
7 139 141.34 −2.34
8 49 68.50 −19.50
9 27 29.51 −2.51
10 10 11.44 −1.44
11 4 4.03 −0.03
12 2 1.30 0.70
> 12 0 0.53 −0.53

The differences Oi − Ei between the observed frequencies and the expected


frequencies appear ‘quite small’ in magnitude. However, to make this kind of
statement precise, a test statistic is required with which to quantify the overall
agreement, or lack of agreement, between the observed and expected frequencies.
Since we are not interested in whether the differences are positive or negative, but
only in their magnitudes, it makes sense to use squared differences in an overall
assessment. However, a problem arises when this is done: even though most of the
differences are small, when squared a few become inordinately large. For example,
the difference of 2.90 in an expected frequency of 54.10 (count = 0) is virtually
the same percentage difference as the difference of −23.61 in an expected count
of 406.61 (count = 2). Yet when the differences are squared, one is much larger
than the other, misrepresenting the discrepancy between the data and the
model. �

The solution to the problem identified in Example 2.1 is to scale the squared
differences by dividing by the expected frequency. Thus the scaled squared
2
differences (Oi − Ei ) /Ei are used. These are then added up to give the overall
measure of goodness of fit:
� (Oi − Ei )2
χ2 = . χ is the Greek letter chi and is
Ei pronounced ‘kye’.
This statistic is the chi-squared goodness-of-fit statistic. It was devised
in 1900 by Karl Pearson.
Note that χ2 is zero when Oi = Ei for all categories, that is, when the observed
and expected frequencies are equal. Clearly, due to random variation, we would
not expect this to occur even when the model is correct. In order to assess what
values of χ2 are consistent or inconsistent with the model, the distribution of χ2 is
required. In fact, the goodness-of-fit test is based upon the approximate
distribution of χ2 under the null hypothesis that the model is correct. This
distribution is described in Subsection 2.2.
78 Unit C2

2.2 The chi-squared distribution


A random variable W is said to have the chi-squared distribution
with r degrees of freedom if W is the sum of the squares of r independent
observations Z1 , Z2 , . . . , Zr on the standard normal random variable Z:
W = Z12 + Z22 + · · · + Zr2 .
This is written
W ∼ χ2 (r).
This defines a family of distributions which, like the family of t-distributions, is
indexed by a parameter called the degrees of freedom.
The chi-squared random variables are defined for strictly positive values. The
p.d.f.s of chi-squared distributions with 1, 2, 3 and 8 degrees of freedom are shown
in Figure 2.1.
Notice that for smaller values of the degrees of freedom, the distribution is very
right-skew. For larger values, the distribution is more symmetrical. In fact, as the
number of degrees of freedom increases, the distribution approaches a normal
distribution. This is hardly surprising: for large values of r, W may be regarded
as the sum of a large number of independent identically distributed random
variables, and hence by the central limit theorem the distribution of W is
approximately normal.

Figure 2.1 The p.d.f.s of four chi-squared distributions

Example 2.2 The mean and variance of χ2 (1)


Suppose that W is a chi-squared random variable with one degree of freedom.
The mean of W is quite easy to obtain. Since W = Z 2 , where Z ∼ N (0, 1),
� �
µW = E(W ) = E Z 2 .
� �
But V (Z) = E Z 2 − (E(Z))2 , so
µW = V (Z) + (E(Z))2 = σ2Z + µ2Z .
But Z ∼ N (0, 1), so µZ = 0 and σ2Z = 1. Therefore
µW = 1 + 02 = 1.
It is not quite so easy to obtain the variance of W , and the details will not be
given. In fact,
σ2W = 2. �
Section 2 79

Activity 2.1 The mean and variance of a chi-squared random variable


The chi-squared random variable W with r degrees of freedom is defined to be
W = Z12 + Z22 + · · · + Zr2 ,
where the Zi , i = 1, 2, . . . , r, are independent observations on the standard normal
random variable Z. Show that the mean and variance of W are given by
µW = r, σ2W = 2r.

The definition of a chi-squared distribution and the results of Activity 2.1 are
summarized in the following box.

The chi-squared distribution


The continuous random variable W given by
W = Z12 + Z22 + · · · + Zr2 ,
which is the sum of r independent squared observations on the standard
normal random variable Z, is said to have a chi-squared distribution
with r degrees of freedom. This is written
W ∼ χ2 (r).
The mean and variance of W are given by
µW = r, σ2W = 2r.

Notice that the p.d.f. of the random variable W ∼ χ2 (r) has not been given. This
is because it is quite complicated and is not useful for present purposes. For
instance, the p.d.f. does not yield an explicit formula for calculating probabilities
of the form FW (w) = P (W ≤ w). These probabilities need to be computed, or
deduced from tables.
A different table would be required for each value of the degrees-of-freedom
parameter, so a comprehensive listing of tail probabilities would require many
pages. Most published tables contain only selected quantiles of the chi-squared
distribution for a range of values of the degrees of freedom. A table of quantiles of
chi-squared distributions is given in the Handbook. This table is used in a similar
way to the table of quantiles for t-distributions. Table 2.3 contains the 0.01-, 0.05-
and 0.95-quantiles of chi-squared distributions with degrees of freedom up to 10. Table 2.3 Selected quantiles
of chi-squared distributions
Example 2.3 Using the table df 0.01 0.05 0.95
The 0.95-quantile of χ2 (5) is the number in the row labelled 5 (df = 5) and in the 1 0.0001 0.0039 3.84
column headed 0.95, which is 11.07. Similarly, the 0.01-quantile of χ2 (7) is 1.24, 2 0.020 0.103 5.99
and the 0.05-quantile of χ2 (2) is 0.103. � 3 0.115 0.352 7.81
4 0.297 0.711 9.49
5 0.554 1.14 11.07
Activity 2.2 Tail probabilities for chi-squared distributions 6 0.872 1.64 12.59
Use the table of quantiles for chi-squared distributions in the Handbook to answer 7 1.24 2.17 14.07
the following. 8 1.65 2.73 15.51
9 2.09 3.33 16.92
(a) Find the 0.01-quantile of W , where W ∼ χ2 (18). 10 2.56 3.94 18.31
(b) Find the value w such that P (W > w) = 0.05, where W ∼ χ2 (12). df stands for ‘degrees of
(c) Find the best possible lower bound and the best possible upper bound on freedom’.
P (W > 12.03), where W ∼ χ2 (4).
80 Unit C2

2.3 The chi-squared goodness-of-fit test


In Subsection 2.1, the chi-squared goodness-of-fit test statistic was defined to be
� (Oi − Ei )2
χ2 = .
Ei
This statistic can be calculated for any discrete model. If the model is correct, the
distribution of the test statistic is approximately chi-squared, with degrees of
freedom that depend on the number of categories and the assumed model. The
chi-squared goodness-of-fit test is described in the following box.

The chi-squared goodness-of-fit test


Suppose that, in a random sample of size n, each observation can be
classified into one of k distinct classes or categories and that the number of
observations out of a total of n that fall into category i is denoted by Oi .
Suppose that a model is set up, including p parameters whose values are
estimated from the data, and that, according to the model, the probability
of an observation falling into category i is θi . Then the expected number of
observations falling into category i is denoted by Ei and Ei = nθi .
The chi-squared goodness-of-fit test statistic for the model is
k
� 2
(Oi − Ei )
χ2 = .
i=1
Ei
If the model is correct then, for large n, the distribution of the test statistic
is approximately chi-squared with k − p − 1 degrees of freedom:
k
� 2
(Oi − Ei )
χ2 = ≈ χ2 (k − p − 1). (2.1)
i=1
Ei

Result (2.1) is presented without proof. You should merely note that it is a
consequence of the central limit theorem. But how good is the approximation?
Here is a simple rule to follow.

The validity of the chi-squared approximation


The chi-squared approximation to the null distribution of the chi-squared
goodness-of-fit test statistic is adequate if no expected frequency Ei is less
than 5. Otherwise the approximation may not be adequate.

Example 2.4 Calculations for the data on emissions of alpha particles


The calculations in Example 2.1 for the data on emissions of alpha particles fit
into the framework of the chi-squared goodness-of-fit test as follows. The total
number of observations is 2612 (so n = 2612), and there are 14 categories. Fitting
the Poisson model requires the estimation of the sample mean µ � = x = 3.877;
hence the number of parameters estimated is 1 (so p = 1).
The probabilities θi are calculated as follows from the Poisson model:
e−3.877 3.877i
θi = , i = 0, 1, . . . , 12, Note that, for convenience, i has
i! been taken to range from 0
to 13 rather than from 1 to 14.
θ13 = P (X > 12),
where X ∼ Poisson(3.877).
Section 2 81

The expected frequencies were given in Table 2.2 (Ei = 2612θi ). The values of Ei
corresponding to counts of 11, 12 and > 12 are all less than 5, thus violating the
rule that each Ei must be at least 5 for the chi-squared approximation to be valid.
This problem is overcome by pooling (combining) categories until all values of Ei
are greater than or equal to 5. This is achieved by replacing categories 11, 12
and > 12 with a single > 10 category, which has expected frequency
4.03 + 1.30 + 0.53 = 5.86. The resulting frequencies and the corresponding values
of (Oi − Ei )2 /Ei are shown in Table 2.4.
Table 2.4 Calculating χ2 for the data on
emissions of alpha particles
2
i Oi Ei Oi − Ei (Oi − Ei ) /Ei
0 57 54.10 2.90 0.155
1 203 209.75 −6.75 0.217
2 383 406.61 −23.61 1.371
3 525 525.47 −0.47 0.000
4 532 509.31 22.69 1.011
5 408 394.92 13.08 0.433
6 273 255.19 17.81 1.243
7 139 141.34 −2.34 0.039
8 49 68.50 −19.50 5.551
9 27 29.51 −2.51 0.213
10 10 11.44 −1.44 0.181
> 10 6 5.86 0.14 0.003

There are now 12 categories (so k = 12), and the value of the goodness-of-fit test
statistic is
� (Oi − Ei )2
χ2 = = 0.155 + 0.217 + · · · + 0.003 = 10.417  10.4.
Ei
Under the null hypothesis that the Poisson model is correct, the distribution of χ2
is approximately a chi-squared distribution with k − p − 1 = 12 − 1 − 1 = 10
degrees of freedom.
Remember that the chi-squared test statistic measures the extent to which
observed frequencies differ from those expected under the assumed model: the
higher the value of χ2 , the greater the discrepancy between the data and the
model. Thus the appropriate test is one-sided: only high values of χ2 indicate
that the model does not fit the data well. (It is possible to argue that low values
of χ2 suggest a fit so good that the data are suspect, showing less variation than
might be expected, and hence that a two-sided test is required. However, this is
not the approach adopted in M248.) Figure 2.2 The null
2 distribution and the significance
The upper tail of χ (10) cut off at 10.4 is shown in Figure 2.2: its area is 0.406. probability for the data on
Thus the significance probability is 0.406. There is little evidence against the null emissions of alpha particles,
hypothesis that the assumed Poisson model is correct. � assuming a Poisson model

Example 2.5 Testing the fit of a binomial model


In an experiment on visual perception, a screen of 155 520 squares was created,
and a computer program was used to colour each square randomly either black or Laner, S., Morris, P. and
white. The computer program assigned colours according to a long sequence of Oldfield, R.C. (1957) A random
Bernoulli trials with the predetermined probabilities pattern screen. Quarterly
Journal of Experimental
P (black) = 0.29, P (white) = 0.71. Psychology, 9, 105–108.

After this was done, and before performing the experiment, the screen was
sampled to check whether the colouring algorithm had operated successfully.
82 Unit C2

A total of 1000 larger squares, each containing sixteen of the small squares, was
randomly selected and i, the number of black squares in each larger square, was
counted. The observed frequencies Oi are shown in Table 2.5.
If the computer program worked as intended, then the observed values should be
consistent with 1000 observations from a binomial distribution, B(16, 0.29). This
hypothesis can be tested using a chi-squared goodness-of-fit test.
The expected frequencies Ei are also shown in Table 2.5. These were calculated as
follows. For i = 2, for example, Table 2.5 Counts on a
random screen pattern
E2 = 1000θ2
� � i Oi Ei
16
= 1000 (0.29)2 (1 − 0.29)14 0 2 4.17
2
1 28 27.25
 83.48. 2 93 83.48
3 159 159.13
The first two and the last seven categories need to be pooled to obtain categories 4 184 211.23
with expected frequencies of 5 or more. The values obtained when this is done are 5 195 207.07
shown in Table 2.6. Values of (Oi − Ei )2 /Ei are also shown. 6 171 155.06
7 92 90.48
Table 2.6 Counts on a random screen pattern: 8 45 41.57
calculating the chi-squared goodness-of-fit test statistic 9 24 15.09
10 6 4.32
2
Count i Oi Ei Oi − Ei (Oi − Ei ) /Ei 11 1 0.96
12 0 0.16
0 or 1 30 31.42 −1.42 0.064 13 0 0.02
2 93 83.48 9.52 1.086 14 0 0.00
3 159 159.13 −0.13 0.000 15 0 0.00
16 0 0.00
4 184 211.23 −27.23 3.510
5 195 207.07 −12.07 0.704 The expected frequencies add
6 171 155.06 15.94 1.639 up to 999.99 instead of 1000.
7 92 90.48 1.52 0.026 This discrepancy, which is due
to rounding error, is not
8 45 41.57 3.43 0.283 important and can be ignored.
9 24 15.09 8.91 5.261
>9 7 5.46 1.54 0.434

The value of the chi-squared goodness-of-fit test statistic is


� (Oi − Ei )2
χ2 = = 0.064 + 1.086 + · · · + 0.434 = 13.007  13.01.
Ei
After pooling there are 10 categories, so k = 10. No parameter has been
estimated, since the hypothesized binomial model was fully specified, including
the value of the Bernoulli parameter, 0.29. Thus p, the number of estimated
parameters, is 0. The null distribution of χ2 is therefore approximately
chi-squared with k − p − 1 = 10 − 0 − 1 = 9 degrees of freedom. The significance
probability for a one-sided test (found using a computer) is 0.162. There is little
evidence against the hypothesis that the observations are from the binomial
distribution B(16, 0.29). �

Activity 2.3 Leaves of Indian creeper plants


The leaves of the Indian creeper plant Pharbitis nil can be variegated or
unvariegated and, at the same time, faded or unfaded. In an experiment, plants Bailey, N.T.J. (1961)
were crossed. Of 290 offspring plants observed, the four types of leaf occurred Mathematical Theory of Genetic
with frequencies 187, 35, 37 and 31. Linkage. Clarenden Press,
Oxford, p. 41.
(a) According to one genetic theory, the four types should have occurred in the
9 3 3 1
ratios 16 : 16 : 16 : 16 . Use a chi-squared test of goodness of fit to show that
the data offer strong evidence against the theory.
(b) According to a more sophisticated theory, allowing for genetic linkage but
requiring estimation of one model parameter, the hypothesized proportions
are 0.6209, 0.1291, 0.1291, 0.1209, respectively. Use a chi-squared
goodness-of-fit test to investigate the validity of this theory.
Section 2 83

Summar y of Section 2
In this section, the chi-squared distribution has been introduced and the
chi-squared goodness-of-fit test has been described. You have learned how to use
this test to investigate the goodness of fit of discrete models.

Exercise on Section 2
Exercise 2.1 Diseased trees
The ecologist E.C. Pielou was interested in the pattern of healthy and diseased Pielou, E.C. (1963) Runs of
trees in a plantation of Douglas firs. Several lines of trees were examined. The healthy and diseased trees in
lengths of unbroken runs of healthy and diseased trees were recorded. The transects through an infected
forest. Biometrics, 19, 603–614.
observations made on a total of 109 runs of diseased trees are given in Table 2.7.
Table 2.7 Run lengths of diseased trees
Run length 1 2 3 4 5 6
Number of runs 71 28 5 2 2 1

There were no runs of more than six diseased trees. Pielou proposed that the
geometric distribution might be a good model for these data, and from the data
estimated the geometric parameter p to be 0.657. (Here, p is the proportion of
healthy trees in the plantation.)
Investigate the goodness of fit of the geometric model.

Summar y of Unit C2
In this unit, you have learned about certain nonparametric tests, and about a
method for testing the goodness of fit of a probability model for discrete data. A
nonparametric or distribution-free significance test is a statistical testing
procedure that does not involve making specific assumptions about the form of
the distribution of the population(s) involved. This means, for example, that such
procedures can be used in place of t-tests when the populations involved cannot
be assumed to have a normal distribution. You have met two tests whose null
hypothesis is that a single sample of data, which might actually arise as a set of
differences between pairs of data, comes from a population whose median has a
specified value. The sign test discards much of the information in the data, is thus
not very powerful and is not often used in practice. The Wilcoxon signed rank test
uses more of the information in the data and generally has reasonable power, but
it involves the assumption that the distribution from which the data were drawn is
symmetric. You have also met the Mann–Whitney test, which is used to compare
the distributions of the populations from which two samples of data were drawn.
A family of distributions, called chi-squared distributions, has been introduced.
These are indexed by a parameter r, known as the degrees of freedom. The
chi-squared goodness-of-fit test for discrete probability models involves calculating
expected frequencies for each possible value of the random variable involved (that
is, for each category), and producing a summary measure of how these differ from
the frequencies that were actually observed. Under the null hypothesis that the
proposed model fits the data, the distribution of this summary measure is
approximately a chi-squared distribution. This approximate result for the null
distribution of the chi-squared goodness-of-fit test statistic is only valid if the
expected frequencies for the categories are not too small: if any of the expected
frequencies are less than 5, then some of the categories should be pooled before
the value of the test statistic is calculated.
84 Summar y of Unit C2

Learning outcomes
You have been working towards the following learning outcomes.

Terms to know and use


Nonparametric test, distribution-free test, parametric test, sign test,
Wilcoxon signed rank test, Mann–Whitney test, goodness-of-fit test,
chi-squared distribution, chi-squared goodness-of-fit test statistic.

Symbols and notation to know and use


The notation w+ for the Wilcoxon signed rank test statistic.
The notation uA for the Mann–Whitney test statistic.
The notation χ2 (r) for the chi-squared distribution with r degrees of freedom.
The notation χ2 for the chi-squared goodness-of-fit test statistic.

Ideas to be aware of
That there are statistical tests that do not involve making specific
distributional assumptions about the data.
That, in general, nonparametric tests may involve certain broad
distributional assumptions, such as an assumption of symmetry.
That the validity of a particular model for a given set of data can be tested
using a goodness-of-fit test.

Statistical skills
Perform the sign test, the Wilcoxon signed rank test and the Mann–Whitney
test.
Perform a chi-squared goodness-of-fit test for discrete data.
Use the table of quantiles for chi-squared distributions in the Handbook.

Features of the software to use


Use MINITAB to carry out the sign test, the Wilcoxon signed rank test and
the Mann–Whitney test.
Solutions to Activities 85

Solutions to Activities
Solution 1.1 There are nine + signs and one Solution 1.4 Table S.2 shows the results of
− sign. (There are no zeros to omit.) The test subtracting 0.618 from each entry in Table 1.3 and
statistic is the number of + signs, which is� 9. The
� null allocating ranks.
distribution of the number of + signs is B 10, 12 . The Table S.2
significance probability is the probability of obtaining
nine or more + signs, under the null hypothesis. This Original value 0.693 0.662 0.690 0.606 0.570
is Difference 0.075 0.044 0.072 −0.012 −0.048
�10 � � Sign + + + − −
10 � 1 �10 � �10 � �10
2 = 10 × 12 + 1 × 12 Rank 17 10 16 5 12 11
x
x=9 Original value 0.749 0.672 0.628 0.609 0.844
11 Difference 0.131 0.054 0.010 −0.009 0.226
=  0.011.
1024 Sign + + + − +
(This is the appropriate tail to consider, since under Rank 18 14 4 3 19
the alternative hypothesis you would expect there to Original value 0.654 0.615 0.668 0.601 0.576
be many + signs. Since you are performing a Difference 0.036 −0.003 0.050 −0.017 −0.042
one-sided test, there is no requirement to consider the Sign + − + − −
other tail and double the p value.) The p value is Rank 8 1 12 7 9
small, 0.011. There is quite strong evidence against
the null hypothesis. We conclude it is highly likely Original value 0.670 0.606 0.611 0.553 0.933
that the median sleep gain is greater than zero. Difference 0.052 −0.012 −0.007 −0.065 0.315
Sign + − − − +
Rank 13 5 12 2 15 20
Solution 1.3
(a) The Wilcoxon signed rank test statistic can be There are no 0s and only two tied differences.
calculated using Table S.1. The value of the test statistic w+ is the sum of the
Table S.1 ranks associated with positive differences. Thus
w+ = 17 + 10 + 16 + 18 + 14 + 4
Patient 1 2 3 4 5 6 7 8 9 10
Sign of difference + + + + − + + + + + + 19 + 8 + 12 + 13 + 20
in sleep time
= 151.
Absolute value 1.9 0.8 1.1 0.1 0.1 4.4 5.5 1.6 4.6 3.4
of difference
Rank of absolute 6 3 4 1 12 1 21 8 10 5 9 7
Solution 1.6 The sample size is 20 and there are
no zero differences, so n = 20. Therefore
value of difference
n(n + 1) 20 × 21
E(W+ ) = = = 105,
The test statistic w+ is the sum of the ranks 4 4
associated with the positive differences, so n(n + 1)(2n + 1) 20 × 21 × 41
V (W+ ) = = = 717.5.
w+ = 6 + 3 + 4 + 1 12 + 8 + 10 + 5 + 9 + 7 = 53 12 . 24 24
Note that, in this case, it may be slightly quicker to The observed value of the test statistic is w+ = 151, so
work out the 151 − 105
� sum of �the ranks for the negative z= √  1.72.
differences w− = 1 12 and to use the fact that the 717.5
sum of all the ranks is 12 n(n + 1) = 21 × 10 × 11 = 55, The table of probabilities for the standard normal
to give distribution in the Handbook gives
w+ = 55 − 1 12 = 53 21 . P (Z ≤ 1.72) = Φ(1.72) = 0.9573  0.957.
(b) The p value is very small, indicating strong So there is a probability of 0.043 of being at least this
evidence against the null hypothesis. We conclude far out into the (right-hand) tail of the standard
that it is very likely that the median sleep gain is normal distribution. Since you are performing a
greater than zero. two-sided test, you need to consider the other tail as
This is in general terms the same conclusion as for the well. Thus the approximate p value is
sign test. However, the p value for the Wilcoxon 2 × 0.043 = 0.086. This is very close to the value given
signed rank test is even smaller than for the sign test by the exact test. The p value of 0.086 provides weak
(0.003 compared with 0.011), indicating that the evidence that Shoshoni rectangles do not conform to
Wilcoxon test provides even more evidence against the the Greek golden ratio standard.
null hypothesis than the sign test does. This is an
illustration of the fact that, in many circumstances,
the Wilcoxon signed rank test is more powerful than
the sign test.
86 Unit C2

Solution 1.7 The appropriate test is the Solution 2.1 Since W is the sum of r independent
Mann–Whitney test. The ranks are given in Table S.3. observations Z12 , Z22 , . . . , Zr2 on Z 2 ,
� � � � � �
Table S.3 E(W ) = E Z12 + E Z22 + · · · + E Zr2 ,
� � � � � �
Pleasant Rank Unpleasant Rank V (W ) = V Z12 + V Z22 + · · · + V Zr2 .
memory memory So, since Z 2 has mean 1,
1.07 1 1.45 5 E(W ) = 1 + 1 + · · · + 1 = r;
1.17 2 1.67 7 and since Z 2 has variance 2,
1.22 3 1.90 8
1.42 4 2.02 10 V (W ) = 2 + 2 + · · · + 2 = 2r.
1.63 6 2.32 12 12
1.98 9 2.35 14 Solution 2.2 The following values were obtained
2.12 11 2.43 15 using the table of quantiles of chi-squared
2.32 12 21 2.47 16 distributions in the Handbook.
2.56 17 2.57 18 (a) q0.01 = 7.01.
2.70 19 3.33 25 (b) The 0.95-quantile of χ2 (12) is required, so
2.93 20 3.87 27 w = 21.03.
2.97 21 4.33 28 (c) The value 12.03 lies between the 0.975-quantile
3.03 22 5.35 31 and the 0.99-quantile of χ2 (4). Thus
3.15 23 5.72 33
3.22 24 6.48 35 0.01 < P (W > 12.03) < 0.025.
3.42 26 6.90 36 (In fact, P (W > 12.03) = 0.0171.)
4.63 29 8.68 37
4.70 30 9.47 38 Solution 2.3
5.55 32 10.00 39 (a) The categories have expected frequencies given by
6.17 34 10.93 40 Ei = nθi = 290θi , i = 1, 2, 3, 4,
Sums 345 12 474 12 where
9 3 3 1
θ1 = 16 , θ2 = 16 , θ3 = 16 , θ4 = 16 .
If the group of pleasant memory recall times is
labelled A, then the test statistic uA is 345 12 . There This leads to the values in Table S.4.
are only two tied values, and the samples are not Table S.4 Pharbitis nil, simple theory
particularly small, so the normal approximation
2
should be adequate for calculating the p value. i Oi Ei Oi − Ei (Oi − Ei ) /Ei
nA (nA + nB + 1) 20 × 41 1 187 163.125 23.875 3.494
E(UA ) = = = 410,
2 2 2 35 54.375 −19.375 6.904
nA nB (nA + nB + 1) 20 × 20 × 41 3 37 54.375 −17.375 5.552
V (UA ) = =
12 12 4 31 18.125 12.875 9.146
 1366.667.
The mean of UA under the null hypothesis is greater The value of the chi-squared test statistic is
than the observed value, confirming the impression � (Oi − Ei )2
given by the table that the A values tend to have χ2 =
Ei
smaller ranks than the B values. The z value is = 3.494 + 6.904 + 5.552 + 9.146
345.5 − 410
z= √  −1.74. = 25.096  25.10.
1366.667
There are four categories and no model parameters
So the approximate p value is
were estimated, so the null distribution of the test
2Φ(−1.74) = 2 × 0.0409 = 0.0818  0.082. statistic has 4 − 0 − 1 = 3 degrees of freedom. The
Therefore there is some evidence against the null value 25.10 is greater than the 0.995-quantile of χ2 (3),
hypothesis that the distribution of recall times is the which is 12.84, so the significance probability is less
same for pleasant and unpleasant memories, but the than 0.005. (The actual value is about 0.000015.) This
evidence is weak. Looking at the data, it would seem is very small, so there is strong evidence that the
that pleasant memories have shorter recall times, on simple theory is flawed.
average.
Solutions to Activities 87

(b) Allowing for genetic linkage, the expected The value of the chi-squared test statistic is
frequencies are given by � (Oi − Ei )2
Ei = nθi = 290θi , i = 1, 2, 3, 4, χ2 =
Ei
where = 0.267 + 0.159 + 0.005 + 0.470
θ1 = 0.6209, θ2 = 0.1291, = 0.901  0.90.
θ3 = 0.1291, θ4 = 0.1209. The number of categories is again 4. However, one
This leads to the values in Table S.5. parameter has been estimated, so the null distribution
of the test statistic has 4 − 1 − 1 = 2 degrees of
Table S.5 Pharbitis nil, genetic linkage theory freedom. The 0.1-quantile of χ2 (2) is 0.211 and the
2 0.5-quantile is 1.39. The observed value 0.91 lies
i Oi Ei Oi − Ei (Oi − Ei ) /Ei
between these quantiles, so the significance probability
1 187 180.061 6.939 0.267 is between 0.5 and 0.9. (The actual value is 0.638.)
2 35 37.439 −2.439 0.159 Hence there is little evidence against the theory. The
3 37 37.439 −0.439 0.005 model appears to fit the data well.
4 31 35.061 −4.061 0.470
88 Unit C2

Solutions to Exercises
Solution 1.1 (c) The ranks are given in Table S.7.
(a) To find the sign test statistic, you simply need to Table S.7
examine the data, counting a + sign for values greater
First Rank Fourth Rank
than the hypothesized value of 6.4, a − sign for values
coinage coinage
below 6.4, and omitting values that are equal to the
hypothesized value. This gives six + signs and two − 5.9 7 5.1 1
signs; one value is omitted. Thus the sign test statistic 6.2 8 12 5.3 2
is 6. In total there are eight + and − � signs,
� so the null 6.4 10 5.5 3
distribution of the test statistic is B 8, 12 . Thus the 6.6 11 5.6 4
significance probability is twice the probability of 6.8 12 5.8 5 12
obtaining six or more + signs, under the null
6.9 13 5.8 5 12
hypothesis. This is
�8 � � 7.0 14 6.2 8 12
8 � 1 �8 7.2 15
2 2
x 7.7 16
x=6
� � �8 � �8 � �8 �
= 2 × 28 × 12 + 8 × 12 + 1 × 12 Sums 106 12 29 12
37 If the group of silver contents of coins from the first
=2×  0.289.
256 coinage are labelled A, then the test statistic uA is
This provides very little evidence against the null 106 21 .
hypothesis that the underlying median silver content
nA (nA + nB + 1) 9 × 17
is 6.4%. E(UA ) = = = 76.5,
2 2
(b) This time you must begin by calculating the nA nB (nA + nB + 1) 9 × 7 × 17
differences between the data values and the V (UA ) = = = 89.25.
12 12
hypothesized value, 6.4. These differences, together The mean of UA under the null hypothesis is less than
with the appropriate ranks, are shown in Table S.6. the observed value, confirming the impression given by
Table S.6 the table that the A values tend to have larger ranks
Original value 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2 than the B values. The z value is
Difference −0.5 0.4 0 0.6 0.2 1.3 0.8 0.5 −0.2 106.5 − 76.5
z= √  3.18.
Sign − + + + + + + − 89.25
Rank 4 12 3 6 1 12 8 7 4 12 1 12 So the approximate p value for the two-sided test is
p = 2Φ(−3.18) = 0.0014  0.001.
There is one 0 (which is subsequently omitted) and
two pairs of ties (given average ranks). The test Therefore there is strong evidence against the null
statistic w+ is the sum of the ranks associated with hypothesis that the distribution of silver contents is
the positive differences, so it is 30. The total number the same for the two coinages. The silver contents
of positive and negative differences is n = 8. Thus the appear to differ, and looking at the data, it would seem
mean and variance of the test statistic under the null that the first coinage contains more silver, on average.
distribution are given by (Note that the normal approximation is quite good in
n(n + 1) 8×9 this case: a computer program that can calculate
E(W+ ) = = = 18, exact p values for Mann–Whitney tests gave 0.00149.)
4 4
n(n + 1)(2n + 1) 8 × 9 × 17
V (W+ ) = = = 51.
24 24
This leads to a z value of
30 − 18
z= √  1.68.
51
So the significance probability is
p = 2Φ(−1.68) = 2 × 0.0465 = 0.093.
This provides weak evidence against the null
hypothesis that the underlying median silver content
of coins from the first coinage is 6.4%.
(A computer program that can calculate exact
p values for Wilcoxon tests gave 0.117 for the
significance probability. Thus the normal
approximation is not very good in this case.)
Solutions to Exercises 89

Solution 2.1 For the geometric model, the Table S.9 Testing the geometric model
probability of a run of length i (i = 1, 2, . . . , 6) is 2
Run Oi Ei Oi − Ei (Oi − Ei ) /Ei
θi = (1 − p)i−1 p.
length
The observation that no run was greater than 6 must
be accounted for by a category, with probability 1 71 71.61 −0.61 0.005
2 28 24.56 3.44 0.482
θ7 = P (X > 6) = (1 − p)6 .
≥3 10 12.83 −2.83 0.624
The geometric parameter p has been estimated from
the data to be 0.657. The expected frequencies under The value of the chi-squared test statistic is
the geometric model are 109θi ; these are shown in � (Oi − Ei )2
Table S.8. χ2 =
Ei
Table S.8 Observed and = 0.005 + 0.482 + 0.624
expected frequencies
= 1.111  1.11.
Run length Oi Ei There are three categories and one parameter has been
1 71 71.61 estimated from the data (the geometric parameter
2 28 24.56 p = 0.657), so the null distribution of the test statistic
3 5 8.43 has 3 − 1 − 1 = 1 degree of freedom. The observed
4 2 2.89 value 1.11 lies between the 0.5-quantile and the
5 2 0.99 0.9-quantile of χ2 (1), so the significance probability is
6 1 0.34 greater than 0.1. (The actual value is 0.292.) There is
>6 0 0.18 therefore little evidence against the geometric model.
This confirms that Pielou’s assumptions were
To ensure that all expected values are at least 5, runs reasonable.
of length 3 or more are pooled. The calculations are
set out in Table S.9.
90 Unit C3

UNIT C3 The modelling process


Study guide for Unit C3
This unit is shorter than average. You should schedule four study sessions,
including time for answering the TMA questions on the unit, for generally
reviewing and consolidating your work on this unit, and for completing TMA 03.
The unit contains relatively little new material, and thus gives you an opportunity
to revise many of the ideas you have met so far in M248.
Sections 1, 2 and 4 should be studied in order. As you work through the unit, you
will be asked to work through two chapters of Computer Book C. We recommend
that you work through the first (Chapter 7) at the point indicated in the text (in
Subsection 2.2). Section 3 consists of working through Chapter 8. You should not
do this before you have studied Section 2, but you could postpone this until after
Section 4 if this is more convenient for you.
One possible study pattern is as follows.
Study session 1: Sections 1 and 2. You will need access to your computer for this
session, together with Computer Book C.
Study session 2: Section 3. You will need access to your computer for this session,
together with Computer Book C.
Study session 3: Section 4.
Study session 4: TMA questions on Unit C3 ; complete TMA 03.
If you follow this study pattern, then the first session may be a long one.

Introduction
You have now met some of the most important ideas of statistics: you can
summarize the key features of a data set and can represent it graphically in
different ways; you have seen that variability in a population can be represented
by a probability distribution; with a few assumptions, you are able to use a
variety of distributions to represent both discrete and continuous random
variables; and you can use data to answer practical questions.
Each unit so far has dealt with a particular topic or method of statistics, which
has been illustrated by examples. In contrast, the focus of this unit is on the
statistical modelling process. You might think of the techniques you have
encountered as statistical tools. Having assembled your toolbox, the aim now is to
work out how to use it when confronted with a statistical problem.
The beginning of most statistical investigations is usually a practical problem. For
example, a medical researcher might want to know whether or not a treatment for
cancer works; an engineer might wish to estimate the tensile strength of a
particular material; a social scientist might seek to understand what factors
influence school performance; an economist might wish to predict future inflation
rates. In a statistical investigation, the problem is formulated in statistical terms,
appropriate data are collected and analysed, and the conclusions are summarized
in a statistical report. The journey from practical problem to statistical report is
best thought of as a research process, which can be represented by the flow chart
in Figure 0.1. Note that, in practice, the various stages of the modelling process
might arise in a slightly different order from that in Figure 0.1. For example, it is Figure 0.1 The modelling
sometimes more convenient to check assumptions after the model has been fitted. process
Section 1 91

Formulating the right questions, and designing studies to answer them, are
important statistical issues, though they are not dealt with in this course.
Typically, they involve collaborations with specialists in other
disciplines — medical doctors, engineers, social scientists, economists, and so
on — and usually require some knowledge of the particular application area.
However, this unit focuses on the issues of statistical modelling and reporting,
which are similar in all application areas. The starting point is therefore the
problem or question under consideration, together with the data that have been
collected to throw light on it. Thus, in terms of the flow chart in Figure 0.1, you
will begin at the box marked ‘Choose model’.
What is a model? In general terms, it is a simplified representation of the process
generating the data. The key component of a statistical model is the underlying
distribution from which the data are sampled; but the model might also include
other components — for example, transformations or other relationships (known Transformations are discussed in
or presumed) between variables. However, the terms ‘distribution’ and ‘model’ are Subsection 2.3.
used interchangeably in much of this unit.
A suitable model to start with is one that reflects the important attributes of the
data. For example, if the data consist of measurements on a continuous variable,
then it makes sense to choose a continuous distribution to represent the
underlying variation. Also, the question to be answered will often suggest how the
model will be used — for example, to calculate a confidence interval or carry out a
t-test. Having chosen a model, you will need to check that it fits the data, and
that any assumptions required are satisfied. If either is not the case, then you will
need to alter the model in some way, or perhaps even try a completely different
one. You will then need to repeat the process, improving the model at each stage
until it is good enough for its purpose. Having chosen a model, the final stage of
the modelling process is to report your results.
Statistical modelling is an art as much as a science, requiring common sense and
judgement and, just occasionally, a little inspiration. It is as well to remember
that statistical models are at best idealizations of reality: you should not expect
to find a ‘perfect’ model. The real skill is in finding a model that is good enough
for your purposes, and from which you can draw valid conclusions.
In Section 1, some hints are given on how to get started, even before looking at
the data. The emphasis is on developing some a priori ideas about how you might
approach the analysis and the choice of model, knowing only the context of the
problem and the type of data collected. In Section 2, methods for exploring the
data are discussed with a view to choosing a model, transforming the data, and
checking the model. In Section 3, you will practise undertaking a complete
analysis using a variety of tools; the section consists of a chapter of Computer
Book C. Writing the statistical report is discussed in Section 4.

1 Choosing a model: getting star ted


In this section some guidance is provided about how to approach the task of
selecting a model. It is important to remember that there are no fixed rules for
this, only general principles — and even they, on occasion, might reasonably be set
aside, as you will see. Unusually for this course, no data are presented. This is
not because you should choose the model before you see the data. On the
contrary, it is recommended that before choosing a model you get a feel for the
data using graphical methods. Rather, the aim is to emphasize the importance of
the setting or context in which the data are obtained, and the type of data
collected, in structuring your ideas. This is often sufficient to narrow down the
possibilities to a small number of likely candidates for a model. These can then be
refined further (or perhaps set aside) after looking at histograms, bar charts or
other graphical displays.
92 Unit C3

In Subsection 1.1, we discuss how to choose between continuous and discrete


models. Having made this choice, you then need to select an appropriate discrete
or continuous distribution. The models available and factors to take into account
when choosing one of them are described in Subsections 1.2 and 1.3.

1.1 Continuous or discrete?


So far you have met the following ten families of probability models, which are
listed here in alphabetical order: Bernoulli, binomial, chi-squared, continuous
uniform, discrete uniform, exponential, geometric, normal, Poisson, t. Given a
data set, you can start narrowing down the likely contenders from among this list
of modelling distributions, by considering some key features of the data and the
context in which they are collected. Here are four examples, giving brief details of
the problem investigated and the type of data collected — for example, whether
the data are counts, proportions or continuous measurements.

Example 1.1 Fish traps


One hundred fish traps were set and later the number of fish caught in each trap David, F.N. (1971) A First
was counted. Questions of interest include the distribution of numbers of fish Course in Statistics, 2nd edn.
caught, the proportion of traps with no catches, and whether or not catches are Griffin, London.
‘clustered’ in some way. �

Example 1.2 Heredity and head shape


In 25 families where there were at least two sons, measurements were taken on the Frets, G.P. (1921) Heredity of
head length and head breadth (both measured in mm) of the first and second head form in man. Genetica, 3,
sons. Head size can be measured as length + breadth, head shape as 193–384.
100 × (breadth/length). One issue of interest is whether there is a difference
between the head shapes of first and second sons. �

Example 1.3 Interspike inter vals


Motor cortex neuron interspike intervals of an unstimulated monkey were Zeger, S.L. and Qaqish, B.
measured (in milliseconds). The aims of the study were to describe the (1988) Markov regression
distribution of waiting times between spikes, and to estimate the firing rate in the models for time series: a
quasi-likelihood approach.
absence of any stimulus. � Biometrics , 44, 1019–1031.

Example 1.4 The great pigeonhole in the sky


Dr Sutherland worked in a large organization in which internal notes and Sutherland, W. (1990) The
memoranda are sent in reusable envelopes. Each envelope has twelve spaces great pigeonhole in the sky. New
(windows) for the names of recipients; new users cross out their own name, and Scientist, 9 June, 73–74.
write in the next window the name of the person they wish to contact. Dr
Sutherland kept a count of how many names, including his, were written on some
of the envelopes he received. The purpose of the analysis is to describe the
distribution of used windows, and thus to obtain some idea of the age structure of
envelopes in circulation. �

Activity 1.1 First thoughts


For each of Examples 1.1 to 1.4, suggest a probability model that you think might
provide a reasonable starting point for statistical analysis.
Remember that the model you choose need not be perfect — at this stage, a ‘first
guess’ is all that is required. Suitable models are discussed later in this section, so
do not spend long on this activity.
Section 1 93

For the purposes of statistical modelling, the first major distinction to be drawn is
whether the data should be modelled as discrete or continuous. In each of these
four examples, the choice is reasonably clear. For example, in Example 1.1 the
data are fish counts, so take the values 0, 1, 2, 3, . . .. Thus even before seeing any
data you may conclude that a first choice of model should be one suited to
discrete data: this narrows down your choice to the Bernoulli, binomial, discrete
uniform, geometric and Poisson models. In contrast, the data in Example 1.2 are
sizes, measured in millimetres. Thus a suitable first model for these data should
be a continuous distribution. Possible models include exponential, continuous
uniform and normal distributions.

Activity 1.2 Candidate distributions


For each of Examples 1.3 and 1.4, state whether the data are discrete or
continuous, and hence identify suitable candidate distributions for modelling these
data.

The distinction between discrete and continuous data is a fundamental one in


selecting a model. However, the distinction is not always clear-cut. For example,
in the discussion of Example 1.1 after Activity 1.1 it was taken for granted that
fish counts were low integers 0, 1, 2, . . .. But what if the counts were typically
hundreds of fish? The underlying distribution is still discrete, but the data might
reasonably be modelled as continuous, as an error of one fish in hundreds could be
considered to be negligible. Conversely, measurements on continuous variables are
often rounded to some fixed number of decimal places, and so may be regarded as
discrete. If the rounding can be ignored, it is reasonable to treat the rounded
measurements as continuous. On the other hand, in some cases, the measurement
might be very crude and the data may then best be regarded as discrete. These
points are illustrated in Example 1.5.

Example 1.5 Earthquakes


The times in days between successive major earthquakes between 16 December Adapted from: The Open
1902 and 4 March 1977 were collected. There were 63 earthquakes and hence 62 University (1981) S237 The
‘waiting times’ ranging from 9 days to 1901 days. Earth: Structure, Composition
and Evolution. The Open
The continuous variable ‘time’ is reported as whole days and thus could arguably University, Milton Keynes.
be regarded as discrete. However, it could equally be argued that the rounding to
the nearest day is immaterial, and that for all practical purposes, the data should
be treated as continuous. In situations such as this, the data can be treated either
as continuous or as discrete, and both approaches should lead to similar results.
These data are interesting for a further reason: they include only ‘major’
earthquakes, defined as those whose magnitude was at least 7.5 on the Richter
scale, or which led to the deaths of more than 1000 people. The magnitude of an
earthquake, as measured on the Richter scale, is continuous. However, the
classification < 7.5 or ≥ 7.5 turns it into a discrete variable. Thus, in this case,
magnitude measured as ‘minor’ or ‘major’ is discrete. Finally, the severity of an
earthquake as measured by the number of deaths it causes is clearly a discrete
variable, since it takes only integer values. However, the numbers of deaths
involved are large, and probably approximate: for example, the earthquake of
16 December 1902 in Turkestan is reported to have killed 4500 people; and that of
4 March 1977 in Vrancea, Romania, killed 2000. These numbers are clearly
approximate. It might therefore make sense to measure severity in units of
thousands of deaths (with values 4.5 for the Turkestan earthquake and 2.0 for the
Romanian one) and treat it as continuous. �

Example 1.5 illustrates the point that clear rules even about the apparently
simple matter of deciding on a discrete or continuous model can be difficult to
specify. When the error involved in treating a continuous variable as discrete, or
vice versa, is negligible, then it may not matter which choice is made. In this case,
the choice might reasonably be made on the grounds of convenience, and how well
the proposed model fits the data.
94 Unit C3

Choosing a model: continuous or discrete?


If the random variable X is continuous, it usually makes sense to choose
a continuous distribution — for example, exponential, continuous
uniform or normal.
If the random variable X is discrete, you might choose a discrete
distribution — for example, binomial, discrete uniform, geometric or
Poisson.
In some circumstances, it is appropriate to model a continuous variable In Example 1.5, you saw that,
as discrete or a discrete variable as continuous. This is typically the although it is continuous, the
case when the error involved in doing so is negligible. time between earthquakes may
be modelled as discrete; and, if
measured in thousands, the
number of deaths could be
Activity 1.3 Continuous or discrete? modelled as continuous.
For each of the random variables described below, state whether you would model
it using a continuous distribution, a discrete distribution, or either of these. Give
a reason for your answer in each case.
(a) The weights in kilograms, to two decimal places, of a sample of children aged
ten years.
(b) The litter size for a sample of sows.
(c) The number of tickets sold in the UK National Lottery each week.
(d) The number of tickets sold in the UK National Lottery each week that win
the jackpot.
(e) The examination marks of a sample of M248 students.
(f) The pass grades (1 to 4) of a sample of M248 students who passed the course.

Having narrowed the field to either discrete models or continuous models, the
next step is to choose which of the models in each of these categories is most
likely to be suitable. How to do this is discussed in Subsections 1.2 and 1.3.

1.2 Which discrete distribution?


The Bernoulli model can be regarded as a special case of the binomial model with
n = 1. Thus the choice of discrete models available to you is really between the
binomial, geometric, discrete uniform and Poisson distributions. Choosing There are many other discrete
between these is helped by prior understanding, knowledge or intuition about the distributions, but these are the
process generating the random variable X which you wish to model. In particular, ones you have met in this
course.
the four discrete distributions were introduced in specific settings; and each
setting may be regarded as the standard one for this model. If your data were
collected in such a setting, then it makes sense to try the corresponding model.

Choosing a model: the standard settings for discrete distributions


If X may be regarded as the number of successes in some known
number n of independent Bernoulli trials with constant probability p of
success at each trial, then choose the binomial distribution B(n, p).
If X has a finite range and every outcome has the same probability, or
at least is believed to be equally likely, then try the discrete uniform
distribution.
If X may be regarded as the number of trials up to and including the
first success in a sequence of independent Bernoulli trials with constant
success probability p, then choose the geometric distribution G(p).
If events are believed to occur at random and X is the number of events
that occur during intervals of fixed length, then try the Poisson
distribution.
Section 1 95

These models are not guaranteed to work — in particular, you will need to check
the assumptions required in each case. However, they often provide a convenient
starting point.

Example 1.6 A vaccine trial


Informed consent to participate in a trial of two new vaccines against whooping
cough was obtained from the parents of 334 babies. Vaccine 1 was administered to
175 babies, Vaccine 2 to 159 babies. The variable of interest is the number of
babies who do not experience a rash at the injection site. These may be regarded
as numbers of successes in, respectively, 175 and 159 independent Bernoulli trials.
The probability of success may be different for babies in the two vaccine groups,
but can reasonably be assumed to be equal for babies within the same group. So
appropriate models are the binomial distributions B(175, p1 ) in the Vaccine 1
group and B(159, p2 ) in the Vaccine 2 group, where p1 and p2 are parameters to
be estimated. �

Activity 1.4 Damaged pylons


A storm has damaged one of the pylons carrying high-tension power cables
between two towns. There are 24 sequentially numbered pylons and each is
equally likely to have failed. Identify a suitable model for the location of the fault.

Activity 1.5 Particle emissions


In order to elucidate the phenomenon of radioactivity, the scientists Rutherford Rutherford, E. and Geiger, H.
and Geiger counted the number of alpha particles emitted from a radioactive (1910) The probability
source during 2612 intervals of length 7.5 seconds. Identify a suitable model for variations in the distribution of
alpha particles. Philosophical
the counts. (The fit of a particular model to Rutherford and Geiger’s data was Magazine, Sixth Series, 20,
discussed in Unit C2. However, for this activity, you should use only the 698–704.
information given here and the idea of standard settings to identify a suitable
model.)

Clearly, not all problems conform to the standard settings just described.
Thankfully, however, the four discrete distributions apply much more widely than
the settings might suggest. On the other hand, in some circumstances none of
them will do; this is illustrated in Example 1.7.

Example 1.7 None of the standard settings might apply


Consider the setting described in Example 1.4. The data are counts of names on
envelopes, ranging from 1 to 12, so clearly they are discrete. I considered in turn
each of the standard settings to see if one might fit this problem.
I started with the binomial model. To fit this in with the standard model, I
needed to identify a relevant Bernoulli trial. The obvious choice is whether or not
a window has been filled. However, the probability that a window is filled depends
on the position of the window: those higher up the envelope are more likely to
have been filled than those below (assuming that people use up the windows in
sequence). In fact, the top window is always filled. I couldn’t think of any other
Bernoulli trial relevant to this situation. Thus the standard setting for the
binomial model did not seem appropriate. And for the same reason — my inability
to define what would constitute a Bernoulli trial in this context — I ruled out the
standard setting for the geometric distribution.
I also discounted the standard setting for the Poisson model because I could not
think of what the ‘event’ was that occurred at random, nor was it clear to me
that there was a fixed interval involved in the specification of the problem. This
left the discrete uniform distribution. Certainly the random variable has a finite
range (1 to 12), but it is not clear at all from the setting that each outcome has
the same probability. This might be the case, but there was no particular reason
to suppose it was true.
96 Unit C3

I therefore concluded that, for this problem, there was no compelling reason to
opt for any of the standard models, though perhaps the least ‘bad’ choice might
be to opt for the discrete uniform distribution. �

In some cases, as in Example 1.7, none of the standard settings is appropriate.


However, this does not mean that the distributions themselves are not
appropriate. What it means is that the distribution must be chosen on empirical
grounds; that is, a distribution whose ‘shape’ is likely to mirror that of the data
should be selected. This is discussed in Section 2. Nevertheless, some features can
help guide the choice of model even before looking at the data. These include the
range of the data, how many modes there are (all the models described have at
most a single mode), and where the mode might lie.

Choosing a discrete model: range and shape


Your choice of distribution can be guided by the range and the likely shape
of the distribution.
Binomial, B(n, p): finite range {0, 1, . . . , n}; one mode, which can take
any value within the range (depending on the value of the parameter p).
Discrete uniform: finite range; no mode.
Geometric: unbounded range {1, 2, 3, . . .}; one mode which is always
at 1.
Poisson(µ): unbounded range {0, 1, 2, . . .}; one mode, which can take
any value within the range (depending on the value of the parameter µ).

In some cases no distribution fits all the requirements. Even then, all may not be
lost: it is important to remember that the purpose of statistical modelling is not
to find a perfect model, but a ‘good enough’ model. Provided that the model does
not fail in some key respect, it might still be useful.

Example 1.8 Used windows


For the data on used windows of Examples 1.4 and 1.7, the only facts known for
certain are that the data are counts, and are in the finite range 1 to 12. The only
one out of the four distributions listed which meets these requirements is the
discrete uniform distribution. However, there is as yet no information about the
likely shape of the distribution. The best you can do at this stage is retain the
discrete uniform model as a possibility, but keep an open mind as it is far from
clear that, when you look at the data, the shape will be approximately that of the
uniform distribution. �

1.3 Which continuous distribution?


Choosing an appropriate continuous distribution follows much the same process as
choosing a discrete distribution, with one major exception: if none of the available
distributions seems appropriate, then you can try transforming the data.
(Transforming data is discussed in Subsection 2.3.) For the time being, consider
the standard continuous distributions you have met so far. These are the
exponential, continuous uniform, normal, t and chi-squared distributions. The There are other continuous
t-distribution and the chi-squared distribution were introduced specifically in the distributions; these are the ones
context of statistical tests. Nothing precludes you from using them for modelling, you have met in this course.
but in this course only the exponential, continuous uniform and normal
distributions are used for this purpose.
There are standard settings for some continuous distributions. These can help you
to select a candidate distribution.
Section 1 97

Choosing a model: the standard settings for continuous


distributions
If events are thought to occur at random in time and X is the waiting
time between successive events, then try the exponential distribution.
If X takes values between a and b and each value in the interval
a ≤ x ≤ b is equally likely, or believed to be equally likely, then try the
continuous uniform distribution U (a, b).
The normal distribution is a good first choice when X is clustered
around a central value, and is as likely to lie below as above this value.

Example 1.9 Interspike inter vals


Consider the setting of Example 1.3. The data in this case are waiting times
between spikes. It seems a reasonable first assumption to assume that these spikes
arise at random. Hence a reasonable first model for the waiting times between
spikes is the exponential distribution. �

You might have noticed that the standard setting for both the discrete uniform
distribution and the continuous uniform distribution includes the situation in
which there is no reason to believe that any value is special. Thus the uniform
distribution is sometimes used to represent lack of knowledge, though of course
such knowledge might be acquired later through experimentation.
Note also that the standard setting for the normal distribution is rather less
specific than for the exponential distribution or the uniform distribution. In fact,
there is no really natural model for the normal distribution. This does not mean,
of course, that the normal distribution is not useful. On the contrary, it is the
most commonly used continuous model.
In Subsection 1.2, you saw that the shape of the distribution is a key factor in
choosing a discrete model; it is also a key factor in choosing a continuous model.

Choosing a continuous model: range and shape


The exponential distribution has range 0 ≤ x < ∞ and its mode is at 0.
The uniform distribution has finite range a ≤ x ≤ b and has no mode.
The normal distribution has range −∞ < x < ∞ and is symmetric
about a single mode that coincides with the mean. Values far from the
mean have low probability.

Example 1.10 Silver content of coins


You have met data on the silver content of Byzantine coins at several points in
the course. Now consider the problem of choosing a model for these data. For instance, see Unit C2,
Exercise 1.1.
The silver content X of a coin, measured in % Ag, must take values between 0 and
100. Thus the range of X is finite. The only continuous model with a finite range
is the uniform distribution. However, in this setting coins with no silver at all
(X = 0) or made of pure silver (X = 100) are presumably unlikely, and surely not
all values between 0 and 100 are equally likely. Hence the uniform distribution
may not be a good choice. It is more reasonable to assume that the silver content
will cluster around some central value. A reasonable model in this case is the
normal distribution, even though, in theory, this allows values of X that are
negative or more than 100, which of course cannot happen in practice. However,
provided that values of X close to 0 and close to 100 are very unlikely, then the
normal model may be good enough. �

Example 1.10 serves to illustrate the important point that models are at best
approximate representations of reality. The aim is to formulate a reasonable
model, not a perfect one.
98 Unit C3

Activity 1.6 Head size and shape


The three variables below are described in Example 1.2. They are all associated
with measurements taken in 25 families on the heads of the first and second sons.
Identify suitable candidate models for the following variables on head size and
shape.
(a) Head size, measured as length + breadth.
(b) Head shape, measured as 100 × (breadth/length).
(c) Head shape difference, measured as head shape of first son minus head shape
of second son.

In this section, the importance of thinking about the context or setting in which
the data are collected has been emphasized, in order to guide the choice of model.
This is certainly useful but does not guarantee a unique answer. In some settings
there may be two, or many more, reasonable candidates.

Example 1.11 Treatment duration


The duration (in days) of treatment was recorded for 86 hospital patients. Copas, J.B. and Fryer, M.J.
Treatment can last several months, so it makes sense to model these data as (1980) Density estimation and
continuous. But which model is appropriate? The normal model might be a good suicide risks in psychiatric
treatment. J. Royal Statistical
one, if there is a ‘typical’ treatment length around which the values cluster. Society, Series A, 143, 167–176.
Alternatively, you could think of treatment durations as waiting times, which
might suggest that an exponential model may be suitable. In this case, the setting
alone does not really help: it is necessary to look at the shape of the data. This
topic is discussed in Section 2. �

Before ending this section, it is worth emphasizing that there are many more
distributions than the ten covered in detail in this course, though these include
some of the most important ones in statistics. MINITAB provides several other
distributions, which you might care to explore (this is entirely optional); and
there are others beside these. However, the approach to selecting an appropriate
distribution is much the same whatever the collection of distributions to which
you have access.

Summar y of Section 1
In this section, some basic principles for thinking about data and models, even
before looking at the data, have been reviewed. The key issues are whether the
data are discrete or continuous, whether the setting in which the data were
collected conforms to any of the standard settings, and what is the likely shape of
the distribution. These basic principles can help to formulate a starting point for
choosing a model, which can be revised in the light of the data.

Exercises on Section 1
Exercise 1.1 Car production
A car manufacturer monitors its production by means of daily counts of the cars
it produces. Would you model the output as a discrete or a continuous variable?
What other information might you require before deciding on an answer?

Exercise 1.2 A leaking water main


A leak has developed in an underground water main running the length of a
street. It is not known where the leak might have occurred.
(a) In the absence of any other information, identify a model for the location of
the leak along the length of the street.
(b) Now suppose that it is known that the leak is likely to have occurred at one
of the T-joints supplying the houses along the street. Identify a suitable
model for the location of the leak in this setting.
Section 2 99

2 Exploring the data


In Section 1, you saw how the setting of a study, and the type of data to be
collected, can often suggest a model, or perhaps a few possible models, even
before you have looked at the data. Accordingly, Section 1 contained no data,
only descriptions of settings and variables. However, such general considerations
can only provide a starting point in choosing a model. Having narrowed down the
possibilities to one or a few likely candidates, you then need to check that the
data that have been collected are consistent with the model you wish to fit. To do
this, it is essential to explore the data, using numerical summaries and graphical
methods, to check that your proposed model is a reasonable one.
The aim of this section is to bring together various tools you have already
encountered, and apply them to checking the assumptions of your chosen model.
In Subsection 2.1, you will be exploring data by looking at graphical displays and
numerical summaries. Probability plots are revisited in Subsection 2.2; your work
on this subsection will include a chapter of Computer Book C. Data
transformations are considered in Subsection 2.3; and the problem of dealing with
outliers is discussed briefly in Subsection 2.4.

2.1 Getting a feel for the data


After making sure that you are clear about the problem being addressed, and
understand the setting in which the data were collected and the type of data
available, the next step is to examine the data. This is arguably the most
important part of the statistical modelling process, and one which statisticians It could be argued that the
describe as getting a feel for the data. There are no simple rules for how to do most critical statistical input is
this. You have already met all the methods commonly used. The aim of this at the study design stage, to
make sure that the right sort of
section is to apply these techniques to the task of selecting a model and checking data are collected.
any assumptions.
The first step in any analysis is to look at the data. This will give you an idea of
the scales of measurement and the number of significant figures used in recording
the data, and might also allow you to spot missing or unusual values.
The next stage is to obtain a graphical representation of the data. This will reveal
the key features of the data, and might indicate whether the notional model you
have in mind is at all appropriate. Graphical representations will indicate how
many modes the data have, where they are located within the range of the data,
the variation around these modes, the degree of skewness of the data, and so on.
You can then judge whether your proposed model mirrors these aspects of the
data.

Example 2.1 Particle emissions


In Activity 1.5 you considered data on alpha particle emissions from a radioactive
source. The data are counts of particles emitted in intervals of equal length.
Provided that particles are emitted at random, this setting is standard for the
Poisson distribution. Thus a good model with which to start is the Poisson
distribution. Having collected the data, the next stage is to look at them.
Figure 2.1 shows a bar chart of the counts. The distribution is unimodal and
right-skew: both are attributes of the Poisson distribution. Of course, this does
not demonstrate that the counts arise from a Poisson distribution, but it does Figure 2.1 Counts of particles
confirm that a Poisson distribution is a good candidate model. � emitted

The key idea in Example 2.1 is the rather obvious one that the ‘shape’ of the data
should be consistent with the distribution you wish to fit. You met this idea in
Block A.
100 Unit C3

Activity 2.1 The shape of the data


Suppose that a bar chart of the particle emission data had looked like one of those The vertical scales have been
shown in Figure 2.2. For each of the three bar charts, what conclusion would you omitted since only the shapes of
have drawn about the suitability of a Poisson model for the data and why? the bar charts are of interest
here.

Figure 2.2 Three bar charts with different shapes

Activity 2.2 will give you some practice at using the shape of a bar chart to decide
whether various distributions are good candidate models for the data.

Activity 2.2 Shapes of some discrete distributions


For each of the bar charts in Figure 2.3, state whether or not its shape is
consistent with each of a Poisson distribution, a binomial distribution and a
geometric distribution. Note that the shapes of the bar charts may be consistent
with more than one of these distributions.

Figure 2.3 Shapes of some discrete distributions

In using ‘shape’ to suggest a distribution it is important to remember that


random fluctuations might result in chance departures from the theoretical shape.
For example, the bar chart in Figure 2.3(c) might represent a sample from a
geometric distribution for which, by chance, the mode is at 3 rather than 1. Such
chance fluctuations are more likely to have a major impact on the shape of the
bar chart for small samples than for large ones. For this reason, using shape as a
guide to selecting the distribution is most useful when the sample size is large.
Section 2 101

Activity 2.3 Windows: some data at last


Possible models for the setting described in Example 1.4, concerning the windowed
envelopes collected by Dr Sutherland, were discussed in Examples 1.7 and 1.8. It
was suggested that a discrete uniform distribution might be a reasonable choice in
the absence of any further information about the shape of the distribution.
So far, the discussion of this example has not involved any data, just a rather
abstract discussion of the setting. Recall that the data comprise the numbers of
used windows on reusable envelopes, and can range from 1 to 12. Figure 2.4
shows a bar chart of the numbers of used windows for a sample of 311 windowed
envelopes.
(a) Do you think the discrete uniform model is appropriate for these data?
(b) What model (or models) does the shape of the data suggest?
(c) What are the shortcomings of the model(s) that you named in part (b)? Figure 2.4 Numbers of used
(d) How would you check your choice of model? windows

The bar chart in Figure 2.4 of the data on used windows shows that the uniform
model, which was tentatively suggested in Examples 1.7 and 1.8, is unsuitable. A
geometric model would appear to be better. This illustrates the point that
statistical modelling often requires you to alter the model as more features of the
data become apparent. It is important, therefore, to keep an open mind, but also
to concentrate on the important aspects of the data. In this case, getting the
shape right is probably more important than abiding by constraints about the
range of the data.
Similar considerations apply to continuous data. A good starting point is always a
graphical display of the key variable or variables.

Example 2.2 Interspike inter vals


In Example 1.9 you saw that a sensible first model to try for the data on
interspike intervals is the exponential distribution. Figure 2.5 shows a histogram
of these data. The data certainly display a long right tail, as would be expected of
an exponential distribution. However, it is noticeable that the mode is not at
zero, but in the range 20 to 30. This could be due to random variation, but it is
more likely that it reflects a failure of the exponential model. � Figure 2.5 Interspike intervals

Activity 2.4 Treatment duration


For the treatment duration data of Example 1.11 the setting alone could not help
identify a suitable candidate model. Figure 2.6 shows a histogram of treatment
times. What model might you choose to fit to these data?

Histograms or bar charts are essential for displaying the main features of a data
set, and for suggesting a modelling approach. When dealing with certain types of
continuous data, further checks are possible using probability plots; you met these
in Blocks A and B. Recall that a probability plot for a data set consisting of
ordered observations x(1) , . . . , x(n) is a plot of these observations against scores
y1 , . . . , yn derived from the proposed distribution. These ‘scores’ are certain Figure 2.6 Treatment times for
quantiles of the proposed distribution. In general terms, if the data are randomly hospital patients
sampled from the proposed distribution, then the points on the probability plot
should lie close to a straight line.
In Block A you met exponential probability plots, in which the scores are derived
assuming an exponential distribution. For such data, the points on the probability
plot should lie close to a line passing through the origin. In Block B you met the
corresponding plot for an assumed normal distribution. In this case, the points on
the probability plot should also lie close to a line but this line does not have to
pass through the origin. Probability plots provide a useful check on model
assumptions, and are more sensitive to departures from these assumptions than
histograms.
102 Unit C3

Activity 2.5 More on interspike inter vals


The histogram in Figure 2.5 suggested that an exponential model might not be
appropriate for the interspike intervals data. Figure 2.7 shows an exponential
probability plot for these data, together with the ‘best’ straight line through the
points that passes through the origin. What do you conclude? What might you
try next?

Figure 2.7 An exponential probability plot for the interspike intervals data

The interpretation of probability plots is considered in more detail in


Subsection 2.2.
So far, only graphical tools for data exploration have been discussed. These are
undeniably the most important. However, data summaries can also provide useful
insight. In particular, for either a Poisson distribution or an exponential
distribution, the relationship between the mean and variance provides an informal
guide to whether that distribution is at all likely to be a useful model.

Mean–variance relationships for Poisson distributions and


exponential distributions
For a Poisson distribution, the mean and variance are equal: µ = σ2 .
For an exponential distribution, the mean and standard deviation are
equal: µ = σ.

For example, if a Poisson model is to be fitted, then you might expect the sample
mean and the sample variance to be similar, at least in large samples. Similarly, if
an exponential model is to be considered, the sample mean and the sample
standard deviation should be similar. Note, however, the necessary emphasis on
‘similar’ here: exact equality is most unlikely and, especially in small samples,
there might be substantial discrepancies due to random fluctuations. Nevertheless,
examination of the mean–variance relationship for the data can provide a clue as
to possible problems with fitting either a Poisson model or an exponential model.

Example 2.3 Alpha particles


For the alpha particle data of Example 2.1, the sample mean of the counts is
3.877 and the sample variance is 3.696. The variance is thus very close to the
mean, and this provides further support for the Poisson model. �
Section 2 103

Activity 2.6 Treatment duration


The sample mean of the data on treatment times of hospital patients of
Activity 2.4 is 122.3 days, and the sample standard deviation is 146.7 days. How
might you use this information to guide your choice of model?

2.2 Inter preting probability plots


In this subsection, probability plots are revisited. Normal probability plots are the
most commonly used probability plots; their interpretation is discussed in more
detail than previously. You will explore some further probability plotting facilities
available in MINITAB for exponential and normal probability plots.
If data are sampled from a normal distribution, then the points on a normal
probability plot should lie close to a straight line. But what if they don’t? And
how close to the line should the points lie? If just a few points in the tails lie far
from the line, then it is reasonable to conclude that the normality assumption
holds for the bulk of the data, but perhaps not for the tails — for example, due to
the presence of outliers. However, quite commonly, a normal probability plot as a
whole reveals a systematic pattern, even though no individual point lies very far
from the line. This is illustrated in Example 2.4.

Example 2.4 Systematic patterns in probability plots


Figure 2.8 shows a normal probability plot obtained using a random sample of
size 50 drawn from a continuous uniform distribution.

Figure 2.8 A normal probability plot for a sample of size 50 from U (0, 1)

It is clear that the probability plot has a wavy shape which differs in a systematic
way from a straight line, even though no individual point lies very far from the
line drawn in the figure. �

The systematic pattern in Figure 2.8 provides clear evidence that the data are not
sampled from a normal distribution. In fact, the wavy pattern is characteristic of
a symmetric distribution with tails that are too ‘light’ compared with a normal
distribution. For U (0, 1), there are no values less than 0 or greater than 1.
Similarly, if the underlying distribution is either right-skew or left-skew, this will
induce a systematic non-linear pattern in a normal probability plot of the data.
104 Unit C3

Activity 2.7 Interpreting normal probability plots


Figure 2.9 shows two normal probability plots. For each plot, decide whether the
data might reasonably be assumed to be a sample from a normal distribution.

Figure 2.9 Two normal probability plots

Deducing the shape of the distribution from its normal probability plot is not an
easy task, and is seldom attempted. The key point is that the presence of a
systematic non-linear pattern in a normal probability plot is evidence that the
data are not sampled from a normal distribution. If a probability plot indicates
that the normality assumption is invalid, you should look at a histogram of the
data to identify in what respects the normality assumption fails.
A further problem in interpreting probability plots is that individual points may
lie far from the straight line, not because the underlying distribution is not
normal, but because of chance effects. This is frequently the case for points at the
extremities of the plots. It would be useful to quantify the degree of variation that
might be expected to occur by chance. This can be done when using MINITAB to
produce a probability plot: confidence bands are drawn either side of the straight
line. These are explained in the computer book.
Refer to Chapter 7 of Computer Book C for the rest of the work in
this subsection.

2.3 Transforming the data


The continuous distributions available to you that are most commonly used for
modelling purposes are the normal, exponential and continuous uniform
distributions. At first sight, this presents a serious problem, since the choice of
shapes covered by these models is very restricted. For example, the only one of
the three with non-zero skewness is the exponential distribution, and this
distribution is rather inflexible: the mode is at zero and, whatever the value of the
exponential parameter, the skewness is equal to 2. The skewness referred to here is
the population analogue of the
Thankfully, when dealing with continuous data, the range of modelling sample skewness. You do not
distributions can be extended enormously by using transformations. The idea is need to know how to calculate
simple: if the data do not have the shape required, you can try to transform them this quantity for M248.
so that the transformed data do. You met transformations of data in Blocks A
and B. In this subsection, transformations are considered in a little more detail.
Section 2 105

The use of transformations is illustrated in Example 2.5 using simulated data.

Example 2.5 Transforming data


Four histograms of data sets, each comprising 300 data points, are shown in
Figure 2.10.

Figure 2.10 Four histograms

The histogram in Figure 2.10(a) looks as though the data might be normally
distributed, but those in Figures 2.10(b), (c) and (d) are progressively more
skewed and the distributions appear to be far from normal. It may surprise you to
learn that the same sample of data was used for all four histograms. A computer
was used to generate a sample of size 300 from a normal distribution; these data
are represented by the histogram in Figure 2.10(a). Suppose that a typical value
in this data set is denoted x. Then the data points used for Figure 2.10(b) are the
values x2 ; the data points used for Figure 2.10(c) are the values e2(x−1) ; and the
data points used for Figure 2.10(d) are the values 1/(251x).
Since the data represented in Figure 2.10(a) are normally distributed, they could
be used to carry out a t-test, for example. However, it would not be legitimate to
carry out a t-test using the data in any of Figures 2.10(b), (c) or (d) because the
variation is far from normal. Suppose now that data resembling those in
Figures 2.10(b), (c) or (d) were to arise in practice. It would clearly be in order to
transform them. For example, if the data looked like those in Figure 2.10(b), then
the correct procedure would be to take the square root of each value. This
transformation would result in the data represented in Figure 2.10(a). It would
then be appropriate to carry out t-tests on the transformed data, based on the
assumption of normality. �
106 Unit C3

Example 2.6 March precipitation in Minneapolis St Paul


The total precipitation (in inches) in the month of March was recorded in 30 Hinckley, D. (1977) On quick
successive years in Minneapolis St Paul. Figure 2.11(a) shows a normal choice of power transformation.
probability plot for these data. The normal probability plots in Figures 2.11(b) Applied Statistics, 26, 67–69.
and 2.11(c) were produced after transforming the data using the log
transformation log x and the cube root transformation x1/3 , respectively.

Figure 2.11 Normal probability plots for precipitation data: (a) untransformed (b) log transformed (c) cube root
transformed
The probability plot for the untransformed data displays a systematic pattern
indicating non-normality. The log transformation results in a much straighter
plot. An even straighter probability plot is obtained using the cube root
transformation. �

One aim of transforming a set of data values x to a different set of values y by


means of a mathematical transformation is to render the transformed data more
plausibly normal. Possible transformations include y = x2 , y = 1/x and y = log x;
the list is endless. Note, however, that some transformations are not always
appropriate. For example, the transformation y = log x is not defined for negative
(or zero) values of x, so it can only be used when all the data values are positive.
Also, a transformation must be either increasing or decreasing over the range of
values of x. Recall that a transformation of x is increasing if its graph rises as you
move to the right through the range of x; it is decreasing if its graph falls.
For example, if x can take both positive and negative values, then the
transformation y = x2 is neither increasing nor decreasing over the range of x.
This is illustrated in Figure 2.12: the graph of y = x2 is decreasing between −2
and 0, then increasing between 0 and +2. On the other hand, if the range of x
included only values x ≥ 0, then the transformation y = x2 would be appropriate Figure 2.12 The graph of
as it is increasing over this range. The case for transforming data is strengthened y = x2 over the range −2 to +2
if some natural physical interpretation of the transformation is available. Often,
however, as in Example 2.6, there is no convenient physical interpretation, and
the data are transformed simply to satisfy the requirements of the statistical
procedure that you wish to use.
Some general indications for choosing a transformation may be given. For data
that are positive and highly skewed with many relatively small values and fewer
higher values, and possibly some very high values, the following transformations
all tend to reduce the spread of higher values more than that of lower values:

y = x = x1/2 , y = x1/3 , y = log x or y = log(1 + x).
The overall effect of one of these transformations is to reduce the skew in the
data. The ladder of powers lists transformations of the form
. . . , x−2 , x−1 , x−1/2 , log x, x1/2 , x1 , x2 , . . ..
Section 2 107

The transformation x1 leaves values unchanged. Transformations with powers


above 1 on the ladder (that is, transformations to the right of x1 ) expand the high
values relative to the low values, and transformations with powers below 1 have
the opposite effect. Notice the position of log x: though not, in fact, a power
transformation, it fills the position of x0 which is not a valid transformation
because it collapses all values to 1.

Activity 2.8 Interspike inter vals


In Activity 2.5 you found that an exponential model is not appropriate for
the data on interspike intervals. A histogram of the data is shown in Figure 2.13(a).

Figure 2.13 Histograms of interspike intervals: (a) untransformed


(b) log transformed (c) square root transformed

The data are highly skewed, indicating that a normal model is not appropriate
either. Figure 2.13(b) shows a histogram of the data after a log transformation,
and Figure 2.13(c) represents the data after a square root transformation.
Comment on the effect of the two transformations. In your view, which
transformation has produced the more symmetrical result?

It is perhaps worth pointing out that it is not always necessary to obtain a more
symmetrical distribution! For example, suppose that the aim of the analysis of
the interspike intervals data is to calculate the mean interspike interval, together
with a 95% confidence interval for the mean. Then, since the sample size is
sufficiently large — there are 100 observations — the distribution of the sample
mean is approximately normal, even though the underlying distribution is skew.
So large-sample methods can be used to find an approximate 95% confidence
interval for the mean. Therefore it is not necessary to transform the data to find a
confidence interval for the mean. There may, of course, be other reasons to
transform the data.

2.4 Dealing with outlier s


From the moment that graphical representations of variability in a set of data
were introduced in Block A, you have been aware of the notion of statistical
outliers. Similarly, in probability plots, occasionally there may be evidence of an
unusual outlying point, or set of points, suggesting that, apart from a few
exceptional observations, the hypothesized model is adequate to describe the
variability observed. Of course, outliers are more disconcerting if they are found
in very small data sets, as in Example 2.7.
108 Unit C3

Example 2.7 Radio-carbon age determinations


The data in Table 2.1 are a set of radio-carbon age determinations, in years, on
eight samples from the Lake Lamoka site.

Table 2.1 Radio-carbon dating Long, A. and Rippeteau, B.


(1974) Testing contemporaneity
Sample Radio-carbon age and averaging radio-carbon
number determination dates. American Antiquity, 39,
205–215.
C-288 2419
M-26 2485
C-367 3433
M-195 2575
M-911 2521
M-912 2451
Y-1279 2550
Y-1280 2540

Most determinations suggest an age of between 2400 and 2600 years. However,
sample C-367 indicates an age of 3433 years, which is quite out of step with the
other values: this sample is a clear outlier. �

If at all possible, when outliers are present, your first step should be to check that
they are not the result of recording, coding or data entry errors. Such errors are
very common. It is well worth repeating here that, if you enter your own data,
you should always check your computer data file against the original. Not all data
entry errors will necessarily appear as outliers!
The study of outliers and how to treat them can be rather complex; so only a
little general guidance will be given in this course. Broadly speaking, the
treatment of outliers depends upon how many appear in the data, what effect they
have on the conclusions, and how far you are prepared to go in believing that you
have been unlucky enough to obtain a few ‘atypical’ values, rather than believing
that the distributional assumptions are not viable. This last point is important:
the outliers might just reflect the fact that you have chosen the ‘wrong’ model.
The effect of model choice on outliers is illustrated in Example 2.8.

Example 2.8 Treatment duration


A histogram of the data on treatment times was given in Figure 2.6. From this See Activities 2.4 and 2.6.
histogram and by looking at the mean–variance relationship, you may well have
concluded that an exponential model might be reasonable. An exponential
probability plot for these data is shown in Figure 2.14.

Figure 2.14 An exponential probability plot for the data on treatment times

The probability plot is not straight. In particular, there are four outliers in the
top right-hand corner of the plot, corresponding to the four longest treatment
Section 2 109

times. These outliers could be interpreted as relating to an atypical group of


patients with unusually long treatment times.
However, if the transformation x1/4 is applied to the data, then the corresponding
histogram, which is shown in Figure 2.15(a), is roughly symmetric; and the
normal probability plot in Figure 2.15(b) suggests that a normal model might be
appropriate for the transformed data.

Figure 2.15 Transformed data: (a) histogram (b) normal probability plot

There is no suggestion of a problem at longer treatment times now! There are


three ill-fitting points with treatment times of one day. Since times are counted in
whole days, it is possible that some of the patients to which these times relate
were treated for a slightly longer or slightly shorter time than one day. The poor
fit in the lower tail might thus be the result of crude rounding of the data.
So which model is correct? Are the untransformed data exponentially distributed,
with a group of four atypical patients with long treatment times? Or are the
transformed data roughly normal? It is not possible to settle the issue without
further information, in particular about the four possibly atypical patients. Which
model to use would in any case depend on the purpose of the analysis. �

An important consideration in dealing with outliers is the purpose of the analysis.


If there are relatively few outliers, and the model you have selected appears to fit
the rest of the data reasonably well, then a sensible procedure is to examine the
sensitivity of your conclusions to the outliers. This is easily done by undertaking
two analyses, one including the outliers and the other excluding them. If your
conclusions do not differ substantially under the two procedures, then the outliers
are not influential and should not be a major source of concern. In this case, you
should report the presence of the outliers, and state that the conclusion reached is
not sensitive to inclusion or exclusion of these outliers. On the other hand, if the
outliers do have a big impact on the conclusions, then you need to report your
findings with the data analysed both ways.

Activity 2.9 Age of the Lake Lamoka site


The radio-carbon data of Example 2.7 are to be used to provide an estimate of the
age of the Lake Lamoka site. Provide such an estimate using all of the data points,
and investigate its sensitivity to the outlier. How would you report your results?

There are no hard and fast rules to decide how many data values may be deleted
in order to salvage a particular modelling assumption. In practice, it is best to
remove no more than one or two values.
If in doubt, an alternative is to keep all the values and revert to a distribution-free
method. Using ranks instead of data values loses information about how far apart
the values are but, on the other hand, it removes sensitivity to abnormally large
or abnormally small values. If decisions about which method to use seem unduly
vague, you should remember that there is not always a definitely right or wrong
way of performing a statistical analysis. All you can do is use your common sense.
110 Unit C3

Summar y of Section 2
In this section, techniques for exploring data, choosing a model and checking the
model have been discussed. Graphical and other methods for guiding model
choice, based on features such as the shape of the data, its range and other
properties such as mean–variance relationships have been reviewed. The
interpretation of probability plots has been discussed and relevant features of
MINITAB explored. Transformations of continuous variables have been
considered, and the ladder of powers introduced. The handling of outliers has
been discussed briefly.

3 Statistical modelling with


MINITAB
In this section, you will have the opportunity to tackle an exercise, or
mini-project, in statistical modelling using MINITAB. The aim is to give you
practice in the skills of statistical modelling. The mini-project begins with some
background, a scientific question and a description of a data set relevant to the
question.
Refer to Chapter 8 of Computer Book C for the work in this section.

Summar y of Section 3
In this section, you have undertaken extended analysis of a data set using
MINITAB, starting from a scientific question, and progressing through the various
stages of exploratory analysis, model and method choice, model checking, and
performing the relevant statistical calculations.

4 Writing a statistical report


A statistical investigation usually begins with a practical problem and ends with
the results being summarized in a statistical report. The stages involved were
discussed briefly in the introduction and represented in a flow chart in Figure 0.1.
In this section, some general advice is given on how to write — and how not to
write — a statistical report.
The statistical report is an account of what you did, why you did it, what you The report is also important for
found, and what your results mean in terms of the original scientific problem. The your own record.
main challenge in writing a report is that it is aimed at two very different
readerships. First, it should be sufficiently detailed to allow other statisticians to
understand clearly what you did, and enable them to assess the validity of your
conclusions. But it is equally important that it should provide a non-technical
account of your investigation to non-statisticians who are interested primarily in
the original question. These distinct aims are usually reconciled by writing a
non-technical summary of your investigation to accompany the more technical
report.
It is also important to stress that the report should be succinct, and if possible,
short. A long-winded, rambling document is of little use to anyone!
In Subsection 4.1, the structure of a typical statistical report is described, and in
Subsection 4.2, an example is discussed in detail.
Section 4 111

4.1 The structure of a statistical report


The key to a good report is its structure. This makes for easy reading. For
example, a non-specialist might ignore the more technical parts of the report
dealing with statistical methods, and read only the summary and the conclusions.
Also it is easier to write, as the structure helps to organize the material.
A possible structure for a statistical report is set out in the following box.

The structure of a statistical report


A statistical report comprises the following sections.
Summary
Introduction
Methods
Results
Discussion

This structure is reasonably standard, though some authors might use different
section headings — for example, Background instead of Introduction , Conclusion
instead of Discussion, and so on.
The Summary should be completely self-contained. It should state briefly the aim
of the analysis, the method used, the key finding or findings, and the
interpretation. It is usually written last, and should use largely non-technical
language. The ‘largely’ in the previous sentence is a reflection of the fact that it is
often simply not possible to provide an accurate summary of results without using
some statistical terminology or referring to some statistical concepts. It is far
better to give a slightly technical, but correct, summary than one apparently
easily understood by all, but potentially misleading.
The Introduction should contain a brief description of the problem or hypothesis
to be investigated, the setting in which the data were collected, and the data
available. Note that, in this course, the starting point is always a problem and
some data relevant to that problem.
The Methods section should include a description of the model, the procedures
used to check the model, the statistical tests employed, the method used for
calculating confidence intervals, and any other relevant techniques you have used,
such as data transformations. The key guide to this section is to include enough
detail to allow other statisticians to evaluate your method, and to repeat your
investigation if they had the same data. You should not include all the blind
alleys and dead ends you travelled (we all travel them) before settling on your
preferred solution. However, if you found two equally plausible models that give
appreciably different results, then you should include both.
The Results section should contain descriptive summaries of your data (for
example, graphical and numerical summaries), evidence that your model is
appropriate and, finally, the numerical results of statistical tests or confidence
interval calculations. It is important to remember that this section, as all others,
should be written in prose: a collection of numbers and graphs is not sufficient.
The Discussion should contain your own assessment of the statistical evidence
relating to the original question or hypothesis. In particular, you should discuss
any evidence of lack of fit of your model, any problems with the data (for
example, outliers), or any other matter that might have a bearing on the
interpretation of the results.
There is no set order in which to write the sections of a report but you should
present the sections in the order just described. Many readers will not read all the
sections — for example, many will read only the Introduction and Discussion — so
it is important to structure your report so that they can find the sections they are
interested in quickly. In some sense the Results section forms the heart of the
report. The Methods section is organized in such a way as to explain how you
112 Unit C3

obtained your results, while the Discussion is your interpretation of the results.
Some authors prefer to write the Results section first, followed by the Methods
section. You should use whatever order you feel most comfortable with. In any
case, you will probably find yourself going back over previous sections to make
sure everything fits together in a coherent whole.
Finally, one important general rule: the shorter, the better. If you can describe
something accurately in one sentence rather than two, then so much the better!
(But, of course, two short sentences are better than one long rambling sentence.)

Activity 4.1 Organizing the report


The following are a few notes from a statistical analysis that you wish to write up
as a statistical report. Organize this material into an outline report under the
headings Introduction, Methods, Results, Discussion.
Two groups, continuous variable. Sample variances similar. Did two sample
t-test, p = 0.16. Normality seemed OK in each group (probability plot). 95%
t-interval (−3.92, 17.63). Conclude means could be the same for both groups.
Sample sizes 24 and 32.
Note that you are not expected to write the report, just to reorganize the
information in the sentences or parts of sentences under the four headings.

4.2 Writing the report


In this subsection, an analysis of the data on head shape that were described in
Example 1.2 is used to illustrate what might be included in a statistical report.
Only the comparison of head shapes of first and second sons will be considered.
Example 4.1 provides a brief account of the analysis of these data.

Example 4.1 Head shapes of first and second sons


Head measurements were taken for the first and second sons from 25 families. For
each son, head shape was calculated as head breadth divided by head length
(both in the same units), multiplied by 100. This is called the head shape index.
It is assumed that this index does not vary substantially over the age range of the
data. The question is whether there is any difference in shape indices between
first and second sons.
The data are paired, so it makes sense to look at the differences between the head
shape indices of first and second sons. The mean difference is 0.19, so it appears
that the shape index is slightly greater for first sons than for second sons. In
Activity 1.6 you may have suggested that a normal model seemed appropriate for
the differences. A histogram and a normal probability plot of the differences are
shown in Figure 4.1.

Figure 4.1 Head shape differences: (a) histogram (b) normal probability plot
Section 4 113

The probability plot confirms the validity of the normal model.


The next step is to calculate a t-interval for the mean difference: the 95%
t-interval is (−1.35, 1.72). Finally, a paired t-test of the null hypothesis of zero
mean difference might be carried out. This gives t = 0.25 on 24 degrees of
freedom. The p value for the test is 0.80. There were no problems with
outliers. �

Example 4.2 Writing the report


The information in Example 4.1 can be organized into a statistical report as
follows.
The Introduction states the problem and describes the data available, including
the source of the data. You should not include extraneous material such as
theories of heredity or genetics here (though, of course the scientists who gave you
the data might wish to do so in their report — but that is up to them). So the
following (taken from the description of Example 1.2) is adequate.
Introduction
In 25 families where there were at least two sons, measurements were taken
on the head length and head breadth (both measured in mm) of the first and
second sons. The head shape index is defined as 100 × (breadth/length). The
issue addressed in this report is whether there is a difference between the
head shapes of first and second sons. The data for this analysis were taken
from Frets, G.P. (1921) Heredity of head form in man. Genetica, 3, 193–384.
The next step is to write the Methods section. You should state the variables and
describe the methods that you used to reach your conclusions, but not all the
blind alleys that you might have explored in the process! For example, a normal
model was chosen, so it should be mentioned that a probability plot was used to
check the adequacy of the model. This information is required so that, if other
statisticians read your results, they will know that the model is valid. However, if
you originally thought you might use, say, an exponential model, but dropped the
idea once you looked at the data, this information should not go in the Methods
section. Readers are not interested in your thought processes, or all the mistakes
you might have made along the way, but simply want to know how you obtained
your results, and whether your methods were appropriate.
The Methods section is generally aimed at a statistical readership, and hence you
can quote standard methods without describing them in any detail. For example,
it is perfectly appropriate to say ‘95% t-intervals were calculated’ without
explaining what a confidence interval is or what the t-distribution is. Indeed, you
should most definitely not describe what they are! Finally, it is often useful to
include details of the software you used, usually the name and version. Here is a
suggested Methods section for the report on head shapes.
Methods
As the data are paired, the analysis was based on differences between the
head shape indices of first sons and second sons. A normal model was used for
the differences; this was checked using a normal probability plot. A 95%
t-interval was calculated and a paired t-test was used to test the hypothesis
that the mean difference is zero. All analyses were performed using MINITAB
Student Version 14.
Next comes the Results section. A general rule is to separate the results into
descriptive summaries and analytic results. Descriptive summaries might include
some relevant numerical summaries (for example, median and range) or graphs.
The aim is to convey some feel for the data. However, you should beware of
including too many descriptive summaries: the aim is to highlight aspects of the
data that are relevant to the question of interest. For example, for the head shape
data, you might include a comparative boxplot of head shape indices, or perhaps a
histogram of the differences in head shape indices; one such graph is enough here.
114 Unit C3

The analytic results include those that directly address the original question set
out in the Introduction. The original question relates to the difference between the
head shapes of first and second sons. Thus you should report the mean difference
and the 95% t-interval. In addition, you should report the result of the paired
t-test. Finally, you need to provide some evidence that your methods are justified.
In this case, a probability plot was used to test the normality assumption. It is
not essential to show this plot. To save space it is quite reasonable simply to state
that you used this method to check the assumption.
Results
The distribution of the 25 differences between the head shape indices of first
and second sons is shown in the histogram below. The data were
approximately normally distributed, as confirmed by a probability plot.

The mean difference in head shape indices was 0.19, with 95% t-interval
(−1.35, 1.72). A paired t-test of the hypothesis of zero mean difference gave
t = 0.25 on 24 degrees of freedom, p = 0.80.
The next section is the Discussion section, in which you give your interpretation
of the results in the light of the original question. This is also the place where you
should comment on the possible impact of any other factors (such as missing data
or outliers) on the interpretation. In this example there are no such factors. The
section can thus be suitably brief: there is no evidence of a difference. However, it In general, it is important to
is worth qualifying this conclusion by reminding the reader that the sample size write concisely and to the point.
was rather small.
Discussion
We conclude that there is little evidence against the hypothesis of no
difference between the head shape indices of first and second sons. However,
the sample size for this study was only 25.
Finally, having assembled and re-read the report, you can now write the
Summary. This states briefly the purpose of the analysis, the method used, the
key finding and its interpretation. It should be largely non-technical.
Summary
The aim of this analysis was to compare head shapes of first and second sons,
using a shape index based on the ratio of head breadth to head length. Data
on 25 pairs of first and second sons were obtained from a published source
and analysed using a normal model. We found no significant difference
between the head shapes of first and second sons.
This completes the report. The final step is to read through the report and check
it. �
Section 4 115

The sections of the report on head shapes of first and second sons that were
written in Example 4.2 are assembled in the following box.

A complete statistical report


Summary
The aim of this analysis was to compare head shapes of first and second
sons, using a shape index based on the ratio of head breadth to head length.
Data on 25 pairs of first and second sons were obtained from a published
source and analysed using a normal model. We found no significant
difference between the head shapes of first and second sons.
Introduction
In 25 families where there were at least two sons, measurements were taken
on the head length and head breadth (both measured in mm) of the first
and second sons. The head shape index is defined as 100 × (breadth/length).
The issue addressed in this report is whether there is a difference between
the head shapes of first and second sons. The data for this analysis were
taken from Frets, G.P. (1921) Heredity of head form in man. Genetica, 3,
193–384.
Methods
As the data are paired, the analysis was based on differences between the
head shape indices of first sons and second sons. A normal model was used
for the differences; this was checked using a normal probability plot. A 95%
t-interval was calculated and a paired t-test was used to test the hypothesis
that the mean difference is zero. All analyses were performed using
MINITAB Student Version 14.
Results
The distribution of the 25 differences between the head shape indices of first
and second sons is shown in the histogram below. The data were
approximately normally distributed, as confirmed by a probability plot.

The mean difference in head shape indices was 0.19, with 95% t-interval
(−1.35, 1.72). A paired t-test of the hypothesis of zero mean difference gave
t = 0.25 on 24 degrees of freedom, p = 0.80.
Discussion
We conclude that there is little evidence against the hypothesis of no
difference between the head shape indices of first and second sons. However,
the sample size for this study was only 25.

Activities 4.2 and 4.3 will give you some practice at writing short statistical
reports.
116 Unit C3

Activity 4.2 Fish traps


In this activity you are invited to write a report on a further analysis that was
carried out of the fish traps data of Example 1.1. (These data were discussed in
Activity 1.1.) The aims of this further analysis were as follows.
To describe the distribution of catches.
To estimate the mean number of fish caught per trap.
To estimate the proportion of traps with no catch.
There were a total of 100 fish traps: 72 of the traps contained between 3 and 5
fish and the range was from 0 to 8 per trap. The distribution of catches is shown
in the bar chart in Figure 4.2.
The mean number of fish per trap was 4.04, with 95% z-interval (3.76, 4.32). One
trap failed to catch any fish. The proportion of traps with no catch was thus 0.01. Figure 4.2 Fish catches
The exact 95% confidence interval for this proportion is (0.00025, 0.054).
Write a short report of this analysis.

Activity 4.3 Used windows on reusable envelopes


This activity is based on the data on used windows on windowed envelopes, which
were discussed in Example 1.4. These data were discussed further in Examples 1.7
and 1.8 and in Activities 1.1, 1.2 and 2.3. The aims of the analysis were as follows.
To estimate the mean number of used windows.
To find a suitable model for the distribution of used windows.
A bar chart of the numbers of used windows is shown in Figure 4.3. The mean
number of used windows for this sample of 311 envelopes was 3.412, with
large-sample 95% confidence interval (3.122, 3.701).
In Activity 2.3 you may have concluded that the shape of the distribution was
suggestive of a geometric distribution, even though the maximum possible number
of used windows is twelve. Table 4.1 shows the observed frequencies, together Figure 4.3 Numbers of used
with those expected assuming a geometric model with p = 1/3.412. The goodness windows on reusable envelopes
of fit was assessed after combining the last three rows and yielded a chi-squared
test statistic of 3.16 on 8 degrees of freedom, p = 0.92.

Table 4.1 Numbers of used windows:


observed and expected frequencies

Count Observed Expected


1 91 91.15
2 64 64.43
3 39 45.55
4 30 32.20
5 27 22.76
6 20 16.09
7 11 11.38
8 8 8.04
9 7 5.68
10 9 4.02
11 4 2.84
≥ 12 1 6.85

Write a short report of this analysis.

Summar y of Section 4
In this section, you have learned how to structure and write a statistical report. A
convenient structure includes paragraphs entitled Summary, Introduction,
Methods, Results and Discussion.
Section 4 117

Exercise on Section 4
Exercise 4.1 Interspike inter vals
Data on the motor cortex neuron interspike intervals of an unstimulated monkey
were introduced in Example 1.3. Suppose it is required to describe the distribution
of the logarithms of the interspike intervals and, in particular, to estimate the
mean and calculate a confidence interval for the mean. Suppose also that you have
chosen a normal model. The following is a brief description of a suitable analysis.
The 100 interspike intervals, measured in milliseconds, were transformed using
natural logarithms (see Figure 2.13(b)). The distribution of the transformed data
was roughly normal, as judged by a normal probability plot, with one possible
outlier corresponding to an interval of 2 milliseconds. Accordingly, a 95%
t-interval was calculated for the mean of the transformed data. The mean was
3.41, with t-interval (3.28, 3.54). The standard deviation was 0.66. The
calculations were repeated without the outlier, yielding mean 3.44, 95% t-interval
(3.31, 3.56), and standard deviation 0.6085.
Write a short report of this analysis.

Summar y of Unit C3
In this unit, the methods that you have learned so far in the course have been
integrated into a statistical modelling process. Starting with a question or set of
questions, you have learned to use information about the setting, the types of
variables collected, and any other prior information you might possess — for
example, about the likely shape of the distribution of the data — to identify
possible models for the data. Various approaches to selecting a model, after an
initial exploration of the data using graphical methods and numerical summaries,
have been discussed. You have also learned how to interpret probability plots and
how to transform continuous data. The problem of outliers has been discussed
briefly. MINITAB has been used to apply these methods to a mini-project, which
began with a problem of scientific interest. You have also learned how to
structure and write a statistical report.

Learning outcomes
You have been working towards the following learning outcomes.

Terms to know and use


Ladder of powers, confidence bands.

Ideas to be aware of
That statistical analysis is a process, beginning with a question or problem of
interest, and ending with a statistical report, and involving data exploration,
model choice and model checking, in a cycle that may be repeated several
times.
That the aim of statistical modelling is to draw valid and relevant inferences,
not to find a perfect model.
That some statistical models arise from standard settings, which can in turn
be used for model choice.
That a statistical report comprises a non-technical summary, an introduction,
a methods section, a results section, and a discussion.
118 Summar y of Unit C3

Statistical skills
Use information about the setting of a problem and the type of data collected
to set out an initial modelling framework.
Use information on the discreteness or continuity of a variable and the shape
of its distribution to guide your choice of model.
Use graphical representations of data to select an appropriate distribution to
represent a variable.
Use numerical summaries such as the mean and the variance to inform model
choice.
Use standard settings to guide model choice.
Choose a transformation and apply it to a variable in order to reduce
skewness.
Interpret probability plots using confidence bands.
Identify outliers and explore their influence.
Choose a statistical technique to address a specific problem or question.
Structure a statistical report.
Write a statistical report.

Features of the software to use


Use MINITAB to obtain confidence bands for probability plots.
Use MINITAB to test normality assumptions.
Combine the data manipulation, calculation, statistical and graphical
facilities of MINITAB to undertake a complete statistical analysis.
Solutions to Activities 119

Solutions to Activities
Solution 1.1 Here are my immediate thoughts Solution 1.4 The relevant random variable is the
about the four examples. Later, you will see why these number of the failed pylon, which can take values
might be reasonable starting points. At this stage, between 1 and 24. The variable is thus discrete on the
however, these suggestions are really only ‘hunches’; integers 1 to 24. Since the pylons are all equally likely
other suggestions might be just as valid. to have failed, a suitable model is the discrete uniform
Example 1.1 involves counts of fish; the word ‘counts’, distribution on the integers 1, 2, . . . , 24.
with no notion of a repeated ‘trial’ (which would make
the binomial and geometric models worth Solution 1.5 The data are counts of events in
considering), immediately suggests a Poisson model. intervals of fixed length. The emission of alpha
particles may reasonably be assumed to occur at
Example 1.2 involves differences between head shapes
random. The setting thus corresponds to the standard
of first and second sons; perhaps a normal distribution
setting for the Poisson distribution.
might be appropriate.
Example 1.3: the data are waiting times, so this Solution 1.6 All three variables are continuous, so
suggests an exponential distribution. a continuous model should be chosen in each case.
Example 1.4: I’m not sure about this one. (a) Head size is necessarily positive. However, values
of head length plus head breadth are likely to cluster
Solution 1.2 The data of Example 1.3 are time around some typical value some distance from zero, so
measurements, and hence are continuous. So suitable that a normal model is likely to be appropriate. Note
candidate distributions include the exponential and that the normal model is not ideal since it
the normal. The continuous uniform distribution theoretically allows negative values. However our aim
would require there to be a maximum interspike is to obtain a reasonable model, not a perfect one!
interval. This is not ruled out, though it might seem a
(b) Head shape also takes positive values. The
little unlikely.
normal model is appropriate here as well — at least as
The data of Example 1.4 are counts in the range 1 to a first model — since it might be expected that shape
12, and hence are discrete. At this stage, any of the measurements would fluctuate around an average
standard discrete distributions might be considered. value.
The geometric and Poisson distributions allow counts
(c) Differences between head shape indices can
greater than 12, which are ruled out by the setting.
reasonably be expected to take both negative and
Perhaps the discrete uniform distribution seems the
positive values, perhaps clustered close to zero. A
least unnatural choice at this stage, though no
normal model again seems appropriate here, as a first
distribution seems entirely appropriate.
choice.
Solution 1.3
Solution 2.1
(a) Continuous. Weights are continuous, and the
(a) The Poisson distribution is unimodal, whereas the
measurement accuracy (to two decimal places) is good.
distribution in Figure 2.2(a) is bimodal. So a Poisson
(b) Discrete. The data are counts, and take low model is unsuitable.
values — I’d guess mostly under 20.
(b) The Poisson distribution has zero as its mode if µ
(c) Continuous or discrete. The number of tickets is less than 1, so a Poisson model may be appropriate.
sold is discrete, hence it could be modelled as a
(c) The data are left-skew, whereas the Poisson
discrete variable. However, the numbers are so big
distribution is always right-skew (or roughly
(several million tickets are sold each week) that it
symmetrical when the mean is large), so a Poisson
could equally well be modelled as a continuous
model is unsuitable.
variable.
(d) Discrete. The number of jackpot winners is Solution 2.2 Figure 2.3(a): The shape of this bar
clearly a discrete variable. Unfortunately, few people chart is consistent with all three of the distributions.
win the jackpot, so it would not be appropriate to
Figure 2.3(b): This bar chart is left-skew, whereas
treat the number of jackpot winners as a continuous
Poisson distributions and geometric distributions are
variable.
always right-skew. So the shape of this bar chart is
(e) Continuous or discrete. Examination marks are only consistent with a binomial distribution.
integers from 0 to 100, so they are discrete. However,
it might make sense to treat them as continuous in
view of the large number of different values.
(f ) Discrete. The pass grades are discrete and take
only a restricted range of values.
120 Unit C3

Figure 2.3(c): The shape of this bar chart is consistent Solution 2.6 The mean and standard deviation do
with either a Poisson distribution or a binomial not differ greatly. This suggests that an exponential
distribution, but not with a geometric distribution (for model might be a reasonable choice.
which the mode is always at 1, the lowest value in its
range). Solution 2.7 The pattern of the points in
Note that only the shapes of the bar charts have been Figure 2.9(a) is roughly linear, apart from a single
considered in this solution. Other factors, such as the observation in the top right corner of the plot. Thus
range, have been ignored. the normal model appears to be appropriate for the
bulk of the data, though there may be an outlier.
Solution 2.3 The plot in Figure 2.9(b) shows systematic curvature,
(a) It is clear that envelopes with few used windows suggesting that the normal model is inappropriate.
are more frequent than envelopes with many used This shape is in fact typical of right-skew distributions.
windows. So the uniform distribution is not
appropriate. Solution 2.8 The original data are markedly
right-skew. Both the log and the square root
(b) The shape of the data appears to be consistent in
transformations have reduced the skew by ‘pulling in’
general terms with either a Poisson model or a
the values to the right of the mode and ‘stretching
geometric model. The range of a geometric
out’ those to the left. This ‘stretching out’ effect is
distribution is 1, 2, . . . , and that of a Poisson
more marked for the log transformation. However, it is
distribution is 0, 1, . . .. Since the smallest number of
rather difficult to decide which of the two transformed
used windows is 1, a geometric model seems the more
histograms is the more symmetric.
suitable. In fact, a geometric model gives a very good
fit to these data. Alternatively, we could, for instance,
try modelling X − 1 using a Poisson distribution,
Solution 2.9 An estimate of the age of the site
based on all eight observations is given by the sample
where X is a random variable representing the number
mean, which is approximately 2622 years. However,
of used windows.
this mean is rather unsatisfactory, as it lies above
(c) The shortcoming of the geometric model is that it seven of the eight points. This is because it is greatly
has an unbounded range, whereas in the present influenced by the value for C-367, which is 3433 years.
setting the number of used windows can be no greater When this point is omitted, the mean of the remaining
than 12. However, this is probably not a big problem seven points is approximately 2506 years. Sample
since there are relatively few envelopes with more than number C-367 clearly has a big influence on the mean.
half the windows used, and only one with twelve used You should therefore report the calculations both
windows. So a distribution that gives small including and excluding sample C-367, and perhaps
probabilities to values just below 12 and very small suggest further investigation of the outlier.
probabilities to values greater than 12 could be
appropriate. Thus the geometric model might be a Solution 4.1 Reorganizing the material should
reasonable one in spite of the range restriction. produce something like the following.
(d) The fit of the model would be checked using a Introduction
chi-squared goodness-of-fit test. The expected value Compare means of a continuous variable in two groups
for twelve used windows should be calculated using given samples of sizes 24 and 32.
the upper tail of the geometric distribution — that is,
Methods
P (X ≥ 12).
Check normality in each group using probability plots.
Calculate 95% t-interval.
Solution 2.4 In Example 1.11 it was suggested that
Perform two-sample t-test.
the underlying distribution for these data might be
normal or exponential. The normal model seems out of Results
the question, owing to the substantial skewness of the Normal model reasonable.
histogram in Figure 2.6. The shape of the histogram 95% t-interval was (−3.92, 17.63).
suggests that the exponential model might be better. Two-sample t-test (with equal variances) gave
p = 0.16.
Solution 2.5 The points of the exponential Discussion
probability plot do not lie close to the straight line, Little evidence that the means are different in the two
which suggests that the exponential model is not groups.
appropriate. One alternative is to try a normal model.
However, the histogram in Figure 2.5 is positively
skewed. A possibility is to try transforming the data.
This is discussed in Subsection 2.3.
Solutions to Activities 121

Solution 4.2 Introduction


Summary The numbers of used windows on a sample of 311
The distribution of catch sizes from 100 fish traps is reusable envelopes used for internal circulation of
described. documents within an office were counted. The aim of
Introduction the analysis is to describe the distribution of the
The numbers of fish caught in 100 traps were counted. number of used windows, estimate the mean number
The aims of this analysis are to describe the of used windows, and obtain a model for the
distribution of fish catches, and to estimate the mean distribution. The data for this analysis were obtained
catch per trap and the proportion of traps with zero from Sutherland, W. (1990) The great pigeonhole in
catch. The data for this analysis were obtained from the sky. New Scientist, 9 June, 73–74.
David, F.N. (1971) A First Course in Statistics, 2nd Methods
edn. Griffin, London. A 95% z-interval for the mean number of used
Methods windows was calculated. The fit of the geometric
An approximate 95% confidence interval for the mean model was assessed using the chi-squared
catch per trap was calculated using large-sample goodness-of-fit test. For this purpose, frequencies of
methods. A 95% confidence interval for the proportion 10, 11 and 12 were combined into one category, so the
of traps with zero catch was calculated using exact number of degrees of freedom for the test was 8. All
methods. All calculations were performed using calculations were performed using MINITAB Student
MINITAB Student Version 14. Version 14.
Results Results
A bar chart of the catches is shown below. The numbers of used windows on the 311 envelopes
were distributed as shown in the bar chart below. The
mean number of used windows was approximately
3.41, with approximate 95% confidence interval
(3.12, 3.70).

Of the 100 traps, 72 contained between 3 and 5 fish.


The minimum catch was 0, the maximum was 8. The
average catch per trap was 4.04 fish, with approximate
95% confidence interval (3.76, 4.32). Out of the 100
traps, only one produced a zero catch. The proportion
of traps with no catch is thus 0.01, with exact 95% A geometric model with p = 1/3.412 provided a good
confidence interval (0.00025, 0.054). fit to the data: chi-squared test statistic 3.16 on 8
degrees of freedom, p = 0.92.
Discussion
The distribution of fish catches was symmetric with Discussion
mean about 4. Only one out of one hundred traps had The average age of the envelopes, calculated in
zero catch. numbers of addressees, is about 3.41. A geometric
distribution provided an excellent fit to the data.
Solution 4.3 Here is a possible report. You might However, the model is only approximate as it does not
have decided to include the data in Table 4.1 as well. allow for the fact that the number of windows on each
envelope is restricted to 12.
Summary
The aim of this analysis is to develop a model for the
distribution of the number of addressees on a sample of
reusable envelopes. Data on a sample of 311 envelopes
were obtained from a published source. A geometric
model was found to provide a good fit to the data.
122 Unit C3

Solutions to Exercises
Solution 1.1 The data are clearly discrete, so a Introduction
discrete model is appropriate. However, if the output One hundred motor cortex neuron interspike intervals
is typically large (which might be the case with a of an unstimulated monkey were measured (in
major manufacturer such as Ford), it might make milliseconds). In this analysis, the mean of the
sense to model output as a continuous variable. On logarithms of the interspike intervals is estimated. The
the other hand, if output is low (as might be the case data for the analysis were obtained from Zeger, S.L.
for a luxury hand-crafted car such as the Morgan), and Qaqish, B. (1988) Markov regression models for
then a discrete model would be better. Thus time series: a quasi-likelihood approach. Biometrics,
information about the scale of the output is required. 44, 1019–1031.
Methods
Solution 1.2 The data were transformed using natural logarithms.
(a) The location of the leak can be measured as its A 95% t-interval was calculated for the mean of the
distance from one end of the street, and hence takes logarithms of the interspike intervals. The validity of
values between 0 and L, where L is the street length. the normal model was investigated using a normal
Nothing more is known, so it is reasonable to assume probability plot.
that the leak is equally likely to have occurred at any Results
point. Thus it would seem reasonable to use the The normal model for the logarithms of the interspike
continuous uniform distribution U (0, L) to model the intervals was adequate, apart from a single outlier
location of the leak. corresponding to an interspike interval of
(b) Suppose the T-joints are numbered 1 to n, where 2 milliseconds. The mean was 3.41, with 95%
n is the number of T-joints in the street. All that is confidence interval (3.28, 3.54). The standard
known is that the leak is likely to have occurred at one deviation was 0.66. When the outlier was excluded the
of these joints. Without any reason to suppose mean was 3.44, with 95% confidence interval (3.31,
otherwise, we can assume that the leak might have 3.56), and the standard deviation was 0.61.
occurred at any joint with equal probability. Thus an Discussion
appropriate model is the discrete uniform distribution The outlier has little effect on the results. A normal
on the integers 1, 2, . . . , n. model is appropriate for describing the variation in the
logarithms of the interspike intervals.
Solution 4.1 The following report could also
include a histogram of the logarithms of the intervals,
such as the one in Figure 2.13(b).
Summary
A normal model was used to describe the variation in
the logarithms of interspike intervals of an
unstimulated monkey. The mean log(interspike
interval) is estimated.
Index for Block C 123

Index for Block C


p value 14 one-tailed test 23
t-test 22 outliers 107
alternative hypothesis 10, 15
parametric test 61
chi-squared distribution 78, 79 planning sample sizes 49
chi-squared goodness-of-fit statistic 77 pooled estimate of the common variance 33
chi-squared goodness-of-fit test 80 power 42, 46
choosing a continuous model: range and shape 97 probability plots 103
choosing a discrete model: range and shape 96
choosing a model 91, 94 ranks 66
confidence level 9, 10 rejection region 37

degrees of freedom 78 sign test 61, 66


distribution-free test 61 significance level 9, 10
significance probability 14
fixed-level testing 36 significance testing 14
procedure 38 procedure 18
standard settings for continuous distributions 97
goodness-of-fit test 77 standard settings for discrete distributions 94
statistical report 110
hypothesis 9
structure of a statistical report 111
hypothesis testing using confidence intervals 9
Student’s t-test 22
interpreting significance levels 12
test statistic 15
interpreting significance probabilities 18
testing a normal mean 20
ladder of powers 106 testing a proportion 8, 9, 26
testing the difference between two Bernoulli
Mann–Whitney test 71, 72 probabilities 29
mean–variance relationships 102 transforming data 104
modelling process 90 two-sample t-test 31, 33
assumptions 31, 32
nonparametric test 61 two-sided test 9
normal approximation two-tailed test 9
in Mann–Whitney test 72 Type I error 42
in Wilcoxon test 69 Type II error 42
null distribution 15
null hypothesis 9, 15, 29 validity of the chi-squared approximation 80

one-sample t-test 22, 28 Wilcoxon signed rank test 66, 68


one-sided test 23 writing a statistical report 110

You might also like