Professional Documents
Culture Documents
m248 Block C
m248 Block C
Block C
About this course
M248 Analysing data uses the software package Student Version of MINITAB for
Windows (Minitab Inc.) and other software to explore and analyse data and to
investigate statistical concepts. This software is provided as part of the course, and its
use is covered in the associated computer books.
Acknowledgement
Grateful acknowledgement is made to the Statistical Laboratory, Iowa State University
for permission to reproduce the photograph of R.A. Fisher in Figure 5.4 of Unit C1.
Every effort has been made to contact copyright owners. If any have been inadvertently
overlooked, the publishers will be pleased to make the necessary amendments at the
first opportunity.
This publication forms part of an Open University course. Details of this and other
Open University courses can be obtained from the Student Registration and Enquiry
Service, The Open University, PO Box 197, Milton Keynes MK7 6BJ, United Kingdom:
tel. +44 (0)845 300 6090, email general-enquiries@open.ac.uk
Alternatively, you may visit the Open University website at http://www.open.ac.uk
where you can learn more about the wide range of courses and packs offered at all
levels by The Open University.
To purchase a selection of Open University course materials visit
http://www.ouw.co.uk, or contact Open University Worldwide, Walton Hall, Milton
Keynes MK7 6AA, United Kingdom, for a brochure: tel. +44 (0)1908 858793,
fax +44 (0)1908 858787, email ouw-customer-services@open.ac.uk
Introduction 6
4 Two-sample tests 29
4.1 Testing the difference between two Bernoulli
probabilities 29
4.2 The two-sample t-test 31
4.3 Performing significance tests using MINITAB 35
5 Fixed-level testing 36
5.1 Performing a fixed-level test 36
5.2 A few comments 42
5.3 Fisher, Pearson and Neyman 43
5.4 Exploring the principles of hypothesis testing 44
Summary of Unit C1 52
Learning outcomes 53
Solutions to Activities 54
Solutions to Exercises 58
UNIT C2 Nonparametrics 60
Study guide for Unit C2 60
Introduction 60
4
1 Nonparametric tests 61
1.1 Early ideas: the sign test 61
1.2 The Wilcoxon signed rank test 66
1.3 The Mann–Whitney test 71
1.4 Nonparametric tests using MINITAB 74
Summary of Unit C2 83
Learning outcomes 84
Solutions to Activities 85
Solutions to Exercises 88
Introduction 90
1 Choosing a model: getting started 91
1.1 Continuous or discrete? 92
1.2 Which discrete distribution? 94
1.3 Which continuous distribution? 96
Study guide
Study guide for Block C
There is no CMA on Block C. TMA 03 covers Unit B3 and Block C.
Unit C1 is longer than average and Units C2 and C3 are both shorter than
average. Unit C1 will need seven study sessions; Units C2 and C3 will each need
four study sessions.
6 Unit C1
Introduction
In Block B, the results of statistical experiments were used to obtain confidence
intervals for population parameters, thus providing plausible ranges of values for
the parameters. This unit is about testing claims, or hypotheses, about the values
of population parameters: in each situation discussed, a claim is reinterpreted as a
statement about a population parameter — that is, a hypothesis is
formulated — and data are used to investigate the validity of the hypothesis.
Examples of the sorts of claims that are amenable to statistical investigation
include the following.
Drug A prolongs sleep, on average.
Drug A and Drug B are, on average, equally effective at prolonging sleep.
Eight out of ten dogs prefer Pupkins to any other dog food. At the time of writing, there is
no dog food on the market
Notice that all these claims can be reinterpreted as statements about the values of called Pupkins, and no allusion
some unknown population parameters. For example, the first claim might be to any other real trade name is
interpreted as a statement about the mean sleep gain of patients taking Drug A intended here.
(for some relevant population of patients). And the second claim can be thought
of as a statement that the mean sleep gain using Drug A is equal to the mean
sleep gain using Drug B (for some relevant population of patients).
Introduction 7
Now consider the third claim, which might possibly be advanced by the
manufacturer of the dog food in the course of an advertising campaign: ‘Eight out
of ten dogs prefer Pupkins to any other dog food.’ It appears to mean that, in the
relevant population of dogs (perhaps all those in Britain), 80% of dogs presented
with a choice of all available dog foods would select Pupkins, while the other 20%
would make a selection from the remainder. So this claim may be interpreted as a
hypothesis about a population proportion.
One way to test this hypothesis would be to take a sample of dogs from the
population of dogs in Britain, offer them the full array of available dog foods, and
keep a record of which of them preferred Pupkins to all others. (There is a
problem of definition here: what exactly does ‘any other dog food’ mean? But let
us gloss over that.)
This could be an expensive experiment in terms of materials, and not an easy one
to conduct. But let us concentrate on the principles. Maybe some alternative
experimental design, in which
Notice that there is, implicit in the claim, the idea that ‘at least 80% of dogs different dogs are offered a more
prefer Pupkins’. We would not enter into serious dispute with the manufacturer if limited choice in various
the evidence actually suggested that, say, 85% or 90% of dogs preferred Pupkins. combinations, could be
We would seriously contest the claim only if there was evidence that it was contrived.
exaggerated, and that in fact the underlying proportion was less than 80%.
If, in a small sample of 20 dogs, only 15 dogs showed the claimed preference for
Pupkins, an observed sample proportion of 75%, only an unreasonable person
would challenge the claim of an underlying 80%, for allowance must be made for
random variation arising from the sampling process. But if as few as (say) 11 or
12 of the 20 dogs demonstrated the claimed preference — only just over half the
sample — we might seriously begin to doubt the manufacturer’s claim. This unit
is about what constitutes sufficient evidence to reject a claim or hypothesis, or at
least to cast doubt upon it, and how to assess the strength of the evidence against
it.
In Section 1, the following straightforward approach to hypothesis testing is taken.
First, the claim is reinterpreted as a statement about the value of some unknown
population parameter. Then a random sample is taken from the population; this
is used to construct a confidence interval for the parameter. If the confidence
interval contains the hypothesized parameter value, then that value is a plausible
value for the parameter, and the sample does not provide enough evidence to
dismiss the claim. However, if the interval does not contain the hypothesized
value, then the conclusion is reached that there is sufficient evidence to doubt the
claim. In this way, a rule is developed for deciding whether or not to reject a
hypothesis. Notice that, since the confidence interval will be different for different
confidence levels, the decision rule will depend on the confidence level adopted.
Notice, also, the wording used here: whatever the results of the test, there is no
implication that the claim is accepted as ‘true’. Even if it is established that the
hypothesized parameter value is plausible, it will in general be only one of a range In the ‘Pupkins’ example, if
of plausible values. Statistical hypothesis testing is based on a principle of only 15 dogs out of a sample of
falsifiability rather than of verifiability. In the approach just described, a 20 showed the claimed
preference, that would not be
hypothesis is either rejected in the light of the evidence, or not rejected because enough evidence to reject the
there is insufficient evidence to reject it. In other words, this statistical approach claim that 80% of dogs prefer
may not be used to prove the truth of something, but merely to provide evidence Pupkins, but nor does it prove
for its falsity. that the claim is true!
populations are covered in Section 4. The main test covered in this section is the
two-sample t-test, for comparing two normal means.
In Section 5, an approach called fixed-level testing is discussed briefly. This
approach is related to that in Section 1. In many circumstances, it leads to the
same conclusion.
Finally, in Section 6, you will see how considerations and calculations, based on
the statistical testing procedure to be used, can help in the process of planning an
experiment, and in particular with the question of what an appropriate size for
the sample(s) involved would be.
In this example, ten sixes were obtained in a total of 100 rolls of the supposedly
fair die — rather fewer than expected. Does this experiment provide any
substantial evidence that the die is biased — that is, that the program generating
the throws is flawed?
The methods of Block B can be used to find an exact 90% confidence interval
for p, the underlying proportion of sixes. Assuming a binomial model, B(100, p),
Section 1 9
for the number of sixes that occur in 100 rolls of the die, the confidence interval
for p is (0.0553, 0.1637). The exact confidence intervals
quoted in this example and in
A confidence interval can be used to provide a decision rule for a test of a Activity 1.1 were calculated
hypothesis about the value of a model parameter. The confidence interval can be using MINITAB.
thought of as a range of plausible values of the parameter. The most noticeable
feature of this particular confidence interval is that it does not contain the
theoretical (or assumed) underlying value p = 16 0.1667. The conclusions of this
simple test may be stated as follows.
In a test of the hypothesis that p = 16 , there was sufficient evidence at the The terminology used here will
10% significance level to reject the hypothesis in favour of the alternative that be explained shortly.
p = 16 . In fact, since the entire confidence interval is below 16 , there is some
indication that p < 16 . �
There are a number of features of the testing process in Example 1.1 to notice.
Most obviously, the raw material for the statistical testing of a hypothesis is the
same as that for the construction of a confidence interval: we require data; we
require an underlying probability model; and we need to have identified a model
parameter relevant to the question we are interested in answering.
What has altered is the form of the final statement: rather than listing a range of
plausible values for an unknown parameter, at some level of confidence, a
statement is made about whether or not a hypothesis about a particular
parameter value is tenable, at some assigned significance level. Notice that in
this example the significance level has been expressed as a percentage, and is
equal to 100% minus the confidence level (90%) for the interval used to perform In most circumstances, large
the test. This is just the way the conventional language has developed; you will confidence levels (90% or more)
become used to it as you work through this unit. are used; so the corresponding
significance levels are small.
Example 1.1 continued Testing a hypothesis about a proportion
An exact 95% confidence interval for the binomial parameter p, based on a count
of 10 successes in 100 trials, is (0.0490, 0.1762). In this case, the hypothesized
value p = 16 is contained in the confidence interval. In other words, at this
confidence level, and based on these data, it is a plausible value. The conclusions
of the corresponding test may be stated as follows.
In a test of the hypothesis that p = 16 , there was insufficient evidence at the
0.05 significance level to reject the hypothesis in favour of the alternative that Here, the significance level has
p = 16 . been expressed as a number
(0.05) between 0 and 1, rather
Note that it has certainly not been concluded that the parameter p is definitely than as a percentage (5%).
equal to 16 , but merely that 16 is a plausible value for the parameter. There is not Both formulations are common.
sufficient evidence to reject the hypothesis that p = 16 ; but that does not mean
that this hypothesis must be true.
In the statement of the hypothesis that p = 16 and in the context of the problem,
the implication has been that either a sample proportion too high (suggesting
p > 16 ) or a sample proportion too low (suggesting p < 16 ) would both offer
evidence to reject the hypothesis. Using the conventional terminology, a
two-sided test has been performed. � A two-sided test is sometimes
called a two-tailed test.
The approach to hypothesis testing based on confidence intervals, which has been
illustrated in Example 1.1, may be summarized as follows.
First, the characteristics of a population are expressed in terms of a parameter θ
in such a way that the hypothesis under test takes the form θ = θ0 , where θ0 is
some specified value. This hypothesis is called the null hypothesis and is The reason for the term ‘null’
denoted by H0 , so it is written will become clearer later in the
unit.
H0 : θ = θ0 .
(In Example 1.1, the null hypothesis is H0 : p = 16 .)
10 Unit C1
Activity 1.1
A different computer, and a different statistical package, were used to simulate
the results of a sequence of Bernoulli trials. The intention was that the
probability of success p at any trial should be 23 . The results of a sequence of 25
trials are shown in Table 1.2.
An exact 95% confidence interval for p based on the observed sequence of trials is
(0.2113, 0.6133). An exact 99% confidence interval for p is (0.1679, 0.6702).
These intervals are to be used to test the hypothesis that the underlying
proportion of successes is 23 against the alternative hypothesis that the underlying
proportion is different from 23 . Thus the null and alternative hypotheses are
H0 : p = 23 , H1 : p = 23 .
(a) What would the conclusion of a test be if a significance level of 0.05 is used?
(b) What would the conclusion of a test be if a significance level of 0.01 is used?
In Activity 1.1, you saw that there was enough evidence against the hypothesis
p = 23 to reject the hypothesis at the 0.05 significance level; but more evidence is
required to reject it at the 0.01 level, and there was not enough to do this. It is of
some interest that the null hypothesis was rejected at one significance level (0.05)
but not at another (0.01), on the basis of exactly the same data. This illustrates
that hypothesis testing is a matter of evaluating evidence.
Example 1.1 and Activity 1.1 both involved using (exact) confidence intervals for
a binomial parameter p to test a hypothesis. Example 1.2 illustrates that the
method of using a confidence interval to test a hypothesis may be applied in other
situations.
Section 1 11
The discussions in the previous paragraph and after Activity 1.1 indicate that, in
cases where the null hypothesis has been rejected, there appears to be some kind
of relationship between the significance level used in a test and the strength of
evidence against the null hypothesis provided by the test — the lower the
significance level, the stronger the evidence. This idea can be made more precise
by investigating how hypothesis tests and significance levels are interpreted. This
is done in Example 1.3 by considering how hypothesis testing is related to the
repeated experiments interpretation of confidence intervals.
12 Unit C1
This interpretation means that the smaller the significance level that is used, the
less likely it is that the null hypothesis will be rejected when it is true.
The examples discussed so far have been based on exact confidence intervals. In
Block B, you learned about large-sample confidence intervals. These can be used
to carry out hypothesis tests in the same way as exact confidence intervals are
used. This is illustrated in Example 1.4.
These data are to be used to test the null hypothesis that the mean number of
cells per square is 0.6, against the alternative hypothesis that the mean is
different from 0.6:
H0 : µ = 0.6, H1 : µ = 0.6.
In Block B you learned how to calculate an approximate confidence interval for a
population mean µ without choosing a specific probability model for the data.
The method is valid when the sample size n is large, as is the case here. So it is
not necessary to assume a distribution for the variable (the number of cells in a
square). An approximate 100(1 − α)% confidence interval for µ is given by
� �
s s
(µ− , µ+ ) = x − z √ , x + z √ ,
n n
where x is the sample mean, s is the sample standard deviation and z is the
(1 − α/2)-quantile of the standard normal distribution.
No significance level has been stipulated for the test. For a significance level of α,
a 100(1 − α)% confidence interval for the unknown population mean is required.
So, if α = 0.05 is chosen, a 95% confidence interval is needed.
Here x = 0.6825, s = 0.9021, n = 400 and, for a 95% confidence interval, z = 1.96. The values of x and s were
So the interval is calculated using the data in
� � Table 1.4.
− + 0.9021 0.9021
(µ , µ ) = 0.6825 − 1.96 × √ , 0.6825 + 1.96 × √
400 400
(0.5941, 0.7709).
This confidence interval contains 0.6, the hypothesized value of the underlying You should bear in mind that
mean µ. So, there is insufficient evidence to reject the null hypothesis at the 5% there is a certain amount of
significance level. It remains plausible that the population mean takes the value approximation here, since the
confidence interval is
0.6; that is, it is plausible that the mean number of cells per square is 0.6. � approximate.
Summar y of Section 1
In this section, a method for performing hypothesis tests based on confidence
intervals has been described. The terminology for the hypotheses involved in a
test, and the notion of significance level, have been introduced. The approach to
hypothesis testing using confidence intervals is general, and straightforward (if
you know how to calculate the interval). However, there is more to hypothesis
testing than this! In Section 2, a different approach is introduced, in which the
strength of the evidence against the null hypothesis is quantified.
14 Unit C1
Exercise on Section 1
Exercise 1.1 Book loans
A sample of 122 books in a particular library was selected and, for each book, the Burrell, Q.L. and Cane, V.R.
number of times that it had been borrowed in the preceding twelve months was (1982) The analysis of library
counted. The sample mean number of loans was x = 1.992 and the sample data. J. Royal Statistical
Society, Series A, 145, 439–471.
standard deviation of the number of loans was s = 1.394. An approximate 90% The authors collected their data
confidence interval for µ, the mean number of loans in a year for books in the from several libraries. The data
library, calculated using large-sample methods, is (1.784, 2.200). This confidence used in the exercise are from
interval is used to test the null hypothesis that µ is 2.5 against the alternative one of the sections of the
hypothesis that µ = 2.5. Wishart Library in Cambridge.
sample standard deviation s is 0.9021. (See Example 1.4.) So, for the null
hypothesis H0 : µ = 0.6, the approximate null distribution of the sample mean X
is N (0.6, 0.90212 /400), or N (0.6, 0.00203). Thus, in this case, the sample mean X
can be used as the test statistic. �
In significance testing, the idea is to describe the extent to which the data provide
evidence against the null hypothesis, rather than to decide whether or not there is
sufficient evidence to reject the null hypothesis (as was the case in Section 1 when
using the confidence interval approach). This is done by calculating the
probability of obtaining a value of the test statistic that is ‘at least as extreme as’
the observed value when the null hypothesis is true. This is illustrated in
Example 2.2.
Figure 2.1 The null distribution of X, and the values that are at least as extreme
as the observed value x = 0.6825
The shaded regions in Figure 2.1 contain all the possible values of X that, if µ
were equal to 0.6 (the value specified in the null hypothesis), would be at least as
extreme as the actual value that was observed (0.6825). In the tail of the null In a certain sense these
distribution that contains the observed value 0.6825, it is easy to see which values ‘extreme’ values in the shaded
these are: any value greater than 0.6825 would be more extreme than 0.6825 (as it regions are those that are as
little or even less in accord with
is further from 0.6). Thus the shaded area in the right tail of the null distribution the null hypothesis than is the
includes all values of X greater than or equal to 0.6825. actual observed value — so they
provide as little or less support
Why is there a shaded area in the left tail as well? Remember this is a two-sided for the null hypothesis than
test, so a sample mean a long way below the hypothesized value of 0.6 would also does the observed value.
provide evidence against the null hypothesis. The observed value, 0.6825, is
0.0825 above 0.6. The null distribution of X is symmetric about x = 0.6, and the
value 0.5175 is 0.0825 below 0.6. Therefore it seems reasonable to state that a
value of X equal to 0.5175 would be just as extreme as the observed value, 0.6825,
and that values further into the left tail (that is, less than 0.5175) are even more
extreme. Thus the set of values of X that are at least as extreme (in relation to
the null hypothesis) as the observed value, 0.6825, consists precisely of all values
greater than or equal to 0.6825, together with all values less than or equal
to 0.5175.
The significance probability for the test is simply the sum of the two shaded tail
areas in Figure 2.1. That is,
p = P (X ≥ 0.6825) + P (X ≤ 0.5175),
where X ∼ N (0.6, 0.00203). Since the null distribution of X is symmetric about
x = 0.6, the probability in the left tail is equal to the probability in the right tail,
and hence the significance probability is
p = 2P (X ≥ 0.6825).
Section 2 17
Standardizing and using the table of probabilities for the standard normal
distribution in the Handbook gives
� �
0.6825 − 0.6
p = 2P Z ≥ √ , where Z ∼ N (0, 1),
0.00203
2P (Z ≥ 1.83)
= 2 × 0.0336
= 0.0672
0.067. �
Note that the test could have been carried out using a test statistic a little more
complicated than X. The null distribution of X is N (0.6, 0.00203). Although this
distribution is not difficult to deal with, it would have been even easier to deal
with N (0, 1). If X had been standardized using its null distribution, then the
resulting random variable Z would have had the √standard normal distribution as
its null distribution. That is, if Z = (X − 0.6)/ 0.00203 had been used as the
test statistic, then its null distribution would have been N (0, 1). There would
have been slightly more effort in calculating the observed value of Z — it comes to
1.83 — but then we would simply have had to find the value of P (|Z| ≥ 1.83),
where Z ∼ N (0, 1). In this situation, the change is of no practical importance; it
has been mentioned only to show that the most obvious choice for a test statistic
is not the only possibility, and that it is possible that another choice may be
easier to work with.
Examples 2.1 and 2.2 illustrate most features of significance testing. However,
before summarizing the procedure for carrying out a significance test, there
remain two areas that need further consideration.
First, in a two-sided test, the rule for deciding which values in the opposite tail of
the null distribution (that is, the one that does not include the observed value)
should be counted as being ‘at least as extreme, in relation to the null hypothesis,
as that observed’ is somewhat arbitrary. In fact, there is no universal agreement
on exactly how this should be done. Several proposed general rules exist. In
practice, in cases like that in Example 2.2, where the null distribution is unimodal
and symmetric, the rules in common use all agree: the appropriate area in the
opposite tail is drawn symmetrically to match that in the observed tail. (This has
the consequence that the areas in these two tails are equal, so that the total
probability in the two tails can be found simply by doubling the probability in one
of them.) In very many common testing situations, the null distribution is indeed
unimodal and symmetric, so no problem arises. In Subsection 3.2, you will meet
an example where the null distribution is not symmetric, and which is complicated
even more by the fact that the test statistic is discrete. The rule used in this
course for defining the appropriate area in the opposite tail is described there.
18 Unit C1
Secondly, it is all very well saying that the significance testing approach does not
carry with it a firm rule on whether or not to reject H0 , but in practice how are
these p values to be interpreted? It is dangerous to give a general rule of thumb,
because there are always situations where any rule of thumb will be found
wanting. Nevertheless, a rough guide to interpreting significance probabilities is
given in Table 2.1.
A total of 33 insect traps were set out across sand dunes and the numbers of
different insects caught in a fixed time were counted. Table 2.2 gives the number Gilchrist, W. (1984) Statistical
of traps containing various numbers of insects of the taxon Staphylinoidea. Modelling. John Wiley and
Sons, Chichester, p. 132. The
Table 2.2 Staphylinoidea in 33 traps original purpose of the
experiment was to test the
Count 0 1 2 3 4 5 6 ≥7 quality of fit of a Poisson model.
Here no specific model is
Frequency 10 9 5 5 1 2 1 0 assumed, though a sample size
of 33 is arguably a little small to
The sample mean of the counts is 1.636, and the sample standard deviation use the large-sample result for
is 1.655. the distribution of the sample
mean.
Perform a two-sided significance test of the null hypothesis that µ, the underlying
mean number of insects of the taxon Staphylinoidea in a trap, is 1. Use the
sample mean X as the test statistic, and use a normal approximation to the null
distribution of X.
Follow the procedure for significance testing, numbering the steps in your test.
Summar y of Section 2
In this section, a general procedure for carrying out a hypothesis test has been
introduced. The method described is known as significance testing. You have
learned how to interpret the significance probability that arises from a significance
test.
20 Unit C1
Exercise on Section 2
Exercise 2.1 Book loans
In Exercise 1.1, data on the number of times that a sample of 122 books in a
library were borrowed in a year were investigated using the confidence interval
approach to hypothesis testing. The sample mean number of loans per book was
x = 1.992, and the sample standard deviation was s = 1.394. Perform a
(two-sided) significance test of the null hypothesis that the underlying mean
number of loans per book in a year is 2.5.
Do the Shoshoni tend to use golden rectangles? This can be investigated using a
significance test as follows.
Appropriate hypotheses for the test are
H0 : µ = 0.618, H1 : µ = 0.618,
where µ is the underlying mean width-to-length ratio of Shoshoni rectangles
(step 1). (Note that departures from the null hypothesis in either direction would
indicate that the Shoshoni are doing something other than constructing ‘golden’
rectangles, so the test is two-sided.)
The values in Table 3.1 will be used for data (step 2).
The next step (step 3) is to decide on the test statistic and find its null
distribution. Here, the hypotheses relate to the underlying mean µ. So, as in
Example 2.1, the obvious choice for the test statistic is perhaps the sample
mean X. But what is its null distribution?
Section 3 21
Why is there a shaded area in the left tail as well? Remember this is a two-sided
test, so that a sample mean a long way below the hypothesized value of 0.618,
corresponding to a negative value of T , would also provide evidence against the
null hypothesis. Since the null distribution of T is symmetric about t = 0, it
seems reasonable to state that a value of T equal to −2.055 would be just as
extreme in relation to the null hypothesis as is the observed value, 2.055, and that
values further into the left tail (that is, less than −2.055) are even more extreme.
Thus the set of values of T that are at least as extreme (in relation to H0 ) as the
observed value, 2.055, consists precisely of all values greater than or equal to
2.055, together with all values less than or equal to −2.055.
The significance probability for the test is simply the sum of the two shaded tail
areas in Figure 3.2. In other words, it is p = P (|T | ≥ 2.055), where T ∼ t(19). To
find this probability exactly needs computing facilities; in fact, p = 0.0539
(step 6).
The necessity of using a computer (or some other calculating method) to calculate
the significance probability for many tests may be seen as a disadvantage of
significance testing. However, in practice it is not usually a major problem. For
instance, in this case the table of quantiles for t-distributions indicates that the
0.95-quantile and the 0.975-quantile of t(19) are 1.729 and 2.093, respectively.
Thus
P (T ≥ 1.729) = 1 − 0.95 = 0.05
and
P (T ≥ 2.093) = 1 − 0.975 = 0.025.
Since 2.055 is between 1.729 and 2.093, it follows that
0.025 < P (T ≥ 2.055) < 0.05.
But p = P (|T | ≥ 2.055) = 2P (T > 2.055), so 0.05 < p < 0.1. In most
circumstances, this kind of rather imprecise information is adequate for drawing
conclusions from a test.
There is weak evidence against the null hypothesis that the underlying mean
width-to-length ratio of Shoshoni rectangles is 0.618 and in favour of the
alternative hypothesis.
Since the sample mean is greater than 0.618, the data suggest that Shoshoni
rectangles may be somewhat ‘squarer’ than golden ratio rectangles (steps 7 and 8).
This completes the test. �
The test just used for testing a hypothesis about a normal mean is called the
t-test or Student’s t-test (although, actually, R.A. Fisher had a lot to do with
its development). To be more specific, it is called the one-sample t-test to
distinguish it from another test, involving two samples, that you will meet in
Section 4. The various kinds of t-test are among the most commonly used tests in
statistics because, as you have seen, the normal distribution is a very widely used
probability model.
Assuming a normal model for the variation in times, use the procedure for
significance testing to test the null hypothesis that µ, the underlying mean time
that the timer takes to ring when it is set to five minutes, is equal to five minutes
(300 seconds) against the alternative hypothesis that it is different from five
minutes:
H0 : µ = 300, H1 : µ = 300.
Follow the procedure for significance testing, numbering the steps in your test.
When the alternative hypothesis of a test describes departures from the null
hypothesis in only one of the two possible directions, the test is said to be
one-sided. So the general statement of the null and alternative hypotheses might A one-sided test is sometimes
take either of the following forms, depending on the context of the claim being called a one-tailed test.
investigated:
H0 : θ = θ0 , H1 : θ > θ0 ;
or
H0 : θ = θ0 , H1 : θ < θ0 .
When you learned how to calculate confidence intervals for normal means in
Block B, you saw that the technique can be applied to data that are actually
differences of matched pairs. Not surprisingly, tests for normal means can also be
applied to differences of matched pairs. Apart from the fact that the data arise in
a particular way, the technique is the same as that used in Example 3.1. However,
to illustrate this, an example for which a one-sided test is appropriate will be used.
These data are actually individual differences between matched pairs of Patient Gain
observations. For each of the ten individuals, the length of time asleep after 1 1.9
taking no drug was subtracted from the length of time asleep after taking 2 0.8
3 1.1
L-hyoscyamine hydrobromide. The differences are all positive except the fifth, and 4 0.1
this remark alone suggests that L-hyoscyamine hydrobromide is effective, on 5 −0.1
average, at prolonging sleep. However, a null hypothesis for a formal test that the 6 4.4
drug, in fact, makes no difference to the duration of sleep might take the form 7 5.5
8 1.6
H0 : µ = 0, 9 4.6
10 3.4
where the parameter µ is the mean underlying sleep gain. This is an example
where the name ‘null’ for this hypothesis makes particularly good sense, because
it proposes a zero value for the parameter. However, the hypothesis is still called
‘null’ when it does not propose a zero value, as you have seen.
In this case, it will be interesting to pursue a one-sided test to reflect the
suspicion (or even the aim in administering the dose) that the drug does indeed
prolong sleep, on average. The alternative hypothesis will therefore be written as
H1 : µ > 0. This concludes Step 1.
24 Unit C1
Comment
The 0.995-quantile and the 0.999-quantile of t(9) are, respectively, 3.250 and
4.297. Thus
P (T ≥ 3.250) = 1 − 0.995 = 0.005
and
P (T ≥ 4.297) = 1 − 0.999 = 0.001.
The significance probability p must therefore be somewhere between these two
values. That is, 0.001 < p < 0.005. Step 6.
A diagram of the null distribution is shown in Figure 3.4. The shaded regions in
the diagram show the possible counts that are at least as extreme, in relation to Step 5.
the null hypothesis, as the value that was actually observed.
Figure 3.4 The null distribution B(20, 34 ) and counts that are at least as extreme,
in relation to the null hypothesis, as that observed, n = 12
Section 3 27
It is clear from the diagram that the observed value n = 12 is in the lower tail of
the null distribution; that is, it is rather a small value to have observed if the null
hypothesis is true. It is therefore clear that values further out in this tail — 11, 10,
9, and so on — provide even more evidence against the null hypothesis. However,
this is a two-sided test, so we need to consider values in the other tail as well, but
which values?
In this course the principle that will be used to choose ‘extreme’ values in the
opposite tail to that observed is based on ‘tail probabilities’ — that is, on the total
probability included in the ‘tails’ of the distribution, where a ‘tail’ includes all
values from one extreme or the other up to a certain value. Table 3.4 shows the
probability function of the null distribution B(20, 34 ), together with the
corresponding tail probabilities for both tails of the distribution.
The probability in the ‘tail’ up to and including the observed count, n = 12, is MINITAB uses a more
thus 0.1018. The upper tail is then chosen to be the tail which includes the complicated method to choose
largest amount of probability that is not greater than this value (that is, less than the second tail. However, in
many instances this leads to the
or equal to 0.1018). This is the tail including the values 18, 19 and 20, for which same p value as the method
the total probability is 0.0913. described here.
The significance probability for the test is given by the sum of the two shaded
areas in Figure 3.4 — that is, by the sum of the probabilities in the two tails. So Step 6.
����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����
Summar y of Section 3
In this section, you have seen how significance tests can be used for testing normal
means and Bernoulli probabilities. One-sided tests and two-sided tests have both
been considered.
In particular, you have seen that a useful test statistic for testing the null
hypothesis H0 : µ = µ0 about the mean µ of a normal distribution is the random
variable
X − µ0
T = √ .
S/ n
The null distribution of T is t(n − 1), where n is the sample size. The test is
called the one-sample t-test.
You have used MINITAB to carry out the tests described in Sections 2 and 3.
Exercise on Section 3
Exercise 3.1 Differences in plant heights
Charles Darwin measured differences in height for fifteen pairs of plants of the The data are quoted in Fisher,
species Zea mays. Each pair had parents grown from the same seed — one plant in R.A. (1942) The Design of
each pair was the progeny of a cross-fertilization, the other of a self-fertilization. Experiments, 3rd edn. Oliver
and Boyd, London, p. 27.
Darwin’s measurements were the differences in height between cross-fertilized and
self-fertilized progeny. The data are given in Table 3.5. The units of measurement
are eighths of an inch. Table 3.5 Difference in
plant height ( 18 inch)
Suppose that the observed differences di , i = 1, 2, . . . , 15, are independent
observations on a normally distributed random variable D with mean µ and Pair Difference
variance σ2 . 1 49
2 −67
(a) State appropriate null and alternative hypotheses for a two-sided test of the
3 8
hypothesis that there is no difference (on average) between the heights of 4 16
progeny of cross-fertilized and self-fertilized plants, and state the null 5 6
distribution of an appropriate test statistic. 6 23
7 28
(b) Calculate the value of the test statistic for this data set, and obtain the 8 41
significance probability for the test. 9 14
(c) Interpret the significance probability, and state your conclusion clearly. 10 29
11 56
12 24
13 75
14 60
15 −48
Section 4 29
4 Two-sample tests
Many of the tests that you have met so far have been concerned with testing the
hypothesis that some population parameter takes a particular value. Such
hypotheses are not uncommon in certain application areas, for example, genetics.
However, a much more common testing situation is where samples are drawn from
two separate populations, in order to test some hypothesis about differences in
population characteristics. Very often, in such a situation, the hypothesis being
tested is that the populations actually do not differ in terms of the characteristic
of interest. This is the origin of the term null hypothesis. ‘Null’ ≡ ‘no difference’.
The most general question that could be asked is perhaps this: ‘Is the pattern of
variation in the attribute of interest exactly the same in both populations?’ In
other words, denoting the respective cumulative distribution functions by F1 (.)
and F2 (.), this might suggest the hypothesis
H0 : F1 (x) = F2 (x) for all x.
This may make sense, for example, in a context where the two populations differ
only in that different experimental treatments have been applied to them. If, in
fact, it is possible that the treatments make no difference at all, and the
individuals involved were drawn originally from a single population, then this kind
of hypothesis would be a sensible one to investigate. But more typically, some
common probability model, with parameters, would be assumed for the two
populations, and the null hypothesis would simply state that the parameter values
were equal for the two populations. An example is where the model is
parameterized by the mean and, denoting the two population means by µ1 and
µ2 , the null hypothesis is
H0 : µ1 = µ2 .
Subsection 4.1 is concerned with the situation where the variation in each of the
two populations is assumed to be adequately described by the binomial
distribution, and the question of interest is whether the Bernoulli probability p is
the same in the two populations. In Subsection 4.2, the variation in the two
populations is assumed to be adequately described by a normal distribution, and
the question of interest is whether the means of the two populations are equal.
Subsection 4.3 involves working through a chapter of Computer Book C : you will
learn how to carry out the significance tests described in Subsections 4.1 and 4.2
using MINITAB.
In Sections 2 and 3, the steps in the significance testing procedure were identified
by their numbers, both in examples and in solutions. From now on, although the
steps will be followed, they will not be numbered explicitly. However, you should
continue to use them in activities and exercises to help you to produce clear,
structured solutions.
would see this as inappropriate. But it is usually worth at least having a quick
informal look at the sample variances. The following rule of thumb will be used in
this course: if the sample variances differ by a factor of less than about 3, it may
be assumed that the assumption of equal variances is not seriously amiss.
There are also many situations where an assumption of equal variances is
reasonable on modelling grounds, at least when the null hypothesis of equal
means is true. These situations include those where the individuals in both
samples have been sampled from a common population and then had different
treatments applied to them. Under the hypothesis that the treatments have no
effect, the underlying distributions of both populations will be the same. So, their
variances as well as their means will be equal.
The assumptions required for the two-sample t-test may be summarized as in the
following box.
The next stage in setting up a test is to find a suitable test statistic. It would
seem sensible to base the test statistic on the difference between the sample
means, D = X 1 − X 2 . It follows from results on the distribution of a linear
combination of independent normally distributed random variables that D also
has a normal distribution: The distribution of a linear
� � combination of independent
σ2 σ2
D = X 1 − X 2 ∼ N µ1 − µ 2 , + , (4.1) normally distributed random
n1 n2 variables was discussed in
Block B.
where n1 and n2 are the sample sizes. The snag, just as in the single sample case,
is that the variance of D depends on the unknown variance σ2 . Thus, even under
the null hypothesis H0 : µ1 = µ2 , where the mean of D is known to be 0, the
distribution of D is not known. So D cannot be used as a test statistic. When
dealing with a single sample, this difficulty was resolved by calculating a test
statistic that involved the sample standard deviation S and using the
t-distribution. In the present setting, however, there are two distinct (and
independent) estimates of σ2 , one from each sample. Intuitively, it would make
sense to combine them in some way to obtain a single estimate. There are several
ways of doing this, but the optimal combination is given in the following box.
Note that the pooled estimate of the common variance gives more weight to the
estimate from the larger sample.
The following result is a consequence of (4.1): The result does not follow quite
directly, but the details are
(X 1 − X 2 ) − (µ1 − µ2 ) unimportant. Note that it can
� ∼ t(n1 + n2 − 2).
1 1 be used to calculate a
SP + confidence interval for µ1 − µ2 .
n1 n2
That is, the random variable on the left has a t-distribution with n1 + n2 − 2
degrees of freedom. Under the null hypothesis H0 : µ1 = µ2 , (µ1 − µ2 ) = 0, so the
null distribution of the quantity
X1 − X2
T = �
1 1
SP +
n1 n2
is t(n1 + n2 − 2). The statistic T involves only quantities whose values can be
calculated from the samples, and it is essentially just a scaled version of the
difference between the two sample means. Furthermore, its null distribution is
known. Thus it is an appropriate test statistic for the test of equality of means.
the evidence against the null hypothesis will be stronger) than for any test that
does not make this assumption. Thus the two-sample t-test with an assumption of
equal variances is of considerable practical use. Its theory is also considerably
easier to describe than is that of the test that does not assume equal variances.
The theory of that test is beyond the scope of this course and will not be
described.
����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����
Summar y of Section 4
In this section, you have learned how to carry out significance tests for the
difference between two Bernoulli probabilities and for the difference between the
means of two normal populations. You have also learned how to use MINITAB to
carry out these tests.
Exercises on Section 4
Exercise 4.1 A clinical trial
Patients newly diagnosed with rheumatoid arthritis were recruited into a clinical
trial of a new drug to control the symptoms of the disease. Of these patients, 62
were randomly allocated to the new drug, and 60 to receive standard therapy.
After one year, participants in the trial were categorized as being in remission or
not being in remission. Of the 62 patients on the new drug, 38 were in remission;
and of the 60 on standard treatment, 22 were in remission.
Test the hypothesis that the underlying probabilities of being in remission after a
year are the same for patients on the new drug and for patients on the standard
therapy. (In clinical trials of this general nature, it is normal practice to use
two-sided significance tests, because it is considered important that the analysis
should pick up situations in which the new treatment is actually worse than the
existing standard treatment, as well as those where the new treatment is better.)
5 Fixed-level testing
In Block B, you saw that it is common to calculate confidence intervals at the
predetermined confidence levels of 90%, 95% and 99% (although there is no
particularly compelling reason for using these levels). Similarly, it is common to
perform tests of hypotheses at predetermined significance levels. Such tests are
called fixed-level tests. It is common to choose levels such as 10%, 5% and 1%,
corresponding to the confidence levels just mentioned. In Section 1, you saw that
these tests can be performed simply by using the corresponding confidence
intervals. In Subsection 5.1, a different way of thinking of the testing procedure is
introduced. In many common cases, the way of thinking about the procedure is
the only difference, since the outcomes of the tests will be the same as for the
process using confidence intervals. The aim of this approach is, as in Section 1, to This is not true for some testing
develop a decision rule for rejection of a null hypothesis in favour of a stated situations, including tests
alternative hypothesis, at some predetermined significance level. This approach to involving binomial proportions.
However, even in these cases,
hypothesis testing is quite general and can be applied in any context where a the differences between the two
significance test can be used. The procedure for fixed-level testing will be methods are usually small, and
illustrated for a few of the situations discussed in Sections 2 and 3. Two further do not often lead to differences
ideas associated with hypothesis testing are discussed briefly in Subsection 5.2. in conclusions.
It has probably occurred to you to ask why there should be so many rather
different approaches to what seems, after all, to be a straightforward problem to
describe. A brief investigation of the historical background of hypothesis testing,
which may help to make some sense of this, is given in Subsection 5.3.
In Subsection 5.4, you will use a SUStats program to develop further your
understanding of the principles of hypothesis testing.
There are two points about what has been done in Example 5.1 that need further
comment.
√ Figure 5.2 The rejection region
First, as you saw in Section 2, Z = (X − 0.6)/ 0.00203 could be used as the test and the observed value of the
statistic instead of X; its null
√ distribution is N (0, 1). In this case, the observed test statistic X
value of Z is (0.6825 − 0.6)/ 0.00203 1.831. The 0.025-quantile and the
0.975-quantile of the standard normal distribution are −1.96 and 1.96, so the
rejection region consists of values of Z greater than or equal to 1.96 or less than
or equal to −1.96. The conclusion is as before: since 1.831 is not in the rejection
region, there is insufficient evidence at the 5% significance level to reject the null
hypothesis.
Secondly, note that the conclusion in Example 5.1 is the same as that in
Example 1.4, where the same hypothesis was tested, using the same data, but
using an approach based on confidence intervals. It might have been somewhat
worrying if the conclusions had been different. The fact that they must be the
same may be explained as follows.
In Example 5.1, the limit of the part of the rejection region in the upper tail of
the null distribution of X is √
just its 0.975-quantile, which was calculated as
q0.975 = 0.6 + 1.96 × 0.9021/ 400. That means we would reject the null
hypothesis, using this part of the rejection region,
√ if the observed value x of the
sample mean satisfies√ x ≥ 0.6 + 1.96 × 0.9021/ 400. This will happen if
x − 1.96 × 0.9021/ 400 ≥ 0.6.
However, in Example 1.4 you saw that the lower limit of√the 95% confidence
interval for the population
√ mean µ is x − 1.96 × 0.9021/ 400. If
x − 1.96 × 0.9021/ 400 ≥ 0.6, then the hypothesized value (0.6) of the population
mean will be on or below this lower limit, and hence outside the confidence The confidence interval includes
interval. Thus the test statistic will fall in the upper tail of the rejection region all values in between (but not
exactly when the hypothesized value of the population mean falls on or below the equal to) the confidence limits.
lower limit of the confidence interval. Similarly, the test statistic will fall in the
lower tail of the rejection region exactly when the hypothesized value of the
population mean falls on or above the upper limit of the confidence interval.
38 Unit C1
Putting these two facts together, the test statistic falls in the rejection region
exactly when the hypothesized value (0.6) of the population mean falls outside the
confidence interval for the population mean.
Thus, in circumstances like this, the results of the confidence interval approach
and the fixed-level approach involving the null distribution of a test statistic have There are circumstances in
to be the same. which the two methods do not
coincide in this way.
Before summarizing the procedure for fixed-level testing, there is a general point
about the hypotheses of a test that should be emphasized. It is important to note
that the hypothesis testing procedure does not treat the two hypotheses (null and
alternative) on an equal footing. The test is performed by finding out whether the
data provide enough evidence against the null hypothesis in order for it to be
rejected (at the chosen significance level). If the observed value of the test
statistic falls inside the rejection region, which consists of those values that are
least likely if the null hypothesis is true, then the null hypothesis is rejected in
favour of the alternative hypothesis. But if the test statistic falls outside the
rejection region, this means that there is not enough evidence to reject the null
hypothesis, which is not the same as saying that we accept that the null
hypothesis must be true. The situation is in some ways like the procedure in
criminal courts in countries with legal systems similar to those in England and the
USA. There, the initial assumption is that the accused person is innocent
(corresponding to the null hypothesis) and, to secure a conviction, the prosecution
have to provide evidence that proves beyond reasonable doubt that this
assumption is wrong and must be rejected. If they cannot provide such evidence
and proof, the ‘null hypothesis’ of innocence cannot be rejected. In many cases,
this will indeed be because the accused really is innocent of the crime; in other
cases, the position may be that the accused actually committed the crime but the
evidence was insufficient to prove this.
The strategy for a fixed-level test may be summarized as in the following box.
This procedure is used in Example 5.2 and Activity 5.1 for a data set from
Section 1.
Section 5 39
So far, fixed-level testing has been discussed only for situations where the
alternative hypothesis is two-sided. A one-sided fixed-level test is described in
Example 5.3.
Figure 5.3 The rejection region for a one-sided test with significance level 0.10
Section 5 41
Next, the observed value of the test statistic is calculated. The sample standard Step 6.
deviation of the observed differences is s = 2.002 and the sample mean is d = 2.33
(from Example 3.2), so the observed value of the test statistic is
d 2.33
√ = √ 3.68.
s/ 10 2.002/ 10
The observed value of the test statistic, t = 3.68, is well inside the rejection
region, so the null hypothesis is rejected in favour of the alternative hypothesis at Steps 7 and 8.
the 10% significance level. We conclude that, on average, the drug L-hyoscyamine
hydrobromide does prolong sleep. �
In Example 5.3, you have seen how to carry out a one-sided fixed-level test. In
fact, in most areas of application of statistics, one-sided tests are not used very
often. There are several reasons for this. First, situations where departures from
the null hypothesis in one particular direction are of no interest, or can
realistically be assumed not to be possible, are not particularly common in
practice. For instance, in a drug testing scenario, we may be confident in advance
that a new drug is on average at least as good as no treatment at all, and this
might lead us to propose a one-sided test, where the null hypothesis is that the
new drug performs exactly as well (on average) as no treatment, and the
alternative hypothesis is that the new drug is better. But what would happen if,
when the data were collected, the new drug turned out to do much worse (on
average) than no treatment? The null hypothesis could not be rejected; there
would be no possibility of concluding that the new drug could actually be worse
than useless. In general, it would be important to use a procedure that allowed
this conclusion as a possibility. Secondly, the two ‘tails’ involved in the rejection
region of a two-sided test usually contain (exactly or approximately) equal
probabilities. This means that, if the observed value of the test statistic falls just
inside the rejection region at the 10% significance level on a two-sided test, it will
fall inside the rejection region at the 5% significance level on the corresponding
one-sided test. A brief glance at these results might lead us to conclude that the
one-sided test provides stronger evidence against the null hypothesis than the
two-sided test, even though the data are the same in both cases. In general,
statisticians wish to avoid making the evidence in their data look stronger than it
really is, and in many situations this issue would lead a statistician to prefer a
two-sided test.
42 Unit C1
Within repeated experiments, the idea of outcomes more discordant with a null
hypothesis than others is fairly clear. However, with different experiments or
when using different test statistics, it is not at all clear whether a significance
probability as an absolute measure of accord with the null hypothesis — one that
can be compared across experiments — is a useful notion. This is an important
criticism of Fisher’s approach.
The approach of Neyman and Pearson offers an alternative, but some key
concepts were always rejected by Fisher. Among other things, he considered the
use of a pre-specified alternative hypothesis to be inappropriate for scientific
investigations. He maintained that the fixed-level approach was that of mere
mathematicians, without experience in the natural sciences. As well as subtle and
irreconcilable philosophical and theoretical incompatibilities between the two
approaches, there is no doubt that the controversy was fuelled by personal
antipathies as well. Peters (1987) writes, ‘Fisher was a fighter rather than a Peters, W.S. (1987) Counting
modest and charitable academic.’ for Something — Statistical
Principles and Personalities,
The heat has gone out of this controversy now, and most statisticians who test Springer-Verlag, New York.
hypotheses use approaches that to some extent compromise between the Fisher
approach and that of Neyman and Pearson. For example, they typically calculate
significance probabilities (Fisher) but refer to alternative hypotheses
(Neyman–Pearson). But there do remain differences between the approaches, and
these still tend, sometimes, to blur what is really going on when a hypothesis is
being tested.
����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����
Summar y of Section 5
In this section, a general procedure for carrying out a fixed-level test of a
statistical hypothesis has been described. You have seen how to interpret the
significance level of a test as the probability of the test statistic being in the
rejection region, according to the null distribution of the test statistic. The ideas
of Type I error and Type II error, and the power of a test, have been introduced.
You have also read a little about the history of testing statistical hypotheses. And
you have used a SUStats program to enhance your understanding of the principles
behind hypothesis testing.
Section 6 45
Exercise on Section 5
Exercise 5.1 Differences in plant heights Table 5.1 Difference in
plant height ( 18 inch)
In Exercise 3.1, you performed a significance test on the differences in height
between the progeny of cross-fertilized and self-fertilized plants. The data are Pair Difference
reproduced in Table 5.1. The units of measurement are eighths of an inch. 1 49
2 −67
As in Exercise 3.1, suppose that the observed differences di , i = 1, 2, . . . , 15, are 3 8
independent observations on a normally distributed random variable D with mean 4 16
µ and variance σ2 . 5 6
6 23
Perform a two-sided fixed-level test of the hypothesis that there is no difference 7 28
(on average) between the heights of progeny of cross-fertilized and self-fertilized 8 41
plants. Use a 10% significance level. 9 14
10 29
11 56
12 24
13 75
14 60
15 −48
More generally, suppose that the data are distributed as N (µ, σ2 ), where σ is
X − µ0
known, and the test statistic T = √ is to be used to test the null hypothesis
σ/ n
H0 : µ = µ0 against the alternative hypothesis H1 : µ > µ0 , with a significance
level α. Then, using an argument similar to that in Example 6.1, it can be shown
that the power of the test when the true value of the underlying mean is µ0 + d is
given by
� �
d
P Z ≥ q1−α − √ , (6.1)
σ/ n
where Z ∼ N (0, 1). If the corresponding two-sided test is performed, and d is
positive, the power will be approximately
� �
d
P Z ≥ q1−α/2 − √ . (6.2) Expression (6.2) ignores the
σ/ n probability that the null
What happens to the power of the test when the significance level of the test is hypothesis is rejected due to a
negative value of the test
changed? An example is helpful here. In Example 6.1 you saw that, for a statistic, even though the
significance level of 0.05, the rejection region consists of values of 1.645 or larger hypothesized value of the
and the power of the test when µ = 0.5 is 0.223. For a significance level of 0.01, underlying mean is actually
the rejection region consists of values at least as large as q0.99 , the 0.99-quantile of bigger than the value in the null
N (0, 1), which is 2.326. So, in this case, the power of the test when µ is 0.5 is hypothesis. This probability can
be ignored because it is
given by (6.1): typically very small.
� �
d
P Z ≥ q1−α − √ = P (Z ≥ 2.326 − 0.884)
σ/ n
= P (Z ≥ 1.442)
0.075.
48 Unit C1
These results are illustrated in Figures 6.3 and 6.4. Figure 6.3 shows the rejection
regions corresponding to significance levels of 0.05 and 0.01; the power of the test
when µ is 0.5 is represented in Figure 6.4 for each of these significance levels.
Figure 6.4 The power when µ is 0.5 (a) α = 0.05 (b) α = 0.01
As you can see, reducing the significance level has made the rejection region
smaller and hence, when µ is 0.5, it has also made the power of the test smaller.
d
In general, when α is made smaller, q1−α becomes larger, so q1−α − √
σ/ n
becomes larger. Hence the power
� �
d
P Z ≥ q1−α − √
σ/ n
becomes smaller. That is, when the significance level is made smaller, the power
of the test decreases.
Activity 6.1
By examining Expressions (6.1) and (6.2) for the power of the test, say what
happens to the power of the test in each of the following cases.
(a) The difference d (between the mean in the null hypothesis and the mean for
which the power is being calculated) is increased.
(b) The population standard deviation σ is increased.
(c) The sample size n is increased.
Intuitively, the results of Activity 6.1 seem reasonable. If the difference between
the two hypothesized values of the underlying mean is bigger, then, other things
being equal, the test will more likely be able to distinguish them and come to the
right conclusion. If the population standard deviation increases, there is more
variability in the data, and the test will less likely be able to ‘see through’ this
variability and come to the correct conclusion. Finally, increasing the sample size
will increase the information obtained from the test and thus make it more likely
that the correct conclusion is reached.
Section 6 49
Activity 6.2
A psychologist wishes to investigate the IQ of a certain specific population. The
aim is to investigate whether the mean IQ in this population could plausibly be
equal to that in the population of the UK as a whole. For the IQ test that the
psychologist will use, the mean score for the general UK population is 100, and
the standard deviation of scores is 15. The psychologist intends to take a sample
of 80 individuals from the specific population and measure their IQs using this
test. It is thought likely that the standard deviation of IQ scores in the specific
population is the same as that in the general UK population (though, of course,
nobody can be sure until the data have been collected). The psychologist will test
the null hypothesis that the mean IQ in the specific population is 100, using a
two-sided test based on the normal distribution, at significance level 0.05.
Suppose the actual mean IQ score in the specific population were 105. What is
the probability that the psychologist will reject the null hypothesis?
Expressions (6.1) and (6.2) above for calculating the power are, strictly, only valid
if the population standard deviation is known. In practice, in the great majority
of cases, it will not be known, and the population standard deviation will be
replaced by the sample standard deviation and a t-test performed. If the sample
size is fairly large, these expressions will still give reasonable approximations to
the power of the test; but if the sample size is small, they will not. More
complicated calculations, based on the same principles, can give accurate values
for the power of t-tests. But you are spared the details: in Subsection 6.3, you will
use your computer for such calculations, and also for similar calculations in the
case of other tests.
In general, suppose the data are distributed as N (µ, σ2 ) and the test statistic
X − µ0
T = √ is to be used to test the null hypothesis H0 : µ = µ0 against the
σ/ n
alternative hypothesis H1 : µ > µ0 , with a significance level α. Suppose the
sample size n is to be chosen such that the power of the test when the true value
of the underlying mean is µ0 + d is equal to some predetermined value γ. Then it
can be shown that the required sample size is
σ2 2
n= (q1−α − q1−γ ) , (6.3)
d2
where q1−α and q1−γ are quantiles of the standard normal distribution. For a
two-sided test, q1−α is replaced by q1−α/2 in (6.3) to give an expression for the
approximate sample size required. The approximation is reasonable provided that
d/σ is not too small.
Formula (6.3) will still give approximately correct answers in the case where the
underlying variance is not known and a t-test is being performed. But your
computer can perform accurate sample size calculations even for small sample
sizes, and can also use similar expressions in other testing situations. You will be
using MINITAB to do sample size calculations in Subsection 6.3.
Summar y of Section 6
In this section, you have learned how to calculate the power of a test on a normal
mean in the simple case where the population variance is assumed known. You
have seen how these calculations can be used to choose an appropriate sample size
for a statistical investigation.
You have also learned how to use MINITAB to perform sample size and power
calculations for one- and two-sample t-tests, and for a test of one proportion.
Exercises on Section 6
Exercise 6.1 Power
Suppose that the researcher in Activity 6.3 did indeed carry out the study as
described, with 60 subjects. What would be the power of the researcher’s test to
detect a mean difference in blood pressure of just 2 mm Hg?
Summar y of Unit C1
In this unit, you have learned how to test statistical hypotheses. Specifically, you
have seen how null and alternative hypotheses, making specific statements about
the value of a population parameter, can be set up.
You have seen three different approaches to testing a statistical hypothesis, on the
basis of data. In the first approach, a confidence interval for the parameter of
interest is calculated, and if the specific parameter value named in the null
hypothesis is outside this interval, the null hypothesis is rejected in favour of the
alternative hypothesis. Otherwise, the null hypothesis remains plausible. In this
context, the significance level of the hypothesis test is 100% minus the confidence
level for the interval.
The second, and most common, approach to testing hypotheses is known as
significance testing. A test statistic is chosen; this is a quantity whose value gives
information about the parameter in the hypotheses, and whose distribution when
the null hypothesis is true — its null distribution — is known. A significance
probability, or p value, is calculated. This is the probability that, under the null
hypothesis, a value of the test statistic would be observed that is at least as
extreme as the value that was actually observed. The smaller the significance
probability, the stronger is the evidence against the null hypothesis. Significance
tests can have two-sided or one-sided alternative hypotheses, depending on
whether or not departures from the null hypothesis in both directions are
included. For one-sided tests, the extreme values fall in one tail of the null
distribution; for two-sided tests, the extreme values fall in both tails.
In the third approach, a significance level is chosen and fixed in advance. This
approach also involves choosing a test statistic and knowing its null distribution.
However, instead of calculating a significance probability, a rejection region is set
up, such that values of the test statistic in the rejection region tend to favour the
alternative hypothesis rather than the null hypothesis, and such that the
probability that the test statistic is in the rejection region, when the null
hypothesis is true, is equal to the significance level. In some circumstances, such
tests are exactly equivalent to those produced using the method based on
confidence intervals. Fixed-level tests also come in one-sided or two-sided forms,
depending on the nature of the alternative hypothesis. The choice of alternative
hypothesis is followed in setting up the rejection region: in one-sided cases the
rejection region falls in just one tail of the null distribution of the test statistic,
but in two-sided cases the rejection region falls in both tails.
Tests of hypotheses may lead to two possible kinds of wrong conclusion. First, a
null hypothesis may be rejected when it is in fact true. This is known as a Type I
error, and its probability is equal to the significance level of the test. Secondly, a
test may fail to reject a null hypothesis when it is in fact false. This is known as a
Type II error. The power of a test is the probability that the test will (correctly)
reject the null hypothesis when it is false. So the power of a test is one minus the
probability of a Type II error. You have seen how to calculate the power of certain
simple tests by hand, and of a wider range of tests using MINITAB. You have also
seen how to use power calculations to plan an appropriate sample size for a study.
You have seen how to perform statistical tests for null hypotheses involving a
single normal mean and a single proportion, as well as for differences between
these quantities in two populations. (The tests for normal means are called
t-tests.) You have also seen how to perform tests involving a population mean,
using large-sample results for the distribution of the sample mean. You have
learned to carry out these tests both with and without using a computer.
Section 6 53
Learning outcomes
You have been working towards the following learning outcomes.
Ideas to be aware of
That a significance test involves assessing the strength of the evidence against
a null hypothesis.
That a fixed-level test typically involves deciding whether or not to reject a
null hypothesis.
That, even when a null hypothesis is not rejected, it is not appropriate to
conclude that it must be true.
That the conclusions of fixed-level tests are subject to two different possible
types of error, and that the probabilities of these errors can be taken into
account.
That the context of a statistical investigation determines whether the
appropriate statistical test is one-sided or two-sided.
That a quantity cannot be used as a test statistic unless its null distribution
is known (at least approximately).
That there is a relationship between confidence intervals and fixed-level
hypothesis testing.
That a two-sample t-test for the difference between two normal means may
involve the assumption that the variances of the two populations are equal.
That the power of a test decreases when the significance level is made smaller.
Statistical skills
Set up hypotheses appropriately for a hypothesis test.
Perform a significance test or a fixed-level test for null hypotheses involving a
single mean of a normal distribution, a single proportion (exact and
approximate), a mean of a population whose distribution is unspecified
(based on a large sample), the difference between the means of two normal
distributions, or the difference between two proportions.
Interpret the results of a significance test or a fixed-level test, in terms of the
real-world investigation that gave rise to the test.
Calculate the power of a test on a normal mean when the population variance
is assumed to be known.
Calculate the sample size required to obtain a given power in a test on a
normal mean when the population variance is assumed to be known.
Solutions to Activities
Solution 1.1
(a) The appropriate confidence interval to use for a
test with significance level 0.05 (or 5%) is the 95%
confidence interval. This is given in the question as
(0.2113, 0.6133). The value in the null hypothesis,
2
3 0.6667, is not contained in this interval. The
conclusions of the test may be stated as follows.
There is sufficient evidence at the 0.05 significance
level to reject the null hypothesis that p = 23 in favour
of the alternative hypothesis that p differs from 23 . In
fact, since the entire confidence interval is below 23 , the
data suggest that p < 23 . Figure S.1 The null distribution of X, and the values
(b) This time, the appropriate confidence interval to that are at least as extreme as the observed value
use is the 99% interval. The value 0.6667 is included x = 1.636
in this interval. So there is insufficient evidence at the
0.01 significance level to reject the null hypothesis that Steps 7 and 8: There is moderate evidence against the
p = 23 . null hypothesis that the mean number of insects of the
taxon Staphylinoidea in a trap is 1.
Solution 1.2 Further, since x = 1.636 > 1, the data suggest that the
underlying mean number of insects in a trap is greater
(a) The significance level of the test will be 5% (or
than 1.
0.05).
(b) Since 2 is in the confidence interval, there is Solution 3.2 The null and alternative hypotheses
insufficient evidence at the 5% significance level to are given in the question (step 1), and the data to be
reject the null hypothesis that the mean March rainfall used are in Table 1.3 (step 2).
in Minneapolis St Paul is 2 inches.
Step 3: An appropriate test statistic is
Solution 2.1 Step 1: The null and alternative X − 300
T = √ ;
hypotheses are S/ 10
H0 : µ = 1, H1 : µ = 1, its null distribution is t(9).
where µ is the underlying mean number of insects of Step 4: The observed value of the test statistic is
the taxon Staphylinoidea caught in a trap. x − 300 294.81 − 300
t= √ = √ −9.272.
Step 2: The data are in Table 2.2. s/ 10 1.77/ 10
Step 3: An appropriate test statistic is the sample Step 5: The set of values of the test statistic that are
mean X. When the null hypothesis is true, µ = 1, so at least as extreme as the observed value consists of all
the approximate null distribution of X is N (1, σ2 /33). values less than or equal to −9.272, together with all
The unknown population variance σ2 can be replaced values greater than or equal to 9.272.
by its sample estimate s2 = 1.6552 , so the approximate Step 6: The significance probability is
null distribution of X is p = P (|T | ≥ 9.272).
N (1, 1.6552 /33) = N (1, 0.0830). From the table of quantiles in the Handbook, the
Step 4: The observed value of the test statistic is 0.999-quantile of t(9) is 4.297. Therefore
x = 1.636. P (T ≥ 9.272) < 0.001, and hence p < 0.002.
Step 5: Since this is a two-sided test, the values that Steps 7 and 8: There is strong evidence against the
are at least as extreme as the observed value (1.636) null hypothesis that the mean time that the timer
will fall into two tails. (See Figure S.1.) takes to ring when set to five minutes is equal to five
Step 6: The significance probability is minutes (300 seconds). Since the observed sample
mean x = 294.81, the data suggest that the average
p = P (X ≤ 0.364) + P (X ≥ 1.636)
time the timer takes to ring is less than five minutes.
= 2P (X ≥ 1.636)
� �
1.636 − 1
= 2P Z ≥ √ , where Z ∼ N (0, 1)
0.0830
2P (Z ≥ 2.21)
= 2 × 0.0136
= 0.0272
0.027.
Solutions to Activities 55
Solution 3.5 Step 1: Denoting the population Solution 4.1 Appropriate hypotheses for the test,
mean sleep gain for patients taking D-hyoscyamine in terms of the parameters p1 and p2 , which represent
hydrobromide by µ, the null and alternative the underlying proportions of males and females who
hypotheses are are helped, are
H0 : µ = 0, H1 : µ > 0. H0 : p1 = p2 , H1 : p1 = p2 .
(The justifications for carrying out a one-sided test The observed value of D, the difference between the
here are as follows. First, that is what you were asked sample proportions, is
to do. Secondly, there may be grounds for assuming 71 89
that the drug can have an effect only in one direction, d= − 0.71 − 0.8476 = −0.1376.
100 105
as was considered to be the case for L-hyoscyamine Under the null hypothesis that p1 = p2 = p, the pooled
hydrobromide.) estimate of the common value of the proportion p is
Step 2: The data are given in the question. 71 + 89 160
p� = = 0.7805.
Step 3: Since there are ten patients, the appropriate 100 + 105 205
test statistic is The approximate null distribution of D is
� � ��
D 1 1
T = √ , N 0, p�(1 − p�) +
S/ 10 n1 n2
� � ��
where D is the sample mean sleep gain and S is the 1 1
= N 0, 0.7805(1 − 0.7805) +
sample standard deviation of the sleep gains. The null 100 105
distribution of this statistic is t(9).
= N (0, 0.003345).
Step 4: The sample mean is 0.75 and the sample
Therefore, if Z ∼ N (0, 1), the significance probability
standard deviation is 1.789, so the observed value of
is � �
the test statistic is
0.1376
d 0.75 P (|D| ≥ 0.1376) P |Z| ≥ √
t= √ = √ 1.33. 0.003345
s/ n 1.789/ 10
P (|Z| ≥ 2.38)
Step 5: This is a one-sided test, so the values of the
test statistic that are at least as extreme as the = 2 × 0.0087
observed value fall in one tail of the null distribution. = 0.0174
This tail corresponds to values of T that are greater
0.017.
than or equal to the observed value t = 1.33.
Thus there is moderate evidence against the null
Step 6: The significance probability is
hypothesis and in favour of the alternative hypothesis.
p = P (T ≥ 1.33), where T ∼ t(9). The 0.9-quantile of
The conclusion is that there is moderate evidence that
t(9) is 1.383, so P (T ≥ 1.383) = 0.1. But 1.33 < 1.383,
male students and female students are not equally
so
likely to be helped in the context of an experiment like
p = P (T ≥ 1.33) > 0.1. this. Indeed, since p�2 > p�1 , the data indicate that
Steps 7 and 8: There is little evidence against the null female students are more likely to be helped.
hypothesis that, on average, D-hyoscyamine
hydrobromide does not prolong sleep. Solution 4.2
(a) The ratio of the larger sample variance to the
Solution 3.6 The observed value, 19, is in the smaller is 114.84/48.93 2.35. This is less than 3. So,
upper tail of the null distribution of the test statistic. by the rule of thumb, the equal variance assumption is
Thus, from Table 3.4, the probability in the tail not unreasonable.
including the observed value and larger values is The pooled estimate of the common variance is
0.0243. The lower tail is chosen so that it contains the (n1 − 1)s21 + (n2 − 1)s22
largest amount of probability that is less than or equal s2P =
n1 + n2 − 2
to 0.0243. Again using Table 3.4, this is the tail
containing all values up to and including 10, for which 9 × 48.93 + 9 × 114.84
=
the total probability is 0.0139. 10 + 10 − 2
Thus the significance probability is 81.89.
p = P (N ≤ 10) + P (N ≥ 19) = 0.0139 + 0.0243 (In this case, the two sample sizes are equal, so the
pooled estimate is in fact the average of the two
= 0.0382.
sample variances, and it is a bit quicker to calculate it
Observing 19 yellow peas out of 20 would thus provide as (48.93 + 114.84)/2 81.89.)
moderate evidence that the underlying value p of the
(b) Denoting the underlying mean joint widths in the
proportion of peas that are yellow is different from the
two populations by µ1 and µ2 , the hypotheses are
hypothesized value of 34 . Since the observed
proportion is greater than 34 , the data suggest that the H0 : µ1 = µ2 , H1 : µ1 = µ2 .
underlying proportion is greater than 34 . The null distribution of the test statistic has a
t-distribution with 10 + 10 − 2 = 18 degrees of
freedom.
56 Unit C1
The observed value of the test statistic is Solution 5.2 Step 1: The hypotheses are given in
x1 − x2 128.40 − 122.80 the question:
t= � = � 1.384.
1 1 √ 1 1 H0 : µ = 0.618, H1 : µ = 0.618.
sP + 81.89 × +
n2 n2 10 10 Step 2: The data are in Table 3.1.
The significance probability is p = P (|T | ≥ 1.384), Step 3: Since there are 20 rectangles, the appropriate
where T ∼ t(18). The number 1.384 lies between the test statistic is
0.9-quantile and the 0.95-quantile of t(18), so X − 0.618
P (T ≤ 1.384) lies between 0.9 and 0.95, and hence T = √ ,
S/ 20
0.05 < P (T ≥ 1.384) < 0.1.
where X is the sample mean ratio and S is the sample
Since p = P (|T | ≥ 1.384) = 2P (T ≥ 1.384), it follows standard deviation of the ratios. The null distribution
that 0.1 < p < 0.2. (Computer calculations give the of this statistic is t(19).
value of the significance probability as 0.183.)
Step 4: The significance level has been specified as 5%.
Thus there is little evidence against the null
Step 5: The rejection region is defined by the
hypothesis. We cannot rule out the possibility that the
0.025-quantile and 0.975-quantile of t(19); these are
underlying mean joint widths are equal.
q0.025 = −2.093 and q0.975 = 2.093. (See Figure S.3.)
Solution 5.1
(a) Step 5: For a test using a 1% significance level,
the appropriate quantiles are q0.005 and q0.995 . From
the table of quantiles for t-distributions, the
0.995-quantile of t(9) is q0.995 = 3.250. Using the
symmetry of the distribution, q0.005 = −3.250. The
rejection region for the test is therefore as shown in
Figure S.2.
Solution 5.3 Solution 6.1 For a one-sided test, the key quantity
(a) The null hypothesis H0 would be rejected in a in (6.1) to consider is
fixed-level test with a significance level greater than d
z = q1−α − √ .
0.083, but would not be rejected if a significance level σ/ n
less than 0.083 is used. Thus H0 would be rejected in If z decreases, then P (Z ≥ z) increases, and the power
a fixed-level test with a 10% significance level, but not of the test will go up. If z increases, then the power of
in a fixed-level test with either a 5% or a 1% the test will go down.
significance level. d
(a) If d increases, then q1−α − √ will decrease.
(b) Since H0 is rejected at the 1% significance level, σ/ n
the observed value of the test statistic lies in the Thus the power of the test will increase.
rejection region, and the rejection region contains 1% d
of the probability in the null distribution. Thus the (b) If σ increases, then q1−α − √ will increase.
σ/ n
observed value must be at least as extreme as the Thus the power of the test will decrease.
boundaries of the rejection region, and hence the d
significance probability p ≤ 0.01. (c) If n increases, then q1−α − √ will decrease.
σ/ n
Thus the power of the test will increase.
Solution 5.4 Step 1: As in Activity 3.5, denoting For a two-sided test, the key quantity in (6.2) to
the population mean sleep gain for patients taking
consider is
D-hyoscyamine hydrobromide by µ, the null and d
alternative hypotheses are q1−α/2 − √ .
σ/ n
H0 : µ = 0, H1 : µ > 0. Similar results follow in this case.
Step 2: The data are in Table 3.3.
Step 3: Since there are ten patients, the appropriate Solution 6.2 Here, translating the problem into
test statistic is the notation of power calculations, we have
D α = 0.05, d = 105 − 100 = 5, σ = 15, n = 80.
T = √ ,
S/ 10 The test is two-sided, so the probability required is
where D is the sample mean sleep gain and S is the given by
� �
sample standard deviation of the sleep gains. The null d
P Z ≥ q1−α/2 − √ ,
distribution of this statistic is t(9). σ/ n
Step 4: The significance level has been specified as 5%. where Z ∼ N (0, 1). In this case,
Step 5: The rejection region is defined by the q1−α/2 = q0.975 = 1.96. So
0.95-quantile of t(9); this is q0.95 = 1.833. d 5
q1−α/2 − √ = 1.96 − √ −1.02.
Step 6: The sample mean is 0.75 and the sample σ/ n 15/ 80
standard deviation is 1.789, so the observed value of From the table of probabilities for the standard
the test statistic is normal distribution,
d 0.75 P (Z ≥ −1.02) = 0.8461 0.846.
t= √ = √ 1.33.
s/ n 1.789/ 10 The psychologist’s procedure has a reasonably good
Step 7: The observed value of the test statistic does chance of finding a difference in mean IQ of this size.
not lie in the rejection region, because it is less than
q0.95 = 1.833. The null hypothesis cannot be rejected Solution 6.3 The appropriate number of subjects
at the 5% significance level. is given by
Step 8: On the basis of these data, it remains plausible σ2 � �2
that, on average, D-hyoscyamine hydrobromide does n = 2 q1−α/2 − q1−γ .
d
not prolong sleep. Here α = 0.01. The assumed population standard
deviation σ is 10. The difference d that the test is
being designed to detect is 5, and γ, the required
power to detect such a difference, is 0.9. Thus
q1−α/2 = q0.995 = 2.576 and q1−γ = q0.1 = −1.282. So,
the sample size required is
102
n = 2 (2.576 + 1.282)2 59.54,
5
which is rounded up to 60.
58 Unit C1
Solutions to Exercises
Solution 1.1 (b) Steps 4, 5 and 6: For this data set, the sample
(a) The significance level of the test is 10% (or 0.1). mean is 20.93 and the sample standard deviation is
37.74, so the observed value of the test statistic is
(b) The hypothesized value of 2.5 is outside the 90%
confidence interval for µ. So the null hypothesis that d 20.93
t= √ = √ 2.148.
µ = 2.5 can be rejected at the 0.1 significance level in s/ n 37.74/ 15
favour of the alternative hypothesis that the The significance probability is p = P (|T | ≥ 2.148).
population mean number of loans per book in a year The 0.975-quantile of t(14) is 2.145 and the
differs from 2.5. Indeed, since the sample mean is less 0.99-quantile is 2.624, so P (T ≥ 2.148) lies between
than 2.5, the data indicate that the population mean 0.01 and 0.025. Therefore 0.02 < p < 0.05.
is less than 2.5. (c) Steps 7 and 8: There is moderate evidence against
the null hypothesis that there is no difference, on
Solution 2.1 Step 1: The null and alternative average, between the heights of cross-fertilized and
hypotheses are self-fertilized plants whose parents were grown from
H0 : µ = 2.5, H1 : µ = 2.5, the same seed. Since the sample mean difference is
where µ is the underlying mean number of loans per 20.93, the data indicate that cross-fertilized plants
book in a year. tend to be taller than self-fertilized plants.
Step 2: The data are given in the question in
summary form: x = 1.992, s = 1.394, n = 122. Solution 4.1 Appropriate hypotheses for the test
are
Step 3: An appropriate test statistic is the sample
mean X. When the null hypothesis is true, µ = 2.5 so, H0 : p1 = p2 , H1 : p1 = p2 ,
using s to estimate the population standard deviation where the parameters p1 and p2 denote the underlying
σ, the approximate null distribution of X is proportions of patients receiving the new drug and the
N (2.5, 1.3942 /122) = N (2.5, 0.01593). standard treatment, respectively, who are in remission
Step 4: The observed value of the test statistic is after a year.
x = 1.992. The observed value of D, the difference between the
Steps 5 and 6: Since this is a two-sided test, the values sample proportions, is
that are at least as extreme as the observed value 38 22
d= − 0.6129 − 0.3667 = 0.2462.
(1.992) fall into two tails, and the significance 62 60
probability is twice the probability in the lower tail: The pooled estimate of the common value of the
proportion p (under the null hypothesis that
p = 2P (X ≤ 1.992)
� � p1 = p2 = p) is
1.992 − 2.5 38 + 22 60
= 2P Z ≤ √ p� = = 0.4918.
0.01593 62 + 60 122
2P (Z ≤ −4.02). The approximate null distribution of D is
� � ��
Using the table of probabilities for the standard 1 1
normal distribution in the Handbook gives p = 0.0000 N 0, p�(1 − p�) +
n1 n2
to four decimal places. or � � ��
Steps 7 and 8: There is strong evidence against the 1 1
N 0, 0.4918(1 − 0.4918) + = N (0, 0.00820).
null hypothesis that the mean number of loans per 62 60
book is 2.5. The significance probability is thus
Since the sample mean is less than 2.5, the data P (|D| ≥ 0.2462)
indicate that the underlying mean number of loans per � �
0.2462
book in a year is less than 2.5. P |Z| ≥ √ , where Z ∼ N (0, 1),
0.00820
Solution 3.1 P (|Z| ≥ 2.72)
(a) Steps 1 and 3: The null and alternative = 2 × 0.0033
hypotheses are = 0.0066.
H0 : µ = 0, H1 : µ = 0, Thus there is strong evidence against the null
where µ is the (population) mean difference between hypothesis and in favour of the alternative hypothesis.
the heights of a pair of cross-fertilized and The conclusion is that there is strong evidence that
self-fertilized plants whose parents were grown from patients on the two treatments differ in their
the same seed. An appropriate test statistic is probability of being in remission after a year.
Moreover, since p�1 = 0.6129 > 0.3667 = p�2 , the data
D D indicate that patients on the new drug are more likely
T = √ = √
S/ n S/ 15 to be in remission than are those on the standard
with null distribution t(n − 1) = t(14). treatment.
Solutions to Exercises 59
UNIT C2 Nonparametrics
Study guide for Unit C2
This unit is shorter than average. You should schedule four study sessions,
including time for answering the TMA questions on the unit and for generally
reviewing and consolidating your work on this unit.
Section 2 does not depend on ideas or skills from Section 1. Thus, if you have
some pressing reason for studying the two sections out of order, this should cause
no problems.
As you study this unit you will be asked to work through Chapter 6 of
Computer Book C. We recommend that you do this at the place indicated in the
unit (in Subsection 1.4), though it would be possible to postpone studying the
chapter until later in the unit without affecting your understanding of the unit
itself.
One possible study pattern is as follows.
Study session 1: Subsections 1.1 and 1.2.
Study session 2: Subsections 1.3 and 1.4. You will need access to your computer
for this session, together with Computer Book C.
Study session 3: Section 2.
Study session 4: TMA questions on Unit C2.
Introduction
This unit contains two different topics, which do not have a great deal in common
except that they both involve significance testing.
Most of the significance test procedures that you learned in Unit C1 involved a
particular assumption about the underlying distribution of the data involved (for
example, that they had a normal distribution, or a binomial distribution). What
do we do if such assumptions do not seem justified? In Section 1, you will meet
several hypothesis tests that do not involve such assumptions.
This raises the question of how we can actually tell whether a sample of data
could plausibly have been drawn from a particular distribution. You have already
seen certain graphical methods for investigating this, in the form of probability
plots. In Section 2, you will learn about a more formal procedure for testing what
is known as the goodness of fit of a probability model to discrete data.
Section 1 61
1 Nonparametric tests
Many of the significance tests you met in Unit C1 involve assumptions about the
distribution of the data. For instance, t-tests involve the assumption that the
underlying distribution of the population or populations is normal. The
distributional properties of the test statistic depend on this assumption of
normality. If the underlying population distribution(s) were not normal, the t-test
statistic would not in general have a Student’s t-distribution, and therefore any
p value that you calculated on the basis of this assumption might be incorrect.
How serious this was in practice would depend on how different the underlying
distribution was from the assumed normal form. In some cases the discrepancy
might be small, and no practical problems might arise; but in other cases the
discrepancy might be crucial. This kind of restriction does not apply to all the
tests you have met; some of them are based on large-sample approximations
(using the central limit theorem) which are valid for a wide range of underlying
distributions. In other circumstances, as you will see in Unit C3, it may be
possible to transform the data, that is, to apply an appropriate function to all the
data values, such that the transformed data are at least approximately normally
distributed. Then methods that assume a normal distribution can be applied to
the transformed data. However, this is not always possible. In such cases, a
technique that does not require a specific probability model must be used. Such
techniques are called nonparametric. In the one-sample t-test, for example, the
underlying distribution is assumed to be normal, but with unknown mean and
variance. That is, the underlying distribution comes from a family defined by two
parameters, the mean µ and the variance σ2 . Thus this test is said to be
parametric. In a nonparametric test, there is no assumption that the underlying
distribution comes from a specified family indexed by parameters in this way.
Since no particular distributional form is assumed, the tests are also called
distribution-free, though, as you will see, this does not mean that there are no
distributions involved at all!
A nonparametric test is described in each of Subsections 1.1 to 1.3. You will learn
how to use MINITAB to carry out these tests in Subsection 1.4.
� �
obtain a significance probability by using a binomial distribution, B n, 12
(where n is the sample size). Arbuthnot subtracted the number of girls recorded
from the number of boys for each of 82 years and obtained 82 + signs.
The points do not lie particularly close to a straight line, so the evidence is not
compelling that a normal distribution is appropriate for modelling these data.
However, it must be borne in mind that the sample size is small, and therefore
that the evidence against a normal distribution is not compelling either. In fact,
the (two-sided) t-test gives a p value of 0.33, so there is very little evidence
against the null hypothesis of a zero mean difference. But this conclusion may be
in doubt, because of the possible non-normality of the underlying distribution.
One way to avoid assuming a normal model for the differences is to use the sign
test. The null hypothesis for this test is that the underlying median difference
(rather than the mean difference, as used in the t-test) is zero. The test then
proceeds simply by counting the number of differences with + signs and the
number with − signs. It is common practice simply to ignore zeros in the sign test
(and reduce the sample size accordingly). Looking at the differences, and ignoring There is another approach to
the zero, there are three + signs and four − signs or, to put it another way, the sign test that incorporates
the zeros in the analysis, but it
three + signs out of seven. Assuming equal probability for individual
� �+ and − is not discussed in this course.
signs, the distribution of the number of + signs out of seven is B 7, 12 . If we
wished to perform a one-sided test, the p value would be the probability of three
or fewer + signs out of seven; this is
�3 � �
7 � 1 �7 � �7 � �7 � �7 � �7
2 = 1 × 12 + 7 × 12 + 21 × 12 + 35 × 12 = 12 .
x
x=0
Section 1 63
Thus, for a one-sided test, the p value would be 0.5. This is illustrated in
Figure 1.2. For a two-sided test, as well as the probability that the number of
+ signs is less than or equal to the observed value, 3, we would need to consider
the other tail of the distribution of the test statistic. The lower tail contains the
values 0, 1, 2 and 3, so the upper tail contains the values 4, 5, 6 and 7. Together
this accounts for all the possible numbers of + signs out of seven; the p value for
the two-sided test is 1. This provides absolutely no evidence against the null
hypothesis; in other words, there is no evidence of a difference (on average) in
corneal thickness. �
Another way of thinking of the relationship between the p values for the one-sided
and two-sided sign tests�is as�follows. The distribution of the test statistic (the
number of + signs) is B n, 12 . This is a symmetric distribution. Thus the Figure 1.2 The null
probability in the tail of this distribution opposite to that actually observed will � 1of
distribution � the number of +
signs: B 7, 2
always be equal to that in the observed tail. Therefore the p value for the
two-sided test is generally double that of the one-sided test. There is one exception. This
occurs when n, the total number
Activity 1.1 Sleep gain — the sign test of signs, is even, and the
observed number of + signs is in
In Example 3.2 of Unit C1, you met some data on the possible hypnotic effect in the middle of its distribution,
humans of the drug L-hyoscyamine hydrobromide. Ten individuals had their sleep at n/2. In this case, p = 1 for
time measured before and after taking this drug. The differences in sleep time the two-sided test.
(time after taking the drug − time before taking the drug) are given in Table 1.2.
Table 1.2 Sleep gain (hours) when patients take L-hyoscyamine hydrobromide ‘Student’ (1908) The probable
error of a mean. Biometrika, 6,
Patient 1 2 3 4 5 6 7 8 9 10 1–25.
Gain in sleep 1.9 0.8 1.1 0.1 −0.1 4.4 5.5 1.6 4.6 3.4
In Unit C1 these data were analysed using a t-test. The null and alternative
hypotheses were as follows:
H0 : µ = 0, H1 : µ > 0,
where µ is the underlying mean sleep gain. (Thus you were performing a
one-sided test.) The p value for this test was 0.0025, so that there was strong See Activity 3.3 of Unit C1.
evidence against the null hypothesis (and hence that this substance worked as a
hypnotic). There was no particular reason to doubt the appropriateness of a
normal model for these data. However, the data can still be analysed using the
sign test. The hypotheses will need to be amended as follows:
H0 : m = 0, H1 : m > 0,
where m is the (population) median sleep gain.
What is the value of the sign test statistic for these data? Calculate the
corresponding p value and report your conclusion.
In Unit C1 you saw how the one-sample t-test could be used both to test for a
specified value of the population mean given a sample from a single population,
and to test for a zero mean difference between matched pairs. Working with
matched pairs merely involves calculating the differences between the paired
values, and then applying the one-sample t-test procedure to the resulting
differences, to test whether their underlying mean could be zero. The sign test
has been introduced here as a test for differences between matched pairs, but the
first step is to look at the differences between the paired values. Thus, if we had a For the sign test, it is not
single sample of data and wanted to test whether the population median could be strictly speaking necessary to
zero, we could simply apply the sign test to the original data. Indeed we can go calculate these differences, since
all we need to know is the sign
further; we can test the null hypothesis that a single sample of data is drawn from of each difference and that can
a population with a specified median simply by subtracting the specified median be found simply by looking to
from all the data values, and applying the test procedure to the resulting see which of the paired values is
numbers. Example 1.2 illustrates how this works. the larger; but the principle
remains true.
64 Unit C2
The shape of the plot is decidedly curved, so there must be some doubt over the
appropriateness of the assumption of normality. Can an alternative analysis be
performed that avoids the normality assumption?
One alternative analysis is to use the sign test with the data in Table 1.3 to test
the null hypothesis that the population median m of width-to-length ratios of
Shoshoni rectangles is 0.618 against the alternative hypothesis that it is not 0.618:
H0 : m = 0.618, H1 : m = 0.618. As usual with the sign test, the
hypotheses refer to the
The first step is to subtract the value hypothesized in the null hypothesis (that is, population median rather than
0.618) from each value in the sample. Then omit any zero differences and count the mean.
how many of the resulting differences have + signs and how many have − signs.
You can do this if you wish, but it is quicker to note that a data value greater
than 0.618 will lead to a + sign, and one less than 0.618 will lead to a − sign. �
Probabilities like this are usually rather tedious to calculate by hand; a computer
calculated this one to be 0.824. However, in this case the calculation is actually
reasonably straightforward if the relevant probability � in1the
� lower tail is considered
in a different way. The relevant lower
� tail
� of the B 20, 2 distribution consists of
1
values from 0 to 9 inclusive. A B 20, 2 random variable can take integer values
from 0 to 20 inclusive. So all its possible values except 10 are included in one or
other of the relevant tails of the distribution, and the required p value is thus
�9 � � 20 � � � �
20 � 1 �20 � 20 � 1 �20 20 � 1 �20
2 + 2 = 1 − 2 P (X ≤ 9 or X ≥ 11)
x x 10
x=0 x=11 = 1 − P (X = 10).
20! 1
=1− × 20
10! × 10! 2
1
= 1 − 184756 ×
1048576
1 − 0.176
= 0.824.
Thus (however you did the calculation) the sign test provides little evidence that
the underlying median of width-to-length ratios of Shoshoni rectangles is different
from the hypothesized value of 0.618.
You may have found this result quite surprising. The p value for the t-test that
was carried out in Unit C1 was 0.0539, which could be interpreted as providing See Unit C1 , Example 3.1.
some evidence (though rather weak) against the hypothesis that the population
mean is 0.618. There are various possible reasons for the large difference in
p values between the two tests, among them the following. First, the two tests
were actually of different hypotheses: the t-test is a test for the population mean,
whereas the sign test is a test for the population median. In fact, the curved
shape of the probability plot in Figure 1.3 indicates that the data have a skew
distribution; therefore the mean will not be equal to the median, and it could thus
be the case that the population median is 0.618 while the population mean is
different from 0.618. Secondly, and much more likely to be important here, is the
following. Maybe the population median really does differ from 0.618, but the
sign test is simply not powerful enough in this case to detect the difference. In
other words, maybe a Type II error has occurred. In fact, in most contexts, the
sign test has relatively low power (and indeed further analysis of these data
provides an indication that this lack of power is the real problem here). �
The question of the power of the sign test is discussed further in the next
subsection. This subsection concludes with a summary of the procedure for
performing the sign test.
66 Unit C2
There are two important things to notice in Table 1.4. The first is that the
difference of zero for patient 2 has not been included in the ranking; it is ignored,
and the sample size is taken as 7 instead of 8, just as for the sign test. The second As for the sign test, there is an
is that, where two differences have the same absolute value, an average rank is alternative procedure in which
given. In the table (after ignoring the zero difference), the two lowest absolute zeros are incorporated into the
analysis. Except for very small
differences are tied on 4. Since the two lowest ranks are 1 and 2, each of these two samples, this alternative
lowest differences is allocated rank 12 (1 + 2) = 1 12 . The same has happened where approach does not usually lead
two absolute differences are tied on 12. If they had had very slightly different to substantially different
values instead of being equal, they would have had ranks 4 and 5, so each is given conclusions.
a rank of 12 (4 + 5) = 4 12 .
Now the signs are taken into account. The sum of the ranks for the positive
differences is
w+ = 1 12 + 1 12 + 4 12 = 7 12 .
The sum of the ranks for the negative differences is
w− = 4 12 + 7 + 3 + 6 = 20 21 .
If either of these sums were particularly large or particularly small, this would
provide evidence against the null hypothesis of zero median difference. In general,
the sum w+ + w− is equal to 1 + 2 + · · · + n = 12 n(n + 1), where n is the sample The use of average ranks makes
size (after excluding zeros). Thus, for a given sample size, w+ is small exactly this true even where there are
when w− is large, and vice versa. Thus we can concentrate on just one of these ties. In this case, w+ + w− = 28;
the sample size n is 7.
quantities. The test statistic for the Wilcoxon signed rank test is w+ : under the
null hypothesis of zero median difference, values of w+ that are extremely small or
extremely large will lead to rejection of the null hypothesis (for a two-sided test).
The null distribution of the test statistic w+ is different for each value of n. It is
rather complicated, and is in general calculated using a complete enumeration of
cases. So to obtain the p value for a Wilcoxon signed rank test, a computer is
generally used. As for most of the tests you have met, the null distribution of the
test statistic is symmetric, so that the p value for a one-sided test is exactly half
that for a two-sided test.
The significance probability for a two-sided test with w+ = 7 12 is 0.344. On this This is not the same as the
analysis (as for the t-test and the sign test) there is little evidence of a non-zero p value given by MINITAB,
difference in corneal thickness. � which uses an approximation to
calculate p values for the
Wilcoxon signed rank test.
In Subsection 1.1, you saw that the sign test can be used to test the null
hypothesis that a single sample of data is drawn from a population with a
specified median m0 . Similarly, the Wilcoxon signed rank test can be used to test
such a null hypothesis simply by subtracting the specified median from each data
value to obtain a set of differences. The procedure for performing the Wilcoxon
signed rank test for zero median difference is summarized in the following box.
68 Unit C2
In Activities 1.4 and 1.5, you are asked to use the Wilcoxon signed rank test to
investigate the Shoshoni rectangle data given in Example 1.2. In fact, there is a
snag in using the test with this data set. However, ignore that for now and
proceed with the test. (This snag is discussed later in the subsection.)
There is a sense in which the central limit theorem operates with the Wilcoxon
signed rank test statistic: provided that the number of differences is sufficiently
large, a normal approximation to the null distribution of the test statistic may be
used. This approximation is described in the following box.
Here the sample size is only 7; the approximate p value is noticeably different from
the exact value of 0.344 given in Example 1.3. However, 7 is a lot less than the
minimum sample size of 16 given in the ‘rule of thumb’ for adequacy of the normal
approximation, so it is not very surprising that the approximation is poor. �
Before leaving the Wilcoxon signed rank test, we should spend a little time
considering the assumptions behind it. The test is indeed nonparametric, in that
it does not involve an assumption that the data can be modelled by a particular
parametric family of distributions. But that does not mean that it does not
involve any assumptions about the population distribution. Its advantage over the
sign test is that it makes some use of the size of the differences, instead of just
using their signs. This, however, comes at a price. The null distribution of the
test statistic is found by assuming that an absolute difference with a particular
rank is just as likely to be associated with a positive difference as with a negative
one. Suppose, however, that the null hypothesis of the test (zero population
median difference) is true, but that the differences have a distribution that is not
symmetric about zero. For definiteness, suppose that the true distribution of
differences is right-skew, that is, that its upper tail (positive values) is more
spread out than its lower tail. Then, because we have assumed that the median
difference is zero, we would expect the number of positive differences to be about
the same as the number of negative differences, but on the whole the positive
differences would tend to be larger in absolute value than the negative differences.
In other words, a difference whose absolute value had a large rank would be more
likely to be positive than to be negative. If this were indeed the case, the null
distribution of the Wilcoxon test statistic would be wrong. Therefore, in order to
use information about the relative size of the differences and use the Wilcoxon
signed rank test, we must make an assumption that the differences can reasonably
be modelled by a symmetric distribution, at least under the null hypothesis. The
particular shape of the distribution does not matter at all, as long as it is
symmetric. If this is not the case, the test is not valid. In many circumstances,
particularly for differences in paired data, such an assumption of symmetry is
perfectly reasonable. However, this is not always so. Judging from the sample
values, it looks as if it may well be inappropriate to model the Shoshoni rectangles
data by a symmetric distribution, because the sample is quite heavily right-skew. Further investigation, using the
(Its sample skewness is 1.75.) Thus, it may be inappropriate to use the Wilcoxon transformation methods you will
test with these data. One plausible explanation for the fact that both the meet in Unit C3, indicates that
the Wilcoxon test and the t-test
one-sample t-test and the Wilcoxon signed rank test provide some evidence do not provide misleading
against the null hypothesis may be simply that both reflect the skewness of the information in this case, even
data, rather than the possibility that their mean or median is different from 0.618. though the sample data are
skew. That is, the data do
Note that, since the Wilcoxon signed rank test involves an assumption of really provide some weak
symmetry, the care that was taken to express its null hypothesis as evidence that the underlying
mean or median is not 0.618,
H0 : m = 0 and the fact that the sign test
failed to detect this is simply a
(where m is the population median) was actually misplaced. If the underlying consequence of its lack of power.
distribution is symmetric, then its median is equal to its mean, so we could have
written the null hypothesis as
H0 : µ = 0
(with corresponding forms for the alternative hypothesis), just as for the
one-sample t-test.
Section 1 71
The Mann–Whitney test may be used to test the hypothesis that the distribution
of β-hydroxylase activity is the same for patients judged non-psychotic as for
patients judged psychotic. The data may be pooled and ranked as shown in
Table 1.6.
Section 1 73
It is fairly clear from the table that the values in the A sample on the whole have
smaller ranks than the values in the B sample, though it is not utterly
straightforward to make this comparison because of the different sample sizes.
The sample sizes are nA = 15 and nB = 10; so nA + nB = 25. Summing the ranks
for each sample gives
uA = 1 + 2 + 3 + · · · + 19 + 21 = 140,
uB = 7 + 14 + 15 + · · · + 24 + 25 = 185.
Their sum is 140 + 185 = 325. Note also that
1
2 (nA + nB ) (nA + nB + 1) = 12 (25)(26) = 325.
This provides a useful check on your arithmetic if you are not using a computer.
The expected value of UA under the null hypothesis that the two samples are
from identical populations is
nA (nA + nB + 1) 15(15 + 10 + 1)
= = 195.
2 2
The observed value uA = 140 is substantially smaller than this (in accord with
our observation that the A values tend to be smaller than the B values), but is it
significantly smaller? When there are ties in the data (as there are here), the null
distribution of UA can have a very complicated shape with many modes. Exact
computation of significance probabilities in such a context is quite difficult. A
computer gives the (two-sided) significance probability as p = 0.0015.
Alternatively, the variance of UA under the null hypothesis is
nA nB (nA + nB + 1) 15 × 10 × 26
= = 325.
12 12
For the observed value uA = 140, the corresponding z value is
140 − 195 −55
z= √ =√ −3.05.
325 325
So the approximate p value based on the normal approximation is
p = 2 × Φ(−3.05) = 0.0022 0.002.
This is close to the exact p value.
The significance probability is very small, so there is strong evidence that the
distribution of dopamine activity is not the same in the two groups. The dopamine
activity in psychotic and non-psychotic patients appears to differ, and, looking at
the data, it would seem that those judged non-psychotic (the population
corresponding to sample A) have lower dopamine activity, on average. �
74 Unit C2
����� �� ������� � �� �������� ���� � ��� ��� ���� �� ���� ����� �����
Section 2 75
Summar y of Section 1
In this section, the idea of a nonparametric (or distribution-free) test has been
introduced. You have learned how to perform the sign test and the Wilcoxon
signed rank test (which are both tests for the population median difference given
paired data or for the median of a single population). In most circumstances, the
Wilcoxon signed rank test is more powerful than the sign test, because it uses
more of the information in the data. You have also learned how to perform the
Mann–Whitney test for comparing the distributions of two populations given a
sample of data from each population. You have been introduced briefly to the
pros and cons of these tests compared with the corresponding t-tests, and you
have learned how to carry out the tests using MINITAB.
Exercise on Section 1
Exercise 1.1 Byzantine coins — nonparametric tests
Data are given in Table 1.8 on the silver content of coins from two different Hendy, M.F. and Charles, J.A.
coinages of the reign of Manuel I, Comnenus (1143–1180). (1970) The production
techniques, silver content and
Table 1.8 Silver content of coins: first and fourth coinages (% Ag) circulation history of the
twelfth-century Byzantine
First coinage 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2 trachy. Archaeometry, 12,
Fourth coinage 5.3 5.6 5.5 5.1 6.2 5.8 5.8 13–21.
It is not easy to tell from such small samples whether an assumption of normality
is appropriate, and there is certainly no strong evidence against their normality.
Nevertheless, in this exercise you are asked to carry out nonparametric tests using
these data.
(a) Suppose that an archaeologist wishes to investigate whether it is plausible
that the coins from the first coinage could come from a population where the
median silver content is 6.4%. Test this hypothesis using the sign test.
(b) Test the archaeologist’s hypothesis from part (a) using the Wilcoxon signed
rank test. Use the normal approximation to calculate the p value (even
though this will not be particularly accurate because of the small sample
size).
(c) Use the Mann–Whitney test to investigate whether it is plausible that the
silver contents of both of these sets of coins come from the same distribution.
Use the normal approximation to compute the p value.
When similar calculations are made for each emission frequency, the results are as
shown in the third and fourth columns of Table 2.2.
Table 2.2 Emissions of alpha particles: observed and expected
frequencies and differences between them, assuming a Poisson model
Count Observed frequency Expected frequency Difference
i Oi Ei Oi − Ei
0 57 54.10 2.90
1 203 209.75 −6.75
2 383 406.61 −23.61
3 525 525.47 −0.47
4 532 509.31 22.69
5 408 394.92 13.08
6 273 255.19 17.81
7 139 141.34 −2.34
8 49 68.50 −19.50
9 27 29.51 −2.51
10 10 11.44 −1.44
11 4 4.03 −0.03
12 2 1.30 0.70
> 12 0 0.53 −0.53
The solution to the problem identified in Example 2.1 is to scale the squared
differences by dividing by the expected frequency. Thus the scaled squared
2
differences (Oi − Ei ) /Ei are used. These are then added up to give the overall
measure of goodness of fit:
� (Oi − Ei )2
χ2 = . χ is the Greek letter chi and is
Ei pronounced ‘kye’.
This statistic is the chi-squared goodness-of-fit statistic. It was devised
in 1900 by Karl Pearson.
Note that χ2 is zero when Oi = Ei for all categories, that is, when the observed
and expected frequencies are equal. Clearly, due to random variation, we would
not expect this to occur even when the model is correct. In order to assess what
values of χ2 are consistent or inconsistent with the model, the distribution of χ2 is
required. In fact, the goodness-of-fit test is based upon the approximate
distribution of χ2 under the null hypothesis that the model is correct. This
distribution is described in Subsection 2.2.
78 Unit C2
The definition of a chi-squared distribution and the results of Activity 2.1 are
summarized in the following box.
Notice that the p.d.f. of the random variable W ∼ χ2 (r) has not been given. This
is because it is quite complicated and is not useful for present purposes. For
instance, the p.d.f. does not yield an explicit formula for calculating probabilities
of the form FW (w) = P (W ≤ w). These probabilities need to be computed, or
deduced from tables.
A different table would be required for each value of the degrees-of-freedom
parameter, so a comprehensive listing of tail probabilities would require many
pages. Most published tables contain only selected quantiles of the chi-squared
distribution for a range of values of the degrees of freedom. A table of quantiles of
chi-squared distributions is given in the Handbook. This table is used in a similar
way to the table of quantiles for t-distributions. Table 2.3 contains the 0.01-, 0.05-
and 0.95-quantiles of chi-squared distributions with degrees of freedom up to 10. Table 2.3 Selected quantiles
of chi-squared distributions
Example 2.3 Using the table df 0.01 0.05 0.95
The 0.95-quantile of χ2 (5) is the number in the row labelled 5 (df = 5) and in the 1 0.0001 0.0039 3.84
column headed 0.95, which is 11.07. Similarly, the 0.01-quantile of χ2 (7) is 1.24, 2 0.020 0.103 5.99
and the 0.05-quantile of χ2 (2) is 0.103. � 3 0.115 0.352 7.81
4 0.297 0.711 9.49
5 0.554 1.14 11.07
Activity 2.2 Tail probabilities for chi-squared distributions 6 0.872 1.64 12.59
Use the table of quantiles for chi-squared distributions in the Handbook to answer 7 1.24 2.17 14.07
the following. 8 1.65 2.73 15.51
9 2.09 3.33 16.92
(a) Find the 0.01-quantile of W , where W ∼ χ2 (18). 10 2.56 3.94 18.31
(b) Find the value w such that P (W > w) = 0.05, where W ∼ χ2 (12). df stands for ‘degrees of
(c) Find the best possible lower bound and the best possible upper bound on freedom’.
P (W > 12.03), where W ∼ χ2 (4).
80 Unit C2
Result (2.1) is presented without proof. You should merely note that it is a
consequence of the central limit theorem. But how good is the approximation?
Here is a simple rule to follow.
The expected frequencies were given in Table 2.2 (Ei = 2612θi ). The values of Ei
corresponding to counts of 11, 12 and > 12 are all less than 5, thus violating the
rule that each Ei must be at least 5 for the chi-squared approximation to be valid.
This problem is overcome by pooling (combining) categories until all values of Ei
are greater than or equal to 5. This is achieved by replacing categories 11, 12
and > 12 with a single > 10 category, which has expected frequency
4.03 + 1.30 + 0.53 = 5.86. The resulting frequencies and the corresponding values
of (Oi − Ei )2 /Ei are shown in Table 2.4.
Table 2.4 Calculating χ2 for the data on
emissions of alpha particles
2
i Oi Ei Oi − Ei (Oi − Ei ) /Ei
0 57 54.10 2.90 0.155
1 203 209.75 −6.75 0.217
2 383 406.61 −23.61 1.371
3 525 525.47 −0.47 0.000
4 532 509.31 22.69 1.011
5 408 394.92 13.08 0.433
6 273 255.19 17.81 1.243
7 139 141.34 −2.34 0.039
8 49 68.50 −19.50 5.551
9 27 29.51 −2.51 0.213
10 10 11.44 −1.44 0.181
> 10 6 5.86 0.14 0.003
There are now 12 categories (so k = 12), and the value of the goodness-of-fit test
statistic is
� (Oi − Ei )2
χ2 = = 0.155 + 0.217 + · · · + 0.003 = 10.417 10.4.
Ei
Under the null hypothesis that the Poisson model is correct, the distribution of χ2
is approximately a chi-squared distribution with k − p − 1 = 12 − 1 − 1 = 10
degrees of freedom.
Remember that the chi-squared test statistic measures the extent to which
observed frequencies differ from those expected under the assumed model: the
higher the value of χ2 , the greater the discrepancy between the data and the
model. Thus the appropriate test is one-sided: only high values of χ2 indicate
that the model does not fit the data well. (It is possible to argue that low values
of χ2 suggest a fit so good that the data are suspect, showing less variation than
might be expected, and hence that a two-sided test is required. However, this is
not the approach adopted in M248.) Figure 2.2 The null
2 distribution and the significance
The upper tail of χ (10) cut off at 10.4 is shown in Figure 2.2: its area is 0.406. probability for the data on
Thus the significance probability is 0.406. There is little evidence against the null emissions of alpha particles,
hypothesis that the assumed Poisson model is correct. � assuming a Poisson model
After this was done, and before performing the experiment, the screen was
sampled to check whether the colouring algorithm had operated successfully.
82 Unit C2
A total of 1000 larger squares, each containing sixteen of the small squares, was
randomly selected and i, the number of black squares in each larger square, was
counted. The observed frequencies Oi are shown in Table 2.5.
If the computer program worked as intended, then the observed values should be
consistent with 1000 observations from a binomial distribution, B(16, 0.29). This
hypothesis can be tested using a chi-squared goodness-of-fit test.
The expected frequencies Ei are also shown in Table 2.5. These were calculated as
follows. For i = 2, for example, Table 2.5 Counts on a
random screen pattern
E2 = 1000θ2
� � i Oi Ei
16
= 1000 (0.29)2 (1 − 0.29)14 0 2 4.17
2
1 28 27.25
83.48. 2 93 83.48
3 159 159.13
The first two and the last seven categories need to be pooled to obtain categories 4 184 211.23
with expected frequencies of 5 or more. The values obtained when this is done are 5 195 207.07
shown in Table 2.6. Values of (Oi − Ei )2 /Ei are also shown. 6 171 155.06
7 92 90.48
Table 2.6 Counts on a random screen pattern: 8 45 41.57
calculating the chi-squared goodness-of-fit test statistic 9 24 15.09
10 6 4.32
2
Count i Oi Ei Oi − Ei (Oi − Ei ) /Ei 11 1 0.96
12 0 0.16
0 or 1 30 31.42 −1.42 0.064 13 0 0.02
2 93 83.48 9.52 1.086 14 0 0.00
3 159 159.13 −0.13 0.000 15 0 0.00
16 0 0.00
4 184 211.23 −27.23 3.510
5 195 207.07 −12.07 0.704 The expected frequencies add
6 171 155.06 15.94 1.639 up to 999.99 instead of 1000.
7 92 90.48 1.52 0.026 This discrepancy, which is due
to rounding error, is not
8 45 41.57 3.43 0.283 important and can be ignored.
9 24 15.09 8.91 5.261
>9 7 5.46 1.54 0.434
Summar y of Section 2
In this section, the chi-squared distribution has been introduced and the
chi-squared goodness-of-fit test has been described. You have learned how to use
this test to investigate the goodness of fit of discrete models.
Exercise on Section 2
Exercise 2.1 Diseased trees
The ecologist E.C. Pielou was interested in the pattern of healthy and diseased Pielou, E.C. (1963) Runs of
trees in a plantation of Douglas firs. Several lines of trees were examined. The healthy and diseased trees in
lengths of unbroken runs of healthy and diseased trees were recorded. The transects through an infected
forest. Biometrics, 19, 603–614.
observations made on a total of 109 runs of diseased trees are given in Table 2.7.
Table 2.7 Run lengths of diseased trees
Run length 1 2 3 4 5 6
Number of runs 71 28 5 2 2 1
There were no runs of more than six diseased trees. Pielou proposed that the
geometric distribution might be a good model for these data, and from the data
estimated the geometric parameter p to be 0.657. (Here, p is the proportion of
healthy trees in the plantation.)
Investigate the goodness of fit of the geometric model.
Summar y of Unit C2
In this unit, you have learned about certain nonparametric tests, and about a
method for testing the goodness of fit of a probability model for discrete data. A
nonparametric or distribution-free significance test is a statistical testing
procedure that does not involve making specific assumptions about the form of
the distribution of the population(s) involved. This means, for example, that such
procedures can be used in place of t-tests when the populations involved cannot
be assumed to have a normal distribution. You have met two tests whose null
hypothesis is that a single sample of data, which might actually arise as a set of
differences between pairs of data, comes from a population whose median has a
specified value. The sign test discards much of the information in the data, is thus
not very powerful and is not often used in practice. The Wilcoxon signed rank test
uses more of the information in the data and generally has reasonable power, but
it involves the assumption that the distribution from which the data were drawn is
symmetric. You have also met the Mann–Whitney test, which is used to compare
the distributions of the populations from which two samples of data were drawn.
A family of distributions, called chi-squared distributions, has been introduced.
These are indexed by a parameter r, known as the degrees of freedom. The
chi-squared goodness-of-fit test for discrete probability models involves calculating
expected frequencies for each possible value of the random variable involved (that
is, for each category), and producing a summary measure of how these differ from
the frequencies that were actually observed. Under the null hypothesis that the
proposed model fits the data, the distribution of this summary measure is
approximately a chi-squared distribution. This approximate result for the null
distribution of the chi-squared goodness-of-fit test statistic is only valid if the
expected frequencies for the categories are not too small: if any of the expected
frequencies are less than 5, then some of the categories should be pooled before
the value of the test statistic is calculated.
84 Summar y of Unit C2
Learning outcomes
You have been working towards the following learning outcomes.
Ideas to be aware of
That there are statistical tests that do not involve making specific
distributional assumptions about the data.
That, in general, nonparametric tests may involve certain broad
distributional assumptions, such as an assumption of symmetry.
That the validity of a particular model for a given set of data can be tested
using a goodness-of-fit test.
Statistical skills
Perform the sign test, the Wilcoxon signed rank test and the Mann–Whitney
test.
Perform a chi-squared goodness-of-fit test for discrete data.
Use the table of quantiles for chi-squared distributions in the Handbook.
Solutions to Activities
Solution 1.1 There are nine + signs and one Solution 1.4 Table S.2 shows the results of
− sign. (There are no zeros to omit.) The test subtracting 0.618 from each entry in Table 1.3 and
statistic is the number of + signs, which is� 9. The
� null allocating ranks.
distribution of the number of + signs is B 10, 12 . The Table S.2
significance probability is the probability of obtaining
nine or more + signs, under the null hypothesis. This Original value 0.693 0.662 0.690 0.606 0.570
is Difference 0.075 0.044 0.072 −0.012 −0.048
�10 � � Sign + + + − −
10 � 1 �10 � �10 � �10
2 = 10 × 12 + 1 × 12 Rank 17 10 16 5 12 11
x
x=9 Original value 0.749 0.672 0.628 0.609 0.844
11 Difference 0.131 0.054 0.010 −0.009 0.226
= 0.011.
1024 Sign + + + − +
(This is the appropriate tail to consider, since under Rank 18 14 4 3 19
the alternative hypothesis you would expect there to Original value 0.654 0.615 0.668 0.601 0.576
be many + signs. Since you are performing a Difference 0.036 −0.003 0.050 −0.017 −0.042
one-sided test, there is no requirement to consider the Sign + − + − −
other tail and double the p value.) The p value is Rank 8 1 12 7 9
small, 0.011. There is quite strong evidence against
the null hypothesis. We conclude it is highly likely Original value 0.670 0.606 0.611 0.553 0.933
that the median sleep gain is greater than zero. Difference 0.052 −0.012 −0.007 −0.065 0.315
Sign + − − − +
Rank 13 5 12 2 15 20
Solution 1.3
(a) The Wilcoxon signed rank test statistic can be There are no 0s and only two tied differences.
calculated using Table S.1. The value of the test statistic w+ is the sum of the
Table S.1 ranks associated with positive differences. Thus
w+ = 17 + 10 + 16 + 18 + 14 + 4
Patient 1 2 3 4 5 6 7 8 9 10
Sign of difference + + + + − + + + + + + 19 + 8 + 12 + 13 + 20
in sleep time
= 151.
Absolute value 1.9 0.8 1.1 0.1 0.1 4.4 5.5 1.6 4.6 3.4
of difference
Rank of absolute 6 3 4 1 12 1 21 8 10 5 9 7
Solution 1.6 The sample size is 20 and there are
no zero differences, so n = 20. Therefore
value of difference
n(n + 1) 20 × 21
E(W+ ) = = = 105,
The test statistic w+ is the sum of the ranks 4 4
associated with the positive differences, so n(n + 1)(2n + 1) 20 × 21 × 41
V (W+ ) = = = 717.5.
w+ = 6 + 3 + 4 + 1 12 + 8 + 10 + 5 + 9 + 7 = 53 12 . 24 24
Note that, in this case, it may be slightly quicker to The observed value of the test statistic is w+ = 151, so
work out the 151 − 105
� sum of �the ranks for the negative z= √ 1.72.
differences w− = 1 12 and to use the fact that the 717.5
sum of all the ranks is 12 n(n + 1) = 21 × 10 × 11 = 55, The table of probabilities for the standard normal
to give distribution in the Handbook gives
w+ = 55 − 1 12 = 53 21 . P (Z ≤ 1.72) = Φ(1.72) = 0.9573 0.957.
(b) The p value is very small, indicating strong So there is a probability of 0.043 of being at least this
evidence against the null hypothesis. We conclude far out into the (right-hand) tail of the standard
that it is very likely that the median sleep gain is normal distribution. Since you are performing a
greater than zero. two-sided test, you need to consider the other tail as
This is in general terms the same conclusion as for the well. Thus the approximate p value is
sign test. However, the p value for the Wilcoxon 2 × 0.043 = 0.086. This is very close to the value given
signed rank test is even smaller than for the sign test by the exact test. The p value of 0.086 provides weak
(0.003 compared with 0.011), indicating that the evidence that Shoshoni rectangles do not conform to
Wilcoxon test provides even more evidence against the the Greek golden ratio standard.
null hypothesis than the sign test does. This is an
illustration of the fact that, in many circumstances,
the Wilcoxon signed rank test is more powerful than
the sign test.
86 Unit C2
Solution 1.7 The appropriate test is the Solution 2.1 Since W is the sum of r independent
Mann–Whitney test. The ranks are given in Table S.3. observations Z12 , Z22 , . . . , Zr2 on Z 2 ,
� � � � � �
Table S.3 E(W ) = E Z12 + E Z22 + · · · + E Zr2 ,
� � � � � �
Pleasant Rank Unpleasant Rank V (W ) = V Z12 + V Z22 + · · · + V Zr2 .
memory memory So, since Z 2 has mean 1,
1.07 1 1.45 5 E(W ) = 1 + 1 + · · · + 1 = r;
1.17 2 1.67 7 and since Z 2 has variance 2,
1.22 3 1.90 8
1.42 4 2.02 10 V (W ) = 2 + 2 + · · · + 2 = 2r.
1.63 6 2.32 12 12
1.98 9 2.35 14 Solution 2.2 The following values were obtained
2.12 11 2.43 15 using the table of quantiles of chi-squared
2.32 12 21 2.47 16 distributions in the Handbook.
2.56 17 2.57 18 (a) q0.01 = 7.01.
2.70 19 3.33 25 (b) The 0.95-quantile of χ2 (12) is required, so
2.93 20 3.87 27 w = 21.03.
2.97 21 4.33 28 (c) The value 12.03 lies between the 0.975-quantile
3.03 22 5.35 31 and the 0.99-quantile of χ2 (4). Thus
3.15 23 5.72 33
3.22 24 6.48 35 0.01 < P (W > 12.03) < 0.025.
3.42 26 6.90 36 (In fact, P (W > 12.03) = 0.0171.)
4.63 29 8.68 37
4.70 30 9.47 38 Solution 2.3
5.55 32 10.00 39 (a) The categories have expected frequencies given by
6.17 34 10.93 40 Ei = nθi = 290θi , i = 1, 2, 3, 4,
Sums 345 12 474 12 where
9 3 3 1
θ1 = 16 , θ2 = 16 , θ3 = 16 , θ4 = 16 .
If the group of pleasant memory recall times is
labelled A, then the test statistic uA is 345 12 . There This leads to the values in Table S.4.
are only two tied values, and the samples are not Table S.4 Pharbitis nil, simple theory
particularly small, so the normal approximation
2
should be adequate for calculating the p value. i Oi Ei Oi − Ei (Oi − Ei ) /Ei
nA (nA + nB + 1) 20 × 41 1 187 163.125 23.875 3.494
E(UA ) = = = 410,
2 2 2 35 54.375 −19.375 6.904
nA nB (nA + nB + 1) 20 × 20 × 41 3 37 54.375 −17.375 5.552
V (UA ) = =
12 12 4 31 18.125 12.875 9.146
1366.667.
The mean of UA under the null hypothesis is greater The value of the chi-squared test statistic is
than the observed value, confirming the impression � (Oi − Ei )2
given by the table that the A values tend to have χ2 =
Ei
smaller ranks than the B values. The z value is = 3.494 + 6.904 + 5.552 + 9.146
345.5 − 410
z= √ −1.74. = 25.096 25.10.
1366.667
There are four categories and no model parameters
So the approximate p value is
were estimated, so the null distribution of the test
2Φ(−1.74) = 2 × 0.0409 = 0.0818 0.082. statistic has 4 − 0 − 1 = 3 degrees of freedom. The
Therefore there is some evidence against the null value 25.10 is greater than the 0.995-quantile of χ2 (3),
hypothesis that the distribution of recall times is the which is 12.84, so the significance probability is less
same for pleasant and unpleasant memories, but the than 0.005. (The actual value is about 0.000015.) This
evidence is weak. Looking at the data, it would seem is very small, so there is strong evidence that the
that pleasant memories have shorter recall times, on simple theory is flawed.
average.
Solutions to Activities 87
(b) Allowing for genetic linkage, the expected The value of the chi-squared test statistic is
frequencies are given by � (Oi − Ei )2
Ei = nθi = 290θi , i = 1, 2, 3, 4, χ2 =
Ei
where = 0.267 + 0.159 + 0.005 + 0.470
θ1 = 0.6209, θ2 = 0.1291, = 0.901 0.90.
θ3 = 0.1291, θ4 = 0.1209. The number of categories is again 4. However, one
This leads to the values in Table S.5. parameter has been estimated, so the null distribution
of the test statistic has 4 − 1 − 1 = 2 degrees of
Table S.5 Pharbitis nil, genetic linkage theory freedom. The 0.1-quantile of χ2 (2) is 0.211 and the
2 0.5-quantile is 1.39. The observed value 0.91 lies
i Oi Ei Oi − Ei (Oi − Ei ) /Ei
between these quantiles, so the significance probability
1 187 180.061 6.939 0.267 is between 0.5 and 0.9. (The actual value is 0.638.)
2 35 37.439 −2.439 0.159 Hence there is little evidence against the theory. The
3 37 37.439 −0.439 0.005 model appears to fit the data well.
4 31 35.061 −4.061 0.470
88 Unit C2
Solutions to Exercises
Solution 1.1 (c) The ranks are given in Table S.7.
(a) To find the sign test statistic, you simply need to Table S.7
examine the data, counting a + sign for values greater
First Rank Fourth Rank
than the hypothesized value of 6.4, a − sign for values
coinage coinage
below 6.4, and omitting values that are equal to the
hypothesized value. This gives six + signs and two − 5.9 7 5.1 1
signs; one value is omitted. Thus the sign test statistic 6.2 8 12 5.3 2
is 6. In total there are eight + and − � signs,
� so the null 6.4 10 5.5 3
distribution of the test statistic is B 8, 12 . Thus the 6.6 11 5.6 4
significance probability is twice the probability of 6.8 12 5.8 5 12
obtaining six or more + signs, under the null
6.9 13 5.8 5 12
hypothesis. This is
�8 � � 7.0 14 6.2 8 12
8 � 1 �8 7.2 15
2 2
x 7.7 16
x=6
� � �8 � �8 � �8 �
= 2 × 28 × 12 + 8 × 12 + 1 × 12 Sums 106 12 29 12
37 If the group of silver contents of coins from the first
=2× 0.289.
256 coinage are labelled A, then the test statistic uA is
This provides very little evidence against the null 106 21 .
hypothesis that the underlying median silver content
nA (nA + nB + 1) 9 × 17
is 6.4%. E(UA ) = = = 76.5,
2 2
(b) This time you must begin by calculating the nA nB (nA + nB + 1) 9 × 7 × 17
differences between the data values and the V (UA ) = = = 89.25.
12 12
hypothesized value, 6.4. These differences, together The mean of UA under the null hypothesis is less than
with the appropriate ranks, are shown in Table S.6. the observed value, confirming the impression given by
Table S.6 the table that the A values tend to have larger ranks
Original value 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2 than the B values. The z value is
Difference −0.5 0.4 0 0.6 0.2 1.3 0.8 0.5 −0.2 106.5 − 76.5
z= √ 3.18.
Sign − + + + + + + − 89.25
Rank 4 12 3 6 1 12 8 7 4 12 1 12 So the approximate p value for the two-sided test is
p = 2Φ(−3.18) = 0.0014 0.001.
There is one 0 (which is subsequently omitted) and
two pairs of ties (given average ranks). The test Therefore there is strong evidence against the null
statistic w+ is the sum of the ranks associated with hypothesis that the distribution of silver contents is
the positive differences, so it is 30. The total number the same for the two coinages. The silver contents
of positive and negative differences is n = 8. Thus the appear to differ, and looking at the data, it would seem
mean and variance of the test statistic under the null that the first coinage contains more silver, on average.
distribution are given by (Note that the normal approximation is quite good in
n(n + 1) 8×9 this case: a computer program that can calculate
E(W+ ) = = = 18, exact p values for Mann–Whitney tests gave 0.00149.)
4 4
n(n + 1)(2n + 1) 8 × 9 × 17
V (W+ ) = = = 51.
24 24
This leads to a z value of
30 − 18
z= √ 1.68.
51
So the significance probability is
p = 2Φ(−1.68) = 2 × 0.0465 = 0.093.
This provides weak evidence against the null
hypothesis that the underlying median silver content
of coins from the first coinage is 6.4%.
(A computer program that can calculate exact
p values for Wilcoxon tests gave 0.117 for the
significance probability. Thus the normal
approximation is not very good in this case.)
Solutions to Exercises 89
Solution 2.1 For the geometric model, the Table S.9 Testing the geometric model
probability of a run of length i (i = 1, 2, . . . , 6) is 2
Run Oi Ei Oi − Ei (Oi − Ei ) /Ei
θi = (1 − p)i−1 p.
length
The observation that no run was greater than 6 must
be accounted for by a category, with probability 1 71 71.61 −0.61 0.005
2 28 24.56 3.44 0.482
θ7 = P (X > 6) = (1 − p)6 .
≥3 10 12.83 −2.83 0.624
The geometric parameter p has been estimated from
the data to be 0.657. The expected frequencies under The value of the chi-squared test statistic is
the geometric model are 109θi ; these are shown in � (Oi − Ei )2
Table S.8. χ2 =
Ei
Table S.8 Observed and = 0.005 + 0.482 + 0.624
expected frequencies
= 1.111 1.11.
Run length Oi Ei There are three categories and one parameter has been
1 71 71.61 estimated from the data (the geometric parameter
2 28 24.56 p = 0.657), so the null distribution of the test statistic
3 5 8.43 has 3 − 1 − 1 = 1 degree of freedom. The observed
4 2 2.89 value 1.11 lies between the 0.5-quantile and the
5 2 0.99 0.9-quantile of χ2 (1), so the significance probability is
6 1 0.34 greater than 0.1. (The actual value is 0.292.) There is
>6 0 0.18 therefore little evidence against the geometric model.
This confirms that Pielou’s assumptions were
To ensure that all expected values are at least 5, runs reasonable.
of length 3 or more are pooled. The calculations are
set out in Table S.9.
90 Unit C3
Introduction
You have now met some of the most important ideas of statistics: you can
summarize the key features of a data set and can represent it graphically in
different ways; you have seen that variability in a population can be represented
by a probability distribution; with a few assumptions, you are able to use a
variety of distributions to represent both discrete and continuous random
variables; and you can use data to answer practical questions.
Each unit so far has dealt with a particular topic or method of statistics, which
has been illustrated by examples. In contrast, the focus of this unit is on the
statistical modelling process. You might think of the techniques you have
encountered as statistical tools. Having assembled your toolbox, the aim now is to
work out how to use it when confronted with a statistical problem.
The beginning of most statistical investigations is usually a practical problem. For
example, a medical researcher might want to know whether or not a treatment for
cancer works; an engineer might wish to estimate the tensile strength of a
particular material; a social scientist might seek to understand what factors
influence school performance; an economist might wish to predict future inflation
rates. In a statistical investigation, the problem is formulated in statistical terms,
appropriate data are collected and analysed, and the conclusions are summarized
in a statistical report. The journey from practical problem to statistical report is
best thought of as a research process, which can be represented by the flow chart
in Figure 0.1. Note that, in practice, the various stages of the modelling process
might arise in a slightly different order from that in Figure 0.1. For example, it is Figure 0.1 The modelling
sometimes more convenient to check assumptions after the model has been fitted. process
Section 1 91
Formulating the right questions, and designing studies to answer them, are
important statistical issues, though they are not dealt with in this course.
Typically, they involve collaborations with specialists in other
disciplines — medical doctors, engineers, social scientists, economists, and so
on — and usually require some knowledge of the particular application area.
However, this unit focuses on the issues of statistical modelling and reporting,
which are similar in all application areas. The starting point is therefore the
problem or question under consideration, together with the data that have been
collected to throw light on it. Thus, in terms of the flow chart in Figure 0.1, you
will begin at the box marked ‘Choose model’.
What is a model? In general terms, it is a simplified representation of the process
generating the data. The key component of a statistical model is the underlying
distribution from which the data are sampled; but the model might also include
other components — for example, transformations or other relationships (known Transformations are discussed in
or presumed) between variables. However, the terms ‘distribution’ and ‘model’ are Subsection 2.3.
used interchangeably in much of this unit.
A suitable model to start with is one that reflects the important attributes of the
data. For example, if the data consist of measurements on a continuous variable,
then it makes sense to choose a continuous distribution to represent the
underlying variation. Also, the question to be answered will often suggest how the
model will be used — for example, to calculate a confidence interval or carry out a
t-test. Having chosen a model, you will need to check that it fits the data, and
that any assumptions required are satisfied. If either is not the case, then you will
need to alter the model in some way, or perhaps even try a completely different
one. You will then need to repeat the process, improving the model at each stage
until it is good enough for its purpose. Having chosen a model, the final stage of
the modelling process is to report your results.
Statistical modelling is an art as much as a science, requiring common sense and
judgement and, just occasionally, a little inspiration. It is as well to remember
that statistical models are at best idealizations of reality: you should not expect
to find a ‘perfect’ model. The real skill is in finding a model that is good enough
for your purposes, and from which you can draw valid conclusions.
In Section 1, some hints are given on how to get started, even before looking at
the data. The emphasis is on developing some a priori ideas about how you might
approach the analysis and the choice of model, knowing only the context of the
problem and the type of data collected. In Section 2, methods for exploring the
data are discussed with a view to choosing a model, transforming the data, and
checking the model. In Section 3, you will practise undertaking a complete
analysis using a variety of tools; the section consists of a chapter of Computer
Book C. Writing the statistical report is discussed in Section 4.
For the purposes of statistical modelling, the first major distinction to be drawn is
whether the data should be modelled as discrete or continuous. In each of these
four examples, the choice is reasonably clear. For example, in Example 1.1 the
data are fish counts, so take the values 0, 1, 2, 3, . . .. Thus even before seeing any
data you may conclude that a first choice of model should be one suited to
discrete data: this narrows down your choice to the Bernoulli, binomial, discrete
uniform, geometric and Poisson models. In contrast, the data in Example 1.2 are
sizes, measured in millimetres. Thus a suitable first model for these data should
be a continuous distribution. Possible models include exponential, continuous
uniform and normal distributions.
Example 1.5 illustrates the point that clear rules even about the apparently
simple matter of deciding on a discrete or continuous model can be difficult to
specify. When the error involved in treating a continuous variable as discrete, or
vice versa, is negligible, then it may not matter which choice is made. In this case,
the choice might reasonably be made on the grounds of convenience, and how well
the proposed model fits the data.
94 Unit C3
Having narrowed the field to either discrete models or continuous models, the
next step is to choose which of the models in each of these categories is most
likely to be suitable. How to do this is discussed in Subsections 1.2 and 1.3.
These models are not guaranteed to work — in particular, you will need to check
the assumptions required in each case. However, they often provide a convenient
starting point.
Clearly, not all problems conform to the standard settings just described.
Thankfully, however, the four discrete distributions apply much more widely than
the settings might suggest. On the other hand, in some circumstances none of
them will do; this is illustrated in Example 1.7.
I therefore concluded that, for this problem, there was no compelling reason to
opt for any of the standard models, though perhaps the least ‘bad’ choice might
be to opt for the discrete uniform distribution. �
In some cases no distribution fits all the requirements. Even then, all may not be
lost: it is important to remember that the purpose of statistical modelling is not
to find a perfect model, but a ‘good enough’ model. Provided that the model does
not fail in some key respect, it might still be useful.
You might have noticed that the standard setting for both the discrete uniform
distribution and the continuous uniform distribution includes the situation in
which there is no reason to believe that any value is special. Thus the uniform
distribution is sometimes used to represent lack of knowledge, though of course
such knowledge might be acquired later through experimentation.
Note also that the standard setting for the normal distribution is rather less
specific than for the exponential distribution or the uniform distribution. In fact,
there is no really natural model for the normal distribution. This does not mean,
of course, that the normal distribution is not useful. On the contrary, it is the
most commonly used continuous model.
In Subsection 1.2, you saw that the shape of the distribution is a key factor in
choosing a discrete model; it is also a key factor in choosing a continuous model.
Example 1.10 serves to illustrate the important point that models are at best
approximate representations of reality. The aim is to formulate a reasonable
model, not a perfect one.
98 Unit C3
In this section, the importance of thinking about the context or setting in which
the data are collected has been emphasized, in order to guide the choice of model.
This is certainly useful but does not guarantee a unique answer. In some settings
there may be two, or many more, reasonable candidates.
Before ending this section, it is worth emphasizing that there are many more
distributions than the ten covered in detail in this course, though these include
some of the most important ones in statistics. MINITAB provides several other
distributions, which you might care to explore (this is entirely optional); and
there are others beside these. However, the approach to selecting an appropriate
distribution is much the same whatever the collection of distributions to which
you have access.
Summar y of Section 1
In this section, some basic principles for thinking about data and models, even
before looking at the data, have been reviewed. The key issues are whether the
data are discrete or continuous, whether the setting in which the data were
collected conforms to any of the standard settings, and what is the likely shape of
the distribution. These basic principles can help to formulate a starting point for
choosing a model, which can be revised in the light of the data.
Exercises on Section 1
Exercise 1.1 Car production
A car manufacturer monitors its production by means of daily counts of the cars
it produces. Would you model the output as a discrete or a continuous variable?
What other information might you require before deciding on an answer?
The key idea in Example 2.1 is the rather obvious one that the ‘shape’ of the data
should be consistent with the distribution you wish to fit. You met this idea in
Block A.
100 Unit C3
Activity 2.2 will give you some practice at using the shape of a bar chart to decide
whether various distributions are good candidate models for the data.
The bar chart in Figure 2.4 of the data on used windows shows that the uniform
model, which was tentatively suggested in Examples 1.7 and 1.8, is unsuitable. A
geometric model would appear to be better. This illustrates the point that
statistical modelling often requires you to alter the model as more features of the
data become apparent. It is important, therefore, to keep an open mind, but also
to concentrate on the important aspects of the data. In this case, getting the
shape right is probably more important than abiding by constraints about the
range of the data.
Similar considerations apply to continuous data. A good starting point is always a
graphical display of the key variable or variables.
Histograms or bar charts are essential for displaying the main features of a data
set, and for suggesting a modelling approach. When dealing with certain types of
continuous data, further checks are possible using probability plots; you met these
in Blocks A and B. Recall that a probability plot for a data set consisting of
ordered observations x(1) , . . . , x(n) is a plot of these observations against scores
y1 , . . . , yn derived from the proposed distribution. These ‘scores’ are certain Figure 2.6 Treatment times for
quantiles of the proposed distribution. In general terms, if the data are randomly hospital patients
sampled from the proposed distribution, then the points on the probability plot
should lie close to a straight line.
In Block A you met exponential probability plots, in which the scores are derived
assuming an exponential distribution. For such data, the points on the probability
plot should lie close to a line passing through the origin. In Block B you met the
corresponding plot for an assumed normal distribution. In this case, the points on
the probability plot should also lie close to a line but this line does not have to
pass through the origin. Probability plots provide a useful check on model
assumptions, and are more sensitive to departures from these assumptions than
histograms.
102 Unit C3
Figure 2.7 An exponential probability plot for the interspike intervals data
For example, if a Poisson model is to be fitted, then you might expect the sample
mean and the sample variance to be similar, at least in large samples. Similarly, if
an exponential model is to be considered, the sample mean and the sample
standard deviation should be similar. Note, however, the necessary emphasis on
‘similar’ here: exact equality is most unlikely and, especially in small samples,
there might be substantial discrepancies due to random fluctuations. Nevertheless,
examination of the mean–variance relationship for the data can provide a clue as
to possible problems with fitting either a Poisson model or an exponential model.
Figure 2.8 A normal probability plot for a sample of size 50 from U (0, 1)
It is clear that the probability plot has a wavy shape which differs in a systematic
way from a straight line, even though no individual point lies very far from the
line drawn in the figure. �
The systematic pattern in Figure 2.8 provides clear evidence that the data are not
sampled from a normal distribution. In fact, the wavy pattern is characteristic of
a symmetric distribution with tails that are too ‘light’ compared with a normal
distribution. For U (0, 1), there are no values less than 0 or greater than 1.
Similarly, if the underlying distribution is either right-skew or left-skew, this will
induce a systematic non-linear pattern in a normal probability plot of the data.
104 Unit C3
Deducing the shape of the distribution from its normal probability plot is not an
easy task, and is seldom attempted. The key point is that the presence of a
systematic non-linear pattern in a normal probability plot is evidence that the
data are not sampled from a normal distribution. If a probability plot indicates
that the normality assumption is invalid, you should look at a histogram of the
data to identify in what respects the normality assumption fails.
A further problem in interpreting probability plots is that individual points may
lie far from the straight line, not because the underlying distribution is not
normal, but because of chance effects. This is frequently the case for points at the
extremities of the plots. It would be useful to quantify the degree of variation that
might be expected to occur by chance. This can be done when using MINITAB to
produce a probability plot: confidence bands are drawn either side of the straight
line. These are explained in the computer book.
Refer to Chapter 7 of Computer Book C for the rest of the work in
this subsection.
The histogram in Figure 2.10(a) looks as though the data might be normally
distributed, but those in Figures 2.10(b), (c) and (d) are progressively more
skewed and the distributions appear to be far from normal. It may surprise you to
learn that the same sample of data was used for all four histograms. A computer
was used to generate a sample of size 300 from a normal distribution; these data
are represented by the histogram in Figure 2.10(a). Suppose that a typical value
in this data set is denoted x. Then the data points used for Figure 2.10(b) are the
values x2 ; the data points used for Figure 2.10(c) are the values e2(x−1) ; and the
data points used for Figure 2.10(d) are the values 1/(251x).
Since the data represented in Figure 2.10(a) are normally distributed, they could
be used to carry out a t-test, for example. However, it would not be legitimate to
carry out a t-test using the data in any of Figures 2.10(b), (c) or (d) because the
variation is far from normal. Suppose now that data resembling those in
Figures 2.10(b), (c) or (d) were to arise in practice. It would clearly be in order to
transform them. For example, if the data looked like those in Figure 2.10(b), then
the correct procedure would be to take the square root of each value. This
transformation would result in the data represented in Figure 2.10(a). It would
then be appropriate to carry out t-tests on the transformed data, based on the
assumption of normality. �
106 Unit C3
Figure 2.11 Normal probability plots for precipitation data: (a) untransformed (b) log transformed (c) cube root
transformed
The probability plot for the untransformed data displays a systematic pattern
indicating non-normality. The log transformation results in a much straighter
plot. An even straighter probability plot is obtained using the cube root
transformation. �
The data are highly skewed, indicating that a normal model is not appropriate
either. Figure 2.13(b) shows a histogram of the data after a log transformation,
and Figure 2.13(c) represents the data after a square root transformation.
Comment on the effect of the two transformations. In your view, which
transformation has produced the more symmetrical result?
It is perhaps worth pointing out that it is not always necessary to obtain a more
symmetrical distribution! For example, suppose that the aim of the analysis of
the interspike intervals data is to calculate the mean interspike interval, together
with a 95% confidence interval for the mean. Then, since the sample size is
sufficiently large — there are 100 observations — the distribution of the sample
mean is approximately normal, even though the underlying distribution is skew.
So large-sample methods can be used to find an approximate 95% confidence
interval for the mean. Therefore it is not necessary to transform the data to find a
confidence interval for the mean. There may, of course, be other reasons to
transform the data.
Most determinations suggest an age of between 2400 and 2600 years. However,
sample C-367 indicates an age of 3433 years, which is quite out of step with the
other values: this sample is a clear outlier. �
If at all possible, when outliers are present, your first step should be to check that
they are not the result of recording, coding or data entry errors. Such errors are
very common. It is well worth repeating here that, if you enter your own data,
you should always check your computer data file against the original. Not all data
entry errors will necessarily appear as outliers!
The study of outliers and how to treat them can be rather complex; so only a
little general guidance will be given in this course. Broadly speaking, the
treatment of outliers depends upon how many appear in the data, what effect they
have on the conclusions, and how far you are prepared to go in believing that you
have been unlucky enough to obtain a few ‘atypical’ values, rather than believing
that the distributional assumptions are not viable. This last point is important:
the outliers might just reflect the fact that you have chosen the ‘wrong’ model.
The effect of model choice on outliers is illustrated in Example 2.8.
Figure 2.14 An exponential probability plot for the data on treatment times
The probability plot is not straight. In particular, there are four outliers in the
top right-hand corner of the plot, corresponding to the four longest treatment
Section 2 109
Figure 2.15 Transformed data: (a) histogram (b) normal probability plot
There are no hard and fast rules to decide how many data values may be deleted
in order to salvage a particular modelling assumption. In practice, it is best to
remove no more than one or two values.
If in doubt, an alternative is to keep all the values and revert to a distribution-free
method. Using ranks instead of data values loses information about how far apart
the values are but, on the other hand, it removes sensitivity to abnormally large
or abnormally small values. If decisions about which method to use seem unduly
vague, you should remember that there is not always a definitely right or wrong
way of performing a statistical analysis. All you can do is use your common sense.
110 Unit C3
Summar y of Section 2
In this section, techniques for exploring data, choosing a model and checking the
model have been discussed. Graphical and other methods for guiding model
choice, based on features such as the shape of the data, its range and other
properties such as mean–variance relationships have been reviewed. The
interpretation of probability plots has been discussed and relevant features of
MINITAB explored. Transformations of continuous variables have been
considered, and the ladder of powers introduced. The handling of outliers has
been discussed briefly.
Summar y of Section 3
In this section, you have undertaken extended analysis of a data set using
MINITAB, starting from a scientific question, and progressing through the various
stages of exploratory analysis, model and method choice, model checking, and
performing the relevant statistical calculations.
This structure is reasonably standard, though some authors might use different
section headings — for example, Background instead of Introduction , Conclusion
instead of Discussion, and so on.
The Summary should be completely self-contained. It should state briefly the aim
of the analysis, the method used, the key finding or findings, and the
interpretation. It is usually written last, and should use largely non-technical
language. The ‘largely’ in the previous sentence is a reflection of the fact that it is
often simply not possible to provide an accurate summary of results without using
some statistical terminology or referring to some statistical concepts. It is far
better to give a slightly technical, but correct, summary than one apparently
easily understood by all, but potentially misleading.
The Introduction should contain a brief description of the problem or hypothesis
to be investigated, the setting in which the data were collected, and the data
available. Note that, in this course, the starting point is always a problem and
some data relevant to that problem.
The Methods section should include a description of the model, the procedures
used to check the model, the statistical tests employed, the method used for
calculating confidence intervals, and any other relevant techniques you have used,
such as data transformations. The key guide to this section is to include enough
detail to allow other statisticians to evaluate your method, and to repeat your
investigation if they had the same data. You should not include all the blind
alleys and dead ends you travelled (we all travel them) before settling on your
preferred solution. However, if you found two equally plausible models that give
appreciably different results, then you should include both.
The Results section should contain descriptive summaries of your data (for
example, graphical and numerical summaries), evidence that your model is
appropriate and, finally, the numerical results of statistical tests or confidence
interval calculations. It is important to remember that this section, as all others,
should be written in prose: a collection of numbers and graphs is not sufficient.
The Discussion should contain your own assessment of the statistical evidence
relating to the original question or hypothesis. In particular, you should discuss
any evidence of lack of fit of your model, any problems with the data (for
example, outliers), or any other matter that might have a bearing on the
interpretation of the results.
There is no set order in which to write the sections of a report but you should
present the sections in the order just described. Many readers will not read all the
sections — for example, many will read only the Introduction and Discussion — so
it is important to structure your report so that they can find the sections they are
interested in quickly. In some sense the Results section forms the heart of the
report. The Methods section is organized in such a way as to explain how you
112 Unit C3
obtained your results, while the Discussion is your interpretation of the results.
Some authors prefer to write the Results section first, followed by the Methods
section. You should use whatever order you feel most comfortable with. In any
case, you will probably find yourself going back over previous sections to make
sure everything fits together in a coherent whole.
Finally, one important general rule: the shorter, the better. If you can describe
something accurately in one sentence rather than two, then so much the better!
(But, of course, two short sentences are better than one long rambling sentence.)
Figure 4.1 Head shape differences: (a) histogram (b) normal probability plot
Section 4 113
The analytic results include those that directly address the original question set
out in the Introduction. The original question relates to the difference between the
head shapes of first and second sons. Thus you should report the mean difference
and the 95% t-interval. In addition, you should report the result of the paired
t-test. Finally, you need to provide some evidence that your methods are justified.
In this case, a probability plot was used to test the normality assumption. It is
not essential to show this plot. To save space it is quite reasonable simply to state
that you used this method to check the assumption.
Results
The distribution of the 25 differences between the head shape indices of first
and second sons is shown in the histogram below. The data were
approximately normally distributed, as confirmed by a probability plot.
The mean difference in head shape indices was 0.19, with 95% t-interval
(−1.35, 1.72). A paired t-test of the hypothesis of zero mean difference gave
t = 0.25 on 24 degrees of freedom, p = 0.80.
The next section is the Discussion section, in which you give your interpretation
of the results in the light of the original question. This is also the place where you
should comment on the possible impact of any other factors (such as missing data
or outliers) on the interpretation. In this example there are no such factors. The
section can thus be suitably brief: there is no evidence of a difference. However, it In general, it is important to
is worth qualifying this conclusion by reminding the reader that the sample size write concisely and to the point.
was rather small.
Discussion
We conclude that there is little evidence against the hypothesis of no
difference between the head shape indices of first and second sons. However,
the sample size for this study was only 25.
Finally, having assembled and re-read the report, you can now write the
Summary. This states briefly the purpose of the analysis, the method used, the
key finding and its interpretation. It should be largely non-technical.
Summary
The aim of this analysis was to compare head shapes of first and second sons,
using a shape index based on the ratio of head breadth to head length. Data
on 25 pairs of first and second sons were obtained from a published source
and analysed using a normal model. We found no significant difference
between the head shapes of first and second sons.
This completes the report. The final step is to read through the report and check
it. �
Section 4 115
The sections of the report on head shapes of first and second sons that were
written in Example 4.2 are assembled in the following box.
The mean difference in head shape indices was 0.19, with 95% t-interval
(−1.35, 1.72). A paired t-test of the hypothesis of zero mean difference gave
t = 0.25 on 24 degrees of freedom, p = 0.80.
Discussion
We conclude that there is little evidence against the hypothesis of no
difference between the head shape indices of first and second sons. However,
the sample size for this study was only 25.
Activities 4.2 and 4.3 will give you some practice at writing short statistical
reports.
116 Unit C3
Summar y of Section 4
In this section, you have learned how to structure and write a statistical report. A
convenient structure includes paragraphs entitled Summary, Introduction,
Methods, Results and Discussion.
Section 4 117
Exercise on Section 4
Exercise 4.1 Interspike inter vals
Data on the motor cortex neuron interspike intervals of an unstimulated monkey
were introduced in Example 1.3. Suppose it is required to describe the distribution
of the logarithms of the interspike intervals and, in particular, to estimate the
mean and calculate a confidence interval for the mean. Suppose also that you have
chosen a normal model. The following is a brief description of a suitable analysis.
The 100 interspike intervals, measured in milliseconds, were transformed using
natural logarithms (see Figure 2.13(b)). The distribution of the transformed data
was roughly normal, as judged by a normal probability plot, with one possible
outlier corresponding to an interval of 2 milliseconds. Accordingly, a 95%
t-interval was calculated for the mean of the transformed data. The mean was
3.41, with t-interval (3.28, 3.54). The standard deviation was 0.66. The
calculations were repeated without the outlier, yielding mean 3.44, 95% t-interval
(3.31, 3.56), and standard deviation 0.6085.
Write a short report of this analysis.
Summar y of Unit C3
In this unit, the methods that you have learned so far in the course have been
integrated into a statistical modelling process. Starting with a question or set of
questions, you have learned to use information about the setting, the types of
variables collected, and any other prior information you might possess — for
example, about the likely shape of the distribution of the data — to identify
possible models for the data. Various approaches to selecting a model, after an
initial exploration of the data using graphical methods and numerical summaries,
have been discussed. You have also learned how to interpret probability plots and
how to transform continuous data. The problem of outliers has been discussed
briefly. MINITAB has been used to apply these methods to a mini-project, which
began with a problem of scientific interest. You have also learned how to
structure and write a statistical report.
Learning outcomes
You have been working towards the following learning outcomes.
Ideas to be aware of
That statistical analysis is a process, beginning with a question or problem of
interest, and ending with a statistical report, and involving data exploration,
model choice and model checking, in a cycle that may be repeated several
times.
That the aim of statistical modelling is to draw valid and relevant inferences,
not to find a perfect model.
That some statistical models arise from standard settings, which can in turn
be used for model choice.
That a statistical report comprises a non-technical summary, an introduction,
a methods section, a results section, and a discussion.
118 Summar y of Unit C3
Statistical skills
Use information about the setting of a problem and the type of data collected
to set out an initial modelling framework.
Use information on the discreteness or continuity of a variable and the shape
of its distribution to guide your choice of model.
Use graphical representations of data to select an appropriate distribution to
represent a variable.
Use numerical summaries such as the mean and the variance to inform model
choice.
Use standard settings to guide model choice.
Choose a transformation and apply it to a variable in order to reduce
skewness.
Interpret probability plots using confidence bands.
Identify outliers and explore their influence.
Choose a statistical technique to address a specific problem or question.
Structure a statistical report.
Write a statistical report.
Solutions to Activities
Solution 1.1 Here are my immediate thoughts Solution 1.4 The relevant random variable is the
about the four examples. Later, you will see why these number of the failed pylon, which can take values
might be reasonable starting points. At this stage, between 1 and 24. The variable is thus discrete on the
however, these suggestions are really only ‘hunches’; integers 1 to 24. Since the pylons are all equally likely
other suggestions might be just as valid. to have failed, a suitable model is the discrete uniform
Example 1.1 involves counts of fish; the word ‘counts’, distribution on the integers 1, 2, . . . , 24.
with no notion of a repeated ‘trial’ (which would make
the binomial and geometric models worth Solution 1.5 The data are counts of events in
considering), immediately suggests a Poisson model. intervals of fixed length. The emission of alpha
particles may reasonably be assumed to occur at
Example 1.2 involves differences between head shapes
random. The setting thus corresponds to the standard
of first and second sons; perhaps a normal distribution
setting for the Poisson distribution.
might be appropriate.
Example 1.3: the data are waiting times, so this Solution 1.6 All three variables are continuous, so
suggests an exponential distribution. a continuous model should be chosen in each case.
Example 1.4: I’m not sure about this one. (a) Head size is necessarily positive. However, values
of head length plus head breadth are likely to cluster
Solution 1.2 The data of Example 1.3 are time around some typical value some distance from zero, so
measurements, and hence are continuous. So suitable that a normal model is likely to be appropriate. Note
candidate distributions include the exponential and that the normal model is not ideal since it
the normal. The continuous uniform distribution theoretically allows negative values. However our aim
would require there to be a maximum interspike is to obtain a reasonable model, not a perfect one!
interval. This is not ruled out, though it might seem a
(b) Head shape also takes positive values. The
little unlikely.
normal model is appropriate here as well — at least as
The data of Example 1.4 are counts in the range 1 to a first model — since it might be expected that shape
12, and hence are discrete. At this stage, any of the measurements would fluctuate around an average
standard discrete distributions might be considered. value.
The geometric and Poisson distributions allow counts
(c) Differences between head shape indices can
greater than 12, which are ruled out by the setting.
reasonably be expected to take both negative and
Perhaps the discrete uniform distribution seems the
positive values, perhaps clustered close to zero. A
least unnatural choice at this stage, though no
normal model again seems appropriate here, as a first
distribution seems entirely appropriate.
choice.
Solution 1.3
Solution 2.1
(a) Continuous. Weights are continuous, and the
(a) The Poisson distribution is unimodal, whereas the
measurement accuracy (to two decimal places) is good.
distribution in Figure 2.2(a) is bimodal. So a Poisson
(b) Discrete. The data are counts, and take low model is unsuitable.
values — I’d guess mostly under 20.
(b) The Poisson distribution has zero as its mode if µ
(c) Continuous or discrete. The number of tickets is less than 1, so a Poisson model may be appropriate.
sold is discrete, hence it could be modelled as a
(c) The data are left-skew, whereas the Poisson
discrete variable. However, the numbers are so big
distribution is always right-skew (or roughly
(several million tickets are sold each week) that it
symmetrical when the mean is large), so a Poisson
could equally well be modelled as a continuous
model is unsuitable.
variable.
(d) Discrete. The number of jackpot winners is Solution 2.2 Figure 2.3(a): The shape of this bar
clearly a discrete variable. Unfortunately, few people chart is consistent with all three of the distributions.
win the jackpot, so it would not be appropriate to
Figure 2.3(b): This bar chart is left-skew, whereas
treat the number of jackpot winners as a continuous
Poisson distributions and geometric distributions are
variable.
always right-skew. So the shape of this bar chart is
(e) Continuous or discrete. Examination marks are only consistent with a binomial distribution.
integers from 0 to 100, so they are discrete. However,
it might make sense to treat them as continuous in
view of the large number of different values.
(f ) Discrete. The pass grades are discrete and take
only a restricted range of values.
120 Unit C3
Figure 2.3(c): The shape of this bar chart is consistent Solution 2.6 The mean and standard deviation do
with either a Poisson distribution or a binomial not differ greatly. This suggests that an exponential
distribution, but not with a geometric distribution (for model might be a reasonable choice.
which the mode is always at 1, the lowest value in its
range). Solution 2.7 The pattern of the points in
Note that only the shapes of the bar charts have been Figure 2.9(a) is roughly linear, apart from a single
considered in this solution. Other factors, such as the observation in the top right corner of the plot. Thus
range, have been ignored. the normal model appears to be appropriate for the
bulk of the data, though there may be an outlier.
Solution 2.3 The plot in Figure 2.9(b) shows systematic curvature,
(a) It is clear that envelopes with few used windows suggesting that the normal model is inappropriate.
are more frequent than envelopes with many used This shape is in fact typical of right-skew distributions.
windows. So the uniform distribution is not
appropriate. Solution 2.8 The original data are markedly
right-skew. Both the log and the square root
(b) The shape of the data appears to be consistent in
transformations have reduced the skew by ‘pulling in’
general terms with either a Poisson model or a
the values to the right of the mode and ‘stretching
geometric model. The range of a geometric
out’ those to the left. This ‘stretching out’ effect is
distribution is 1, 2, . . . , and that of a Poisson
more marked for the log transformation. However, it is
distribution is 0, 1, . . .. Since the smallest number of
rather difficult to decide which of the two transformed
used windows is 1, a geometric model seems the more
histograms is the more symmetric.
suitable. In fact, a geometric model gives a very good
fit to these data. Alternatively, we could, for instance,
try modelling X − 1 using a Poisson distribution,
Solution 2.9 An estimate of the age of the site
based on all eight observations is given by the sample
where X is a random variable representing the number
mean, which is approximately 2622 years. However,
of used windows.
this mean is rather unsatisfactory, as it lies above
(c) The shortcoming of the geometric model is that it seven of the eight points. This is because it is greatly
has an unbounded range, whereas in the present influenced by the value for C-367, which is 3433 years.
setting the number of used windows can be no greater When this point is omitted, the mean of the remaining
than 12. However, this is probably not a big problem seven points is approximately 2506 years. Sample
since there are relatively few envelopes with more than number C-367 clearly has a big influence on the mean.
half the windows used, and only one with twelve used You should therefore report the calculations both
windows. So a distribution that gives small including and excluding sample C-367, and perhaps
probabilities to values just below 12 and very small suggest further investigation of the outlier.
probabilities to values greater than 12 could be
appropriate. Thus the geometric model might be a Solution 4.1 Reorganizing the material should
reasonable one in spite of the range restriction. produce something like the following.
(d) The fit of the model would be checked using a Introduction
chi-squared goodness-of-fit test. The expected value Compare means of a continuous variable in two groups
for twelve used windows should be calculated using given samples of sizes 24 and 32.
the upper tail of the geometric distribution — that is,
Methods
P (X ≥ 12).
Check normality in each group using probability plots.
Calculate 95% t-interval.
Solution 2.4 In Example 1.11 it was suggested that
Perform two-sample t-test.
the underlying distribution for these data might be
normal or exponential. The normal model seems out of Results
the question, owing to the substantial skewness of the Normal model reasonable.
histogram in Figure 2.6. The shape of the histogram 95% t-interval was (−3.92, 17.63).
suggests that the exponential model might be better. Two-sample t-test (with equal variances) gave
p = 0.16.
Solution 2.5 The points of the exponential Discussion
probability plot do not lie close to the straight line, Little evidence that the means are different in the two
which suggests that the exponential model is not groups.
appropriate. One alternative is to try a normal model.
However, the histogram in Figure 2.5 is positively
skewed. A possibility is to try transforming the data.
This is discussed in Subsection 2.3.
Solutions to Activities 121
Solutions to Exercises
Solution 1.1 The data are clearly discrete, so a Introduction
discrete model is appropriate. However, if the output One hundred motor cortex neuron interspike intervals
is typically large (which might be the case with a of an unstimulated monkey were measured (in
major manufacturer such as Ford), it might make milliseconds). In this analysis, the mean of the
sense to model output as a continuous variable. On logarithms of the interspike intervals is estimated. The
the other hand, if output is low (as might be the case data for the analysis were obtained from Zeger, S.L.
for a luxury hand-crafted car such as the Morgan), and Qaqish, B. (1988) Markov regression models for
then a discrete model would be better. Thus time series: a quasi-likelihood approach. Biometrics,
information about the scale of the output is required. 44, 1019–1031.
Methods
Solution 1.2 The data were transformed using natural logarithms.
(a) The location of the leak can be measured as its A 95% t-interval was calculated for the mean of the
distance from one end of the street, and hence takes logarithms of the interspike intervals. The validity of
values between 0 and L, where L is the street length. the normal model was investigated using a normal
Nothing more is known, so it is reasonable to assume probability plot.
that the leak is equally likely to have occurred at any Results
point. Thus it would seem reasonable to use the The normal model for the logarithms of the interspike
continuous uniform distribution U (0, L) to model the intervals was adequate, apart from a single outlier
location of the leak. corresponding to an interspike interval of
(b) Suppose the T-joints are numbered 1 to n, where 2 milliseconds. The mean was 3.41, with 95%
n is the number of T-joints in the street. All that is confidence interval (3.28, 3.54). The standard
known is that the leak is likely to have occurred at one deviation was 0.66. When the outlier was excluded the
of these joints. Without any reason to suppose mean was 3.44, with 95% confidence interval (3.31,
otherwise, we can assume that the leak might have 3.56), and the standard deviation was 0.61.
occurred at any joint with equal probability. Thus an Discussion
appropriate model is the discrete uniform distribution The outlier has little effect on the results. A normal
on the integers 1, 2, . . . , n. model is appropriate for describing the variation in the
logarithms of the interspike intervals.
Solution 4.1 The following report could also
include a histogram of the logarithms of the intervals,
such as the one in Figure 2.13(b).
Summary
A normal model was used to describe the variation in
the logarithms of interspike intervals of an
unstimulated monkey. The mean log(interspike
interval) is estimated.
Index for Block C 123