Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 30

Chapter 1: Introduction

1.1 What is Statistics?

Statistics involves collecting, analysing, presenting and interpreting data.

We frequently see statistical tools (such as bar charts, tables, plots of data, averages and percentages) on TV, in
newspapers and in magazines. Such methods used to organise and summarise data, so as to increase the
understanding of the data, are called descriptive statistics.

Statistics is also used in practice in many different walks of life, going beyond simple data summarisation to
answer a wide variety of questions such as:
 Medicine: Does a certain new drug prolong life for AIDS sufferers?
 Science: Is global warming really happening?
 Education: Are GCSE and A level examinations standards declining?
 Psychology: Is the national lottery making us a nation of compulsive gamblers?
 Sociology: Is the gap between rich and poor widening in Britain?
 Business: Do Persil adverts really make us want to buy Persil?
 Finance: What will interest rates be in 6 months time?

1.2 Populations and Samples

Suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies.
We use this example to illustrate the following definitions.

Definitions:
 Experimental unit: the object on which measurements are made.
For above example, we are measuring birth weights of newborn babies, so a unit is a newborn baby.
 Variable: a measurable characteristic of a unit.
For above example, the variable is birth weight.
 Population: the set of all units about which information is required.
For above example, the population is all newborn babies.
 Sample: a subset of units of the population for which we can observe the variable of interest.
For above example, a sample would be the observed birth weights for a set of newborn babies (which will be
a subset of all newborn babies).
 Random sample: a sample such that each unit in the population has the same chance of being chosen
independently of whether or not any other unit is chosen.

To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a
random sample of weights of new-born babies whose mothers smoke, with a random sample of weights of new-
born babies of non-smoking mothers. By analysing the sample data, we would hope to be able to draw
conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population).
The process of using a random sample to draw conclusions about a population is called statistical inference.

If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, birth
weights of twins are generally lower than the weights of babies born alone. So if all the non-smoking mothers in
the sample were giving birth to twins, whereas all the smoking mothers were giving birth to single babies, then
the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are
affected by sampling bias.

Different units of the same population will have different values of the same variable  this is called natural
variation. For example, obviously the weights of all newborn babies are not the same. So different samples will
contain different data- called sampling variability. Therefore it is important to bear in mind that slightly different
conclusions could be reached from different samples.

1
1.3 Types of Data

Different types of data require different types of analysis. The type of data set is determined by several factors:

 Type of variable:
 quantitative data - i.e. numerical (e.g., heights of students, number of phone calls in an hour).
 qualitative data - i.e. non-numerical (for example, eye colour, M/F).
Quantitative data can be subdivided further:
 discrete – a discrete variable can take only particular values (e.g., number of phone calls received at an
exchange).
 continuous- a continuous variable can take any value in a given range (e.g., heights of students).
 Number of variables measured:
 1 variable  univariate data.
 2 variables  bivariate data. E.g., we may have both the heights and weights of a set of individuals. The
data set then consists of pairs of observations on each unit such as (1.7m, 65kg).
 3 or more variables  multivariate data. E.g., we have heights, weights, eye colour, gender for a group
of individuals. In this case the data set consists of sets of 4 observations made on each unit such as
(1.7m, 65kg, blue, M).
 Number of samples: For example, when investigating the effects of smoking during pregnancy, we would
observe two samples:
 a sample of birth weights of babies born to smoking mothers
 a sample of birth weights of babies born to non-smoking mothers.
 Relationship of samples (if more than 1 sample):
 Are the samples independent? E.g., the two birth weight samples should be independent.
 Are the samples dependent?

 Example:
Suppose that a doctor would like to assess the effectiveness of changing to a low-fat diet in lowering cholesterol
for a group of patients. To do this the doctor might measure the cholesterol of the patients before starting on the
low-fat diet and then measure the cholesterol for the same patients after they have been on the low-fat diet. We
therefore have 2 samples of measured cholesterol:
 a sample before the diet
 a sample after the diet.
However, the 2 samples are not independent, since the cholesterol measurements for each sample were taken on
the same patients. Samples of this type are called matched pair data.

1.4 Recommended Books

You will need to use statistical tables for the course. The tables used in the exams are:
 Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical Tables, C.U.P., 1984.
Statistical tables will be used throughout this course.

There are many books which cover the material in this course. Some good books are:
 Introduction to probability and statistics for engineers and scientists; [with CD-ROM] / Sheldon M. Ross
 Probability and Statistics for Engineers and Scientists - 7th edition, R.E.Walpole, R.H.Myers, S.L.Myres and
K. Ye, Prentice Hall, 2002
 Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold, 4th edition, 1999.
 Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements of Statistics, Open University,
1995.
Goes beyond what's required for this course, but is quite clearly written with some real examples.
 Devore, J and Peck, R. Introductory Statistics, West, 1990.
Rather simplistic at times, but has lots of real examples. Especially good if you have not done any statistics
before.
 Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988.

2
In addition, you could browse in the library around QA276 and find a book which suits you. For starters you
could try looking at some of the following.

 Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to Statistics: Concepts and Applications,
West, 2nd edition, 1991.
 Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T. and North, P.M., Statistics: Problems
and Solutions, Edward Arnold, 1986.
 Moore, D.S., The Basic Practice of Statistics, Freeman, 1995.
 Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986.
 Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985, 1979.

There are many online books which could be useful. See for example
http://www.statsoft.com/textbook/stathome.html

3
Chapter 2: Graphical and Numerical Statistics
2.1 Histograms

Histograms give a visual representation of continuous data. We consider two separate cases corresponding to
when (i) all the bars in the histogram have the same width; (ii) the intervals are of variable widths.

2.1.1 Histograms with equal class widths

 Example:
Mercury contamination can be particularly high in certain types of fish. The mercury content (ppm) on the hair
of 40 fishermen in a region thought to be particularly vulnerable are given below (From paper “Mercury content
of commercially imported fish of the Seychelles, and hair mercury levels of a selected part of the population.”
Environ. Research, (1983), 305-312.)
13.26 32.43 18.10 58.23 64.00 68.20 35.35 33.92 23.94 18.28
22.05 39.14 31.43 18.51 21.03 5.50 6.96 5.19 28.66 26.29
13.89 25.87 9.84 26.88 16.81 38.65 19.23 21.82 31.58 30.13
42.42 16.51 21.16 32.97 9.84 10.64 29.56 40.69 12.86 13.80

 The first step is to group the data. A reasonable choice of class intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is:

Interval Frequency
0-10 5 N.B. By convention, any
10-20 11 observation that is at a
20-30 10 boundary of a class will be
put into the higher class. For
30-40 9
example, an observation of
40-50 2 10 above would be put into
50-60 1 the 10-20 category.
60-70 2

To construct the histogram in this situation (i.e. all class widths equal):
 Mark boundaries of the class intervals on the horizontal axis.
 The height of the bars above each interval can be taken as the frequency for that interval.

4
Instead of using frequencies to give the heights of the rectangles in a histogram, relative frequencies may be
used. The relative frequency for an interval is that interval's frequency divided by the total frequency.

 So for the mercury example…

Interval Frequency Relative frequency


0-10 5 .125
10-20 11 .275
20-30 10 .250
30-40 9 .225
40-50 2 .050
50-60 1 .025
60-70 2 .050
Total 40 1

The relative frequencies can be expressed as percentages (which is how Minitab produces a relative frequency
histogram):

Notice that the shape of the histograms, whether using frequencies or relative frequencies, is the same.

2.1.2 Histograms with unequal class widths

There is no hard and fast rule as to how many intervals should be used. Too many classes produce an uneven
distribution, but having too few loses information. Usually the number of classes is about 6-20. The more
observations we have, the more classes we will usually use.

The width of the intervals defining the histograms need not all be equal. It is often sensible to choose short
intervals where the data is quite dense but intervals with a longer width where the data is more sparse. This will
ensure that we don’t have too many intervals with zero frequency, yet keeps as much information about the
distributional shape of the data as possible.

When unequal interval widths are used, then the frequency density should be used on the vertical scale on the
histogram, where
Frequency density = Frequency  class width.

5
 Example:
The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the following table:

Vehicle length (m) Class width Frequency Frequency density


3.0-4.0 1 90 90
4.0-4.5 0.5 80 160
4.5-5.0 0.5 40 80
5.0-5.5 0.5 24 48
5.5-7.5 2 16 8

A histogram showing the lengths of 250 vehicles


200

180

160

140
Frequency density

120

100

80

60

40

20

0
2 3 4 5 6 7 8
Vehicle length (m)

Notice that if we had simply defined the heights of the rectangles to be the frequencies, then the histogram
would exaggerate, for example, the incidence of cars between 3 and 4 metres in length.

An alternative way of producing a histogram in situations were not all class widths are equal is to set the bar
height to be the relative frequency density. This is given by:

Relative freq. density = Relative freq.  class width.

If the histogram is produced in this way, then the total area of all the bars is 1.

 Example (continued)
The relative frequency densities for the car vehicle length data are as follows:

Vehicle length (m) Class width Frequency Relative freq. Rel. freq. density
3.0-4.0 1 90 0.36 0.36
4.0-4.5 0.5 80 0.32 0.64
4.5-5.0 0.5 40 0.16 0.32
5.0-5.5 0.5 24 0.096 0.192
5.5-7.5 2 16 0.064 0.032

The corresponding histogram can then be produced:

6
2.1.3 Histogram shapes

Histograms are very useful for giving some idea of the shape of a density by approximating the histogram to a
smooth curve.

Densities can take many different shapes:

Unimodal Bimodal Multimodal

Unimodal distribution Bimodal distribution Multimodal distribution

7
Symmetric Positive skew Negative skew

Symmetric distribution Positively skew distribution Negatively skew distribution

Normal Heavy-tailed Light-tailed

Normal distribution Heavy-tailed distribution Light-tailed distribution

2.1.4 Histograms for discrete data

Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are generally used for
continuous data. However, when the number of possible values for the observations is large, a bar diagram
would become uninformative. In this case it is acceptable to group the values into class intervals, much as you
would for continuous data.

 Example:
Suppose we have the following data:
1 1 2 2 2 3 3 4 4 5 5 5 5 6 6 7 7 7
8 9 9 9 9 10 10 10 10 10 11 11 11 11 12 12 12 12
13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 16 16 16
17 17 17 18 18 19 19 20 21 21 22 22 23 23 24 26 27 29

As there are a large number of different values here, to get a better idea of the shape of the distribution, we can
group data into classes. Let's consider grouping all observations between 1 - 3, 4 - 6 and so on. To draw a
histogram we need a continuous scale and so we need to define our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5,
and so on. (Remember: a histogram never has gaps between the bars).

We then get the following frequency distribution:

8
Interval Frequency
0.5 - 3.5 7
3.5 - 5.5 8
5.5 - 9.5 8
9.5 - 12.5 13
12.5 - 15.5 14
15.5 - 18.5 8
18.5 - 21.5 5
21.5 - 24.5 5
24.5 - 27.5 2
27.5 - 30.5 1

The histogram can now be drawn in the normal way.

2.2 Stem-and-leaf plots

Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with very little effort.
The idea of the plots is to separate each observation into 2 parts - the first part being the stem and the second the
leaf.
To construct a stem-and-leaf plot:
 Select one or more leading digits for the stem values. The following digit or digits become the leaves.
 List possible stem values in a vertical column.
 Record the leaf value for every observation beside the corresponding stem value.
 Indicate the units for stems and leaves.

 Example:
To investigate the efficiency of new air-conditioning equipment installed on Boeing
720 aircraft, the times (in hours) to first failure of the equipment were obtained
from 28 different aircraft:
79 90 10 60 61 49 14 24 56 20 84 44 25 59
46 37 32 76 26 35 29 53 75 25 44 23 27 33

For these data an obvious choice for the stems is the leading digit (tens) and the leaves are then the second digits
(units). So, for example, the first observation of 79 has stem 7 and leaf 9. The data values range from 10 up to
90, so we have the stem values 1-9.

1 0 4 An unordered stem-and-leaf
2 4 0 5 6 9 5 3 7 diagram for the Boeing data
3 7 2 5 3
4 9 4 6 4
5 6 9 3
6 0 1 Leaves- these
7 9 6 5 should be in
8 4 columns
9 0
Ste
m
Scale: Stem = 10s Leaves = units

1 0 4 An ordered stem-and-leaf diagram


2 0 3 4 5 5 6 7 9 for the Boeing data
3 2 3 5 7
4 4 4 6 9

9
5 3 6 9
6 0 1 Leaves have
7 5 6 9 now been put
8 4 in order
9 0

Scale: Stem =10s Leaves = units

N.B. Rearranging the leaves in ascending order clarifies things and is useful for producing numerical
summaries.

N.B.2 One advantage that stem-and-leaf diagrams have over histograms is that they retain the detail of the raw
data.

2.2.1 Use of stem-and-leaf plots

 Stem-and-leaf plots give a visual display of the rough shape of the distribution of the variable being
measured. We can identify whether the density is a) unimodal or multimodal; b) symmetric, negatively or
positively skewed; c) normal, heavy- or light-tailed.
 Stem-and-leaf plots are useful for informal inference. We can find medians and quartiles easily from the
diagrams and obtain estimates of probabilities. For example, in the Boeing data 10 pieces of equipment
lasted under 30 hours so we could estimate the probability of a new piece of equipment failing within the
first 30 hours as 10/28.
 Stem-and-leaf plots are useful for identifying outliers- these are unusually large or small observations. For
example, for the Boeing example, if there had been an extra observation of 119, then this might be an outlier:

1 0 4
2 0 3 4 5 5 6 7 9
3 2 3 5 7
4 4 4 6 9
5 3 6 9
6 0 1
7 5 6 9
8 4
9 0 This could be
10 considered an
11 9 outlying value

2.2.2 Choice of stem unit


Choice of stem unit can be important.

 Example:
To determine the age of a pre-historic settlement in North Wales, 24 small fragments from a
wooden boat found at the settlement were independently radio-carbon dated. The
radio-carbon determiniations (in years) of age of fragments are:
4969 5163 5052 5144 4965 5152 4967 4934 4895 5078 5019 4908
5009 5046 4912 5012 4889 5034 4914 5117 4931 5081 4984 4881

 Possibility 1: We could round each observation to the nearest one hundred years:
5000 5200 5100 5100 5000 5200 5000 4900 4900 5100 5000 4900
5000 5000 4900 5000 4900 5000 4900 5100 4900 5100 5000 4900

10
Taking the stem unit to be 1000 years gives the following diagram:

4 9 9 9 9 9 9 9 9 Scale: Stem = 1000's


5 0 2 1 1 0 2 0 1 0 0 0 0 0 1 1 0 Leaves = 100's

Because we have so few stem values here, we lose a lot of information. We can’t say anything for example
about the shape of the distribution.

 Possibility 2: Round observations to the nearest 10 years.


4970 5160 5050 5140 4970 5150 4970 4930 4900 5080 5020 4910
5010 5050 4910 5010 4890 5030 4910 5120 4930 5080 4980 4880

Taking the stem unit as 100 years gives:


48 9 8
49 7 7 7 3 0 0 1 1 3 8 Scale: Stem = 100's
50 5 8 2 1 5 1 3 8 Leaves = 10's
51 6 4 5 2
This plot is a little more informative, but we could still do with having slightly more stems.

11
 Possibility 3: Split the stems into high and low values

48L In the high category


48H 8 9 you write any 5s, 6s,
49L 0 0 1 1 3 3 7s, 8s or 9s.
49H 7 7 7 8
50L 1 1 2 3
50H 5 5 8 8
51L 2 4 In each low category Scale: Stem = 100's
50H 5 6 you put any 0s, 1s, Leaves = 10's
2s, 3s, or 4s.

The diagram is now quite informative about the distribution- there is evidence of a positive skew.

[Note that if the stem unit was taken to be 10s, then the diagram we would get would be poor- we would then
have too many stem values (a lot of the rows would have no values in them).]

2.2.3 Back-to-back displays for displaying two independent samples

If there are 2 sets of data which you wish to compare, then both of these can be put on the same stem-and-leaf
plot with the leaves for one dataset going to the right and the leaves of the other dataset going to the left.

 Example:
Using a technique involving chromium dioxide, the protein assimilation efficiencies (i.e. percentage of protein
intake actually absorbed) were measured on field mice and voles fed on their natural diets. The assimilation
efficiencies (in percentages) are given below:
A.E.'s of field mice:
61.3 65.4 71.7 62.6 63.6 76.3 67.8 61.9
57.8 70.6 70.5 68.9 62.6 69.7 74.6
A.E.'s of voles:
51.7 66.7 72.0 69.8 63.7 77.2 62.6 63.5 69.2 67.5
70.1 67.3 75.2 73.8 59.6 69.9 77.6 74.1 73.7

Rounding observations to the nearest integer gives us:

An unordered back-to-back stem-and-leaf


diagram for the protein data

A.E.s for field mice A.E.s for voles


5L 2 Outlier?
8 5H
3 2 4 3 1 6L 4 3 4 0
9 8 5 6H 7 9 8 7
0 1 1 2 7L 2 0 0 4 0 4 4
5 6 7H 7 5 8

Scale: Stem = 10's Leaves = 1's

Then ordering the leaves we get…

12
An ordered back-to-back stem-and-leaf
diagram showing the protein data

A.E.s for field mice A.E.s for voles


5L 2
8 5H
4 3 3 2 1 6L 0 3 4 4
9 8 5 6H 7 7 8 9
2 1 1 0 7L 0 0 0 2 4 4 4
6 5 7H 5 7 8

Scale: Stem = 10's Leaves = 1's

2.2.4 Stem-and-leaf diagrams for matched-pair data

It is not a good idea to do a back-to-back plot if the 2 variates are not independent. Consider the following
example.

 Example:
Fifteen people participated on a short typing course. Their typing speeds (words/min) before and after the course
were recorded:

Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Before 15 18 23 27 36 12 8 19 32 22 17 21 16 15 33
After 26 28 27 26 28 24 26 42 32 36 20 29 21 22 28

These data are an example of matched-pair data (there are two measurements recorded on each participant).
Matched-pair data are likely to be dependent (a person with a fast typing speed before the course is also likely to
have a fast typing speed after the course). By drawing a stem-and-leaf diagram you lose information about how
the measurements pair up. You could draw a scatter diagram (this would show the pairings). Alternatively, you
could produce a stem-and-leaf diagram of the differences:

Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Change 11 10 4 -1 -8 12 18 23 0 14 3 8 5 7 -5

A stem-and-leaf diagram showing the change in typing speeds


after a short course

-0 1 8 5
0 4 0 3 8 5 7 Scale: Stem = 10’s
1 1 0 2 8 4 Leaves = units.
2 3

A slightly more informative diagram can be obtained by splitting each stem up into two parts (one for the lower
leaves and the other for higher leaves):

13
A stem-and-leaf diagram showing the change in typing speeds
after a short course

-0H 8 5
-0L 1
0L 4 0 3
0H 8 5 7 Scale: Stem = 10’s
1L 1 0 2 4 Leaves = units.
1H 8
2L 3

Each diagram could then be ordered.

2.2.5 Problems

Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical for large
numbers of observations.

2.3 Cumulative Frequency Plots

A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a class is the
number of observations with values less than the upper boundary for that class.

 Example:
Consider the mercury example again. The cumulative frequencies are given in the table below:

Interval Frequency Cumulative frequency


0-10 5 5
10-20 11 16
20-30 10 26
30-40 9 35
40-50 2 37
50-60 1 38
60-70 2 40

In a cumulative frequency polygon the cumulative frequencies are plotted against the upper class boundaries of
the classes. These points are then joined with a straight line.

 Example (continued)
For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then join these
points:

14
A cumulative frequency plot is useful for giving us some idea of the shape of the distribution function of the
variable. They can also be used to obtain estimates of the median and other quantiles for grouped data.

2.4 Scatter Plots.

Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot we represent one
of the variables by the horizontal axis and the other variable by the vertical axis. We then simply plot the pairs of
data points on the graph.

 Example:
Fifteen children were given a visual-discrimination (V) test during the first week at primary school and a
reading-achievement (R) test at the end of their first year of schooling. Scores out of 100 were calculated for
each test.

Child no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
V-score 75 69 70 62 52 45 42 39 37 34 34 66 54 58 63
R-score 95 90 82 69 58 49 38 35 30 20 31 75 61 64 77

To draw a scatter plot we now want to plot the points (75, 95), (69, 90), (70, 82), …, (63, 77).

The plot would suggest that there is a positive relationship between the V-score and the R-score.

15
2.4.1 Positive/ negative correlation

The following graphs give illustrations of variables that are (a) positively and (b) negatively correlated with each
other. Correlation can also be categorised as strong or weak depending upon how close the points are to lying on
a straight line.

15 15
Strong, positive Weak, positive

10
10
5
y

y
5
0

0 -5
0 5 10 15 0 5 10 15
x x
15 20
Strong, negative Weak, negative
15
10
10
y

5
5
0

0 -5
0 5 10 15 0 5 10 15
x x

2.4.2 Correlation does not imply causation

It is important to realise that scatter plots point to associations between variables. They do not necessarily show
a causal relationship.

 Example:
Information about two variables (life expectancy and the number of people per television set) is available for 12
countries:

It is clear that the two variables are negatively correlated. However, it clearly would be wrong to conclude that
simply sending more televisions to countries with low life expectancies would cause their inhabitants to live
longer.

16
This example illustrates the very important distinction between causation and association. Two variables may be
strongly correlated without a cause-and-effect relationship existing between them. Often the explanation is that
both variables are related to a third variable not being measured. In the example above for instance both life
expectancy and the number of televisions in the population will both be related to the country’s wealth.

There is one further type of graph that we will consider later in the chapter (namely box-and-whisker plots). We
first however need to look at numerical summary measures for data.

2.5 Numerical summaries of data

In the next few sections we will look at some numerical ways of summarising data.

2.5.1 Some notation

Suppose that we would like to learn about the random variable X. To do this we will observe a random sample of
n observations, , such that each has the same distribution as X. The observed values of
are then denoted .

 Example:
Suppose we are interested in the number of units of alcohol students at UKC consumed last week. To do this we
could randomly select 50 students to form a random sample , where is the random variable
representing the number of units of alcohol consumed by the ith student. The observed value of is denoted
.

Now suppose that we order the random sample . We let:


 denote the smallest observation;
 denote the second smallest observation;

 denote the ith smallest observation;

 denote the largest observation.
Then is called the ith order statistic and the following relation holds:
.

 Example:
Suppose that we have the observations:

Then

When we have frequency data, we will denote the frequency of the kth class by for k = 1,…, K, where K is

the number of classes. Then

17
 Example:
Consider the mercury example again. Here we have the frequency table given by:

Interval Frequency
0-10 5
10-20 11
20-30 10
30-40 9
40-50 2
50-60 1
60-70 2

Here we have 7 classes, so that K = 7. Then and so on, such that .

2.5.2 Measures of location

 The Sample Mean

Let denote the random variables for a sample of size n. The sample mean, denoted , is defined
by:

The observed value of the sample mean for a particular sample is therefore:

When the data are grouped by means of a frequency table, then the equivalent formula for is given by:

where K is the number of classes or groups, and is the mid-point of class k.

 Example:
Consider the mercury example again.

Interval Mid-point, Frequency,


0-10 5 5
10-20 15 11
20-30 25 10
30-40 35 9
40-50 45 2
50-60 55 1
60-70 65 2

The sample mean is therefore:

18
Note: The mean is probably the most useful measure of location. Its advantages are that it uses all the values in
the data and is easy to manipulate mathematically. A disadvantage is that it is not robust- this means that its
value can be sensitive to the presence of outlying values. More robust measures of location (such as the median
or trimmed mean) are increasing in popularity amongst statisticians.

 The Median

To find the median of a set of n data values, we must first rearrange them in order of size. The median is then
equal to the middle observation if n is odd, and the average of the middle two observations is n is even.
More formally,

 Example 1:
The values below are systolic blood pressures of patients admitted to a hospital:
112.1 138.6 115.9 109.5 108.2 110.9 159.6 115.8 122.3 122.4 123.8 117.5.
To find the median value for the blood pressure, we must first list them in ascending order:
108.2 109.5 110.9 112.1 115.8 115.9 117.5 122.3 122.4 123.8 138.6 159.6.

Here we have an even number of observations. So


Sample median =

For these data the sample mean is:

Sample mean =
which is somewhat larger than the sample median. The mean is influenced by the outlying value (159.6). The
median is more robust than the mean and is not really affected by outliers.

 Example 2:
A football team has scored the following number of goals in the last 44 matches:

Number of goals 0 1 2 3 4
Frequency 9 8 15 9 3

As n = 44, the median will lie halfway between the 22nd and 23rd observations. Since both and are
2, the median value is 2.

For grouped data, the most convenient way to estimate the median is by graphical methods. This is most easily
demonstrated via an example.

 Example
Consider the mercury example once again. The cumulative frequency plot is given below. We have a total of 40
observations, so when the cumulative frequency is 20 we might expect the corresponding value of mercury read
off from the graph to be an estimate of the median. In this case we estimate the median as 23 approximately.

19
Note:
The median is also often a better measure of location than the mean when data are highly skewed. The following
show the relative positions of the mean and median for 3 densities:

20
 Example:
Distributions of incomes are commonly positively skewed as there are typically a few very large salaries which
gives the density a long right-hand tail. Therefore the median is often used to give a typical salary value, rather
than the mean.

Disadvantages for the median:


There are two main disadvantages of using the median. It ignores the actual values of the data and uses only their
ranks (it effectively uses only the “middle” part of the data set). It is also not as easy to use mathematically in the
theory of statistics as the arithmetic mean.

 The Trimmed Mean

The trimmed mean can be viewed as some sort of compromise between the mean and the median. To calculate a
trimmed mean:
 order the data values
 delete a selected number of values from each end of the ordered list
 average the remaining values.
The trimmed mean avoids the disadvantages of the mean by excluding extreme observations and avoids that of
the median by taking some account of the observations other than the middle one. To calculate the 5% trimmed
mean for example, discard the top 5% and the bottom 5% of observations, and average those remaining.

 Example:
The body temperatures (deg. F) of 10 patients hospitalised with meningitis are as follows:

104.0 104.8 101.6 108.0 103.8


100.8 104.2 100.2 102.4 101.4
The sample mean for these data is:

To find the 10% trimmed mean, as we have 10 observations, we drop the smallest and largest data values.
10% trimmed mean =
In this case the 10% trimmed mean is probably a better representation of the centre of the distribution as it
ignores the (possible) outlier, 108.

 The Mode

The mode is a very simple measure of location. For discrete data, it is the value of x with the largest frequency.
We cannot calculate a mode for ungrouped continuous data. For data grouped into classes we obtain a modal
class.

 Example:
Consider again the family size data presented in the previous section. The numbers of children in the sampled
families are:
2, 6, 3, 2, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4, 1.
Here the most commonly occurring value is 2 and so this is the mode.

 Quantiles

The median divides the data into two equal parts. In a similar way, quartiles divide the data into four equal parts,
deciles divide the data into 10 equal parts and percentiles divide it into 100 equal parts.

21
The upper and lower quartlies can be found in the following way:
sample lower quartile = median of lower half of data
sample upper quartile = median of upper half of data

If n is odd, then the median of the entire sample is included in both halves.

Note that deciles and percentiles only tend to be used on very large data sets.

 Example:
The salinity values for 28 water specimens are as follows:
7.6 7.7 4.3 5.9 5.0 10.5 7.7 9.5 12.0 12.6
6.5 8.3 8.2 13.2 12.6 13.6 14.1 13.5 11.5 12.0
10.4 10.8 13.1 12.3 10.4 13.0 14.1 15.1

To find the quartiles we first need to order the data:


4.3 5.0 5.9 6.5 7.6 7.7 7.7 8.2 8.3 9.5
10.4 10.4 10.5 10.8 11.5 12.0 12.0 12.3 12.6 12.6
13.0 13.1 13.2 13.5 13.6 14.1 14.1 15.1

We have 28 observations and so

To find the lower and upper quartiles we need to find the median of the lower 14 and upper 14 observations
respectively:

 Exercise:
Find the median, together with the lower and upper quartiles for the following examination marks:
68, 72, 31, 60, 90, 96, 45, 57, 54, 45, 16, 22, 82, 63, 52.

Just as with finding the median, we can estimate quantiles graphically.

 Example:
Consider again the cumulative frequency polygon for the mercury data. As the total number of observations is
40, we can estimate the lower and upper quartiles by reading off the mercury values from the graph for a
cumulative frequency of 10 and 30, respectively.

22
We see UQ = 34 and LQ = 14 (approximately).

2.5.3 Measures of dispersion

Obviously specifying the central value of a set of data does not tell the whole story. We also need to consider the
variability (or spread or dispersion) of the data.

 The Range

The simplest measure of dispersion is the range which is simply the difference between the largest and smallest
values in the data set. If we have grouped data then we cannot calculate an exact range, only an upper limit.

 Example:
For the water salinity data, the largest observation is 15.1 and the smallest is 7.6. Therefore,
range = 15.1 - 7.6 = 7.5.

Note: The range is sensitive to the presence of one or two extremely large or small values in the data.

 Inter-quartile range

This is a more useful measure of dispersion than the range. It is simply the difference between the upper and
lower quartiles. The inter-quartile range contains the middle half of the data set.

 Example:
We calculated the upper and lower quartiles for the water salinity data to be 13.05 and 7.95 respectively.
Therefore,
Inter-quartile range = 13.05 - 7.95 = 5.1.

 The Mean Deviation

The deviations in a sample are the differences,

One possible idea for obtaining a summary measure of the dispersion in the sample would be to calculate the
mean of these deviations. However, the mean of these deviations is always zero. [Think about why this should
be.]

Instead we could take the absolute value of each of the deviations and calculate the mean of these. This gives the
mean (absolute) deviation:

23
For grouped data the equivalent formula is:

where, is the midpoint of the kth class.

 Example
Twelve students record their weight in kg, creating the following sample:
50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.

The mean of these 12 observations is:


The deviations of each value from the mean are:

-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.

So the mean deviation is:


Mean deviation = .

 The Sample Variance and Sample Standard Deviation

Instead of taking the absolute values of the deviations (so that the positive and negative deviations don't just
cancel each other out), we could use the squares of the deviations. The sample variance (usually denoted by )
can be thought of as an ‘average’ of the squared deviations.

The sample variance is defined by:

Note that although we are summing n squared deviations, we divide through by n – 1. This is important! The
reason why we use n - 1 and not n in the definition of the sample variance will become apparent later on in the
course when we look at unbiased estimators.

The disadvantage of using the sample variance is that it is not measured in the units of measurement used for the
data, but in squared units. This problem is overcome by using the standard deviation. The sample standard
deviation is simply the square root of the sample variance, ie:

Note: For grouped data, we use the following definition for a sample s.d.:

24
 Example
Consider again the weights of the 12 students given above. The deviations from the mean were:
-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.

So the sample variance is:

This means that the sample standard deviation is s = 109.1136 = 10.446 kg.

Result:
Using the above formula to calculate the sample variance can be complicated. In general it is better to use the
expression:

To calculate the variance using this expression we need to know the sum of the observations and the sum of the
squares.

Proof:
We need to show that both formulae for the sample variance are equivalent. It suffices to show:

Now,

But, so

as required.

Note:
There is an equivalent expression for grouped data, so that:

 Example 1:
Consider again the student height data:
50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.

25
We can check that the new formula for calculating the variance does in fact give us the same result:

So,

as before.

 Example 2:
For an example of grouped data, consider the mercury data again:

Interval Mid-point, Frequency

0 - 10 5 5
10 - 20 15 11
20 - 30 25 10
30 - 40 35 9
40 - 50 45 2
50 - 60 55 1
60 - 70 65 2

Here we have,

So,

The sample standard deviation is therefore 227.6282 = 15.09.

 Exercise:
A sample of 50 adults were asked how many lottery tickets they purchased last week:

Number of lottery tickets 0 1 2 3 4 5


Frequency 19 11 10 3 4 3

Find the sample standard deviation.

Note:
Find out how to use your calculator’s statistical mode to calculate s.d.s.

2.6 Box-and-whisker plots

26
Box-and-whisker plots aim to highlight a few important features of a data set. They are based on the following
location summaries: minimum, lower quartile, median, upper quartile and maximum. These 5 quantities are
sometimes referred to as the five-number summary.

 Simple Example:
The number of runs scored by a batsman on 14 occasions are as follows:
40, 22, 17, 50, 24, 48, 5, 0, 28, 19, 30, 25, 16, 37.
Ordering these values we get:
0, 5, 16, 17, 19, 22, 24, 25, 28, 30, 37, 40, 48, 50.
The five-number summary then is:
Minimum value = 0 Maximum value = 50
Median, Q2 = 24.5 Lower quartile, Q1 = 17 Upper quartile, Q1 = 37

The box-and-whisker plot then looks like:

In the above diagram, the box indicates the interquartile range. The whiskers go from the lower and upper
quartiles to the smallest and largest observations respectively. The median is represented by a line within the
box.

Note: the position of the median within the box gives an indication of whether the data are skewed:
 Symmetry: ;
 positive skew: ;
 negative skew: .

Box-and-whisker plots are especially useful for comparing two different data sets as they give a simple picture
of the locations and spreads of different distributions.

 Example:
The numbers of hysterectomies performed by 15 male doctors and 10 female doctors are given below:

Male doctors 20 25 25 27 28 31 33 34 36 37 44 50 59 85
Female doctors 5 7 10 14 18 19 25 29 31 33

First of all we need to find the five-number summaries for the two data sets.

Summary statistic Male doctors Female doctors

27
Minimum 20 5
Lower quartile 27.5 10
Median 34 18.5
Upper quartile 47 29
Maximum 86 33

 Exercise
Consider again the protein assimilation efficiency data given in Section 2.2.3. We then had the following stem-
and-leaf diagram:

An ordered back-to-back stem-and-leaf


diagram showing the protein data

A.E.s for field mice A.E.s for voles


5L 2
8 5H
4 3 3 2 1 6L 0 3 4 4
9 8 5 6H 7 7 8 9
2 1 1 0 7L 0 0 0 2 4 4 4
6 5 7H 5 7 8

Scale: Stem = 10’s Leaves = 1’s

28
Draw box-and-whisker plots for the field mice and voles and compare the shapes of these.

Note:
Minitab calculates the quartiles slightly differently to the method used in this course. Consequently, slightly
different values for the quartiles can arise when using Minitab.

29
30

You might also like