AS Level Mathematics Statistics (New)

CONTENTS Year 1 Statistics Notes
Contents
0 Mathematical models in probability and statistics 2
0.1 The definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.2 The definition of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.3 Usefulness of models in probability and statistics . . . . . . . . . . . . . . . . . . . . . 2
1 Data collection 2
1.1 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Using Statistics to solve problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Data processing, presentation and interpretation 8

2.1 Presenting different types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Ranked data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Discrete numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Continuous numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Displaying continuous grouped data . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Estimating summary measures from grouped continuous data . . . . . . . . . . 18
2.4.3 Cumulative frequency graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Bivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Displaying bivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Dependent and Independent variables . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Random and non-random variables . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Interpreting scatter diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.5 Summary measures for bivariate data . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.6 Lines of best fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 A new measure of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.2 Identifying outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Linear coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Probability 30
3.1 Introduction to Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Mutually exclusive events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Introduction to probability distributions . . . . . . . . . . . . . . . . . . . . . . 38
3.5.2 Properties of probability distributions . . . . . . . . . . . . . . . . . . . . . . . 39
4 The Binomial Distribution 39

4.1 The Criteria And Probability Formula For The Binomial Distribution . . . . . . . . . 39
4.2 Finding Cumulative Probabilities With The Binomial Distribution . . . . . . . . . . . 41
5 Statistical hypothesis testing with the binomial distribution 42

5.1 Introduction to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Hypothesis testing basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Hypothesis test examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1 Mr J Berwick
Year 1 Statistics Notes
0 Mathematical models in probability and statistics

0.1 The definition of Probability
Probability measures how likely a random (or stochastic) event is to occur. For example, whether a
coin is going to be heads or tails for when it is tossed in the air. Although individual events occur
randomly there is normally a typical underlying pattern, or probability distribution like the Poisson,
Binomial or Normal distribution. Probability theory studies these distributions and their properties.
0.2 The definition of Statistics

Statistics is the study of data; it tells us how to collect data appropriately, how to present varying types
of data, how to analyse and interpret them. Statistics allows us to understand and make predictions
about real world stochastic phenomena. The basis of this statistical analysis is usually a probability
model that attempts to describe the behaviour of the data.
0.3 Usefulness of models in probability and statistics

• To simplify (or represent) a real world problem.
• To improve understanding.
• To analyse a real world problem.
• To make predictions or find estimates.
1 Data collection
1.1 Types of data
Data can be described as a series of facts from which conclusions can be drawn. To collect data, it
is necessary to measure a property, something that is called a variable. Variables can be categorised
depending on their types:
• A variable is quantitative if it can take a numerical value:
– A quantitative variable that can take any value in a given range is continuous.
– A quantitative variable that has clear steps between its possible values is discrete.
• A variable is qualitative if it is not possible for it to take a numerical value.

For example, height is a continuous variable, whilst shoe size is a discrete variable (with both, of
course, being quantitative variables). Favourite academic subject is a qualitative variable, since it
does not take numerical values. A random variable is a quantity whose value depends on chance. For
example, the number of heads that you would get from tossing a coin 10 times is a (discrete) random
variable.
1.2 Using Statistics to solve problems

Statistics may seem boring to some people, but it is one of the most useful areas of maths in the ‘real
world’ as it solves many problems by using data collection. There are many problems that statistics
has aided in the past few years, but health problems is a big one. Finding relationships between
a variable such as how much you drink and a disease such as cancer is great, but does not imply
2 Mr J Berwick
1.2 Using Statistics to solve problems Year 1 Statistics Notes
causation. Articles tend to overemphasise findings of studies by focusing on these relationships. The
lesson here is that we should read with caution!
Example
Suppose that Suzanne who takes calls for the emergency services in Birmingham is asked by her
manager whether they need more staff to cope with the increase in emergency calls per day in her
area. Suzanne decides that the best way to decide this is by doing an in-depth analysis on the matter.
Suzanne will have to use a model which is referred to as ‘the problem solving cycle’ which is explained
in a lot more detail later on. Suzanne will have to think of the following:
1. Problem specification and analysis
• What issue am I going to address? For example, Suzanne’s issue is to do with whether
Suzanne’s work needs more staff to cope with the increase in emergency calls in Birming-
ham.
• What type of data am I collecting? For example, continuous, discrete, categorical data or
a mix.
2. Information collection
• What is the best way to collect the data that I want?
3. Processing and representation
• How am I going to present the data? E.g. the use of scatter graphs?
4. Interpretation
• How will I be able to interpret the data and explain my conclusions?
Problem specification and analysis
Suzanne needs to understand the problem in whole before she starts collecting any kind of data.
She decides to focus on three questions for her analysis:
1. How many calls can an average person who works for the emergency services take in Birmingham
per hour?
3 Mr J Berwick
1.2 Using Statistics to solve problems Year 1 Statistics Notes
2. How many people are answering calls at one time?

3. How many calls do the emergency services in Birmingham receive every hour?
Information collection
Suzanne can now collect data as she has highlighted the questions she wants to answer.
• In order to work out what the average person that works for the emergency services can answer
per hour, she decides to collect data on 50 people and then take an average.
• In order to answer the second question, she looks at the rotas and shift times to work out an
average amount of people who work at one time.
• Suzanne looks at the phone data in order to see how many calls are being received per hour.
Processing and representation
In regards to the first question, the data that Suzanne will collect is the amount of calls answered per
hour for each person out of the 50 people she is collecting data on. These data is referred to as raw
data. The first thing that Suzanne must deal with is cleaning these data (the word ‘data’ is plural).
Cleaning involves dealing with outliers, missing data and errors.
Outliers are extreme values that are detached from the main body of data. Suzanne notices that
an unusually high number of 80 calls per hour have apparently been answered by one person. She
decides to investigate. The result is that this number is so high because a number of calls were trans-
ferred to this person and then the person was cut off for some unknown reason. Suzanne decides that
this person has not truly answered 80 calls. She decides to discard this person from her study as the
average will be affected.
Suzanne then investigates whether there is any missing data. It turns out that one person’s plug
socket failed for a total of 5 minutes in the hour she wanted to take the raw data from. She decides
that she will adjust by taking an average of the 5 minute intervals of this person in the hour, and then
add that on to the total of the calls taken in the hour by this person and round to the nearest whole
number. Do you think this is fair?
Whilst looking through the data, Suzanne notices that not all of her data is discrete. She notices
that someone has answered 50.5 calls. This is impossible so she rounds to 51. Suzanne has done this
because she believes there was a problem with the processing of this result that has caused the error.
This process is repeated for questions two and three.
Interpretation
Suzanne must now either write a small report or verbally communicate with her manager about
her results and comment on whether she thinks there is any need for more staff.
The problem solving cycle
In the example that you have just read, Suzanne was presented with a problem which required her to
use statistics. The steps she took fit into the problem solving cycle. This is a general process used
for investigation and problem solving.
1. Problem specification and analysis
This stage is where we define the problem at hand. We recognise that there is a problem and
4 Mr J Berwick
1.3 Sampling Year 1 Statistics Notes
we think about how we can solve this problem. After this, we need to set out what type of data
we are to collect, what sampling technique we are going to use to collect these data (see next
section), and how we are going to present it once we have it.
2. Information collection
This often involves taking what is known as a sample from all the possible data using a specific
sampling technique. However, we can sometimes collect data on the entire population which is
called a census (the government does a census of the UK population every 10 years). There
are various sampling methods in order to collect data from a given population. We will discuss
these methods later on.
3. Processing and representation
First of all, we clean the data which was explained in the example above. We usually follow
this by presenting the data that shows the main features by using diagrams such as pie charts
(categorical data). We then calculate summary measures which you do not need to worry about
just yet.
4. Interpretation
This is where we put all of our data analysis together. We make conclusions from what we have
found out, and try to answer the question that was first asked.
1.3 Sampling
The general election has just been and passed again. Suppose that we wish to determine the next
general election result for Sutton Coldfield by using the method of sampling. The conservative candi-
date is Andrew Mitchell, the Labour candidate is Rob Pocock, and the Liberal Democrat candidate is
Jennifer Wilkinson. Suppose that we ask 100 people in Sutton Coldfield town centre about how they
intend to vote next general election. The results are as follows:
• Andrew Mitchell got 30 votes.
• Rob Pocock got 50 votes.
• Jennifer Wilkinson got 20 votes.
Does this imply that Labour will take this seat from the Conservative party? Why? Do these results
from the 2017 general election help in anyway?
There should be a few questions going around your head by now.

1. How was the sample selected? (sampling technique.)
2. Was the sample representative of the whole population? (was it fair or unfair?)
3. Was the sample large enough to be representative?
4. Those who were asked about their voting intention were actually asked ‘Who would make the
best MP?’. Will this matter?
5 Mr J Berwick
Terminology and notation

• A sample is a selection of subjects (using a sampling technique) from a subset of the population
which is used to make conclusions about the population as a whole.
• The overarching set is called the population. The population can be finite, such as every person
inside this school, or infinite, for example the points where a golf ball can land on a golf course.
• A population is normally described in terms of its parameters, such as its mean which is
denoted as µ. Note that Greek letters are used to denote parameters of populations whilst
Roman letters are used to denote the equivalent sample values.
• Individuals of a population are often numbered to form a list called a sampling frame. This
could, for example, be a list of pigs in a pen. In many cases, no sampling frame exists. For
example, the cod in the North Atlantic. The proportion of the available items that are actually
sampled is called the sampling fraction. A census is where we take a sample of the whole
population.
Why do we take a sample?
There are many reasons as to why we may want to take a sample of a population. The three main
reasons are:
• To obtain information as part of a pilot study to inform a proposed investigation.
• To estimate the values of the parameters of the population.
• To conduct a hypothesis test (we will come to this later on).
When sampling from the population, we need to think about how we collect the sample, and how we
can ensure that this is a true representation of the true population. For example, the example above
did not have a good quality sample because it lacked true representation of Sutton Coldfield.
An estimate of a parameter such as the mean of the population from the sample data will often
be different to the true value. The difference between the means of the sample and the population is
what we call sampling error. To reduce the sampling error, you want the sample to be as represen-
tative of the population as possible. However, this is easier said than done in many cases.
In the UK, you may have heard of the population census. The population census is carried
out every 10 years on the entire population of the UK by the office of national statistics. The popu-
lation census is collected to gather important information about the population of the UK such as
the size of the populations and densities of major cities. The population census is carried out every
10 years rather than every year due to the shear expense of it. It is worth noting that the sampling
error here is 0 as we have took data on the entire population. Note that a census cannot always be
carried out. For example, taking a census on the population of cod in the sea.
When we are just about to take a sample, we should be asking ourselves the following questions:
• Are the data relevant?
If in the politics example we asked ‘Who would make the best MP?’, then this is seen as obtaining
irrelevant information. We should ask ‘Who do you intend to vote for?’.
• Are the data likely to be biased?
Let’s say that we wanted to find out the average height at which people can jump. If we were
to take a sample of men on the Olympic high jump team, then this would not be representative
of the population.
6 Mr J Berwick
• Does the method of collection distort the data?

Try and not to ask leading questions like ‘Are you a driver who obeys the speed limits?’ as this
is leading the person to say ‘yes’.
• Is the right person collecting the data?
If a university wanted to find out the percentage of students in the university that would choose
partying over studying and got the lecturers to do the sample, then the university has clearly
not chosen the right people to do the job.
• Is the sample large enough?
The political party example that we have used is perfect for this. In order to get a true ‘feel’ of
the voting intentions of Sutton Coldfield, we would have to sample a decent amount of people
in the district.
• Is the sampling procedure appropriate in the circumstances?

If we were to ask people about their voting intentions, and we chose to ask passers-by in a town,
then this is not right. The people who were asked in the survey would probably not be in work,
and therefore, have different voting habits to the people who are working.
Sampling techniques
Simple random sampling
A simple random sample is a sample chosen in such a manner that each possible sample of a given
size has the same chance of being selected. It follows that in such a procedure every member of the
population is equally likely to be selected. In order to carry out this method of sampling, we would
need a sampling frame.
Please note that the converse of this is not true! Suppose that we wish to obtain a random sam-
ple of 15 students from a class consisting of 15 boys and 15 girls. To select that sample, we toss an
unbiased coin, and, if it shows heads, the 15 boys are chosen, and likewise, the 15 girls are chosen if
it shows tails. It is quite obvious that the sampling method is flawed, but it would still satisfy the
criteria of ‘every member of the population is equally likely to be selected’.
An advantage of this sampling technique is that the sample is probably representative of the popula-
tion. Disadvantages of this sampling technique are that it can be time consuming and we always need
a sampling frame.
Stratified sampling
Let us yet again go back to the political party example. In that situation, we may want to iden-
tify a number of different sub-groups which may have different voting patterns. These sub-groups are
known as strata. For example, we could have youth, middle-aged, and elderly voters as our strata.
Within each group, or stratum, a probability sample is then selected.
An advantage of this sampling technique is that it minimises sample selection bias by ensuring cer-
tain segments of the population are not overrepresented or under-represented. Disadvantages of this
sampling technique are that it is sometimes difficult to split the population into naturally occurring
groups and we always need a sampling frame.
Systematic sampling
Systematic sampling is a method where individuals are chosen at regular intervals from a sampling
7 Mr J Berwick
frame. Say that we have sampling frame which has a size of 100. If we wanted a sample of 25 people,
then we could pick every 4th person on the list going up to the 100th person.
Advantages of this sampling technique are that it assures that the population will be evenly sam-
pled and it is quick and easy. Disadvantages of this sampling technique are that there may be missing
values in the population and we always need a sampling frame.
Quota sampling
Quota sampling is when an interviewer or researcher splits the population into groups or strata that
is representative of the whole population. Then a judgement is used to select the members from each
group.
An advantage of this sampling technique is that we do not need a sampling frame. Disadvantages of
this sampling technique are that it is non-random and can be biased.
Opportunity (or convenience) sampling
Opportunity sampling consists of taking a sample of people who are available at the time the study
is carried out and who fit the criteria (e.g. a smoker) you are looking for. An example of when this
could be done is at a conference for teachers. This is seen as the weakest form of sampling as it can
lead to bias.
An advantage of this sampling technique is that it is easy to select a sample. Disadvantages of

this sampling technique are that it is non-random and can be biased.
2 Data processing, presentation and interpretation

2.1 Presenting different types of data
It is first worth thinking about what kind of data we have to begin with. Is the data categorical? Is the
data continuous? Is the data discrete? The answers to these questions matter! We can display time
vs distance on a line graph, but we would display the number of people with different eye colours on
a bar chart. There are various ways that we can display categorical data, but pictograms, bar charts,
dot plots and pie charts are by far the most common. Let us look at an example of these displays below.
Example
A school in Cambridge wishes to investigate the most popular type of car in their area. They decide
to take a sample of 40 cars from a point on a main road. The results are shown below.
Type of car Frequency

Aston Martin 4
Toyota 5
Ford 20
Mini cooper 5
Mercedes 6
8 Mr J Berwick
2.2 Ranked data Year 1 Statistics Notes
The display on the right is

known as a pictogram. In
this example we have used
the scale that a picture of
a car is equal to a car, but
we could have used the scale
that a picture of a car repre-
sents 4 cars. If we had data
on flowers then we could
Figure 1: Pictogram
have had pictures of flow-
ers. Always specify the key
in pictograms!
Figure 2: Dot plot of cars data Figure 3: Bar chart of cars data
We should explain how the sector for each car make

is calculated. There are 360 degrees in a circle so
there are 360 degrees in a pie chart. In order to
accurately draw a pie chart, you will need a pro-
tractor. Each sector is calculated by calculating
how many cars of that particular make there were
in the sample of 40 cars as a proportion, and then
multiplying by 360 degrees. So for the Ford cars
20
we have 40 × 360 = 180. So we give the Ford cars
sector 180 degrees of the 360 degrees on the pie
chart.
Figure 4: Pie chart of cars data
We use measures that are called summary measures to sum up the data that we are investigating
(quite simple really). The main summary measure that is used with categorical data is the modal
class. The modal class is the class with the highest frequency. Therefore, the modal class of our
example above is Ford.
2.2 Ranked data

In statistics, sometimes it is convenient for data to be rearranged so that it is in order of size from
smallest to largest. Sometimes people choose to then rank their data; assigning the smallest number
a value of 1 and then the largest number the size of the data set, but we will tend to not do this here
as we can all work logically.
9 Mr J Berwick
When presented with raw data, stem-and-leaf diagrams are a manner in which we can organise the
data. This involves placing all digits except the final digit in a column, and writing the final digits of
all data values in corresponding rows. This is best shown with an example.
Example
Create a stem-and-leaf diagram for the following data:
6.2, 3.1, 4.8, 9.1, 8.3, 6.2, 1.4, 9.6, 0.3, 0.3, 8.4, 6.1, 8.2, 4.3.
We begin by creating a column that shows the first digit of each value. This will begin at 0 (for the
two observations of 0.3) and finish at 9 (for 9.1 and 9.6). For the first value, 6.2, we place a 2 in the
row that begins with 6. We then continue by putting a 1 in the row beginning with 3, and continuing
until this has been done for every value. We finish by including a key so that the location of the
decimal point can be established: 0|3 could mean 03, 0.3 or 0.03, to give just three possibilities.
0 3 3
1 4
2
3 1
4 8 3
5
6 2 2 1
7
8 3 4 2
9 1 6
Key: 0|3 means 0.3
An ordered stem-and-leaf diagram is one where the numbers in each row (not including the leading
digit) are placed in ascending order. Using the data in the example above, the ordered stem-and-leaf
diagram for these data is
0 3 3
1 4
2
3 1
4 3 8
5
6 1 2 2
7
8 2 3 4
9 1 6
Key: 0|3 means 0.3
If we have two variables, it is possible to modify the stem-and-leaf diagram by placing the final digits
either side of any leading digits, so that the values for one variable are displayed on the left of the
central column, and the values for the other variable are displayed on the right of the central column.
An example of this is the following ordered back-to-back stem-and-leaf diagram that displays the
masses of a sample of males (on the left) and females (on the right):
10 Mr J Berwick
10 6
11 4 8 8
12 2 7 7
4 2 13 5 6 8 9
8 4 3 14 0 3 6 6 7
5 1 15 3 5 9
16
9 8 2 2 1 17 5
7 6 1 0 18
2 1 19
Key: 10|6 means 106 pounds
Note that both variables are ordered starting closest to the central column and moving outwards,
rather than starting at the left and moving to the right.
For a set of data, the median is a value where there is an equal number of values above and be-
low it. To find the median of a data set of n values, arrange the values in order of increasing size. If
n is odd, the median is the 12 (n + 1)th value. If n is even, the median is halfway between the 12 nth
value and the following value. The median of a data set is often written as Q2 .
Example
Find the median of 49, 56, 55, 68, 61, 57, 61, 52, 63.
The data in ascending order are 49, 52, 55, 56, 57, 61, 61, 63, 68. We have nine values, so the
median will be the 12 (9 + 1) = 5th value. Therefore the median of the data is 57.
Example
Find the median of 49, 56, 55, 68, 61, 57, 61, 52, 63, 59.
The data in ascending order are 49, 52, 55, 56, 57, 59, 61, 61, 63, 68. We have ten values, so
the median is halfway between the 12 × 10 = 5th and 5 + 1 = 6th values, which are 57 and 59 respec-
tively. Therefore the median of the data is 12 (57 + 59) = 58.
The median is known as a measure of central tendency. It usually provides a good represen-
tative value. The mean which you have probably come across before can be distorted by extreme
values as the median cannot.
The lower quartile (Q1 ) and the upper quartile (Q3 ) describe points where 25% of the data is below
(Q1 ) and 75% of the data is below (Q3 ).
There is no standard way to find the lower and upper quartiles, but Edexcel have a set way that
they wish you to do it.
The range of a set of data values is defined by the equation
range = largest value − smallest value .
Because the range only considers the smallest and largest values, it might not be representative of the
entire data. For this reason, the interquartile range is often used instead of the range, where
interquartile range = upper quartile − lower quartile = Q3 − Q1 .
11 Mr J Berwick
The lower and upper quartiles lie at the positions one-quarter and three-quarters respectively of the
way through the data when the values are arranged in order. The way in which they are calculated
is summed up in the following table:
Example
Find the range and interquartile range for the following data:
7, 8, 11, 13, 15, 18, 19, 20, 22 .
The range is 22 − 7 = 15.
To calculate the lower quartile, we divide n by 4 and get 2.25. According to our table, we round
up and so the third value is the lower quartile. The lower quartile is 11.
To calculate the upper quartile, we divide 3n by 4 and get 6.75. According to our table, we round up
and so the seventh value is the upper quartile. The upper quartile is 19.
Thus the interquartile range is 19 − 11 = 8.
It should be noted that interquartile range is often used to investigate whether certain values that
seem to be too extreme are outliers. A standard procedure is to investigate items that are at least 1.5
× interquartile range above Q3 or below Q1 .
The semi-interquartile range is another measure of spread and is half of the interquartile range.
It is half the distance needed to cover half the scores. The semi-interquartile range is affected very
little by extreme scores. Semi-interquartile range is comparable to standard deviation which we
will talk about later on.
The median and quartiles are not the only divisions that are used when it comes to data! We may use
divisions called percentiles. Percentiles lie at the percentage points and are widely used with large
data sets. Q1, Q2, and Q3 are used to denote the 25th, 50th, and 75th percentiles.
The 90th percentile is 90n

100 th data value when the data is in ascending order. If this value is not
an integer, then choose the next value up so that 90% of the data would have a value (not ranked
value) less than the 90th percentile.
12 Mr J Berwick
2.3 Discrete numerical data Year 1 Statistics Notes
For a set of data, the median, lower quartile, upper quartile, minimum value and maximum value
together summarise the entire data. For this reason, they are often represented in a box and whisker
plot that shows these values in relation to each other.
Example
The data below give the number of fish caught each day over a period of 11 days by a fisherman:
0, 2, 5, 2, 0, 4, 4, 8, 9, 8, 8 .
Draw a box and whisker plot to represent these data.
Rearranging the data in ascending order gives
0, 0, 2, 2, 4, 4, 5, 8, 8, 8, 9 .
The minimum value is 0 and the maximum value is 9.
There are 11 values, so the median is the 21 (11 + 1) = 6th value. This is 4.
To calculate the lower quartile, we divide n by 4 and get 2.75. According to our table, we round
up and so the third value is the lower quartile. The lower quartile is 2.
To calculate the upper quartile, we divide 3n by 4 and get 8.25. According to our table, we round up
and so the ninth value is the upper quartile. The upper quartile is 8.
Plotting the five values (0, 2, 4, 8 and 9) gives us the following box and whisker plot:
2.3 Discrete numerical data

Sometimes we may have to work with discrete data. Remember that discrete data are data that can
take certain particular values. Sometimes we may not be given a frequency table. In fact, we may
have to sort it ourselves (which is exhausting I know!). Say that we collected the shoe sizes of a class
of 30, then we would have to create a frequency table with the 30 raw values.
We have already mentioned that the median is a measure of central tendency, but are there any
others? Of course there is! The median, the mode, and the mean are measures of central tendency
that are used for discrete data.
The mode of a data set is the value that occurs with the highest frequency. A data set can have
more than one mode if two or more values have the same maximum frequency. If two values occur
more frequently than the rest and have the same frequency, then the data set is said to be bimodal.
A data set has no mode if all of the values have the same frequency.
Example
Find the mode of 49, 49, 58, 61, 56, 55, 68, 61, 57, 61, 52, 63.
13 Mr J Berwick
The only value that occurs more than twice is 61. This means that 61 is the modal value.
The mean, x, of a data set of n values is given by

P
x1 + x2 + . . . + xn xi
x= = .
n n
Example
Find the mean of 49, 56, 55, 68, 61, 57, 61, 52 and 63.
P
If the values are x1 , x2 , . . ., x9 , xi = 49 + 56 + 55 + 68 + 61 + 57 + 61 + 52 + 63 = 522 and
522
n = 9, so x = = 58.
9
A mean can also be found from a frequency table. The mean, x, of a data set in which the vari-
able takes the variable xi with frequency f1 , x2 with frequency f2 , and so on is given by
P
x1 f1 + x2 f2 + . . . + xn fn xi fi
x= = P .
f1 + f2 + . . . + fn fi
Example
The number of pets that Year 11 pupils at a certain school have is given below. Find the mean
of these data.
Number of pets, xi Frequency, fi xi fi
0 36 0
1 94 94
2 48 96
3 15 45
4 7 28
5 3 15
6 P 1 P 6
Totals fi = 204 xi fi = 284
We begin by finding the values of xi fi and summing these and the frequencies (shown in italics above).
The mean, x is thus
284
x= = 1.39 pets, correct to 3 significant figures.
204
If the frequency table were grouped, the midpoint of each class would be used to calculate xi fi . We
will come to grouped frequency tables later so do not worry!
The median for the example above is between the 102nd value and 103rd value. Therefore, the
median is 1 pet. The mode is also 1 pet.
Perhaps the most simplest measure of spread for this type of data is the range, the difference
between the highest and lowest values. The range for this data is 6. If we ranked the data then we
could find the lower and upper quartile and thus, the interquartile range. The most widely used
measure of spread is standard deviation, and this is covered later on. A nice way to display the
last example’s data would be to plot a vertical line chart. Try this for yourself.
14 Mr J Berwick
Say that we have collected data on the scores of 95 pupils on their final examination for statis-
tics. This seems quite tedious to list each score for each student (which we could do on a stem and
leaf diagram). Instead, we present the data in a grouped frequency table.
Grouping means putting the data into a number of classes. The

Scores Frequency number of data items falling into any class is called the frequency for
40 − 49 24 that class. When numerical data are grouped, each item of data falls
50 − 59 29 within a class interval lying between class boundaries.
60 − 69 14
70 − 79 9 The main advantage of grouping is that it makes it easier to dis-
80 − 89 12 play data, and to estimate some of the summary measures (like the
mean). However, since you have grouped the data, the summary
90 − 100 7
measures are only estimates now as you have lost the raw values.
The best way to display this kind of data is by drawing a simple bar chart like the one below.
Different types of distribution are described in terms of the position of their modes or modal groups.
Figure 5: Unimodal Figure 6: Uniform Figure 7: Bimodal
The measures of central tendency could lead to varying interpretations of a data set. It is valuable to
consider this in more detail.
The mean considers all values of a set of data. However, this means that it can be highly affected by
outliers. For instance, the mean of 40, 45, 55 and 200 is 85 which is not representative of the data.
The mean of the data has been affected by the outlier 200.
The median ignores outliers, and instead only considers the values in the middle of the distribu-
tion. This means it is not affected by outliers, although it ignores a large proportion of the data
obtained.
In general, if positive skew exists, the positive tail causes the mean to be greater than the me-
dian. The median is closer to the lower quartile when a distribution is positively skewed. If negative
skew exists, the negative tail causes the mean to be less than the median. The median is closer to
15 Mr J Berwick
2.4 Continuous numerical data Year 1 Statistics Notes
the upper quartile when a distribution is negatively skewed. The three below graphs show no skew,
positive skew and negative skew respectively:
It should be noted that this relationship between skew and the difference between the mean and me-
dian is not always true, especially for discrete distributions.
The mode may not seem as useful compared to the median and the mean. However, can you calculate
the mean and median of a qualitative data set? You cannot, but you can calculate a mode. Therefore,
the mode is still important in some cases.
2.4 Continuous numerical data

When we are investigating continuous data, we normally have to group the data. This includes two
special cases.
• The variable is actually discrete, but the intervals between values are very small (£0.001 for
example).
• The underlying variable is continuous, but the measurements are rounded. If we were measuring
heights of students, then rounding to the nearest mm would be an example of this.
2.4.1 Displaying continuous grouped data

The two ways that we can display continuous data using bars are frequency charts and histograms.
In both cases, there are no gaps between the bars.
Frequency chart
A frequency chart is used to display data that are grouped into equal classes of equal width. The
below frequency table shows the ages of a sample of 50 people in a secondary school with its frequency
chart.
Class boundaries Frequency

0 ≤ a < 10 0
10 ≤ a < 20 12
20 ≤ a < 30 14
30 ≤ a < 40 11
40 ≤ a < 50 7
50 ≤ a < 60 6
If the midpoints of each bar of a frequency chart are connected, a frequency polygon is constructed.
The frequency polygon for the data given in the example above is shown below:
16 Mr J Berwick
Frequency charts and frequency polygons are a useful way of showing the shape of a distribution.
Please note that you can only use a frequency chart or a frequency polygon if all the classes are of
equal width.
Histograms
There are key differences between histograms and frequency charts:
• In a histogram, frequency is represented by the area of a bar and not by its height.
• The y-axis of a histogram represents frequency density and not frequency.
Frequency ∝ Frequency Density × Class Width.

We should note that this is different to what we saw in maths GCSE. In maths GCSE we always saw
that Frequency = Frequency Density×Class Width, but this was to keep it simple. Please refer to the
textbook to look at examples where Frequency is proportional to the product of Frequency Density
times by Class width. We will do an example where they are equal.
The class width is a measure of how wide each class is. Suppose that the data are grouped as
101 − 110, 111 − 120, 121 − 130, . . .. We begin by finding the class boundaries. For a value x to
be in the class 111 − 120, one might initially assume that 111 ≤ x ≤ 120. However, we have not
considered what occurs when 110 < x < 111, because these observations would then not belong to
any group. We instead take the class boundaries to be 110.5 ≤ x < 120.5, because, if x = 110.5,
it would be rounded to 111, which falls in the class 111 − 120. The class width of this class is then
120.5 − 110.5 = 10. Supposing that the corresponding frequency were 35, the frequency density of the
class (and therefore, the height of the bar) is 35
10 = 3.5.
Before continuing, suppose that these groups were actually 0 − 10, 11 − 20, 21 − 30, . . . . The class
boundaries for the classes 11 − 20 and 21 − 30 would be 10.5 ≤ x < 20.5 and 20.5 ≤ x < 30.5 respec-
tively. However, this would suggest that the class boundaries for the class 0 − 10 are −0.5 ≤ x < 10.5.
As unusual as this might initially appear, this is the correct procedure, and is more preferable than
altering the class width so as to avoid negative values, even if negative values are outside the range of
the variable. However, note that the exam board can choose to alter the class width to avoid negative
values. If they alter the class width, then this will be made clear in the question.
Example
The grouped frequency distribution below shows the height in inches of a sample of 39 people. Rep-
resent these data in a histogram.
17 Mr J Berwick
Height (inches) Frequency Class Boundaries Class Width Frequency Density

62 − 63 4 61 .5 ≤ h < 63 .5 2 2
64 − 65 5 63 .5 ≤ h < 65 .5 2 2 .5
66 − 67 8 65 .5 ≤ h < 67 .5 2 4
68 − 71 13 67 .5 ≤ h < 71 .5 4 3 .25
72 − 75 5 71 .5 ≤ h < 75 .5 4 1 .25
76 − 79 4 75 .5 ≤ h < 79 .5 4 1
We begin by adding columns for the class boundaries, class width and frequency density (shown in
italics). The values of frequency density were worked out using the standard formula; for the first
class, frequency density = 42 = 2.
The following histogram can be drawn, with the bar for each class lying along the interval given
in the class boundary column, and the height of each bar being the frequency density:
2.4.2 Estimating summary measures from grouped continuous data

In this section we will look at an example of how to calculate the mean, the median, and the modal
class with a grouped continuous data set.
Example
A random simple sample at a local office of 95 people was taken and their heights were measured.
The results are shown in the table below. Calculate the mean, the median, and the modal class.
Height, hcm Frequency, fi

120 ≤ h < 130 2
130 ≤ h < 135 5
135 ≤ h < 140 11
140 ≤ h < 145 37
145 ≤ h < 155 32
155 ≤ h < 180 8
First of all, we need to create a column with the midpoints of the class intervals and a column that
calculates the value of each midpoint multiplied by the corresponding frequency.
18 Mr J Berwick
Height, hcm Frequency, fi Midpoint, mi mi × fi

120 ≤ h < 130 2 125 250
130 ≤ h < 135 5 132.5 662.5
135 ≤ h < 140 11 137.5 1512.5
140 ≤ h < 145 37 142.5 5272.5
145 ≤ h < 155 32 150 4800
155 ≤ P
h < 180 8 167.5 1340
95 13,837.5
Now we use the formula for the mean that has been aforementioned with the midpoints as the data
values. So P
mf 13, 837.5
P i i = = 145.66 (2dp).
fi 95
The median is (n+1) 2 = 962 = 48th value. The 48th value lies in the class interval 140 ≤ h < 145. In
fact, the median is the 48 − 18 = 30th value into the class interval 140 ≤ h < 145. Now we assume
that the 37 measurements that are in that class interval are evenly distributed so that the median is
equal to 140+ 30

37 ×5 = 140+4.054 = 144 (3sf) where 5 is the class width. This method is sometimes
called (linear) interpolation.
The modal class is the class with the highest frequency density. If you do some quick calcula-
tions then this will show that it is the class 140 ≤ h < 145.
2.4.3 Cumulative frequency graphs

To estimate the median, quartiles and percentiles (along with related measures) we can use cumulative
frequency graphs. Instead of representing the frequency of observations within a closed range (as in a
histogram), cumulative frequency graphs depict the total frequency of observations less than a certain
point. For instance, if there are 40 observations of a variable less than the value 50 (so that 50 is an
upper bound for the class), the point (50, 40) is plotted on the graph. The points plotted should then
be joined with straight lines, which facilitates further statistical analysis of the data.
Example
For the data used in the section with the histogram, draw the corresponding cumulative frequency
graph. Estimate the proportion of the sample whose heights were less than 69 inches.
From the data, the following is evident:
• There are 0 observations less than 61.5. The point (61.5, 0) should therefore be plotted.
• There are 4 observations less than 63.5. The point (63.5, 4) should therefore be plotted.
• There are 4 + 5 = 9 observations less than 65.5. The point (65.5, 9) should therefore be plotted.
• There are 9 + 8 = 17 observations less than 67.5. The point (67.5, 17) should therefore be
plotted.
plotted.
plotted.
19 Mr J Berwick
plotted.
Plotting these points and connecting the points with straight lines gives the following graph:
To find the proportion of the sample whose heights were less than 69 inches, we must estimate the num-
ber of people in the sample whose heights were less than 69 inches. The proportion is then this number
divided by the total number of people in the sample. If we were to draw a vertical line corresponding
to 69 inches, and then a horizontal line to where this meets the cumulative frequency axis, we obtain a
cumulative frequency of approximately 21.9 (this is shown on the graph). Therefore, the proportion of
the sample whose heights were less than 69 inches is approximately 21.939 = 0.5615 . . . = 0.562, correct
to 3 significant figures.
It is possible to approximate the median for a grouped frequency table by using interpolation or
a cumulative frequency graph. The median will only be an estimate because the data is grouped.
The number of values is typically large with a grouped frequency table, and so it is unnecessary to
consider whether the 21 (n+1)th value or the number halfway between the 21 nth value and the following
value is to be selected as the median (since any difference is negligible, especially when we recall that
the median will be an estimate anyway). Instead, the 12 nth value is normally used as the median in
p
the case of using grouped data. Also, the 14 nth, the 43 nth, and the 100 nth values are taken as Q1 ,
Q3 , and the pth percentile in the case of having a grouped frequency table. Note that if we had used
the 12 nth value as the median in the last example, we would have ended up with the same answer to 3sf.
Example
Approximate the median playing time and the interquartile range of the ninety-five CDs in the table
below by using a cumulative frequency graph.
Playing time (min) Frequency
40 − 44 1
45 − 49 7
50 − 54 12
55 − 59 24
60 − 64 29
65 − 69 14
70 − 74 5
75 − 79 3
A cumulative frequency graph that passes through (39.5, 0), (44.5, 1), (49.5, 8), . . ., (79.5, 95) is
shown below:
20 Mr J Berwick
2.5 Bivariate data Year 1 Statistics Notes
Using the graph, the 12 × 95 = 47.5th value corresponds to approximately 60 minutes. Therefore, the
median playing time of the ninety-five CDs is approximately 60 minutes.
A cumulative frequency graph that passes through (39.5, 0), (44.5, 1), (49.5, 8), . . ., (79.5, 95) is
shown below:
Using the graph, the 14 × 95 = 23.75th value corresponds to approximately 55 minutes. The 34 ×
95 = 71.25th value corresponds to approximately 64 minutes. This gives an interquartile range of
approximately 64 − 55 = 9 minutes.
2.5 Bivariate data

The table below shows some credit data about 15 randomly selected people in the UK.
Income Limit Rating Cards Age Education Balance

1 14.891 3606 283 2 34 11 333
2 106.025 6645 483 3 82 15 903
3 104.593 7075 514 4 71 11 580
4 148.924 9504 681 3 36 11 964
5 55.882 4897 357 2 68 16 331
6 80.18 8047 569 4 77 10 1151
7 20.996 3388 259 2 37 12 203
8 71.408 7114 512 2 87 9 872
9 15.125 3300 266 5 66 13 279
10 71.061 6819 491 3 41 19 1350
11 63.095 8117 589 4 30 14 1407
12 15.045 1311 138 3 64 16 0
13 80.616 5308 394 1 57 7 204
14 43.682 6922 511 1 49 9 1081
15 19.144 3291 269 2 75 13 148
21 Mr J Berwick
This credit data records balance (average credit card debt for a number of individuals) as well as
several quantitative predictors: age, cards (number of credit cards), education (years of education),
income (in thousand of dollars), limit (credit limit), and rating (credit rating). The data shown in
this table is known as multivariate as one data item covers a multiple amount of variables. If we
are investigating the relationship between limit and balance, then our items cover just two variables
and so the data are described as bivariate.
2.5.1 Displaying bivariate data

Bivariate data are often represented on a scatter graph (or diagram).
The scatter graph of limit vs balance is displayed

on the right. We see that people with a high limit
tend to have a high balance. There is a high level
of association between the two variables. If, as
in this case, high values of both variables occur to-
gether, and the same for low values, the association
is positive. If, on the other hand, high values of
one variable are associated with low values of the
other, the association is negative.
A line of best fit is often drawn through the points on

a scatter graph. If the points lie close to a straight line
the association is described as correlation. So corre-
lation is linear association. The equation of the line
of best fit for the scatter graph of limit vs balance is
y = 0.1967x − 453. The line of best fit can be found
using statistical functions on your calculator (we will
do this). Major statistical software like R or Microsoft
excel will output the equations for the lines of best fit
automatically. The equation shows that for every unit
increase in limit, balance increases by 0.1967 units.
Interpolation is the prediction within the range of the data. As long as your regression model (line of
best fit) is accurate, i.e. there should not be too much scatter about the line of regression, it should
give reliable results.
Extrapolation is a prediction outside the range of the data. The further beyond the range of the
data the prediction is made, the less reliable the result.
2.5.2 Dependent and Independent variables

The scatter graph in the last section was drawn with limit on the x-axis and balance on the y-axis.
This was done to emphasise that we are predicting balance based on the limit of a credit card. So
balance is dependent on the limit of a credit card. It is normal practice to plot the dependent (or
response) variable on the y-axis and the independent (or explanatory) variable on the x-axis.
This is not necessarily obvious which way these variables should be, however, in the exam it will be
made obvious. For example, the total weight of passengers for a bus (dependent variable) relies on
the number of people in the bus (independent variable).
22 Mr J Berwick
2.5.3 Random and non-random variables

Both the variables used for the scatter graph have unpredictable values, and are therefore, random.
Both variables are free to assume any of a particular set of values in a given range. When we talk
about correlation we will assume that both variables are random.
Sometimes one or both variables are controlled. A controlled variable is one which the researcher
holds controls during an experiment. For example, the time at which you measure the blood pressure
of a patient could be taken every 10 minutes. Controlled variables are independent variables and are
usually plotted on the x-axis. Independent variables are the variables that we can change!
2.5.4 Interpreting scatter diagrams

We can normally judge whether two variables are correlated based on their scatter graphs (we are as-
suming that none of the variables are controlled). However, we must be careful as there are situations
where you may think there is correlation, but there is not!
When correlation is present we expect most of points on a scatter graph to lie within what we call an
ellipse.
The first graph shows a positive correlation. We would describe this as weak positive correlation.
When there is a positive correlation, but it is even clearer than this we call it strong positive cor-
relation. The second graph shows a negative correlation. We would describe this as weak negative
correlation. When there is a negative correlation, but it is even clearer than this we call it strong
negative correlation. The third graph shows no indication that there is any correlation.
A correlation between two variables does not automatically mean that the change in one variable
is the cause of the change in the values of the other variable. Causation indicates that one event
is the result of the occurrence of the other event; i.e. there is a causal relationship between the two
events. Previously, we were trying to predict balance by using the limit of a credit card. Even though
a regression line was calculated, we cannot be certain that a change in limit causes the balance to
change all the time. To do this we would need more evidence.
We need to look out for the following 3 cases:
1. If a scatter graph shows two islands, then this indicates that the graph is showing two different
groups. Neither of the groups have correlation.
23 Mr J Berwick
2. Outliers may give the impression that there is correlation, but there is not!
3. A few data items may give a ‘funnel’ effect which may give the impression that there is correla-
tion, but there is not.
2.5.5 Summary measures for bivariate data

There are two summary measures that you may meet when using statistical software:
• Spearman’s rank correlation coefficient is a measure of correlation that is observed when both
variable are ranked. It is named after its maker, Charles Spearman. It is a number that shows
how closely two sets of data are linked and its interpretation depends on the sample size.
• Pearson’s product moment correlation coefficient is a measure of linear correlation. It is given
by most statistical software. Its interpretation also relies on the sample size.
2.5.6 Lines of best fit

When plotting a scatter graph, you are often required to draw a line of best fit. This can be done
roughly by trying to ensure that about the same number of points lie above and below the line. If
24 Mr J Berwick
2.6 Variance and Standard Deviation Year 1 Statistics Notes
you figure out the mean values of the two variables, then try and make sure the line of best fit that
you draw goes through this point (the proof is left for when you do this formally). The formal way to
draw a line of best fit is by using the method of least squares regression line. You will meet this
later if you do further statistics.
2.6 Variance and Standard Deviation

2.6.1 A new measure of spread
We have already met the range and the interquartile range which are measures of spread. However,
these measures of spread only consider certain values as the variance (and standard deviation) con-
siders every value.
Let us look at what is known as mean absolute deviation before we look at the variance (and
standard deviation).
The mean absolute deviation of a set of data values x1 , x2 , . . ., xn , whose mean is

P
x1 + x2 + . . . + xn xi
x= =
n n
is given by the formulae
(
1X x if x ≥ 0,
Mean absolute deviation = |xi − x̄| where |x| = .
n −x if x < 0
Example
Find the mean absolute deviation of the following numbers:
49, 56, 55, 68, 61, 57, 61, 52, 63 .
The mean, x̄, of the data is

49 + 56 + 55 + 68 + 61 + 57 + 61 + 52 + 63 522
x= = = 58 .
9 9
We shall now construct a table with the values of xi , (xi − x) and |xi − x|:
P
Data Total
xi 49 56 55 68 61 57 61 52 63 522
Deviation (xi − x̄) -9 -2 -3 10 3 -1 3 -6 5 0
|xi − x̄| 9 2 3 10 3 1 3 6 5 42
The total of the deviations is 0 which is not a coincidence! The reason for this is down to the definition
of the mean. The sum of the values above it must be equal to the sum of those below it.
So the mean absolute deviation for this data set is |xin−x̄| = 42

9 = 4.67 (3sf). The number 4.67
represents the average distance between each data value and the mean.
Mean absolute deviation is an acceptable measure of spread, but is not widely used because it is
difficult to work with. Instead we use something called standard deviation which is incredibly
25 Mr J Berwick
important!
The variance of a set of data values x1 , x2 , . . ., xn , whose mean is

P
x1 + x2 + . . . + xn xi
x= =
n n
is given by the formulae
1X 1X 2
variance = (xi − x)2 = xi − x2 .
n n
In practice, the second of these is easiest to use when beginning with the original data, although the
first can become necessary when only given certain information regarding the data. Note that, if x1 ,
x2 , . . ., xn are measured in a certain unit, the variance takes the square of this unit as its unit. The
standard deviation of a data set is the square root of the variance of the data set. The standard
deviation is measured in the same units as the data.
Example
Find the variance of the following numbers:
49, 56, 55, 68, 61, 57, 61, 52, 63 .
Method 1
We shall begin by constructing a table with the values of xi and x2i :
xi x2i
49 2401
56 3136
55 3025
68 4624
61 3721
57 3249
61 3721
52 2704
P 63 P 23969
xi = 522 xi = 30 550
The mean, x, is given by P
xi 522
x= = = 58 .
n 9
So the variance is
x2i
P
30 550
− x2 = − 582 = 30.44 . . . .
n 9
Therefore, the variance is 30.4, correct to 3 significant figures.
Method 2
The mean, x, of the data is
49 + 56 + 55 + 68 + 61 + 57 + 61 + 52 + 63 522
x= = = 58 .
9 9
We shall now construct a table with the values of xi , (xi − x) and (xi − x)2 :
26 Mr J Berwick
xi xi − x = xi − 58 (xi − x)2 = (xi − 58)2

49 −9 81
56 −2 4
55 −3 9
68 10 100
61 3 9
57 −1 1
61 3 9
52 −6 36
63 5 25
(xi − x)2 = 274
P
The variance is then

(xi − x)2
P
274
= = 30.44 . . . .
n 9
Therefore, the variance is 30.4, correct to 3 significant figures.
√
For the example above, the standard deviation would be 30.44 . . . = 5.517 . . . = 5.52, correct to
3 significant figures. This will take the same units as xi , and not its square (like the variance) as
aforementioned.
The variance of a data set represented in a frequency table can also be found. The variance of
data given in a frequency table in which the variable takes the value x1 with frequency f1 , the value
x2 with frequency f2 and so on is given by
(xi − x)2 fi
P P 2
x fi
variance = P = P i − x2 .
fi fi
Example
The number of pets that Year 11 pupils at a certain school have is given below. Find the standard
deviation of these data.
Number of siblings, xi Frequency, fi xi fi xi2 fi
0 36 0 0
1 94 94 94
2 48 96 192
3 15 45 135
4 7 28 112
5 3 15 75
6 P 1 P 6 P 2 36
Totals fi = 204 xi fi = 284 xi fi = 644
We begin by inserting columns for xi fi and x2i fi , and finding the totals of each column (shown in
italics above). The mean, x, is given by
P
xf 284
Pi i = = 1.392 . . . .
fi 204
The variance is then
P 2
x f 644
P i i − x2 = − (1.392 . . .)2
fi 204
= 1.218 . . . .
27 Mr J Berwick
√
This gives a standard deviation of 1.218 . . . = 1.103 . . . = 1.10, correct to 3 significant figures.
The second equation for variance was used to simplify the arithmetic. The first equation could
also have been used, and it would have yielded the same answer, although the arithmetic would have
been somewhat longer. To estimate the variance of a frequency table where the data are grouped, the
midpoint of each class must be used to calculate xi fi and x2i fi . This is an identical process to that
for estimating the mean of a grouped frequency table. Again, this will only be an estimate, because
the original grouping of the data loses accuracy.
The following example shows how a new mean and variance can be found when an additional value is
merged into an existing data set.
Example
The mean and standard deviation of the heights of 12 boys in a class are 148.8 cm and 5.4 cm
respectively. A boy of height 153.4 cm joins the class. Find the mean and standard deviation of the
heights of the 13 boys.
For this question, we shall use specific notation. The two means shall be written as x12 and x13 ,
2 2
and the two variances shall be written as σ12 and σ13 . The symbol σ 2 is often used to represent
variance, although this is more appropriately for the variance of a probability distribution, rather
than a sample of data.
Because x12 = 148.8,

12
X
xi 12
i=1
X
= 148.8 ⇐⇒ xi = 12 × 148.8 = 1785.6 .
12 i=1
Then,
13
X 12
X
xi = xi + 153.4 = 1785.6 + 153.4 = 1939 .
i=1 i=1
Hence
13
X
xi
i=1
x13 =
13
1939
=
13
= 149.15 . . .
= 149 cm, correct to 3 significant figures.
2
Now, because σ12 = 5.42 ,
12
X
x2i 12
i=1
X
− x212 = 5.42 ⇐⇒ x2i = 12 5.42 + 148.82 = 266 047.2 .

12 i=1
Then,
13
X 12
X
x2i = x2i + 153.42 = 266 047.2 + 153.42 = 289 578.76 .
i=1 i=1
28 Mr J Berwick
2.7 Linear coding Year 1 Statistics Notes
So
13
X
x2i
2 i=1
σ13 = − x213
13
289 578.76
= − 149.15 . . .2
13
= 28.41 . . . .
√
This gives a standard deviation of 28.41 . . . = 5.330 . . . = 5.33cm, correct to 3 significant figures.
Thus, the mean and standard deviation of the 13 boys are 149cm and 5.33 cm respectively, both
correct to 3 significant figures.
Standard deviation is an incredibly important measure of spread in statistics! It can be used for
both discrete and continuous data along with ungrouped and grouped data.
2.6.2 Identifying outliers

We have already seen how to identify outliers using the interquartile range, but have we got a proce-
dure for standard deviation? Of course we have!
We normally investigate data items that lie more than 2 standard deviations from the mean and
decide whether they should be included in the analysis or not. The distribution of many sets of data
is approximately normal. The normal distribution will be dealt with next year in a formal manor.
2.7 Linear coding

Suppose we wish to find the mean of the numbers 907, 908, 898, 902 and 897. This could, of course,
be achieved using the standard method, but it would be easier to subtract 900 from each number to
give 7, 8, −2, 2 and −3, and calculate the mean of these numbers. The mean of these numbers is
7+8−2+2−3
= 2.4. It follows that the mean of the original data is 900 + 2.4 = 902.4.
5
The process of subtracting a constant from all data values is called coding. Calling this constant
a, P
(x − a)
x= +a.
n
Example
29 Mr J Berwick
P
The heights, x cm, of a sample of 80 female students are summarised by the equation (x−160) = 240.
Find the mean height of a female student.
The mean height is P

(x − 160) 240
x= + 160 = + 160 = 163 cm.
n 80
This is a relatively easy example and it is possible to multiply all the values of a data set by a constant,
and then add or subtract a constant. We are still able to get the mean of the original data set. It is
possible to also find the variance (and standard deviation) of a data set by using coding. The formulas
are summed up in the following table:
Original value Coded value

coding x y = a + bx
mean x̄ ȳ = a + bx̄
standard deviation sdx sdy = b × sdx
Please note that the variance (and standard deviation) are not affected by adding or subtracting a
constant. We are measuring spread remember!
Suppose we want to find the variance of 907, 908, 898, 902 and 897. We can code these values
to 7, 8, −2, 2 and −3. From its equation, we can see that the variance is the average squared distance
of the data values from the mean. Because the mean has also changed when coding the data, the
distance of each value from the mean is unchanged. It follows that the variance of the coded values
is equal to the variance of the original values. To find the variance of a set of data, it is necessary to
find its mean first. It is imperative that, when finding the variance of the coded values, the mean of
the coded values is used, and not the mean of the original values.
Example
The heights, x cm, of a sample of 80 female students are summarised by the equations
X X
(x − 160) = 240 and (x − 160)2 = 8720 .
Find the mean and standard deviation of the heights of the 80 female students. The mean is
P
(x − 160) 240
x= + 160 = + 160 = 163 cm .
n 80
The variance of x is equal to the variance of x − 160, which is
2 2
(x − 160)2
P P
(x − 160) 8720 240
− = − = 100 cm2 .
n n 80 80
√
Thus, the standard deviation of x is 100 = 10cm.
3 Probability
3.1 Introduction to Probability
Probability was given a formal definition at the start of these notes, but it is essentially a way of
describing the likelihood of different outcomes as a result of some experiment. Another word for
30 Mr J Berwick
3.1 Introduction to Probability Year 1 Statistics Notes
experiment is trial. The word trial tends to be used when there are two possible outcomes like
when flipping a coin. Another important word is event, and this often describes several outcomes
put together like rolling a die and getting an even number. It should be noted that the word event
can be used to describe a single outcome too.
We have given rough descriptions of the key words to use when describing probability, but now
we will make these more formal.
The set of possible outcomes of a random trial is called the sample space of the trial which is denoted
by Ω (omega). For example, the sample space for tossing a coin twice is
{(H, H), (H, T ), (T, H), (T, T )},
where H represents heads and T tails.
Probabilities can be assigned to the outcomes of a sample space. The following criteria are observed:
• Each probability lies between 0 and 1 inclusive.
• The sum of all of the probabilities is equal to 1.

Subsets of the sample space are called events. If A is the event of one head being obtained for the
previous sample space, then A = {(H, T ), (T, H)}. If B = {(H, H)}, then B is the event of two
heads being obtained.
Estimation of probability
Probability can be estimated experimentally or theoretically.
Experimental estimation of probability
In many situations probabilities are estimated on the basis of data collected experimentally, as in
the following example.
Example
Of 40 2 pence coins tossed in the air, 29 of them were found to have landed on tails. From this you
would estimate that the probability that the next coin tossed in the air will land on tails is 29
40 or 0.725.
n(T )
We can describe this more formally as P (T ) = n(X) , where T is the event that the next throw
will result in the 2 pence coin landing on tails, n(T ) is the number of times the coin lands on tails,
and n(X) is the total number of throws.
Theoretical estimation of probability
However, we do know that if we are tossing a fair 2 pence coin, we would expect P (T ) = 12 .
Example
Figure out the probability that the correct answer for the maths challenge question will be answer A
(answer options are A, B, C, D, and E).
Assuming that the maths challenge test-setter has used each letter equally often, the probability,
31 Mr J Berwick
P (A), that the next question will have the answer A can be written as follows:
1
P (A) = .
5
For events that have equally often, the probability, P (A), of event A occurring can be expressed
formally as:
n(A)
P (A) =
n()
where n(A) is the number of ways that A can occur and n() is the total number of ways that the
possible events can occur.
Probabilities of 0 and 1
The two ends of probability are certainty at one end of the scale and impossibility at the other.
An example of certainty is that when we roll a die the die will land on a 1, 2, 3, 4, 5 or 6. An
impossible event here is that the die will land on a 7.
Let A be any event. Then 0 ≤ P (A) ≤ 1. So if you get a negative number or a number that is
larger than 1, then you have gone wrong!
The complement of an event
The complement of an event A is the event where A does not occur, and is written as A0 . For
the example of a coin being tossed twice, where A = {(H, T ), (T, H)}, A0 = {(H, H), (T, T )}. Be-
cause the events A and A0 together involve every outcome of the sample space once, P(A) + P(A0 ) = 1.
It is sometimes quicker to find P(A) by considering P(A0 ) first.
Example
Two cards are drawn from an ordinary pack. Find the probability that they are not both kings.
Let A be the event of two kings not being drawn. Because the event A consists of so many out-
comes, it is easier to consider A0 , which is the event of two kings being drawn. The outcomes that
constitute A0 are
{(KC, KD), (KD, KC), (KC, KH), (KH, KC), (KC, KS), (KS, KC),
(KD, KH), (KH, KD), (KD, KS), (KS, KD), (KH, KS), (KS, KH)} ,
where the KC is the king of clubs, KD is the king of diamonds, KH is the king of hearts, KS is the
king of spades, and the first card in each bracket is the first card picked, with the second card in each
bracket is the second card picked.
There are 52 ways of picking the first card (because an ordinary pack of card contains 52 cards),
and then 51 ways of picking the second card. This means there are 52 × 51 = 2652 combinations of
two cards, each of which are equally likely.
The event A0 involves 12 of these 2652 combinations, so P(A0 ) = 12

2652 = 1
221 . Thus,
1 220
P(A) = 1 − = .
221 221
Expectation
32 Mr J Berwick
Expectation is equal to np where n is the population size and p is the probability. Let us say
that we know that there is a 1 in 3 chance that 90,000 people in a certain area of Sierra Leone will
contract Ebola. How many do we expect to contract the disease? This is equal to 90, 000× 13 = 30, 000.
The probability of either one event or another
So far we have just looked at one event at one time. However, sometimes we wish to find the
probability of two or more events happening together. The probability of events A or B occurring is
found by adding the probability of event A occurring with the probability of B occurring. This is
true as long as both events cannot happen together!
Example
A die with six faces has been made from brass and aluminium, and is not fair. The probability
of a 6 is 14 , the probabilities of 2, 3, 4 and 5 are each 16 , and the probability of a 1 is 12
1
. Find the
probability of rolling a 1 or a 6.
Let A be the event of rolling a 1 and B be the event of rolling a 6.

1 1 1
P(A or B) = P (A ∪ B) = P (A) + P (B) = + = .
12 4 3
We use the ∪ sign to mean or.
Sometimes we may not have theoretical situations. In this case, we will have to use experimental
probabilities.
Example
The company LoveFilm has shared some data with us below. The table shows the probability of
the next requested DVD falling into each of the three categories listed, assuming that each DVD is
likely to be requested.
Category of DVD Typical numbers Probability

On the shelves (S) 15,000 0.1875
Out on loan (L) 50,000 0.625
Overdue (O) 15,000 0.1875
Total 80,000 1.00
What is the probability that a randomly requested DVD is either out on loan or overdue?
50, 000 + 15, 000

P (L or O) = P (L ∪ O) =
80, 000
65, 000 (1)
=
80, 000
= 0.8125
We can write this more formally as
33 Mr J Berwick
3.2 Mutually exclusive events Year 1 Statistics Notes
n(L ∪ O)
P (L or O) = P (L ∪ O) =
n()
n(L) n(O)
= +
n() n()
= P (L) + P (O)
Figure 8: Venn diagram showing events L and O.
Example
The table below shows further details of the categories of DVDs that LoveFilm offers.
Category of DVD Number of DVDs

On the shelves (S) 15,000
Out on loan (L) 50,000
Unauthorised loan (U) 15,000
Adult fiction (A) 22,000
Adult non-fiction (B) 40,000
Junior (C) 18,000
Total stock 80,000
Assuming that all the DVDs are equally likely to be requested, find the probability that the next
DVD request will be either out on loan or a DVD of adult non-fiction.
Some people may think that the solution will be

50, 000 + 40, 000
P (L or B) = P (L ∪ B) =
80, 000
90, 000 (2)
=
80, 000
= 1.125
which is completely absurd! Probability cannot

be more than 1! The reason that this calculation
has gone wrong is that it involves double counting.
Some of the DVDs that are classed as adult non-
fiction are also out on loan. So the events L and
B can happen simultaneously! The Venn diagram
to the right depicts the scenario. We would need a
more detailed table in order to solve this problem. Figure 9: Venn diagram showing events L and B.
3.2 Mutually exclusive events

Two events that share no outcomes (so cannot occur simultaneously) are called mutually exclusive
events. Suppose we roll a die, and we let A be the event of obtaining a 1, B the event of obtaining an
even number, and C the event of obtaining a number larger than 4. The events A and B are mutually
exclusive, because A and B share no outcomes (so they cannot occur simultaneously). Similarly, A
and C are mutually exclusive events. However, B and C are not mutually exclusive events, because,
when a 6 is rolled, both B and C occur.
34 Mr J Berwick
3.2 Mutually exclusive events Year 1 Statistics Notes
If A1 , A2 , A3 , . . . , An are n mutually exclusive events,

P(A1 or A2 or A3 , or, . . . , or An ) = P(A1 ) + P(A2 ) + P(A3 ) + . . . + P(An ) .
This result is not true for events that are not mutually exclusive.
Example
A box contains 6 red, 9 green and 10 blue counters. A counter is drawn at random. What is
the probability that it is blue or red?
Let B be the event of obtaining a blue counter, and R the event of obtaining a red counter. These
events are mutually exclusive, since there are no outcomes common to both. Thus, P(B or R) =
10 6
P(B) + P(R) = 6+9+10 + 6+9+10 = 10 6 16
25 + 25 = 25 .
It should be noted that if we do have two events that occur simultaneously, then this method will not
work!
Theorem 1.
If A and B are any events (if they are mutually exclusive then this works also), then
P (A ∪ B) = P (A) + P (B) − P (A ∩ B),
where A ∩ B means A and B.
Proof. Let A and B be any two events. We shall use the notation A \ B to mean the event A, but
not the event B.
First of all, notice that the events A and B \ A are mutually exclusive and their union makes A ∪ B.
∴ P (A ∪ B) = P (A ∪ (B \ A)) = P (A) + P (B \ A). (3)
Now we observe that B = (A ∩ B) ∪ (B \ A) and the events (A ∩ B) and (B \ A) are mutually exclusive.
∴ P (B) = P ((A ∩ B) ∪ (B \ A)) = P (A ∩ B) + P (B \ A). (4)
Rearranging (4) we get
P (B \ A) = P (B) − P (A ∩ B) (5)
Not substitute (5) into (3) and we get P (A ∪ B) = P (A) + P (B) − P (A ∩ B). So we are done.
Figure 10: Mutually exclusive events Figure 11: Not mutually exclusive events
Example
A card is selected at random from a normal pack of 52 playing cards. Let event A be ‘the card
is a club’ and let event B be ‘the card is a Queen’.
35 Mr J Berwick
3.3 Independent events Year 1 Statistics Notes
1. Draw a Venn diagram showing events A and B.

2. Find the probability that the card is a club.
3. Find the probability that the card is a Queen.
4. Find the probability that the card is the Queen of clubs.
5. Find the probability that the card is a Queen or a club.
1.
13
2. P (A) = 52 = 14 .
4 1
3. P (B) = 52 = 13 .
1
4. P (A ∩ B) = 52 .
13 4 1 16 8 4
5. P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 52 + 52 − 52 = 52 = 26 = 13 .
3.3 Independent events

Two events that have no effect on one another are called independent events. For instance, two
consecutive tosses of a coin are independent because the outcome of the first toss cannot affect the
outcome of the second toss. Conversely, the events of it snowing and a certain football match being
played are not independent, because snow can affect the probability of the football match taking place.
If A1 , A2 , A3 , . . . , An are n independent events,

P(A1 and A2 and A3 , and, . . . , and An ) = P(A1 ) × P(A2 ) × P(A3 ) × . . . × P(An ) .
This result is not true for events that are not independent: the rules of conditional probability must
be applied instead. Conditional probability is covered next year.
Example
In a game at a fair, a contestant has to toss a fair coin and then roll a fair cubical die whose faces are
numbered 1 to 6. The contestant wins a prize if the coin shows heads and the die score is below 3.
Find the probability that a contestant wins a prize.
The event of the coin showing heads is independent of the event that the die score is below 3, which
means we can use the above result.
P(prize is won) = P(coin shows heads and die score is below 3)
= P(coin shows heads) × P(die score is below 3)
1 2
= ×
2 6
1
= .
6
36 Mr J Berwick
3.3 Independent events Year 1 Statistics Notes
Another way we can show probability is with a tree diagram.
Example
Mr Smith is a teacher and drives 20 minutes to work each day. On the way to work he has to
go through two sets of traffic lights. The probability that he is stopped at the first is 0.6 and the
probability he is stopped at the second is 0.4. The timings of the two traffic lights are independent
of each other.
1. What is the probability that he is stopped at both sets of traffic lights?
2. What is the probability that he is stopped at least once?
1. First of all, we draw a simple tree diagram.
Now we multiply accordingly as the events of Mr Smith stopping at the two sets of traffic lights are
independent:
P (stopped at first set and stopped at second set) = 0.6 × 0.4 = 0.24.
2. There are two methods that we can use here.
Method 1: Adding the probabilities for the three ways that it can happen.
P (Stopped and not Stopped) = 0.6 × 0.6 = 0.36

P (Not stopped and Stopped) = 0.4 × 0.4 = 0.16 (6)
P (Stopped and Stopped) = 0.6 × 0.4 = 0.24
So P (Stopped at least once) = 0.76.
Method 2: Find the complement of not being stopped at all.
P (Not stopped and Not stopped) = 0.4 × 0.6 = 0.24.
So P (Stopped at least once) = 1 − 0.24 = 0.76.
37 Mr J Berwick
3.4 Risk Year 1 Statistics Notes
3.4 Risk
You may have heard of probability being described as risk in everyday life. For example, your mum
asking you what is the risk of it raining tomorrow? To answer this, you would go on your phone and
look at a weather application that has calculated the exact risk. Risk is normally mentioned when
we do not want something to happen. In the example just given, it is most probable that your mum
has asked you this question, because she wants to do some washing and, therefore, does not want it
to rain.
Example
The risk of an earthquake in Italy in any one month is 1 in 75. What is the risk that it has at
least one earthquake in the next three months?
Let A be the event that Italy has at least one earthquake in the next three months. Then
3
74
P (A) = 1 − P (A0 ) = 1 − P (No earthquake at all) = 1 − = 0.0395(3sf).
75
3.5 Probability distributions

3.5.1 Introduction to probability distributions
If we were to toss a fair coin twice, we know what the data should look like. Suppose that there
are one-hundred sets of two tosses. Then, it would be expected that 25 pairs of tosses would give no
heads, 50 pairs of tosses would give one head, and 25 pairs of tosses would give two heads. Of course,
there would be slight differences; however, this model could be used in statistical analysis.
For this example, let X be the random variable of the number of heads when tossing a coin twice.
The values it can take are {0, 1, 2}; these will be denoted as x. It is customary to use a capital letter
to represent the random variable, and the corresponding lower-case letter for the values the random
variable can take. We can then say that P(X = 0) = 14 , because the probability that X has a value
of 0 (when the coin is tossed twice, 0 heads are obtained) is 14 . Additionally, P(X = 1) = 12 and
P(X = 2) = 14 . A probability distribution of a random variable is a listing of the possible values of
the variable and the corresponding probabilities. Thus, the probability distribution of X is
x 0 1 2
1 1 1 .
P(X = x) 4 2 4
A probability distribution may be dis-

played graphically with the variable on
the x-axis and probability on the y-
axis. The graphical representation of the
probability distribution for the example
above is shown on the right.
38 Mr J Berwick
The example above involved a discrete random variable.

However, there are such things as continuous random vari-
ables. A continuous random variable is a random variable
where the data can take infinitely many values. For example,
a random variable measuring the time taken for something
to be done is continuous since there are an infinite number
of possible times that can be taken. We note that if we were
to draw the probability distribution of a continuous random
variable on a graph then we would have a function whereby
the area underneath this function is 1. The area underneath
this function between values a and b (say) represents the prob-
ability of the continuous random variable being between a and
b. This function is called a probability density function. The
probability density function for a special continuous distribu-
tion called the normal distribution is shown on the right.
3.5.2 Properties of probability distributions

P
For any discrete random variable, X, the sum of the probabilities is 1; that is, P(X = x) = 1. The
reason for this is obvious: the sum of the probabilities in a sample space always equals 1.
Example
The table below gives the probability distribution of the random variable T . Find the value of c
and P(T > 2).
t 1 2 3 4 5
.
P(T = t) c 2c 2c 2c c
P
Because P(T = t) = 1,
1
c + 2c + 2c + 2c + c = 8c = 1 ⇐⇒ c = .
8
So
P(T > 2) = P(T = 3) + P(T = 4) + P(T = 5)

= 2c + 2c + c
= 5c
5
= .
8
4 The Binomial Distribution

4.1 The Criteria And Probability Formula For The Binomial Distribution
Suppose that we roll a die twenty times, and we wish to find the probability distribution of the number
of sixes we obtain. The following conditions are observed:
1. There are only two outcomes to each throw (that cannot occur simultaneously): either we get a
six, or we do not get a six.
2. There are exactly twenty rolls.
39 Mr J Berwick
4.1 The Criteria And Probability Formula For The Binomial Distribution Year 1 Statistics Notes
3. Obtaining or not obtaining a six for any roll does not make it more or less likely to get a six on
any other roll.
4. The probability that we obtain a six is constant for each throw.
We can then say that the number of sixes follows a binomial distribution. In general, a binomial
distribution occurs when the following conditions are satisfied:
1. A single trial has exactly two possible outcomes (success and failure) and these are mutually
exclusive.
2. A fixed number, n, of trials takes place.
3. The outcome of each trial is independent of the outcome of all the other trials.
4. The probability, p, of a success at each trial is constant.
Before we carry on we will define nx as x!(n−x)!

n! n

where n! = n × (n − 1) × ... × 2 × 1. x represents
the amount of ways to choose x successes from n trials.
The binomial random variable X (random variables are capital letters!), which represents the number
of successes in the n trials of the experiment, has a probability distribution given by

n x
P(X = x) = p (1 − p)n−x
x
for x = 0, 1, 2, . . . , n. This can be shortened to X ∼ B(n, p), which is read as ‘X follows a binomial
distribution where there are n trials and the probability of success is p’. The values n and p are called
the parameters of the distribution.
Example
A die is rolled 20 times. Find the probability that four sixes are obtained. Let X represent the
number of sixes, so that we wish to find P(X = 4). We know that X followsa binomial distribution
where there are 20 trials and the probability of success is 16 , so X ∼ B 20, 16 . Thus,
4 16
20 1 5
P(X = 4) = × ×
4 6 6
= 0.2022 . . .
= 0.202, correct to 3 significant figures.
Example
1

Given that X ∼ B 8, 4 , find P(X = 6), P(X ≤ 2) and P(X > 0).
6 2
8 1 3
P(X = 6) = × ×
6 4 4
= 0.003845 . . .
40 Mr J Berwick
4.2 Finding Cumulative Probabilities With The Binomial Distribution Year 1 Statistics Notes
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2)

0 8 1
8 1 3 8 1
= × × + ×
0 4 4 1 4
7 2 6
3 8 1 3
× + × ×
4 2 4 4
= 0.1001 . . . + 0.2669 . . . + 0.3114 . . .
= 0.6785 . . .
P(X > 0) = 1 − P(X = 0)
0 8
8 1 3
=1− × ×
0 4 4
= 1 − 0.1001 . . .
= 0.8998 . . .
4.2 Finding Cumulative Probabilities With The Binomial Distribution

Suppose that X ∼ B 16, 41 , and we wish to find P(X ≤ 8). We could find this using the standard

formula for the binomial distribution:
P(X ≤ 8) = P(X = 0) + P(X = 1) + . . . + P(X = 8)

0 16 1 15 8 8
16 1 3 16 1 3 16 1 3
= × × + × × + ... + × ×
0 4 4 1 4 4 8 4 4
It is evident, however, that much computation is required to find this. For this reason, there exists
tables of cumulative binomial probabilities. In this instance, we must use a table for which n = 16. We
begin by locating the column corresponding to p = 41 = 0.25. Part of this column and the immediately
adjacent columns are shown below:
p 0.2 0.25 0.3
x=6 0.9733 0.9204 0.8247
7 0.9930 0.9729 0.9256
8 0.9985 0.9925 0.9743
9 0.9998 0.9984 0.9929
The row x = 6 gives the probability that P(X ≤ 6) for various values of p. For instance, if p = 0.2,
P(X ≤ 6) = 0.9733, correct to 4 decimal places. We can thus see that, if p = 41 , P(X ≤ 8) = 0.9925,
correct to 4 decimal places, so P(X ≤ 8) = 0.993, correct to 3 significant figures.
Other cumulative binomial probabilities can be found using the tables by expressing them in the
form P(X ≤ c) with various values of c. The tables will be given to you in the exam.
Example
Given that Y ∼ B(10, 0.3), find P(Y < 8), P(Y > 5), P(3 < Y ≤ 5) and P(2 ≤ Y < 7). Using
41 Mr J Berwick
the table for which n = 10, and the column where p = 0.3,
P(Y < 8) = P(Y ≤ 7)
= 0.9984
P(Y > 5) = 1 − P(Y ≤ 5)

= 1 − 0.9527
P(3 < Y ≤ 5) = P(Y ≤ 5) − P(Y ≤ 3)

= 0.9527 − 0.6496
= 0.3031
P(2 ≤ Y < 7) = P(Y ≤ 6) − P(Y ≤ 1)

= 0.9894 − 0.1493
= 0.8401
Example
Extensive research into the colour of people’s eyes has shown that 1 in every 4 people have blue
eyes on average. How large would a sample have to be for the probability of it containing at least one
person with blue eyes to be greater than 99.9%?
Let the sample size be given by n and let X be the random variable that denotes the number of
people with blue eyes in the population so that X ∼ B(n, 0.25).
The probability that none of the chosen sample have blue eyes is P (X = 0) = (0.75)n and so the
probability that at least one person has blue eyes is P (X ≥ 1) = 1 − P (X = 0) = 1 − (0.75)n . So we
require
1 − (0.75)n > 0.999
(0.75)n < 0.001
nlog(0.75) < log(0.001)
(7)
log(0.75)
n>
log(0.001)
n > 24.01
So n = 25.
5 Statistical hypothesis testing with the binomial distribution

5.1 Introduction to hypothesis testing
A common application of probability is in a technique called hypothesis testing, which involves pre-
dicting whether there is evidence to suggest that a specific hypothesis, or theory, is likely to be true.
42 Mr J Berwick
5.1 Introduction to hypothesis testing Year 1 Statistics Notes
Only variables that are distributed binomially (discretely) are to be discussed here. There is hypoth-
esis testing when variables are distributed continuously, but this will be taught next year. Consider
the three following situations:
• In the past few years, a reporter has claimed that 50% of Sutton Coldfield votes for the conser-
vative party. Julie, who is a local resident, believes that this is not true. She believes that the
probability of someone voting for the conservative party is less than this.
• In a report, it was stated that 60% of all current patients at Good Hope hospital are over the
age of 60. A newspaper believes that this figure is an underestimate, and obtains a sample of
patients to verify this.
• The BBC has always claimed that it is politically unbiased. However, many people are currently
saying that this is far from the truth. So a reporter collects a sample that is representative of
the population of the UK and asks the sample whether they think the BBC is unbiased. If the
BBC were unbiased, we would expect a 50:50 split on opinion in the sample.
For all three cases, information has been given concerning what should be the theoretical probability
of the variables. If we write this as p, then the original suppositions are that p = 0.5, p = 0.6 and
p = 0.5. Following this, it is desired to determine whether this supposition has changed; for the three
cases, this can be written mathematically as p < 0.5, p > 0.6 and p 6= 0.5. It is apparent that there
are two different statements, or hypotheses, regarding each situation: one for what is initially believed
to be correct, and a second that seeks to improve the first. The first hypothesis is called the null
hypothesis, whilst the second is called the alternative hypothesis. These are written as H0 and
H1 (or sometimes Hα ) respectively. The hypotheses are written as follows:
• H0 : p = 0.5, H1 : p < 0.5,
• H0 : p = 0.6, H1 : p > 0.6,
• H0 : p = 0.5, H1 : p 6= 0.5.
The purpose of a hypothesis test is to indicate whether there is evidence to suggest that the alternative
hypothesis, H1 , should be accepted in place of the null hypothesis, H0 . However, this cannot be done
with any certainty: only the likelihood of H0 being true can be considered. Therefore, a hypothesis
test is based on whether there is evidence to suggest that the null hypothesis should be rejected or
that there is no evidence for it to be rejected.
The alternative hypotheses of the first two situations are rather different to that of the third. The
inequalities p < 0.5 and p > 0.6 relate solely to a change in p in one particular direction, which causes
them to be called one-tail tests (one end of the ‘tail’ of the distribution curve is involved). Similarly,
the inequality p 6= 0.5 causes the third situation to involve a two-tail test.
For a hypothesis test on the probability, p, the null hypothesis, H0 , proposes a value, p0 , (say)
for p : H0 : p = p0 . The alternative hypothesis, H1 , suggests the manner in which p might differ from
p0 . H1 can take three forms:
• H1 : p < p0 , a one-tail test for a decrease,
• H1 : p > p0 , a one-tail test for an increase,
• H1 : p 6= p0 , a two-tail test for a difference.
The actual execution of a hypothesis test involves no additional mathematics to what you already
know; only the terminology and conventions of hypothesis testing must be adopted. As previously
alluded to, a hypothesis test is used to indicate whether an alternative hypothesis should replace a
43 Mr J Berwick
5.2 Hypothesis testing basics Year 1 Statistics Notes
null hypothesis. To do this, a probability value using the observed outcome must be obtained, which
is called the p-value.
The p-value is the probability of the observed outcome or a more extreme outcome.
The p-value can indicate one of two things: that H0 should be rejected (and therefore, evidence in
favour of H1 ), or that, there is no evidence to reject H0 . For H0 , to be rejected, the p-value must be
less than what we call the significance level (falling in the critical/rejection region), and, for it to be
accepted, the p-value must be larger than the significance level (falling in the acceptance region).
An observed outcome separating the rejection region and acceptance region are called critical values.
If the p-value is very small, this suggests that H0 is not true, since it is unlikely to have hap-
pened due to chance. The magnitude of this chance is called the significance level. Thus, for a test
at 5% significance, H0 , is rejected if the p-value is 5% or less. Choosing a lower significance level,
therefore, reduces the probability that H0 , is erroneously rejected, but it makes it more difficult to
recognise any change.
• The value of p is the probability of success in a binomial trial.
• The p-value is calculated from the observed outcome. Its value is used to decide whether the
null hypothesis, H0 , should be rejected.
• The critical/rejection region gives the values of the observed outcomes for which the null
hypothesis, H0 , is rejected.
• The acceptance region gives the values of the test statistic for which the null hypothesis, H0 ,
is not rejected.
• Any boundary values of the critical region are called critical values.
• The significance level of a test gives the probability that the observed outcome falls in the
rejection region when H0 is true.
The nature of hypothesis testing might seem very confusing at first, but it is remarkably simple in
practice.
5.2 Hypothesis testing basics

Hypothesis testing questions
1. Was the test set up before or after the data were known?
If the data were known, then we could essentially cheat when conducting a hypothesis test.
We could choose the null and alternate hypotheses along with the significance level to get the
result we want. So always treat a test with a little suspicion.
2. Was the sample involved chosen at random and are the data independent?
We normally conduct a hypothesis test to see whether a certain population has a different
result to what we once thought. If the sample is collected in such a way that it does not repre-
sent the population, then the hypothesis test becomes invalid.
Example of dependent data
A recent study of driving speed in England has been carried out across the entire country.
In this study, speed cameras were set up at 100 random sampled road-sites and the speed of all
44 Mr J Berwick
5.3 Hypothesis test examples Year 1 Statistics Notes
cars passing through was registered. In order for all the data to be used, we must assume that
each car measured throughout the country had the same chance to drive at a particular speed.
However, this is absolute nonsense as different road-sites have different speed limits along with
limiting capabilities for speed. For example, if the first car recorded at a site drove 30 miles/h,
the probability of the next car passing through at a speed of 60 miles/h is much smaller than
when the first car recorded drove 65 miles/h. To conclude, measurements at the same road-site
resemble each other much more than measurements across the country do. Therefore, we do
not have the identically and independently distributed data that we need in order to conduct a
hypothesis test.
3. Is the statistical procedure actually testing the original claim?
The claim and the alternate hypothesis may be related, but are not necessarily the same. This is
down to the transition between translating a statement into mathematics which can be difficult.
The steps for a hypothesis test (with a binomial distribution)
In order to carry out an ideal hypothesis test, you need to follow the steps below:
1. Establish the null and alternate hypotheses.
2. Decide on the significance level.

3. Collect suitable data using a random sampling procedure that ensures the items are independent.
4. Calculate the probability of the observed or a more extreme value and compare the probability
with the significance level.
5. For a one-tail test, reject the null hypothesis if this probability is less than the significance level;
for a two-tail test use the side that produces a probability less than 50% and reject the null
hypothesis if the probability is less than half of the significance level.
6. Interpret this result and write a conclusion that refers to the original problem.
This type of hypothesis test is one that uses p-values. However, we can use critical regions for a
hypothesis test. Critical regions are used in an example later on along with p-values.
Choosing the significance level
The standard significance level used in industry is 5%. However, significance levels of 1% and 10%
are still used from time to time. It is often down to the judgement of the person conducting the
hypothesis test as to what significance level to use. A 1% significance level means that we will only
reject the null hypothesis if there is incredibly strong evidence to reject it. A 10% significance level
means that we have a 10% chance of rejecting the null hypothesis when it is actually true (this is
down to the definition of significance level).
5.3 Hypothesis test examples

Example
A national opinion poll claims that 40% of the electorate would vote for party R if there were an
election tomorrow. A student at a large college suspects that the proportion of young people who
would vote for them is lower. She asks 16 fellow students, chosen at random from the college roll,
which party they would vote for. Three choose party R. Show, at the 10% significance level, that this
indicates that the reported figure is too high for the young people at the student’s college.
45 Mr J Berwick
The null and alternative hypotheses are H0 : p = 0.4 and H1 : p < 0.4, where p is the propor-
tion of the electorate that would vote for party R.
The test is a one-tail test at 10% significance, so we reject H0 if p < 0.1.
Assuming that H0 is true, X ∼ B(16, 0.4), where X is the number of students that would select
party R. The required probability, found using tables, is
P (X ≤ 3) = 0.0651 < 0.1.
∴ Reject H0 : there is sufficient evidence at the 10% significance level to suggest that the reported
figure is too high for the young people at the student’s college.
Note that the wording of the conclusion had to be carefully made. Whilst it is apparent that the
sample was indeed a random sample for the college, it was not a random sample from the entire
population, and, consequently, it could not have been implied that the reported figure is too high for
the entire electorate.
Continuing with this example, if one more student had stated that they would have voted for party
R, then the probability would have been P (X ≤ 4) = 0.1666 > 0.1, which does not lead to the
rejection of H0 , showing that the critical/rejection region is X ≤ 3 and the critical value is 3. Even
though the significance level of the test was intended to be 10%, the probability of X ≤ 3 occurring
by chance, which is the definition of a significance level, is not 10%, but approximately 6.51%. This is
because a value of X corresponding to p = 0.1 does not actually exist since X is discrete. It must be
emphasised that, even if P (X ≤ 4) were very close, but slightly greater, than 0.1, it cannot be used
as the rejection region: the probability of X falling within the rejection region by chance should be
at most the significance level, and not as close to the significance level as possible.
Example
In order to test a coin for bias, it is tossed 20 times. The result is 14 heads and 6 tails. Test,
at the 5% significance level whether the coin is biased.
The null and alternative hypotheses are H0 : p = 0.5 and H1 : p 6= 0.5, where p is the probabil-
ity of obtaining a head.
The test is a two-tail test at 5% significance, so we reject H0 if p < 0.025.
Assuming that H0 is true, X ∼ B(20, 0.5), where X is the number of heads. The required prob-
46 Mr J Berwick
ability, found using tables, is
P (X ≥ 14) = 1 − P (X ≤ 13) = 1 − 0.9423 = 0.0577 > 0.025.
∴ We do not reject H0 : we conclude that there is insufficient evidence to suggest that the coin is biased.
You may have noticed that we could have chosen X to be the number of tails. The truth is that it does
not matter! In this case, we would have calculated P (X ≤ 6) (which actually equals P (X ≥ 14)) and
then concluded appropriately. Some textbooks do not half the significance level. They demonstrate
two-tail tests by adding up the probabilities of the two extremes occurring. So let us go back to when
X was the number of heads. Indeed, we observed 14 heads, but what if we had observed 6 heads (20
trials minus 14 trials where heads occurred). So now calculate P (X ≤ 6) which is equal to P (X ≥ 14).
Therefore, we are not actually doing anything different here. In the textbooks, they would compare
P (X ≤ 6) + P (X ≥ 14) to 0.05 if we had a 5% significance level. However, we halve the significance
level and compare to P (X ≥ 14).
Note: In the last chapter, we learnt that the expected value for a binomial distribution is np. np = 10
in this situation. Since 14 > np = 10 the p-value is calculated with X ≥ 14. If our observed result
was less than np then we would test X ≤ OR where OR is the observed result. This is because of
step 5 in the steps for a hypothesis test.
Example
In the world it is believed that 25% of people have blonde hair. A scientist thinks that this claim
is over exaggerated and wishes to investigate. The scientist sends off hundreds of field workers to
randomised places in the world to sample 30 people from each place. Find the critical/rejection region
for the test at a 5% significance level.
The null and alternative hypotheses are H0 : p = 0.25 and H1 : p < 0.25, where p is the proba-
bility of observing a person with blonde hair.
The test is a one-tail test at 5% significance.
Assuming that H0 is true, X ∼ B(30, 0.25), where X is the number of people with blonde hair
in a sample of 30 people. The critical region is the region X ≤ k, where P (X ≤ k) ≤ 0.05 and P (X ≤
k +1) > 0.05. If we look at the cumulative binomial tables we observe that P (X ≤ 3) ≈ 0.0375 ≤ 0.05,
but P (X ≤ 4) ≈ 0.0929 > 0.05. ∴ The critical region is X ≤ 3.
Example
47 Mr J Berwick
In a particular area in China, 25% of air is believed to be polluted. An environmentalist wishes

to see whether this is true. She decides to sample 15 places in this area. For what number of places
at the 10% significance level can she decide whether the pollution level in the area is changing?
The null and alternative hypotheses are H0 : p = 0.25 and H1 : p 6= 0.25, where p is the proba-
bility of obtaining an area that is polluted.
The test is a two-tail test at a 10% significance level.
We want to find each tail to be as near as possible to 5%, but both must be less than 5%. Us-
ing the cumulative binomial tables it is observed that the left-hand tail includes 0, but not 1 or more;
the right-hand tail is 8 or above, but not 7.
48 Mr J Berwick

AS Level Mathematics Statistics (New)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AS Level Mathematics Statistics (New)

Uploaded by

Copyright:

Available Formats

CONTENTS Year 1 Statistics Notes

2 Data processing, presentation and interpretation 8

4 The Binomial Distribution 39

5 Statistical hypothesis testing with the binomial distribution 42

0 Mathematical models in probability and statistics

0.2 The definition of Statistics

0.3 Usefulness of models in probability and statistics

• To make predictions or find estimates.

• A variable is qualitative if it is not possible for it to take a numerical value.

1.2 Using Statistics to solve problems

2. How many people are answering calls at one time?

This process is repeated for questions two and three.

The problem solving cycle

There should be a few questions going around your head by now.

Terminology and notation

• Does the method of collection distort the data?

• Is the sampling procedure appropriate in the circumstances?

Simple random sampling

Opportunity (or convenience) sampling

An advantage of this sampling technique is that it is easy to select a sample. Disadvantages of

2 Data processing, presentation and interpretation

Type of car Frequency

The display on the right is

We should explain how the sector for each car make

2.2 Ranked data

Create a stem-and-leaf diagram for the following data:

range = largest value − smallest value .

interquartile range = upper quartile − lower quartile = Q3 − Q1 .

7, 8, 11, 13, 15, 18, 19, 20, 22 .

The range is 22 − 7 = 15.

Thus the interquartile range is 19 − 11 = 8.

The 90th percentile is 90n

Draw a box and whisker plot to represent these data.

Rearranging the data in ascending order gives

The minimum value is 0 and the maximum value is 9.

2.3 Discrete numerical data

The mean, x, of a data set of n values is given by

Grouping means putting the data into a number of classes. The

Figure 5: Unimodal Figure 6: Uniform Figure 7: Bimodal

2.4 Continuous numerical data

2.4.1 Displaying continuous grouped data

Class boundaries Frequency

Frequency ∝ Frequency Density × Class Width.

Height (inches) Frequency Class Boundaries Class Width Frequency Density

2.4.2 Estimating summary measures from grouped continuous data

Height, hcm Frequency, fi

Height, hcm Frequency, fi Midpoint, mi mi × fi

2.4.3 Cumulative frequency graphs

From the data, the following is evident:

2.5 Bivariate data

Income Limit Rating Cards Age Education Balance

2.5.1 Displaying bivariate data

The scatter graph of limit vs balance is displayed

A line of best fit is often drawn through the points on

2.5.2 Dependent and Independent variables

2.5.3 Random and non-random variables

2.5.4 Interpreting scatter diagrams

We need to look out for the following 3 cases:

2.5.5 Summary measures for bivariate data

2.5.6 Lines of best fit

2.6 Variance and Standard Deviation

The mean absolute deviation of a set of data values x1 , x2 , . . ., xn , whose mean is