2A1 Probability and Statistics L1 Notes Payne PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

A1: Probability and Statistics - Michaelmas Term 2017

SJ Payne
A1: Probability and Statistics
SJ Payne
Michaelmas Term 2017 - 4 lectures, 1 tutorial sheet
With acknowledgements to T Blakeborough and V Grau

Books
All the recommended general mathematics text books
have sections on probability and statistics. The relevant
chapters of three popular ones are:
Riley, Hobson and Bence, Mathematical Methods for
Physics and Engineering (3rd edition): Chapter 24
Kreyszig, Advanced Engineering Mathematics (10th
edition): Chapters 24-25
James, Modern Engineering Mathematics (4th ed): ch 13
for Probability
James, Advanced Modern Engineering Mathematics (3rd
ed): ch 11 for Statistics.

Any of these covers the material in this course. In


particular, most of the notation and conventions have
been adapted from Kreyszig; however, Riley, Hobson and
Bence also contains an excellent description of the subject.
For a deeper coverage of the matter you can go to
specialised books such as:
Ross, Introduction to Probability and Statistics for
Engineers and Scientists (Academic Press, 5th edition,
2014). Recommended as a complement to Kreyszig if you
want to expand on the topic.
Grimmett, & Welsh, Probability: An Introduction, (OUP).
Mostly covers the Probability part.
T.T. Soong, Fundamentals of Probability and Statistics for
Engineers (Wiley 2004)
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Table of Contents

1 Set theory ........................................................... 1


1.1 Laws of set algebra ......................................... 2
1.2 A useful example: Rolling a die ......................... 3
1.3 Probability values ........................................... 4
1.4 Permutations and combinations ........................ 5
1.5 Probability of events........................................ 8
2 Prior knowledge, Independence & Bayes’ theorem ... 10
2.1 Total probability theorem ............................... 11
2.2 Bayes’ rule .................................................. 12
2.3 Independence .............................................. 12
3 Probability distributions ....................................... 14
3.1 Random variables ......................................... 14
3.2 Discrete random variables .............................. 14
3.2.1 Expectation and Variance........................... 15
3.2.2 Independent random variables ................... 17
3.2.3 The uniform discrete probability function ...... 18
3.2.4 The Binomial (or Bernoulli) distribution ........ 18
3.2.5 Geometric distribution ............................... 22
3.2.6 Poisson distribution................................... 22
3.2.7 Poisson approximation to the binomial
distribution ........................................................ 26
3.3 Continuous random variables.......................... 27
3.3.1 Expectation and Variance of a continuous
distribution ........................................................ 29
3.3.2 Other measures of central tendency and spread
30
3.3.3 Chebycheff’s inequality.............................. 31
3.3.4 Uniform distribution .................................. 32
3.3.5 Exponential probability density function ....... 33
3.3.6 Normal or Gaussian distribution .................. 35
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.7 Central limit theorem: approximations using
the normal distribution ........................................ 38
3.3.8 The chi-square distribution ........................ 40
3.3.9 The t-distribution ..................................... 42
4 Random sampling ............................................... 43
4.1 Characterising random samples ...................... 44
5 Estimating mean and variance of a distribution ....... 44
6 Confidence intervals ............................................ 46
6.1 Confidence interval for the mean when the
variance is known ................................................. 47
6.2 Confidence interval for the mean when the
variance is unknown .............................................. 49
6.3 Confidence interval for the variance of a normal
distribution .......................................................... 49
6.4 Distributions different from the normal............. 50
7 Hypothesis testing .............................................. 51
7.1 Type I and type II errors ............................... 53
8 Conclusion ......................................................... 56
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
PART 1. PROBABILITY
Life is full of things that are not certain. We therefore
have to have way of quantifying this uncertainty. The
mathematical treatment of uncertainty is based on
probability theory. The formulation of the theory relies
on sets of events so before outlining the theory it will be
useful to go over a bit of set theory.

1 Set theory
 A set is a collection of objects with some common
property.
 The objects are the elements of the set.
 A set is represented by enclosing the objects inside
curly brackets, e.g.
A = { 1, 2, 3, 4, 5, 6 } and C = {x:x>0}
 The symbol  means ‘is a member of’ e.g.
y  C means y is a member of C
 The empty or null set  contains no elements.
 The symbol  means is a subset, i.e. A  B (or
equivalently B  A, B is a superset of A) means every
element of A is also an element of B.
 The symbol  means Union and A  B is the set
containing all the elements of A and B.
 The symbol  means intersection and A  B contains
the elements common to both A and B.
 A space is the set containing all the elements that are
being considered.
 The set containing all the elements not in a particular
set is called its complement. There are various
conventions about how this is written down. These notes
use a superscript ‘C’, e.g. Ac, but you will also see A
and S\A in books.
 If the intersection of two sets is the null set (i.e. they
have no elements in common) the sets are said to be
disjoint or mutually exclusive.
A good way of visualising sets, and set operations, is to
use Euler or Venn diagrams. At this point it should be

1
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
obvious to you what the union and the intersection sets
correspond to in the diagram.

A C
B
S: space, the set of all elements

1.1 Laws of set algebra


Some important laws of set algebra are given below. They
should all be quite intuitive. As an exercise, try and check
these using Venn diagrams.
Union and intersection are commutative, associative and
distributive:
AB=BA
AB=BA
(A  B)  C = A  (B  C)
(A  B)  C = A  (B  C)
A  (B  C) = (A  B)  (A  C)
A  (B  C) = (A  B)  (A  C)

Additional laws: If S is the set that contains all possible


elements (the space),
A=A AA=A
AS=A AS=S
A  Ac = S A=
A  Ac = 
AA=A

2
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

1.2 A useful example: Rolling a die


As the result of rolling a single conventionally numbered
die once (experiment or trial) you get a ‘6’ (an
outcome or sample); the set of all possible outcomes is
called the sample space S, which for a successful throw
of a die is S = {1, 2, 3, 4, 5, 6}. Often we are not
interested in individual outcomes, but in whether an
outcome belongs to a given subset (e.g. A) of S. These
subsets are called events. In this example, we might
consider two mutually exclusive events: throwing an even
number or throwing an odd number; another event would
be throwing a number which is an integer multiple of 3.
We must be careful to include all possible outcomes, and
to take into account whether all outcomes are equally
likely. For our purposes here, we will assume that we
have a fair die, and it always ends up with a number face
up.
Note that:
 The outcomes are all separate. Each trial results in only
one outcome – you cannot get both a ‘1’ and a ‘6’ from
a single throw of a die - so the outcomes correspond to
the elements of a set.
 As the result of a trial there is at least one event.

3
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
1.3 Probability values
We will now associate a probability to each outcome. We
do this such that a probability of 1 corresponds to
certainty and 0 to impossibility.
We can observe that the probability of an event depends
on the probability of the outcomes it contains. This can be
achieved by defining the probability of events as the sum
of all the outcomes contained in the set. P(A); the
probability of event A, can thus be defined as
 
P  A  P   ei    P ei  for all ei  A ,
i  i
where ei represents an event within the set A.
Since something is bound to happen as the result of an
experiment we can say straight away that P(S) = 1.
Similarly, the probability that none of the possible
outcomes will happen is 0 (i.e. P()=0), and we can note
that since no event can be less likely to happen than it not
happening, there are no negative probabilities, (i.e.
P(A) ≥ 0).

Example: single throw of a die


If the die we considered above is fair or unbiased, each
outcome is equally probable and we can say that the
probability of any number coming up is 1/6.
1
P 1  P 2  P 3  P 4  P 5  P 6 
6

4
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
1.4 Permutations and combinations
A central idea in understanding probability calculations is
the concept of relative frequency, i.e. the frequency with
which we can expect a particular event to appear among
all possible events. If all events are equally likely, we just
need to count the number of possible results to assess
probabilities. There are tools to calculate the number of
results that can occur in different conditions. To
understand the possibilities, it is useful to contemplate a
simple situation for our examples – the drawing of balls
with different colours (or numbers) from bags.
A typical question could be: what is the chance of picking
first a white ball, followed immediately by two black balls,
from a bag containing 7 white balls and 3 black ones?
There are two factors that we need to take into account.
The first relates to whether the balls are put back into the
bag or not:
a.1) when we put the ball back into the bag before
drawing the next one (in which case the chances of
selecting a particular ball does not change, i.e. the
outcomes are independent), or
a.2) when we do not replace the ball (and thus the odds
change for subsequent selections).
In the first case we say that sampling is done with
replacement, while the second corresponds to sampling
without replacement.
The second factor relates to whether we are interested in
the order in which we have drawn them:
b.1) If we do not care about the order, then we talk about
combinations.
b.2) If the order is important, then we talk about
permutations.
These two factors combined give rise to four possible
cases: permutations/combinations with/without replace-
ment. In the following we will obtain mathematical rules
for the number of possible outcomes in each case.

5
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Let us first find out in how many ways it is possible to
arrange a set of objects. Let us imagine a bag containing
n balls, numbered 1 to n, and let us draw r of them in
sequence without replacement.
Now there are n possibilities for the first ball, n-1 for the
second, and so on until you get to n-r+1 ways for the rth
and last ball.
The total number of possible sequences of r balls from a
bag with n balls is therefore
n!
nPr  n n  1n  2 n  r  1 
n  r !
Since we do not put the balls back and the order is
considered, we are talking about permutations without
replacement and the number of permutations is written as
n P r , as shown above.

As an example, calculate how many card sequences you


can get if you take 10 cards (one by one) from a deck of
52 – the number is surprisingly large!
If r = n, and noting that 0!=1, we can see there are n!
unique ways of ordering n objects. The probability of
obtaining any particular sequence is therefore 1/ n!.
In the case of permutations with replacement, we put
the previous ball back in the bag before drawing the next
one, so each ball is actually taken in exactly the same
conditions. The number of possible sequences of r balls
from a total of n is nr.
Now suppose we are only interested in the set of balls
that we draw, not the precise order. How many ways of
doing this are there?
We can calculate this by first choosing a permutation
containing the right balls – one of the n P r permutations.
There are r! ways of ordering this collection of balls (see
above) so the number of unique collections (the number
of combinations without replacement) is the number
of permutations over r!:

6
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
nPr n!
nC r  
r! n  r  !r!
n
n C r can also be written   - you might also know it as
r 
the binomial coefficient, which will be used later. It is easy
to see that n C r  n C n  r .

Example: What is the chance of winning the jackpot in


the national lottery?
There are 49 balls, numbered from 1 to 49, in the
machine. Six balls are selected without replacement and
the jackpot is won when all 6 of the selected balls are
correctly identified.
How many ways of choosing combinations of 6 balls from
a set of 49 are there? It is 49C6.
49! 49!
49 C 6   13,983,816
49  6 !6! 43!6!
Our chances: each combination is equally likely so the
chance of selecting all the numbers is one in 13,983,816.

The last possible case is combinations with


replacement. As above, given a set of n possible values,
if we take a set of r allowing for the values to be repeated
(i.e., with replacement), how many different combinations
can we get?
Obtaining a mathematical expression for this case is less
intuitive than in the previous ones, so we will state the
answer first. The number of combinations with
 n  r  1  n  r  1
replacement is   , or, equivalently,   .
 n  1   r 

7
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: If we throw five dice, how many combinations
of numbers can we get?
Obviously we need to allow for repetitions (replacement),
and given that the order is not important in this case we
are talking about combinations with replacement Given
that r=5 and n=6, the number of combinations is
5  6  1 10  10 !
       252
 5    5 5 !5 !

In order to derive the result above, we note that for each


of our r selections, we have n possibilities. This is exactly
equivalent to the problem of putting r balls into n boxes,
or dividing r items into n categories. We can imagine this
as follows:
r=3, n=6 o|x|o|x|x|o or x|o|x|x|o|o or …
r=6, n=6 oo|x|o|o|o|o or o|o|x|oo|o|o or …
r=6, n=3 oo|o|ooo or o|oooo|o or …

Each of these problems is equivalent to ordering n-1


dividers (|) and r objects (o) into n+r-1 locations. This is
now a problem of selection without replacement, in which
we aim to select n-1 or equivalently r of the n+r-1
possibilities, the number of ways in which this can be
 n  r  1  n  r  1
done is   , or, equivalently,   .
 n  1   r 

1.5 Probability of events


Events are sets of outcomes. In the case of a die, we can
ask what is the probability of the number being even, or
being larger than 3. We can also ask more complicated
questions, like what is the chance of it being both even
and larger than three, or larger than a second throw of
the die (although in this case we will have to increase our
sample space to contain pairs of throws).

8
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The answer to the first question, that the number is even,
is the set of outcomes E={2, 4, 6}, and according to the
summing rule the probability of the event is the sum of
the probabilities of the outcomes contained in the event
set.
1 1 1 1
P E     
6 6 6 2
Intuitively this corresponds again to the idea of relative
frequency, i.e. if we repeated the test a large number of
times, how often will the outcome be in E? In this example,
we can expect to get even values half of the time. In a
similar way, we could ask for the probability of the
number being exactly divisible by 3, i.e. what is P(T) when
T={3, 6}? We can see that the probability of T is 1/3
(=2/6).
Now let us ask the combined question “what is the chance
of the number being exactly divisible by two or three or
both?”
This set is C={2, 3, 4, 6}, which has four equally likely
outcomes so the probability P C   4 6  2 3 . Note we
cannot just add the probabilities of each event because
the element ‘6’ appears in both sets and its probability
would be included twice. We must correct for this by
taking off one of each of the probabilities of the joint
outcomes.
This illustrates the rule of addition of probabilities
P A  B  P A  P B  P A  B
It is easy to confirm this visually by checking the sets A, B,
their union and their intersection. We can also check that
it gives the same answer as before:
P C   P E  T   P E   P T   P E  T 
1 1 1 2
   
2 3 6 3

9
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
2 Prior knowledge, Independence & Bayes’
theorem
In the examples above, the sample space S was the whole
set of possible outcomes. There are many situations in
which we possess prior information, i.e. we already know
something about the outcome. For example, the likelihood
of a device failure in the next minutes might be affected
by the knowledge that the room temperature is unusually
high. As an example for our die we can ask “if we know
that the outcome is even, what is the probability of it
being larger than 3?”
The set of digits in S that are both even and larger than 3
is {4,6}, and the set of events that are even but not
larger than 3 is {2}. It seems thus that the mentioned
event will happen 2/3 of the time. Notice that this is
different from the probability of the value being larger
than 3 in the absence of any prior information, which
would be 1/2.
Again, let’s try to calculate this in a more formal way.
What we are trying to calculate here is a conditional
probability, i.e. the probability of the event A given that
the event B has happened. By introducing our prior
knowledge about the outcome (i.e. that it is even), we
have reduced the set of possible outcomes to a subset of
S, which we will denote B = {2,4,6}. The event for which
we want to calculate the probability is A = {4,5,6}, of
which the only ones possible now are A  B, i.e. {4,6}.
This conditional probability of A given B is written P  A B 
(remember the form of the words; it will help you with the
order of the arguments), and following the argument
above it can be defined as:
P A  B
P A B   , where P B  0
P B 

which we can also write:


P A  B  P A BP B  P B AP A

10
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

A
1 3 5

2 4 6
B
Intuitively we can read this as: the chance of two events
happening simultaneously is the chance of one of them
happening, multiplied by the chance of the second
happening given that the first has happened.
2.1 Total probability theorem
We can use this result to derive important relationships.
The first is the total probability theorem. If we partition
the sample space into a set of n disjoint sets Ai

A1 A2 A...

B
An A...
A... Ai
S
we can see from the addition rule above that the
probability of B is given by
P B   P B A1 P  A1   P B A2 P  A2     P B An P  An 
n
  P B Ai P Ai 
i 1

As a special case we can say



P B  P B AP A  P B AC P AC 
11
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
2.2 Bayes’ rule
From the equation above:
P A  B  P A BP B  P B AP A
we can then rearrange this to give
P B AP A
P A B  
P B 
which is Bayes’ rule.
Bayes’ rule is often combined with the total probability
rule:

P B Aj P Aj  

P Aj B   n
 P B Ai P Ai 
i 1

This simple equation is extremely important in many


branches of Engineering.

2.3 Independence
The final thing to note is that if prior knowledge does not
affect the probability of the second event, P A B   P(A) ,
and thus
P A  B  P A BP B  P B AP A  P AP B
then we say that the events A and B are independent.
So: if two events are independent, the probability that the
two of them happen in the same experiment is the
product of their individual probabilities. Note that
statistical dependence does not require any causative link,
as we have seen before in the example of the die
outcomes.

12
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: Clinical test:
A test for a rare disease detects the disease with a
probability of 99%, and has a false positive ratio (i.e. it
tests positive even though the person is healthy) of 0.5%.
We know that the percentage of the general population
who have the disease is 1 in 10000.
Suppose we chose a random subject and perform the test,
which comes out positive. What is the probability of the
person actually having the disease?
The probability of D (having the disease) before the test is
1 in 10000: P(D)=0.0001 (i.e. P(Dc)=0.9999)
The conditional probabilities of getting a positive in the
test P(T) when the person does / does not have the
disease are P(T|D)=0.99 (true positive ratio) and
P(T|Dc)=0.005 (false positive ratio).
The probability of getting a positive result in the test is
P(T), which we can calculate using total probability:
P(T) = P(T|D)P(D)+P(T|Dc)P(Dc) =0.005
Now we just need to apply Bayes’ theorem
P T DP D 0.99  0.0001
P D T     0.02
P T  0.005
You may be surprised to know that even with a high
quality test such as the one described, if a random subject
is tested and it comes out positive, the probability of the
subject being ill is only 2%. Of course this would all
change if the person has shown previous symptoms or
belongs to a certain risk group (i.e. if you have further
prior information).

If you like brain teasers, you may want to look for the
“Monty Hall problem”, an application of prior probability
that has puzzled many since it was originally proposed in
1975.

13
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3 Probability distributions
3.1 Random variables
If we assign a probability to each point in a sample space,
then we have a function defined on the space. This
function is called a random variable, usually represented
by a capital letter, e.g. X or Y. Variables that are random
in time are also called stochastic variables.
If the sample space is a set of discrete points then the
variable is discrete, otherwise it is non-discrete, or
continuous. For example, the number of students at a
particular lecture is a discrete random variable; the height
of the tallest student is a continuous random variable.
3.2 Discrete random variables
Given a sample space containing a discrete number of
values xi, we say that the probability that the random
variable X equals that number xi, is p(xi) or
P X  x i   px i 

The function, p(xi), is a discrete probability density


function (pdf). The function must obey the basic rules of
probability, i.e.
px i   0 and  px i   1 .
all i

For historical reasons the discrete probability density


function is also known as the probability mass function.
The Figure below represents a sample discrete pdf.
Later in these lecture notes we introduce some
particularly interesting probability density functions.
Before that, however, we will look at ways to describe
discrete probability distributions.

14
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Probability Density Function of a sample discrete Random Variable

0.1

0.05

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

3.2.1 Expectation and Variance

We often need to quantify the main properties of a


distribution using a small set of parameters. One of the
most useful parameters is called the expectation.
Intuitively it is related to the concept of mean of a number
of samples, so we will rely on this well-known concept to
introduce it.
We know if we take N samples of a random variable we
can calculate the mean value:
N
1
x 
N
 xk
k 1

This is itself a random variable (the mean is probably


going to be different if we take another set of N samples),
but as N   we can expect this to settle down to the
true average of the random variable.
If the variable is discrete, there is a set of distinct values
that x can have. If we collect all of these together we can
reorganise the sum above
1  ni 
x 
N
 ni xi    xi N

i i 

15
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
where ni is the number of times the value xi appears in
the N samples. But as N   we know that ni N  pxi  ,
so we can write
x   x i px i 
i

We call this the ‘expected’ value of the discrete random


variable X
E X    x i px i 
i

and it is the value we would expect the mean to be.


We can extend this argument to work out the expected
values of functions of the random variable. Instead of
averaging the mean value of the variable, we can consider
some function Z = f(X) of the random variable X. In which
case, we would get
E Z   E f X    f x i px i 
i

We can show that expectation is a linear operator, if Z = a


X+b
E Z   E a X  b   a x i  bpx i 
i

 a x i px i   b px i 
i i

 a E X   b
Similarly, it is easy to demonstrate that, given two
random variables X and Y, the expectation of a linear
combination of the two (Z = aX + bY) is a linear
combination of the respective expectations:
E Z   E a X  bY   a E X   bEY 
The expectation gives us an idea of which values we can
expect to get (though we still need to be careful on the
use of this parameter: for example, there is no guarantee
that the expectation corresponds to the value that is most
likely to appear). We would also like to know how much
the values are likely to vary between tests: this is related
to the breadth of the distribution. A simple, intuitive way
to quantify this is by getting the ‘average’ value of the

16
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
square of the difference of each point from the mean. This
figure is called the variance of the distribution

 X2  E X  x 
2

We can expand this to give a useful way of calculating the
variance.

 X2  E X  x 
2


 E X 2  2X x  x 2 
 E X   2 x E X   x
2 2

 E X   E X 
2 2

The square root of the variance is called the standard


deviation.
In the same way as we did above for the expectation, one
can easily demonstrate that the variance of Z= aX + b is
 z2  a2 x2 .

3.2.2 Independent random variables

We saw before how, if two events are independent, the


probability of the two events happening together is the
product of their individual probabilities. This concept can
be extended to random variables. Given two random
variables X and Y, they are independent if the probability
that X takes the value xi and Y takes the value yj
simultaneously is P(xi)P(yj). This needs to hold for any
combination of values xi, yj.
A consequence of the definition of independence between
random variables is the following: given two independent
variables X and Y, if we define Z = aX + bY, the variance
of Z is  Z2  a2 X2  b2Y2 . You can find the proof in the
recommended bibliography – or even better, try to prove
it yourself.

17
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.2.3 The uniform discrete probability function

A uniform probability function is that which assigns the


same probability value to all points. An example of
uniform discrete probability function is the score from
rolling a die – all values from 1 to 6 receive the same
probability equal to 1/6.
p(i) = 1/6 i = 1,2,3,4,5,6
Apart from our die, other examples are drawing a single
card from a pack, the numbers on an English roulette
wheel, the number of the person who draws a short straw.
Most (fair) games of chance use the uniform distribution
as their basis.

3.2.4 The Binomial (or Bernoulli) distribution

Imagine you are sampling sets of data. You might be


counting the number of sixes you get when you throw
three dice, or the number of ‘1’s in a binary byte
transmitted across the net, or the number of times in a
month the average daily flow in a river exceeds a certain
value. All these events have the property that you are
counting the number of times a condition, which can
either be true (and given a value 1) or false (with value 0),
is met. This type of sampling (something that is either
true of false) is called a Bernoulli trial.
To represent this with a simple example, let us say you
are throwing a set of n “special” coins. The property of the
coins is that they come up with heads with a probability of
p, and so tails have a probability q = 1 – p. We would like
to know what the chance is of getting a given number of
heads in a throw of the n coins.
There are the following possibilities:-
 No ‘1’s – all trials must result in ‘0’s so the probability
is qn.
 A single ‘1’ – let the first die be a ‘1’ and the
remainder ‘0’. The probability of this is qn-1p. The ‘1’ can
come in any of n mutually exclusive positions, so the

18
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
probability of getting a single ‘1’ anywhere in the order
is nqn-1p.
 r ‘1’s – let r ones come first followed by n - r zeros. The
probability of this is qn-rpr, but then the r ‘1’s and n - r
n
zeros can be arranged in   ways (from Section 0) so
r 
 n
the probability is  qn  r pr
r 
 All ‘1’s – all trials must be ‘1’ so the probability is pn.
The values for each possible outcome are summarised in
the following Table:
No of successes 0 1 …. r …. n
 n  nr r
Probability q n nq n1 p ….  q p …. p n
r 
We can prove that this distribution meets the properties of
a discrete probability distribution: a) It is obvious that all
probabilities are positive; and b) To demonstrate that the
sum of all probability distributions equals 1, we can see
that the terms in the Table above correspond to those of
the binomial expansion (q+p)n, which equals 1 since
q+p=1.
 n
q  pn  q n  nq n 1 p     q n  r p r    p n
r 
This distribution is called the binomial distribution
(from the generating function) or the Bernoulli
distribution, after Jacques Bernouilli, the 17th century
Swiss mathematician who first discovered it. It is often
represented in the form B(n,p).
The binomial distribution is very common and it occurs
whenever there is a set of independent trials with
outcomes ‘0’ or ‘1’ and which all have the same
probability of success. Below you can see sample plots
corresponding to B(100,1/3) and B(100,1/2) in the first
Figure, and B(25,1/3) and B(25,1/2) in the second. Check
whether the position of the maximum and the width of the
distribution are those you expected.

19
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

0.09

0.08
n = 100
0.07

0.06
p = 1/3

0.05 p = 1/2

0.04

0.03

0.02

0.01

0
0 10 20 30 40 50 60 70 80 90 100

0.18

0.16
n = 25
0.14
p = 1/3
0.12
p = 1/2
0.1

0.08

0.06

0.04

0.02

0
0 5 10 15 20 25

20
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: At a particular junction 10% of cars turn left.
Five cars approach the junction, what is the probability
that 3 or more will turn left?

21
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

3.2.5 Geometric distribution

Another distribution where Bernoulli trials pop up is the


geometric distribution. Imagine you throw one of our
special coins (remember, the ones which had a probability
p of heads and q=1-p of tails). The variable for which we
are trying to get the distribution is the number of trials to
the first success, i.e. how many times do we throw the
coin until we get the first heads. This is calculated quite
easily. For the kth attempt to be the successful one the
previous k – 1 attempts must all have been failures
(which has a probability qk-1), and this one a success
(probability p), which gives:
pk   q k 1 p
You may recognise that these are the terms of a
geometric series with ratio q. We can use this to check
that the probabilities all add up to 1:
 
p p
 pk    q k 1 p  1  q 
p
1
k 1 k 1

3.2.6 Poisson distribution

Another important discrete distribution is the Poisson


distribution, which concerns the number of occurrences of
an event over time.
The Poisson distribution covers the occurrences of rare
events, like traffic accidents on a road, number of cars
crossing a bridge in very light traffic (if the traffic is heavy
the events are not independent) or the injuries to Prussian
cavalry officers by their horses – the problem Poisson was
originally asked to investigate which led to his discovery
of the distribution. We can get to the Poisson distribution
through the binomial distribution.
Let us imagine we are looking at a process where the
probability of an event happening is uniform in time with a
mean probability of  events per unit time (for example,
we may know that a certain call centre receives an
average of fifty calls per hour). Let us split our unit time

22
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
period into a series of n small, equal periods of time. The
probability of an event happening on each of the smaller
periods is /n but there are n of them. If we keep on
dividing the period up into more and more of ever shorter
periods eventually we will get to the point when every
little period has at most one event in it. Sampling the
process for a unit period of time can then be considered a
binomial process (since each sub-period can have either
one event in it or none) with a very large number of sub-
periods n.
What is the distribution of successes (note that we use
success here as a probability term, and it does not
necessarily reflect a positive event)?
The process is binomial so the probability of there being k
successes is
 n
pk    q n  k p k
k 
with p being the probability of a success within each of the
n periods, i.e. p=/n. Let us make the substitution for p
nk k
n!   
pk   1    
n  k  ! k!  n  n
which can be rewritten
n k k
n!       
pk   1     1  
n  k  !k!  n   n   n 
Now what happens to this as n   ? Let us look at each
component in turn:
n!
 nk
n  k  !
n
  
1    e
 n
k
 
1   1
 n
and so the probability becomes

23
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
k
nk     
pk   e  
k! n
k
 e
k!
This is the probability distribution of independent rare
events. The requirements are that
 each event is independent of each other
 only one event can happen at a time
 the mean rate of events is constant.
So in our example of the call centre, we can use the
function above to calculate the probability of getting a
certain number of calls in one hour, helping us to estimate
the number of lines we need to hire to attend the calls.
Note that this is much more than the original information
(which was just the average number of calls per hour).
We can see that the sum of all probabilities is

k
e 
 k!
 e  e  1
k 0

It is relatively easy (and left as an exercise for the tutorial


sheet) to show that both the mean and the standard
deviation of the Poisson distribution are equal to .
Examples of Poisson distributions with values of  equal to
1, 5 and 10 are given below. You can see how both the
centre and the spread of the distribution change with .

24
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
0.4

Probability 0.35
= 1
=1
0.3

0.25
=5
= 5

=10
= 10
0.2

0.15

0.1

0.05

0
0 2 4 6 8 10 12 14 16 18 20
k

Example: If the number of accidents on a particular road


is 2 per month, what is the probability that there are no
accidents during a given month?

25
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.2.7 Poisson approximation to the binomial
distribution

The Poisson distribution has another use. We derived it by


taking the binomial distribution to a limit, so it seems
reasonable that the Poisson distribution can be used to
approximate the binomial distribution for large n or small
p (and setting = np). As a rule of thumb, we can say this
is reasonably accurate if n>10 and p<0.1.
Example: What is the probability of there being no
defective components is a batch of 100 if the probability
of a component being defective is 0.02?
Binomial (the correct distribution)

Poisson approximation

26
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3 Continuous random variables
In the previous section we considered discrete random
variables – variables that can take one of a countable set
of distinct values. In other cases the variable can vary
continuously over a range: take for example the
distribution of student heights within your class. In this
case it is no longer possible to define the probability of
getting any single value of the variable because there is
an infinite number of them. It turns out that we have to
define the probability not of getting a particular value, but
the probability of the variable being within a given range.
It is best to approach this problem indirectly by starting
with the cumulative distribution function (CDF) -
sometimes also called cumulative probability function or
CPF. This is defined as the function, F x  , equal to the
probability that the variable is less than a particular value:
F x   P X  x 
Discrete random variables also have CDFs, the one for the
Poisson distribution with  = 25 is shown below.
CPF & DPF for Poisson lambda=25
1 0.1
Cumulative Probability Function

Probability

0.5 0.05

0 0
0 5 10 15 20 25 30 35 40 45 50
k

27
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
We can see that because the probability function is
discrete this CDF has a staircase shape. We can also note
that (all these should be evident if you have understood
the definition above):
 it starts from a value of 0,
 rising monotonically
 to a final value of 1, and that
 it is steeper where the probability density function is
largest.
We can determine the probability that the random
variable lies inside a given range from the CDF
P xl  X  xu   F xu   F xl 

In particular, consider the mean level of probability in a


small range around x
P x  X  x   x  F x   x   F x 

x x
As  x  0 the term on the right becomes the differential
of the CDF and we call this the probability density
function (pdf), f(x), or
dF x 
f x  
dx
We can see that because F(x) rises monotonically with x,
f(x)≥0 (condition A), and because F   1 we can say

 f x   1 (condition B)


Any function that satisfies conditions A and B is a valid pdf.


We can now calculate the probability that the random
variable lies within a certain range as an integral
xu
P xl  X  xu    f x  dx
xl

This is illustrated in the following Figure: the probability


that the variable takes a value between xl and xu is the
highlighted area.

28
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

pdf
f(x)
P(xl<X<xu)
is equal to the
area under the
pdf graph

xl xu x

If the CDF is continuous (no steps) then the pdf is a finite


valued function and the probability of any particular
number being an outcome from a trial is zero, which can
be a rather perplexing thought. This means that zero
probability does not necessarily imply an impossible event.
On the other hand an impossible event will always have a
zero probability.

3.3.1 Expectation and Variance of a continuous


distribution

Earlier we worked out the expectation of discrete variables.


We can also extend this idea to continuous random
variables if we switch our sums to integrals.
So we can just substitute sums by the corresponding
integrals to get

x  E X    x f x dx

We note that this is the first moment of the distribution.


The variance, and from it, the standard deviation, is

29
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
defined in the same way as before, and is derived from
the second moment of the distribution

 2  E X  x 
2

 
 E X 2  E X 
2

2
  x f x dx    x f x dx 
 2 
   

3.3.2 Other measures of central tendency and


spread

We have already noted that the mean is a measure of the


average value. However, as we already mentioned above,
the mean does not necessarily correspond to the most
likely value: in fact it is easy to come up with possible
pdfs which have a value equal to zero at the mean. There
are two other measures that give ‘typical’ values of the
variable.
The first is the mode, which is the most likely value, i.e.
the peak of the pdf.
The second is the median, which is the mid-point in the
distribution. For a sample of values drawn from the
distribution, half are likely to fall below the median.
Mode:
Median:
most probable
P(X<xmedian) =0.5
pdf value

f(x)

50%

xmode x
xmedian

30
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The standard deviation is one measure of the dispersion of
a distribution: the higher the value of the standard
deviation, the less concentrated around the mean the
distribution is. For some distributions it makes more sense
to extend the median idea to give other proportions of the
distribution, so you get quartiles (25% of samples are
likely to fall below the lower quartile and 75% above).
There is also the upper quartile where the proportions are
the other way around. More generally we talk about
percentiles. A typical one is the 95% percentile: on
average only 5% of the samples will be higher than this
value. It is used in civil engineering to define the
‘characteristic load’ that is the standard load used in
design for the service (everyday) condition, and is the
load expected to be exceeded only once in 20 times.

3.3.3 Chebycheff’s inequality

The exact values of these outlying measures depend on


the particular distribution that you are considering. We
can, however, obtain bounds to the extreme values of
probability functions from the mean and standard
deviation alone. For any random variable X with mean x
and variance  X2 , it can be proven that Chebycheff’s
inequality tells us that
1
P  X  x  k X  
k2
This is a useful bound when no other information is
available. It applies to absolutely any distribution that has
a mean and standard variation (there are some that do
not!). But as you might expect from such a general rule it
is extremely pessimistic in most cases, and if we have any
information about the shape of the distribution it is usually
possible to calculate tighter bounds.

31
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.4 Uniform distribution

One of the simplest continuous distributions is the uniform


distribution, where outcomes are equally probable inside a
range. This is the continuous analogue of the uniform
discrete distribution.
We know that the integral of the function over the whole
range must sum to 1 so the mean level of the probability
must be inversely proportional to the range. If our
random variable can take values from a to b and is
uniformly distributed in this range, the corresponding pdf
is
1
f x   for a x b
ba
0 elsewhere.
Both the pdf and the CDF for this distribution are
illustrated in the following Figure:

1
CPF
CDF

pdf

1/(b-a)

a b x
The uniform distribution is used where there is no reason
to assume any particular value is more likely to occur than
any other, and being particularly simple it is easy to use
mathematically.

32
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.5 Exponential probability density function

Imagine an event that is equally likely to occur at any


time with a mean rate λ (let’s say, 3 times per hour), i.e.
the probability of the event occurring in a period of time
dt is equal to λ dt. What is the probability that an event
will have happened in time t (let’s say after 10 hours)?
You might be tempted to guess that the event will happen
with probability λt. It is easy to see that this would not
meet the conditions expected of a probability distribution:
if t is large enough we will get probability values larger
than one. The correct answer is a bit more complicated.
We can analyse this by asking what is the chance that the
event will happen between times t and t+dt. From our
definition of the CDF we can say that the probability that
the event will have happened by time t is
P T  t   F t 
The probability of the event to occur in the next interval of
time is equal to the probability that it has not occurred
already (1-F) times the probability that it will occur in the
interval (λ dt), i.e.
dF  1  F    dt
Rearranging this and integrating gives
F t
dF
1 F    dt so  ln1  F    t
0 0

 F t   1  exp  t 
We can now differentiate this to get the pdf
0 t 0
f t   
 exp  t  t 0
The exponential distribution function is used extensively in
failure analysis, where the event we are trying to detect is
a failure occurring. Think about questions like this: if on
average we get one failure every three days, what is the
likelihood that a certain component will fail today? It is
also the distribution of inter-arrival times in a Poisson

33
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
process. This can be very useful in systems which have to
react to events that occur randomly: think for example
about a maintenance department at a large company
where system failures represent huge costs, or an
ambulance service, which cannot afford to make the users
wait for more than a few minutes.

Mean and variance of the exponential probability


function
What are the mean and standard deviation of the
exponential probability function?
f t    e t for t 0

t  E T  
  t
 0  t e dt

We do not need to work this one out – it is a Laplace


transform and the result is in the Laplace transforms
section in HLT:
t n1 LT 1

n  1! sn
2
Setting n to 2 and s to λ gives the answer we want: 1/ λ .
So the integral is
1 1
 t   2
  
We can also work out E[T2], when n = 3

  
ET2 
 2 2
 t 2 e  t dt    3   2
0
  
from which we can say
2
 2
T  ET   ET 
2 2 2
1 1
 2    2
  
so the mean and the standard deviation have the same
value 1/ λ.

34
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.6 Normal or Gaussian distribution

The most important distribution in probability is the


normal (or Gaussian) distribution. This is because in
real life applications there are lots of random variables
whose distribution is normal, can be approximated by a
normal in certain conditions or can be transformed into a
normal using simple transformations. It is characterised
by two parameters: mean m and variance 2, and is
usually written N(m,2). The pdf is given by the following
expression:
 x  m2 

f x   N m,  2
 
1
exp  
 2


 2  2 
As an exercise, you can check that this function meets the
two conditions for pdfs mentioned above.
The normal distribution has many remarkable properties,
one of which is that, given two independent normally
distributed random variables X=N(mx,  x2 ) and Y=N(my,
 y2 ), the distribution of a linear combination Z=aX+bY is
itself a normally distributed variable, with the mean and
variance that we calculated in Section 3.2.2.

f z   N amx  bmy , a2 x2  b2 y2 
To calculate values of pdf and CDF for the normal
distribution, the usual methods make use of tables (or,
nowadays, statistics software packages). Tabulated values
are given in terms of a ‘standardised’ random variable Z:
X  mX
Z 
X
so that f z   N0, 1
The normal distribution can be integrated to give the CDF,
which in the standardised case is usually represented as
(z):
1  z 
P X  x   z   1  erf   
2  2 

35
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
This is plotted, together with the pdf, in the Figure below:

Plot of pdf and cdf for the Gaussian distribution


1

 (z)
0.9

0.8

0.7

0.6

0.5

N(0,1)
0.4

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
z

You can find tables for the standardised normal


distribution in HLT and in many other text books. Let’s say
you want to calculate the value of the CDF at x=2, for a
normal distribution N(1,0.5). First we need to calculate
the corresponding value of the standardised variable z:
xm 2 1
z    1.41
 0.5
Now we can go to the table and find the corresponding
value, which in this case is (z) = 0.9207.
Note that the tables generally only give us the CDF for
positive values of the variable. This is because, given the
symmetry of the normal distribution, we can write:
Φz   1  Φ z 
The tables can also be used to calculate the probability of
the variable to be within a certain range a  z  b, by
using (b) - (a). Another typical situation is when we

36
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
want to solve the inverse problem, for example finding the
value z0 corresponding to the 90th percentile, i.e.
P(z<z0)=0.9. Tables for this are also widely available,
including in HLT. Assuming a distribution N(0,1), and
given that the values in HLT table correspond to P(z>z0)
and are provided in percentage points, we just need to
find the entry for P=10, which is 1.2816. A useful statistic
is that the deviation to give the 95th percentile is  1.64  .

Example: The diameter of an optical disc drive is


normally distributed with a mean of 0.2508 cm and a
standard deviation of 0.0005 cm. The specifications on the
shaft are 0.2500 ± 0.0015 cm.
What proportion of shafts conform to the specifications?
That is, find the proportion of shafts no larger than
0.2515 cm and no smaller than 0.2485 cm.

37
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.7 Central limit theorem: approximations using
the normal distribution

The reason normal distributions appear everywhere is


because if you sum n samples of any random variable X
(absolutely any random variable), as n tends to infinity
the distribution of the sum will tend to the normal
distribution N(nmx, n  x2 ), where mx and  x2 are the mean
and variance of the original distribution. This is called the
central limit theorem. It is a remarkable result, and one
that can be used in a large number of applications.
Imagine, for example, if you want to describe the total
weight of a large number of pieces, or the average
resistance of a large number of components. Since the
average is the sum divided by the number of components,
both variables will have a normal distribution, regardless
of the distribution of the individual values.
Using the normal to approximate discrete distributions
It follows from the central limit theorem that you can use
the normal as an approximation to the binomial and the
Poisson distribution, for large values of n (for the binomial
distribution) or  (for the Poisson), even though both
binomial and Poisson are discrete distributions. A binomial
distribution with parameters (n,p) can be seen as a sum
of n random variables, each one of them taking the value
1 or 0 with respective probabilities p and 1-p. It is easy to
see that the mean of each of these n individual variables
is p, and it can also be derived that the variance is p(1-p).
The binomial distribution can then be approximated by a
normal with mean np and variance np(1-p). Given that, as
we have seen before, the Poisson distribution can be used
as an approximation to the binomial distribution, we can
in turn approximate the Poisson distribution by a normal
distribution with mean and variance equal to . For the
normal to be a good approximation, np and n(1 − p) (for
the binomial distribution) or  (for Poisson) have to be
large. To get an idea, the meaning of “large” here is
sometimes taken to be ≥ 5.

38
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The continuity correction
In the conversion between discrete and continuous pdfs,
you can improve the accuracy of the estimate by taking a
mean value of the function at the mid-point between the
steps in the discrete CDF. This is called the continuity
correction.
The CDF for a discrete probability function corresponding
to a discrete random variable X is flat between the
discrete values, so we can say
P X  k   P X  k  1
To approximate using a continuous random variable Y, we
have the choice of the range of values between k and k+1.
It turns out that the value in the middle of the interval is a
pretty good approximation:
P X  k   P Y  k  0.5

The Figure below shows how a normal distribution


(dashed line) can be used as an approximation to a
Poisson distribution (staircase curve), in this case with
=10. The dots are points at values k+0.5 for all integer
values of k.
Comparison between a Poisson CPF (staircase curve) and a Normal approximation (dashed curve).
The dots represent the approximated values after continuity correction.
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25

39
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.8 The chi-square distribution

This arises from the normal distribution, and it is defined


in the following way: given a set of independent,
standard (i.e. normalised) normal random variables,
which we will denote Z1 ... Zn, if we define a new variable
X,
X  Z12  Z22  ...  Z n2
X is said to have a chi-square distribution with n degrees
of freedom. This can also be written as  n2 .
An important property of the chi-square distribution (and
one that is not difficult to prove) is that the sum of two
chi-square random variables with n1 and n2 degrees of
freedom is another chi-square random variable with n1 +
n2 degrees of freedom.
The distribution function of the chi-square distribution
with n degrees of freedom can be calculated as follows:
n
1
e x / 2  x  2
 
2 2
f x  
n
 
2
where  (.) is the gamma function. In practice, as for the
normal distribution, this expression is seldom used: there
are Tables (and, of course, statistics software) available
for the calculation of probability values or, inversely, to
calculate the values of the variable that correspond to a
certain, fixed probability. The Figure below shows
probability density functions for chi-square distributions
with different degrees of freedom n=3, 5, 10.

40
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

0.25

n=3
0.2

n=5
0.15
n=10

0.1

0.05

0
0 2 4 6 8 10 12

It can be shown that the chi-square distribution has mean


E[X]=n and variance  x2  2n .
It is easy to see the use of the chi-square distribution in
determining the distribution of the sum of squares of
normal distributions (for example, in the calculation of
Euclidean distances). In fact the chi-square distribution
plays a fundamental role in statistical testing, as we will
see later.

41
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.9 The t-distribution

Another distribution related to the normal is the t-


distribution (also called Student’s t distribution). It is
defined as follows: given two independent random
variables, Z and  n2 , with Z having a standard
(normalised) normal distribution and  n2 having a chi-
square distribution with n degrees of freedom, the random
variable Tn:
Z
Tn 
 n2
n
has a t-distribution with n degrees of freedom. The mean
and variance of the t-distribution are respectively:
E Tn   0 , n 1
n
Var Tn   , n2
n2
Values for the t-distribution can also be found in Tables or
calculated using statistics software. In the following Figure
we can see the t-distribution with n = 1, 3 and 10 degrees
of freedom (dashed lines) compared to the normal
distribution. One can see that, as the value of n increases,
the t-distribution converges to the normal.

Normal 10 d.o.f. As with the chi-


distribution square distribution,
3 d.o.f. we will see the
1 d.o.f. importance of the
t-distribution later
on when we
introduce the
concept of
statistical testing.

42
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
PART 2. STATISTICS
In the lectures up to now we have been trying to build
models that provide an accurate representation of actual
phenomena which include a certain random component.
This is called probability theory. In the following we will
study ways to apply these models to analyse real-life
observations, in order to inform our decisions: this is
known as statistics.

4 Random sampling
Imagine you are in charge of quality control for a
company which manufactures a certain product. Your duty
is to assure that the products that come out of the
production line are of sufficient quality: if a damaged
batch is detected when the product is already in the
market, a fix will be much more expensive and might
cause important damage to the company’s reputation.
If you are manufacturing a cheap product (let’s say
screws), it obviously makes no sense to test every single
one of them: the cost of testing will be higher than the
actual production costs. If the product is expensive (let’s
say cars), you might afford to test each one of them, but
then again there are different levels of testing, and you
might not be able to test absolutely every single possible
defect on every single car.
The solution: take a subset of the products, selected
randomly (this is called a random sample) and test this
as a representation of the whole production. There are a
number of questions you can think about: how many
samples do I need? What are the maximum/minimum
values for the tested parameters that I can accept?
Statistics provides tools to solve these questions and
similar ones in lots of different engineering applications.

43
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
4.1 Characterising random samples
Probably the most important parameter we can use to
characterise a random sample is its mean. Let’s say we
want to estimate the mean height of the buildings in
Oxford. We can take a random sample of n=100 buildings
(here, as in many other situations, you need to take extra
care to assure that the sample is really random), measure
each one of them and calculate the sample mean:
n
1
x 
n
 xj
j 1

In the same way, the sample variance is defined as


1 n
s2   x j  x  2
n  1 j 1
It is important to note the difference between the means
and variances that we defined for probability distributions
and these that we have just defined for samples. Even if
our samples come from a known distribution, there is no
guarantee that the sample mean from a particular sample
will coincide with the mean of the distribution (though
intuitively we might expect these two values to be close).

5 Estimating mean and variance of a


distribution
Let’s consider the general case of a probability distribution
that we want to characterise by calculating relevant
parameters using a finite sample of values drawn from te
distribution (for example the mean and variance for a
normal distribution, or the value of  for a Poisson
distribution). One of the best ways to do this is using the
maximum likelihood method. Let’s take the normal
distribution as an example. Suppose we have a normal
distribution pdf f(x) and we take a sample of random
independent values arising from this distribution, x1…xn. If
we knew the mean and variance characterising the
distribution, we could easily calculate the probability that

44
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
the distribution produces precisely this sample, which in a
discrete distribution would be
l  f x1  f x n 

and which, in the case of a continuous distribution such as


the normal, we can represent by the probability of the
function having certain values in very small intervals ∆x:
l  f x1 x  f x n x

The function l is called the likelihood function. Our goal


is to obtain the best estimates for the mean m and
variance 2 of the normal distribution, i.e. the values that
maximise this likelihood. In practice, rather than the
likelihood we can maximise its logarithm (the log-
likelihood), which will give the same result since the
logarithm function is monotonic, and has the advantage of
transforming the product above into a sum:
ln l  ln l 
0 and 0
m 

In the case of the normal distribution,


1  x  m2 
f x   exp   
 
2 2 
1 2
 2 
1
lnf x    ln2   ln  
x  m  2

2 2 2
And thus the log-likelihood is
n
n n x j  m 2
ln l   lnf x j    ln2   n ln   
j 1 2 j 1 2 2
Now we just have to differentiate this with respect to m
and  and make it equal to zero to calculate the estimates
(i.e. the stationary points), which are
1 n
mˆ   xj
n j 1
n
1
ˆ2   x j  m ˆ
2

n j 1

45
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
As you can see, these expressions are hardly surprising:
they are almost exactly the same as the sample mean and
variance. Note the little hat (more formally called caret)
on the symbols we have used here m ˆ , ˆ2 . These represent
that these values are estimates, rather than the exact
mean and variance of the underlying probability
distribution.
Note
Some people define the sample variance as
1 n
  xj  x 
2
s2 
n j 1

and the unbiased sample variance as


1 n
s2   x j  x  2
n  1 j 1
using the latter as an estimate of the population variance,
which is to say the opposite way round to the description
above.

6 Confidence intervals
So if we take a random sample from a normal distribution,
and we want to use the sample mean as an estimate for
the actual mean of the distribution, how confident can we
be that the estimate is good enough for our purposes?
Important decisions might rely on this: for example, a
change in the mean of the diameters of the measured
screws might be due to errors in the manufacturing
process (which should lead to stopping the production line
for an assessment), or might just be caused by chance.
To assess this we introduce the concept of confidence
intervals. For a distribution parameter θ (e.g. the mean),
a confidence interval is the range of values θ1  θ  θ2 that
contains the actual value of θ with a certain confidence
level  (more on this later). Typically the confidence level
is chosen beforehand, and is usually above the 90% value.
We will refer to θ1 and θ2 as the lower and upper
confidence limits.

46
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
There are different symbols used for confidence intervals.
Here we will use the convention in Kreyszig:
CONF 1     2 

6.1 Confidence interval for the mean when the


variance is known
We will start with a simple example: the estimation of the
mean of a normal distribution, assuming that the variance
2 is known. The key here is to realise that, if we take a
random sample containing n values from a normal
distribution N(m,2), the mean of these values is also a
normally distributed random variable N(m, 2/n). This
comes from the fact that the mean is a linear combination
of the values, and as we saw in Section 3.3.6, a linear
combination of normal variables also follows a normal
distribution. Now let’s take a look at the Figure below,
representing a normal distribution N(0,1):

If this is the distribution of the standardised random


variable Z, we could conclude that the value of the
variable z will be within the coloured interval 90% of the
times, and thus for a 90% confidence level we can give
the confidence interval
CONF0.9  c  z  c

for which the value c=1.6449 can be found in normal


distribution tables.

47
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
We also know that we can transform a normal distribution
X= N(m, 2/n) into the standardised Z=N(0,1) by using
the change of variable
X m 
Z   X  m Z
/ n n

so applying this variable change and using the c value


corresponding to the chosen confidence level  we can
arrive at an expression for a confidence interval for the
variable we are interested in, which is the mean :
   
CONF x  c    x c 
 n n
where the value 𝑥̅ corresponds to the mean of the current
sample, which is our best estimate of the mean of the
distribution.
The process for calculating a confidence interval for the
mean, given that we know the variance, is thus:
 Pick a confidence value
 Calculate the value c of the standardised normal
variable corresponding to the chosen confidence,
from tables or an appropriate software package.
 Form a confidence interval centred in the sample
mean, with a length equal to 2 c / n .
The concept of a confidence interval might be slightly
confusing. Note that we cannot say that there is a 
probability that the mean value of the distribution lies
within this interval. This depends on the particular value
of the random sample: we might have been unlucky in
this case and get a random sample whose values are all in
the outskirts of the distribution. What we can say is that,
if we repeated this test many times with different
probability distributions and got this value for the sample
mean, in 100 % of the times the mean of the distribution
would be within the calculated interval. While this may not
be exactly what we would like to have, it is a good
indication for the distribution parameter.

48
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
6.2 Confidence interval for the mean when the
variance is unknown
We will skip the mathematical proofs here – you can go to
the recommended literature for details. Overall the
process is very similar to the one described above, except
in two aspects. First, given that we do not know the value
of the variance, we will use the sample variance as a
substitute. Second, in this case the mean does not follow
a normal distribution but the t-Distribution with n-1
degrees of freedom (where, as before, n is the number
of elements in our random sample).
As mentioned above, tables for the t-distribution are also
easy to find (in HLT, for example). Let’s say our random
sample contains 6 measurements (that means 5 degrees
of freedom). For a confidence value =0.95, we need to
look at the percentage point 2.5 in the HLT table, which
lists a value 2.571 (if you don’t understand why we looked
at the percentage point 2.5 rather than 5, take into
account the symmetry of the confidence interval).
When the number of degrees of freedom is large (i.e.
when our sample contains many measurements), the
normal distribution is a good approximation; however
when we have a small number of measurements the
difference can be important.
6.3 Confidence interval for the variance of a normal
distribution
So we have seen how to calculate confidence intervals for
the mean of the distribution when we assume we know
the distribution variance, and when we need to use the
sample variance as an estimation. There is one more thing
we may want to characterise: can we estimate a
confidence interval for the variance itself?
As above, we will skip the mathematical proof, which can
be found in any of the recommended books. Let’s just say
that the sample variance can be characterised using a
random variable with a chi-square distribution with n-1
degrees of freedom, in the following way:

49
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
2
n  1 S 2 ~  n2 1

where n is the number of samples, S2 is the sample
variance, and σ2 represents the population variance. Some
simple rearrangement of the variables results in the
following expression for a confidence interval:
 
 n  1S 2
n  1S  2
CONF  2  2  2 
   , n 1  
1  , n 1

 2 2 
 
with 2 ,n being the value for which P X  2,n   .

6.4 Distributions different from the normal


Up to now we have only considered the problem of
parameter estimation from normal distributions. The
central limit theorem tells us that, for large samples, we
can use the normal as an approximation to the
distribution. The concept of “large” sample here is
sometimes taken to mean n>20.
As an example, let’s find a confidence interval for the
mean of a binomial distribution. We have seen above how
a binomial distribution can be approximated by a normal
of mean np and variance np(1-p). This means that, if we
calculate the value of c for a particular confidence level γ,
we can write the confidence interval as for the normal
distribution:

 X  np 

 c   c

 np1  p 

What we want to obtain is an interval for p, for which we
can approximate the values of mean and variance using
the sample mean and the sample variance, i.e.
ˆ
np  np

np1  p  ˆ1  p
np ˆ
and thus the confidence interval is

50
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

 ˆ1  p
p ˆ ˆ1  p
p ˆ 

CONF pˆc ˆc
 p p 

 n n 

This can be used also to calculate the required sample
sizes to achieve a specified confidence level. Note
however that this would require an initial estimation of p.

7 Hypothesis testing
As suggested above, many times in engineering
applications we have to make decisions based on
information coming from random processes. Imagine, for
example, that you have to decide whether the pieces from
a certain manufacturer meet the requirements you need
for your products. It is not possible to test every single
piece, so you may have to rely on the measurements
coming from a small random sample for your decision. Or
let’s say your department has come up with a new
formulation for a chemical product, and you want to check
whether this is better or worse than the previous one you
were producing. Again, you will need to make an
important decision based on a limited number of samples.
The theory that lies behind this type of decisions is called
hypothesis testing, and is one of the most important
areas of statistics when applied to engineering.
As a working example, let’s say we are producing a new
hybrid car which we expect to get 100 miles per gallon.
We have carried out tests on the first 8 cars coming out of
the production line, which produced the following values:
100.3 102.1 95.6 97.7 99.8 103.2 96.4 98.5
How can we determine whether the average mpg of the
produced cars (we will call it µ) is the one we expected?
The basic idea is the following: we are trying to check a
hypothesis (µ=100). We call this the null hypothesis.
We also need to define an alternative hypothesis, which
in this case will be µ≠100. We also have to define a
significance level. This concept has a similar role to the
confidence level we were using in parameter estimation,

51
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
and it is related to how sure we need to be about the
result in order to accept or reject the hypothesis. Let’s say
we are looking for a significance level =0.05 (we will
revisit the concept of significance level later). We will also
assume that the measurements come from a normal
distribution. Finally, let’s assume that the variance of the
distribution is known: 2=10.
You can see the parallelism to the situation in which we
were trying to establish a confidence interval for the mean
when the standard deviation is known. We can thus use
the tables for the normal distribution (if we did not know
the variance we would use instead the Student’s t-
distribution, with n-1=7 degrees of freedom). We can see
that 95% of the area under the probability distribution
corresponds to a range of 1.96. From the list of values
above, the sample mean is 99.2 and we know the
variance is 2=10. Applying the usual change of variable,
if the mean is 100 we can expect the sample mean to be
in the following interval with 95% probability (i.e. a
significance level of 100-95=5%):
100  1.96 / n    100  1.96 / n
97.8    102.2
Our sample mean is 99.2, which falls within the interval,
and thus we can accept the hypothesis: to the significance
level required, the car meets the mpg condition. Note
however that, if we had set a lower significance level the
result might have been different. As you can see, setting
the significance level is an important decision.
There is another aspect in which we could have taken a
different approach. We used µ=100 as the null hypothesis,
so the alternative hypothesis µ≠100 includes either
having a mean significantly lower or significantly higher
than this. This is called a two-sided test (called two-
tailed in some textbooks). If we were sure that the mean
mpg is not going to be higher than 100, we would only
need to check the alternative hypothesis µ<100. This
would be a one-sided test (one-tailed), which in turn can
be left-sided (alternative hypothesis µ<100) or right-sided

52
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
(alternative hypothesis µ>100). You can easily check that
the limits of the confidence interval are different between
the one- and two-sided cases. Thus the choice between
the two types of tests is important, and requires a careful
assessment of the particular situation. This is important
for example in medical research tests: the developers of a
new drug may insist that only a possible improvement in
patient’s condition is tested (a one-sided test), while it
may be more appropriate to use a two-sided test to check
whether the drug can actually be worse than currently
available treatments.
7.1 Type I and type II errors
When we make a decision based on the result of a
hypothesis test, we run the risk of it being wrong. There
are two possible types of errors:
Type I errors occur when a true hypothesis is rejected.
In the previous example, it would happen if we wrongly
conclude that the mpg is different from 100. This can
happen in any case if we are unlucky with the values that
come up on the random sample, but it is more likely to
occur if we chose an overly restrictive significance level .
In fact, we can easily see that the value of  is the
probability of making a type I error.
Type II errors occur when a hypothesis that should have
been rejected is accepted as true: in the previous example,
if we conclude that mpg=100 when in reality it is not. This
is more likely to occur if  is small: in general there is a
trade-off between type I and type II errors. We call  the
probability of making a type II error, and =1- is called
the power of the test.
In the example of the hybrid car above, we calculated the
interval that gave us a probability of type I errors =0.05.
We can now go on to calculate , the power of the test.
Let’s again assume that the variance of the distribution is
known: 2=10 (the same could be calculated for an
unknown variance, a case we will not discuss here). The
value of  depends on the actual mean of the distribution
: obviously if  is very far from the test value 100, the

53
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
probability of making a type II error will be very small. We
can calculate  as
    P(x  97.8)  P(x  102.2)
Where the subindex  indicates that the probability is
calculated for this particular value of the mean. To give
some example values, (99.0)=0.14, meaning that if the
mean of the process is 99.0 (rather than the hypothesised
100.0), the probability of this test correctly rejecting the
hypothesis is only 14%. On the other hand (97.0)=0.76,
so if the mean goes down to 97.0, the test will correctly
reject the hypothesis with probability 76%. We could
continue finding values manually or, much more efficiently,
using suitable software, to produce a curve for the power
function. Keeping the value of =0.05, we can see how
the curve changes with different values of n:

As you can see, the power function becomes larger – and


with it, the number of Type II errors decreases – when we
take a larger sample.

54
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
In the next Figure, we keep the size of the random sample
to n=8, and we illustrate the effects change the value of 

Here, as one would expect, the power function is better


(larger) for higher values of alpha. This shows that, in the
absence of any other changes, there is a trade-off
between Type I and Type II errors. It is up to the designer
to decide about the relative importance of these two types,
which will typically depend on the application.

55
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
8 Conclusion
In these four lectures we have discussed:
 The concept of probability, both intuitively and in a
formal mathematical framework.
 Methods to calculate probability using the concept of
relative frequency
 The concept of prior probability and Bayes’ theorem
 Some of the most important probability distributions, in
particular the normal or Gaussian.
 The concept of statistical testing, and some basic
testing methods.
These concepts find applications in all branches of
engineering – and are explored in more detail in some of
the advanced courses.

56
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne

57
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
To: Dr SJ Payne, Engineering Science

Feedback on Lecture series A1 Probability and Statistics

I have attended this series of lectures and think:

1. The level of difficulty of the material compared to


other courses was:

□ Absurdly easy – a walkover.


□ Fairly easy – comfortably manageable.
□ Just the same as other typical courses.
□ Fairly hard – but just about manageable.
□ Completely beyond me.

2. The pace of presentation was

□ Much too slow, more material could have been covered.


□ Fairly easy, but acceptable.
□ About right for me.
□ Fairly hard but acceptable.
□ Way too fast.

3. Clarity of presentation. Overall, the lectures were

□ Clearly presented and easy to follow.


□ Mainly clear but with occasional unclear presentation.
□ Mainly unclear but with some clearly presented parts.
□ Very unclear and hard to follow.

58
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
4. Quality of the handout. The lecture handouts were

□ Easy enough to read and understand.


□ Easy enough to read but hard to understand.
□ Tricky to read but made sense once read carefully.
□ Tricky to read and hard to understand.

5. Relevance to the examples sheets. So far, the lectures


and handouts were

□ Not much use for solving examples problems.


□ Useful, but not closely enough related to the problems.
□ Directly useful for solving the problems.

Finally, feel free to add any other comments on aspects of


the lectures or notes you found particularly helpful or
unhelpful below. If you spotted a typo please let me know.
Thank you.

59

You might also like