Probabilistic Methods in Engineering: Dr. Horst Hohberger

Probabilistic Methods in Engineering
Dr. Horst Hohberger
University of Michigan - Shanghai Jiaotong University

Joint Institute
Spring Term 2009
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 1 / 355
Elements of Probability Theory
Part I
Introduction to Probability and Counting
Discrete Random Variables
Continuous Random variables
Joint Distributions
Elements of Probability Theory Introduction to Probability and Counting
Joint Distributions
Probability - Historical Notes

Although random events have fascinated humanity for thousands of years
(archaeologists have discovered dice games of the ancient babylonians),
the concept of probability has eluded mathematicians for a surprisingly
long time. There are two main reasons for this:
◮ The greek philosophy of mathematics, which profoundly influenced
western mathematics for millenia, viewed mathematics as a ideal,
precise description of abstract truths, in particular with regard to
geometry. It would not have entered the minds of the greek
philosophers to attempt to describe random, unpredictable events
using such techniques.
◮ Random events were viewed as being decided by God or some other
higher power. In particular, bones were used to foretell the future and
unpredictable events like the throw of dice or the weather were viewed
as unknowable by mortals. The influential Christian church did not
encourage people to “question the decisions of God.”
Probability - Classical Definition
In the 16th century, the Italian mathematician Cardano, who was a heavy
gambler, attempted to use mathematics to describe the outcome of
games. He hit upon the following definition, which is really a procedure for
calculating probabilities:
1.1.1. Definition. Let A be a random outcome (random event) of an
experiment (game) that may proceed in various ways. Assume each of
these ways is equally likely. Then the probability of the outcome A is
number of ways leading to outcome A
P(A) =
number of ways the experiment can proceed.
Classical Probability - Examples
1.1.2. Examples.
1. The experiment consists of flipping a coin. We are interested in the
probability of the coin landing heads up. The experiment can proceed
in two ways: the coin land heads up or tails up. We assume each
event is equally likely, so the classical definition gives
1
P[coin lands heads up] = = 0.5
2
Classical Probability - Examples
1.1.3. Examples.
2. The experiment consists of rolling two 6-sided dice and summing the
results, so the possible outcomes are the numbers S = 2, 3, . . . , 12.
We are interested in the outcome S = 3. Each die will give results 1,
2, 3, 4, 5 or 6. We assume that each result is equally likely. There are
two ways we can get the outcome S = 3: either the first die’s result is
1 and the second die’s result is 2, or the first die gives 2 and the
second die gives 1. In total, the experiment can proceed in 6 × 6 = 36
different ways. Hence
2 1
P[3] = = = 0.056
36 18
Historical Notes
Cardano’s work although published, received little attention, and 100 years
later, in the middle of the 17th century, the two French mathematicians
Fermat and Pascal rediscovered his principles, also by considering games.
They discussed how to divide the jackpot if a game in progress is
interrupted. Imagine that Fermat and Pascal are playing a simple game,
whereby a coin is repeatedly tossed. Fermat wins as soon as the coin has
turned up heads six times, Pascal wins as soon as the coin has turned up
tails six times.
There are 24 gold pieces in the pot. Now the game is interrupted when
the coin has already turned up 5 tails and 4 heads. How to divide the pot?
Fermat can only win if the coin turns up heads two times in a row, and we
can calculate that he has a 1/4 chance of this happening. Therefore, he
receives 1/4 of the pot (6 gold pieces), while Pascal receives 3/4 (18 gold
pieces).
Tree Diagrams
We use a tree diagram to find the probability of two heads occurring:
1 Toss st
ee eeeeeeeYYYYYYYYtails
YYYYYY
eee
heads
eeeeee YYYY
2nd Toss
R 2nd Toss
R
l RRRRtails l RRRRtails
llll
heads
l RRRR llll
heads
l RRRR
llll R llll R
2 heads 1 head, 1 tail 1 head, 1 tail 2 tails
After two tosses, there are four possible outcomes. Only one of the four
outcomes involves two heads, so the probability of this happening is 1/4.
The Art of Counting
The just illustrated strategy of using a tree diagram for evaluating

probabilities in a multi-stage process (each node represents a stage, or
step) shows us that counting is important. However, counting is also quite
difficult, more difficult, in fact, than you probably think.
The mathematical field that concerns itself with counting is known as
combinatorics. You probably have discussed some basic combinatorics in
school, and we will review some elementary concepts here.
Permutations in Calculus
There is a small difference in the use of the term “permutation” between
analysts and combinatorists; we first repeat the usage that is familiar from
calculus/linear algebra.
1.1.4. Definition. Let {x1 , . . . , xn } be a set of n distinguishable elements

(xk ̸= xj for j ̸= k). Then an injective map
π : {x1 , . . . , xn } → {x1 , . . . , xn }
is called a permutation of these elements.
Recall that a map f is called injective if
f (x1 ) = f (x2 ) ⇒ x1 = x2 .
1.1.5. Remark. A permutation of n elements is a finite map because the

domain and range are sets with a finite number of elements. Any finite
injective map is automatically bijective.
1.1.6. Notation. A permutation is defined on a set of n distinguishable

elements; instead of {x1 , . . . , xn } we can also simply write {1, . . . , n},
replacing the permutation of elements with a permutation of indices.
Recall that a function f is defined as a set of pairs of the form (x, f (x)),
where x is the independent variable. Thus we could define a permutation
π through a set of pairs {(1, π(1)), . . . , (n, π(n))}. In fact, we do represent
permutations in this way, but us a different notation, writing
µ ¶
1 2 ... n
π=
π(1) π(2) . . . π(n)
1.1.7. Example. Consider the set of 3 objects, {1, 2, 3}. Then one
permutation is given by the identity map,
µ ¶
1 2 3
π0 =
1 2 3
while other permutations are given by
µ ¶ µ ¶ µ ¶ µ ¶ µ ¶
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
, , , , .
3 1 2 2 3 1 3 2 1 2 1 3 1 3 2
1.1.8. Notation. Instead of writing out the map as above, we often simply
give the ordered n − tuple (π(1), . . . , π(n)). For example, instead of
µ ¶
1 2 3
we might write (3, 1, 2).
3 1 2
We have previously encountered permutations in the definition of the
determinant of an n × n matrix using the Leibniz formula; there one sums
over all permutations of the index set {1, . . . , n} and also includes the sign
of the permutation, a concept we won’t elaborate on here.
1.1.9. Lemma. There are n! permutations π of the set {1, . . . , n}.
Proof.
We consider the number of possible values of π(k), k = 1, . . . , n. The first
value, π(1) can be any of the n elements of {1, . . . , n}. The second value
can be any element of {1, . . . , n} \ {π(1)}. Hence there are n possible
values of π(1), but only n − 1 possible values of π(2). In general, there are
n − k + 1 possible values for π(k), so in total, there are
n · (n − 1) · · · (n − n + 1) = n!
different choices for the values (π(1), . . . , π(n)) of a permutation.

Permutations in Combinatorics
In combinatorics, permutations are regarded as arrangements in a definite
order. Clearly, the ordered tuple (π(1), . . . , π(n)) is an arrangement of
(1, . . . , n) is a definite order, i.e., the calculus-based definition realizes this
goal.
However, in combinatorics one is often interested in first selecting r ≤ n
objects from {1, . . . , n} and then arranging these objects in a definite
order. We realize this by the following, more general definition:
1.1.10. Definition. Let n, r ∈ N \ {0} with r ≤ n. Then an injective map
π : {1, . . . , r } → {1, . . . , n}
is called a permutation of r elements from {1, . . . , n}.
1.1.11. Remark. For r = n we regain our previous definition.
Permutations in Combinatorics
1.1.12. Notation. We again write π as
µ ¶
1 2 ... r
or (π(1), . . . , π(r )),
π(1) π(2) . . . π(r )
where π(k) ∈ {1, . . . , n}, k = 1, . . . , r .
1.1.13. Example. Let n = 3, r = 2. Then the following are permutations of

two elements from {1, 2, 3}:
(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2).
1.1.14. Theorem. There are

n!
n · (n − 1) · · · (n − r + 1) =
(n − r )!
different permutations of r elements from {1, . . . , n}.
Combinations
A related question arising in combinatorics is the number of selections of

r elements from a set of n elements, where we do not care about the
order. Such selections are called combinations.
1.1.15. Example. Let n = 3, r = 2. Then the following are combinations of
two elements from {1, 2, 3}:
{1, 2}, {1, 3}, {2, 3}.
Note that a selection consists of sets (which are unordered) while a

permutation consists of ordered tuples.
1.1.16. Definition. A combination of r elements from {1, . . . , n} is subset

A ⊂ {1, . . . , n} with |A| = r elements.
Combinations
1.1.17. Theorem. There are
µ ¶
n n!
=
r r !(n − r )!
combinations of r elements from {1, . . . , n}.
Proof.
We first consider permutations of r objects from {1, . . . , n}. Every
permutation gives us a combination, by identifying
(π(1), . . . , π(r )) −→ {π(1), . . . , π(r )}.
Obviously, more than one permutation will give us the same combination.
Note that any permutation of (π(1), . . . , π(r )) will be a permutation of r
objects from {1, . . . , n}. Furthermore, any permutation of (π(1), . . . , π(r ))
will yield the same combination. For each tuple (π(1), . . . , π(r )), there are
r ! permutations of (π(1), . . . , π(r )), so we need to divide the total number
of permutations of r objects from {1, . . . , n} by r !.
Combinations as Permutations of Indistinguishable Objects
We can regards the previous situation in a different light: A combination

of r elements of {1, . . . , n} is similar to a permutation, but without regard
to order. Fundamentally, if we do not order the selected objects, we regard
them as indistinguishable once they have been selected. (We can not say
“Take the fifth selected object,” but merely, “Take a selected object.” We
do not distinguish between the selected objects.)
Therefore, a combination can be regarded as a permutation of r
indistinguishable objects from {1, . . . , n}.
Effectively, this corresponds to a division of {1, . . . , n} into two classes, a
subset consisting of r elements and its complement.
Sorting Into Classes
We can hence reformulate Theorem 1.1.17 on combinations with regard to

the problem of sorting elements of {1, . . . , n} into two classes (sets) A1
and A2 of specified size.
1.1.18. Theorem. Let N := {1, . . . , n}. Then there are

µ ¶
|N |! n! n! n
= = =
|A1 |!|A2 |! n1 !n2 ! r !(n − r )! r
possible ways of dividing N into two sets A1 and A2 , where

◮ A1 ⊂ N , |A1 | = n1 = r ,
◮ A2 = N \ A1 , |A2 | = n2 = n − r .
Of course, this result can be generalized!
1.1.19. Theorem. Let N := {1, . . . , n}. Then there are
n!
n1 !n2 ! . . . nk !
possible ways of dividing N into k sets A1 , . . . , Ak , where
S
k Tk
◮ N = Ai , Ai = ∅
i=1 i=1
◮ |Ai | = ni , i = 1, . . . , k.
1.1.20. Remark. Since we are only interested in dividing a number of

elements into classes, and do not distinguish between the elements within
each class, this sorting into classes is also called permutation of
indistinguishable objects.

Proof of Theorem 1.1.19.
We first count the number of ways of sorting n1 elements into A1 :
µ ¶
n n!
= .
n1 n1 !(n − n1 )!
Next we count the ways of sorting n2 elements of the remaining n − n1

elements into A2 :
µ ¶
n − n1 (n − n1 )!
= .
n2 n2 !(n − n1 − n2 )!
The total number of ways of sorting n1 elements into A1 and n2 elements

into A2 is then the product
µ ¶µ ¶
n n − n1 n! (n − n1 )! n!
= =
n1 n2 n1 !(n − n1 )! n2 !(n − n1 − n2 )! n1 !n2 !(n − n1 − n2 )!

Proof of Theorem 1.1.19 (continued).
Proceeding in this manner, we find there are
µ ¶ k−1 µ P ¶ ¡ P ¢
n Y n − ij=1 nj Y n − ij=1 nj !
k−1
n!
= ¡ P ¢
n1 ni+1 n1 !(n − n1 )! n ! n − i+1 j=1 nj !
i=1 i=1 i+1
Qk−1 ¡ Pi ¢
n! i=1 n − j=1 nj !
= Q Qk−1 ¡ Pi+1 ¢
n1 !(n − n1 )! k−1
i=1 ni+1 ! i=1 n − j=1 nj !
Qk−1 ¡ Pi ¢
n! i=2 n − j=1 nj !
= Qk Qk−2 ¡ Pi+1 ¢
i=1 ni ! i=1 n − j=1 nj !
n!
= Qk
i=1 ni !
ways of sorting all objects into classes A1 , . . . , Ak , |Ai | = ni .
Sample Spaces and Sample Points

These techniques of counting permutations, combinations and sorting into
classes are very useful for evaluating probabilities by the classical
definition. First, however, we need to translate physical situations into a
mathematical context.
1.1.21. Definition. A sample space for an experiment is a set S such that
each physical outcome of the experiment corresponds to exactly one
element of S.
An element of S is called a sample point.
1.1.22. Remark. Not every element of a sample space needs to correspond

to the outcome of an experiment. It is often convenient to select a very
large, but simple sample space. For example, for the roll of a six-sided die,
we can take S = N, where the result of the die corresponds to the natural
numbers 1,2,3,4,5 or 6 and all other natural numbers are not used. All
natural numbers in this example are sample points, even though only six
actually correspond to physical reality.
Events
1.1.23. Definition. Any subset A of a sample space S is called an event.

Two events A1 , A2 are called mutually exclusive if A1 ∩ A2 = ∅.
1.1.24. Example. We roll a four-sided die 10 times. Then a possible sample

space is S = N10 and a sample point is a 10-tuple, for example
(1, 2, 3, 2, 3, 3, 1, 1, 4, 4) ∈ N10 . This sample point corresponds to first
rolling a one, then a 2, next a 3, followed by a 2, a 3, a 3, two ones and
two fours.
An event might correspond to “rolling at least two fours” in which case
this would be a subset A ⊂ S such that each 10-tuple in A has at least
two entries equal to 4. For example, (1, 2, 3, 2, 3, 3, 1, 1, 4, 4) ∈ A but
(1, 2, 3, 2, 3, 3, 1, 1, 3, 4) ∈
/ A.
Example
1.1.25. Example. We roll a four-sided die 10 times. What is the probability
of obtaining 5 ones, 3 twos, 1 three and 1 four?
There are 410 = 1048576 possibilities for the 10-tuple of results of the die
rolls, corresponding to that many sample points in S = N10 that
correspond to physical results. The event A consists of all ordered
10-tuples containing 5 ones, 3 twos, 1 three and 1 four. There are
10!
= 5040
5!3!1!1!
possible ways of obtaining 5 ones, 3 twos, 1 three and 1 four, so there are
that many elements in A. The probability is
5040
≈ 0.00481 ≈ 0.5%.
1048576
A Note on Numerical Values
As seen in the above example, we will round numerical values to a suitable

number of significant digits.
Hence 0.6 and 0.60 mean two different things, as you recall from your
physics lectures.
ALWAYS put a leading zero before the decimal point: writing .6 is not
acceptable for engineers, always write 0.6.
Counting Mountains
We will now look at a non-trivial example of counting. Consider the
following problem: We wish to draw “mountains” using upstrokes and
downstrokes, like this:
◮ One upstroke, one downstroke:
B
||| BBB
|| B
◮ Two upstrokes, two downstrokes:
6
©©© 666
© 66
©©© 66
B B ©© 66
||| BBB ||| BBB ©© 66
|| B|| B ©© 6
Counting Mountains |BB |BB |BB

||| BBB ||| BBB ||| BBB
◮ Three upstrokes, three downstrokes: | | |
B 6 6
©©© BBB ||| 666 ©©© 666
© B|| 66 © 66
©©© 66 ©©© 66
© 66 © 66
©©© 66 ©©© 66 ||BBB
6 6||| BB
©© ©©
4
444
44
44
44 6
44
©©© 666
44 © 66
44 ©© 66
44 BB ©©© 66
44 |
| BB ©© 66
4 | 6
|| B©©
Counting Mountains
The mountains must each consist of n upstrokes and n downstrokes, and
thre may be no “valleys,” i.e., no stroke may lie under the starting point.
That means that the following constructions are not allowed:
BB B |66
BB ||| BBB ||| || 66
B|| B|| || 66
66
66
66 ||
6|||
The question is: for given n, how many such mountains using n up- and n
downstrokes can we draw?
The answer is related to many diverse combinatorial problems.
Counting Mountains
We would like to consider every mountain as a “word” or “string”

consisting of the “letters” or “characters” ↑ and ↓. For instance, we would
write
6
©©© 666
© 66
©©© 66
© 66
©©© 66 ||BBB
© 6||| BB
©
as W(3 ↑, 3 ↓) =↑↑↓↓↑↓.
However, not all such words do not lead to admissible mountains. For
example, ↑↓↓↑↓↑ does not give a mountain.
Dyck Words and Catalan Numbers

Clearly, we need to impose two conditions on a word w :
1. #(↓) = #(↑) = n
2. In any word consisting of the first k letters of w , #(↑) ≥ #(↓)
(k = 1, . . . , 2n)
Words satisfying these conditions are called Dyck words.
1.1.26. Theorem. The number of Dyck words of length 2n, n ∈ N \ {0}, is

µ ¶
1 2n
Cn = .
n+1 n
Cn is called the nth Catalan number.
A proof will be given in the recitation class. If you think that you are good
at counting, I challenge you to find a proof by yourselves before the class!
Historical Notes - Independence
Towards the end of the 17th century, the French mathematician de Moivre
had to flee to England, as he was a protestant (Huguenot), who were
being persecuted in France at the time. However, in England he couldn’t
get a job, since he was French. (The English did not like or trust the
French.) So he ended up spending a lot of time in coffee houses.
There he earned money by helping gamblers resolve disputes (such as how
to divide the pot in an interrupted game). His experiences led him to the
concept of independence of events - a coin does not remember its last
result, and will always have the same chance of turning heads up, no
matter how often it has done so in the past. We will explore this essential
concept in more detail later.
Historical Notes - Law of Large Numbers
In contrast, experience tells us that if we throw a fair coin lots of times, it

should not come heads up all the time; this was formulated by Jacob
Bernoulli in the early 18th century, who stated the Law of Large Numbers.
Essentially, this law says that if an experimental outcome A happens with
probability p, and if we conduct “sufficiently many” experiments, then a
proportion close to p · 100% of the experimental outcomes should be A.
This statement is the basis of an empirical method of finding probabilities.
Probability - Relative Frequency Approximation
1.1.27. Definition. Let A be a random outcome (random event) of an

experiment. Then the probability P(A) of this event occurring may be
approximated by performing the experiment N times:
number of times A occurs f (A)

P[A] ≈ =: ,
number of times experiment is performed N
This is called the relative frequency approximation to P[A].

1.1.28. Examples.
1. An experiment consists of flipping a coin. We are interested in the
probability of the coin landing heads up. We perform the experiment
100 times and observe that the coin lands heads up 52 times. By the
relative frequency approximation,
52
P[coin lands heads up] ≈ = 0.52.
100
2. An experiment consists of rolling two 6-sided dice and summing the

results, so the possible outcomes are the numbers S = 2, 3, . . . , 12.
We are interested in the outcome S = 3. We perform the experiment
200 times and observe this outcome 10 times. Hence
10
P[3] = P[S = 3] ≈ = 0.05.
200
1.1.29. Remark. You should be careful in making deductions from the law
of large numbers. For example, if a fair coin is tossed four times, it is more
likely to obtain a 3-1 split (3 heads and one tail or 3 tails and one head)
than it is to obtain a 2-2 split (2 heads and 2 tails). Verify this for yourself
by using the classical definition of probability!
Classical Probability - Problems

You will have noticed that there are severe problems with the classical
definition of probability; it assumes that events are “equally likely ” - but
what does this mean? It means that the probabilities of each event
occurring are equal! So the definition of probability already presumes the
concept of probability - it is circular.
Such a “definition” is clearly useless!
In the beginning of the 20th century, the Russian mathematician
Kolmogorov decided to base probability on a solid mathematical
foundation. The classical definition of probability was not only shaky from
a logical point of view, it also didn’t help much when trying to treat things
like throwing a dart at dart board.
He defined probability entirely in the abstract; the physical world is
translated into a mathematical set of events and probability is defined as a
certain function on this set. This removes all ambiguity from the
definition, but does not really tell us what probability means.
What is Probability?
Probability theory is mainly applied by statisticians. For them, abstract

axioms are meaningless. On the other hand, pure mathematicians can not
easily deal with the “fuzzy” and at time inconsistent explanations of
probability with reference to the real world. If you are interested in the
subject, I advise you to google/baidu
◮ Frequentism and
◮ Bayesianism
for some schools of thought on what probability actually is. Be aware that
this will lead you to some fascinating philosophical questions!
Axiomatic Definition of Probability
The upshot of Kolmogorov’s work for us is that we define probability in

the following way:
1.1.30. Definition. Let S be a sample space and P(S) the power set (set of
all subsets) of S. Then a function P : P(S) → R, A 7→ P[A] is called a
probability function (or just probability ) on S if
1. P ≥ 0,
2. P[S] = 1,
T
3. For any set of events {Ak } ⊂ P(A) such that Ak = ∅,
h[ i X
P Ak = P[Ak ].
Axiomatic Definition of Probability
From the definition of the probability function P it follows immediately

that
P[∅] = 0, P[Ac ] = 1 − P[A],
where Ac = S \ A denotes the complement of A ⊂ S.

We can also derive the general addition rule
P[A1 ∪ A2 ] = P[A1 ] + P[A2 ] − P[A1 ∩ A2 ].
Simple examples may be discussed in recitation class!
Probability Tree Diagrams
Now that we have a good grasp of the concept of probability, we can use
tree diagrams more generally. Consider the case of an unfair coin, which
turns heads up 60% of the time. We toss the coin twice:
eeeeYYYYYYYYY0.4
eeeee eeeeee
0.6 YYYYYY
YYY
eee
heads
R tails
R
0.6lllll RRRR0.4 0.6lllll RRRR0.4
l RRRR l RRRR
llll R llll R
2 heads head, tail tail, head 2 tails
P[2 heads] = 0.36 P[head,tail] = 0.24 P[tail,head] = 0.24 P[2 tail] = 0.16
Probabilities are multiplied along each branch to give the final probability,
and summed across sub-branches to give unity.
Conditional Probability
We now want to study the effect of information on probability. In
particular, two events may be related, such that the occurrence of one
event influences the probability that the other event occurs. For example,
event B may have a 50% chance of occurring, but if one also knows that
event A has previously occurred, then event B might have a higher or
lower chance of occurring.
1.1.31. Example. There is a test for the gender of an unborn child called
“starch gel electrophoresis” (this type of test is of course not done in
China!). It detects the presence of a protein zone called the pregnancy
zone. The following statistical information is known:
◮ 43% of all pregnant women have the pregnancy zone.
◮ 51% of all children born are male.
◮ 17% of all children are male and their mothers have the pregnancy
zone.
Given that the zone is present, what is the probability that the child is
male?
We now use a classical probability tree,
All pregnant
fXX
Women
ffff XXXXX
fffff XXXXXP[zone present] = 0.43
fffff XXX
Zone Absent Zone present
P
nn PPPP
nnnnn PPP P[male | zone present]
n
Female Male
P[male ∩ zone present] = 0.17
Here P[male | zone present] means the probability that the child is male,
given that the zone is present.
Since we multiply probabilities along branches of probability trees, it is
clear that
P[male ∩ zone present] 0.17
P[male | zone present] = = ≈ 0.40
P[zone present] 0.43
The previous example motivates the general definition:
1.1.32. Definition. Let A, B be events and P(A) ̸= 0. Then we define the
conditional probability
P[A ∩ B]
P[B | A] := .
P(A)
Conditional probability emphasizes information; our idea of the probability

of an event occurring can change radically if we obtain information about
related events. Mathematically, P(B) becomes P(B | A), which may be
dramatically less or greater than P(B). This concept of probability being
related to incomplete information is at the heart of Bayesianism. It is a
different approach to Frequentism, which takes a more empirical
approach based on the law of large numbers.
The Monty Hall Paradox
You are participating in a game show to win 10,000,000 RMB. The game
master [Monty Hall] presents you with three closed doors. Behind one of
the doors is the prize, behind the other two doors there is nothing. If you
open the correct door, you will receive the money, if you open one of the
other two doors you will not get anything.
Before opening any of the three doors, you can announce which door you
intend to open. Obviously, at least one of the other two doors does not
hide the money. The game master opens this (empty) door. You are then
given the option of either
◮ sticking with your choice or
◮ switching to the other closed door.
What do you do and does it make a difference?
The Monty Hall Paradox
To many people it seems counter-intuitive, but the best course of action is

to change your choice to the other door. There will be a 2/3 probability
that the prize is behind the remaining door that you have not chosen!
Why is that? By opening the door, the game master has not given you
any information about the door you have chosen (he can always open one
of the remaining doors, no matter which door you choose). The
probability of this being the correct door was 1/3 before he opens the
other door, and it remains that way after he opens the door.
However, his opening a door does give you information on the other two
doors, namely, it tells you which of the other two doors does definitely not
hide the prize. The original 2/3 probability that one of these doors hides
the money is now concentrated on just the one door. Therefore, it is
advantageous for you to change your choice.
Independence of Events
If one event does not influence another, then we say that the two events
are independent. Mathematically, we express this in the following way.
1.1.33. Definition. Let A, B be two events. We say that A and B are
independent if
P(A ∩ B) = P(A)P(B). (1.1.1)
Equation (1.1.1) is equivalent to
P(A | B) = P(A) if P(B) ̸= 0,

P(B | A) = P(B) if P(A) ̸= 0.
1.1.34. Example. The birthdays (day and month) of a group of people are
generally assumed to be independent. Disregarding leap years, any person
is assumed to have a 1/365 chance of being born on a given day. (Do you
think that this is a reasonable assumption?) How many people should a
group have so that there is a better than even chance of two people in the
group having the same birthday?
We consider the complementary problem and start with a single person in
the group. If we add a second person, there is a 364/365 chance of them
not sharing a birthday. Adding a third person, for no two people to share a
birthday, this person must have his birthday on one of the other 363 days
of the year, so there is now a
364 363
365 365
chance of no two people in the group sharing a birthday.
Continuing this argument, in a group of n ≥ 2 people there is a
Y
n
366 − k 1 364!
=
365 365 n−1 (365 − n)!
k=2
chance of no two people having the same birthday. It turns out that for
n = 23 this number is less than 0.5, so the probability of two people
having the same birthday is > 0.5.
This statement has been verified empirically; in a soccer match there are
2 × 11 players + 1 referee on the pitch. On any given playing day in the
Premier Division of the English league, about half the games should
feature two participants with the same birthday.
Refer to the article Coincidences: The truth is out there, Teaching
Statistics, Vol. 1, No. 1, 1998 (uploaded to the Resources section on
SAKAI).
Total Probability
Imagine you conduct a 2-stage experiment with two possible outcomes A
and B but three intermediate stages F , G and H. The probability tree for
this experiment looks like this:
P[F ] ccccccccccc[[[[[[[[[[[P[H]
ccccccc [[[[[[[[
ccccccc P[G ] [[[[[[
F G H
P[A|F ]qqqMMMP[B|F
M ] P[A|G ]qqqMMMP[B|G
M ] P[A|H]qqqMMMP[B|H]
M
q MMM q MMM q MMM
qqq qqq qqq
A B A B A B
Then the probability of obtaining outcome A is given by
P[A] = P[A | F ] · P[F ] + P[A | G ] · P[G ] + P[A | H] · P[H]
Total Probability
The previous example gives rise to the following theorem:

1.1.35. Theorem. (Total Probability) Let A1 , . . . , An ⊂ S be a set of
mutually exclusive events whose union is S. Let B ⊂ S be any event. Then
X
n
P[B] = P[B | Aj ] · P[Aj ].
j=1
Bayes’s Theorem
This immediately leads to the celebrated theorem of Bayes:

1.1.36 Bayes’s Theorem Let A1 , . . . , An ⊂ S be a set of mutually exclusive events
whose union is S. Let B ⊂ S be any event such that P(B) ̸= 0. Then for any Ak ,
k = 1, . . . , n, Then
P[B ∩ Ak ] P[B | Ak ] · P[Ak ]

P[Ak | B] = = n .
P[B] P
P[B | Aj ] · P[Aj ]
j=1
Bayes’s Theorem
1.1.37. Example. Blood type distribution in US:
◮ Type A: 41%
◮ Type B: 9%
◮ Type AB: 4%
◮ Type O: 46%
Measurements statistics:
◮ P[type A registered | true type O] = 4%
◮ P[type A registered | true type A] = 88%
◮ P[type A registered | true type B] = 4%
◮ P[type A registered | true type AB] = 10%
A test registers type A; what is the probability that this is correct?
Wanted: P[true type A | type A registered]
Bayes’s Theorem
By Bayes’s Theorem,
P[true type A | type A registered]

P[type A registered | true type A] · P[true type A]
= ,
P[type A registered]
where the total probability of P[type A registered] is
P[type A registered]
= P[type A registered | true type O] · P[true type O]
+ P[type A registered | true type A] · P[true type A]
+ P[type A registered | true type B] · P[true type B]
+ P[type A registered | true type AB] · P[true type AB]
Inserting the numerical values,
P[true type A | type A registered] ≈ 0.93
Elements of Probability Theory Discrete Random Variables
Joint Distributions
Countable and Uncountable Sets
We first need to review the concept of countable and uncountable sets.

1.2.1. Definition. Let A be a set. Then we say that A is countable if either
◮ A is finite or
◮ there exists a bijection A → N.
If A is not countable, then we say that A is uncountable.
1.2.2. Examples.
1. The integer numbers are countable.
2. The rational numbers are countable.
3. The real numbers in the interval [0, 1] are uncountable.

1.2.3. Definition. Let S be a sample space and Ω a countable subset of R.
A discrete random variable is a map X : S → Ω together with a function
fX : Ω → R with the properties that
X ≥ 0 and
1. fP
2. fX (x) = 1.
x∈Ω
Then fX gives the probability that X assumes a given value x, i.e.,
fX (x) = P[X = x].
The function fX is called the probability function or distribution function
of the random variable X .
Please note the non-standard notation above. “X = x” makes no sense at
all, since X is a function and x is a number. What is meant is
P(X (p) = x), the probability of the random variable X applied to a
sample point p ∈ S yielding the value x. Like physics, probability theory
uses its own notation, which is often incompatible with that of “standard”
calculus.

1.2.4. Example. Consider a class of 25 students. We test the blood type of
each student and count the total number of students with each blood
type. Then the sample space is
S = {(n1 , n2 , n3 , n4 )} n1 ,n2 ,n3 ,n4 ∈N
n1 +n2 +n3 +n4 =25
where the quadruple (n1 , n2 , n3 , n4 ) means that n1 students have type A,

n2 have type B, n3 have type AB, n4 have type O.
A random variable X might denote the number of students with type A or
type B blood. Then
X : S → Ω = {0, 1, 2, . . . , 25}, (n1 , n2 , n3 , n4 ) 7→ x = n1 + n2 .
Assuming that the blood types of the students are independent and that
the distribution of the blood types among the students is the same as that
of the general population, data for the general population can be used to
assign a probability to each value x of X .
Please note:
◮ The map X : S → Ω alone is just a variable.
◮ If Ω is countable, we call X a discrete variable, otherwise a
continuous variable.
◮ X together with fX (i.e., the pair (X , fX )) is a discrete or continuous
random variable.
Cumulative distribution
In the previous example, we might be interested in the probability that not
more than 10 out of 25 students have blood type A. The random variable
would be
X : S → Ω = {0, 1, 2, . . . , 25}, (n1 , n2 , n3 , n4 ) 7→ x = n1 .
with a certain density function fX . We would hence determine
P[X ≤ 10] = P[X = 0] + P[X = 1] + · · · + P[X = 10].
This is known as the cumulative probability, and we in fact define the
so-called cumulative distribution function of a random variable by
F (x) = P[X ≤ x].
For a discrete random variable,
X X
F (x) = P[X = y ] = fX (y ).
y ≤x y ≤x
Expectation
Consider the rolling of a fair six-sided die. We are interested in the average
or expected value of the result. Since each result (numbers 1,2,3,4,5,6)
occurs with probability 1/6, we take the weighted sum:
1 1 1 1 1 1
· 1 + · 2 + · 3 + · 4 + · 5 + · 6 = 3.5.
6 6 6 6 6 6
The average result of a die roll is then 3.5, even though this result itself
can never occur.
1.2.5. Definition. Let (X , fX ) be a discrete random variable. Then the
expected value of X is
X
E[X ] = x · fX (x).
x∈Ω
provided that the sum (series) on the right converges absolutely.
Expectation
Given a random variable X and a function H : Ω → R, the composition

H ◦ X will again be a random variable, albeit with a different probability
density function. If X is discrete, then so will H ◦ X .
1.2.6. Definition. Let (X , fX ) be a discrete random variable and H : Ω → R
some function. Then the expected value of H ◦ X is
X
E[H ◦ X ] = H(x) · fX (x).
x∈Ω
provided that the sum (series) on the right converges absolutely.
Some Properties of the Expectation

1.2.7. Theorem.
1. Consider the random variable S → R given by p →
7 c, for every
sample point p ∈ S and a fixed number c ∈ R. Denote this random
variable by c also. Then
E[c] = c.
2. Let X be a random variable and c ∈ R. Then the composition of the

function H : R → R, H : y 7→ c · y with X is a random variable, and
we write cX := H ◦ X . Then
E[cX ] = c E[X ].
3. Let X , Y be two random variables that are independent (we will later
give a precise definition). Then
E[X + Y ] = E[X ] + E[Y ].
Variance
The expectation of a random variable is its mean, the point about which
the values of the random variable fluctuate. What the mean does not
describe is the degree of this fluctuation.
1.2.8. Example. We roll three fair six-sided dice and sum the results. The
result of each die roll is a random variable X , Y and Z , respectively. The
random variable is X + Y + Z and its expected value is
E[X + Y + Z ] = E[X ] + E[Y ] + E[Z ] = 3.5 + 3.5 + 3.5 = 10.5
We can also roll a single twenty-sided (icosahedral) die. Then the
expected value of the result is
X
20
i 1 20 · 21
E[W ] = = = 10.5
20 20 2
i=1
Although both W and X + Y + Z have the same mean, they assume a
totally different range of values and we would also find that
P[X + Y + Z = 9] ̸= P[W = 9].
Variance
We therefore want to measure not only the mean of a random variable,
but also its expected deviation from the mean. The deviation from the
mean is given by
X − E [X ],
but we are interested in its absolute size, so we square it and then take the
expected value.
1.2.9. Definition. Let X be a random variable with expectation E[X ]. Then
the variance of X is defined as
£ ¤
Var X = E (X − E[X ])2 .
1.2.10. Notation. For short (and especially in statistics) we write

E[X ] = µX = µ, Var X = σX2 = σ 2 .
Variance and Standard Deviation

From the definition of the variance we can easily calculate that
Var X = E[X 2 ] − E[X ]2 .
If the random variable X assumes values with physical units (e.g., x

meters, x m), then the variance will have units that are the square of
these. This often does not make physical sense, since we want to measure
the expected deviation from the mean of a random variable, which should
have the same units.
1.2.11. Definition. Let X be a random variable with variance σX2 . Then we

define the standard deviation of X by
√ √
σX = Var X = σ 2 .
Some Properties of the Variance

1.2.12. Theorem.
1. Consider the random variable S → R given by p →
7 c, for every
sample point p ∈ S and a fixed number c ∈ R. Denote this random
variable by c also. Then
Var c = 0.
2. Let X be a random variable and c ∈ R. Then the composition of the

function H : R → R, H : y 7→ c · y with X is a random variable, and
we write cX := H ◦ X . Then
Var cX = c Var X .
3. Let X , Y be two random variables that are independent. Then

Var[X + Y ] = Var[X ] + Var[Y ].
The Geometric Distribution
The main interest in studying random variable lies in their distribution

function fX . The actual values that X takes are just numbers; all
probabilistic information lies in the distribution of these numbers.
Many discrete probability density functions arise from experiments that
follow a simple set of rules. We first discuss the properties of a geometric
random variable, i.e, a random variable with a geometric distribution
function.
Geometric Properties.
1. The experiment consists of a series of trials. The outcome of each
trial can be classed as being either a “success” (s) or a “failure” (f ).
A trial with this property is called a Bernoulli trial.
2. The trials are identical and independent in the sense that the
outcome of one trial has no effect on the outcome of any other. The
probability of success, p, remains the same for each trial.
3. The random variable X denotes the number of trials needed to obtain
the first success.

1.2.13. Definition. A random variable (X , fX ) with
X : S → Ω = N \ {0}
and distribution function fX : N \ {0} → R given by
fX (x) = (1 − p)x−1 p, 0 < p < 1,
is said to have a geometric distribution with parameter p.
1.2.14. Lemma. The cumulative distribution function for a geometrically

distributed random variable (X , fX ) with parameter p is given by
F (x) = P[X ≤ x] = 1 − q ⌊x⌋ ,
where q = 1 − p is the probability of failure and ⌊x⌋ denotes the greatest

integer less than or equal to x.
Moments and the Moment-Generating Function
Finding the expectation value and variance for a geometric random

variable can be quite tricky if we use the definition directly. It will be even
more difficult for some distributions which we are still to encounter.
However, there exists a tool that allows us to employ all the power of
calculus to this problem.
1.2.15. Definition. Let (X , fX ) be a random variable. The k th ordinary
moment of X is defined as E[X k ].
Hence the expectation E[X ] is the first moment of X , while the variance
Var X = E[X 2 ] − E[X ]2 is the second moment minus the square of the first
moment. It turns out not only that many important properties of X can
be inferred from the knowledge of its moments, but also that the moments
can be derived easily from the so-called moment-generating function.

1.2.16. Definition. Let (X , fX ) be a random variable. Assume that there
exists some ε > 0 such that
E[e tX ] exists for −ε < t < ε.
Then the moment-generating function (m.g.f.) mX for X is defined as
mX : [−ε, ε] → R, mX (t) = E[e tX ].
1.2.17. Theorem. Let (X , fX ) be a random variable with

moment-generating function mX . Then
¯
k d k mX (t) ¯¯
E[X ] =
dt k ¯t=0

We calculate directly the expectation value of e tX :
hX
∞ n ni
t X
∞ n
X t
mX (t) = E[e tX ] = E = E[X n ].
n! n!
n=0 n=0
Here we have (ab)used the various properties of the expectation.

Differentiating term-by-term,
∞ ∞
d k mX (t) X d k t n n
X t n−k
= E[X ] = E[X n ].
dt k dt k n! (n − k)!
n=0 n=k
At t = 0, only the first term survives, so

¯
d k mX (t) ¯¯
= E[X k ].
dt k ¯t=0

We will also use the moment-generating function as a “fingerprint” for a
distribution: We treat the association “distribution 7→ m.g.f.” as being
injective.
In other words, if we know that a given distribution g has a certain
moment-generating function, and if we find that some random variable
(X , fX ) has the same m.g.f., then fX = g .
This is strictly speaking untrue. Such a statement actually holds for the
characteristic function
χX (t) = E[e itX ],
√
where i = −1. This has to do with the fact that the m.g.f. can be
regarded as the Laplace transform of the indicator function on Ω, while
the characteristic function is its Fourier transform. There are stronger
analytic uniqueness results for the Fourier transform than there are for the
Laplace transform.
We now tackle the geometric distribution:

1.2.18. Proposition. Let (X , fX ) be a geometrically distributed random
variable with parameter p. Then the moment-generating function for X is
given by
pe t
mX : (−∞, − ln q) → R, mX (t) =
1 − qe t
where q = 1 − p.
Proof.
We have fX (x) = q x−1 p for x ∈ N \ {0}. Then
∞
X ∞
pX t x
mX (t) = E[e tX ] = e tx q x−1 p = (qe )
q
x=1 x=1
This is a geometric series which converges for |qe t | = qe t < 1, i.e., for
t < − ln q. For such t, the limit is given by
p ³X t x ´ p³ 1 ´
∞ ∞
pX t x
mX (t) = (qe ) = (qe ) − 1 = − 1
q q q 1 − qe t
x=1 x=0
p qe t pe t
= = .
q 1 − qe t 1 − qe t

1.2.19. Lemma. Let (X , fX ) be a geometrically distributed random variable
with parameter p. Then the expectation value and variance are given by
1 q
E[X ] = and Var X =
p p2
where q = 1 − p.
Proof.
We use the moment-generating function to calculate the expectation value:
¯ ¯
d ¯¯ d ¯¯ pe t
E [X ] = ¯ m X (t) = ¯
dt t=0 dt t=0 1 − qe t
¯
pe (1 − qe ) + pe t qe t ¯¯
t t p 1
= ¯ = = .
(1 − qe )t 2
t=0 (1 − q)2 p
The proof for the variance is similar and is left to the reader.
The Binomial Distribution
We will now look at various further discrete distributions, starting with the
binomial distribution.
Binomial Properties.
1. The experiment consists of a fixed number n of Bernoulli trials.
2. The trials are identical and independent. The probability of success,
p, remains the same for each trial.
3. The random variable X denotes the number of successes in the n
trials.
The Binomial Distribution ¡ ¢

Since x successes among n trials can be achieved in xn ways, and x
successes with probability p must be accompanied by n − x failures with
probability 1 − p, we see that
µ ¶
n x
P[x successes in n trials] = p (1 − p)n−x .
x
This motivates the following
1.2.20. Definition. A random variable (X , fX ) with
X : S → Ω = {0, 1, 2, . . . , n}
and distribution function fX : Ω → R given by

µ ¶
n x
fX (x) = p (1 − p)n−x , 0 < p < 1, n ∈ N \ {0}
x
is said to have a binomial distribution with parameters n and p.
The Binomial Distribution

1.2.21. Theorem. Let (X , fX ) be a binomial random variable with
parameters n and p.
1. The moment generating function of X is given by
mX : R → R, mX (t) = (q + pe t )n , q = 1 − p.
2. E[X ] = np.
3. Var X = npq.
The proof of this theorem is left as an exercise. We will also later be very
interested in the cumulative distribution F , F (x) = P[X ≤ x]. There is no
simple way of evaluating the sums involved, so the values have been
tabulated (Table I of Appendix A in the textbook). The table gives the
values of
⌊t⌋ µ ¶
X n x
F (t) = p (1 − p)n−x ,
x
x=0
where ⌊t⌋ is the greatest integer less than or equal to t.
The Pascal (Negative Binomial) Distribution
While the binomial distribution counts the number of successes in n trials,

the Pascal distribution counts the number of trials needed for r successes.
Pascal Properties.
1. The experiment consists of a series of Bernoulli trials.
2. The trials are identical and independent. The probability of success,
p, remains the same for each trial.
3. The trials are observed until exactly r successes are obtained, where r
is fixed beforehand.
4. The random variable X is the number of trials needed to obtain the r
successes.
The Pascal Distribution
We will try to derive a formula for the probability density associated with
negative binomial trials. First, notice that if we need x trials for r
successes, then in x − 1 trials we must have had exactly r − 1 successes
and therefore x − r failures. We know that the probability of this is
µ ¶
x − 1 r −1
P[exactly r − 1 successes in x − 1 trials] = p (1 − p)x−r .
r −1
Now with probability p the xth trial will be a success, so

µ ¶
x −1 r
P[obtain r th success in xth trial] = p (1 − p)x−r .
r −1

1.2.22. Definition. Let r ∈ N \ {0}. A random variable (X , fX ) with
X : S → Ω = N \ {0, 1, . . . , r − 1} = {r , r + 1, r + 2, . . .}
µ ¶
x −1 r
fX (x) = p (1 − p)x−r , 0 < p < 1,
r −1
is said to have a negative binomial distribution with parameters p and r .
1.2.23. Theorem. Let (X , fX ) be a negative binomial random variable with

parameters p and r .
(pe t )r
mX : R → R, mX (t) = , q = 1 − p.
(1 − qe t )r
2. E[X ] = r /p.
3. Var X = rq/p 2 .
Proof.
We derive the moment-generating function only. It is given by
∞
X ¶ µ
x −1 r
mX (t) = E[e ] = Xt
e p (1 − p)x−r
tx
x=r
r − 1
X∞ µ ¶
t(r +x) r + x − 1
= e p r (1 − p)x
r −1
x=0
X∞ µ ¶
r −1+x
r tr
=p e [e t (1 − p)]x
x
x=0

Proof (continued).
Note that the binomial series gives
∞ µ
X ¶
−r
(1 − y )−r = (−y )x ,
x
x=0
where
µ ¶
−r (−r ) · (−r − 1) · · · (−r − x + 1)
=
x x!
r · (r + 1) · · · (r + x − 1)
= (−1)x
x! µ ¶
x (r + x − 1)! x r −1+x
= (−1) = (−1)
x!(r − 1)! x
Proof (continued).
∞ µ
X ¶
−r r −1+x x
It follows that (1 − y ) = y . Therefore,
x
x=0
∞ µ
X ¶
r −1+x
mX (t) = p e r tr
[e t (1 − p)]x
x
x=0
(pe t )r
= p r e tr (1 − (1 − p)e t )−r =
(1 − qe t )r
with q = 1 − p.

1.2.24. Example. The president of a large corporation makes decisions by
throwing darts at a board. The center section is marked “yes” and
represents a success. The probability of his hitting a “yes” is 0.6., and this
probability remains constant from throw to throw. The president continues
to throw until he has three “hits.” We denote X as the number of the trial
on which he experiences his third hit. The president’s decision rule is
simple: If he gets three hits on or before the fifth throw he decides in favor
of the question. What is the probability that he will decide in favor?
P[X ≤ 5] = P[X = 3] + P[X = 4] + P[X = 5]

µ ¶ µ ¶ µ ¶
2 3 0 3 3 1 4
= (0.6) (0.4) + (0.6) (0.4) + (0.6)3 (0.4)2
2 2 2
= 0.683
The Hypergeometric Distribution

The hypergeometric distribution concerns trials that are not independent.
In order to construct such trials, Bernoulli introduced the idea of drawing
different colored balls from an urn without replacig them. Thus each draw
influences the distribution of the remaining balls
Hypergeometric Properties.
1. The experiment consists of drawing a random sample of size n
without replacement and without regard to order from a collection of
N ≥ n objects.
2. Of the N objects, r have a trait that interests us; the other N − r do
not have the trait.
3. The random variable X is the number of objects in the sample with
the trait.
1.2.25. Remark. If the sampling were done with replacement, this would
correspond to a binomial situation.
In order to find a density function for this situation, we use the classical
definition of probability. Fix the number of objects N, the sample size n
and the number of objects with the trait, r .
P[exactly x objects with trait in sample]

(# ways to select x out of r objects) · (# ways to select n − x out of N − r objects)
=
# ways to select n out of N objects
¡ r ¢¡N−r ¢
x
= ¡Nn−x
¢
n
1.2.26. Definition. Let N, n, r ∈ N \ {0}, r , n ≤ N. A random variable

(X , fX ) with
X : S → Ω = {x ∈ N : max(0, n − (N − r )) ≤ x ≤ min(n, r )}

¡ r ¢¡N−r ¢
x
fX (x) = ¡Nn−x
¢
n
is said to have a hypergeometric distribution with parameters N, n and r .
The hypergeometric distribution is tricky to treat explicitly. Without proof

or motivation, we give the following result:
1.2.27. Theorem. Let (X , fX ) be a hypergeometric distribution with
parameters N, n and r .
r
1. E[X ] = n .
N
r N −r N −n
2. Var X = n .
N N N −1
The hypergeometric distribution is used in acceptance sampling.

Suppose a retailer buys goods in lots and each item can be either
acceptable or defective. Let
N = # of items in a lot,
and
r = # of defectives in a lot.
Then we can calculate the probability that a sample of size n contains x
defectives.
1.2.28. Example. Suppose that a lot of 25 machine parts is delivered, where

a part is considered acceptable only if it passes tolerance. We sample 10
parts and find that none are defective (all are within tolerance). What is
the probability of this event if there are 6 defectives in the lot of 25?
Applying the hypergeometric distribution, N = 25, r = 6 and n = 10, we
have ¡ r ¢¡N−r ¢ ¡6¢¡19¢
x
P[X = 0] = fX (0) = ¡Nn−x
¢ = 0
¡2510
¢ = 0.028,
n 10
showing that our observed event is quite unlikely if there 6 defectives in
the lot.
Approximating the Hypergeometric Distribution
The hypergeometric distribution describes “sampling without

replacement.” In contrast, “sampling with replacement” is described by
the binomial distribution. Therefore, it makes sense that under certain
conditions the hypergeometric distribution may be approximated by the
binomial distribution.
The approximation is valid whenever the sampling fraction n/N is small.
Depending on whom you ask, “small” can mean n/N ≤ 0.05 or
n/N ≤ 0.1. Then the hypergeometric distribution can be approximated by
a binomial distribution with parameters n and p = r /N.
The smaller n/N is, the better the approximation.
Approximating the Hypergeometric Distribution
1.2.29. Example. A production lot of 200 units has 8 defectives. A random

sample of 10 units is selected, and we want to find the probability that the
random sample will contain exactly one defective. The true probability is
¡ r ¢¡N−r ¢ ¡8¢¡192¢
x
P[X = 1] = ¡Nn−x
¢ = 1
¡2009¢ = 0.288.
n 10
Since n/N = 10/200 = 0.05. Then p = r /N = 8/200 = 0.04 and by the

binomial approximation gives
µ ¶
10
P[X = 1] ≈ (0.04)1 (0.96)9 = 0.277.
1
The Poisson Distribution
The Poisson distribution is used for discrete occurrences, called arrivals or

births, occurring randomly in a continuous time frame.
The random variable is Xt , the number of arrivals that occur in the time
interval [0, t] for some t > 0. Hence for any t, Xt : S → N.
We make the assumption that the numbers of arrivals during a
non-overlapping time intervals T1 , T2 , T1 ∩ T2 = ∅, are independent.
We also assume that there exists some number λ > 0 such that for any
small time interval of size ∆t the following postulates are satisfied:
1. The probability that exactly one arrival will occur in an interval of
width ∆t is approximately λ · ∆t.
2. The probability that exactly zero arrivals will occur in the interval is
approximately 1 − λ · ∆t.
3. The probability that two or more arrivals ocur in the interval is
approximately zero (very small).
We will formulate these postulates mathematically, and from them obtain
the density function for the Poisson distribution.

We denote by o(t) any function f such that
f (t)
lim = 0.
t→0 t
Hence o(t) does not denote a particular function, rather a class of
functions. For example,
◮ t 2 = o(t),
◮ (1 + t)2 = 1 + 2t + o(t),
◮ sin t = t + o(t).
In particular,
◮ o(t) + o(t) = o(t),
◮ t n · o(t) = o(t) for all n ∈ N,
◮ o(t) · o(t) = o(t).
We now formulate or postulates mathematically:

1. The probability that exactly one arrival will occur in an interval of
width ∆t is λ · ∆t + o(∆t).
2. The probability that exactly zero arrivals will occur in the interval is
1 − λ · ∆t + o(∆t).
3. The probability that two or more arrivals occur in the interval is
o(∆t).
From this, we can derive the probability density function fXt . We write
fXt (x) = P[Xt = x] =: px (t) for x = 0, 1, 2, 3, . . .

The probability of having zero arrivals in the time interval [0, t + ∆t] is
given by the product of the probabilities of having zero arrivals in [0, t] and
[t, t + ∆t], respectively. We fix t and obtain
p0 (t + ∆t) = (1 − λ∆t + o(∆t))p0 (t) = (1 − λ∆t)p0 (t) + o(∆t).
so that
p0 (t + ∆t) − p0 (t) o(∆t)
−λp0 (t) = + .
∆t ∆t
We can take the limit as ∆t → 0 on both sides. Then the fraction
o(∆t)/∆t vanishes by definition, and we have
p0 (t + ∆t) − p0 (t)
−λp0 (t) = lim = p0′ (t).
∆t→0 ∆t

Now let x > 0. Then
px (t + ∆t) = λ∆t px−1 (t) + (1 − λ∆t)px (t) + o(∆t).
so that
px (t + ∆t) − px (t) o(∆t)
λpx−1 (t) − λpx (t) = + .
∆t ∆t
We can take the limit as ∆t → 0 on both sides. Then the fraction
o(∆t)/∆t vanishes by definition, and we have
px′ (t) = λpx−1 (t) − λpx (t).
Together with p0′ = −λp0 we hence have a system of differential equations

that can be solved inductively to determine p0 , p1 , p2 ,

The solution to these equations is
(λt)x −λt
fXt (x) = px (t) = e .
x!
If we set k = λt, we obtain the Poisson distribution with parameter k.
1.2.30. Definition. Let k ∈ R. A random variable (X , fX ) with
X: S →N
and distribution function fX : N → R given by
k x e −k
fX (x) =
x!
is said to have a Poisson distribution with parameter k.
1.2.31. Theorem. Let (X , fX ) be a Poisson distributed random variable

with parameter k.
t −1)
mX : R → R, mX (t) = e k(e .
2. E[X ] = k.
3. Var X = k.
The proof of this theorem is left as an exercise.

1.2.32. Example. The white blood cell count of a healthy individual can
average as low as 6000/ mm3 of blood. To detect a white-cell deficiency, a
0.001 mm3 drop of blood is taken and the number X of white blood cells
is found. How many white cells are expected in a healthy individual? If at
most two are found, is there evidence of a white-cell deficiency?
Here the (continuous) volume of blood (in mm3 ) takes the role of the time
variable, a white cell counts as an “arrival.” The number of arrivals per
unit volume is λ = 6000, the volume under consideration is s = 0.001.
Hence we have a Poisson process with parameter
k = λs = 6.
The expected value is E[x] = k = 6. Furthermore,
X
2
e −6 6x
P[X ≤ 2] = = 0.062.
x!
x=0
This value has been found from Table II of Appendix A of the text book.
Approximating the Binomial Distribution
If n is large and p is small, we can approximate the binomial distribution

by the Poisson distribution. We then set
k = pn.
Note that usually k = λt, where λ corresponds to the arrivals per unit
time. In general, we require p < 0.1 for this approximation. The smaller p
and the larger n are, the better the approximation.
In fact, the Poisson distribution can be regarded as a limiting case of the
binomial distribution, as you shall prove in the exercises.
1.2.33. Example. The probability that a particular rivet in the wing surface
of a new aircraft is defective is 0.001. There are 4000 rivets in the wing.
What is the probability that not more than six defective rivets will be
installed?
The true probability is
6 µ
X ¶
4000
P[X ≤ 6] = (0.001)x (0.999)4000−x .
x
x=0
Using the Poisson approximation, k = 4000 · 0.001 = 4 and
X
6
4k
P[X ≤ 6] ≈ e −4 = 0.889.
x!
x=0
Elements of Probability Theory Continuous Random variables
Joint Distributions
Continuous Random Variables
1.3.1. Definition. Let S be a sample space. A continuous random variable

is a map X : S → R together with a function fX : R → R with the
properties that
1. fX ≥ 0 and
R∞
2. fX (x) dx = 1.
−∞
The integral of fX is interpreted as the probability that X assumes values
x in a given range, i.e.,
Z b
P[a ≤ X ≤ b] = fX (x) dx
a
The function fX is called the probability density function (or just density)
of the random variable X .
Cumulative Distribution
Notice that by the above definition,
Z x
P[X = x] = fX (y ) dy = 0,
x
i.e., the probability that X assumes any specific value is zero.

In particular, if for two continuous random variables (X , fX ) and (Y , fY )
the densities fX and fY differ only on sets of measure zero (e.g., countable
sets), then for any a, b ∈ R
Z b Z b
P[a ≤ X ≤ b] = fX (x) dx = fY (y ) dy = P[a ≤ Y ≤ b].
a a
In this case we say
(X , fX ) = (Y , fY ) almost surely.
(From the point of view of calculus, fX = fY almost everywhere.)

Cumulative Distribution
1.3.2. Definition. Let (X , fX ) be a continuous random variable. The

cumulative distribution function for X is defined by F : R → R,
Z x
FX (x) := P[X ≤ x] = fX (y ) dy
−∞
Notice that by the fundamental theorem of calculus we can easily obtain

the density fX from FX :
fX (x) = FX′ (x).
Expectation and Variance

We define expectation similarly to the discrete case:
1.3.3. Definition. Let (X , fX ) be a continuous random variable and
H : R → R some function. Then the expected value of H ◦ X is
Z ∞
E[H ◦ X ] = H(x) · fX (x).
−∞
provided that the integral on the right converges absolutely.

As a special case we regain the expected value of X ,
Z
E[X ] = x · fX (x) dx
R
and also note that as before
Var X = E[(X − E[X ])2 ] = E[X 2 ] − E[X ]2 .
Exponential Distribution
In contradistinction to the discrete distributions, the continuous

distributions can often not be easily motivated by experiments. In practice,
one finds that a random variable may follow some distribution, but it is
not generally possible to predict the distribution in advance just by
specifying the circumstances.
1.3.4. Definition. Let β ∈ R, β > 0. A continuous random variable (X , fβ )

with density
(
1 −x/β
e , x > 0,
fβ (x) = β
0, x ≤ 0,
is said to follow an exponential distribution with parameter β.
2 Β=2
3
Β=
2
Β=1
1
1 Β=
2
1
2
x
-1 1 2 3
Graphs of y = fβ (x) for different values of β.
We can easily calculate the expectation and variance for the exponential
distribution:
Z ∞ Z ∞
x −x/β
E[X ] = x fβ (x) dx = e dx
−∞ 0 β
Z ∞
¯
−x/β ¯∞
= −xe 0
+ e −x/β dx = β
0
Z ∞ Z ∞
2 2 x 2 −x/β
E[X ] = x fβ (x) dx = e dx
−∞ 0 β
Z
¯∞ ∞
= −2xe −x/β ¯0 + 2 x e −x/β dx = 2β 2
0
⇒ Var X = E[X 2 ] − E[X ]2 = β 2 .
Similarly, we can obtain the moment-generating function:

Z ∞
tX
mX (t) = E[e ] = e tx fβ (x) dx
−∞
Z ∞
1 −(1/β−t)x
= e dx
0 β
Z
(1/β − t)−1 ∞ −y
= e dy
β 0
= (1 − βt)−1 .
The exponential distribution has an interpretation as “time-to-failure.” In
fact, there is a relationship between the exponential distribution and the
Poisson distribution.
Recall that for Poisson-distributed events (arrivals) the probability of x
arrivals in the time interval [0, t] was given by
(λt)x −λt
p(x) = e , x ∈ N.
x!
Then p(0) is the probability of no arrivals in [0, t]. This can also be
interpreted as the probability that the first arrival occurs at a time greater
than t. Let the time of the first arrival be a continuous random variable
denoted by T . Then
P[T > t] = p(0) = e −λt , t ≥ 0.
Hence, if we denote by F the cumulative distribution of the density of T ,

we have
F (t) = P[T ≤ t] = 1 − e −λt , t ≥ 0.
Since fT (t) = F ′ (t), it follows that the density is
fT (t) = λe −λt , t ≥ 0.
and fT (t) = 0 for t < 0. Thus the time between successive arrivals of a
Poisson-distributed random variable is exponentially distributed with
parameter β = 1/λ.
If we interpret an “arrival” as the failure of a component, then β −1 = λ
represents the rate of failure.
1.3.5. Example. An electronic component is known to have a useful life

represented by an exponential density with failure rate of 10−5 failures per
hour, i.e., 1/β = 10−5 . The mean time to failure, E[X ], is thus β = 105
hours.
Suppose we wanted to determine the fraction of such components that
would fail before the mean or expected life:
Z β
P[T ≤ 1/λ] = β −1 e −x/β dx = 1 − e −1 = 0.63212.
0
That is, 63.2% of the components will fail before the mean life time. As
you can see, this result does not depend on the value of β.
The exponential distribution has an interesting and unique property: it is

memoryless. In other words,
P[X > x + s | X > x] = P[X > s].
For example, if a cathode ray tube has an exponential time to failure

distribution and at time x it is observed to be still functioning, then the
remaining life has the same exponential failure distribution as the tube had
at time zero.
To see this, note that

Z ∞ Z ∞
P[X > x] = f (t) dt = λe −λt dt = e −λx .
x x
Then
P[(X > x + s) ∩ (X > x)] P[X > x + s]
P[X > x + s | X > x] = =
P[X > x] P[X > x]
e −λ(x+s)
= = e −λs = P[X > s].
e −λx
Gamma Distribution
A more general form of the exponential distribution is the gamma
distribution, which has two parameters.
1.3.6. Definition. Let α, β ∈ R, α, β > 0. A continuous random variable
(X , fα,β ) with density
(
1 α−1 e −x/β , x > 0,
αx
fα,β (x) = Γ(α)β
0, x ≤ 0,
is said to follow an gamma distribution with parameters α and β.

Here
Z ∞
Γ(α) = z α−1 e −z dz, α > 0,
0
is the Euler gamma function.
Gamma Distribution
The gamma function satisfies Γ(1) = 1 and Γ(α) = (α − 1)Γ(α − 1) if
α > 1. In other words,
n! = Γ(n + 1) for n ∈ N.
It is a continuous extension of the factorial function to the positive real
numbers and can in fact be defined for all complex numbers except zero
and the negative integers. Below is its graph for α ∈ (−3, 5).
y
15
y = G HxL = à
¥
10 z x - 1 e-z âz
0
x
-3 -2 -1 1 2 3 4 5
-5
-10
Gamma Distribution
1.3.7. Theorem. Let (X , fα,β ) be a Gamma distributed random variable

with parameters α, β > 0.
1. The moment-generating function of X is given by
mX : (−∞, 1/β) → R, mX (t) = (1 − βt)−α .
2. E[X ] = αβ.
3. Var X = αβ 2 .
Gamma Distribution
Proof.
We will verify the moment-generating function only.
Z ∞
e tx
mX (t) = E[e tX ] = α
x α−1 e −x/β dx
Γ(α)β
Z 0∞
1
= x α−1 e −x(1/β−t) dx
Γ(α)β α 0
Substituting y = x(1/β − t), we have dy = (1/β − t)dx and

Z ∞
1 −1
mX (t) = (1/β − t) [y /(1/β − t)]α−1 e −y dy
Γ(α)β α 0
Z
(1/β − t)−α ∞ α−1 −y
= y e dy
Γ(α)β α 0
= (1 − βt)−α .
Special Gamma Distributions
The gamma distribution is of interest because many other distributions are

special cases. For example, the exponential distribution is a gamma
distribution with α = 1.
An important distribution in statistics is the chi squared distribution.
1.3.8. Definition. Let X be a gamma random variable with β = 2 and

α = γ/2 for a positive integer γ ∈ N. Then X is said to have a chi
squared distribution with γ degrees of freedom. We denote this variable by
X = Xγ2 .
We will encounter this distribution again later.
Special Gamma Distributions
0.15
0.10
0.05
5 10 15 20
Graph of the density of the chi squared distribution.
Normal (Gauß) Distribution
The normal distribution was first described by de Moivre in 1733 as a

limiting case of the binomial distribution. This result did not get much
attention, however, and was soon forgotten. The distribution was
rediscovered by Laplace and Gauß half a century later, as they both found
that this distribution described errors in astronomical measurements.
1.3.9. Definition. Let µ ∈ R, σ > 0. A continuous random variable (X , fX )

with density
1
e −((x−µ)/σ) /2
2
fX (x) = √
2πσ
is said to follow a normal distribution with parameters µ and σ.
Normal Distribution
- J N
y = H2 ΠL-12 Σ-1 ã
1 x-Μ 2
2 Σ
Μ-Σ Μ Μ+Σ
Graph of the density of the normal distribution.
Normal Distribution
R
It is easily verified that R fX (x) dx = 1 by using polar coordinates (see
blackboard). We furthermore have the following result:
1.3.10. Theorem. Let (X , fX ) be a normally distributed random variable
with parameters µ and σ.
1. The moment-generating function of X is given by
2 t 2 /2
mX : R → R, mX (t) = e µt+σ .
2. E[X ] = µ.
3. Var X = σ 2 .
Normal Distribution
Proof.
We will verify the moment-generating function only.
Z ∞
e tx −((x−µ)/σ)2
tX
mX (t) = E[e ] = √ e dx
−∞ 2πσ
Z ∞
1 2
=√ e tx−((x−µ)/σ) /2 dx
2πσ −∞
We complete the square in the exponent to gain
(x − (µ + σ 2 t))2
tx − ((x − µ)/σ)2 /2 = − + µt + σ 2 t 2 /2
2σ 2
Normal Distribution
Proof (continued).
Substituting into the integral,
Z ∞
1 (x−(µ+σ 2 t))2
mX (t) = √ e− 2σ 2
+µt+σ 2 t 2 /2
dx
2πσ −∞
Z ∞
1 (x−(µ+σ 2 t))2
=e µt+σ 2 t 2 /2
√ e− 2σ 2 dx
2πσ −∞
| {z }
=1
Standard Normal Distribution
1.3.11. Definition. A normally distributed random variable with parameters

µ = 0 and σ = 1 is called a standard normal random variable and denoted
by Z .
The standard normal distribution is particularly important because any
normally distributed random variable can be transformed into a
standard-normally distributed one.
1.3.12. Theorem. Let X be a normally distributed random variable with
mean µ and standard deviation σ. Then
X −µ
Z :=
σ
has standard normal distribution.
Transformation of Random Variables
It is easily seen that Z = X σ−µ has mean E[Z ] = 0 and variance Var Z = 1,
but it is not clear that Z is normally distributed. To see this, we need to
find the density of Z .
Hence it is worth studying the density of transformed variables in general.
1.3.13. Theorem. Let X be a continuous random variable with density fX .
Let Y = φ ◦ X , where φ : R → R is strictly monotonic and differentiable.
The density for Y is then given by
¯ −1 ¯
¯ dφ (y ) ¯
−1 ¯
fY (y ) = fX (φ (y )) · ¯ ¯.
dy ¯
Transformation of Random Variables

Proof.
We assume without loss of generality that φ is strictly decreasing. (The
case where φ is strictly increasing is analogous.)
The cumulative distribution function for Y is given by
FY (y ) = P[Y ≤ y ] = P[φ(X ) ≤ y ].
Since φ is strictly decreasing, φ−1 exists and is also decreasing. Therefore,
FY (y ) = P[φ(X ) ≤ y ] = P[φ−1 (φ(X )) ≥ φ−1 (y )] = P[X ≥ φ−1 (y )]

= 1 − P[X ≤ φ−1 (y )] = 1 − FX (φ−1 (y )).
Differentiating this expression, we obtain

¯ −1 ¯
dφ−1 (y ) ¯ dφ (y ) ¯
fY (y ) = FY′ (y ) = −fX (φ −1
(y )) −1 ¯
= fX (φ (y )) · ¯ ¯.
dy dy ¯

We can now prove Theorem 1.3.12. We have Z = φ ◦ X , where
φ(x) = x−µ
σ is strictly increasing and differentiable. Note that
dφ−1 (z)
φ−1 (z) = σz + µ, = σ > 0.
dz
Using
1
e −((x−µ)/σ) /2
2
fX (x) = √
2πσ
we have
¯ −1 ¯
¯ dφ (z) ¯
fZ (z) = fX (φ −1
(z)) · ¯¯ ¯ = √ 1 e −(z)2 /2 · σ = √1 e −z 2 /2 ,
dz ¯ 2πσ 2π
which is the density of the standard normal distribution. Hence the
variable Z = x−µ
σ is standard normal.

The cumulative distribution function of the standard normal distribution is
often denoted by Φ,
Z z
1
e −t /2 dt.
2
Φ(z) = √
2π −∞
The values of Φ are given in Table V of Appendix A.

1.3.14. Example. The breaking strength (in Newtons) of a synthetic fabric
is denoted X , and it is normally distributed with mean µ = 800 N and
standard deviation σ = 12 N. The purchaser of the fabric requires the
fabric to have a strength of at least 772 N. A fabric sample is randomly
selected and tested. To find P[X ≥ 772], we first calculate
· ¸
X −µ 772 − 800
P[X < 772] = P <
σ 12
= P[Z < −2.33] = Φ(−2.33) = 0.01.
Hence the desired probability, P[X ≥ 772] = 0.99.
0.5
0.1
0.4
0.08
0.06 0.3
0.04 Σ=12 0.2 Σ=1
0.02 0.1
750 Μ=800 850 -20 -15 -10 -5 Μ=0 5 10 15 20
Plots of the densities of the normal distribution with µ = 800 and σ = 12

and the standard normal distribution. The shaded areas represent
P[X < 772] and P[Z < −2.33], respectively.
1.3.15. Example. The pitch diameter of the thread on a fitting is normally

distributed with a mean of 0.4008 cm and a standard deviation of
0.0004 cm. The design specifications are 0.4000 ± 0.0010 cm.
Notice that this process is operating with the mean not equal to the
nominal specifications. We desire to determine what fraction of product is
within tolerance.
· ¸
0.3990 − 0.4008 0.4010 − 0.4008
P[0.399 ≤ X ≤ 0.401] = P ≤Z ≤
0.0004 0.0004
= P[−4.5 ≤ Z ≤ 0.5] = Φ(0.5) − Φ(−4.5)
= 0.6915 − 0.0000 = 0.6915
Hence about 30% of the product will fall outside tolerance.
1.3.16. Example. As process engineers study the results of such

calculations, they decide to replace a worn cutting tool and adjust the
machine producing the fittings so that the new mean falls directly at the
nominal value of 0.4000 cm. Then
· ¸
0.3990 − 0.4 0.4010 − 0.4
P[0.399 ≤ X ≤ 0.401] = P ≤Z ≤
0.0004 0.0004
= P[−2.5 ≤ Z ≤ 2.5] = Φ(2.5) − Φ(−2.5)
= 0.9938 − 0.0062 = 0.9876
With the adjustment, 98.76% of the product will fall inside tolerance.
Σ=0.0004
0.3990 Μ=0.4008 0.4010
Plots of the densities of the respective normal distributions for the

unadjusted and adjusted fitting production process. The shaded areas
indicate the design specifications.
Σ=0.0004
0.3990 Μ=0.4000 0.4010
Plots of the densities of the respective normal distributions for the

unadjusted and adjusted fitting production process. The shaded areas
indicate the design specifications.

1.3.17. Example. Let X denote the amount of radiation that can be
absorbed by an individual before death ensues. Assume that X is normal
with a mean dosage of 500 roentgens and a standard deviation of 150
roentgens. Above what dosage level will only 5% of those exposed
survive?
Here we want to find x0 such that P[X ≥ x0 ] = 0.05. Standardizing,
· ¸ · ¸
X − 500 x0 − 500 x0 − 500 !
P[X ≥ x0 ] = P ≥ =P Z ≥ = 0.05
150 150 150
From Table V, P[Z ≥ 1.64] = 0.0505 and P[Z ≥ 1.65] = 0.0495.

Interpolating, we take P[Z ≥ 1.645] ≈ 0.0500, so we have
x0 − 500 roentgen
= 1.645 ⇔ x0 = 746.75 roentgen .
150 roentgen

In general, the following estimates are often useful:
1.3.18. Theorem. Let X be normally distributed with parameters µ and σ.
Then
P[−σ < X − µ < σ] = 0.68

P[−2σ < X − µ < 2σ] = 0.95
P[−3σ < X − µ < 3σ] = 0.997
Hence 68% of the values of a normal random variable lie within one
standard deviation of the mean, 95% lie within two standard deviations,
and 99.7% lie within three standard deviations. This rule of thumb will be
especially important in statistics, where the number of “extraordinary”
events needs to be judged.
Chebyshev’s Inequality
There is a general estimate, which does not depend on the distribution of
a discrete or continuous random variable, that tells us how the variance
influences the probability of deviating from the mean. This estimate is
called Chebyshev’s inequality.
1.3.19. Theorem. Let (X , fX ) be a discrete or continuous random variable
and k > 0 some positive number. Then
1
P[−kσ < X − µ < kσ] ≥ 1 − (1.3.1)
k2
or, equivalently,
1
P[|X − µ| ≥ kσ] ≤ . (1.3.2)
k2
Comparing (1.3.2) with Theorem 1.3.18, we see that the estimates in the
theorem are better. This is not surprising, as Chebyshev’s rule is valid for
any distribution, while the previous theorem uses the specific properties of
the normal distribution.
Proof.
We will prove the theorem for the case of a continuous random variable
only. The proof in the discrete case is quite similar. By definition,
Z
σ = Var X = E[(X − µ) ] = (x − µ)2 fX (x) dx
2 2
R
Z √ Z √
µ− K µ+ K
= (x − µ)2 fX (x) dx + √ (x − µ)2 fX (x) dx
−∞ µ− K
| {z }
≥0
Z ∞
+ √ (x − µ)2 fX (x) dx
µ+ K
for any K > 0. Hence

Z √ Z
µ− K ∞
σ ≥
2
(x − µ) fX (x) dx +
2
√ (x − µ)2 fX (x) dx.
−∞ µ+ K
Proof (continued).
√
Now (x −√µ)2 ≥ K if and√only if |x − µ| ≥ K which is the case if
x ≥ µ + K or x ≤ µ − K . Therefore,
Z √ Z
µ− K ∞
σ ≥
2
(x − µ) fX (x) dx +
2
√ (x − µ)2 fX (x) dx
−∞ µ+ K
Z √ Z
µ− K ∞
≥K fX (x) dx + K √ fX (x) dx
−∞ µ+ K
³ √ √ ´
= K P[X ≤ µ − K ] + P[X ≥ µ + K ] ,
or, equivalently,
√ σ2
P[|X − µ| ≥ K] ≤ .
K
√ 1
Taking k = K /σ, we obtain P[|X − µ| ≥ kσ] ≤ .
k2
1.3.20. Example. From an analysis of company records, a materials control

manager estimates that the mean and standard deviation of the “lead
time” required in ordering a small valve are 8 days and 1.5 days,
respectively. He does not know the distribution of the lead time, but he is
willing to assume the estimates of the mean and standard deviation to be
absolutely correct.
The manager would like to determine a time interval such that the
probability is at least 8/9 that the order will be received during that time.
That is,
1 8
1− 2 = ,
k 9
so that k = 3 and µ ± kσ = (8 ± 4.5) days. It is noted that this interval
may well be too large to be of any value to the manager, in which case he
may elect to learn more about the distribution of lead times.
We have previously seen how the binomial distribution can be

approximated by the Poisson distribution in the case where p is small. As
we have noted before, the normal distribution was originally conceived as
an approximation to the binomial distribution. This approximation is valid
for any p if n is large enough. In practice, n is often a sample size, as e
shall see in an example.
The mathematical formulation is known as the Theorem of de
Moivre-Laplace. The precise theorem contains an estimate for the error in
the approximation. Since we do not want to delve too deeply into the
mathematical theory involved, we give an informal formulation only.

We will use Stirling’s approximation of the factorial function,
√
n! ≈ 2πe −n nn+1/2 ,
in the sense that the relative error converges to zero as n goes to infinity,
√
n! − 2πe −n nn+1/2
lim = 0.
n→∞ n!
Then for a binomial random variable with parameters n and p (q = 1 − p)
we can eventually show that
n! 1
√ e −(x−np) /(2npq)
2
P[X = x] = p x q n−x ≈ √
x!(n − x)! npq 2π
so that µ ¶
x − np
P[X ≤ x] ≈ Φ √
npq

Thus the binomial distribution with parameters n and p behaves as a
normal distribution with mean µ = np and variance σ 2 = npq.
This approximation is good if p is close to 1/2 and n > 10. Otherwise, we
require that
np > 5 if p ≤ 1/2 or n(1 − p) > 5 if p > 1/2.
We also need to make a half-unit correction. For example, when

approximating the cumulative binomial distribution to determine
√
P[X ≥ 12], we in fact have to calculate 1 − Φ((11.5 − np)/ npq):
Aproximating the Binomial Distribution

1.3.21. Example. In sampling from a production process that produces
items of which 20% are defective, a random sample of 100 items is
selected each hour of each production shift. The number of defectives in a
sample is denoted by X .
To find, say, P[X ≤ 15] we might use the normal approximation as follows:
· ¸
15 − 100 · 0.2
P[X ≤ 15] ≈ P Z ≤ √ = P[Z ≤ −1.25]
100 · 0.2 · 0.8
= Φ(−1.25) = 0.1056
A half-unit correction would instead give
· ¸
15.5 − 20
P[X ≤ 15] ≈ P Z ≤ = 0.130
4
15 ¡
P ¢
The correct result is P[X ≤ 15] = 100
k 0.2k 0.8100−k = 0.1285.
k=0
Weibull Distribution
Like the gamma distribution, the Weibull distribution (introduced by W.

Weibull in 1951) is useful in engineering and life sciences applications
because it is flexible, with parameters that enable it to reduce to or
approximate the exponential and normal distributions.
The most general is the three-parameter Weibull distribution, with density
(
αβ(x − γ)β−1 e −α(x−γ)
β
x > γ,
f (x) = , α, β > 0, γ ∈ R.
0, otherwise
Here γ is called the location parameter, β is called the shape parameter

and α is the scale parameter. We will here consider the (physically most
common) case of γ = 0.
1.3.22. Definition. A random variable (X , fX ) is said to have a
(two-parameter) Weibull distribution with parameters α and β if its
density is given by
(
αβx β−1 e −αx , x > 0,
β
f (x) = α, β > 0.
0, otherwise,
1.3.23. Theorem. Let X be a Weibull random variable with parameters α

and β. The mean and variance of X are given by
µ = α−1/β Γ(1 + 1/β)
and
σ 2 = α−2/β Γ(1 + 2/β) − µ2 .
For the proof of the expression for the mean, we refer to the text book.
The proof of the formula for the variance is left as an exercise.
The Weibull distribution is used (for example)
◮ In reliability engineering and failure analysis (the most common usage)
◮ In survival analysis (a branch of statistics which deals with death in
biological organisms and failure in mechanical systems)
◮ To represent manufacturing and delivery times in industrial
engineering
◮ In weather forecasting
◮ In radar systems to model the dispersion of the received signals level
produced by some types of clutters
◮ To model fading channels in wireless communications
◮ In general insurance to model the size of reinsurance claims
◮ In forecasting technological change (also known as the Sharif-Islam
model)
◮ To describe wind speed distributions, as the natural distribution often
matches the Weibull shape
Reliability
Reliability studies are concerned with assessing whether or not a system

functions adequately under the conditions for which it was designed.
Interest centers on describing the behavior of the random variable X , the
time to failure of a system that can not be repaired once it fails to
operate.
Three functions come into play:
◮ the failure density f ,
◮ the reliability function R,
◮ the failure or hazard rate ϱ.
Reliability
Consider some system being put into operation at time t = 0. We observe

the system until it eventually fails. Let X denote the time of failure. This
is a continuous random variable with values in (0, ∞).
The probability density function of X is the failure density f . The
reliability function R is defined to be the probability that the component
will not fail before time t. Thus
R(t) = 1 − P[component will fail before time t]

Z t
=1− f (x) dx = 1 − F (t)
0
where F is the cumulative distribution function of X .
Reliability - Hazard Rate

To define ϱ, the hazard rate function, consider a time interval [t, t + ∆t].
We define the hazard rate function over this interval by
P[t ≤ X ≤ t + ∆t | t ≤ X ]] P[X ∈ [t, t + ∆t]]
ϱ(t) := lim = lim
∆t→0 ∆t ∆t→0 P[X ∈ [t, ∞]] · ∆t
Note that
P[X ∈ [t, t + ∆t]] F (t + ∆t) − F (t)
lim = lim = F ′ (t) = f (t)
∆t→0 ∆t ∆t→0 ∆t
and P[X ∈ [t, ∞]] = 1 − F (t) = R(t) does not depend on ∆t. Therefore,
f (t)
ϱ(t) = .
R(t)
The job of the scientist is to find the form of these functions for the
problem at hand. In practice, one often begins by assuming a particular
form for the hazard rate function based on empirical evidence.
Reliability - Hazard Rate
Interpretation of the Hazard rate:

1. If ϱ is increasing over an interval, then as time goes by a failure is
more likely to occur. This normally happens for systems that begin to
fail primarily due to wear.
2. If ϱ is decreasing over an interval, then as time goes by a failure is
less likely to occur than it was earlier in the time interval. This
happens in situations in which defective systems tend to fail early. As
time goes by, the hazard rate for a well-made system decreases.
3. A steady hazard rate is expected over the useful life span of a
component. A failure tends to occur during this period due mainly to
random factors.
Reliability
Often one has an idea of ϱ, but not of the failure density f or reliability
function R. Then the following theorem enables one to determine the
other two functions:
1.3.24. Theorem. Let X be a random variable with failure (probability)
density X , reliability function R and hazard rate ϱ. Then
Rt
R(t) = e − 0 ϱ(x) dx
Proof.
Note that since R(x) = 1 − F (x) we have R ′ (x) = −F ′ (x). Therefore,
f (x) F ′ (x) R ′ (x)

ϱ(x) = = =−
R(x) R(x) R(x)
so R ′ (x) = −ϱ(x)R(x). Note that R(0) = 1, since the component will not
fail before t > 0. Solving this differential equation through separation of
variables, we obtain the result.
Reliability
1.3.25. Example. One hazard function in widespread use is the function
ϱ(t) = αβt β−1 , t > 0, α, β > 0
◮ If β = 1, the hazard rate is constant (failure is due to random factors)

◮ If β > 1, the hazard rate is increasing (failure is due to wear)
◮ If β < 1, the hazard rate is decreasing (early failure likely due to
malfunction)
The reliability function is given by
Rt
R(t) = e − αβx β−1 dx
= e −αt .
β
0
The failure density is given by

f (t) = ϱ(t)R(t) = αβt β−1 e −αt ,
β
which happens to be the probability density of the Weibull distribution

with parameters α and β.
Reliability of Series and Parallel Systems
Recall that reliability was defined as
R(t) = 1 − P[component will fail before time t].
Here we assume t is fixed; instead of tracking the time-dependence of the

reliability of a single component, we interest ourselves in how the design of
a system with components of various reliability influences the reliability of
the entire system.
Components in multiple-component systems can be installed in the system
in various ways. Many systems are arranged in “series” configuration,
some are in “parallel” and others are combinations of the two designs.

1.3.26. Definition.
1. A system whose components are arranged in such a way that the
system fails whenever any of its components fail is called a series
system.
2. A system whose components are arranged in such a way that the
system fails only if all of its components fail is called a parallel system.
Assuming the components are independent of each other, the reliability of
a series system with k components is given by
Y
k
Rs (t) = Ri (t),
i=1
where Ri is the reliability of the ith component.
The reliability of parallel system is given by
Y
k
Rp (t) = 1 − P[all components fail before t] = 1 − (1 − Ri (t)).
i=1
1.3.27. Example. Consider a system consisting of eight independent

components, connected as shown below:
The reliability of the entire system is the product of assemblies I-V,

working out to 0.7689.
Elements of Probability Theory Joint Distributions
Joint Distributions
Random Vectors
Often, a single random variable is not enough to describe a physical

problem, or we are interested in the effect of one quantity on another. In
such a case we consider two (or more) random variables together. Then
we consider a “vector” where each component is itself a (“scalar”) random
variable.
We call such a vector a random vector or a multi-variate random variable
or an n-dimensional random variable. The components can be discrete or
continuous random variables, and even mixtures of the two.
In this section we will study bivariate (two-dimensional) random variables
where either both components are discrete or both components are
continuous random variables.
Discrete Bivariate Random Variables
1.4.1. Definition. Let S be a sample space and Ω a subset of N2 . A

discrete bivariate random variable is a map (X , Y ) : S → Ω together with
a function fXY : Ω → R with the properties that
1. fXY ≥ 0 and
P
2. fXY (x, y ) = 1.
(x,y )∈Ω
Then fXY gives the probability that the pair (X , Y ) assumes a given value
(x, y ), i.e.,
fXY (x, y ) = P[X = x and Y = y ].
The function fXY is called the joint density function of the random
variable (X , Y ).
1.4.2. Example. In an automobile plant two tasks are performed by robots.

The first entails welding two joints; the second, tightening three bolts. Let
X denote the number of defective welds and Y the number of improperly
tightened bolts produced per car. Past data may indicate the following
joint distribution for X and Y :
X = x/Y = y 0 1 2 3
0 0.840 0.030 0.020 0.010
1 0.060 0.010 0.008 0.002
2 0.010 0.005 0.004 0.001
For example, P[X = 1 and Y = 2] = 0.008.

Note that the sum of all probabilities is equal to 1, as required.

1.4.3. Example.
X = x/Y = y 0 1 2 3
0 0.840 0.030 0.020 0.010
1 0.060 0.010 0.008 0.002
2 0.010 0.005 0.004 0.001
We can read off that the probability of one error being committed is
P[X = 1 and Y = 0] + P[X = 0 and Y = 1] = 0.09.
The probability of no defective joints is
X
2
P[Y = 0] = P[X = x and Y = 0] = 0.91.
x=0
By summing in this way, we can determine P[Y = y ] for all y . This is

called the marginal density for Y .

1.4.4. Definition. Let ((X , Y ), fXY ) be a discrete bivariate random variable.
We define the marginal density fX for X by
X
fX (x) = fXY (x, y )
y
and the marginal density fY for Y by

X
fY (y ) = fXY (x, y )
x
1.4.5. Example.
X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00
Continuous Bivariate Random Variables
1.4.6. Definition. Let S be a sample space. A continuous bivariate random

variable is a map (X , Y ) : S → R2 together with a function fXY : R2 → R
with the properties that
1. fXY ≥ 0 and
R∞ R∞
2. fXY (x, y ) dy dx = 1.
−∞ −∞
The integral of fXY is interpreted as the probability that X and Y assume
values (x, y ) in a given range, i.e.,
Z b Z d
P[a ≤ X ≤ b and c ≤ Y ≤ d] = fXY (x, y ) dy dx
a c
for a ≤ b, c ≤ d. The function fX ,Y is called the joint density function of

the random variable (X , Y ).
Continuous Bivariate Random Variables
1.4.7. Definition. Let ((X , Y ), fXY ) be a continuous bivariate random

variable. We define the marginal density fX for X by
Z ∞
fX (x) = fXY (x, y ) dy
−∞
and the marginal density fY for Y by

Z ∞
fY (y ) = fXY (x, y ) dx
−∞
Independence
Two events A1 , A2 are independent if P[A1 ∩ A2 ] = P[A1 ] · P[A2 ]. This
motivates us to define the independence of two discrete random variables
through
P[X = x and Y = y ] = P[X = x] · P[Y = y ]
which works out to fXY (x, y ) = fX (x)fY (y ). This also generalizes to
continuous random variables.
1.4.8. Definition. Let ((X , Y ), fXY ) be a bivariate random variable with
marginal densities fX and fY . If
dom fXY = (dom fX ) × (dom fY )
and
fXY (x, y ) = fX (x)fY (y ) for all (x, y ) ∈ dom fXY
then (X , fX ) and (Y , fY ) are independent random variables.

Independence
1.4.9. Example.
X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00
(X , fX ) and (Y , fY ) are not independent because
fX ,Y (0, 0) = 0.840 ̸= fX (0) · fY (0) = 0.900 · 0.910 = 0.819
Conditional Densities
Whenever we consider two random variables, it does not make sense to
speak of (X , fX ) and (Y , fY ) anymore, as the two random variables may
influence each other. We therefore always need to consider a joint density
fXY for the pair (X , Y ), i.e., a bivariate random variable. The “individual
densities” fX and fY are then just the marginal densities.
If Y = y is kept fixed, the density of the random variable (X , fX ) will be
some given function for this value of Y . It may be another function for
another value of Y .
The conditional probability for an event A1 given A2 was
P[A1 | A2 ] = P[A1 ∩ A2 ]/P[A2 ]. This motivates us to define the
independence of two discrete random variables through
P[X = x and Y = y ] fXY (x, y )

P[X = x | Y = y ] = = .
P[Y = y ] fY (y )
This will also generalize to continuous random variables.

Conditional Densities

marginal densities fX and fY . Then
1. The conditional density for X given Y = y is defined as
fXY (x, y )
fX |y (x) =
fY (y )
whenever fY (y ) > 0.
2. The conditional density for Y given X = x is defined as
fXY (x, y )
fY |x (y ) =
fX (x)
whenever fX (x) > 0.
Expectation for Discrete Bivariate Random Variables
1.4.11. Definition. Let((X , Y ), fXY ) be a discrete bivariate random variable

and H : Ω → R2 some function. Then the expected value of H ◦ (X , Y ) is
X
E[H ◦ (X , Y )] = H(x, y ) · fXY (x, y ).
(x,y )∈Ω
provided that the sum (series) on the right converges absolutely. As

special cases, we consider H(x, y ) = x and H(x, y ) = y , giving
X X
E[X ] = x · fXY (x, y ), E[Y ] = y · fXY (x, y ).
(x,y )∈Ω (x,y )∈Ω
Note that this implies E[X + Y ] = E[X ] + E[Y ], as we have stated earlier.
Expectation for Discrete Bivariate Random Variables

1.4.12. Example.
X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00
X X
2 X
3
E[X ] = x · fXY (x, y ) = x · fXY (x, y ) = 0.12
(x,y )∈Ω x=0 y =0
X X
2 X
3
E[Y ] = y · fXY (x, y ) = y · fXY (x, y ) = 0.148
(x,y )∈Ω x=0 y =0
Expectation for Continuous Bivariate Random Variables
1.4.13. Definition. Let((X , Y ), fXY ) be a continuous bivariate random

variable and H : R2 → R2 some function. Then the expected value of
H ◦ (X , Y ) is
ZZ
E[H ◦ (X , Y )] = H(x, y ) · fXY (x, y ) dx dy .
R2
provided that the integral on the right converges absolutely. As special

cases, we consider H(x, y ) = x and H(x, y ) = y , giving
ZZ ZZ
E[X ] = x · fXY (x, y ) dx dy , E[Y ] = y · fXY (x, y ) dx dy .
R2 R2
Again, we see that E[X + Y ] = E[X ] + E[Y ].
Covariance
Whenever we deal with bivariate random variables, we are interested not

only in their individual variances, but also in how their interplay influences
their joint deviation from the respective means.
means µX = E[X ] and µY = E[Y ]. Then the covariance of (X , Y ) is given
by
Cov(X , Y ) = σXY = E[(X − µX )(Y − µY )].
Covariance
We can compute the covariance from the formula
Cov(x, Y ) = E[XY ] − E[X ] E[Y ].
1.4.15. Theorem. Let ((X , Y ), fXY ) be a bivariate random variable. If X

and Y are independent, then
Cov(X , Y ) = 0 or, equivalently, E[XY ] = E[X ] E[Y ].
The proof is straightforward and left as an exercise.
Correlation
While the covariance tells us whether or not two random variables affect
each other, there is no quantitative statement contained in the covariance
as such. We often want to know, however, if there is a linear dependence
between X and Y . This can be determined from the Pearson coefficient of
correlation, defined as follows.

◮ means µX = E[X ], µY = E[Y ],
◮ variances σX2 = Var[X ] = E[(X − µX )2 ] ̸= 0, σY2 = Var[Y ] ̸= 0,
◮ covariance σXY = Cov(X , Y ) = E[(X − µX )(Y − µY )].
The correlation of (X , Y ) is then defined by
Cov(X , Y ) σXY
ρXY = p = .
(Var X )(Var Y ) σ X σY
Correlation
If ρXY = 0, we say that X and Y are uncorrelated, otherwise they are

correlated. However, we can obtain a more useful result:
1.4.17. Theorem. Let ((X , Y ), fXY ) be a bivariate random variable.
1. The correlation coefficient satisfies −1 ≤ ρXY ≤ 1.
2. |ρXY | = 1 if and only if there exist numbers β0 , β1 ∈ R, β1 ̸= 0, such
that
Y = β0 + β1 X
almost surely.
1.4.18. Remark. Heuristically, the closer |ρXY | is to 1, the “more linear”

the relationship between X and Y is.
Correlation
Proof.
We first show that −1 ≤ ρXY ≤ 1.
We consider two random variables W and Z such that E[Z 2 ], E[W 2 ] ̸= 0.
Note that for any a ∈ R (aW − Z )2 ≥ 0, so
0 ≤ E[(aW − Z )2 ] = a2 E[W 2 ] − 2a E[WZ ] + E[Z 2 ].
Now let a = E[WZ ]/ E[W 2 ]. Then we obtain
E[WZ ]2 E[WZ ]2
− + E[Z 2 ] ≥ 0 ⇔ ≤ 1.
E[W 2 ] E[W 2 ] E[Z 2 ]
Now let W = X − µX and Z = Y − µY . Then
E[(X − µX )(Y − µY )]2

= ρ2XY ≤ 1
E[(X − µX )2 ] E[(Y − µY )2 ]
Correlation
Proof (continued).
Now let ρ2XY = 1. Then, reversing the steps above, we obtain
E[(X − µX )(Y − µY )]2 E[WZ ]2

= =1
E[(X − µX )2 ] E[(Y − µY )2 ] E[W 2 ] E[Z 2 ]
E[WZ ]2
⇔− + E[Z 2 ] = 0
E[W 2 ]
⇔ a2 E[W 2 ] − 2a E[WZ ] + E[Z 2 ] = E[(aW − Z )2 ] = 0
Since (aW − Z )2 ≥ 0, this implies (aW − Z )2 = 0 and aW − Z = 0

almost surely. Re-substituting W = X − µX and Z = Y − µY , we obtain
Y = (µY − aµX ) + aX
almost surely.
Transformation of Variables
The following theorem allows us to perform transformations of random

variables and obtain the densities of the transformed variables.
1.4.19. Theorem. Let ((X , Y ), fXY ) be a continuous bivariate random
variable and let H : R2 → R2 be a differentiable bijective map with inverse
H −1 . Then (U, V ) = H ◦ (X , Y ) is a continuous bivariate random variable
with density
fUV (u, v ) = fXY ◦ H −1 (u, v ) · |det DH −1 (u, v )|,
where DH −1 is the Jacobian of H −1 .

This result basically uses knowledge from Calculus II, which you are
encouraged to review if necessary.
First Midterm Exam
The preceding material completes our overview of probability theory; on

this basis we will commence with studying statistics.
The preceding material encompasses all of the material that will be the
subject of the First Midterm Exam. The exam will take place on Thursday,
the 12th of March during the usual lecture time.
Please bring a non-programmable calculator to the exam.
An Introduction to Statistical Methods
Part II
An Introduction to Statistical Methods Descriptive statistics
Descriptive Statistics
Up to now we have discussed probability theory, i.e., the properties of
random variables and their distributions. It is clear that we have only
scratched the surface of probability theory and a lot more can be explored.
For example, the theory of stochastic processes and Markov chains have
not been discussed.
Now, however, instead of delving deeper into probability theory, we will
leave the “perfect world” of known random variables and distributions,
and enter the “real world” of statistics, which deals with incomplete
information. Statistical problems are characterized through
◮ a large group of objects about which inferences are to be made, called
a population,
◮ at least one random variable whose behavior is to be studied relative
to the population,
◮ a subset of the population, called a sample which is actually studied.
Populations and Samples
2.1.1. Examples.
1. We may want to find the average gas mileage of cars in Shanghai.
Then the population is “all cars in Shanghai”, and we may pick a
sample of 100 cars. Here the random variable is the gas mileage, of
which we wish to obtain the mean.
2. We may want to find out whether a proposed new car model will have
a lower mean gas mileage than existing cars. In this case, the
population is “all cars of this model, existing now and produced in the
future” and a sample might consist of a trial production of 20
prototype cars. Again the random variable is the gas mileage and we
are interested in its expected value.
Random Sampling
The first step in a statistical analysis is the selection of a random sample,

which consists of a set of n objects selected from the population such that
the selection of one object does not influence the subsequent selection of
any other.
The random variable X to be analyzed is then studied relative to each
object in the sample; effectively, one obtains n random variables
X1 , . . . , Xn . These random variables and their values are also referred to as
random samples.
Random Sampling
However, we define:
2.1.2. Definition. A random sample of size n from the distribution of X is
a collection of n independent random variables X1 , . . . , Xn , each with the
same distribution as X . We say they are independent identically
distributed (i.i.d.) random variables.
In order to guarantee that the the random variables in a random sample
are indeed independently distributed, the size of the random sample should
not exceed 5% of the population.
Random Variables and Statistics
A statistic is generally speaking a random variable whose numerical values

can be determined from a random sample.
2.1.3. Examples.
P n
1. Xk ,
k=1
Pn
2. Xk2 ,
k=1
Pn
3. Xk /n,
k=1
4. min{Xk : k = 1, . . . , n}, max{Xk : k = 1, . . . , n}.
Random Variables and Statistics
Determining the distribution of a random variable X of a population from

a sample population is in general quite difficult. If the random variable is
discrete, it may often be surmised from the physical description of the
experiment.
If X is continuous, a guess can be made from the shape of the distribution
of the random sample: if it appears flat, X may be uniformly distributed;
it is bell-shaped, X may be normally distributed; and if it appears skewed,
X may have a certain gamma (exponential, χ2 ) distribution.
Since the random sample is just a set of numbers, numerous techniques
have been developed to aid the visualization of its shape.
Stem-and-Leaf Diagrams
A stem-and-leaf diagram is a very rough way to get an idea of the shape

of the distribution of a random sample, while preserving some of its
numeric information. It consists of labeled rows of numbers, where the
label is called the stem and the other numbers are called leaves. This idea
was introduced by Tukey in 1977.
In order to construct a stem-and-leaf diagram from a random sample (a
set of values of random variables), follow these steps:
1. Choose some convenient numbers to serve as stems,
2. label the rows using the stems,
3. for each datum of the random sample, note down the digit following
the stem in the corresponding row,
4. turn the graph on its side to get an idea of its distribution.
Stem-and-Leaf Diagrams with Mathematica

As an example, we consider the random sample
4285 564 1278 205 3920

2066 604 209 602 1379
2584 14 349 3770 99
Multiple Stem-and-Leaf Diagrams with Mathematica

Sometimes it is useful to further subdivide the stems. For example, we
might wish to have separate leaves for digits in the ranges 0-4 and 5-9.
This is called a double stem-and-leaf plot.
Sample Range
For a random sample X1 , . . . , Xn , the sample range is simply
max Xk − min Xk
1≤k≤n 1≤k≤n
Histograms
A histogram is a vertical or horizontal bar graph, of the type often

appearing in news media and presentations. There are four main
properties that a histogram should have:
◮ The number of categories should be suitable for the amount of data;
a suggested guideline is based on Sturges’s rule (1926),
◮ each datum should fall into exactly one category,
◮ the categories should have the same width,
◮ no datum should assume a boundary value.
An algorithm for selecting categories that implement these points is given
in the textbook. In practice, you will use computer software that will
create histograms from data sets and it will often be advisable to modify
the default setting to conform to this algorithm.
Histograms
Steps in creating a histogram for numerical data:
1. Find the desired number of categories:
Data Set Size # Categories
< 16 Insufficient data
2 n−1 to 2n − 1 n
2. Calculate data (sample) range: (largest datum) - (smallest datum).

3. Divide data range by number of categories; round up to the accuracy
of the data, or add a smallest decimal unit if already at accuracy of
data. This is the category length.
4. The lower boundary for the first category lies 1/2 smallest decimal
unit below the smallest datum.
5. The remaining boundaries are found by adding the category length to
the preceding boundary value.
Histograms with Mathematica (default settings)
Histograms with Mathematica (determining categories)
Histograms with Mathematica (adjusted settings)
Ogives
An ogive is also known as a cumulative frequency plot. The abscissa

shows the boundaries of the categories of the histogram, while the ordinate
gives the number of data in the corresponding category and all
“preceding” categories. We use a relative frequency ogive, so we divide by
the total number of data. This graph will approximate the cumulative
distribution function F for a continuously distributed random variable.
Ogives with Mathematica
Sample Statistics - Location
2.1.4. Definition. Let X1 , . . . , Xn be a random sample from the distribution

of a random variable X . The statistic
1X
n
X = Xi
n
i=1
is called the sample mean.

Note that X ̸= E[X ]. While E[X ] is the actual mean of X , X depends on
the chosen random sample and may at best approximate E[X ].
Sample Statistics - Location

Another measure for the location of a distribution of a random variable X
is the median M, defined by
P[X ≤ M] = 0.50.
For a continuous distribution, the median is the “halfway point”, i.e., an

observation of X is just as likely to fall below it as above it.
For a random sample X1 , . . . , Xn , the sample median is defined in the
following way: Arrange (re-index) the values x1 , . . . xn in such a way that
xk ≤ xk+1 , k = 1, . . . , n. Then
(
1
(x + xn/2+1 ) n even
x = 2 n/2
e
x(n+1)/2 n odd
This sample median may differ significantly from the sample mean!
We define the median location by the number (n + 1)/2.
Mean and Median with Mathematica
Sample Statistics - Variability

The main measure of variability in a random variable is the variance, and
we can define the variance of a sample analogously to that of a discrete
random variable, replacing the expectation value by the sample mean,
1X
n
S2 = (Xk − X )2 .
n
k=1
However, it can be shown that this formula (which is in use in certain

calculators and computer systems) underestimates the variance σ 2 ;
therefore, we set instead
1 X
n
S2 = (Xk − X )2 ,
n−1
k=1
defining
√ the sample variance in this way and the sample standard deviation
S = S 2.
Variance and Standard Deviation with Mathematica
Rounding of Statistics
We round the statistics in the following ways (rounding instead of

truncating):
◮ For the mean we give one more decimal place than the original data
has.
◮ For the variance we give two more decimal places than the original
data has.
◮ For the standard deviation we give one more decimal place than the
original data has.
◮ The range and median are not rounded.
A final warning: when calculating the above statistics, always make sure
that you are actually studying a sample of a population, not the
population as a whole!
Boxplots - Quartiles
Boxplots are a very useful way of visualizing data. In order to construct a
boxplot, we first need to determine the quartiles q1 and q3 and the
interquartile range iqr = q3 − q1 . The quartiles play a similar role to that
of the median (which would be the quartile denoted q2 ) in that of a
random sample ordered from smallest to largest, 25% would lie below q1
and 75% would lie below q3 . The precise construction of q1 and q3 varies
(one algorithm is given in the textbook).
Construction of Boxplots
The construction of a boxplot will be demonstrated on the blackboard.
We need the following data:
◮ q1 , e
x , q3 and iqr.
◮ Inner fences
3 3
f 1 = q1 − iqr, f3 = q3 + iqr .
2 2
◮ Adjacent values
a1 = min{xk : xk ≥ f1 }, a3 = max{xk : xk ≤ f3 }.
◮ Outer fences
F1 = q1 − 3 iqr, F3 = q3 + 3 iqr .
Boxplots with Mathematica
Outliers
Data points lying between the inner and outer fences are called near
outliers, those lying outside the outer fences are called far outliers. Far
outliers are unusual if (and only if!) an approximately bell-shaped
distribution of the random variable X of the population is expected. In
this case, their origin should be investigated.
◮ If the outlier seems to be the result of an error in measurement or
data collecting, it may be discarded from the data.
◮ If the outlier seems to be the result of a random measurement, it is
recommended that statistics are reported twice: with the outlier
included and without the outlier.
◮ As a rule of thumb: Of 1000 random samples of a normally
distributed population, it can be expected that 7 will be outliers.
An Introduction to Statistical Methods Estimation
Estimation
In the last section we have seen how data obtained from a random sample
can be used to obtain information on a population; in particular a statistic
(such as the median) would approximate a random variable (such as the
mean). However, no precise information on the “quality” of the
approximation was given, and one formula (for the sample variance)
remained obscure and counterintuitive.
The process of using statistics to approximate random variables is called
estimation. We now aim to provide a mathematical framework for this
process. Note that the language of statistics differs slightly from that of
probability theory; instead of a random variable X of a population, we
refer to a more general population parameter θ such as the mean or
standard deviation. But a population parameter can also be the parameter
of a certain distribution (such as λ of the Poisson distribution). We have
previously seen that functions of random samples X1 , . . . , Xn (which in
probability theory are random variables themselves) are called statistics.
Estimators
An estimator for a population parameter θ is a statistic and denoted by θ.b

Any given value of θb is called an estimate. (More precisely, we refer to
point estimators and point estimates.) We would like an estimator to have
the following properties:
◮ The expected value of θ b should be equal to θ,
◮ θb should have small variance for large sample sizes.
b is called the bias of an estimator

2.2.1. Definition. The difference θ − E[θ]
θb for a population parameter θ. We say that θb is unbiased if E[θ]
b = θ.
Estimators
The quality of an estimator is measured by its mean square error, defined

as
MSE(θ) b := E[(θb − θ)2 ].
This can be rewritten as

b = E[(θb − E[θ])
MSE(θ) b 2 ] + (θ − E(θ))
b 2 = Var θb + (bias)2 .
Hence variance can be just as important as bias for an estimator (see

blackboard). In general, we will prefer to have an unbiased estimator, but
sometimes biased estimation is used (e.g., in multiple regression).
Sample Mean
2.2.2. Theorem. Let X1 , . . . , Xn be a random sample of size n from a

distribution with mean µ. The sample mean X is an unbiased estimator
for µ.
Proof.
We simply insert the definition of the sample mean and use the properties
of the expectation:
1
E[X ] = E[(X1 + · · · + Xn )/n] = E[X1 + · · · + Xn ]
n
1 nµ
= (E[X1 ] + · · · + E[Xn ]) = = µ.
n n
Sample Variance
2.2.3. Theorem. Let X be the sample mean of a random sample of size n
from a distribution with mean µ and variance σ 2 . Then
σ2
Var X = E[(X − µ)2 ] = .
n
Proof.
We simply insert the definition of the sample mean and use the properties
of the variance:
1
Var X = Var((X1 + · · · + Xn )/n) = Var(X1 + · · · + Xn )
n2
1 nσ 2 σ2
= 2 (Var X1 + · · · + Var Xn ) = 2 = .
n n n
Thus X is both unbiased and has a variance that decreases with large n; it
is a “nice” estimator, since we can make the mean square error MSE X as
small as desired by taking n large enough.
Standard Error of the Mean and Sample Variance
√ √
2.2.4. Definition. The standard deviation of X is given by Var X = σ/ n
and is called the standard error of the mean.
2.2.5. Theorem. The sample variance
1 X
n
S =2
(Xk − X )2 ,
n−1
k=1
is an unbiased estimator for σ 2 .
Sample Variance
Proof.
We simply calculate E[S 2 ],
· ¸
1 X
n
2
E[S ] = E (Xk − X) 2
n−1
k=1
·Xn ¸
1
= E (Xk − µ + µ − X) 2
n−1
k=1
·Xn X
n X
n ¸
1
= E (Xk − µ)2 − 2(X − µ) (Xk − µ) + (µ − X )2
n−1
k=1 k=1 k=1
·Xn ³³X
n ´ ´ ¸
1
= E (Xk − µ) − 2(X − µ)
2
Xk − nµ + n(µ − X ) .
2
n−1
k=1 k=1
Sample Variance
Proof (continued).
Pn
Note that k=1 Xk = nX , so
·X n ³³X n ´ ´ ¸
1
2
E[S ] = E (Xk − µ) − 2(X − µ)
2
Xk − nµ + n(µ − X )2
n−1
k=1 k=1
·X n ¸
1
= E (Xk − µ)2 − 2(X − µ)(nX − nµ) + n(µ − X )2
n−1
k=1
·X n ¸
1
= E (Xk − µ) − n(X − µ)
2 2
n−1
k=1
µX n ¶
1
= E[(Xk − µ) ] − n E[(X − µ) ] .
2 2
n−1
k=1
Sample Variance
Proof (continued).
Since Var Xk = σ 2 = E[(Xk − µ)2 ] for each k = 1, . . . , n, and
E[(X − µ)2 ] = σ 2 /n by Theorem 2.2.3, we have
µXn ¶
1
2
E[S ] = E[(Xk − µ) ] − n E[(X − µ) ]
2 2
n−1
k=1
µXn ¶
1 σ2
= σ −n
2
n−1 n
k=1
1
= (nσ 2 − σ 2 ) = σ 2 .
n−1
Finding Estimators - Method of Moments
The method of moments goes back to Karl Pearson in 1894 and uses the
basic fact that unbiased estimators Mk for the kth moments E [X k ] of a
distribution are
1X k
n
Mk = Xi ,
n
i=1
given a random sample X1 , . . . , Xn .

The idea is then that a population parameters θj can often be expressed in
terms of the moments of the distribution. Replacing the moments in these
expressions by their estimators then yields estimators for the parameters θj .
Note: Estimators obtained in this way are not necessarily unbiased!
Finding Estimators - Method of Moments
2.2.6. Example. Let X1 , . . . , Xn be a random sample from a gamma

distribution with parameters α and β. We know that
E[X ] = αβ, Var X = E[X 2 ] − E[X ]2 = αβ 2 .
Replacing the moments with M1 and M2 , we obtain
M1 = α̂β̂, M2 − M12 = α̂β̂ 2 .
This gives first M2 − M12 = M1 β̂ and then
M2 − M12 M1 M12
β̂ = , α̂ = = .
M1 β̂ M2 − M12
Finding Estimators - Method of Maximum Likelihood

Maximum likelihood estimators can be trace to Carl Friedrich Gauß, who
used them more than 170 years ago on isolated problems. They are based
on the idea that given a set of observations x1 , . . . xn , one finds the value
of the population parameter most likely to have produced these
observations. In other words, we express the probability of obtaining
x1 , . . . xn as a function of the parameter and then find the value of θ that
maximizes this probability. We proceed as follows:
1. Obtain a random sample x1 , . . . , xn from the distribution of a random
variable X with density f and associated parameter θ.
2. Define the likelihood function L by
Y
n
L(θ) = f (xi ).
i=1
b 1 , . . . , xn ) for θ from the condition

3. Obtain the estimator θ(x
b
L(θ) = max.
2.2.7. Example. Water samples of a specific size are taken from a river
suspected of having been polluted by improper treatment procedures at an
upstream sewage-disposal plant. Let X denote the number of coliform
organism found per sample, and assume that X is a Poisson random
variable with parameter k. Let x1 , . . . , xn be a random sample from the
distribution of X . We want to determine the value of k that gives the
highest probability of observing this sample.
Since random sampling implies independence,
Y
n
P[X1 = x1 and X2 = x2 and . . . and Xn = xn ] = P[Xj = xj ].
j=1

e −k k x
The density for X is given by P[X = x] = f (x) = , x ∈ N, so
x!
P
Y
n xj
−nk k
Q
P[Xj = xj ] = e =: L(k).
xj !
j=1
L is called the likelihood function for k. We want to find the value of k

that maximizes L. To simplify our calculations, we take the logarithm of
the above expression:
X
n Y
ln L(k) = −nk + ln k xj − ln xj !.
j=1
Maximizing ln L(k) will also maximize L(k), so we take the first derivative
and set it equal to zero:
1X
n
d ln L(k)
= −n + xj = 0 ⇔ k = x.
dk k
j=1
Distribution of the Sample Mean

We now want to analyze the distribution of the sample mean - this will be
the first step in obtaining information on how well the sample mean
approximates the population mean.
Our main tool will be the moment generating functions. We thus first
establish some basic properties of them.
Uniqueness Theorem (false, but convenient): Let X and Y be two random
variables with moment-generating functions mX and mY , respectively. If
mX = mY in some neighborhood of 0, then X = Y .
2.2.8. Theorem. Let X1 and X2 be two random variables with

moment-generating functions mX1 and mX2 , respectively. If Y = X1 + X2 ,
then
mY = mX1 mX2 .
Distribution of the Sample Mean
2.2.9. Theorem. Let X be a random variable with moment-generating

function mX . Let Y = α + βX . Then
mY (t) = e αt mX (βt).
These results can be used to obtain (homework!) the following

normal distribution with mean µ and variance σ 2 . Then X is normally
distributed with mean µ and variance σ 2 /n.
Confidence Intervals
2.2.11. Definition. Let 0 ≤ α ≤ 1. A 100(1 − α)% confidence interval for a
parameter θ is an interval [L1 , L2 ] such that
P[L1 ≤ θ ≤ L2 ] = 1 − α.
Since L1 and L2 are random variables, we often call [L1 , L2 ] a random

interval. Note that the population parameter θ is not a random variable,
but constant.
If one of L1 and L2 is not a random variable, but a fixed number (such as
0, ∞ or −∞), then we speak of a one-sided confidence interval.
Initially, we are most often interested in centered confidence intervals with
L1 = θ̂ − L and L2 = θ̂ + L, where L is a sample statistic and the interval
is centered on θ̂, a point estimate for θ.
Interval Estimation
2.2.12. Notation. We will often denote an interval of the form

[x − ε, x + ε] for x ∈ R, ε > 0 by x ± ε. In fact, we define
y =x ±ε :⇔ y ∈ [x − ε, x + ε].
We would like to make statements such as “based on the results of a

sample, we are 90% certain that the mean of a population lies in X ± L.”
This is known as interval estimation.
The simplest case is when we are looking for a confidence interval for the
mean of a normal population where we already know its variance.
Although this case rarely occurs in applications, understanding it is a first
step to more complicated (and realistic) scenarios.
Interval Estimation
Interval Estimation for the Mean (Variance Known)
Assume that we have a random sample of size n from a normal population

with unknown mean µ and known variance σ 2 . Our sample yields a point
estimate X for µ. We are now interested in finding L = L(α) such that we
can state with 100(1 − α)% confidence that µ = X ± L.
We first define zα/2 for α ∈ [0, 1] by
α/2 = P[Z ≥ zα/2 ]

Z ∞
1
e −x /2 dx.
2
=√
2π zα/2
where Z is a standard normal variable.

Fix α ∈ [0, 1]. Then
· ¸
X −µ−L X −µ+L
1 − α = P[X − L ≤ µ ≤ X + L] = P √ ≤0≤ √
σ/ n σ/ n
By Theorem 2.2.10 the sample mean is normally distributed with mean µ
and variance σ 2 /n. Thus,
X −µ
Z= √
σ/ n
follows a standard normal distribution, and so
· ¸
L L
1−α=P Z − √ ≤0≤Z + √
σ/ n σ/ n
· ¸
L L
=P − √ ≤Z ≤ √
σ/ n σ/ n
· ¸ · ¸
L L
= 2P 0 ≤ Z ≤ √ = 1 − 2P √ ≤Z ≤∞
σ/ n σ/ n

Thus we determine L as being the number such that
· ¸
L
P √ ≤ Z ≤ ∞ = α/2.
σ/ n
But this means that
L zα/2 · σ
√ = zα/2 ⇔ L= √ .
σ/ n n
We have thus proved the following result:
normal distribution with mean µ and variance σ 2 . A 100(1 − α)%
confidence interval on µ is given by
zα/2 · σ
X± √ .
n
Interval Estimation
2.2.14. Example. An article in the Journal of Heat Transfer describes a
method of measuring the thermal conductivity of Armco iron. Using a
temperature of 100◦ F and a power input of 550 W, the following 10
measurements of thermal conductivity (in Btu /(hr ft ◦ F)) were obtained:
41.60 41.48 42.34 41.95 41.86
42.18 41.72 42.26 41.81 42.04
A point estimate of the mean thermal conductivity at 100◦ F and 550 W is
the sample mean,
x = 41.92 Btu /(hr ft ◦ F).
Suppose we know that the standard deviation of the thermal conductivity
under the given conditions is σ = 0.10 Btu /(hr ft ◦ F). A 95% confidence
interval (α = 0.05) on the mean is then given by
z0.025 · σ
x± √ = 41.924 ± 0.062 = [41.862, 41.986].
n
Central Limit Theorem
While up to now most assertions we have made have been proven (more or
less rigorously), many important results in statistics are highly non-trivial
and require an inordinate (for this course) amount of effort to prove. One
of the first of these is the Central Limit Theorem, which we now cite:
2.2.15. Theorem. Let X1 , . . . , Xn be a sequence of independent random
variables with arbitrary distributions, means E[Xj ] = µj and variances
Var Xj = σj2 (all finite). Let Y = X1 + · · · + Xn . Then under some general
conditions P
Y − µj
Zn = qP
σj2
is approximately standard-normally distributed as n becomes large.
If the random variables X1 , . . . , Xn all follow the same distribution, one

obtains the following special case:
2.2.16. Theorem. Let X1 , . . . , Xn be a random sample of size n from an
arbitrary distribution with mean µ and variance σ 2 . Let
Y = X1 + · · · + Xn = nX . Then under some general conditions, for large n,
X is approximately normal with mean µ and variance σ 2 /n. Furthermore,
Y − nµ X −µ
Zn = √ = √
σ n σ/ n
is approximately standard normal.

2.2.17. Example. In a construction project, a network of major activities
has been constructed to serve as a basis for planning and scheduling. On a
critical path there are 16 activities.
Activity Mean (weeks) Variance Activity Mean (weeks) Variance
1 2.7 1.0 9 3.1 1.2
2 3.2 1.3 10 4.2 0.8
3 4.6 1.0 11 3.6 1.6
4 2.1 1.2 12 0.5 0.2
5 3.6 0.8 13 2.1 0.6
6 5.2 2.1 14 1.5 0.7
7 7.1 1.9 15 1.2 0.4
8 1.5 0.5 16 2.8 0.7
The activity times may be considered independent and the project time Y
is the sum of the individual activity times Xj on the critical path, i.e.,
Y = X1 + X2 + · · · + X16 .

The contractor would like to know
1. the expected completion time and
2. a project time y0 corresponding to a probability of 0.90 of having the
projected completed.
P P
We calculate µY = Xj = 49 weeks and σY2 = σX2 j = 16 weeks2 .
Hence the expected completion time is 49 weeks.
Using the central limit theorem, we can use the normal distribution to find
an approximate value for y0 :
· ¸
y0 − 49 weeks
P[Y ≤ y0 ] = P Z ≤ = 0.9
4 weeks
gives
y0 − 49 weeks
= 1.282 or y0 = 54.128 weeks .
4 weeks
How large must n be for the central limit theorem to give a good
approximation?
This depends on how “well-behaved” the distributions of the variables Xj
are:
1. Well-behaved (nearly symmetric densities that look close to that of a
normal distribution): n ≥ 4.
2. Reasonably behaved (no prominent mode, densities look like uniform
densities): n ≥ 12.
3. Ill-behaved (much of the weight of the densities is in the tails,
irregular appearance): n ≥ 100.
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution
Preliminary Theory
Let X1 , . . . , Xn be i.i.d. random variables following a normal distribution

with mean µ and variance σ 2 . We define the random variable
v
u n
1u X
χn = t (Xk − µ)2 .
σ
k=1
and we interest ourselves in its density function. We will consider the

cumulative distribution function Fχn ,
Fχn (y ) = P[χn ≤ y ].
Clearly, Fχn (y ) = 0 for y < 0.
Preliminary Theory
For y > 0,
 v 
u n
1 u X
Fχn (y ) = P[χn ≤ y ] = P  t (Xk − µ)2 < y 
σ
k=1
" n µ ¶ # " n #
X Xk − µ 2 X
2 2 2
=P <y =P Zk < y
σ
k=1 k=1
Xk −µ
where the variables Zk = σ , k = 1, . . . , n, follow a standard normal
distribution.
The density for the sum of n independent normal distributions is the
product of their individual densities, so we have
Z Pn
(2π)−n/2 e − k=1 zk /2 dz1 . . . dzn .
2
Fχn (y ) =
Pn 2 2
k=1 zk <y
Preliminary Theory
We recall that on Rn \ {0} we can introduce polar coordinates

(r , θ1 , . . . , θn−1 ) by
x1 = r sin θ1
x2 = r cos θ1 sin θ2
x3 = r cos θ1 cos θ2 sin θ3
..
.
xn−1 = r cos θ1 cos θ2 . . . cos θn−2 sin θn−1
xn = r cos θ1 cos θ2 . . . cos θn−2 cos θn−1
Here r > 0 and −π/2 < θk < π/2, k = 1, n − 2, 0 < θn−1 < 2π.
Preliminary Theory
The integral becomes
Z 2π Z π/2 Z π/2 Z y
(2π)−n/2 e −r
2 /2
Fχn (y ) = ... r n−1
0 −π/2 −π/2 0
× D(θ1 , . . . , θn−1 ) dr dθ1 . . . dθn−2 dθn−1
where D(θ1 , . . . , θn−1 ) is the modulus of the determinant of the Jacobian

of the transformation (θ1 , . . . , θn−1 ) → x/r . Writing
Z 2π Z π/2 Z π/2
Cn = (2π)−n/2 ... D(θ1 , . . . , θn−1 ) dθ1 . . . dθn−2 dθn−1
0 −π/2 −π/2
we have
Z y
e −r
2 /2
Fχn (y ) = Cn r n−1 dr .
0
Preliminary Theory
We determine Cn from
Z ∞ ³n´
e −r
2 /2
1 = lim Fχn (y ) = Cn r n−1 dr = Cn Γ 2n/2−1 .
y →∞ 0 2
Hence
Z y
1
e −r
2 /2
Fχn (y ) = ¡n¢ r n−1 dr .
Γ 2 2n/2−1 0
and the density of ζn is given by

2
fχn (y ) = Fχ′ n (y ) = y n−1 e −y
2 /2
.
2n/2 Γ(n/2)
The χ2 -Distribution
Next we consider the random variable
1 X
n
χ2n = (Xk − µ)2 . (2.3.1)
σ2
k=1
where again X1 , . . . , Xn be i.i.d. random variables following a normal

distribution with mean µ and variance σ 2 .
From our formula for the transformation of variables, we obtain
2 1 1
fχ2n (z) = z (n−1)/2 e −z/2 · √ = n/2 z n/2−1 e −z/2 .
2n/2 Γ(n/2) 2 z 2 Γ(n/2)
This is just the chi-squared distribution with n degrees of freedom. Note

that this distribution does not depend on µ or σ; hence any χ2 distribution
can be regarded as being the distribution of the sum of squares of
independent standard-normally distributed random variables.
The Chi-Squared Distribution

We immediately obtain the following result:
2.3.1. Lemma. Let Xγ21 , . . . , Xγ2n be n independent random variables
following chi-squared distributions with γ1 , . . . , γn degrees of freedom,
respectively. Then
X n
Xα2 := Xγ2k
k=1
P
n
is a chi-squared random variable with α = γk degrees of freedom.
k=1
Proof.
Each of the chi-squared random variables Xγ2k may be regarded as being a
sum of γk squares of standard-normally distributed random variables.
Pn
Hence their sum may be regarded as the sum of α = γk squares of
k=1
standard-normally distributed random variables, giving a chi-squared
random variable with α degrees of freedom.
Joint Sampling of Mean and Variance

Our interest in the chi-squared distribution is not merely abstract, for
understanding the sum of squares of normally distributed random
variables; in fact, the main application lies in analyzing the distribution of
the sample variance. In the previous chapter, we were able to analyze the
sample mean, and also its distribution, under the assumption of known
variance. If the variance
σ 2 = E[(X − µ)2 ]
is unknown, we must start all over again, and first learn more about the
sample variance
1 X
n
S2 = (Xk − X )2 .
n−1
k=1
The problem essentially is that we are using the random sample X1 , . . . , Xn

to obtain X and S 2 at the same time, i.e., we actually need to obtain the
joint distribution of X and S 2 .
Joint Sampling of Mean and Variance
The problem is resolved by the following fundamental theorem:

2.3.2. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n
from a normal distribution with mean µ and variance σ 2 . Then
1. The sample mean X is independent of the sample variance S 2 ,
2. X is normally distributed with mean µ and variance σ 2 /n,
3. (n − 1)S 2 /σ 2 is chi-squared distributed with n − 1 degrees of freedom.
The proof of this theorem uses the so-called Helmert transformation,

which we now discuss.
The Helmert Transformation

The Helmert transformation is a very special kind of orthogonal
transformation from a set of n ≥ 2 i.i.d. normal random variables
X1 , . . . , Xn to a new set of random variables Y1 , . . . , Yn . Effectively, a
sample of size n of a normal population X with mean µ and variance σ 2 is
transformed as follows:
1
Y1 = √ (X1 + · · · + Xn )
n
1
Y2 = √ (X1 − X2 )
2
1 ¡ ¢
Y3 = √ X1 + X2 − 2X3
6
..
.
1 ¡ ¢
Yn = p X1 + X2 + · · · + Xn−1 − (n − 1)Xn
n(n − 1)

In matrix notation,
 
  √1 √1 √1 ··· √1  
Y1 
n n n n
 X 1
Y2   √12 − √12 0 ··· 0  X2 
    
Y3   √1 √1 − √26 ··· 0  X3 
 = 6 6  
 ..   .. .. .. .. ..   .. 
 .   . . . . .  . 
 
Yn √ 1 √ 1 √ 1 ··· − √ n−1 Xn
n(n−1) n(n−1) n(n−1) n(n−1)
or Y = AX for short. It is easy to see that the rows of the matrix A are
orthonormal. Thus, A is an orthogonal matrix, A−1 = AT . This
immediately implies |det A| = 1, since
1
det A = det AT = det A−1 = ⇒ (det A)2 = 1
det A

Incidentally, the orthogonality of A also implies that if y = Ax, then
X
n X
n
yi2 = ⟨y, y⟩ = ⟨Ax, Ax⟩ = ⟨AT Ax, x⟩ = ⟨x, x⟩ = xi2 . (2.3.2)
i=1 i=1
We have assumed that the random variables X1 , . . . , Xn are i.i.d., so their

joint distribution function is given by the product of the individual normal
distributions,
Y
n
(2π)−1/2 σ −1 e − 2σ2 (xi −µ)
1 2
fX1 ···Xn (x1 , . . . , xn ) =
i=1
P
n
− 1
(xi2 −2µxi +µ2 )
−n/2 −n 2σ 2
= (2π) σ e i=1

Note that the Helmert transformation is linear, so its derivative (Jacobian)
is simply A. Using (2.3.2), |det A−1 | = 1 and Theorem 1.4.19 on the
transformation of joint random variables, we obtain
fY1 ···Yn (Y1 , . . . , Yn )

= fY1 ···Yn (y) = fX1 ···Xn (AT y)
³n ´
P √
− 1
yi2 −2µ ny1 +nµ2
= (2π)−n/2 σ −n e
2σ 2
i=1
³ ´
P
n √
− 1
yi2 +(y1 − nµ)2
= (2π)−n/2 σ −n e
2σ 2
i=2
√ Y
n
= (2π)−1/2 σ −1 e − 2σ2 (y1 − (2π)−1/2 σ −1 e − 2σ2 yi
1 1
nµ)2 2
i=2
In particular, we see from this representation of the joint density function

that the random variables Y1 , . . . , Yn are all independent. Note also that
√
Y1 is normally distributed with mean nµ and variance σ 2 .

Using the Helmert transformation, we can rewrite
1
X = √ Y1 .
n
Furthermore,
X
n X
n
2 X
n
(n − 1)S 2 = (Xi − X )2 = Xi2 − nX = Yi2 − Y12
i=1 i=1 i=1
X
n
= Yi2 .
i=2
Since the Yi are all independent, it follows that X is independent of S 2 , so

we have proven assertion 1. of the theorem.

Proof of Theorem 2.3.2 (continued).
Since X = √1 Y1 and
n
√
(y1 − nµ)2
fY1 (y1 ) = (2π)−1/2 σ −1 e − 2σ 2
it follows that √ √
−1/2 −1 −
( nx− nµ)2 √
fX (x) = (2π) σ e 2σ 2 n
so X is normally distributed with mean µ and variance σ 2 /n.
Now
1 X 2
n
(n − 1)S 2 /σ 2 = Yi
σ2
i=2
follows a chi-squared distribution with n − 1 degrees of freedom (see

(2.3.1) and the following discussion). This completes the proof.
Independence of Sample Mean and Sample Variance
2.3.3. Remark. Theorem 2.3.2 essentially uses the fact that the i.i.d.
variable Xk , k = 1, . . . , n, are normally distributed. In fact, the converse
result is true also:
Let X1 , . . . , Xn , n ≥ 2, be i.i.d. random variables. Then if X and
S 2 are independent, the Xk , k = 1, . . . , n follow a normal
distribution.
This means that the independence of X and S 2 is a characteristic property
of the normal distribution. Furthermore, if in a given situation we assume
that X and S 2 are independently distributed we are essentially assuming
that the population is normally distributed.
Interval Estimation of Variability
We can now use Theorem 2.3.2 to find a confidence interval for the
variance based on the sample variance S. First, for 0 < α ≤ 1 we define
χ21−α/2,n ≤ χ2α/2,n ∈ R by
Z χ21−α/2,n
fXn2 (x) dx = α/2,
0
Z ∞
fXn2 (x) dx = α/2,
χ2α/2,n
where fXn2 is the probability den-

sity of the chi-squared distribution
with n degrees of freedom.

From Theorem 2.3.2 we know that given a sample of size n from a normal
population, (n − 1)S 2 /σ 2 follows a chi-squared distribution with n − 1
degrees of freedom. Thus
· ¸
(n − 1)S 2
1 − α = P χ1−α/2,n−1 ≤
2
≤ χα/2,n−1
2
σ2
" #
(n − 1)S 2 (n − 1)S 2
=P ≤ σ2 ≤ 2
χ2α/2,n−1 χ1−α/2,n−1
This gives us the following result:

2.3.4. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n from
a normal distribution with mean µ and variance σ 2 . A 100(1 − α)%
confidence interval on σ 2 is given by
£ ¤
(n − 1)S 2 /χ2α/2,n−1 , (n − 1)S 2 /χ21−α/2,n−1 .

Often, we are only interested in finding an upper or lower bound for the
variance.
2.3.5. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n from
a normal distribution with mean µ and variance σ 2 . Then with
100(1 − α)% confidence
(n − 1)S 2
≤ σ2.
χ2α,n−1
2
[ (n−1)S
χ2
, ∞) is known as a 100(1 − α)% lower confidence interval for σ 2 .
α,n−1
Similarly, with 100(1 − α)% confidence,

(n − 1)S 2
σ2 ≤
χ21−α,n−1
2
[0, χ(n−1)S
2 ] is known as a 100(1 − α)% upper confidence interval for σ 2 .
1−α,n−1
2.3.6. Example. A manufacturer of soft drink beverages is interested in the

uniformity of the machine used to fill cans. Specifically, it is desirable that
the standard deviation σ of the filling process be less than 0.2 fluid ounces;
otherwise there will be a higher than allowable percentage of cans that are
underfilled. We will assume that fill volume is approximately normally
distributed. A random sample of 20 cans results in a sample variance of
s 2 = 0.0225 (fluid ounces)2 . A 95% upper-confidence interval is given by
(n − 1)S 2 19 · 0.0225 (fluid ounces)2

σ2 ≤ = = 0.0423 (fluid ounces)2
χ20.95,19 10.117
This corresponds to σ ≤ 0.21 fluid ounces with 95% confidence. This is

not sufficient to support the hypothesis that σ ≤ 0.20 fluid ounces so
further investigation is necessary.
Interval Estimation for the Mean (Variance unknown)

Recall that we have derived a formula for the confidence interval of the
mean of a normal distribution using the random variable
X −µ
Z= √
σ/ n
which was found to be normally distributed. The Central Limit Theorem

allowed us to extend this result (approximately) even to non-normal
distributions, but one central difficulty remained: σ must be known!
Our main goal is to derive a general formula for a confidence interval on
the mean when he value of σ is not known and must be estimated.
The difficulty lies in the fact that the distribution of
X −µ
√
S/ n
is not known.
The Student T -distribution
2.3.7. Definition. Let Z be a standard normal variable and let Xγ2 be an

independent chi-squared random variable with γ degrees of freedom. The
random variable
Z
Tγ = q
Xγ2 /γ
is said to follow a T distribution with γ degrees of freedom.
2.3.8. Theorem. The density of a T distribution with γ degrees of freedom

is given by
µ ¶− γ+1
Γ(γ + 1)/2 t2 2
fTγ (t) = √ 1+ .
Γ(γ/2) πγ γ
The proof is left to you (homework!).
The Student T -distribution

2.3.9. Theorem. Let X1 , . . . , Xn be a random sample from a normal
distribution with mean µ and variance σ 2 . The random variable
X −µ
Tn−1 = √
S/ n
follows a T distribution with n − 1 degrees of freedom.
Proof. √
We know that (X − µ)/(σ/ n) is standard normal and (n − 1)S 2 /σ 2 is a
chi-squared random variable with n − 1 degrees of freedom. Therefore,
√
Z (X − µ)/(σ/ n) X −µ
q = p = √
Xγ2 /γ ( (n − 1)S /σ )/(n − 1)
2 2 S/ n
follows a T distribution with n − 1 degrees of freedom.

Interval Estimation of Mean with Variance Unknown
Let 0 < α ≤ 1/2. We define tα,n ≥ 0 by

Z −tα,n
fTn (t) dt = α,
−∞
where fTn is the density of the T -distribution with n degrees of freedom.

normal distribution with mean µ and variance σ 2 . Then a 100(1 − α)%
confidence interval on µ is given by
√
X ± tα/2,n S/ n
Interval Estimation of Mean with Variance Unknown

2.3.11. Example. An article in the Journal of Testing and Evaluation
presents the following 20 measurements on residual flame time (in
seconds) of treated specimens of children’s nightwear:
9.85 9.93 9.75 9.77 9.67 9.87 9.67 9.94 9.85 9.75
9.83 9.92 9.74 9.99 9.88 9.95 9.95 9.93 9.92 9.89
We wish to find a 95% confidence interval on the mean residual flame
time. The sample mean and standard deviation are
x = 9.8475, s = 0.0954
We refer to the table for the T distribution with 20 − 1 = 19 degrees of

freedom and and α/2 = 0.025 to obtain t0.025,19 = 2.093. Hence
µ = (9.8475 ± 0.0446) sec, i.e., 9.8029 ≤ µ ≤ 9.8921
with 95% probability.

Hypotheses and Testing
Often, the statistician will have some idea of the value of a population
parameter, or will try to verify or refute a statement on this parameter. As
an example, a new type of battery might be designed to have a longer life
span than a traditional model. Then a series of prototypes of the new type
(constituting a sample from the population of all not yet produced
batteries of the new type) might be tested for their life span. If the mean
life span of the traditional batteries was 160 days, a hypothesis might be
that “the new type of batteries has a mean life span of more than 170
days”.
The hypothesis that is to be tested is called the research hypothesis and
denoted by H1 (in our example, H1 would be “µ > 170 days”), while the
negation of H1 is called the null hypothesis and denoted H0 (here:
“µ ≤ 170 days”.
Hypotheses
We will use the following conventions regarding hypotheses:

◮ A hypothesis always involves the numerical value of a population
parameter θ.
◮ The hypothesis to be supported is denoted H1 , the negation is
denoted H0 . One hopes to accept H1 and reject H0 through
statistical evidence.
◮ The statement of equality for θ is always part of H0 . This value is
called the null value θ0 of θ.
(In our above example, θ0 = 170 days.)
A hypothesis of the form θ ≥ θ0 , θ ≤ θ0 , θ > θ0 or θ < θ0 is known as a
one-sided hypothesis, while a hypothesis of the form θ = θ0 or θ ̸= θ0 is
known as a two-sided hypothesis.
Hypothesis Testing
We always test a pair of research and null hypotheses. Let θ denote a
population parameter whose value is being compared to some θ0 ∈ R.
Then we have
One-sided tests.
H 0 : θ ≤ θ0 , H1 : θ > θ 0
H 0 : θ ≥ θ0 , H1 : θ < θ 0
Two-sided test.
H 0 : θ = θ0 , H1 : θ ̸= θ0
In order to test a hypothesis, we select a random sample and evaluate a

statistic whose distribution is known under the assumption that θ = θ0 .
This statistic is called a test statistic.
Hypothesis Testing
On the basis of the test statistic, we either
◮ reject H0 or
◮ fail to reject H0 .
2.3.12. Example. We are given the task of testing the following hypothesis:
More than half of all car headlights in Shanghai are incorrectly
adjusted.
We now establish the mathematical/statistical framework for tackling this
problem:
◮ The population is the set of all cars in Shanghai;
◮ The random variable X is discrete: either a car has a correctly
adjusted headlight (X = 0) or it does not (X = 1);
◮ X follows a binomial distribution with
N = “number of cars in Shanghai” and population parameter p.
Hypothesis Testing
Our hypotheses are
H0 : p ≤ 0.5 H1 : p > 0.5
with the null value p0 = 0.5.
We take a random sample of n = 20 automobiles and count those that
have maladjusted headlights. Effectively, we are conducting an experiment
with a binomial random variable X : S → Ω = {1, . . . , 20}.
Our test statistic will be the number X of cars in the random sample with
incorrectly adjusted headlights.
If p = p0 , then X follows a binomial distribution with p = p0 = 0.5, and
we can expect p̂ = X /n to be close to p. In this case, E[X ] = np0 = 10.
We decide to reject H0 if at least 14 cars have incorrectly adjusted
headlights; the probability of this happening if p = p0 is
P[X ≥ 14 | p = 0.5 n = 20] = 1 − P[X ≤ 13 | p = 0.5 n = 20]
= 1 − 0.9423 = 0.0577.
Hypothesis Testing
Thus, if H0 is true (i.e., p = 0.5) then there is approximately a 6% chance
of observing 14 cars with incorrectly adjusted headlights.
What does this imply? Assume that we actually observe 14 or more cars
with incorrectly adjusted headlights and reject H0 . Then two there are two
possibilities:
◮ We have correctly rejected H0 or
◮ We have falsely rejected H0 .
In the second case, H0 is in fact true. However, the probability of this
happening is at most 6%, since if p ≤ p0 , then
P[X ≥ 14/20 | p] ≤ P[X ≥ 14/20 | p0 ] = 0.0577.
Falsely rejecting H0 is committing a so-called Type I error and the

probability of committing this error is denoted by α (in our example,
α = 5.77%).
Type I Errors
We want to keep the probability α of committing a Type I error as small
as possible. We directly control α by arbitrarily defining the critical region
of the test.
The critical region is the subset of the range of the test statistic which will
lead us to reject H0 . In our example, the critical region is
C = {14, 15, 16, 17, 18, 19, 20} ⊂ Ω.
If we reduce the critical region, we decrease α. For example, if we set
C = {16, 17, 18, 19, 20} ⊂ Ω,
i.e., we only reject H0 if at least 16 out of 20 cars have maladjusted
headlights, then
α = P[X ≥ 16 | p ≤ 0.5 n = 20] ≤ P[X ≥ 16 | p = 0.5 n = 20]
= 1 − P[X ≤ 15 | p = 0.5 n = 20] = 1 − 0.9941 = 0.0059.
Type I Errors
We say that rejecting H0 is a strong conclusion, because we can nearly

always set up our critical region in such a way that we control the
probability α of erring incoming to this conclusion. We call α the level of
significance of our test.
Note that for a two-sided test (H0 : θ = θ0 , H1 : θ ̸= θ0 ), α is a constant
determined by the critical region. However, in a one-sided test such as
H 0 : θ ≤ θ0 , H1 : θ > θ 0
α depends on the true value of the population parameter θ. In fact, the

smaller θ is, the smaller the probability of falsely rejecting H0 . Thus we
can define a function α = α(θ) for θ ≤ θ0 . However, α will be largest
when θ = θ0 , so we generally just quote this value of α for the level of
significance.
Type I Errors
In general, we will end up in one of the following situations:

1. We shall reject H0 even though H0 is in fact true - this is the Type I
error we have discussed.
2. We shall reject H0 when H0 is untrue.
3. We shall fail to reject H0 even though H0 is untrue - this is known as
a Type II error.
4. We shall fail to reject H0 when H0 is true.
Type II Errors
In testing either of
H 0 : θ ≤ θ0 , H1 : θ > θ 0
H 0 : θ ≥ θ0 , H1 : θ < θ 0
H 0 : θ = θ0 , H1 : θ ̸= θ0
Type II errors are more tricky than Type I errors:
◮ A Type II error occurs if H0 is untrue but we still fail to reject H0 .
◮ We do not control β, the probability of committing a Type II error.
◮ β depends on the true value of the population parameter θ.
We also introduce the power of a test, which is he probability of rejecting
H0 when H1 is true. The power is given by 1 − β. Summarizing, we have
α = P[Type I error] = P[reject H0 | H0 true],
β = P[Type II error] = P[fail to reject H0 | H0 false],
Power = 1 − β = P[reject H0 | H0 false],
Example - A binomial test

2.3.13. Example. In the context of our previous example, we wanted to test
the statement that more than half of all car headlights in Shanghai are
maladjusted and selected a random sample of 20 cars, deciding to accept
the statement (reject H0 ) if 14 or more have wrongly adjusted headlights.
Suppose that p = 0.7. Then
β = P[fail to reject H0 | p = 0.7] = P[X ≤ 13 | p = 0.7, n = 20] = 0.3920
However, if p = 0.8,
β = P[X ≤ 13 | p = 0.8, n = 20] = 0.0867
Thus, (unless β is known) failing to reject H0 is thought of as a weak

conclusion.
We prefer to say that we “fail to reject H0 ” rather than “accept H0 ”,
indicating that we simply do not have enough evidence to reject H0 (as
opposed to being sure that H0 is true).
Operating Characteristic (OC) Curves

For many types of tests the dependence of β on the true value of the
population parameter is known and can be looked up in the form of a
so-called operating characteristic (OC) curve. While the textbook
discusses OC curves only in the context of acceptance sampling (see pages
672-674), OC curves are actually fundamental to hypothesis testing.
2.3.14. Example. We are interested in the mean compressive strength of a
particular type of concrete. Specifically, we want to decide whether or not
the mean compressive strength (say µ) is 2500 psi. We set up the
hypotheses
H0 : µ = 2500 psi, H1 : µ ̸= 2500 psi .
We take a random sample of size n and obtain the sample mean X as a

test statistic for µ. We decide to reject H0 if |X − µ0 | > 50 psi, where
µ0 = 2500 psi is the null value of µ.

From the central limit theorem, we know that X follows a normal
distribution with mean µ. Through our choice of critical region (and
sample size), we have effectively fixed α. Depending on the true value of
µ, β can be large or small. If µ = µ0 = 2500 psi, then β = 1 − α. We can
represent β as a function of µ through a curve of the following shape:

Of course, β also depends on the sample size, since a large sample size will
reduce the variance of X and therefore make it less likely to falsely fail to
reject H0 . When increasing the sample size from n1 to n2 , the OC curve
narrows:

By increasing α, we make it easier to reject H0 . This also decreases β, the
probability of failing to reject H0 even though H0 is true:

The previous curves were those for the two-sided test H1 : µ ̸= µ0 . If we
have a one-sided test
H0 : µ ≤ µ0 , H1 : µ > µ0
then we can a priori only define β(µ) for µ > µ0 . However, since
β(µ0 ) = 1 − α and we in fact have α = α(µ) for µ ≤ µ0 , we simply set
β(µ) = 1 − α(µ) for µ ≤ µ0 .
Hypothesis Testing
2.3.15. Example. The burning rate of a rocket propellant is being studied.
Specifications require that the mean burning rate must be 40 cm/s.
Furthermore, suppose that we know that the standard deviation of the
burning rate is approximately σ = 2 cm/s. The experimenter decides to
specify a Type I error probability of α = 0.05 and he will base the test on
a random sample of size n = 25. The hypotheses we wish to test are
H0 : µ = 40 cm/s, H1 : µ ̸= 40 cm/s.
If H0 is true, the sample mean is normally distributed with mean

µ0 = 40 cm/s and variance σ 2 /n; thus we will use the standard normal
statistic
x −µ
Z= √
σ/ n
to test the hypotheses.
Hypothesis Testing
Since the level of significance is to be α = 0.95, we set our acceptance

region (the complement of the critical region) to
X −µ
−zα/2 ≤ Z ≤ zα/2 ⇔ −1.96 ≤ √ ≤ 1.96.
σ/ n
Twenty-five specimen are tested, and the sample mean burning rate
obtained is x = 41.25 cm/s. The value of the test statistic is
x − µ0 41.25 − 40
Z0 = √ = √ = 3.125.
σ/ n 2/ 25
We note that |Z0 | > 1.96, so it falls in the critical region and we can reject
H0 at the 5% level of significance.
Hypothesis Testing
2.3.16. Example. Continuing from the previous example, suppose that the
analyst is concerned about the probability of a Type II error if the true
mean burning rate is µ = 41 cm/s. We may use the following operating
characteristic curve (specific to α = 0.05) to find β:
Hypothesis Testing
In this graph,
|µ − µ0 | 41 − 40 1
d := = = .
σ 2 2
Since in our example n = 25 we can read off β ≈ 0.30.
2.3.17. Example. Continuing from the previous example, suppose that the
analyst would like to design the test so that if the true mean burning rate
differs from 40 cm/s by more than 1 cm/s the test will detect this (i.e.,
reject H0 : µ = 40) with a high probability, say 0.90. This corresponds to
setting the power 1 − β of the test to 0.90, i.e., we want to achieve
β = 0.10.
We want to have β ≤ 0.1 if
|µ − µ0 | |µ − 40| 1
d= = ≥ .
σ 2 2
We see that the point (d, β) = (0.5, 0.1) is intersected by the OC curve
for n = 40 and that the curve remains below 0.1 for d > 1/2. Thus the
test should involve a sample size of n = 40 or more.
Significance Testing
As previously mentioned, hypothesis testing as just described can be a

little too rigid.
An alternative approach involves setting up H0 and H1 as before, but not
specifying α and the critical region before the test. Rather, a value of the
test statistic is observed, and then the probability of observing this value
given that θ = θ0 is calculated. This probability is variously called
◮ critical level,
◮ descriptive level of significance, or
◮ probability or P value
of the test.
The P-value is also the smallest value for α at which we would have been
able to reject H0 . We reject H0 if we consider this P value to be small.
For one-tailed test, the P-value is the area under the density curve to the
right or left of the observed statistic:
How does one define the P-value for two-tailed tests?

If the distribution is symmetric about µ0 we simply take twice the P-value
of the one-tailed test. This will not be exact if the distribution is
asymmetric (like the chi-squared distribution for tests of variance), but we
will use it as an approximation anyway.
2.3.18. Example. We test the hypothesis that a new car design increases
mileage.
◮ The population is the set of all cars with the new design;
◮ The random variable X is the mileage of the newly designed cars;
◮ The distribution of X is unknown;
◮ The population parameter µ is the mean of X .
We take a random sample of n = 36 automobiles. Our hypotheses are
H0 : µ ≤ 26 H1 : µ > 26
with the null value µ0 = 26. Currently, the mileage of cars has a standard
deviation of 5 miles and we assume this will also be true for the new
design if µ = µ0 .
We obtain a data set of 36 mileages with a sample mean X = 28.04 mpg.

To see whether there is enough evidence to reject H0 we find the P-value
of the test. By the Central Limit Theorem, if µ = 26 and σ = 5, the mean
is at least approximately normally distributed with µ = 26 and standard
√
deviation σ/ n = 5/6.
·
¸
X − 26 28.04 − 26
P[X ≥ 28.04 | µ = 26, σ = 5] = P ≥
5/6 5/6
= P[Z ≥ 2.45] = 1 − P[Z ≤ 2.45]
= 1 − 0.9929 = 0.0071.
This is the P-value of the test. We may decide that it is sufficiently small
to reject H0 .
Selecting Appropriate Hypotheses

For one-sided tests, it is sometimes not clear how to select the null and
research hypotheses. For example, suppose that a soft drink beverage
bottler purchases 10-ounce nonreturnable bottles from a glass company.
The bottler wants to be sure that the bottles exceed the specification on
mean internal pressure or bursting strength, which for 10-ounce bottles is
200 psi.
The bottler has decided to formulate the decision procedure for a specific
lot of bottles as a hypothesis problem. There are two possible formulations
for this problem, either
H0 : µ ≤ 200 psi H1 : µ > 200 psi (2.3.3)
or
H0 : µ ≥ 200 psi H1 : µ < 200 psi (2.3.4)
Selecting Appropriate Hypotheses

The formulation (2.3.3) forces the manufacturer to reject H0 , i.e., to
demonstrate that the bottles conform to specification.
In the formulation (2.3.4) the bottles will be judged satisfactory unless H0
is rejected.
Choosing the “correct” formulation depends on the precise circumstances:
the first formulation means that the bottler needs proof that the bottles
conform to specification (perhaps because there have been problems in the
past), while in the second formulation the bottles will be accepted unless
there is strong evidence to the contrary (the bottler might have been
consistently satisfied with the bottles in the past, and small deviations
from µ ≥ 200 psi might not be harmful).
In formulating one-sided research hypotheses, we should remember that
rejecting H0 is always a strong conclusion, and consequently, we should
put the statement about which it is important to make a strong conclusion
in the research hypothesis. Often this will depend on our point of view and
experience with the situation.
Hypotheses and Significance Tests on the Mean

There are three forms for tests of hypotheses on the mean of a
distribution:
1. Right-tailed test: H0 : µ ≤ µ0 , H1 : µ > µ0
2. Left-tailed test: H0 : µ ≥ µ0 , H1 : µ < µ0
3. Two-tailed test: H0 : µ = µ0 , H1 : µ ̸= µ0
Remember: To test a hypothesis on a parameter θ, we must find a
statistic whose probability distribution is known at least approximately
when θ = θ0 (the null value).
We know that if X is normal, the statistic
X − µ0
T = √
S/ n
follows a Tn−1 -distribution. Tests based on this distribution are called T

tests.

2.3.19. Example. The breaking strength of a textile fiber is a normally
distributed random variable. Specifications require that the mean breaking
strength should equal 150 psi. The manufacturer would like to detect any
significant departure from this value. Thus, he wishes to test
H0 : µ = 150 psi H1 : µ ̸= 150 psi
A random sample of 15 fiber specimens is selected and their breaking

strengths determined. The statistic
X − µ0
T = √
S/ n
will follow a T14 -distribution. We specify α = 0.05, and find

t0.025,14 = 2.145 and t−0.025,14 = −2.145 from Table VI of the textbook.
Thus, the critical region is given by |t| > 2.145.

The sample mean and variance are computed from the sample data as
x = 152.18 and s 2 = 16.63. Therefore, the test statistic is
x − µ0 152.18 − 150
t= √ = p = 2.07,
s/ n 16.63/15
which does not fall into the critical region, so there is insufficient evidence
to reject H0 at the 5% level of significance.
Note that the T -distribution may be used for XS/−µ
√ 0 when a sample is
n
obtained from a normal population. If a sample is obtained from a
non-normal population, care must be taken; for large to medium sample
sizes (n ≥ 25) it can be shown that violating the normality assumption
does not significantly change α and β. For small sample sizes, a T -test
cannot be used and an alternative (non-parametric) test must be
employed; such tests will be discussed later.
In order to accept H1 : µ > µ0 we must reject H0 : µ ≤ µ0 . However, it is

clearly sufficient to reject H0 : µ = µ0 by showing evidence that the mean
is greater than µ0 . For this reason, one often prefers the following
conventions,
1. Right-tailed test: H0 : µ = µ0 , H1 : µ > µ0

2. Left-tailed test: H0 : µ = µ0 , H1 : µ < µ0
3. Two-tailed test: H0 : µ = µ0 , H1 : µ ̸= µ0
These conventions emphasize the null value µ0 , for which a test statistic
with known distribution must be found.
Hypotheses and Significance Tests on the Variance

Hypotheses on the variance have the same general form as or the mean
1. Right-tailed test: H0 : σ = σ0 , H1 : σ > σ0

2. Left-tailed test: H0 : σ = σ0 , H1 : σ < σ0
3. Two-tailed test: H0 : σ = σ0 , H1 : σ ̸= σ0
It is important to be aware of the following difficulty:

◮ The T -distribution can be used in the presence of large sample sizes
for the distribution of the sample mean even if the underlying
distribution is non-normal.
◮ It is, however, not possible to approximate the χ2n−1 statistic in this
way if the distribution is non-normal, regardless of sample size!
Therefore, normality of the data must first be tested, and if the data
is non-normal, other methods must be used.
2.3.20. Example. One random variable studied while designing the

front-wheel-drive half-shaft of a new model automobile is the displacement
(in millimeters) of the constant velocity (CV) joints. With the joint angle
fixed at 12◦ , 20 simulations were conducted, resulting in the following
data:
6.2 1.9 4.4 4.9 3.5
4.6 4.2 1.1 1.3 4.8
4.1 3.7 2.5 3.7 4.2
1.4 2.6 1.5 3.9 3.2
For these data, x = 3.39 and s = 1.41. Engineers designing the
front-wheel- drive half-shaft claim that the standard deviation in the
displacement of the CV shaft is less than 1.5 mm. Do these data support
this contention?

We test H0 : σ = 1.5, H1 : σ < 1.5., which is equivalent to testing
H0 : σ 2 = 2.25, H1 : σ 2 < 2.25.
The observed value of the test statistic is

(n − 1)s 2 19 · 1.412
= = 16.79.
σ02 2.25
Since the test is left-tailed, we reject H0 if this value is too small to have
occurred by chance. From the χ2 table we see that
P[χ219 ≤ 14.6] = 0.25 and P[χ219 ≤ 18.3] = 0.50.
Since the observed value lies between 14.6 and 18.3, the probability
P[χ219 ≤ 16.79] lies between 0.25 and 0.50. This probability is too large to
be able to reject H0 , so we cannot claim that σ < 1.5 mm.
Alternative Nonparametric Methods
◮ “Non-parametric” = “distribution free”

◮ Less powerful than “normal theory procedures” if the underlying
assumptions are met
◮ Much more powerful when normality assumptions are not met; useful
even when they are met.
◮ Two examples: Sign Test for the Median & Wilcoxon Signed Rank
Test
Sign Test for the Median
Recall that the median of a random variable X is defined as the value M

such that
P(X < M) = P(X > M) = 1/2.
For a symmetric distribution (such as the normal distribution), mean and

median are identical. We will see that the sign test is a form of binomial
test. We again have the traditional forms of tests for the median,
1. Right-tailed test: H0 : M = M0 , H1 : M > M0
2. Left-tailed test: H0 : M = M0 , H1 : M < M0
3. Two-tailed test: H0 : M = M0 , H1 : M ̸= M0

Let X1 , . . . , Xn denote a random sample from a symmetric random variable
X with median M. Assuming a continuous distribution of X , each of the
differences Xi − M has probability 1/2 of being positive, 1/2 of being
negative and probability 0 of being 0.
Let Q± = #{Xi : Xi − M0 ≷ 0}. If H0 : M = M0 is true, then Q+ is
binomially distributed, with parameters n and p = 1/2.
In a right-tailed test, we perform a significance test on the value of Q−
and reject H0 if the number of negative results is too small to have
occurred by chance.
Similarly, in a left-tailed test we reject H0 if Q+ is too small to have
occurred by chance.
In a two-tailed test, we form Q = min(Q− , Q+ ) and reject H0 if Q is too
small to have occurred by chance.

2.3.21. Example. Montgomery, Peck and Vining (2001) report on a study
in which a rocket motor is formed by binding an ignitor propellant and a
sustainer propellant together inside a metal housing. The shear strength of
the bond between the two propellant types is an important characteristic.
The results of testing 20 random samples are shown below. We would like
to test the hypothesis that the median shear strength is 2000 psi.
Observation (i) Shear Strength (Xi ) Observation (i) Shear Strength (Xi )
1 2158.70 11 2165.20
2 1678.15 12 2399.55
3 2316.00 13 1779.80
4 2061.30 14 2336.75
5 2207.50 15 1765.30
6 1708.30 16 2053.50
7 1784.70 17 2414.40
8 2575.10 18 2200.50
9 2357.90 19 2654.20
10 2256.70 20 1753.70
Formally, we set
H0 : M = 2000,
H1 : M ̸= 2000,
i.e., we perform a two-tailed test. We calculate Xi − M0 , where

M0 = 2000 is the null value of the median, for every i and note whether
the difference is positive or negative.

Observation (i) Shear Strength (Xi ) Xi − 2000 Sign
1 2158.70 158.70 +
2 1678.15 -321.85 −
3 2316.00 316.00 +
4 2061.30 61.30 +
5 2207.50 207.50 +
6 1708.30 -291.70 −
7 1784.70 -215.30 −
8 2575.10 575.10 +
9 2357.90 357.90 +
10 2256.70 256.70 +
11 2165.20 165.20 +
12 2399.55 399.55 +
13 1779.80 -220.20 −
14 2336.75 336.75 +
15 1765.30 -234.70 −
16 2053.50 53.50 +
17 2414.40 414.40 +
18 2200.50 200.50 +
19 2654.20 654.20 +
20 1753.70 -264.30 −
Note that Q+ = 14 and Q− = 6, so Q = min(Q− , Q+ ) = 6.

While there are tables for checking the significance of such a result, we
can calculate it directly, because the signs of the differences are binomially
distributed. The probability of observing 6 or fewer negative signs in a
sample of 20 observations is
6 µ ¶
X 20 1 1
P[Q ≤ 6] = = 0.058.
r 2r 220−r
r =0
The P–value of this test is hence 5.8%, so at at 95% significance, we are

unable to reject H0 .
In practice, it may happen that Xi − M0 = 0. There has been extensive

research on treating these zeroes. It is recommended that you either
1. count the zeroes in such a way least likely to result in the rejection of
H0 or
2. discard all zeroes if their number is small compared to the sample
size, and reduce the sample size accordingly.
The sign test works even if the magnitude of Xi − M is unknown - if it is
known, we can apply a second test that uses this information, called the
Wilcoxon signed rank test
Wilcoxon Signed Rank Test
◮ X1 , . . . , Xn random sample from continuous distribution with median

M.
◮ H0 : M = M0 , H1 : M > M0 , H1 : M < M0 , H1 : M ̸= M0 .
◮ Order the n absolute differences |Xi − M| according to magnitude, so
that XRi − M0 is the Ri th smallest difference by modulus.
◮ If ties in the rank occur, the mean of the ranks is assigned to both
values.
◮ Define
X X
W+ = Ri , |W− | = |Ri |.
Ri >0 Ri <0
◮ If H0 is true, W+ ≈ |W− | - consider distribution of W .

◮ For a left-tailed test (H1 : M < M0 ) we use W+ as a statistic - we
reject H0 if the observed value of W+ is small.
◮ For a right-tailed test (H1 : M > M0 ) we use |W− | as a statistic - we
reject H0 if the observed value of W− is small.
◮ For a right-tailed test (H1 : M ̸= M0 ) we use W = min(W+ , |W− |) as
a statistic - we reject H0 if the observed value of W is small.
◮ The level of significance of such a rejection is tabulated.
◮ The distribution of W is approximately normal with mean
n(n + 1)
E[W ] =
4
and variance
n(n + 1)(2n + 1)
Var W = .
24

2.3.22. Example. We return to the measurements of the previous example.
Observation (i) Shear Strength (Xi ) Xi − 2000 Signed Rank
16 2053.50 53.50 +1
4 2061.30 61.30 +2
1 2158.70 158.70 +3
11 2165.20 165.20 +4
18 2200.50 200.50 +5
5 2207.50 207.50 +6
7 1784.70 -215.30 −7
13 1779.80 -220.20 −8
15 1765.30 -234.70 −9
20 1753.70 -264.30 −10
10 2256.70 256.70 +11
6 1708.30 -291.70 −12
3 2316.00 316.00 +13
2 1678.15 -321.85 −14
14 2336.75 336.75 +15
9 2357.90 357.90 +16
12 2399.55 399.55 +17
17 2414.40 414.40 +18
8 2575.10 575.10 +19
19 2654.20 654.20 +20
Here
W+ = 1 + 2 + 3 + 4 + 5 + 6 + 11 + · · · + 19 + 20 = 150
and
|W− | = 7 + 8 + 9 + 10 + 12 + 14 = 60.
For our two-tailed test, we take W = min(60, 150) = 60. From the Table
VIII in Appendix A, with n = 20 observations we have the critical value of
52 for a two-tailed test with P = 0.05. Since W = 60 ̸< 52, we cannot
reject H0 at a 95% level of significance..
An Introduction to Statistical Methods Inferences on Proportions
Estimating Proportions
One of the (mathematically) simplest population parameters of general
interest is the proportion of members of a population with some trait.
Every member of the population is characterized as either having or not
having this trait. We describe this mathematically by defining the random
variable (
1 has trait,
X =
0 does not have trait.
The proportion of the members of the population having the trait is
1 X
N
# members wih trait
p= = xi
population size N
i=1
where N is the population size and xi is the value of the variable X for the
ith member of the population. Hence the proportion is equal to the mean
of X .
It follows that if we take a random sample X1 , . . . , Xn of X , the sample
mean
1X
n
p̂ = X = Xi
n
i=1
is an (unbiased) estimator for p.

The random variable X follows a point binomial distribution with
expectation E[X ] = p and Variance p(1 − p).
By the central limit theorem, p̂ is approximately normally distributed with
mean p and variance p(1 − p)/n.
Hence
p̂ − p
p
p(1 − p)/n
is approximately standard-normally distributed.
We therefore obtain the following 100(1 − α)% confidence interval for p:
p
p̂ ± zα/2 p(1 − p)/n
But the interval depends on the unknown parameter p, which we are

actually trying to estimate! The solution is to replace p by p̂, i.e., to write
p
p̂ ± zα/2 p̂(1 − p̂)/n.
But then the number zα/2 is no longer accurate (when we replaced σ by S

to obtain a confidence interval for the mean, we had to switch from zα/2
to tα/2 ). However, we are approximating the binomial distribution in any
case - we argue that if the sample size n is large enough to allow the
central limit theorem to hold, then the difference between zα/2 and a
corrected value will be negligible.
2.4.1. Example. In a random sample of 75 axle shafts, 12 have a surface

finish that is rougher than the specifications will allow. Therefore, a point
estimate of the proportion p of shafts in the population that exceed the
roughness specifications is p̂ = x = 12/75 = 0.16. A 95% two-sided
confidence interval for p is then
p p
p̂ − zα/2 p̂(1 − p̂)/n ≤ p ≤ p̂ + zα/2 p̂(1 − p̂)/n
or r r
0.16 · 0.84 0.16 · 0.84
0.16 − 1.96 ≤ p ≤ 0.16 + 1.96
75 75
which simplifies to 0.08 ≤ p ≤ 0.24.
Choosing the Sample Size

As a practical matter, we are often able to choose (perhaps within
constraints) the sample size. We may want to be able to claim that “with
xx% probability, p̂ differs from ppby at most d.” Given a 100(1 − α)%
confidence interval p = p̂ ± zα/2 p̂(1 − p̂)/n, we know with 100(1 − α)%
confidence that p
d = zα/2 p̂(1 − p̂)/n.
Given d, this means that we should choose
2 p̂(1 − p̂)
zα/2
n=
d2
to ensure that |p − p̂| < d with 100(1 − α)% confidence. However, this
formula requires us to have an idea (estimate) p̂ of p beforehand. If this is
not the case, we can at least use that x(1 − x) < 1/4 for all x ∈ R to
deduce that
2
zα/2
n=
4d 2
will ensure |p − p̂| < d with 100(1 − α)% confidence.
Choosing the Sample Size
2.4.2. Example. A new method of precoating fittings used in oil, break and
other fluid systems in heavy-duty trucks is being studied. How large a
sample is needed to estimate the proportions of fitting that leak to within
0.02 with 90% confidence?
Since no prior estimate is available, we take
2
z0.05 1.6452
n= = = 1692.
4d 2 4 · 0.022
Hypothesis Testing
There a three types of hypotheses and tests for proportions. Let p) denote
the null value of a proportion p. Then we have
1. H0 : p = p0 , H1 : p > p0 (Right-tailed)
2. H0 : p = p0 , H1 : p < p0 (Left-tailed)
3. H0 : p = p0 , H1 : p ̸= p0 (Two-tailed)
For large sample sizes we use the following test statistic to test H0 : p = p0 :
p̂ − p0
Z=p .
p0 (1 − p0 )/n
If H0 is true, then this statistic follows a standard normal distribution.

1. For a right-tailed test, we reject H0 if Z is a large positive number.
2. For a left-tailed test, we reject H0 if Z is a large negative number.
3. For a two-tailed test, we reject H0 if |Z | is large.
Comparing Two Proportions

We are often interested in estimating the difference between two
proportions p1 and p2 , where in general p1 is the proportion of objects
with some trait in a population P1 and P2 is the proportion of objects with
some (other) trait in a (different) population P2 .
If we let (
(1) 1 has trait,
X =
0 does not have trait.
be a random variable defined for population P1 and X (2) be a similarly
defined random variable, then we are interested in the difference of the
means µ1 = p1 and µ2 = p2 of these random variables. An unbiased
estimator for p1 − p2 is given by
(1) (2)
p\
1 − p2 = p̂1 − p̂2 = X −X ,
(1) (1)
where X and X are the means of random samples from the random
variables X (1) (1)
and X , respectively.
Assume that we have random samples of sizes n1 , n2 of X (1) and X (1) ,

(1) (1)
respectively. Since X and X are both approximately normally
distributed, with means p1 , p2 and variances p1 (1 − p1 )/n1 and
p2 (1 − p2 )/n2 , respectively, we obtain the following result:
2.4.3. Theorem. For large samples, the estimator p̂1 − p̂2 is approximately
normal with mean p1 − p2 and variance p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 .
This allows us to deduce the following 100(1 − α)% confidence interval for
p1 − p2 : s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
p̂1 − p̂2 ± zα/2 +
n1 n2
which is valid for large sample sizes.

We can now also test hypotheses on the difference of two proportions. Let
(p1 − p2 )0 denote the null value of the difference. Then we have the
hypotheses
1. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 > (p1 − p2 )0
(Right-tailed)
2. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 < (p1 − p2 )0
(Left-tailed)
3. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 ̸= (p1 − p2 )0
(Two-tailed)
For large sample sizes we use the approximate test statistic
p̂1 − p̂2 − (p1 − p2 )0

Z=q .
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
n1 + n2
Pooled Proportions
Most commonly we test against Ho : (p1 − p2 )0 = 0. Then we have the
hypotheses
1. H0 : p1 − p2 = 0, H 1 : p1 − p2 > 0 (Right-tailed)
2. H0 : p1 − p2 = 0, H 1 : p1 − p2 < 0 (Left-tailed)
3. H0 : p1 − p2 = 0, H1 : p1 − p2 ̸= 0 (Two-tailed)
Now if H0 is true, then p̂1 and pˆ2 are both estimators for the same
proportion p. Then the variance becomes
µ ¶
1 1
p(1 − p)/n1 + p(1 − p)/n2 = p(1 − p) +
n1 n2
and
p̂1 − p̂2
Z=r ³ ´
p(1 − p) n11 + 1
n2
has an approximate standard normal distribution.

Pooled Proportions
In estimating p, we now have a choice of p̂1 and p̂2 .It turns out that it is
best to take the weighted average
n1 p̂1 + n2 p̂2
p̂ =
n1 + n2
and to use the statistic
p̂1 − p̂2
Z=r ³ ´.
p̂(1 − p̂) n11 + 1
n2
Pooled Proportions
2.4.4. Example. Many consumers think that automobiles built on Mondays
are more likely to have serious defects than those built on any other day of
the week. To support this theory, a random sample of 100 cars built on
Monday is selected and inspected. Of these, eight are found to have
serious defects. A random sample of 200 cars produced on other days
reveals 12 with serious defects. Do these data support the stated
contention?
We test
H 0 : p1 = p2 , H1 : p1 > p2
where p1 denotes the proportion of cars with serious defects produced on
Mondays.
Estimates for p1 and p2 are
p̂1 = 8/100 = 0.08, p̂2 = 12/200 = 0.06.
Pooled Proportions
The pooled estimate for the common population proportion is

100 · 0.08 + 200 · 0.06
p̂ = = 20/300 = 0.066.
100 + 200
The observed value of the test statistic is
p̂1 − p̂2 0.08 − 0.06
r ³ ´=q ¡ 1 ¢ = 0.658.
p̂(1 − p̂) n11 + 1 0.066 · 0.934 100 + 1
200
n2
From the standard normal table, we see that the probability of observing
this large or a larger value is 0.2546, so we shall not reject H0 .
An Introduction to Statistical Methods Comparing Two Means and Two Variances
Comparing Two Means - A Point Estimator

We have two populations with different means µ1 and µ2 ; our goal is to
estimate the difference µ1 − µ2 by taking one sample from each
population in such a way that the selection of one sample does not
influence the selection of the other (they are independent).
The natural point estimator is
µ\
1 − µ2 := µ b2 = X 1 − X 2 .
b1 − µ
To determine confidence intervals and to test hypotheses we need to know
the distribution of X 1 − X 2 .
2.5.1. Theorem. Let X 1 and X 2 be the sample means based on

independent random samples of sizes n1 and n2 drawn from normal
distributions with means µ1 and µ2 and variance σ12 and σ22 , respectively.
Then X 1 − X 2 is normal with mean µ1 − µ2 and variance σ12 /n1 + σ22 /n2 .
As usual, the Central Limit theorem allows us to apply this result even to
non-normal populations if we have large sample sizes.
Comparing Variances
We have seen when estimating the mean of a single distribution that the
key lies in understanding the distribution of the variance. Therefore, we
first consider the comparison of variances.
We consider the two cases
1. H0 : σ12 = σ22 , H1 : σ12 > σ22 (right-tailed test)
2. H0 : σ12 = σ22 , H1 : σ12 ̸= σ22 (two-tailed test)
For comparing variances, we prefer to consider the quotient instead of the
difference of the estimators: If the sample variances are S12 and S22 , the
null hypothesis is true if S12 /S22 = 1, while we reject the null hypothesis if
the quotient is much larger than 1.
Comparing Variances - The F -Test
Recall that (n − 1)S 2 /σ 2 follows a χ2 -distribution with n − 1 degrees of

freedom. We will need to consider the distribution of the quotient of two
sample variances, so we introduce the following
2.5.2. Definition. Let Xγ21 and Xγ22 be independent χ2 random variables

with γ1 and γ2 degrees of freedom, respectively. The random variable
Xγ21 /γ1
Fγ1 ,γ2 =
Xγ22 /γ2
follows what is called an F distribution with γ1 and γ2 degrees of freedom.

2.5.3. Theorem. Let S12 and S22 be sample variances based on independent
random samples of sizes n1 and n2 drawn from normal populations with
means µ1 and µ2 and variances σ12 and σ22 , respectively.
If σ12 = σ22 , then the statistic
S12 /S22
follows an F distribution with n1 − 1 and n2 − 1 degrees of freedom.
Proof.
We know that (n1 − 1)S12 /σ12 and (n2 − 1)S22 /σ22 follows χ2 -distributions
with n1 − 1 and n2 − 1 degrees of freedom, respectively. Then
[(n1 − 1)S12 /σ12 ]/(n1 − 1) σ22 S12

Fn1 −1,n2 −1 = = .
[(n2 − 1)S22 /σ22 ]/(n2 − 1) σ12 S22
If σ12 = σ22 , this reduces to S12 /S22 .

There are some assumptions and restrictions underlying the F -test:

1. Normality is essential,
2. The sample sizes should be equal,
3. The test is not very powerful (there is a high probability of
committing a Type II error)
2.5.4. Example. Chemical etching is used to remove copper from printed

circuit boards. X1 and X2 represent process yields when two different
concentrations are used. Suppose that we wish to test H0 : σ12 = σ22 ,
H1 : σ12 ̸= σ22 .
Two samples of sizes n1 = n2 = 8 yield s12 = 4.02 and s22 = 3.89, and
s12 4.02
2 = = 1.03.
s2 3.89
If α = 0.05, we find that the critical value for F7,7 is 3.787. Therefore,
there is not enough evidence to reject H0 .
Comparing Means - Variances Equal

By Theorem 2.5.1, we know that
(X 1 − X 2 ) − (µ1 − µ2 )
p
σ12 /n1 + σ22 /n2
is standard normal. If σ12 = σ22 =: σ 2 , this reduces to
(X 1 − X 2 ) − (µ1 − µ2 )
p ,
σ 2 (1/n1 + 1/n2 )
and we are faced with the task of estimating σ 2 . We define the pooled
estimator
(n1 − 1)S12 + (n2 − 1)S22
Sp2 = .
n1 + n2 − 2
Then (n1 + n2 − 2)Sp2 /σ 2 will follow a χ2 -distribution with n1 + n2 − 2
degrees of freedom.
Thus
(X 1 − X 2 ) − (µ1 − µ2 )
q
Sp2 (1/n1 + 1/n2 )
will follow a T -distribution with n1 + n2 − 2 degrees of freedom. We

obtain the following 100(1 − α)% confidence interval for µ1 − µ2 ,
q
(X 1 − X 2 ) ± tα/2 Sp2 (1/n1 + 1/n2 ),
where tα/2 follows a Tn1 +n2 −2 -distribution.
2.5.5. Example. In a batch chemical process used for etching circuit

boards, two different catalysts are being compared to determine whether
they require different emersion times for removal of identical quantities of
photo-resistant material.
Twelve batches were run with catalyst 1, resulting in a sample mean
emersion time of x 1 = 24.6 minutes and a sample standard deviation of
s1 = 0.85 minutes. Fifteen batches were run with catalyst 2, resulting in a
mean emersion time of x 2 = 22.1 minutes and a standard deviation of
s2 = 0.98 minutes.
We will find a 95% confidence interval on the difference in means µ1 − µ2

assuming that the variances of the two populations are equal. The pooled
estimate for the variance gives
(n1 − 1)s12 + (n2 − 1)s22

sp2 = = 0.8557
n1 + n2 − 2
so sp = 0.925. Since t0.025,25 = 2.060, we obtain
µ1 − µ2 = (2.5 ± 0.74) minutes
We can use the previous results for the testing of hypotheses on the
difference of means. If µ1 − µ2 = (µ1 − µ2 )0 (the null value of the
difference of means), the statistic
X 1 − X 2 − (µ1 − µ2 )0
Tn1 +n2 −2 = q
Sp2 (1/n1 + 1/n2 )
follows a T -distribution with n1 + n2 − 2 degrees of freedom.

While any value for (µ1 − µ2 )0 can be tested, the most common in
applications is (µ1 − µ2 )0 = 0 (testing for equality of means). We then
generally test H0 : µ1 = µ2 against H1 : µ1 ≷ µ2 (one-tailed test) or
H1 : µ1 ̸= µ2 (two-tailed test).
This type of test is known as a pooled T -test.

2.5.6. Example. Two catalysts are being analyzed to determine how they
affect the mean yield of a chemical process. Specifically, catalyst 1 is
currently in use,but catalyst 2 is acceptable. Since catalyst 2 is cheaper, if
it does not change the process yield, it should be adopted. Suppose we
wish to test the hypotheses
H0 : µ1 = µ2 , H1 : µ1 ̸= µ2 .
Pilot data yields n1 = 8, x 1 = 91.73, s12 = 3.89, n2 = 8, x 2 = 93.75,
s22 = 4.02. Then
sp2 = 3.96
and the test statistic is
x1 − x2
p = −2.03.
sp 1/n1 + 1/n2
Using α = 0.05, we find t0.025,14 = 2.145 and −t0.025,14 = −2.145, so H0
cannot be rejected.
Comparing Means - Variances Unequal

If the variances of the two populations are distinct, we also consider the
standard normal random variable
(X 1 − X 2 ) − (µ1 − µ2 )
p
σ12 /n1 + σ22 /n2
and estimate the variances to obtain the statistic
(X 1 − X 2 ) − (µ1 − µ2 )
p
S12 /n1 + S22 /n2
We can expect this to also be a T -random variable, but it is not clear
what its degrees of freedom are.
A solution is to use the Smith-Satterthwaite approximation, setting
¡ 2 ¢2
S1 /n1 + S22 /n2
γ = (S 2 /n )2 (S 2 /n )2
1 2
n1 −1 + n2 −1
1 2
We round the values of γ down to the nearest integer.

Comparing Means - Paired Data
In some situations, we do not take independent samples from two different

populations, but rather the samples are naturally related to each other.
For example, we might test reaction times of a given set of people when
sober and when under the influence of alcohol.
In such situations, pairing of data is appropriate. Instead of considering
two random variables X and Y , we define a new random variable
D = X − Y , whose mean will be
µD = E[D] = E[X − Y ] = E[X ] − E[Y ] = µX − µY .
We can then analyze D using the methods for the mean of a single
random variable.
Hypothesis tests for paired data are called paired T -tests.
The Wilcoxon Rank-Sum Test
If the sample sizes are small, the variances unequal, and/or the populations
non-normal, the T -tests may not yield good results. In such situations, a
non-parametric test is available that is nearly as good as a T -test.
The Wilcoxon rank-sum test tests two random variables X and Y for
equality. However, it is especially sensitive to differences in location. Hence
we usually state the hypotheses in terms of the medians MX and MY :
◮ H0 : MX = MY , H1 : MX > MY (right-tailed),
◮ H0 : MX = MY , H1 : MX < MY (left-tailed),
◮ H0 : MX = MY , H1 : MX ̸= MY (two-tailed).
Assume that we have two random samples X1 , . . . , Xm and Y1 , . . . , Yn ,

m ≤ n. We pool the n + m observations and rank them from 1 to N from
smallest to largest, while retaining their group identity. The test statistic,
denoted by Wm , is the sum of the ranks associated with the smaller (X )
sample.
The reasoning is as follows: if X is located below Y , then the smaller
ranks will be associated with the X values. Thus we will reject H0 in favor
of H1 : MX < MY for small values of Wm ,
For large values of m, Wm is approximately normal with mean
E[Wm ] = m(m + n + 1)/2 and variance Var Em = mn(m + n + 1)/12.
For paired data, we can use the sign test or the Wilcoxon signed-rank test
on D = X − Y as before.

2.5.7. Example. The mean axial stress in tensile members used in an
aircraft structure is being studied. Two alloys are being investigated. Alloy
1 is a traditional material and alloy 2 is a new aluminum-lithium alloy that
is much lighter than the standard material. The sample data are arranged
in the following table:
Alloy 1 (psi) 3238 3195 3246 3190 3204
3254 3229 3225 3217 3241
Alloy 2 (psi) 3261 3178 3209 3212 3258
3248 3215 3226 3240 3234
We test
H0 : M1 = M2 , H1 : M1 ̸= M2 .
The data are arranged in order and ranked as follows:

Alloy Axial Stress Rank Alloy Axial Stress Rank
2 3187 1 2 3229 11
1 3190 2 1 3234 12
1 3195 3 2 3238 13
1 3204 4 1 3240 14
2 3209 5 2 3241 15
2 3212 6 1 3246 16
2 3215 7 1 3248 17
1 3217 8 2 3254 18
1 3225 9 1 3258 19
2 3226 10 2 3261 20
The sums of the ranks are W1 = 99 for alloy 1, and W2 = 111 for alloy 2.
By Table X in Appendix A, the critical values for α = 0.05 in a two-tailed
test are 79 and 131. Since neither sum of ranks is outside the interval
[79, 131], we cannot reject H0 .
Second Midterm Exam
The preceding material completes our introduction to statistical theory; we

will next look at some more specialized applications in statistics.
The preceding material encompasses all of the material that will be the
subject of the Second Midterm Exam. The exam will take place on
Thursday, the 2nd of April during the usual lecture time.
For this exam a non-programmable calculator is required.

Probabilistic Methods in Engineering: Dr. Horst Hohberger

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probabilistic Methods in Engineering: Dr. Horst Hohberger

Uploaded by

Copyright:

Available Formats

Probabilistic Methods in Engineering

Dr. Horst Hohberger

University of Michigan - Shanghai Jiaotong University

Spring Term 2009

Elements of Probability Theory

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Probability - Historical Notes

Probability - Classical Deﬁnition

Classical Probability - Examples

Classical Probability - Examples

We use a tree diagram to ﬁnd the probability of two heads occurring:

The Art of Counting

The just illustrated strategy of using a tree diagram for evaluating

1.1.4. Deﬁnition. Let {x1 , . . . , xn } be a set of n distinguishable elements

1.1.5. Remark. A permutation of n elements is a ﬁnite map because the

1.1.6. Notation. A permutation is deﬁned on a set of n distinguishable

1.1.9. Lemma. There are n! permutations π of the set {1, . . . , n}.

diﬀerent choices for the values (π(1), . . . , π(n)) of a permutation.

1.1.10. Deﬁnition. Let n, r ∈ N \ {0} with r ≤ n. Then an injective map

is called a permutation of r elements from {1, . . . , n}.

1.1.11. Remark. For r = n we regain our previous deﬁnition.

1.1.13. Example. Let n = 3, r = 2. Then the following are permutations of

1.1.14. Theorem. There are

A related question arising in combinatorics is the number of selections of

{1, 2}, {1, 3}, {2, 3}.

Note that a selection consists of sets (which are unordered) while a

1.1.16. Deﬁnition. A combination of r elements from {1, . . . , n} is subset

(π(1), . . . , π(r )) −→ {π(1), . . . , π(r )}.

Combinations as Permutations of Indistinguishable Objects

We can regards the previous situation in a diﬀerent light: A combination

Sorting Into Classes

We can hence reformulate Theorem 1.1.17 on combinations with regard to

1.1.18. Theorem. Let N := {1, . . . , n}. Then there are

possible ways of dividing N into two sets A1 and A2 , where

Of course, this result can be generalized!

Sorting Into Classes

1.1.19. Theorem. Let N := {1, . . . , n}. Then there are

1.1.20. Remark. Since we are only interested in dividing a number of

Sorting Into Classes

Next we count the ways of sorting n2 elements of the remaining n − n1

The total number of ways of sorting n1 elements into A1 and n2 elements

Sorting Into Classes

ways of sorting all objects into classes A1 , . . . , Ak , |Ai | = ni .

Sample Spaces and Sample Points

1.1.22. Remark. Not every element of a sample space needs to correspond

1.1.23. Deﬁnition. Any subset A of a sample space S is called an event.

1.1.24. Example. We roll a four-sided die 10 times. Then a possible sample

A Note on Numerical Values

As seen in the above example, we will round numerical values to a suitable

◮ Two upstrokes, two downstrokes:

Counting Mountains |BB |BB |BB

We would like to consider every mountain as a “word” or “string”

Dyck Words and Catalan Numbers

1.1.26. Theorem. The number of Dyck words of length 2n, n ∈ N \ {0}, is

Cn is called the nth Catalan number.

Historical Notes - Independence

Historical Notes - Law of Large Numbers