Download as pdf or txt
Download as pdf or txt
You are on page 1of 355

Probabilistic Methods in Engineering

Dr. Horst Hohberger

University of Michigan - Shanghai Jiaotong University


Joint Institute

Spring Term 2009

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 1 / 355
Elements of Probability Theory

Part I

Elements of Probability Theory

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 2 / 355
Elements of Probability Theory

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Joint Distributions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 3 / 355
Elements of Probability Theory Introduction to Probability and Counting

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Joint Distributions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 4 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability - Historical Notes


Although random events have fascinated humanity for thousands of years
(archaeologists have discovered dice games of the ancient babylonians),
the concept of probability has eluded mathematicians for a surprisingly
long time. There are two main reasons for this:
◮ The greek philosophy of mathematics, which profoundly influenced
western mathematics for millenia, viewed mathematics as a ideal,
precise description of abstract truths, in particular with regard to
geometry. It would not have entered the minds of the greek
philosophers to attempt to describe random, unpredictable events
using such techniques.
◮ Random events were viewed as being decided by God or some other
higher power. In particular, bones were used to foretell the future and
unpredictable events like the throw of dice or the weather were viewed
as unknowable by mortals. The influential Christian church did not
encourage people to “question the decisions of God.”

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 5 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability - Classical Definition

In the 16th century, the Italian mathematician Cardano, who was a heavy
gambler, attempted to use mathematics to describe the outcome of
games. He hit upon the following definition, which is really a procedure for
calculating probabilities:
1.1.1. Definition. Let A be a random outcome (random event) of an
experiment (game) that may proceed in various ways. Assume each of
these ways is equally likely. Then the probability of the outcome A is
number of ways leading to outcome A
P(A) =
number of ways the experiment can proceed.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 6 / 355
Elements of Probability Theory Introduction to Probability and Counting

Classical Probability - Examples

1.1.2. Examples.
1. The experiment consists of flipping a coin. We are interested in the
probability of the coin landing heads up. The experiment can proceed
in two ways: the coin land heads up or tails up. We assume each
event is equally likely, so the classical definition gives
1
P[coin lands heads up] = = 0.5
2

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 7 / 355
Elements of Probability Theory Introduction to Probability and Counting

Classical Probability - Examples

1.1.3. Examples.
2. The experiment consists of rolling two 6-sided dice and summing the
results, so the possible outcomes are the numbers S = 2, 3, . . . , 12.
We are interested in the outcome S = 3. Each die will give results 1,
2, 3, 4, 5 or 6. We assume that each result is equally likely. There are
two ways we can get the outcome S = 3: either the first die’s result is
1 and the second die’s result is 2, or the first die gives 2 and the
second die gives 1. In total, the experiment can proceed in 6 × 6 = 36
different ways. Hence
2 1
P[3] = = = 0.056
36 18

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 8 / 355
Elements of Probability Theory Introduction to Probability and Counting

Historical Notes
Cardano’s work although published, received little attention, and 100 years
later, in the middle of the 17th century, the two French mathematicians
Fermat and Pascal rediscovered his principles, also by considering games.
They discussed how to divide the jackpot if a game in progress is
interrupted. Imagine that Fermat and Pascal are playing a simple game,
whereby a coin is repeatedly tossed. Fermat wins as soon as the coin has
turned up heads six times, Pascal wins as soon as the coin has turned up
tails six times.
There are 24 gold pieces in the pot. Now the game is interrupted when
the coin has already turned up 5 tails and 4 heads. How to divide the pot?
Fermat can only win if the coin turns up heads two times in a row, and we
can calculate that he has a 1/4 chance of this happening. Therefore, he
receives 1/4 of the pot (6 gold pieces), while Pascal receives 3/4 (18 gold
pieces).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 9 / 355
Elements of Probability Theory Introduction to Probability and Counting

Tree Diagrams

We use a tree diagram to find the probability of two heads occurring:

1 Toss st

ee eeeeeeeYYYYYYYYtails
YYYYYY
eee
heads
eeeeee YYYY
2nd Toss
R 2nd Toss
R
l RRRRtails l RRRRtails
llll
heads
l RRRR llll
heads
l RRRR
llll R llll R
2 heads 1 head, 1 tail 1 head, 1 tail 2 tails

After two tosses, there are four possible outcomes. Only one of the four
outcomes involves two heads, so the probability of this happening is 1/4.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 10 / 355
Elements of Probability Theory Introduction to Probability and Counting

The Art of Counting

The just illustrated strategy of using a tree diagram for evaluating


probabilities in a multi-stage process (each node represents a stage, or
step) shows us that counting is important. However, counting is also quite
difficult, more difficult, in fact, than you probably think.
The mathematical field that concerns itself with counting is known as
combinatorics. You probably have discussed some basic combinatorics in
school, and we will review some elementary concepts here.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 11 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Calculus
There is a small difference in the use of the term “permutation” between
analysts and combinatorists; we first repeat the usage that is familiar from
calculus/linear algebra.

1.1.4. Definition. Let {x1 , . . . , xn } be a set of n distinguishable elements


(xk ̸= xj for j ̸= k). Then an injective map
π : {x1 , . . . , xn } → {x1 , . . . , xn }
is called a permutation of these elements.
Recall that a map f is called injective if
f (x1 ) = f (x2 ) ⇒ x1 = x2 .

1.1.5. Remark. A permutation of n elements is a finite map because the


domain and range are sets with a finite number of elements. Any finite
injective map is automatically bijective.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 12 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Calculus

1.1.6. Notation. A permutation is defined on a set of n distinguishable


elements; instead of {x1 , . . . , xn } we can also simply write {1, . . . , n},
replacing the permutation of elements with a permutation of indices.
Recall that a function f is defined as a set of pairs of the form (x, f (x)),
where x is the independent variable. Thus we could define a permutation
π through a set of pairs {(1, π(1)), . . . , (n, π(n))}. In fact, we do represent
permutations in this way, but us a different notation, writing
µ ¶
1 2 ... n
π=
π(1) π(2) . . . π(n)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 13 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Calculus
1.1.7. Example. Consider the set of 3 objects, {1, 2, 3}. Then one
permutation is given by the identity map,
µ ¶
1 2 3
π0 =
1 2 3
while other permutations are given by
µ ¶ µ ¶ µ ¶ µ ¶ µ ¶
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
, , , , .
3 1 2 2 3 1 3 2 1 2 1 3 1 3 2

1.1.8. Notation. Instead of writing out the map as above, we often simply
give the ordered n − tuple (π(1), . . . , π(n)). For example, instead of
µ ¶
1 2 3
we might write (3, 1, 2).
3 1 2

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 14 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Calculus
We have previously encountered permutations in the definition of the
determinant of an n × n matrix using the Leibniz formula; there one sums
over all permutations of the index set {1, . . . , n} and also includes the sign
of the permutation, a concept we won’t elaborate on here.

1.1.9. Lemma. There are n! permutations π of the set {1, . . . , n}.

Proof.
We consider the number of possible values of π(k), k = 1, . . . , n. The first
value, π(1) can be any of the n elements of {1, . . . , n}. The second value
can be any element of {1, . . . , n} \ {π(1)}. Hence there are n possible
values of π(1), but only n − 1 possible values of π(2). In general, there are
n − k + 1 possible values for π(k), so in total, there are

n · (n − 1) · · · (n − n + 1) = n!

different choices for the values (π(1), . . . , π(n)) of a permutation.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 15 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Combinatorics
In combinatorics, permutations are regarded as arrangements in a definite
order. Clearly, the ordered tuple (π(1), . . . , π(n)) is an arrangement of
(1, . . . , n) is a definite order, i.e., the calculus-based definition realizes this
goal.
However, in combinatorics one is often interested in first selecting r ≤ n
objects from {1, . . . , n} and then arranging these objects in a definite
order. We realize this by the following, more general definition:

1.1.10. Definition. Let n, r ∈ N \ {0} with r ≤ n. Then an injective map

π : {1, . . . , r } → {1, . . . , n}

is called a permutation of r elements from {1, . . . , n}.

1.1.11. Remark. For r = n we regain our previous definition.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 16 / 355
Elements of Probability Theory Introduction to Probability and Counting

Permutations in Combinatorics
1.1.12. Notation. We again write π as
µ ¶
1 2 ... r
or (π(1), . . . , π(r )),
π(1) π(2) . . . π(r )
where π(k) ∈ {1, . . . , n}, k = 1, . . . , r .

1.1.13. Example. Let n = 3, r = 2. Then the following are permutations of


two elements from {1, 2, 3}:

(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2).

1.1.14. Theorem. There are


n!
n · (n − 1) · · · (n − r + 1) =
(n − r )!
different permutations of r elements from {1, . . . , n}.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 17 / 355
Elements of Probability Theory Introduction to Probability and Counting

Combinations

A related question arising in combinatorics is the number of selections of


r elements from a set of n elements, where we do not care about the
order. Such selections are called combinations.
1.1.15. Example. Let n = 3, r = 2. Then the following are combinations of
two elements from {1, 2, 3}:

{1, 2}, {1, 3}, {2, 3}.

Note that a selection consists of sets (which are unordered) while a


permutation consists of ordered tuples.

1.1.16. Definition. A combination of r elements from {1, . . . , n} is subset


A ⊂ {1, . . . , n} with |A| = r elements.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 18 / 355
Elements of Probability Theory Introduction to Probability and Counting

Combinations
1.1.17. Theorem. There are
µ ¶
n n!
=
r r !(n − r )!
combinations of r elements from {1, . . . , n}.
Proof.
We first consider permutations of r objects from {1, . . . , n}. Every
permutation gives us a combination, by identifying

(π(1), . . . , π(r )) −→ {π(1), . . . , π(r )}.

Obviously, more than one permutation will give us the same combination.
Note that any permutation of (π(1), . . . , π(r )) will be a permutation of r
objects from {1, . . . , n}. Furthermore, any permutation of (π(1), . . . , π(r ))
will yield the same combination. For each tuple (π(1), . . . , π(r )), there are
r ! permutations of (π(1), . . . , π(r )), so we need to divide the total number
of permutations of r objects from {1, . . . , n} by r !.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 19 / 355
Elements of Probability Theory Introduction to Probability and Counting

Combinations as Permutations of Indistinguishable Objects

We can regards the previous situation in a different light: A combination


of r elements of {1, . . . , n} is similar to a permutation, but without regard
to order. Fundamentally, if we do not order the selected objects, we regard
them as indistinguishable once they have been selected. (We can not say
“Take the fifth selected object,” but merely, “Take a selected object.” We
do not distinguish between the selected objects.)
Therefore, a combination can be regarded as a permutation of r
indistinguishable objects from {1, . . . , n}.
Effectively, this corresponds to a division of {1, . . . , n} into two classes, a
subset consisting of r elements and its complement.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 20 / 355
Elements of Probability Theory Introduction to Probability and Counting

Sorting Into Classes

We can hence reformulate Theorem 1.1.17 on combinations with regard to


the problem of sorting elements of {1, . . . , n} into two classes (sets) A1
and A2 of specified size.

1.1.18. Theorem. Let N := {1, . . . , n}. Then there are


µ ¶
|N |! n! n! n
= = =
|A1 |!|A2 |! n1 !n2 ! r !(n − r )! r

possible ways of dividing N into two sets A1 and A2 , where


◮ A1 ⊂ N , |A1 | = n1 = r ,
◮ A2 = N \ A1 , |A2 | = n2 = n − r .

Of course, this result can be generalized!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 21 / 355
Elements of Probability Theory Introduction to Probability and Counting

Sorting Into Classes

1.1.19. Theorem. Let N := {1, . . . , n}. Then there are

n!
n1 !n2 ! . . . nk !
possible ways of dividing N into k sets A1 , . . . , Ak , where
S
k Tk
◮ N = Ai , Ai = ∅
i=1 i=1
◮ |Ai | = ni , i = 1, . . . , k.

1.1.20. Remark. Since we are only interested in dividing a number of


elements into classes, and do not distinguish between the elements within
each class, this sorting into classes is also called permutation of
indistinguishable objects.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 22 / 355
Elements of Probability Theory Introduction to Probability and Counting

Sorting Into Classes


Proof of Theorem 1.1.19.
We first count the number of ways of sorting n1 elements into A1 :
µ ¶
n n!
= .
n1 n1 !(n − n1 )!

Next we count the ways of sorting n2 elements of the remaining n − n1


elements into A2 :
µ ¶
n − n1 (n − n1 )!
= .
n2 n2 !(n − n1 − n2 )!

The total number of ways of sorting n1 elements into A1 and n2 elements


into A2 is then the product
µ ¶µ ¶
n n − n1 n! (n − n1 )! n!
= =
n1 n2 n1 !(n − n1 )! n2 !(n − n1 − n2 )! n1 !n2 !(n − n1 − n2 )!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 23 / 355
Elements of Probability Theory Introduction to Probability and Counting

Sorting Into Classes


Proof of Theorem 1.1.19 (continued).
Proceeding in this manner, we find there are
µ ¶ k−1 µ P ¶ ¡ P ¢
n Y n − ij=1 nj Y n − ij=1 nj !
k−1
n!
= ¡ P ¢
n1 ni+1 n1 !(n − n1 )! n ! n − i+1 j=1 nj !
i=1 i=1 i+1
Qk−1 ¡ Pi ¢
n! i=1 n − j=1 nj !
= Q Qk−1 ¡ Pi+1 ¢
n1 !(n − n1 )! k−1
i=1 ni+1 ! i=1 n − j=1 nj !
Qk−1 ¡ Pi ¢
n! i=2 n − j=1 nj !
= Qk Qk−2 ¡ Pi+1 ¢
i=1 ni ! i=1 n − j=1 nj !
n!
= Qk
i=1 ni !

ways of sorting all objects into classes A1 , . . . , Ak , |Ai | = ni .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 24 / 355
Elements of Probability Theory Introduction to Probability and Counting

Sample Spaces and Sample Points


These techniques of counting permutations, combinations and sorting into
classes are very useful for evaluating probabilities by the classical
definition. First, however, we need to translate physical situations into a
mathematical context.
1.1.21. Definition. A sample space for an experiment is a set S such that
each physical outcome of the experiment corresponds to exactly one
element of S.
An element of S is called a sample point.

1.1.22. Remark. Not every element of a sample space needs to correspond


to the outcome of an experiment. It is often convenient to select a very
large, but simple sample space. For example, for the roll of a six-sided die,
we can take S = N, where the result of the die corresponds to the natural
numbers 1,2,3,4,5 or 6 and all other natural numbers are not used. All
natural numbers in this example are sample points, even though only six
actually correspond to physical reality.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 25 / 355
Elements of Probability Theory Introduction to Probability and Counting

Events

1.1.23. Definition. Any subset A of a sample space S is called an event.


Two events A1 , A2 are called mutually exclusive if A1 ∩ A2 = ∅.

1.1.24. Example. We roll a four-sided die 10 times. Then a possible sample


space is S = N10 and a sample point is a 10-tuple, for example
(1, 2, 3, 2, 3, 3, 1, 1, 4, 4) ∈ N10 . This sample point corresponds to first
rolling a one, then a 2, next a 3, followed by a 2, a 3, a 3, two ones and
two fours.
An event might correspond to “rolling at least two fours” in which case
this would be a subset A ⊂ S such that each 10-tuple in A has at least
two entries equal to 4. For example, (1, 2, 3, 2, 3, 3, 1, 1, 4, 4) ∈ A but
(1, 2, 3, 2, 3, 3, 1, 1, 3, 4) ∈
/ A.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 26 / 355
Elements of Probability Theory Introduction to Probability and Counting

Example
1.1.25. Example. We roll a four-sided die 10 times. What is the probability
of obtaining 5 ones, 3 twos, 1 three and 1 four?
There are 410 = 1048576 possibilities for the 10-tuple of results of the die
rolls, corresponding to that many sample points in S = N10 that
correspond to physical results. The event A consists of all ordered
10-tuples containing 5 ones, 3 twos, 1 three and 1 four. There are
10!
= 5040
5!3!1!1!
possible ways of obtaining 5 ones, 3 twos, 1 three and 1 four, so there are
that many elements in A. The probability is
5040
≈ 0.00481 ≈ 0.5%.
1048576

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 27 / 355
Elements of Probability Theory Introduction to Probability and Counting

A Note on Numerical Values

As seen in the above example, we will round numerical values to a suitable


number of significant digits.
Hence 0.6 and 0.60 mean two different things, as you recall from your
physics lectures.
ALWAYS put a leading zero before the decimal point: writing .6 is not
acceptable for engineers, always write 0.6.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 28 / 355
Elements of Probability Theory Introduction to Probability and Counting

Counting Mountains
We will now look at a non-trivial example of counting. Consider the
following problem: We wish to draw “mountains” using upstrokes and
downstrokes, like this:
◮ One upstroke, one downstroke:
B
||| BBB
|| B

◮ Two upstrokes, two downstrokes:

6
©©© 666
© 66
©©© 66
B B ©© 66
||| BBB ||| BBB ©© 66
|| B|| B ©© 6

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 29 / 355
Elements of Probability Theory Introduction to Probability and Counting

Counting Mountains |BB |BB |BB


||| BBB ||| BBB ||| BBB
◮ Three upstrokes, three downstrokes: | | |

B 6 6
©©© BBB ||| 666 ©©© 666
© B|| 66 © 66
©©© 66 ©©© 66
© 66 © 66
©©© 66 ©©© 66 ||BBB
6 6||| BB
©© ©©

4
­­­ 444
­ 44
­­­ 44
­­ 44 6
­­ 44
©©© 666
­­ 44 © 66
­ 44 ©© 66
­­­ 44 BB ©©© 66
­ 44 |
| BB ©© 66
­ 4 | 6
­­ || B©©

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 30 / 355
Elements of Probability Theory Introduction to Probability and Counting

Counting Mountains
The mountains must each consist of n upstrokes and n downstrokes, and
thre may be no “valleys,” i.e., no stroke may lie under the starting point.
That means that the following constructions are not allowed:

BB B |66
BB ||| BBB ||| || 66
B|| B|| || 66
66
66
66 ||
6|||

The question is: for given n, how many such mountains using n up- and n
downstrokes can we draw?
The answer is related to many diverse combinatorial problems.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 31 / 355
Elements of Probability Theory Introduction to Probability and Counting

Counting Mountains

We would like to consider every mountain as a “word” or “string”


consisting of the “letters” or “characters” ↑ and ↓. For instance, we would
write

6
©©© 666
© 66
©©© 66
© 66
©©© 66 ||BBB
© 6||| BB
©

as W(3 ↑, 3 ↓) =↑↑↓↓↑↓.
However, not all such words do not lead to admissible mountains. For
example, ↑↓↓↑↓↑ does not give a mountain.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 32 / 355
Elements of Probability Theory Introduction to Probability and Counting

Dyck Words and Catalan Numbers


Clearly, we need to impose two conditions on a word w :
1. #(↓) = #(↑) = n
2. In any word consisting of the first k letters of w , #(↑) ≥ #(↓)
(k = 1, . . . , 2n)
Words satisfying these conditions are called Dyck words.

1.1.26. Theorem. The number of Dyck words of length 2n, n ∈ N \ {0}, is


µ ¶
1 2n
Cn = .
n+1 n

Cn is called the nth Catalan number.

A proof will be given in the recitation class. If you think that you are good
at counting, I challenge you to find a proof by yourselves before the class!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 33 / 355
Elements of Probability Theory Introduction to Probability and Counting

Historical Notes - Independence

Towards the end of the 17th century, the French mathematician de Moivre
had to flee to England, as he was a protestant (Huguenot), who were
being persecuted in France at the time. However, in England he couldn’t
get a job, since he was French. (The English did not like or trust the
French.) So he ended up spending a lot of time in coffee houses.
There he earned money by helping gamblers resolve disputes (such as how
to divide the pot in an interrupted game). His experiences led him to the
concept of independence of events - a coin does not remember its last
result, and will always have the same chance of turning heads up, no
matter how often it has done so in the past. We will explore this essential
concept in more detail later.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 34 / 355
Elements of Probability Theory Introduction to Probability and Counting

Historical Notes - Law of Large Numbers

In contrast, experience tells us that if we throw a fair coin lots of times, it


should not come heads up all the time; this was formulated by Jacob
Bernoulli in the early 18th century, who stated the Law of Large Numbers.
Essentially, this law says that if an experimental outcome A happens with
probability p, and if we conduct “sufficiently many” experiments, then a
proportion close to p · 100% of the experimental outcomes should be A.
This statement is the basis of an empirical method of finding probabilities.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 35 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability - Relative Frequency Approximation

1.1.27. Definition. Let A be a random outcome (random event) of an


experiment. Then the probability P(A) of this event occurring may be
approximated by performing the experiment N times:

number of times A occurs f (A)


P[A] ≈ =: ,
number of times experiment is performed N

This is called the relative frequency approximation to P[A].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 36 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability - Relative Frequency Approximation


1.1.28. Examples.
1. An experiment consists of flipping a coin. We are interested in the
probability of the coin landing heads up. We perform the experiment
100 times and observe that the coin lands heads up 52 times. By the
relative frequency approximation,
52
P[coin lands heads up] ≈ = 0.52.
100

2. An experiment consists of rolling two 6-sided dice and summing the


results, so the possible outcomes are the numbers S = 2, 3, . . . , 12.
We are interested in the outcome S = 3. We perform the experiment
200 times and observe this outcome 10 times. Hence
10
P[3] = P[S = 3] ≈ = 0.05.
200

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 37 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability - Relative Frequency Approximation

1.1.29. Remark. You should be careful in making deductions from the law
of large numbers. For example, if a fair coin is tossed four times, it is more
likely to obtain a 3-1 split (3 heads and one tail or 3 tails and one head)
than it is to obtain a 2-2 split (2 heads and 2 tails). Verify this for yourself
by using the classical definition of probability!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 38 / 355
Elements of Probability Theory Introduction to Probability and Counting

Classical Probability - Problems


You will have noticed that there are severe problems with the classical
definition of probability; it assumes that events are “equally likely ” - but
what does this mean? It means that the probabilities of each event
occurring are equal! So the definition of probability already presumes the
concept of probability - it is circular.
Such a “definition” is clearly useless!
In the beginning of the 20th century, the Russian mathematician
Kolmogorov decided to base probability on a solid mathematical
foundation. The classical definition of probability was not only shaky from
a logical point of view, it also didn’t help much when trying to treat things
like throwing a dart at dart board.
He defined probability entirely in the abstract; the physical world is
translated into a mathematical set of events and probability is defined as a
certain function on this set. This removes all ambiguity from the
definition, but does not really tell us what probability means.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 39 / 355
Elements of Probability Theory Introduction to Probability and Counting

What is Probability?

Probability theory is mainly applied by statisticians. For them, abstract


axioms are meaningless. On the other hand, pure mathematicians can not
easily deal with the “fuzzy” and at time inconsistent explanations of
probability with reference to the real world. If you are interested in the
subject, I advise you to google/baidu
◮ Frequentism and
◮ Bayesianism
for some schools of thought on what probability actually is. Be aware that
this will lead you to some fascinating philosophical questions!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 40 / 355
Elements of Probability Theory Introduction to Probability and Counting

Axiomatic Definition of Probability

The upshot of Kolmogorov’s work for us is that we define probability in


the following way:
1.1.30. Definition. Let S be a sample space and P(S) the power set (set of
all subsets) of S. Then a function P : P(S) → R, A 7→ P[A] is called a
probability function (or just probability ) on S if
1. P ≥ 0,
2. P[S] = 1,
T
3. For any set of events {Ak } ⊂ P(A) such that Ak = ∅,
h[ i X
P Ak = P[Ak ].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 41 / 355
Elements of Probability Theory Introduction to Probability and Counting

Axiomatic Definition of Probability

From the definition of the probability function P it follows immediately


that

P[∅] = 0, P[Ac ] = 1 − P[A],

where Ac = S \ A denotes the complement of A ⊂ S.


We can also derive the general addition rule

P[A1 ∪ A2 ] = P[A1 ] + P[A2 ] − P[A1 ∩ A2 ].

Simple examples may be discussed in recitation class!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 42 / 355
Elements of Probability Theory Introduction to Probability and Counting

Probability Tree Diagrams

Now that we have a good grasp of the concept of probability, we can use
tree diagrams more generally. Consider the case of an unfair coin, which
turns heads up 60% of the time. We toss the coin twice:

eeeeYYYYYYYYY0.4
eeeee eeeeee
0.6 YYYYYY
YYY
eee
heads
R tails
R
0.6lllll RRRR0.4 0.6lllll RRRR0.4
l RRRR l RRRR
llll R llll R
2 heads head, tail tail, head 2 tails
P[2 heads] = 0.36 P[head,tail] = 0.24 P[tail,head] = 0.24 P[2 tail] = 0.16

Probabilities are multiplied along each branch to give the final probability,
and summed across sub-branches to give unity.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 43 / 355
Elements of Probability Theory Introduction to Probability and Counting

Conditional Probability
We now want to study the effect of information on probability. In
particular, two events may be related, such that the occurrence of one
event influences the probability that the other event occurs. For example,
event B may have a 50% chance of occurring, but if one also knows that
event A has previously occurred, then event B might have a higher or
lower chance of occurring.
1.1.31. Example. There is a test for the gender of an unborn child called
“starch gel electrophoresis” (this type of test is of course not done in
China!). It detects the presence of a protein zone called the pregnancy
zone. The following statistical information is known:
◮ 43% of all pregnant women have the pregnancy zone.
◮ 51% of all children born are male.
◮ 17% of all children are male and their mothers have the pregnancy
zone.
Given that the zone is present, what is the probability that the child is
male?
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 44 / 355
Elements of Probability Theory Introduction to Probability and Counting

Conditional Probability
We now use a classical probability tree,

All pregnant
fXX
Women
ffff XXXXX
fffff XXXXXP[zone present] = 0.43
fffff XXX
Zone Absent Zone present
P
nn PPPP
nnnnn PPP P[male | zone present]
n
Female Male
P[male ∩ zone present] = 0.17
Here P[male | zone present] means the probability that the child is male,
given that the zone is present.
Since we multiply probabilities along branches of probability trees, it is
clear that
P[male ∩ zone present] 0.17
P[male | zone present] = = ≈ 0.40
P[zone present] 0.43

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 45 / 355
Elements of Probability Theory Introduction to Probability and Counting

Conditional Probability
The previous example motivates the general definition:
1.1.32. Definition. Let A, B be events and P(A) ̸= 0. Then we define the
conditional probability

P[A ∩ B]
P[B | A] := .
P(A)

Conditional probability emphasizes information; our idea of the probability


of an event occurring can change radically if we obtain information about
related events. Mathematically, P(B) becomes P(B | A), which may be
dramatically less or greater than P(B). This concept of probability being
related to incomplete information is at the heart of Bayesianism. It is a
different approach to Frequentism, which takes a more empirical
approach based on the law of large numbers.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 46 / 355
Elements of Probability Theory Introduction to Probability and Counting

The Monty Hall Paradox

You are participating in a game show to win 10,000,000 RMB. The game
master [Monty Hall] presents you with three closed doors. Behind one of
the doors is the prize, behind the other two doors there is nothing. If you
open the correct door, you will receive the money, if you open one of the
other two doors you will not get anything.
Before opening any of the three doors, you can announce which door you
intend to open. Obviously, at least one of the other two doors does not
hide the money. The game master opens this (empty) door. You are then
given the option of either
◮ sticking with your choice or
◮ switching to the other closed door.
What do you do and does it make a difference?

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 47 / 355
Elements of Probability Theory Introduction to Probability and Counting

The Monty Hall Paradox

To many people it seems counter-intuitive, but the best course of action is


to change your choice to the other door. There will be a 2/3 probability
that the prize is behind the remaining door that you have not chosen!
Why is that? By opening the door, the game master has not given you
any information about the door you have chosen (he can always open one
of the remaining doors, no matter which door you choose). The
probability of this being the correct door was 1/3 before he opens the
other door, and it remains that way after he opens the door.
However, his opening a door does give you information on the other two
doors, namely, it tells you which of the other two doors does definitely not
hide the prize. The original 2/3 probability that one of these doors hides
the money is now concentrated on just the one door. Therefore, it is
advantageous for you to change your choice.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 48 / 355
Elements of Probability Theory Introduction to Probability and Counting

Independence of Events

If one event does not influence another, then we say that the two events
are independent. Mathematically, we express this in the following way.
1.1.33. Definition. Let A, B be two events. We say that A and B are
independent if

P(A ∩ B) = P(A)P(B). (1.1.1)

Equation (1.1.1) is equivalent to

P(A | B) = P(A) if P(B) ̸= 0,


P(B | A) = P(B) if P(A) ̸= 0.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 49 / 355
Elements of Probability Theory Introduction to Probability and Counting

Independence of Events
1.1.34. Example. The birthdays (day and month) of a group of people are
generally assumed to be independent. Disregarding leap years, any person
is assumed to have a 1/365 chance of being born on a given day. (Do you
think that this is a reasonable assumption?) How many people should a
group have so that there is a better than even chance of two people in the
group having the same birthday?
We consider the complementary problem and start with a single person in
the group. If we add a second person, there is a 364/365 chance of them
not sharing a birthday. Adding a third person, for no two people to share a
birthday, this person must have his birthday on one of the other 363 days
of the year, so there is now a
364 363
365 365
chance of no two people in the group sharing a birthday.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 50 / 355
Elements of Probability Theory Introduction to Probability and Counting

Independence of Events
Continuing this argument, in a group of n ≥ 2 people there is a

Y
n
366 − k 1 364!
=
365 365 n−1 (365 − n)!
k=2

chance of no two people having the same birthday. It turns out that for
n = 23 this number is less than 0.5, so the probability of two people
having the same birthday is > 0.5.
This statement has been verified empirically; in a soccer match there are
2 × 11 players + 1 referee on the pitch. On any given playing day in the
Premier Division of the English league, about half the games should
feature two participants with the same birthday.
Refer to the article Coincidences: The truth is out there, Teaching
Statistics, Vol. 1, No. 1, 1998 (uploaded to the Resources section on
SAKAI).
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 51 / 355
Elements of Probability Theory Introduction to Probability and Counting

Total Probability
Imagine you conduct a 2-stage experiment with two possible outcomes A
and B but three intermediate stages F , G and H. The probability tree for
this experiment looks like this:

P[F ] ccccccccccc[[[[[[[[[[[P[H]
ccccccc [[[[[[[[
ccccccc P[G ] [[[[[[
F G H
P[A|F ]qqqMMMP[B|F
M ] P[A|G ]qqqMMMP[B|G
M ] P[A|H]qqqMMMP[B|H]
M
q MMM q MMM q MMM
qqq qqq qqq
A B A B A B

Then the probability of obtaining outcome A is given by

P[A] = P[A | F ] · P[F ] + P[A | G ] · P[G ] + P[A | H] · P[H]

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 52 / 355
Elements of Probability Theory Introduction to Probability and Counting

Total Probability

The previous example gives rise to the following theorem:


1.1.35. Theorem. (Total Probability) Let A1 , . . . , An ⊂ S be a set of
mutually exclusive events whose union is S. Let B ⊂ S be any event. Then

X
n
P[B] = P[B | Aj ] · P[Aj ].
j=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 53 / 355
Elements of Probability Theory Introduction to Probability and Counting

Bayes’s Theorem

This immediately leads to the celebrated theorem of Bayes:


1.1.36 Bayes’s Theorem Let A1 , . . . , An ⊂ S be a set of mutually exclusive events
whose union is S. Let B ⊂ S be any event such that P(B) ̸= 0. Then for any Ak ,
k = 1, . . . , n, Then

P[B ∩ Ak ] P[B | Ak ] · P[Ak ]


P[Ak | B] = = n .
P[B] P
P[B | Aj ] · P[Aj ]
j=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 54 / 355
Elements of Probability Theory Introduction to Probability and Counting

Bayes’s Theorem
1.1.37. Example. Blood type distribution in US:
◮ Type A: 41%
◮ Type B: 9%
◮ Type AB: 4%
◮ Type O: 46%
Measurements statistics:
◮ P[type A registered | true type O] = 4%
◮ P[type A registered | true type A] = 88%
◮ P[type A registered | true type B] = 4%
◮ P[type A registered | true type AB] = 10%
A test registers type A; what is the probability that this is correct?
Wanted: P[true type A | type A registered]

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 55 / 355
Elements of Probability Theory Introduction to Probability and Counting

Bayes’s Theorem
By Bayes’s Theorem,

P[true type A | type A registered]


P[type A registered | true type A] · P[true type A]
= ,
P[type A registered]
where the total probability of P[type A registered] is
P[type A registered]
= P[type A registered | true type O] · P[true type O]
+ P[type A registered | true type A] · P[true type A]
+ P[type A registered | true type B] · P[true type B]
+ P[type A registered | true type AB] · P[true type AB]
Inserting the numerical values,
P[true type A | type A registered] ≈ 0.93

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 56 / 355
Elements of Probability Theory Discrete Random Variables

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Joint Distributions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 57 / 355
Elements of Probability Theory Discrete Random Variables

Countable and Uncountable Sets

We first need to review the concept of countable and uncountable sets.


1.2.1. Definition. Let A be a set. Then we say that A is countable if either
◮ A is finite or
◮ there exists a bijection A → N.
If A is not countable, then we say that A is uncountable.

1.2.2. Examples.
1. The integer numbers are countable.
2. The rational numbers are countable.
3. The real numbers in the interval [0, 1] are uncountable.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 58 / 355
Elements of Probability Theory Discrete Random Variables

Discrete Random Variables


1.2.3. Definition. Let S be a sample space and Ω a countable subset of R.
A discrete random variable is a map X : S → Ω together with a function
fX : Ω → R with the properties that
X ≥ 0 and
1. fP
2. fX (x) = 1.
x∈Ω
Then fX gives the probability that X assumes a given value x, i.e.,
fX (x) = P[X = x].
The function fX is called the probability function or distribution function
of the random variable X .
Please note the non-standard notation above. “X = x” makes no sense at
all, since X is a function and x is a number. What is meant is
P(X (p) = x), the probability of the random variable X applied to a
sample point p ∈ S yielding the value x. Like physics, probability theory
uses its own notation, which is often incompatible with that of “standard”
calculus.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 59 / 355
Elements of Probability Theory Discrete Random Variables

Discrete Random Variables


1.2.4. Example. Consider a class of 25 students. We test the blood type of
each student and count the total number of students with each blood
type. Then the sample space is
S = {(n1 , n2 , n3 , n4 )} n1 ,n2 ,n3 ,n4 ∈N
n1 +n2 +n3 +n4 =25

where the quadruple (n1 , n2 , n3 , n4 ) means that n1 students have type A,


n2 have type B, n3 have type AB, n4 have type O.
A random variable X might denote the number of students with type A or
type B blood. Then
X : S → Ω = {0, 1, 2, . . . , 25}, (n1 , n2 , n3 , n4 ) 7→ x = n1 + n2 .
Assuming that the blood types of the students are independent and that
the distribution of the blood types among the students is the same as that
of the general population, data for the general population can be used to
assign a probability to each value x of X .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 60 / 355
Elements of Probability Theory Discrete Random Variables

Discrete Random Variables

Please note:
◮ The map X : S → Ω alone is just a variable.
◮ If Ω is countable, we call X a discrete variable, otherwise a
continuous variable.
◮ X together with fX (i.e., the pair (X , fX )) is a discrete or continuous
random variable.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 61 / 355
Elements of Probability Theory Discrete Random Variables

Cumulative distribution
In the previous example, we might be interested in the probability that not
more than 10 out of 25 students have blood type A. The random variable
would be
X : S → Ω = {0, 1, 2, . . . , 25}, (n1 , n2 , n3 , n4 ) 7→ x = n1 .
with a certain density function fX . We would hence determine
P[X ≤ 10] = P[X = 0] + P[X = 1] + · · · + P[X = 10].
This is known as the cumulative probability, and we in fact define the
so-called cumulative distribution function of a random variable by
F (x) = P[X ≤ x].
For a discrete random variable,
X X
F (x) = P[X = y ] = fX (y ).
y ≤x y ≤x

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 62 / 355
Elements of Probability Theory Discrete Random Variables

Expectation
Consider the rolling of a fair six-sided die. We are interested in the average
or expected value of the result. Since each result (numbers 1,2,3,4,5,6)
occurs with probability 1/6, we take the weighted sum:
1 1 1 1 1 1
· 1 + · 2 + · 3 + · 4 + · 5 + · 6 = 3.5.
6 6 6 6 6 6
The average result of a die roll is then 3.5, even though this result itself
can never occur.
1.2.5. Definition. Let (X , fX ) be a discrete random variable. Then the
expected value of X is
X
E[X ] = x · fX (x).
x∈Ω

provided that the sum (series) on the right converges absolutely.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 63 / 355
Elements of Probability Theory Discrete Random Variables

Expectation

Given a random variable X and a function H : Ω → R, the composition


H ◦ X will again be a random variable, albeit with a different probability
density function. If X is discrete, then so will H ◦ X .
1.2.6. Definition. Let (X , fX ) be a discrete random variable and H : Ω → R
some function. Then the expected value of H ◦ X is
X
E[H ◦ X ] = H(x) · fX (x).
x∈Ω

provided that the sum (series) on the right converges absolutely.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 64 / 355
Elements of Probability Theory Discrete Random Variables

Some Properties of the Expectation


1.2.7. Theorem.
1. Consider the random variable S → R given by p →
7 c, for every
sample point p ∈ S and a fixed number c ∈ R. Denote this random
variable by c also. Then
E[c] = c.

2. Let X be a random variable and c ∈ R. Then the composition of the


function H : R → R, H : y 7→ c · y with X is a random variable, and
we write cX := H ◦ X . Then
E[cX ] = c E[X ].

3. Let X , Y be two random variables that are independent (we will later
give a precise definition). Then
E[X + Y ] = E[X ] + E[Y ].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 65 / 355
Elements of Probability Theory Discrete Random Variables

Variance
The expectation of a random variable is its mean, the point about which
the values of the random variable fluctuate. What the mean does not
describe is the degree of this fluctuation.
1.2.8. Example. We roll three fair six-sided dice and sum the results. The
result of each die roll is a random variable X , Y and Z , respectively. The
random variable is X + Y + Z and its expected value is
E[X + Y + Z ] = E[X ] + E[Y ] + E[Z ] = 3.5 + 3.5 + 3.5 = 10.5
We can also roll a single twenty-sided (icosahedral) die. Then the
expected value of the result is
X
20
i 1 20 · 21
E[W ] = = = 10.5
20 20 2
i=1
Although both W and X + Y + Z have the same mean, they assume a
totally different range of values and we would also find that
P[X + Y + Z = 9] ̸= P[W = 9].
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 66 / 355
Elements of Probability Theory Discrete Random Variables

Variance
We therefore want to measure not only the mean of a random variable,
but also its expected deviation from the mean. The deviation from the
mean is given by
X − E [X ],
but we are interested in its absolute size, so we square it and then take the
expected value.
1.2.9. Definition. Let X be a random variable with expectation E[X ]. Then
the variance of X is defined as
£ ¤
Var X = E (X − E[X ])2 .

1.2.10. Notation. For short (and especially in statistics) we write


E[X ] = µX = µ, Var X = σX2 = σ 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 67 / 355
Elements of Probability Theory Discrete Random Variables

Variance and Standard Deviation


From the definition of the variance we can easily calculate that

Var X = E[X 2 ] − E[X ]2 .

If the random variable X assumes values with physical units (e.g., x


meters, x m), then the variance will have units that are the square of
these. This often does not make physical sense, since we want to measure
the expected deviation from the mean of a random variable, which should
have the same units.

1.2.11. Definition. Let X be a random variable with variance σX2 . Then we


define the standard deviation of X by
√ √
σX = Var X = σ 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 68 / 355
Elements of Probability Theory Discrete Random Variables

Some Properties of the Variance


1.2.12. Theorem.
1. Consider the random variable S → R given by p →
7 c, for every
sample point p ∈ S and a fixed number c ∈ R. Denote this random
variable by c also. Then
Var c = 0.

2. Let X be a random variable and c ∈ R. Then the composition of the


function H : R → R, H : y 7→ c · y with X is a random variable, and
we write cX := H ◦ X . Then
Var cX = c Var X .

3. Let X , Y be two random variables that are independent. Then


Var[X + Y ] = Var[X ] + Var[Y ].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 69 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution

The main interest in studying random variable lies in their distribution


function fX . The actual values that X takes are just numbers; all
probabilistic information lies in the distribution of these numbers.
Many discrete probability density functions arise from experiments that
follow a simple set of rules. We first discuss the properties of a geometric
random variable, i.e, a random variable with a geometric distribution
function.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 70 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution

Geometric Properties.
1. The experiment consists of a series of trials. The outcome of each
trial can be classed as being either a “success” (s) or a “failure” (f ).
A trial with this property is called a Bernoulli trial.
2. The trials are identical and independent in the sense that the
outcome of one trial has no effect on the outcome of any other. The
probability of success, p, remains the same for each trial.
3. The random variable X denotes the number of trials needed to obtain
the first success.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 71 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution


1.2.13. Definition. A random variable (X , fX ) with

X : S → Ω = N \ {0}

and distribution function fX : N \ {0} → R given by

fX (x) = (1 − p)x−1 p, 0 < p < 1,

is said to have a geometric distribution with parameter p.

1.2.14. Lemma. The cumulative distribution function for a geometrically


distributed random variable (X , fX ) with parameter p is given by

F (x) = P[X ≤ x] = 1 − q ⌊x⌋ ,

where q = 1 − p is the probability of failure and ⌊x⌋ denotes the greatest


integer less than or equal to x.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 72 / 355
Elements of Probability Theory Discrete Random Variables

Moments and the Moment-Generating Function

Finding the expectation value and variance for a geometric random


variable can be quite tricky if we use the definition directly. It will be even
more difficult for some distributions which we are still to encounter.
However, there exists a tool that allows us to employ all the power of
calculus to this problem.
1.2.15. Definition. Let (X , fX ) be a random variable. The k th ordinary
moment of X is defined as E[X k ].
Hence the expectation E[X ] is the first moment of X , while the variance
Var X = E[X 2 ] − E[X ]2 is the second moment minus the square of the first
moment. It turns out not only that many important properties of X can
be inferred from the knowledge of its moments, but also that the moments
can be derived easily from the so-called moment-generating function.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 73 / 355
Elements of Probability Theory Discrete Random Variables

Moments and the Moment-Generating Function


1.2.16. Definition. Let (X , fX ) be a random variable. Assume that there
exists some ε > 0 such that

E[e tX ] exists for −ε < t < ε.

Then the moment-generating function (m.g.f.) mX for X is defined as

mX : [−ε, ε] → R, mX (t) = E[e tX ].

1.2.17. Theorem. Let (X , fX ) be a random variable with


moment-generating function mX . Then
¯
k d k mX (t) ¯¯
E[X ] =
dt k ¯t=0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 74 / 355
Elements of Probability Theory Discrete Random Variables

Moments and the Moment-Generating Function


Proof of Theorem 1.2.17.
We calculate directly the expectation value of e tX :
hX
∞ n ni
t X
∞ n
X t
mX (t) = E[e tX ] = E = E[X n ].
n! n!
n=0 n=0

Here we have (ab)used the various properties of the expectation.


Differentiating term-by-term,
∞ ∞
d k mX (t) X d k t n n
X t n−k
= E[X ] = E[X n ].
dt k dt k n! (n − k)!
n=0 n=k

At t = 0, only the first term survives, so


¯
d k mX (t) ¯¯
= E[X k ].
dt k ¯t=0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 75 / 355
Elements of Probability Theory Discrete Random Variables

Moments and the Moment-Generating Function


We will also use the moment-generating function as a “fingerprint” for a
distribution: We treat the association “distribution 7→ m.g.f.” as being
injective.
In other words, if we know that a given distribution g has a certain
moment-generating function, and if we find that some random variable
(X , fX ) has the same m.g.f., then fX = g .
This is strictly speaking untrue. Such a statement actually holds for the
characteristic function
χX (t) = E[e itX ],

where i = −1. This has to do with the fact that the m.g.f. can be
regarded as the Laplace transform of the indicator function on Ω, while
the characteristic function is its Fourier transform. There are stronger
analytic uniqueness results for the Fourier transform than there are for the
Laplace transform.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 76 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution

We now tackle the geometric distribution:


1.2.18. Proposition. Let (X , fX ) be a geometrically distributed random
variable with parameter p. Then the moment-generating function for X is
given by

pe t
mX : (−∞, − ln q) → R, mX (t) =
1 − qe t
where q = 1 − p.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 77 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution

Proof.
We have fX (x) = q x−1 p for x ∈ N \ {0}. Then

X ∞
pX t x
mX (t) = E[e tX ] = e tx q x−1 p = (qe )
q
x=1 x=1

This is a geometric series which converges for |qe t | = qe t < 1, i.e., for
t < − ln q. For such t, the limit is given by

p ³X t x ´ p³ 1 ´
∞ ∞
pX t x
mX (t) = (qe ) = (qe ) − 1 = − 1
q q q 1 − qe t
x=1 x=0
p qe t pe t
= = .
q 1 − qe t 1 − qe t

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 78 / 355
Elements of Probability Theory Discrete Random Variables

The Geometric Distribution


1.2.19. Lemma. Let (X , fX ) be a geometrically distributed random variable
with parameter p. Then the expectation value and variance are given by
1 q
E[X ] = and Var X =
p p2
where q = 1 − p.

Proof.
We use the moment-generating function to calculate the expectation value:
¯ ¯
d ¯¯ d ¯¯ pe t
E [X ] = ¯ m X (t) = ¯
dt t=0 dt t=0 1 − qe t
¯
pe (1 − qe ) + pe t qe t ¯¯
t t p 1
= ¯ = = .
(1 − qe )t 2
t=0 (1 − q)2 p

The proof for the variance is similar and is left to the reader.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 79 / 355
Elements of Probability Theory Discrete Random Variables

The Binomial Distribution

We will now look at various further discrete distributions, starting with the
binomial distribution.
Binomial Properties.
1. The experiment consists of a fixed number n of Bernoulli trials.
2. The trials are identical and independent. The probability of success,
p, remains the same for each trial.
3. The random variable X denotes the number of successes in the n
trials.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 80 / 355
Elements of Probability Theory Discrete Random Variables

The Binomial Distribution ¡ ¢


Since x successes among n trials can be achieved in xn ways, and x
successes with probability p must be accompanied by n − x failures with
probability 1 − p, we see that
µ ¶
n x
P[x successes in n trials] = p (1 − p)n−x .
x
This motivates the following
1.2.20. Definition. A random variable (X , fX ) with

X : S → Ω = {0, 1, 2, . . . , n}

and distribution function fX : Ω → R given by


µ ¶
n x
fX (x) = p (1 − p)n−x , 0 < p < 1, n ∈ N \ {0}
x
is said to have a binomial distribution with parameters n and p.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 81 / 355
Elements of Probability Theory Discrete Random Variables

The Binomial Distribution


1.2.21. Theorem. Let (X , fX ) be a binomial random variable with
parameters n and p.
1. The moment generating function of X is given by
mX : R → R, mX (t) = (q + pe t )n , q = 1 − p.

2. E[X ] = np.
3. Var X = npq.
The proof of this theorem is left as an exercise. We will also later be very
interested in the cumulative distribution F , F (x) = P[X ≤ x]. There is no
simple way of evaluating the sums involved, so the values have been
tabulated (Table I of Appendix A in the textbook). The table gives the
values of
⌊t⌋ µ ¶
X n x
F (t) = p (1 − p)n−x ,
x
x=0
where ⌊t⌋ is the greatest integer less than or equal to t.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 82 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal (Negative Binomial) Distribution

While the binomial distribution counts the number of successes in n trials,


the Pascal distribution counts the number of trials needed for r successes.

Pascal Properties.
1. The experiment consists of a series of Bernoulli trials.
2. The trials are identical and independent. The probability of success,
p, remains the same for each trial.
3. The trials are observed until exactly r successes are obtained, where r
is fixed beforehand.
4. The random variable X is the number of trials needed to obtain the r
successes.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 83 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution

We will try to derive a formula for the probability density associated with
negative binomial trials. First, notice that if we need x trials for r
successes, then in x − 1 trials we must have had exactly r − 1 successes
and therefore x − r failures. We know that the probability of this is
µ ¶
x − 1 r −1
P[exactly r − 1 successes in x − 1 trials] = p (1 − p)x−r .
r −1

Now with probability p the xth trial will be a success, so


µ ¶
x −1 r
P[obtain r th success in xth trial] = p (1 − p)x−r .
r −1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 84 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution


1.2.22. Definition. Let r ∈ N \ {0}. A random variable (X , fX ) with
X : S → Ω = N \ {0, 1, . . . , r − 1} = {r , r + 1, r + 2, . . .}
and distribution function fX : Ω → R given by
µ ¶
x −1 r
fX (x) = p (1 − p)x−r , 0 < p < 1,
r −1
is said to have a negative binomial distribution with parameters p and r .

1.2.23. Theorem. Let (X , fX ) be a negative binomial random variable with


parameters p and r .
1. The moment generating function of X is given by
(pe t )r
mX : R → R, mX (t) = , q = 1 − p.
(1 − qe t )r
2. E[X ] = r /p.
3. Var X = rq/p 2 .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 85 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution

Proof.
We derive the moment-generating function only. It is given by

X ¶ µ
x −1 r
mX (t) = E[e ] = Xt
e p (1 − p)x−r
tx

x=r
r − 1
X∞ µ ¶
t(r +x) r + x − 1
= e p r (1 − p)x
r −1
x=0
X∞ µ ¶
r −1+x
r tr
=p e [e t (1 − p)]x
x
x=0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 86 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution


Proof (continued).
Note that the binomial series gives
∞ µ
X ¶
−r
(1 − y )−r = (−y )x ,
x
x=0

where
µ ¶
−r (−r ) · (−r − 1) · · · (−r − x + 1)
=
x x!
r · (r + 1) · · · (r + x − 1)
= (−1)x
x! µ ¶
x (r + x − 1)! x r −1+x
= (−1) = (−1)
x!(r − 1)! x

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 87 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution

Proof (continued).
∞ µ
X ¶
−r r −1+x x
It follows that (1 − y ) = y . Therefore,
x
x=0

∞ µ
X ¶
r −1+x
mX (t) = p e r tr
[e t (1 − p)]x
x
x=0
(pe t )r
= p r e tr (1 − (1 − p)e t )−r =
(1 − qe t )r

with q = 1 − p.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 88 / 355
Elements of Probability Theory Discrete Random Variables

The Pascal Distribution


1.2.24. Example. The president of a large corporation makes decisions by
throwing darts at a board. The center section is marked “yes” and
represents a success. The probability of his hitting a “yes” is 0.6., and this
probability remains constant from throw to throw. The president continues
to throw until he has three “hits.” We denote X as the number of the trial
on which he experiences his third hit. The president’s decision rule is
simple: If he gets three hits on or before the fifth throw he decides in favor
of the question. What is the probability that he will decide in favor?

P[X ≤ 5] = P[X = 3] + P[X = 4] + P[X = 5]


µ ¶ µ ¶ µ ¶
2 3 0 3 3 1 4
= (0.6) (0.4) + (0.6) (0.4) + (0.6)3 (0.4)2
2 2 2
= 0.683

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 89 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution


The hypergeometric distribution concerns trials that are not independent.
In order to construct such trials, Bernoulli introduced the idea of drawing
different colored balls from an urn without replacig them. Thus each draw
influences the distribution of the remaining balls
Hypergeometric Properties.
1. The experiment consists of drawing a random sample of size n
without replacement and without regard to order from a collection of
N ≥ n objects.
2. Of the N objects, r have a trait that interests us; the other N − r do
not have the trait.
3. The random variable X is the number of objects in the sample with
the trait.

1.2.25. Remark. If the sampling were done with replacement, this would
correspond to a binomial situation.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 90 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution

In order to find a density function for this situation, we use the classical
definition of probability. Fix the number of objects N, the sample size n
and the number of objects with the trait, r .

P[exactly x objects with trait in sample]


(# ways to select x out of r objects) · (# ways to select n − x out of N − r objects)
=
# ways to select n out of N objects
¡ r ¢¡N−r ¢
x
= ¡Nn−x
¢
n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 91 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution

1.2.26. Definition. Let N, n, r ∈ N \ {0}, r , n ≤ N. A random variable


(X , fX ) with

X : S → Ω = {x ∈ N : max(0, n − (N − r )) ≤ x ≤ min(n, r )}

and distribution function fX : Ω → R given by


¡ r ¢¡N−r ¢
x
fX (x) = ¡Nn−x
¢
n

is said to have a hypergeometric distribution with parameters N, n and r .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 92 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution

The hypergeometric distribution is tricky to treat explicitly. Without proof


or motivation, we give the following result:
1.2.27. Theorem. Let (X , fX ) be a hypergeometric distribution with
parameters N, n and r .
r
1. E[X ] = n .
N
r N −r N −n
2. Var X = n .
N N N −1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 93 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution

The hypergeometric distribution is used in acceptance sampling.


Suppose a retailer buys goods in lots and each item can be either
acceptable or defective. Let

N = # of items in a lot,

and
r = # of defectives in a lot.
Then we can calculate the probability that a sample of size n contains x
defectives.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 94 / 355
Elements of Probability Theory Discrete Random Variables

The Hypergeometric Distribution

1.2.28. Example. Suppose that a lot of 25 machine parts is delivered, where


a part is considered acceptable only if it passes tolerance. We sample 10
parts and find that none are defective (all are within tolerance). What is
the probability of this event if there are 6 defectives in the lot of 25?
Applying the hypergeometric distribution, N = 25, r = 6 and n = 10, we
have ¡ r ¢¡N−r ¢ ¡6¢¡19¢
x
P[X = 0] = fX (0) = ¡Nn−x
¢ = 0
¡2510
¢ = 0.028,
n 10
showing that our observed event is quite unlikely if there 6 defectives in
the lot.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 95 / 355
Elements of Probability Theory Discrete Random Variables

Approximating the Hypergeometric Distribution

The hypergeometric distribution describes “sampling without


replacement.” In contrast, “sampling with replacement” is described by
the binomial distribution. Therefore, it makes sense that under certain
conditions the hypergeometric distribution may be approximated by the
binomial distribution.
The approximation is valid whenever the sampling fraction n/N is small.
Depending on whom you ask, “small” can mean n/N ≤ 0.05 or
n/N ≤ 0.1. Then the hypergeometric distribution can be approximated by
a binomial distribution with parameters n and p = r /N.
The smaller n/N is, the better the approximation.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 96 / 355
Elements of Probability Theory Discrete Random Variables

Approximating the Hypergeometric Distribution

1.2.29. Example. A production lot of 200 units has 8 defectives. A random


sample of 10 units is selected, and we want to find the probability that the
random sample will contain exactly one defective. The true probability is
¡ r ¢¡N−r ¢ ¡8¢¡192¢
x
P[X = 1] = ¡Nn−x
¢ = 1
¡2009¢ = 0.288.
n 10

Since n/N = 10/200 = 0.05. Then p = r /N = 8/200 = 0.04 and by the


binomial approximation gives
µ ¶
10
P[X = 1] ≈ (0.04)1 (0.96)9 = 0.277.
1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 97 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution

The Poisson distribution is used for discrete occurrences, called arrivals or


births, occurring randomly in a continuous time frame.
The random variable is Xt , the number of arrivals that occur in the time
interval [0, t] for some t > 0. Hence for any t, Xt : S → N.
We make the assumption that the numbers of arrivals during a
non-overlapping time intervals T1 , T2 , T1 ∩ T2 = ∅, are independent.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 98 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution

We also assume that there exists some number λ > 0 such that for any
small time interval of size ∆t the following postulates are satisfied:
1. The probability that exactly one arrival will occur in an interval of
width ∆t is approximately λ · ∆t.
2. The probability that exactly zero arrivals will occur in the interval is
approximately 1 − λ · ∆t.
3. The probability that two or more arrivals ocur in the interval is
approximately zero (very small).
We will formulate these postulates mathematically, and from them obtain
the density function for the Poisson distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 99 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution


We denote by o(t) any function f such that

f (t)
lim = 0.
t→0 t
Hence o(t) does not denote a particular function, rather a class of
functions. For example,
◮ t 2 = o(t),
◮ (1 + t)2 = 1 + 2t + o(t),
◮ sin t = t + o(t).
In particular,
◮ o(t) + o(t) = o(t),
◮ t n · o(t) = o(t) for all n ∈ N,
◮ o(t) · o(t) = o(t).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 100 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution

We now formulate or postulates mathematically:


1. The probability that exactly one arrival will occur in an interval of
width ∆t is λ · ∆t + o(∆t).
2. The probability that exactly zero arrivals will occur in the interval is
1 − λ · ∆t + o(∆t).
3. The probability that two or more arrivals occur in the interval is
o(∆t).
From this, we can derive the probability density function fXt . We write

fXt (x) = P[Xt = x] =: px (t) for x = 0, 1, 2, 3, . . .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 101 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution


The probability of having zero arrivals in the time interval [0, t + ∆t] is
given by the product of the probabilities of having zero arrivals in [0, t] and
[t, t + ∆t], respectively. We fix t and obtain

p0 (t + ∆t) = (1 − λ∆t + o(∆t))p0 (t) = (1 − λ∆t)p0 (t) + o(∆t).

so that
p0 (t + ∆t) − p0 (t) o(∆t)
−λp0 (t) = + .
∆t ∆t
We can take the limit as ∆t → 0 on both sides. Then the fraction
o(∆t)/∆t vanishes by definition, and we have

p0 (t + ∆t) − p0 (t)
−λp0 (t) = lim = p0′ (t).
∆t→0 ∆t

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 102 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution


Now let x > 0. Then

px (t + ∆t) = λ∆t px−1 (t) + (1 − λ∆t)px (t) + o(∆t).

so that
px (t + ∆t) − px (t) o(∆t)
λpx−1 (t) − λpx (t) = + .
∆t ∆t
We can take the limit as ∆t → 0 on both sides. Then the fraction
o(∆t)/∆t vanishes by definition, and we have

px′ (t) = λpx−1 (t) − λpx (t).

Together with p0′ = −λp0 we hence have a system of differential equations


that can be solved inductively to determine p0 , p1 , p2 ,

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 103 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution


The solution to these equations is

(λt)x −λt
fXt (x) = px (t) = e .
x!
If we set k = λt, we obtain the Poisson distribution with parameter k.
1.2.30. Definition. Let k ∈ R. A random variable (X , fX ) with

X: S →N

and distribution function fX : N → R given by

k x e −k
fX (x) =
x!
is said to have a Poisson distribution with parameter k.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 104 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution

1.2.31. Theorem. Let (X , fX ) be a Poisson distributed random variable


with parameter k.
1. The moment generating function of X is given by
t −1)
mX : R → R, mX (t) = e k(e .

2. E[X ] = k.
3. Var X = k.

The proof of this theorem is left as an exercise.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 105 / 355
Elements of Probability Theory Discrete Random Variables

The Poisson Distribution


1.2.32. Example. The white blood cell count of a healthy individual can
average as low as 6000/ mm3 of blood. To detect a white-cell deficiency, a
0.001 mm3 drop of blood is taken and the number X of white blood cells
is found. How many white cells are expected in a healthy individual? If at
most two are found, is there evidence of a white-cell deficiency?
Here the (continuous) volume of blood (in mm3 ) takes the role of the time
variable, a white cell counts as an “arrival.” The number of arrivals per
unit volume is λ = 6000, the volume under consideration is s = 0.001.
Hence we have a Poisson process with parameter
k = λs = 6.
The expected value is E[x] = k = 6. Furthermore,
X
2
e −6 6x
P[X ≤ 2] = = 0.062.
x!
x=0
This value has been found from Table II of Appendix A of the text book.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 106 / 355
Elements of Probability Theory Discrete Random Variables

Approximating the Binomial Distribution

If n is large and p is small, we can approximate the binomial distribution


by the Poisson distribution. We then set

k = pn.

Note that usually k = λt, where λ corresponds to the arrivals per unit
time. In general, we require p < 0.1 for this approximation. The smaller p
and the larger n are, the better the approximation.
In fact, the Poisson distribution can be regarded as a limiting case of the
binomial distribution, as you shall prove in the exercises.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 107 / 355
Elements of Probability Theory Discrete Random Variables

Approximating the Binomial Distribution

1.2.33. Example. The probability that a particular rivet in the wing surface
of a new aircraft is defective is 0.001. There are 4000 rivets in the wing.
What is the probability that not more than six defective rivets will be
installed?
The true probability is
6 µ
X ¶
4000
P[X ≤ 6] = (0.001)x (0.999)4000−x .
x
x=0

Using the Poisson approximation, k = 4000 · 0.001 = 4 and

X
6
4k
P[X ≤ 6] ≈ e −4 = 0.889.
x!
x=0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 108 / 355
Elements of Probability Theory Continuous Random variables

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Joint Distributions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 109 / 355
Elements of Probability Theory Continuous Random variables

Continuous Random Variables

1.3.1. Definition. Let S be a sample space. A continuous random variable


is a map X : S → R together with a function fX : R → R with the
properties that
1. fX ≥ 0 and
R∞
2. fX (x) dx = 1.
−∞
The integral of fX is interpreted as the probability that X assumes values
x in a given range, i.e.,
Z b
P[a ≤ X ≤ b] = fX (x) dx
a

The function fX is called the probability density function (or just density)
of the random variable X .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 110 / 355
Elements of Probability Theory Continuous Random variables

Cumulative Distribution
Notice that by the above definition,
Z x
P[X = x] = fX (y ) dy = 0,
x

i.e., the probability that X assumes any specific value is zero.


In particular, if for two continuous random variables (X , fX ) and (Y , fY )
the densities fX and fY differ only on sets of measure zero (e.g., countable
sets), then for any a, b ∈ R
Z b Z b
P[a ≤ X ≤ b] = fX (x) dx = fY (y ) dy = P[a ≤ Y ≤ b].
a a

In this case we say

(X , fX ) = (Y , fY ) almost surely.

(From the point of view of calculus, fX = fY almost everywhere.)


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 111 / 355
Elements of Probability Theory Continuous Random variables

Cumulative Distribution

1.3.2. Definition. Let (X , fX ) be a continuous random variable. The


cumulative distribution function for X is defined by F : R → R,
Z x
FX (x) := P[X ≤ x] = fX (y ) dy
−∞

Notice that by the fundamental theorem of calculus we can easily obtain


the density fX from FX :
fX (x) = FX′ (x).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 112 / 355
Elements of Probability Theory Continuous Random variables

Expectation and Variance


We define expectation similarly to the discrete case:
1.3.3. Definition. Let (X , fX ) be a continuous random variable and
H : R → R some function. Then the expected value of H ◦ X is
Z ∞
E[H ◦ X ] = H(x) · fX (x).
−∞

provided that the integral on the right converges absolutely.


As a special case we regain the expected value of X ,
Z
E[X ] = x · fX (x) dx
R

and also note that as before

Var X = E[(X − E[X ])2 ] = E[X 2 ] − E[X ]2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 113 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

In contradistinction to the discrete distributions, the continuous


distributions can often not be easily motivated by experiments. In practice,
one finds that a random variable may follow some distribution, but it is
not generally possible to predict the distribution in advance just by
specifying the circumstances.

1.3.4. Definition. Let β ∈ R, β > 0. A continuous random variable (X , fβ )


with density
(
1 −x/β
e , x > 0,
fβ (x) = β
0, x ≤ 0,

is said to follow an exponential distribution with parameter β.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 114 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

2 Β=2
3
Β=
2
Β=1
1
1 Β=
2

1
2

x
-1 1 2 3

Graphs of y = fβ (x) for different values of β.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 115 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution
We can easily calculate the expectation and variance for the exponential
distribution:
Z ∞ Z ∞
x −x/β
E[X ] = x fβ (x) dx = e dx
−∞ 0 β
Z ∞
¯
−x/β ¯∞
= −xe 0
+ e −x/β dx = β
0

Z ∞ Z ∞
2 2 x 2 −x/β
E[X ] = x fβ (x) dx = e dx
−∞ 0 β
Z
¯∞ ∞
= −2xe −x/β ¯0 + 2 x e −x/β dx = 2β 2
0

⇒ Var X = E[X 2 ] − E[X ]2 = β 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 116 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

Similarly, we can obtain the moment-generating function:


Z ∞
tX
mX (t) = E[e ] = e tx fβ (x) dx
−∞
Z ∞
1 −(1/β−t)x
= e dx
0 β
Z
(1/β − t)−1 ∞ −y
= e dy
β 0
= (1 − βt)−1 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 117 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution
The exponential distribution has an interpretation as “time-to-failure.” In
fact, there is a relationship between the exponential distribution and the
Poisson distribution.
Recall that for Poisson-distributed events (arrivals) the probability of x
arrivals in the time interval [0, t] was given by

(λt)x −λt
p(x) = e , x ∈ N.
x!
Then p(0) is the probability of no arrivals in [0, t]. This can also be
interpreted as the probability that the first arrival occurs at a time greater
than t. Let the time of the first arrival be a continuous random variable
denoted by T . Then

P[T > t] = p(0) = e −λt , t ≥ 0.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 118 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

Hence, if we denote by F the cumulative distribution of the density of T ,


we have

F (t) = P[T ≤ t] = 1 − e −λt , t ≥ 0.

Since fT (t) = F ′ (t), it follows that the density is

fT (t) = λe −λt , t ≥ 0.

and fT (t) = 0 for t < 0. Thus the time between successive arrivals of a
Poisson-distributed random variable is exponentially distributed with
parameter β = 1/λ.
If we interpret an “arrival” as the failure of a component, then β −1 = λ
represents the rate of failure.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 119 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

1.3.5. Example. An electronic component is known to have a useful life


represented by an exponential density with failure rate of 10−5 failures per
hour, i.e., 1/β = 10−5 . The mean time to failure, E[X ], is thus β = 105
hours.
Suppose we wanted to determine the fraction of such components that
would fail before the mean or expected life:
Z β
P[T ≤ 1/λ] = β −1 e −x/β dx = 1 − e −1 = 0.63212.
0

That is, 63.2% of the components will fail before the mean life time. As
you can see, this result does not depend on the value of β.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 120 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

The exponential distribution has an interesting and unique property: it is


memoryless. In other words,

P[X > x + s | X > x] = P[X > s].

For example, if a cathode ray tube has an exponential time to failure


distribution and at time x it is observed to be still functioning, then the
remaining life has the same exponential failure distribution as the tube had
at time zero.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 121 / 355
Elements of Probability Theory Continuous Random variables

Exponential Distribution

To see this, note that


Z ∞ Z ∞
P[X > x] = f (t) dt = λe −λt dt = e −λx .
x x

Then
P[(X > x + s) ∩ (X > x)] P[X > x + s]
P[X > x + s | X > x] = =
P[X > x] P[X > x]
e −λ(x+s)
= = e −λs = P[X > s].
e −λx

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 122 / 355
Elements of Probability Theory Continuous Random variables

Gamma Distribution
A more general form of the exponential distribution is the gamma
distribution, which has two parameters.
1.3.6. Definition. Let α, β ∈ R, α, β > 0. A continuous random variable
(X , fα,β ) with density
(
1 α−1 e −x/β , x > 0,
αx
fα,β (x) = Γ(α)β
0, x ≤ 0,

is said to follow an gamma distribution with parameters α and β.


Here
Z ∞
Γ(α) = z α−1 e −z dz, α > 0,
0

is the Euler gamma function.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 123 / 355
Elements of Probability Theory Continuous Random variables

Gamma Distribution
The gamma function satisfies Γ(1) = 1 and Γ(α) = (α − 1)Γ(α − 1) if
α > 1. In other words,
n! = Γ(n + 1) for n ∈ N.
It is a continuous extension of the factorial function to the positive real
numbers and can in fact be defined for all complex numbers except zero
and the negative integers. Below is its graph for α ∈ (−3, 5).
y

15

y = G HxL = à
¥
10 z x - 1 e-z âz
0

x
-3 -2 -1 1 2 3 4 5

-5

-10

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 124 / 355
Elements of Probability Theory Continuous Random variables

Gamma Distribution

1.3.7. Theorem. Let (X , fα,β ) be a Gamma distributed random variable


with parameters α, β > 0.
1. The moment-generating function of X is given by

mX : (−∞, 1/β) → R, mX (t) = (1 − βt)−α .

2. E[X ] = αβ.
3. Var X = αβ 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 125 / 355
Elements of Probability Theory Continuous Random variables

Gamma Distribution
Proof.
We will verify the moment-generating function only.
Z ∞
e tx
mX (t) = E[e tX ] = α
x α−1 e −x/β dx
Γ(α)β
Z 0∞
1
= x α−1 e −x(1/β−t) dx
Γ(α)β α 0

Substituting y = x(1/β − t), we have dy = (1/β − t)dx and


Z ∞
1 −1
mX (t) = (1/β − t) [y /(1/β − t)]α−1 e −y dy
Γ(α)β α 0
Z
(1/β − t)−α ∞ α−1 −y
= y e dy
Γ(α)β α 0
= (1 − βt)−α .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 126 / 355
Elements of Probability Theory Continuous Random variables

Special Gamma Distributions

The gamma distribution is of interest because many other distributions are


special cases. For example, the exponential distribution is a gamma
distribution with α = 1.
An important distribution in statistics is the chi squared distribution.

1.3.8. Definition. Let X be a gamma random variable with β = 2 and


α = γ/2 for a positive integer γ ∈ N. Then X is said to have a chi
squared distribution with γ degrees of freedom. We denote this variable by
X = Xγ2 .
We will encounter this distribution again later.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 127 / 355
Elements of Probability Theory Continuous Random variables

Special Gamma Distributions

0.15

0.10

0.05

5 10 15 20

Graph of the density of the chi squared distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 128 / 355
Elements of Probability Theory Continuous Random variables

Normal (Gauß) Distribution

The normal distribution was first described by de Moivre in 1733 as a


limiting case of the binomial distribution. This result did not get much
attention, however, and was soon forgotten. The distribution was
rediscovered by Laplace and Gauß half a century later, as they both found
that this distribution described errors in astronomical measurements.

1.3.9. Definition. Let µ ∈ R, σ > 0. A continuous random variable (X , fX )


with density
1
e −((x−µ)/σ) /2
2
fX (x) = √
2πσ
is said to follow a normal distribution with parameters µ and σ.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 129 / 355
Elements of Probability Theory Continuous Random variables

Normal Distribution

- J N
y = H2 ΠL-12 Σ-1 ã
1 x-Μ 2
2 Σ

Μ-Σ Μ Μ+Σ

Graph of the density of the normal distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 130 / 355
Elements of Probability Theory Continuous Random variables

Normal Distribution

R
It is easily verified that R fX (x) dx = 1 by using polar coordinates (see
blackboard). We furthermore have the following result:
1.3.10. Theorem. Let (X , fX ) be a normally distributed random variable
with parameters µ and σ.
1. The moment-generating function of X is given by
2 t 2 /2
mX : R → R, mX (t) = e µt+σ .

2. E[X ] = µ.
3. Var X = σ 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 131 / 355
Elements of Probability Theory Continuous Random variables

Normal Distribution

Proof.
We will verify the moment-generating function only.
Z ∞
e tx −((x−µ)/σ)2
tX
mX (t) = E[e ] = √ e dx
−∞ 2πσ
Z ∞
1 2
=√ e tx−((x−µ)/σ) /2 dx
2πσ −∞
We complete the square in the exponent to gain

(x − (µ + σ 2 t))2
tx − ((x − µ)/σ)2 /2 = − + µt + σ 2 t 2 /2
2σ 2

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 132 / 355
Elements of Probability Theory Continuous Random variables

Normal Distribution

Proof (continued).
Substituting into the integral,
Z ∞
1 (x−(µ+σ 2 t))2
mX (t) = √ e− 2σ 2
+µt+σ 2 t 2 /2
dx
2πσ −∞
Z ∞
1 (x−(µ+σ 2 t))2
=e µt+σ 2 t 2 /2
√ e− 2σ 2 dx
2πσ −∞
| {z }
=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 133 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

1.3.11. Definition. A normally distributed random variable with parameters


µ = 0 and σ = 1 is called a standard normal random variable and denoted
by Z .
The standard normal distribution is particularly important because any
normally distributed random variable can be transformed into a
standard-normally distributed one.
1.3.12. Theorem. Let X be a normally distributed random variable with
mean µ and standard deviation σ. Then
X −µ
Z :=
σ
has standard normal distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 134 / 355
Elements of Probability Theory Continuous Random variables

Transformation of Random Variables

It is easily seen that Z = X σ−µ has mean E[Z ] = 0 and variance Var Z = 1,
but it is not clear that Z is normally distributed. To see this, we need to
find the density of Z .
Hence it is worth studying the density of transformed variables in general.
1.3.13. Theorem. Let X be a continuous random variable with density fX .
Let Y = φ ◦ X , where φ : R → R is strictly monotonic and differentiable.
The density for Y is then given by
¯ −1 ¯
¯ dφ (y ) ¯
−1 ¯
fY (y ) = fX (φ (y )) · ¯ ¯.
dy ¯

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 135 / 355
Elements of Probability Theory Continuous Random variables

Transformation of Random Variables


Proof.
We assume without loss of generality that φ is strictly decreasing. (The
case where φ is strictly increasing is analogous.)
The cumulative distribution function for Y is given by

FY (y ) = P[Y ≤ y ] = P[φ(X ) ≤ y ].

Since φ is strictly decreasing, φ−1 exists and is also decreasing. Therefore,

FY (y ) = P[φ(X ) ≤ y ] = P[φ−1 (φ(X )) ≥ φ−1 (y )] = P[X ≥ φ−1 (y )]


= 1 − P[X ≤ φ−1 (y )] = 1 − FX (φ−1 (y )).

Differentiating this expression, we obtain


¯ −1 ¯
dφ−1 (y ) ¯ dφ (y ) ¯
fY (y ) = FY′ (y ) = −fX (φ −1
(y )) −1 ¯
= fX (φ (y )) · ¯ ¯.
dy dy ¯
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 136 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution


We can now prove Theorem 1.3.12. We have Z = φ ◦ X , where
φ(x) = x−µ
σ is strictly increasing and differentiable. Note that

dφ−1 (z)
φ−1 (z) = σz + µ, = σ > 0.
dz
Using
1
e −((x−µ)/σ) /2
2
fX (x) = √
2πσ
we have
¯ −1 ¯
¯ dφ (z) ¯
fZ (z) = fX (φ −1
(z)) · ¯¯ ¯ = √ 1 e −(z)2 /2 · σ = √1 e −z 2 /2 ,
dz ¯ 2πσ 2π
which is the density of the standard normal distribution. Hence the
variable Z = x−µ
σ is standard normal.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 137 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution


The cumulative distribution function of the standard normal distribution is
often denoted by Φ,
Z z
1
e −t /2 dt.
2
Φ(z) = √
2π −∞

The values of Φ are given in Table V of Appendix A.


1.3.14. Example. The breaking strength (in Newtons) of a synthetic fabric
is denoted X , and it is normally distributed with mean µ = 800 N and
standard deviation σ = 12 N. The purchaser of the fabric requires the
fabric to have a strength of at least 772 N. A fabric sample is randomly
selected and tested. To find P[X ≥ 772], we first calculate
· ¸
X −µ 772 − 800
P[X < 772] = P <
σ 12
= P[Z < −2.33] = Φ(−2.33) = 0.01.
Hence the desired probability, P[X ≥ 772] = 0.99.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 138 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

0.5
0.1

0.4
0.08

0.06 0.3

0.04 Σ=12 0.2 Σ=1

0.02 0.1

750 Μ=800 850 -20 -15 -10 -5 Μ=0 5 10 15 20

Plots of the densities of the normal distribution with µ = 800 and σ = 12


and the standard normal distribution. The shaded areas represent
P[X < 772] and P[Z < −2.33], respectively.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 139 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

1.3.15. Example. The pitch diameter of the thread on a fitting is normally


distributed with a mean of 0.4008 cm and a standard deviation of
0.0004 cm. The design specifications are 0.4000 ± 0.0010 cm.
Notice that this process is operating with the mean not equal to the
nominal specifications. We desire to determine what fraction of product is
within tolerance.
· ¸
0.3990 − 0.4008 0.4010 − 0.4008
P[0.399 ≤ X ≤ 0.401] = P ≤Z ≤
0.0004 0.0004
= P[−4.5 ≤ Z ≤ 0.5] = Φ(0.5) − Φ(−4.5)
= 0.6915 − 0.0000 = 0.6915

Hence about 30% of the product will fall outside tolerance.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 140 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

1.3.16. Example. As process engineers study the results of such


calculations, they decide to replace a worn cutting tool and adjust the
machine producing the fittings so that the new mean falls directly at the
nominal value of 0.4000 cm. Then
· ¸
0.3990 − 0.4 0.4010 − 0.4
P[0.399 ≤ X ≤ 0.401] = P ≤Z ≤
0.0004 0.0004
= P[−2.5 ≤ Z ≤ 2.5] = Φ(2.5) − Φ(−2.5)
= 0.9938 − 0.0062 = 0.9876

With the adjustment, 98.76% of the product will fall inside tolerance.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 141 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

Σ=0.0004

0.3990 Μ=0.4008 0.4010

Plots of the densities of the respective normal distributions for the


unadjusted and adjusted fitting production process. The shaded areas
indicate the design specifications.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 142 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution

Σ=0.0004

0.3990 Μ=0.4000 0.4010

Plots of the densities of the respective normal distributions for the


unadjusted and adjusted fitting production process. The shaded areas
indicate the design specifications.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 143 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution


1.3.17. Example. Let X denote the amount of radiation that can be
absorbed by an individual before death ensues. Assume that X is normal
with a mean dosage of 500 roentgens and a standard deviation of 150
roentgens. Above what dosage level will only 5% of those exposed
survive?
Here we want to find x0 such that P[X ≥ x0 ] = 0.05. Standardizing,
· ¸ · ¸
X − 500 x0 − 500 x0 − 500 !
P[X ≥ x0 ] = P ≥ =P Z ≥ = 0.05
150 150 150

From Table V, P[Z ≥ 1.64] = 0.0505 and P[Z ≥ 1.65] = 0.0495.


Interpolating, we take P[Z ≥ 1.645] ≈ 0.0500, so we have
x0 − 500 roentgen
= 1.645 ⇔ x0 = 746.75 roentgen .
150 roentgen

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 144 / 355
Elements of Probability Theory Continuous Random variables

Standard Normal Distribution


In general, the following estimates are often useful:
1.3.18. Theorem. Let X be normally distributed with parameters µ and σ.
Then

P[−σ < X − µ < σ] = 0.68


P[−2σ < X − µ < 2σ] = 0.95
P[−3σ < X − µ < 3σ] = 0.997

Hence 68% of the values of a normal random variable lie within one
standard deviation of the mean, 95% lie within two standard deviations,
and 99.7% lie within three standard deviations. This rule of thumb will be
especially important in statistics, where the number of “extraordinary”
events needs to be judged.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 145 / 355
Elements of Probability Theory Continuous Random variables

Chebyshev’s Inequality
There is a general estimate, which does not depend on the distribution of
a discrete or continuous random variable, that tells us how the variance
influences the probability of deviating from the mean. This estimate is
called Chebyshev’s inequality.
1.3.19. Theorem. Let (X , fX ) be a discrete or continuous random variable
and k > 0 some positive number. Then
1
P[−kσ < X − µ < kσ] ≥ 1 − (1.3.1)
k2
or, equivalently,
1
P[|X − µ| ≥ kσ] ≤ . (1.3.2)
k2
Comparing (1.3.2) with Theorem 1.3.18, we see that the estimates in the
theorem are better. This is not surprising, as Chebyshev’s rule is valid for
any distribution, while the previous theorem uses the specific properties of
the normal distribution.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 146 / 355
Elements of Probability Theory Continuous Random variables

Chebyshev’s Inequality
Proof.
We will prove the theorem for the case of a continuous random variable
only. The proof in the discrete case is quite similar. By definition,
Z
σ = Var X = E[(X − µ) ] = (x − µ)2 fX (x) dx
2 2
R
Z √ Z √
µ− K µ+ K
= (x − µ)2 fX (x) dx + √ (x − µ)2 fX (x) dx
−∞ µ− K
| {z }
≥0
Z ∞
+ √ (x − µ)2 fX (x) dx
µ+ K

for any K > 0. Hence


Z √ Z
µ− K ∞
σ ≥
2
(x − µ) fX (x) dx +
2
√ (x − µ)2 fX (x) dx.
−∞ µ+ K
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 147 / 355
Elements of Probability Theory Continuous Random variables

Chebyshev’s Inequality
Proof (continued).

Now (x −√µ)2 ≥ K if and√only if |x − µ| ≥ K which is the case if
x ≥ µ + K or x ≤ µ − K . Therefore,
Z √ Z
µ− K ∞
σ ≥
2
(x − µ) fX (x) dx +
2
√ (x − µ)2 fX (x) dx
−∞ µ+ K
Z √ Z
µ− K ∞
≥K fX (x) dx + K √ fX (x) dx
−∞ µ+ K
³ √ √ ´
= K P[X ≤ µ − K ] + P[X ≥ µ + K ] ,

or, equivalently,
√ σ2
P[|X − µ| ≥ K] ≤ .
K
√ 1
Taking k = K /σ, we obtain P[|X − µ| ≥ kσ] ≤ .
k2
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 148 / 355
Elements of Probability Theory Continuous Random variables

Chebyshev’s Inequality

1.3.20. Example. From an analysis of company records, a materials control


manager estimates that the mean and standard deviation of the “lead
time” required in ordering a small valve are 8 days and 1.5 days,
respectively. He does not know the distribution of the lead time, but he is
willing to assume the estimates of the mean and standard deviation to be
absolutely correct.
The manager would like to determine a time interval such that the
probability is at least 8/9 that the order will be received during that time.
That is,
1 8
1− 2 = ,
k 9
so that k = 3 and µ ± kσ = (8 ± 4.5) days. It is noted that this interval
may well be too large to be of any value to the manager, in which case he
may elect to learn more about the distribution of lead times.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 149 / 355
Elements of Probability Theory Continuous Random variables

Approximating the Binomial Distribution

We have previously seen how the binomial distribution can be


approximated by the Poisson distribution in the case where p is small. As
we have noted before, the normal distribution was originally conceived as
an approximation to the binomial distribution. This approximation is valid
for any p if n is large enough. In practice, n is often a sample size, as e
shall see in an example.
The mathematical formulation is known as the Theorem of de
Moivre-Laplace. The precise theorem contains an estimate for the error in
the approximation. Since we do not want to delve too deeply into the
mathematical theory involved, we give an informal formulation only.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 150 / 355
Elements of Probability Theory Continuous Random variables

Approximating the Binomial Distribution


We will use Stirling’s approximation of the factorial function,

n! ≈ 2πe −n nn+1/2 ,

in the sense that the relative error converges to zero as n goes to infinity,

n! − 2πe −n nn+1/2
lim = 0.
n→∞ n!
Then for a binomial random variable with parameters n and p (q = 1 − p)
we can eventually show that
n! 1
√ e −(x−np) /(2npq)
2
P[X = x] = p x q n−x ≈ √
x!(n − x)! npq 2π

so that µ ¶
x − np
P[X ≤ x] ≈ Φ √
npq

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 151 / 355
Elements of Probability Theory Continuous Random variables

Approximating the Binomial Distribution


Thus the binomial distribution with parameters n and p behaves as a
normal distribution with mean µ = np and variance σ 2 = npq.
This approximation is good if p is close to 1/2 and n > 10. Otherwise, we
require that
np > 5 if p ≤ 1/2 or n(1 − p) > 5 if p > 1/2.

We also need to make a half-unit correction. For example, when


approximating the cumulative binomial distribution to determine

P[X ≥ 12], we in fact have to calculate 1 − Φ((11.5 − np)/ npq):

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 152 / 355
Elements of Probability Theory Continuous Random variables

Aproximating the Binomial Distribution


1.3.21. Example. In sampling from a production process that produces
items of which 20% are defective, a random sample of 100 items is
selected each hour of each production shift. The number of defectives in a
sample is denoted by X .
To find, say, P[X ≤ 15] we might use the normal approximation as follows:
· ¸
15 − 100 · 0.2
P[X ≤ 15] ≈ P Z ≤ √ = P[Z ≤ −1.25]
100 · 0.2 · 0.8
= Φ(−1.25) = 0.1056
A half-unit correction would instead give
· ¸
15.5 − 20
P[X ≤ 15] ≈ P Z ≤ = 0.130
4
15 ¡
P ¢
The correct result is P[X ≤ 15] = 100
k 0.2k 0.8100−k = 0.1285.
k=0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 153 / 355
Elements of Probability Theory Continuous Random variables

Weibull Distribution

Like the gamma distribution, the Weibull distribution (introduced by W.


Weibull in 1951) is useful in engineering and life sciences applications
because it is flexible, with parameters that enable it to reduce to or
approximate the exponential and normal distributions.
The most general is the three-parameter Weibull distribution, with density
(
αβ(x − γ)β−1 e −α(x−γ)
β
x > γ,
f (x) = , α, β > 0, γ ∈ R.
0, otherwise

Here γ is called the location parameter, β is called the shape parameter


and α is the scale parameter. We will here consider the (physically most
common) case of γ = 0.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 154 / 355
Elements of Probability Theory Continuous Random variables

Weibull Distribution
1.3.22. Definition. A random variable (X , fX ) is said to have a
(two-parameter) Weibull distribution with parameters α and β if its
density is given by
(
αβx β−1 e −αx , x > 0,
β

f (x) = α, β > 0.
0, otherwise,

1.3.23. Theorem. Let X be a Weibull random variable with parameters α


and β. The mean and variance of X are given by
µ = α−1/β Γ(1 + 1/β)
and
σ 2 = α−2/β Γ(1 + 2/β) − µ2 .

For the proof of the expression for the mean, we refer to the text book.
The proof of the formula for the variance is left as an exercise.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 155 / 355
Elements of Probability Theory Continuous Random variables

Weibull Distribution
The Weibull distribution is used (for example)
◮ In reliability engineering and failure analysis (the most common usage)
◮ In survival analysis (a branch of statistics which deals with death in
biological organisms and failure in mechanical systems)
◮ To represent manufacturing and delivery times in industrial
engineering
◮ In weather forecasting
◮ In radar systems to model the dispersion of the received signals level
produced by some types of clutters
◮ To model fading channels in wireless communications
◮ In general insurance to model the size of reinsurance claims
◮ In forecasting technological change (also known as the Sharif-Islam
model)
◮ To describe wind speed distributions, as the natural distribution often
matches the Weibull shape
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 156 / 355
Elements of Probability Theory Continuous Random variables

Reliability

Reliability studies are concerned with assessing whether or not a system


functions adequately under the conditions for which it was designed.
Interest centers on describing the behavior of the random variable X , the
time to failure of a system that can not be repaired once it fails to
operate.
Three functions come into play:
◮ the failure density f ,
◮ the reliability function R,
◮ the failure or hazard rate ϱ.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 157 / 355
Elements of Probability Theory Continuous Random variables

Reliability

Consider some system being put into operation at time t = 0. We observe


the system until it eventually fails. Let X denote the time of failure. This
is a continuous random variable with values in (0, ∞).
The probability density function of X is the failure density f . The
reliability function R is defined to be the probability that the component
will not fail before time t. Thus

R(t) = 1 − P[component will fail before time t]


Z t
=1− f (x) dx = 1 − F (t)
0

where F is the cumulative distribution function of X .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 158 / 355
Elements of Probability Theory Continuous Random variables

Reliability - Hazard Rate


To define ϱ, the hazard rate function, consider a time interval [t, t + ∆t].
We define the hazard rate function over this interval by
P[t ≤ X ≤ t + ∆t | t ≤ X ]] P[X ∈ [t, t + ∆t]]
ϱ(t) := lim = lim
∆t→0 ∆t ∆t→0 P[X ∈ [t, ∞]] · ∆t

Note that
P[X ∈ [t, t + ∆t]] F (t + ∆t) − F (t)
lim = lim = F ′ (t) = f (t)
∆t→0 ∆t ∆t→0 ∆t
and P[X ∈ [t, ∞]] = 1 − F (t) = R(t) does not depend on ∆t. Therefore,

f (t)
ϱ(t) = .
R(t)
The job of the scientist is to find the form of these functions for the
problem at hand. In practice, one often begins by assuming a particular
form for the hazard rate function based on empirical evidence.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 159 / 355
Elements of Probability Theory Continuous Random variables

Reliability - Hazard Rate

Interpretation of the Hazard rate:


1. If ϱ is increasing over an interval, then as time goes by a failure is
more likely to occur. This normally happens for systems that begin to
fail primarily due to wear.
2. If ϱ is decreasing over an interval, then as time goes by a failure is
less likely to occur than it was earlier in the time interval. This
happens in situations in which defective systems tend to fail early. As
time goes by, the hazard rate for a well-made system decreases.
3. A steady hazard rate is expected over the useful life span of a
component. A failure tends to occur during this period due mainly to
random factors.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 160 / 355
Elements of Probability Theory Continuous Random variables

Reliability
Often one has an idea of ϱ, but not of the failure density f or reliability
function R. Then the following theorem enables one to determine the
other two functions:
1.3.24. Theorem. Let X be a random variable with failure (probability)
density X , reliability function R and hazard rate ϱ. Then
Rt
R(t) = e − 0 ϱ(x) dx

Proof.
Note that since R(x) = 1 − F (x) we have R ′ (x) = −F ′ (x). Therefore,

f (x) F ′ (x) R ′ (x)


ϱ(x) = = =−
R(x) R(x) R(x)

so R ′ (x) = −ϱ(x)R(x). Note that R(0) = 1, since the component will not
fail before t > 0. Solving this differential equation through separation of
variables, we obtain the result.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 161 / 355
Elements of Probability Theory Continuous Random variables

Reliability
1.3.25. Example. One hazard function in widespread use is the function
ϱ(t) = αβt β−1 , t > 0, α, β > 0

◮ If β = 1, the hazard rate is constant (failure is due to random factors)


◮ If β > 1, the hazard rate is increasing (failure is due to wear)
◮ If β < 1, the hazard rate is decreasing (early failure likely due to
malfunction)
The reliability function is given by
Rt
R(t) = e − αβx β−1 dx
= e −αt .
β
0

The failure density is given by


f (t) = ϱ(t)R(t) = αβt β−1 e −αt ,
β

which happens to be the probability density of the Weibull distribution


with parameters α and β.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 162 / 355
Elements of Probability Theory Continuous Random variables

Reliability of Series and Parallel Systems

Recall that reliability was defined as

R(t) = 1 − P[component will fail before time t].

Here we assume t is fixed; instead of tracking the time-dependence of the


reliability of a single component, we interest ourselves in how the design of
a system with components of various reliability influences the reliability of
the entire system.
Components in multiple-component systems can be installed in the system
in various ways. Many systems are arranged in “series” configuration,
some are in “parallel” and others are combinations of the two designs.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 163 / 355
Elements of Probability Theory Continuous Random variables

Reliability of Series and Parallel Systems


1.3.26. Definition.
1. A system whose components are arranged in such a way that the
system fails whenever any of its components fail is called a series
system.
2. A system whose components are arranged in such a way that the
system fails only if all of its components fail is called a parallel system.
Assuming the components are independent of each other, the reliability of
a series system with k components is given by
Y
k
Rs (t) = Ri (t),
i=1
where Ri is the reliability of the ith component.
The reliability of parallel system is given by
Y
k
Rp (t) = 1 − P[all components fail before t] = 1 − (1 − Ri (t)).
i=1
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 164 / 355
Elements of Probability Theory Continuous Random variables

Reliability of Series and Parallel Systems

1.3.27. Example. Consider a system consisting of eight independent


components, connected as shown below:

The reliability of the entire system is the product of assemblies I-V,


working out to 0.7689.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 165 / 355
Elements of Probability Theory Continuous Random variables

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 166 / 355
Elements of Probability Theory Joint Distributions

Introduction to Probability and Counting

Discrete Random Variables

Continuous Random variables

Joint Distributions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 167 / 355
Elements of Probability Theory Joint Distributions

Random Vectors

Often, a single random variable is not enough to describe a physical


problem, or we are interested in the effect of one quantity on another. In
such a case we consider two (or more) random variables together. Then
we consider a “vector” where each component is itself a (“scalar”) random
variable.
We call such a vector a random vector or a multi-variate random variable
or an n-dimensional random variable. The components can be discrete or
continuous random variables, and even mixtures of the two.
In this section we will study bivariate (two-dimensional) random variables
where either both components are discrete or both components are
continuous random variables.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 168 / 355
Elements of Probability Theory Joint Distributions

Discrete Bivariate Random Variables

1.4.1. Definition. Let S be a sample space and Ω a subset of N2 . A


discrete bivariate random variable is a map (X , Y ) : S → Ω together with
a function fXY : Ω → R with the properties that
1. fXY ≥ 0 and
P
2. fXY (x, y ) = 1.
(x,y )∈Ω
Then fXY gives the probability that the pair (X , Y ) assumes a given value
(x, y ), i.e.,
fXY (x, y ) = P[X = x and Y = y ].
The function fXY is called the joint density function of the random
variable (X , Y ).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 169 / 355
Elements of Probability Theory Joint Distributions

Discrete Bivariate Random Variables

1.4.2. Example. In an automobile plant two tasks are performed by robots.


The first entails welding two joints; the second, tightening three bolts. Let
X denote the number of defective welds and Y the number of improperly
tightened bolts produced per car. Past data may indicate the following
joint distribution for X and Y :

X = x/Y = y 0 1 2 3
0 0.840 0.030 0.020 0.010
1 0.060 0.010 0.008 0.002
2 0.010 0.005 0.004 0.001

For example, P[X = 1 and Y = 2] = 0.008.


Note that the sum of all probabilities is equal to 1, as required.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 170 / 355
Elements of Probability Theory Joint Distributions

Discrete Bivariate Random Variables


1.4.3. Example.
X = x/Y = y 0 1 2 3
0 0.840 0.030 0.020 0.010
1 0.060 0.010 0.008 0.002
2 0.010 0.005 0.004 0.001
We can read off that the probability of one error being committed is
P[X = 1 and Y = 0] + P[X = 0 and Y = 1] = 0.09.
The probability of no defective joints is
X
2
P[Y = 0] = P[X = x and Y = 0] = 0.91.
x=0

By summing in this way, we can determine P[Y = y ] for all y . This is


called the marginal density for Y .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 171 / 355
Elements of Probability Theory Joint Distributions

Discrete Bivariate Random Variables


1.4.4. Definition. Let ((X , Y ), fXY ) be a discrete bivariate random variable.
We define the marginal density fX for X by
X
fX (x) = fXY (x, y )
y

and the marginal density fY for Y by


X
fY (y ) = fXY (x, y )
x

1.4.5. Example.
X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 172 / 355
Elements of Probability Theory Joint Distributions

Continuous Bivariate Random Variables

1.4.6. Definition. Let S be a sample space. A continuous bivariate random


variable is a map (X , Y ) : S → R2 together with a function fXY : R2 → R
with the properties that
1. fXY ≥ 0 and
R∞ R∞
2. fXY (x, y ) dy dx = 1.
−∞ −∞
The integral of fXY is interpreted as the probability that X and Y assume
values (x, y ) in a given range, i.e.,
Z b Z d
P[a ≤ X ≤ b and c ≤ Y ≤ d] = fXY (x, y ) dy dx
a c

for a ≤ b, c ≤ d. The function fX ,Y is called the joint density function of


the random variable (X , Y ).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 173 / 355
Elements of Probability Theory Joint Distributions

Continuous Bivariate Random Variables

1.4.7. Definition. Let ((X , Y ), fXY ) be a continuous bivariate random


variable. We define the marginal density fX for X by
Z ∞
fX (x) = fXY (x, y ) dy
−∞

and the marginal density fY for Y by


Z ∞
fY (y ) = fXY (x, y ) dx
−∞

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 174 / 355
Elements of Probability Theory Joint Distributions

Independence
Two events A1 , A2 are independent if P[A1 ∩ A2 ] = P[A1 ] · P[A2 ]. This
motivates us to define the independence of two discrete random variables
through
P[X = x and Y = y ] = P[X = x] · P[Y = y ]
which works out to fXY (x, y ) = fX (x)fY (y ). This also generalizes to
continuous random variables.
1.4.8. Definition. Let ((X , Y ), fXY ) be a bivariate random variable with
marginal densities fX and fY . If

dom fXY = (dom fX ) × (dom fY )

and

fXY (x, y ) = fX (x)fY (y ) for all (x, y ) ∈ dom fXY

then (X , fX ) and (Y , fY ) are independent random variables.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 175 / 355
Elements of Probability Theory Joint Distributions

Independence

1.4.9. Example.

X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00

(X , fX ) and (Y , fY ) are not independent because

fX ,Y (0, 0) = 0.840 ̸= fX (0) · fY (0) = 0.900 · 0.910 = 0.819

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 176 / 355
Elements of Probability Theory Joint Distributions

Conditional Densities
Whenever we consider two random variables, it does not make sense to
speak of (X , fX ) and (Y , fY ) anymore, as the two random variables may
influence each other. We therefore always need to consider a joint density
fXY for the pair (X , Y ), i.e., a bivariate random variable. The “individual
densities” fX and fY are then just the marginal densities.
If Y = y is kept fixed, the density of the random variable (X , fX ) will be
some given function for this value of Y . It may be another function for
another value of Y .
The conditional probability for an event A1 given A2 was
P[A1 | A2 ] = P[A1 ∩ A2 ]/P[A2 ]. This motivates us to define the
independence of two discrete random variables through

P[X = x and Y = y ] fXY (x, y )


P[X = x | Y = y ] = = .
P[Y = y ] fY (y )

This will also generalize to continuous random variables.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 177 / 355
Elements of Probability Theory Joint Distributions

Conditional Densities

1.4.10. Definition. Let ((X , Y ), fXY ) be a bivariate random variable with


marginal densities fX and fY . Then
1. The conditional density for X given Y = y is defined as

fXY (x, y )
fX |y (x) =
fY (y )

whenever fY (y ) > 0.
2. The conditional density for Y given X = x is defined as

fXY (x, y )
fY |x (y ) =
fX (x)

whenever fX (x) > 0.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 178 / 355
Elements of Probability Theory Joint Distributions

Expectation for Discrete Bivariate Random Variables

1.4.11. Definition. Let((X , Y ), fXY ) be a discrete bivariate random variable


and H : Ω → R2 some function. Then the expected value of H ◦ (X , Y ) is
X
E[H ◦ (X , Y )] = H(x, y ) · fXY (x, y ).
(x,y )∈Ω

provided that the sum (series) on the right converges absolutely. As


special cases, we consider H(x, y ) = x and H(x, y ) = y , giving
X X
E[X ] = x · fXY (x, y ), E[Y ] = y · fXY (x, y ).
(x,y )∈Ω (x,y )∈Ω

Note that this implies E[X + Y ] = E[X ] + E[Y ], as we have stated earlier.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 179 / 355
Elements of Probability Theory Joint Distributions

Expectation for Discrete Bivariate Random Variables


1.4.12. Example.

X = x/Y = y 0 1 2 3 fX (x)
0 0.840 0.030 0.020 0.010 0.900
1 0.060 0.010 0.008 0.002 0.080
2 0.010 0.005 0.004 0.001 0.020
fY (y ) 0.910 0.045 0.032 0.013 1.00

X X
2 X
3
E[X ] = x · fXY (x, y ) = x · fXY (x, y ) = 0.12
(x,y )∈Ω x=0 y =0

X X
2 X
3
E[Y ] = y · fXY (x, y ) = y · fXY (x, y ) = 0.148
(x,y )∈Ω x=0 y =0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 180 / 355
Elements of Probability Theory Joint Distributions

Expectation for Continuous Bivariate Random Variables

1.4.13. Definition. Let((X , Y ), fXY ) be a continuous bivariate random


variable and H : R2 → R2 some function. Then the expected value of
H ◦ (X , Y ) is
ZZ
E[H ◦ (X , Y )] = H(x, y ) · fXY (x, y ) dx dy .
R2

provided that the integral on the right converges absolutely. As special


cases, we consider H(x, y ) = x and H(x, y ) = y , giving
ZZ ZZ
E[X ] = x · fXY (x, y ) dx dy , E[Y ] = y · fXY (x, y ) dx dy .
R2 R2

Again, we see that E[X + Y ] = E[X ] + E[Y ].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 181 / 355
Elements of Probability Theory Joint Distributions

Covariance

Whenever we deal with bivariate random variables, we are interested not


only in their individual variances, but also in how their interplay influences
their joint deviation from the respective means.
1.4.14. Definition. Let ((X , Y ), fXY ) be a bivariate random variable with
means µX = E[X ] and µY = E[Y ]. Then the covariance of (X , Y ) is given
by
Cov(X , Y ) = σXY = E[(X − µX )(Y − µY )].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 182 / 355
Elements of Probability Theory Joint Distributions

Covariance

We can compute the covariance from the formula

Cov(x, Y ) = E[XY ] − E[X ] E[Y ].

1.4.15. Theorem. Let ((X , Y ), fXY ) be a bivariate random variable. If X


and Y are independent, then

Cov(X , Y ) = 0 or, equivalently, E[XY ] = E[X ] E[Y ].

The proof is straightforward and left as an exercise.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 183 / 355
Elements of Probability Theory Joint Distributions

Correlation
While the covariance tells us whether or not two random variables affect
each other, there is no quantitative statement contained in the covariance
as such. We often want to know, however, if there is a linear dependence
between X and Y . This can be determined from the Pearson coefficient of
correlation, defined as follows.

1.4.16. Definition. Let ((X , Y ), fXY ) be a bivariate random variable with


◮ means µX = E[X ], µY = E[Y ],
◮ variances σX2 = Var[X ] = E[(X − µX )2 ] ̸= 0, σY2 = Var[Y ] ̸= 0,
◮ covariance σXY = Cov(X , Y ) = E[(X − µX )(Y − µY )].
The correlation of (X , Y ) is then defined by

Cov(X , Y ) σXY
ρXY = p = .
(Var X )(Var Y ) σ X σY

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 184 / 355
Elements of Probability Theory Joint Distributions

Correlation

If ρXY = 0, we say that X and Y are uncorrelated, otherwise they are


correlated. However, we can obtain a more useful result:
1.4.17. Theorem. Let ((X , Y ), fXY ) be a bivariate random variable.
1. The correlation coefficient satisfies −1 ≤ ρXY ≤ 1.
2. |ρXY | = 1 if and only if there exist numbers β0 , β1 ∈ R, β1 ̸= 0, such
that
Y = β0 + β1 X
almost surely.

1.4.18. Remark. Heuristically, the closer |ρXY | is to 1, the “more linear”


the relationship between X and Y is.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 185 / 355
Elements of Probability Theory Joint Distributions

Correlation
Proof.
We first show that −1 ≤ ρXY ≤ 1.
We consider two random variables W and Z such that E[Z 2 ], E[W 2 ] ̸= 0.
Note that for any a ∈ R (aW − Z )2 ≥ 0, so

0 ≤ E[(aW − Z )2 ] = a2 E[W 2 ] − 2a E[WZ ] + E[Z 2 ].

Now let a = E[WZ ]/ E[W 2 ]. Then we obtain

E[WZ ]2 E[WZ ]2
− + E[Z 2 ] ≥ 0 ⇔ ≤ 1.
E[W 2 ] E[W 2 ] E[Z 2 ]

Now let W = X − µX and Z = Y − µY . Then

E[(X − µX )(Y − µY )]2


= ρ2XY ≤ 1
E[(X − µX )2 ] E[(Y − µY )2 ]

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 186 / 355
Elements of Probability Theory Joint Distributions

Correlation
Proof (continued).
Now let ρ2XY = 1. Then, reversing the steps above, we obtain

E[(X − µX )(Y − µY )]2 E[WZ ]2


= =1
E[(X − µX )2 ] E[(Y − µY )2 ] E[W 2 ] E[Z 2 ]
E[WZ ]2
⇔− + E[Z 2 ] = 0
E[W 2 ]
⇔ a2 E[W 2 ] − 2a E[WZ ] + E[Z 2 ] = E[(aW − Z )2 ] = 0

Since (aW − Z )2 ≥ 0, this implies (aW − Z )2 = 0 and aW − Z = 0


almost surely. Re-substituting W = X − µX and Z = Y − µY , we obtain

Y = (µY − aµX ) + aX

almost surely.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 187 / 355
Elements of Probability Theory Joint Distributions

Transformation of Variables

The following theorem allows us to perform transformations of random


variables and obtain the densities of the transformed variables.
1.4.19. Theorem. Let ((X , Y ), fXY ) be a continuous bivariate random
variable and let H : R2 → R2 be a differentiable bijective map with inverse
H −1 . Then (U, V ) = H ◦ (X , Y ) is a continuous bivariate random variable
with density

fUV (u, v ) = fXY ◦ H −1 (u, v ) · |det DH −1 (u, v )|,

where DH −1 is the Jacobian of H −1 .


This result basically uses knowledge from Calculus II, which you are
encouraged to review if necessary.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 188 / 355
Elements of Probability Theory Joint Distributions

First Midterm Exam

The preceding material completes our overview of probability theory; on


this basis we will commence with studying statistics.
The preceding material encompasses all of the material that will be the
subject of the First Midterm Exam. The exam will take place on Thursday,
the 12th of March during the usual lecture time.
Please bring a non-programmable calculator to the exam.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 189 / 355
An Introduction to Statistical Methods

Part II

An Introduction to Statistical Methods

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 190 / 355
An Introduction to Statistical Methods

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 191 / 355
An Introduction to Statistical Methods Descriptive statistics

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 192 / 355
An Introduction to Statistical Methods Descriptive statistics

Descriptive Statistics
Up to now we have discussed probability theory, i.e., the properties of
random variables and their distributions. It is clear that we have only
scratched the surface of probability theory and a lot more can be explored.
For example, the theory of stochastic processes and Markov chains have
not been discussed.
Now, however, instead of delving deeper into probability theory, we will
leave the “perfect world” of known random variables and distributions,
and enter the “real world” of statistics, which deals with incomplete
information. Statistical problems are characterized through
◮ a large group of objects about which inferences are to be made, called
a population,
◮ at least one random variable whose behavior is to be studied relative
to the population,
◮ a subset of the population, called a sample which is actually studied.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 193 / 355
An Introduction to Statistical Methods Descriptive statistics

Populations and Samples

2.1.1. Examples.
1. We may want to find the average gas mileage of cars in Shanghai.
Then the population is “all cars in Shanghai”, and we may pick a
sample of 100 cars. Here the random variable is the gas mileage, of
which we wish to obtain the mean.
2. We may want to find out whether a proposed new car model will have
a lower mean gas mileage than existing cars. In this case, the
population is “all cars of this model, existing now and produced in the
future” and a sample might consist of a trial production of 20
prototype cars. Again the random variable is the gas mileage and we
are interested in its expected value.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 194 / 355
An Introduction to Statistical Methods Descriptive statistics

Random Sampling

The first step in a statistical analysis is the selection of a random sample,


which consists of a set of n objects selected from the population such that
the selection of one object does not influence the subsequent selection of
any other.
The random variable X to be analyzed is then studied relative to each
object in the sample; effectively, one obtains n random variables
X1 , . . . , Xn . These random variables and their values are also referred to as
random samples.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 195 / 355
An Introduction to Statistical Methods Descriptive statistics

Random Sampling

However, we define:
2.1.2. Definition. A random sample of size n from the distribution of X is
a collection of n independent random variables X1 , . . . , Xn , each with the
same distribution as X . We say they are independent identically
distributed (i.i.d.) random variables.
In order to guarantee that the the random variables in a random sample
are indeed independently distributed, the size of the random sample should
not exceed 5% of the population.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 196 / 355
An Introduction to Statistical Methods Descriptive statistics

Random Variables and Statistics

A statistic is generally speaking a random variable whose numerical values


can be determined from a random sample.
2.1.3. Examples.
P n
1. Xk ,
k=1
Pn
2. Xk2 ,
k=1
Pn
3. Xk /n,
k=1
4. min{Xk : k = 1, . . . , n}, max{Xk : k = 1, . . . , n}.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 197 / 355
An Introduction to Statistical Methods Descriptive statistics

Random Variables and Statistics

Determining the distribution of a random variable X of a population from


a sample population is in general quite difficult. If the random variable is
discrete, it may often be surmised from the physical description of the
experiment.
If X is continuous, a guess can be made from the shape of the distribution
of the random sample: if it appears flat, X may be uniformly distributed;
it is bell-shaped, X may be normally distributed; and if it appears skewed,
X may have a certain gamma (exponential, χ2 ) distribution.
Since the random sample is just a set of numbers, numerous techniques
have been developed to aid the visualization of its shape.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 198 / 355
An Introduction to Statistical Methods Descriptive statistics

Stem-and-Leaf Diagrams

A stem-and-leaf diagram is a very rough way to get an idea of the shape


of the distribution of a random sample, while preserving some of its
numeric information. It consists of labeled rows of numbers, where the
label is called the stem and the other numbers are called leaves. This idea
was introduced by Tukey in 1977.
In order to construct a stem-and-leaf diagram from a random sample (a
set of values of random variables), follow these steps:
1. Choose some convenient numbers to serve as stems,
2. label the rows using the stems,
3. for each datum of the random sample, note down the digit following
the stem in the corresponding row,
4. turn the graph on its side to get an idea of its distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 199 / 355
An Introduction to Statistical Methods Descriptive statistics

Stem-and-Leaf Diagrams with Mathematica


As an example, we consider the random sample

4285 564 1278 205 3920


2066 604 209 602 1379
2584 14 349 3770 99

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 200 / 355
An Introduction to Statistical Methods Descriptive statistics

Multiple Stem-and-Leaf Diagrams with Mathematica


Sometimes it is useful to further subdivide the stems. For example, we
might wish to have separate leaves for digits in the ranges 0-4 and 5-9.
This is called a double stem-and-leaf plot.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 201 / 355
An Introduction to Statistical Methods Descriptive statistics

Sample Range

For a random sample X1 , . . . , Xn , the sample range is simply

max Xk − min Xk
1≤k≤n 1≤k≤n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 202 / 355
An Introduction to Statistical Methods Descriptive statistics

Histograms

A histogram is a vertical or horizontal bar graph, of the type often


appearing in news media and presentations. There are four main
properties that a histogram should have:
◮ The number of categories should be suitable for the amount of data;
a suggested guideline is based on Sturges’s rule (1926),
◮ each datum should fall into exactly one category,
◮ the categories should have the same width,
◮ no datum should assume a boundary value.
An algorithm for selecting categories that implement these points is given
in the textbook. In practice, you will use computer software that will
create histograms from data sets and it will often be advisable to modify
the default setting to conform to this algorithm.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 203 / 355
An Introduction to Statistical Methods Descriptive statistics

Histograms
Steps in creating a histogram for numerical data:
1. Find the desired number of categories:
Data Set Size # Categories
< 16 Insufficient data
2 n−1 to 2n − 1 n

2. Calculate data (sample) range: (largest datum) - (smallest datum).


3. Divide data range by number of categories; round up to the accuracy
of the data, or add a smallest decimal unit if already at accuracy of
data. This is the category length.
4. The lower boundary for the first category lies 1/2 smallest decimal
unit below the smallest datum.
5. The remaining boundaries are found by adding the category length to
the preceding boundary value.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 204 / 355
An Introduction to Statistical Methods Descriptive statistics

Histograms with Mathematica (default settings)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 205 / 355
An Introduction to Statistical Methods Descriptive statistics

Histograms with Mathematica (determining categories)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 206 / 355
An Introduction to Statistical Methods Descriptive statistics

Histograms with Mathematica (adjusted settings)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 207 / 355
An Introduction to Statistical Methods Descriptive statistics

Ogives

An ogive is also known as a cumulative frequency plot. The abscissa


shows the boundaries of the categories of the histogram, while the ordinate
gives the number of data in the corresponding category and all
“preceding” categories. We use a relative frequency ogive, so we divide by
the total number of data. This graph will approximate the cumulative
distribution function F for a continuously distributed random variable.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 208 / 355
An Introduction to Statistical Methods Descriptive statistics

Ogives with Mathematica

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 209 / 355
An Introduction to Statistical Methods Descriptive statistics

Sample Statistics - Location

2.1.4. Definition. Let X1 , . . . , Xn be a random sample from the distribution


of a random variable X . The statistic

1X
n
X = Xi
n
i=1

is called the sample mean.


Note that X ̸= E[X ]. While E[X ] is the actual mean of X , X depends on
the chosen random sample and may at best approximate E[X ].

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 210 / 355
An Introduction to Statistical Methods Descriptive statistics

Sample Statistics - Location


Another measure for the location of a distribution of a random variable X
is the median M, defined by

P[X ≤ M] = 0.50.

For a continuous distribution, the median is the “halfway point”, i.e., an


observation of X is just as likely to fall below it as above it.
For a random sample X1 , . . . , Xn , the sample median is defined in the
following way: Arrange (re-index) the values x1 , . . . xn in such a way that
xk ≤ xk+1 , k = 1, . . . , n. Then
(
1
(x + xn/2+1 ) n even
x = 2 n/2
e
x(n+1)/2 n odd

This sample median may differ significantly from the sample mean!
We define the median location by the number (n + 1)/2.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 211 / 355
An Introduction to Statistical Methods Descriptive statistics

Mean and Median with Mathematica

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 212 / 355
An Introduction to Statistical Methods Descriptive statistics

Sample Statistics - Variability


The main measure of variability in a random variable is the variance, and
we can define the variance of a sample analogously to that of a discrete
random variable, replacing the expectation value by the sample mean,

1X
n
S2 = (Xk − X )2 .
n
k=1

However, it can be shown that this formula (which is in use in certain


calculators and computer systems) underestimates the variance σ 2 ;
therefore, we set instead

1 X
n
S2 = (Xk − X )2 ,
n−1
k=1

defining
√ the sample variance in this way and the sample standard deviation
S = S 2.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 213 / 355
An Introduction to Statistical Methods Descriptive statistics

Variance and Standard Deviation with Mathematica

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 214 / 355
An Introduction to Statistical Methods Descriptive statistics

Rounding of Statistics

We round the statistics in the following ways (rounding instead of


truncating):
◮ For the mean we give one more decimal place than the original data
has.
◮ For the variance we give two more decimal places than the original
data has.
◮ For the standard deviation we give one more decimal place than the
original data has.
◮ The range and median are not rounded.
A final warning: when calculating the above statistics, always make sure
that you are actually studying a sample of a population, not the
population as a whole!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 215 / 355
An Introduction to Statistical Methods Descriptive statistics

Boxplots - Quartiles
Boxplots are a very useful way of visualizing data. In order to construct a
boxplot, we first need to determine the quartiles q1 and q3 and the
interquartile range iqr = q3 − q1 . The quartiles play a similar role to that
of the median (which would be the quartile denoted q2 ) in that of a
random sample ordered from smallest to largest, 25% would lie below q1
and 75% would lie below q3 . The precise construction of q1 and q3 varies
(one algorithm is given in the textbook).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 216 / 355
An Introduction to Statistical Methods Descriptive statistics

Construction of Boxplots
The construction of a boxplot will be demonstrated on the blackboard.
We need the following data:
◮ q1 , e
x , q3 and iqr.
◮ Inner fences

3 3
f 1 = q1 − iqr, f3 = q3 + iqr .
2 2

◮ Adjacent values

a1 = min{xk : xk ≥ f1 }, a3 = max{xk : xk ≤ f3 }.

◮ Outer fences

F1 = q1 − 3 iqr, F3 = q3 + 3 iqr .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 217 / 355
An Introduction to Statistical Methods Descriptive statistics

Boxplots with Mathematica

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 218 / 355
An Introduction to Statistical Methods Descriptive statistics

Outliers

Data points lying between the inner and outer fences are called near
outliers, those lying outside the outer fences are called far outliers. Far
outliers are unusual if (and only if!) an approximately bell-shaped
distribution of the random variable X of the population is expected. In
this case, their origin should be investigated.
◮ If the outlier seems to be the result of an error in measurement or
data collecting, it may be discarded from the data.
◮ If the outlier seems to be the result of a random measurement, it is
recommended that statistics are reported twice: with the outlier
included and without the outlier.
◮ As a rule of thumb: Of 1000 random samples of a normally
distributed population, it can be expected that 7 will be outliers.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 219 / 355
An Introduction to Statistical Methods Estimation

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 220 / 355
An Introduction to Statistical Methods Estimation

Estimation
In the last section we have seen how data obtained from a random sample
can be used to obtain information on a population; in particular a statistic
(such as the median) would approximate a random variable (such as the
mean). However, no precise information on the “quality” of the
approximation was given, and one formula (for the sample variance)
remained obscure and counterintuitive.
The process of using statistics to approximate random variables is called
estimation. We now aim to provide a mathematical framework for this
process. Note that the language of statistics differs slightly from that of
probability theory; instead of a random variable X of a population, we
refer to a more general population parameter θ such as the mean or
standard deviation. But a population parameter can also be the parameter
of a certain distribution (such as λ of the Poisson distribution). We have
previously seen that functions of random samples X1 , . . . , Xn (which in
probability theory are random variables themselves) are called statistics.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 221 / 355
An Introduction to Statistical Methods Estimation

Estimators

An estimator for a population parameter θ is a statistic and denoted by θ.b


Any given value of θb is called an estimate. (More precisely, we refer to
point estimators and point estimates.) We would like an estimator to have
the following properties:
◮ The expected value of θ b should be equal to θ,
◮ θb should have small variance for large sample sizes.

b is called the bias of an estimator


2.2.1. Definition. The difference θ − E[θ]
θb for a population parameter θ. We say that θb is unbiased if E[θ]
b = θ.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 222 / 355
An Introduction to Statistical Methods Estimation

Estimators

The quality of an estimator is measured by its mean square error, defined


as
MSE(θ) b := E[(θb − θ)2 ].

This can be rewritten as


b = E[(θb − E[θ])
MSE(θ) b 2 ] + (θ − E(θ))
b 2 = Var θb + (bias)2 .

Hence variance can be just as important as bias for an estimator (see


blackboard). In general, we will prefer to have an unbiased estimator, but
sometimes biased estimation is used (e.g., in multiple regression).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 223 / 355
An Introduction to Statistical Methods Estimation

Sample Mean

2.2.2. Theorem. Let X1 , . . . , Xn be a random sample of size n from a


distribution with mean µ. The sample mean X is an unbiased estimator
for µ.

Proof.
We simply insert the definition of the sample mean and use the properties
of the expectation:
1
E[X ] = E[(X1 + · · · + Xn )/n] = E[X1 + · · · + Xn ]
n
1 nµ
= (E[X1 ] + · · · + E[Xn ]) = = µ.
n n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 224 / 355
An Introduction to Statistical Methods Estimation

Sample Variance
2.2.3. Theorem. Let X be the sample mean of a random sample of size n
from a distribution with mean µ and variance σ 2 . Then
σ2
Var X = E[(X − µ)2 ] = .
n
Proof.
We simply insert the definition of the sample mean and use the properties
of the variance:
1
Var X = Var((X1 + · · · + Xn )/n) = Var(X1 + · · · + Xn )
n2
1 nσ 2 σ2
= 2 (Var X1 + · · · + Var Xn ) = 2 = .
n n n

Thus X is both unbiased and has a variance that decreases with large n; it
is a “nice” estimator, since we can make the mean square error MSE X as
small as desired by taking n large enough.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 225 / 355
An Introduction to Statistical Methods Estimation

Standard Error of the Mean and Sample Variance

√ √
2.2.4. Definition. The standard deviation of X is given by Var X = σ/ n
and is called the standard error of the mean.

2.2.5. Theorem. The sample variance

1 X
n
S =2
(Xk − X )2 ,
n−1
k=1

is an unbiased estimator for σ 2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 226 / 355
An Introduction to Statistical Methods Estimation

Sample Variance

Proof.
We simply calculate E[S 2 ],
· ¸
1 X
n
2
E[S ] = E (Xk − X) 2
n−1
k=1
·Xn ¸
1
= E (Xk − µ + µ − X) 2
n−1
k=1
·Xn X
n X
n ¸
1
= E (Xk − µ)2 − 2(X − µ) (Xk − µ) + (µ − X )2
n−1
k=1 k=1 k=1
·Xn ³³X
n ´ ´ ¸
1
= E (Xk − µ) − 2(X − µ)
2
Xk − nµ + n(µ − X ) .
2
n−1
k=1 k=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 227 / 355
An Introduction to Statistical Methods Estimation

Sample Variance
Proof (continued).
Pn
Note that k=1 Xk = nX , so
·X n ³³X n ´ ´ ¸
1
2
E[S ] = E (Xk − µ) − 2(X − µ)
2
Xk − nµ + n(µ − X )2
n−1
k=1 k=1
·X n ¸
1
= E (Xk − µ)2 − 2(X − µ)(nX − nµ) + n(µ − X )2
n−1
k=1
·X n ¸
1
= E (Xk − µ) − n(X − µ)
2 2
n−1
k=1
µX n ¶
1
= E[(Xk − µ) ] − n E[(X − µ) ] .
2 2
n−1
k=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 228 / 355
An Introduction to Statistical Methods Estimation

Sample Variance

Proof (continued).
Since Var Xk = σ 2 = E[(Xk − µ)2 ] for each k = 1, . . . , n, and
E[(X − µ)2 ] = σ 2 /n by Theorem 2.2.3, we have
µXn ¶
1
2
E[S ] = E[(Xk − µ) ] − n E[(X − µ) ]
2 2
n−1
k=1
µXn ¶
1 σ2
= σ −n
2
n−1 n
k=1
1
= (nσ 2 − σ 2 ) = σ 2 .
n−1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 229 / 355
An Introduction to Statistical Methods Estimation

Finding Estimators - Method of Moments

The method of moments goes back to Karl Pearson in 1894 and uses the
basic fact that unbiased estimators Mk for the kth moments E [X k ] of a
distribution are
1X k
n
Mk = Xi ,
n
i=1

given a random sample X1 , . . . , Xn .


The idea is then that a population parameters θj can often be expressed in
terms of the moments of the distribution. Replacing the moments in these
expressions by their estimators then yields estimators for the parameters θj .
Note: Estimators obtained in this way are not necessarily unbiased!

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 230 / 355
An Introduction to Statistical Methods Estimation

Finding Estimators - Method of Moments

2.2.6. Example. Let X1 , . . . , Xn be a random sample from a gamma


distribution with parameters α and β. We know that

E[X ] = αβ, Var X = E[X 2 ] − E[X ]2 = αβ 2 .

Replacing the moments with M1 and M2 , we obtain

M1 = α̂β̂, M2 − M12 = α̂β̂ 2 .

This gives first M2 − M12 = M1 β̂ and then

M2 − M12 M1 M12
β̂ = , α̂ = = .
M1 β̂ M2 − M12

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 231 / 355
An Introduction to Statistical Methods Estimation

Finding Estimators - Method of Maximum Likelihood


Maximum likelihood estimators can be trace to Carl Friedrich Gauß, who
used them more than 170 years ago on isolated problems. They are based
on the idea that given a set of observations x1 , . . . xn , one finds the value
of the population parameter most likely to have produced these
observations. In other words, we express the probability of obtaining
x1 , . . . xn as a function of the parameter and then find the value of θ that
maximizes this probability. We proceed as follows:
1. Obtain a random sample x1 , . . . , xn from the distribution of a random
variable X with density f and associated parameter θ.
2. Define the likelihood function L by
Y
n
L(θ) = f (xi ).
i=1

b 1 , . . . , xn ) for θ from the condition


3. Obtain the estimator θ(x
b
L(θ) = max.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 232 / 355
An Introduction to Statistical Methods Estimation

Finding Estimators - Method of Maximum Likelihood

2.2.7. Example. Water samples of a specific size are taken from a river
suspected of having been polluted by improper treatment procedures at an
upstream sewage-disposal plant. Let X denote the number of coliform
organism found per sample, and assume that X is a Poisson random
variable with parameter k. Let x1 , . . . , xn be a random sample from the
distribution of X . We want to determine the value of k that gives the
highest probability of observing this sample.
Since random sampling implies independence,

Y
n
P[X1 = x1 and X2 = x2 and . . . and Xn = xn ] = P[Xj = xj ].
j=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 233 / 355
An Introduction to Statistical Methods Estimation

Finding Estimators - Method of Maximum Likelihood


e −k k x
The density for X is given by P[X = x] = f (x) = , x ∈ N, so
x!
P
Y
n xj
−nk k
Q
P[Xj = xj ] = e =: L(k).
xj !
j=1

L is called the likelihood function for k. We want to find the value of k


that maximizes L. To simplify our calculations, we take the logarithm of
the above expression:
X
n Y
ln L(k) = −nk + ln k xj − ln xj !.
j=1

Maximizing ln L(k) will also maximize L(k), so we take the first derivative
and set it equal to zero:
1X
n
d ln L(k)
= −n + xj = 0 ⇔ k = x.
dk k
j=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 234 / 355
An Introduction to Statistical Methods Estimation

Distribution of the Sample Mean


We now want to analyze the distribution of the sample mean - this will be
the first step in obtaining information on how well the sample mean
approximates the population mean.
Our main tool will be the moment generating functions. We thus first
establish some basic properties of them.
Uniqueness Theorem (false, but convenient): Let X and Y be two random
variables with moment-generating functions mX and mY , respectively. If
mX = mY in some neighborhood of 0, then X = Y .

2.2.8. Theorem. Let X1 and X2 be two random variables with


moment-generating functions mX1 and mX2 , respectively. If Y = X1 + X2 ,
then
mY = mX1 mX2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 235 / 355
An Introduction to Statistical Methods Estimation

Distribution of the Sample Mean

2.2.9. Theorem. Let X be a random variable with moment-generating


function mX . Let Y = α + βX . Then

mY (t) = e αt mX (βt).

These results can be used to obtain (homework!) the following

2.2.10. Theorem. Let X1 , . . . , Xn be a random sample of size n from a


normal distribution with mean µ and variance σ 2 . Then X is normally
distributed with mean µ and variance σ 2 /n.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 236 / 355
An Introduction to Statistical Methods Estimation

Confidence Intervals
2.2.11. Definition. Let 0 ≤ α ≤ 1. A 100(1 − α)% confidence interval for a
parameter θ is an interval [L1 , L2 ] such that

P[L1 ≤ θ ≤ L2 ] = 1 − α.

Since L1 and L2 are random variables, we often call [L1 , L2 ] a random


interval. Note that the population parameter θ is not a random variable,
but constant.
If one of L1 and L2 is not a random variable, but a fixed number (such as
0, ∞ or −∞), then we speak of a one-sided confidence interval.
Initially, we are most often interested in centered confidence intervals with
L1 = θ̂ − L and L2 = θ̂ + L, where L is a sample statistic and the interval
is centered on θ̂, a point estimate for θ.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 237 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation

2.2.12. Notation. We will often denote an interval of the form


[x − ε, x + ε] for x ∈ R, ε > 0 by x ± ε. In fact, we define

y =x ±ε :⇔ y ∈ [x − ε, x + ε].

We would like to make statements such as “based on the results of a


sample, we are 90% certain that the mean of a population lies in X ± L.”
This is known as interval estimation.
The simplest case is when we are looking for a confidence interval for the
mean of a normal population where we already know its variance.
Although this case rarely occurs in applications, understanding it is a first
step to more complicated (and realistic) scenarios.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 238 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 239 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation for the Mean (Variance Known)

Assume that we have a random sample of size n from a normal population


with unknown mean µ and known variance σ 2 . Our sample yields a point
estimate X for µ. We are now interested in finding L = L(α) such that we
can state with 100(1 − α)% confidence that µ = X ± L.

We first define zα/2 for α ∈ [0, 1] by

α/2 = P[Z ≥ zα/2 ]


Z ∞
1
e −x /2 dx.
2
=√
2π zα/2

where Z is a standard normal variable.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 240 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation for the Mean (Variance Known)


Fix α ∈ [0, 1]. Then
· ¸
X −µ−L X −µ+L
1 − α = P[X − L ≤ µ ≤ X + L] = P √ ≤0≤ √
σ/ n σ/ n
By Theorem 2.2.10 the sample mean is normally distributed with mean µ
and variance σ 2 /n. Thus,
X −µ
Z= √
σ/ n
follows a standard normal distribution, and so
· ¸
L L
1−α=P Z − √ ≤0≤Z + √
σ/ n σ/ n
· ¸
L L
=P − √ ≤Z ≤ √
σ/ n σ/ n
· ¸ · ¸
L L
= 2P 0 ≤ Z ≤ √ = 1 − 2P √ ≤Z ≤∞
σ/ n σ/ n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 241 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation for the Mean (Variance Known)


Thus we determine L as being the number such that
· ¸
L
P √ ≤ Z ≤ ∞ = α/2.
σ/ n
But this means that
L zα/2 · σ
√ = zα/2 ⇔ L= √ .
σ/ n n
We have thus proved the following result:
2.2.13. Theorem. Let X1 , . . . , Xn be a random sample of size n from a
normal distribution with mean µ and variance σ 2 . A 100(1 − α)%
confidence interval on µ is given by
zα/2 · σ
X± √ .
n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 242 / 355
An Introduction to Statistical Methods Estimation

Interval Estimation
2.2.14. Example. An article in the Journal of Heat Transfer describes a
method of measuring the thermal conductivity of Armco iron. Using a
temperature of 100◦ F and a power input of 550 W, the following 10
measurements of thermal conductivity (in Btu /(hr ft ◦ F)) were obtained:
41.60 41.48 42.34 41.95 41.86
42.18 41.72 42.26 41.81 42.04
A point estimate of the mean thermal conductivity at 100◦ F and 550 W is
the sample mean,
x = 41.92 Btu /(hr ft ◦ F).
Suppose we know that the standard deviation of the thermal conductivity
under the given conditions is σ = 0.10 Btu /(hr ft ◦ F). A 95% confidence
interval (α = 0.05) on the mean is then given by
z0.025 · σ
x± √ = 41.924 ± 0.062 = [41.862, 41.986].
n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 243 / 355
An Introduction to Statistical Methods Estimation

Central Limit Theorem

While up to now most assertions we have made have been proven (more or
less rigorously), many important results in statistics are highly non-trivial
and require an inordinate (for this course) amount of effort to prove. One
of the first of these is the Central Limit Theorem, which we now cite:
2.2.15. Theorem. Let X1 , . . . , Xn be a sequence of independent random
variables with arbitrary distributions, means E[Xj ] = µj and variances
Var Xj = σj2 (all finite). Let Y = X1 + · · · + Xn . Then under some general
conditions P
Y − µj
Zn = qP
σj2

is approximately standard-normally distributed as n becomes large.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 244 / 355
An Introduction to Statistical Methods Estimation

Central Limit Theorem

If the random variables X1 , . . . , Xn all follow the same distribution, one


obtains the following special case:
2.2.16. Theorem. Let X1 , . . . , Xn be a random sample of size n from an
arbitrary distribution with mean µ and variance σ 2 . Let
Y = X1 + · · · + Xn = nX . Then under some general conditions, for large n,
X is approximately normal with mean µ and variance σ 2 /n. Furthermore,

Y − nµ X −µ
Zn = √ = √
σ n σ/ n

is approximately standard normal.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 245 / 355
An Introduction to Statistical Methods Estimation

Central Limit Theorem


2.2.17. Example. In a construction project, a network of major activities
has been constructed to serve as a basis for planning and scheduling. On a
critical path there are 16 activities.
Activity Mean (weeks) Variance Activity Mean (weeks) Variance
1 2.7 1.0 9 3.1 1.2
2 3.2 1.3 10 4.2 0.8
3 4.6 1.0 11 3.6 1.6
4 2.1 1.2 12 0.5 0.2
5 3.6 0.8 13 2.1 0.6
6 5.2 2.1 14 1.5 0.7
7 7.1 1.9 15 1.2 0.4
8 1.5 0.5 16 2.8 0.7
The activity times may be considered independent and the project time Y
is the sum of the individual activity times Xj on the critical path, i.e.,
Y = X1 + X2 + · · · + X16 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 246 / 355
An Introduction to Statistical Methods Estimation

Central Limit Theorem


The contractor would like to know
1. the expected completion time and
2. a project time y0 corresponding to a probability of 0.90 of having the
projected completed.
P P
We calculate µY = Xj = 49 weeks and σY2 = σX2 j = 16 weeks2 .
Hence the expected completion time is 49 weeks.
Using the central limit theorem, we can use the normal distribution to find
an approximate value for y0 :
· ¸
y0 − 49 weeks
P[Y ≤ y0 ] = P Z ≤ = 0.9
4 weeks
gives
y0 − 49 weeks
= 1.282 or y0 = 54.128 weeks .
4 weeks

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 247 / 355
An Introduction to Statistical Methods Estimation

Central Limit Theorem

How large must n be for the central limit theorem to give a good
approximation?
This depends on how “well-behaved” the distributions of the variables Xj
are:
1. Well-behaved (nearly symmetric densities that look close to that of a
normal distribution): n ≥ 4.
2. Reasonably behaved (no prominent mode, densities look like uniform
densities): n ≥ 12.
3. Ill-behaved (much of the weight of the densities is in the tails,
irregular appearance): n ≥ 100.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 248 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 249 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Preliminary Theory

Let X1 , . . . , Xn be i.i.d. random variables following a normal distribution


with mean µ and variance σ 2 . We define the random variable
v
u n
1u X
χn = t (Xk − µ)2 .
σ
k=1

and we interest ourselves in its density function. We will consider the


cumulative distribution function Fχn ,

Fχn (y ) = P[χn ≤ y ].

Clearly, Fχn (y ) = 0 for y < 0.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 250 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Preliminary Theory
For y > 0,
 v 
u n
1 u X
Fχn (y ) = P[χn ≤ y ] = P  t (Xk − µ)2 < y 
σ
k=1
" n µ ¶ # " n #
X Xk − µ 2 X
2 2 2
=P <y =P Zk < y
σ
k=1 k=1

Xk −µ
where the variables Zk = σ , k = 1, . . . , n, follow a standard normal
distribution.
The density for the sum of n independent normal distributions is the
product of their individual densities, so we have
Z Pn
(2π)−n/2 e − k=1 zk /2 dz1 . . . dzn .
2
Fχn (y ) =
Pn 2 2
k=1 zk <y

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 251 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Preliminary Theory

We recall that on Rn \ {0} we can introduce polar coordinates


(r , θ1 , . . . , θn−1 ) by

x1 = r sin θ1
x2 = r cos θ1 sin θ2
x3 = r cos θ1 cos θ2 sin θ3
..
.
xn−1 = r cos θ1 cos θ2 . . . cos θn−2 sin θn−1
xn = r cos θ1 cos θ2 . . . cos θn−2 cos θn−1

Here r > 0 and −π/2 < θk < π/2, k = 1, n − 2, 0 < θn−1 < 2π.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 252 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Preliminary Theory
The integral becomes
Z 2π Z π/2 Z π/2 Z y
(2π)−n/2 e −r
2 /2
Fχn (y ) = ... r n−1
0 −π/2 −π/2 0
× D(θ1 , . . . , θn−1 ) dr dθ1 . . . dθn−2 dθn−1

where D(θ1 , . . . , θn−1 ) is the modulus of the determinant of the Jacobian


of the transformation (θ1 , . . . , θn−1 ) → x/r . Writing
Z 2π Z π/2 Z π/2
Cn = (2π)−n/2 ... D(θ1 , . . . , θn−1 ) dθ1 . . . dθn−2 dθn−1
0 −π/2 −π/2

we have
Z y
e −r
2 /2
Fχn (y ) = Cn r n−1 dr .
0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 253 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Preliminary Theory

We determine Cn from
Z ∞ ³n´
e −r
2 /2
1 = lim Fχn (y ) = Cn r n−1 dr = Cn Γ 2n/2−1 .
y →∞ 0 2

Hence
Z y
1
e −r
2 /2
Fχn (y ) = ¡n¢ r n−1 dr .
Γ 2 2n/2−1 0

and the density of ζn is given by


2
fχn (y ) = Fχ′ n (y ) = y n−1 e −y
2 /2
.
2n/2 Γ(n/2)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 254 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The χ2 -Distribution
Next we consider the random variable

1 X
n
χ2n = (Xk − µ)2 . (2.3.1)
σ2
k=1

where again X1 , . . . , Xn be i.i.d. random variables following a normal


distribution with mean µ and variance σ 2 .
From our formula for the transformation of variables, we obtain
2 1 1
fχ2n (z) = z (n−1)/2 e −z/2 · √ = n/2 z n/2−1 e −z/2 .
2n/2 Γ(n/2) 2 z 2 Γ(n/2)

This is just the chi-squared distribution with n degrees of freedom. Note


that this distribution does not depend on µ or σ; hence any χ2 distribution
can be regarded as being the distribution of the sum of squares of
independent standard-normally distributed random variables.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 255 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Chi-Squared Distribution


We immediately obtain the following result:
2.3.1. Lemma. Let Xγ21 , . . . , Xγ2n be n independent random variables
following chi-squared distributions with γ1 , . . . , γn degrees of freedom,
respectively. Then
X n
Xα2 := Xγ2k
k=1
P
n
is a chi-squared random variable with α = γk degrees of freedom.
k=1

Proof.
Each of the chi-squared random variables Xγ2k may be regarded as being a
sum of γk squares of standard-normally distributed random variables.
Pn
Hence their sum may be regarded as the sum of α = γk squares of
k=1
standard-normally distributed random variables, giving a chi-squared
random variable with α degrees of freedom.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 256 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Joint Sampling of Mean and Variance


Our interest in the chi-squared distribution is not merely abstract, for
understanding the sum of squares of normally distributed random
variables; in fact, the main application lies in analyzing the distribution of
the sample variance. In the previous chapter, we were able to analyze the
sample mean, and also its distribution, under the assumption of known
variance. If the variance

σ 2 = E[(X − µ)2 ]

is unknown, we must start all over again, and first learn more about the
sample variance
1 X
n
S2 = (Xk − X )2 .
n−1
k=1

The problem essentially is that we are using the random sample X1 , . . . , Xn


to obtain X and S 2 at the same time, i.e., we actually need to obtain the
joint distribution of X and S 2 .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 257 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Joint Sampling of Mean and Variance

The problem is resolved by the following fundamental theorem:


2.3.2. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n
from a normal distribution with mean µ and variance σ 2 . Then
1. The sample mean X is independent of the sample variance S 2 ,
2. X is normally distributed with mean µ and variance σ 2 /n,
3. (n − 1)S 2 /σ 2 is chi-squared distributed with n − 1 degrees of freedom.

The proof of this theorem uses the so-called Helmert transformation,


which we now discuss.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 258 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


The Helmert transformation is a very special kind of orthogonal
transformation from a set of n ≥ 2 i.i.d. normal random variables
X1 , . . . , Xn to a new set of random variables Y1 , . . . , Yn . Effectively, a
sample of size n of a normal population X with mean µ and variance σ 2 is
transformed as follows:
1
Y1 = √ (X1 + · · · + Xn )
n
1
Y2 = √ (X1 − X2 )
2
1 ¡ ¢
Y3 = √ X1 + X2 − 2X3
6
..
.
1 ¡ ¢
Yn = p X1 + X2 + · · · + Xn−1 − (n − 1)Xn
n(n − 1)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 259 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


In matrix notation,
 
  √1 √1 √1 ··· √1  
Y1 
n n n n
 X 1
Y2   √12 − √12 0 ··· 0  X2 
    
Y3   √1 √1 − √26 ··· 0  X3 
 = 6 6  
 ..   .. .. .. .. ..   .. 
 .   . . . . .  . 
 
Yn √ 1 √ 1 √ 1 ··· − √ n−1 Xn
n(n−1) n(n−1) n(n−1) n(n−1)

or Y = AX for short. It is easy to see that the rows of the matrix A are
orthonormal. Thus, A is an orthogonal matrix, A−1 = AT . This
immediately implies |det A| = 1, since
1
det A = det AT = det A−1 = ⇒ (det A)2 = 1
det A

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 260 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


Incidentally, the orthogonality of A also implies that if y = Ax, then

X
n X
n
yi2 = ⟨y, y⟩ = ⟨Ax, Ax⟩ = ⟨AT Ax, x⟩ = ⟨x, x⟩ = xi2 . (2.3.2)
i=1 i=1

We have assumed that the random variables X1 , . . . , Xn are i.i.d., so their


joint distribution function is given by the product of the individual normal
distributions,
Y
n
(2π)−1/2 σ −1 e − 2σ2 (xi −µ)
1 2
fX1 ···Xn (x1 , . . . , xn ) =
i=1
P
n
− 1
(xi2 −2µxi +µ2 )
−n/2 −n 2σ 2
= (2π) σ e i=1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 261 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


Note that the Helmert transformation is linear, so its derivative (Jacobian)
is simply A. Using (2.3.2), |det A−1 | = 1 and Theorem 1.4.19 on the
transformation of joint random variables, we obtain

fY1 ···Yn (Y1 , . . . , Yn )


= fY1 ···Yn (y) = fX1 ···Xn (AT y)
³n ´
P √
− 1
yi2 −2µ ny1 +nµ2
= (2π)−n/2 σ −n e
2σ 2
i=1
³ ´
P
n √
− 1
yi2 +(y1 − nµ)2
= (2π)−n/2 σ −n e
2σ 2
i=2

√ Y
n
= (2π)−1/2 σ −1 e − 2σ2 (y1 − (2π)−1/2 σ −1 e − 2σ2 yi
1 1
nµ)2 2

i=2

In particular, we see from this representation of the joint density function


that the random variables Y1 , . . . , Yn are all independent. Note also that

Y1 is normally distributed with mean nµ and variance σ 2 .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 262 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


Proof of Theorem 2.3.2.
Using the Helmert transformation, we can rewrite
1
X = √ Y1 .
n

Furthermore,
X
n X
n
2 X
n
(n − 1)S 2 = (Xi − X )2 = Xi2 − nX = Yi2 − Y12
i=1 i=1 i=1
X
n
= Yi2 .
i=2

Since the Yi are all independent, it follows that X is independent of S 2 , so


we have proven assertion 1. of the theorem.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 263 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Helmert Transformation


Proof of Theorem 2.3.2 (continued).
Since X = √1 Y1 and
n

(y1 − nµ)2
fY1 (y1 ) = (2π)−1/2 σ −1 e − 2σ 2

it follows that √ √
−1/2 −1 −
( nx− nµ)2 √
fX (x) = (2π) σ e 2σ 2 n
so X is normally distributed with mean µ and variance σ 2 /n.
Now
1 X 2
n
(n − 1)S 2 /σ 2 = Yi
σ2
i=2

follows a chi-squared distribution with n − 1 degrees of freedom (see


(2.3.1) and the following discussion). This completes the proof.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 264 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Independence of Sample Mean and Sample Variance

2.3.3. Remark. Theorem 2.3.2 essentially uses the fact that the i.i.d.
variable Xk , k = 1, . . . , n, are normally distributed. In fact, the converse
result is true also:
Let X1 , . . . , Xn , n ≥ 2, be i.i.d. random variables. Then if X and
S 2 are independent, the Xk , k = 1, . . . , n follow a normal
distribution.
This means that the independence of X and S 2 is a characteristic property
of the normal distribution. Furthermore, if in a given situation we assume
that X and S 2 are independently distributed we are essentially assuming
that the population is normally distributed.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 265 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Variability

We can now use Theorem 2.3.2 to find a confidence interval for the
variance based on the sample variance S. First, for 0 < α ≤ 1 we define
χ21−α/2,n ≤ χ2α/2,n ∈ R by

Z χ21−α/2,n
fXn2 (x) dx = α/2,
0
Z ∞
fXn2 (x) dx = α/2,
χ2α/2,n

where fXn2 is the probability den-


sity of the chi-squared distribution
with n degrees of freedom.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 266 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Variability


From Theorem 2.3.2 we know that given a sample of size n from a normal
population, (n − 1)S 2 /σ 2 follows a chi-squared distribution with n − 1
degrees of freedom. Thus
· ¸
(n − 1)S 2
1 − α = P χ1−α/2,n−1 ≤
2
≤ χα/2,n−1
2
σ2
" #
(n − 1)S 2 (n − 1)S 2
=P ≤ σ2 ≤ 2
χ2α/2,n−1 χ1−α/2,n−1

This gives us the following result:


2.3.4. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n from
a normal distribution with mean µ and variance σ 2 . A 100(1 − α)%
confidence interval on σ 2 is given by
£ ¤
(n − 1)S 2 /χ2α/2,n−1 , (n − 1)S 2 /χ21−α/2,n−1 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 267 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Variability


Often, we are only interested in finding an upper or lower bound for the
variance.
2.3.5. Theorem. Let X1 , . . . , Xn , n ≥ 2, be a random sample of size n from
a normal distribution with mean µ and variance σ 2 . Then with
100(1 − α)% confidence
(n − 1)S 2
≤ σ2.
χ2α,n−1
2
[ (n−1)S
χ2
, ∞) is known as a 100(1 − α)% lower confidence interval for σ 2 .
α,n−1

Similarly, with 100(1 − α)% confidence,


(n − 1)S 2
σ2 ≤
χ21−α,n−1
2
[0, χ(n−1)S
2 ] is known as a 100(1 − α)% upper confidence interval for σ 2 .
1−α,n−1

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 268 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Variability

2.3.6. Example. A manufacturer of soft drink beverages is interested in the


uniformity of the machine used to fill cans. Specifically, it is desirable that
the standard deviation σ of the filling process be less than 0.2 fluid ounces;
otherwise there will be a higher than allowable percentage of cans that are
underfilled. We will assume that fill volume is approximately normally
distributed. A random sample of 20 cans results in a sample variance of
s 2 = 0.0225 (fluid ounces)2 . A 95% upper-confidence interval is given by

(n − 1)S 2 19 · 0.0225 (fluid ounces)2


σ2 ≤ = = 0.0423 (fluid ounces)2
χ20.95,19 10.117

This corresponds to σ ≤ 0.21 fluid ounces with 95% confidence. This is


not sufficient to support the hypothesis that σ ≤ 0.20 fluid ounces so
further investigation is necessary.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 269 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation for the Mean (Variance unknown)


Recall that we have derived a formula for the confidence interval of the
mean of a normal distribution using the random variable

X −µ
Z= √
σ/ n

which was found to be normally distributed. The Central Limit Theorem


allowed us to extend this result (approximately) even to non-normal
distributions, but one central difficulty remained: σ must be known!
Our main goal is to derive a general formula for a confidence interval on
the mean when he value of σ is not known and must be estimated.
The difficulty lies in the fact that the distribution of

X −µ

S/ n

is not known.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 270 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Student T -distribution

2.3.7. Definition. Let Z be a standard normal variable and let Xγ2 be an


independent chi-squared random variable with γ degrees of freedom. The
random variable
Z
Tγ = q
Xγ2 /γ

is said to follow a T distribution with γ degrees of freedom.

2.3.8. Theorem. The density of a T distribution with γ degrees of freedom


is given by
µ ¶− γ+1
Γ(γ + 1)/2 t2 2
fTγ (t) = √ 1+ .
Γ(γ/2) πγ γ

The proof is left to you (homework!).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 271 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

The Student T -distribution


2.3.9. Theorem. Let X1 , . . . , Xn be a random sample from a normal
distribution with mean µ and variance σ 2 . The random variable

X −µ
Tn−1 = √
S/ n

follows a T distribution with n − 1 degrees of freedom.

Proof. √
We know that (X − µ)/(σ/ n) is standard normal and (n − 1)S 2 /σ 2 is a
chi-squared random variable with n − 1 degrees of freedom. Therefore,

Z (X − µ)/(σ/ n) X −µ
q = p = √
Xγ2 /γ ( (n − 1)S /σ )/(n − 1)
2 2 S/ n

follows a T distribution with n − 1 degrees of freedom.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 272 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Mean with Variance Unknown

Let 0 < α ≤ 1/2. We define tα,n ≥ 0 by


Z −tα,n
fTn (t) dt = α,
−∞

where fTn is the density of the T -distribution with n degrees of freedom.


2.3.10. Theorem. Let X1 , . . . , Xn be a random sample of size n from a
normal distribution with mean µ and variance σ 2 . Then a 100(1 − α)%
confidence interval on µ is given by

X ± tα/2,n S/ n

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 273 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Interval Estimation of Mean with Variance Unknown


2.3.11. Example. An article in the Journal of Testing and Evaluation
presents the following 20 measurements on residual flame time (in
seconds) of treated specimens of children’s nightwear:
9.85 9.93 9.75 9.77 9.67 9.87 9.67 9.94 9.85 9.75
9.83 9.92 9.74 9.99 9.88 9.95 9.95 9.93 9.92 9.89
We wish to find a 95% confidence interval on the mean residual flame
time. The sample mean and standard deviation are

x = 9.8475, s = 0.0954

We refer to the table for the T distribution with 20 − 1 = 19 degrees of


freedom and and α/2 = 0.025 to obtain t0.025,19 = 2.093. Hence

µ = (9.8475 ± 0.0446) sec, i.e., 9.8029 ≤ µ ≤ 9.8921

with 95% probability.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 274 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Testing

Often, the statistician will have some idea of the value of a population
parameter, or will try to verify or refute a statement on this parameter. As
an example, a new type of battery might be designed to have a longer life
span than a traditional model. Then a series of prototypes of the new type
(constituting a sample from the population of all not yet produced
batteries of the new type) might be tested for their life span. If the mean
life span of the traditional batteries was 160 days, a hypothesis might be
that “the new type of batteries has a mean life span of more than 170
days”.
The hypothesis that is to be tested is called the research hypothesis and
denoted by H1 (in our example, H1 would be “µ > 170 days”), while the
negation of H1 is called the null hypothesis and denoted H0 (here:
“µ ≤ 170 days”.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 275 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses

We will use the following conventions regarding hypotheses:


◮ A hypothesis always involves the numerical value of a population
parameter θ.
◮ The hypothesis to be supported is denoted H1 , the negation is
denoted H0 . One hopes to accept H1 and reject H0 through
statistical evidence.
◮ The statement of equality for θ is always part of H0 . This value is
called the null value θ0 of θ.
(In our above example, θ0 = 170 days.)
A hypothesis of the form θ ≥ θ0 , θ ≤ θ0 , θ > θ0 or θ < θ0 is known as a
one-sided hypothesis, while a hypothesis of the form θ = θ0 or θ ̸= θ0 is
known as a two-sided hypothesis.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 276 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
We always test a pair of research and null hypotheses. Let θ denote a
population parameter whose value is being compared to some θ0 ∈ R.
Then we have
One-sided tests.
H 0 : θ ≤ θ0 , H1 : θ > θ 0
H 0 : θ ≥ θ0 , H1 : θ < θ 0

Two-sided test.
H 0 : θ = θ0 , H1 : θ ̸= θ0

In order to test a hypothesis, we select a random sample and evaluate a


statistic whose distribution is known under the assumption that θ = θ0 .
This statistic is called a test statistic.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 277 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
On the basis of the test statistic, we either
◮ reject H0 or
◮ fail to reject H0 .

2.3.12. Example. We are given the task of testing the following hypothesis:
More than half of all car headlights in Shanghai are incorrectly
adjusted.
We now establish the mathematical/statistical framework for tackling this
problem:
◮ The population is the set of all cars in Shanghai;
◮ The random variable X is discrete: either a car has a correctly
adjusted headlight (X = 0) or it does not (X = 1);
◮ X follows a binomial distribution with
N = “number of cars in Shanghai” and population parameter p.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 278 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
Our hypotheses are
H0 : p ≤ 0.5 H1 : p > 0.5
with the null value p0 = 0.5.
We take a random sample of n = 20 automobiles and count those that
have maladjusted headlights. Effectively, we are conducting an experiment
with a binomial random variable X : S → Ω = {1, . . . , 20}.
Our test statistic will be the number X of cars in the random sample with
incorrectly adjusted headlights.
If p = p0 , then X follows a binomial distribution with p = p0 = 0.5, and
we can expect p̂ = X /n to be close to p. In this case, E[X ] = np0 = 10.
We decide to reject H0 if at least 14 cars have incorrectly adjusted
headlights; the probability of this happening if p = p0 is
P[X ≥ 14 | p = 0.5 n = 20] = 1 − P[X ≤ 13 | p = 0.5 n = 20]
= 1 − 0.9423 = 0.0577.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 279 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
Thus, if H0 is true (i.e., p = 0.5) then there is approximately a 6% chance
of observing 14 cars with incorrectly adjusted headlights.
What does this imply? Assume that we actually observe 14 or more cars
with incorrectly adjusted headlights and reject H0 . Then two there are two
possibilities:
◮ We have correctly rejected H0 or
◮ We have falsely rejected H0 .
In the second case, H0 is in fact true. However, the probability of this
happening is at most 6%, since if p ≤ p0 , then

P[X ≥ 14/20 | p] ≤ P[X ≥ 14/20 | p0 ] = 0.0577.

Falsely rejecting H0 is committing a so-called Type I error and the


probability of committing this error is denoted by α (in our example,
α = 5.77%).
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 280 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Type I Errors
We want to keep the probability α of committing a Type I error as small
as possible. We directly control α by arbitrarily defining the critical region
of the test.
The critical region is the subset of the range of the test statistic which will
lead us to reject H0 . In our example, the critical region is
C = {14, 15, 16, 17, 18, 19, 20} ⊂ Ω.
If we reduce the critical region, we decrease α. For example, if we set
C = {16, 17, 18, 19, 20} ⊂ Ω,
i.e., we only reject H0 if at least 16 out of 20 cars have maladjusted
headlights, then
α = P[X ≥ 16 | p ≤ 0.5 n = 20] ≤ P[X ≥ 16 | p = 0.5 n = 20]
= 1 − P[X ≤ 15 | p = 0.5 n = 20] = 1 − 0.9941 = 0.0059.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 281 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Type I Errors

We say that rejecting H0 is a strong conclusion, because we can nearly


always set up our critical region in such a way that we control the
probability α of erring incoming to this conclusion. We call α the level of
significance of our test.
Note that for a two-sided test (H0 : θ = θ0 , H1 : θ ̸= θ0 ), α is a constant
determined by the critical region. However, in a one-sided test such as

H 0 : θ ≤ θ0 , H1 : θ > θ 0

α depends on the true value of the population parameter θ. In fact, the


smaller θ is, the smaller the probability of falsely rejecting H0 . Thus we
can define a function α = α(θ) for θ ≤ θ0 . However, α will be largest
when θ = θ0 , so we generally just quote this value of α for the level of
significance.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 282 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Type I Errors

In general, we will end up in one of the following situations:


1. We shall reject H0 even though H0 is in fact true - this is the Type I
error we have discussed.
2. We shall reject H0 when H0 is untrue.
3. We shall fail to reject H0 even though H0 is untrue - this is known as
a Type II error.
4. We shall fail to reject H0 when H0 is true.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 283 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Type II Errors
In testing either of
H 0 : θ ≤ θ0 , H1 : θ > θ 0
H 0 : θ ≥ θ0 , H1 : θ < θ 0
H 0 : θ = θ0 , H1 : θ ̸= θ0
Type II errors are more tricky than Type I errors:
◮ A Type II error occurs if H0 is untrue but we still fail to reject H0 .
◮ We do not control β, the probability of committing a Type II error.
◮ β depends on the true value of the population parameter θ.
We also introduce the power of a test, which is he probability of rejecting
H0 when H1 is true. The power is given by 1 − β. Summarizing, we have
α = P[Type I error] = P[reject H0 | H0 true],
β = P[Type II error] = P[fail to reject H0 | H0 false],
Power = 1 − β = P[reject H0 | H0 false],

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 284 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Example - A binomial test


2.3.13. Example. In the context of our previous example, we wanted to test
the statement that more than half of all car headlights in Shanghai are
maladjusted and selected a random sample of 20 cars, deciding to accept
the statement (reject H0 ) if 14 or more have wrongly adjusted headlights.
Suppose that p = 0.7. Then
β = P[fail to reject H0 | p = 0.7] = P[X ≤ 13 | p = 0.7, n = 20] = 0.3920
However, if p = 0.8,
β = P[X ≤ 13 | p = 0.8, n = 20] = 0.0867

Thus, (unless β is known) failing to reject H0 is thought of as a weak


conclusion.
We prefer to say that we “fail to reject H0 ” rather than “accept H0 ”,
indicating that we simply do not have enough evidence to reject H0 (as
opposed to being sure that H0 is true).
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 285 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Operating Characteristic (OC) Curves


For many types of tests the dependence of β on the true value of the
population parameter is known and can be looked up in the form of a
so-called operating characteristic (OC) curve. While the textbook
discusses OC curves only in the context of acceptance sampling (see pages
672-674), OC curves are actually fundamental to hypothesis testing.
2.3.14. Example. We are interested in the mean compressive strength of a
particular type of concrete. Specifically, we want to decide whether or not
the mean compressive strength (say µ) is 2500 psi. We set up the
hypotheses

H0 : µ = 2500 psi, H1 : µ ̸= 2500 psi .

We take a random sample of size n and obtain the sample mean X as a


test statistic for µ. We decide to reject H0 if |X − µ0 | > 50 psi, where
µ0 = 2500 psi is the null value of µ.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 286 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Operating Characteristic (OC) Curves


From the central limit theorem, we know that X follows a normal
distribution with mean µ. Through our choice of critical region (and
sample size), we have effectively fixed α. Depending on the true value of
µ, β can be large or small. If µ = µ0 = 2500 psi, then β = 1 − α. We can
represent β as a function of µ through a curve of the following shape:

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 287 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Operating Characteristic (OC) Curves


Of course, β also depends on the sample size, since a large sample size will
reduce the variance of X and therefore make it less likely to falsely fail to
reject H0 . When increasing the sample size from n1 to n2 , the OC curve
narrows:

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 288 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Operating Characteristic (OC) Curves


By increasing α, we make it easier to reject H0 . This also decreases β, the
probability of failing to reject H0 even though H0 is true:

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 289 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Operating Characteristic (OC) Curves


The previous curves were those for the two-sided test H1 : µ ̸= µ0 . If we
have a one-sided test
H0 : µ ≤ µ0 , H1 : µ > µ0
then we can a priori only define β(µ) for µ > µ0 . However, since
β(µ0 ) = 1 − α and we in fact have α = α(µ) for µ ≤ µ0 , we simply set
β(µ) = 1 − α(µ) for µ ≤ µ0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 290 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
2.3.15. Example. The burning rate of a rocket propellant is being studied.
Specifications require that the mean burning rate must be 40 cm/s.
Furthermore, suppose that we know that the standard deviation of the
burning rate is approximately σ = 2 cm/s. The experimenter decides to
specify a Type I error probability of α = 0.05 and he will base the test on
a random sample of size n = 25. The hypotheses we wish to test are

H0 : µ = 40 cm/s, H1 : µ ̸= 40 cm/s.

If H0 is true, the sample mean is normally distributed with mean


µ0 = 40 cm/s and variance σ 2 /n; thus we will use the standard normal
statistic
x −µ
Z= √
σ/ n
to test the hypotheses.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 291 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing

Since the level of significance is to be α = 0.95, we set our acceptance


region (the complement of the critical region) to

X −µ
−zα/2 ≤ Z ≤ zα/2 ⇔ −1.96 ≤ √ ≤ 1.96.
σ/ n

Twenty-five specimen are tested, and the sample mean burning rate
obtained is x = 41.25 cm/s. The value of the test statistic is
x − µ0 41.25 − 40
Z0 = √ = √ = 3.125.
σ/ n 2/ 25

We note that |Z0 | > 1.96, so it falls in the critical region and we can reject
H0 at the 5% level of significance.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 292 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
2.3.16. Example. Continuing from the previous example, suppose that the
analyst is concerned about the probability of a Type II error if the true
mean burning rate is µ = 41 cm/s. We may use the following operating
characteristic curve (specific to α = 0.05) to find β:

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 293 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypothesis Testing
In this graph,
|µ − µ0 | 41 − 40 1
d := = = .
σ 2 2
Since in our example n = 25 we can read off β ≈ 0.30.
2.3.17. Example. Continuing from the previous example, suppose that the
analyst would like to design the test so that if the true mean burning rate
differs from 40 cm/s by more than 1 cm/s the test will detect this (i.e.,
reject H0 : µ = 40) with a high probability, say 0.90. This corresponds to
setting the power 1 − β of the test to 0.90, i.e., we want to achieve
β = 0.10.
We want to have β ≤ 0.1 if
|µ − µ0 | |µ − 40| 1
d= = ≥ .
σ 2 2
We see that the point (d, β) = (0.5, 0.1) is intersected by the OC curve
for n = 40 and that the curve remains below 0.1 for d > 1/2. Thus the
test should involve a sample size of n = 40 or more.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 294 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Significance Testing

As previously mentioned, hypothesis testing as just described can be a


little too rigid.
An alternative approach involves setting up H0 and H1 as before, but not
specifying α and the critical region before the test. Rather, a value of the
test statistic is observed, and then the probability of observing this value
given that θ = θ0 is calculated. This probability is variously called
◮ critical level,
◮ descriptive level of significance, or
◮ probability or P value
of the test.
The P-value is also the smallest value for α at which we would have been
able to reject H0 . We reject H0 if we consider this P value to be small.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 295 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Significance Testing
For one-tailed test, the P-value is the area under the density curve to the
right or left of the observed statistic:

How does one define the P-value for two-tailed tests?


If the distribution is symmetric about µ0 we simply take twice the P-value
of the one-tailed test. This will not be exact if the distribution is
asymmetric (like the chi-squared distribution for tests of variance), but we
will use it as an approximation anyway.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 296 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Significance Testing

2.3.18. Example. We test the hypothesis that a new car design increases
mileage.
◮ The population is the set of all cars with the new design;
◮ The random variable X is the mileage of the newly designed cars;
◮ The distribution of X is unknown;
◮ The population parameter µ is the mean of X .
We take a random sample of n = 36 automobiles. Our hypotheses are

H0 : µ ≤ 26 H1 : µ > 26

with the null value µ0 = 26. Currently, the mileage of cars has a standard
deviation of 5 miles and we assume this will also be true for the new
design if µ = µ0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 297 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Significance Testing

We obtain a data set of 36 mileages with a sample mean X = 28.04 mpg.


To see whether there is enough evidence to reject H0 we find the P-value
of the test. By the Central Limit Theorem, if µ = 26 and σ = 5, the mean
is at least approximately normally distributed with µ = 26 and standard

deviation σ/ n = 5/6.
·
¸
X − 26 28.04 − 26
P[X ≥ 28.04 | µ = 26, σ = 5] = P ≥
5/6 5/6
= P[Z ≥ 2.45] = 1 − P[Z ≤ 2.45]
= 1 − 0.9929 = 0.0071.

This is the P-value of the test. We may decide that it is sufficiently small
to reject H0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 298 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Selecting Appropriate Hypotheses


For one-sided tests, it is sometimes not clear how to select the null and
research hypotheses. For example, suppose that a soft drink beverage
bottler purchases 10-ounce nonreturnable bottles from a glass company.
The bottler wants to be sure that the bottles exceed the specification on
mean internal pressure or bursting strength, which for 10-ounce bottles is
200 psi.
The bottler has decided to formulate the decision procedure for a specific
lot of bottles as a hypothesis problem. There are two possible formulations
for this problem, either

H0 : µ ≤ 200 psi H1 : µ > 200 psi (2.3.3)

or

H0 : µ ≥ 200 psi H1 : µ < 200 psi (2.3.4)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 299 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Selecting Appropriate Hypotheses


The formulation (2.3.3) forces the manufacturer to reject H0 , i.e., to
demonstrate that the bottles conform to specification.
In the formulation (2.3.4) the bottles will be judged satisfactory unless H0
is rejected.
Choosing the “correct” formulation depends on the precise circumstances:
the first formulation means that the bottler needs proof that the bottles
conform to specification (perhaps because there have been problems in the
past), while in the second formulation the bottles will be accepted unless
there is strong evidence to the contrary (the bottler might have been
consistently satisfied with the bottles in the past, and small deviations
from µ ≥ 200 psi might not be harmful).
In formulating one-sided research hypotheses, we should remember that
rejecting H0 is always a strong conclusion, and consequently, we should
put the statement about which it is important to make a strong conclusion
in the research hypothesis. Often this will depend on our point of view and
experience with the situation.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 300 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Mean


There are three forms for tests of hypotheses on the mean of a
distribution:
1. Right-tailed test: H0 : µ ≤ µ0 , H1 : µ > µ0
2. Left-tailed test: H0 : µ ≥ µ0 , H1 : µ < µ0
3. Two-tailed test: H0 : µ = µ0 , H1 : µ ̸= µ0
Remember: To test a hypothesis on a parameter θ, we must find a
statistic whose probability distribution is known at least approximately
when θ = θ0 (the null value).
We know that if X is normal, the statistic

X − µ0
T = √
S/ n

follows a Tn−1 -distribution. Tests based on this distribution are called T


tests.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 301 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Mean


2.3.19. Example. The breaking strength of a textile fiber is a normally
distributed random variable. Specifications require that the mean breaking
strength should equal 150 psi. The manufacturer would like to detect any
significant departure from this value. Thus, he wishes to test

H0 : µ = 150 psi H1 : µ ̸= 150 psi

A random sample of 15 fiber specimens is selected and their breaking


strengths determined. The statistic

X − µ0
T = √
S/ n

will follow a T14 -distribution. We specify α = 0.05, and find


t0.025,14 = 2.145 and t−0.025,14 = −2.145 from Table VI of the textbook.
Thus, the critical region is given by |t| > 2.145.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 302 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Mean


The sample mean and variance are computed from the sample data as
x = 152.18 and s 2 = 16.63. Therefore, the test statistic is
x − µ0 152.18 − 150
t= √ = p = 2.07,
s/ n 16.63/15

which does not fall into the critical region, so there is insufficient evidence
to reject H0 at the 5% level of significance.
Note that the T -distribution may be used for XS/−µ
√ 0 when a sample is
n
obtained from a normal population. If a sample is obtained from a
non-normal population, care must be taken; for large to medium sample
sizes (n ≥ 25) it can be shown that violating the normality assumption
does not significantly change α and β. For small sample sizes, a T -test
cannot be used and an alternative (non-parametric) test must be
employed; such tests will be discussed later.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 303 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Mean

In order to accept H1 : µ > µ0 we must reject H0 : µ ≤ µ0 . However, it is


clearly sufficient to reject H0 : µ = µ0 by showing evidence that the mean
is greater than µ0 . For this reason, one often prefers the following
conventions,

1. Right-tailed test: H0 : µ = µ0 , H1 : µ > µ0


2. Left-tailed test: H0 : µ = µ0 , H1 : µ < µ0
3. Two-tailed test: H0 : µ = µ0 , H1 : µ ̸= µ0

These conventions emphasize the null value µ0 , for which a test statistic
with known distribution must be found.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 304 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Variance


Hypotheses on the variance have the same general form as or the mean

1. Right-tailed test: H0 : σ = σ0 , H1 : σ > σ0


2. Left-tailed test: H0 : σ = σ0 , H1 : σ < σ0
3. Two-tailed test: H0 : σ = σ0 , H1 : σ ̸= σ0

It is important to be aware of the following difficulty:


◮ The T -distribution can be used in the presence of large sample sizes
for the distribution of the sample mean even if the underlying
distribution is non-normal.
◮ It is, however, not possible to approximate the χ2n−1 statistic in this
way if the distribution is non-normal, regardless of sample size!
Therefore, normality of the data must first be tested, and if the data
is non-normal, other methods must be used.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 305 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Variance

2.3.20. Example. One random variable studied while designing the


front-wheel-drive half-shaft of a new model automobile is the displacement
(in millimeters) of the constant velocity (CV) joints. With the joint angle
fixed at 12◦ , 20 simulations were conducted, resulting in the following
data:
6.2 1.9 4.4 4.9 3.5
4.6 4.2 1.1 1.3 4.8
4.1 3.7 2.5 3.7 4.2
1.4 2.6 1.5 3.9 3.2
For these data, x = 3.39 and s = 1.41. Engineers designing the
front-wheel- drive half-shaft claim that the standard deviation in the
displacement of the CV shaft is less than 1.5 mm. Do these data support
this contention?

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 306 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Hypotheses and Significance Tests on the Variance


We test H0 : σ = 1.5, H1 : σ < 1.5., which is equivalent to testing

H0 : σ 2 = 2.25, H1 : σ 2 < 2.25.

The observed value of the test statistic is


(n − 1)s 2 19 · 1.412
= = 16.79.
σ02 2.25

Since the test is left-tailed, we reject H0 if this value is too small to have
occurred by chance. From the χ2 table we see that

P[χ219 ≤ 14.6] = 0.25 and P[χ219 ≤ 18.3] = 0.50.

Since the observed value lies between 14.6 and 18.3, the probability
P[χ219 ≤ 16.79] lies between 0.25 and 0.50. This probability is too large to
be able to reject H0 , so we cannot claim that σ < 1.5 mm.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 307 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Alternative Nonparametric Methods

◮ “Non-parametric” = “distribution free”


◮ Less powerful than “normal theory procedures” if the underlying
assumptions are met
◮ Much more powerful when normality assumptions are not met; useful
even when they are met.
◮ Two examples: Sign Test for the Median & Wilcoxon Signed Rank
Test

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 308 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median

Recall that the median of a random variable X is defined as the value M


such that

P(X < M) = P(X > M) = 1/2.

For a symmetric distribution (such as the normal distribution), mean and


median are identical. We will see that the sign test is a form of binomial
test. We again have the traditional forms of tests for the median,
1. Right-tailed test: H0 : M = M0 , H1 : M > M0
2. Left-tailed test: H0 : M = M0 , H1 : M < M0
3. Two-tailed test: H0 : M = M0 , H1 : M ̸= M0

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 309 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median


Let X1 , . . . , Xn denote a random sample from a symmetric random variable
X with median M. Assuming a continuous distribution of X , each of the
differences Xi − M has probability 1/2 of being positive, 1/2 of being
negative and probability 0 of being 0.
Let Q± = #{Xi : Xi − M0 ≷ 0}. If H0 : M = M0 is true, then Q+ is
binomially distributed, with parameters n and p = 1/2.
In a right-tailed test, we perform a significance test on the value of Q−
and reject H0 if the number of negative results is too small to have
occurred by chance.
Similarly, in a left-tailed test we reject H0 if Q+ is too small to have
occurred by chance.
In a two-tailed test, we form Q = min(Q− , Q+ ) and reject H0 if Q is too
small to have occurred by chance.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 310 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median


2.3.21. Example. Montgomery, Peck and Vining (2001) report on a study
in which a rocket motor is formed by binding an ignitor propellant and a
sustainer propellant together inside a metal housing. The shear strength of
the bond between the two propellant types is an important characteristic.
The results of testing 20 random samples are shown below. We would like
to test the hypothesis that the median shear strength is 2000 psi.
Observation (i) Shear Strength (Xi ) Observation (i) Shear Strength (Xi )
1 2158.70 11 2165.20
2 1678.15 12 2399.55
3 2316.00 13 1779.80
4 2061.30 14 2336.75
5 2207.50 15 1765.30
6 1708.30 16 2053.50
7 1784.70 17 2414.40
8 2575.10 18 2200.50
9 2357.90 19 2654.20
10 2256.70 20 1753.70

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 311 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median

Formally, we set

H0 : M = 2000,
H1 : M ̸= 2000,

i.e., we perform a two-tailed test. We calculate Xi − M0 , where


M0 = 2000 is the null value of the median, for every i and note whether
the difference is positive or negative.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 312 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median


Observation (i) Shear Strength (Xi ) Xi − 2000 Sign
1 2158.70 158.70 +
2 1678.15 -321.85 −
3 2316.00 316.00 +
4 2061.30 61.30 +
5 2207.50 207.50 +
6 1708.30 -291.70 −
7 1784.70 -215.30 −
8 2575.10 575.10 +
9 2357.90 357.90 +
10 2256.70 256.70 +
11 2165.20 165.20 +
12 2399.55 399.55 +
13 1779.80 -220.20 −
14 2336.75 336.75 +
15 1765.30 -234.70 −
16 2053.50 53.50 +
17 2414.40 414.40 +
18 2200.50 200.50 +
19 2654.20 654.20 +
20 1753.70 -264.30 −

Note that Q+ = 14 and Q− = 6, so Q = min(Q− , Q+ ) = 6.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 313 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median

While there are tables for checking the significance of such a result, we
can calculate it directly, because the signs of the differences are binomially
distributed. The probability of observing 6 or fewer negative signs in a
sample of 20 observations is
6 µ ¶
X 20 1 1
P[Q ≤ 6] = = 0.058.
r 2r 220−r
r =0

The P–value of this test is hence 5.8%, so at at 95% significance, we are


unable to reject H0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 314 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Sign Test for the Median

In practice, it may happen that Xi − M0 = 0. There has been extensive


research on treating these zeroes. It is recommended that you either
1. count the zeroes in such a way least likely to result in the rejection of
H0 or
2. discard all zeroes if their number is small compared to the sample
size, and reduce the sample size accordingly.
The sign test works even if the magnitude of Xi − M is unknown - if it is
known, we can apply a second test that uses this information, called the
Wilcoxon signed rank test

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 315 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Wilcoxon Signed Rank Test

◮ X1 , . . . , Xn random sample from continuous distribution with median


M.
◮ H0 : M = M0 , H1 : M > M0 , H1 : M < M0 , H1 : M ̸= M0 .
◮ Order the n absolute differences |Xi − M| according to magnitude, so
that XRi − M0 is the Ri th smallest difference by modulus.
◮ If ties in the rank occur, the mean of the ranks is assigned to both
values.
◮ Define
X X
W+ = Ri , |W− | = |Ri |.
Ri >0 Ri <0

◮ If H0 is true, W+ ≈ |W− | - consider distribution of W .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 316 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Wilcoxon Signed Rank Test


◮ For a left-tailed test (H1 : M < M0 ) we use W+ as a statistic - we
reject H0 if the observed value of W+ is small.
◮ For a right-tailed test (H1 : M > M0 ) we use |W− | as a statistic - we
reject H0 if the observed value of W− is small.
◮ For a right-tailed test (H1 : M ̸= M0 ) we use W = min(W+ , |W− |) as
a statistic - we reject H0 if the observed value of W is small.
◮ The level of significance of such a rejection is tabulated.
◮ The distribution of W is approximately normal with mean

n(n + 1)
E[W ] =
4
and variance
n(n + 1)(2n + 1)
Var W = .
24

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 317 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Wilcoxon Signed Rank Test


2.3.22. Example. We return to the measurements of the previous example.
Observation (i) Shear Strength (Xi ) Xi − 2000 Signed Rank
16 2053.50 53.50 +1
4 2061.30 61.30 +2
1 2158.70 158.70 +3
11 2165.20 165.20 +4
18 2200.50 200.50 +5
5 2207.50 207.50 +6
7 1784.70 -215.30 −7
13 1779.80 -220.20 −8
15 1765.30 -234.70 −9
20 1753.70 -264.30 −10
10 2256.70 256.70 +11
6 1708.30 -291.70 −12
3 2316.00 316.00 +13
2 1678.15 -321.85 −14
14 2336.75 336.75 +15
9 2357.90 357.90 +16
12 2399.55 399.55 +17
17 2414.40 414.40 +18
8 2575.10 575.10 +19
19 2654.20 654.20 +20

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 318 / 355
An Introduction to Statistical Methods Inferences on the Mean and Variance of a Distribution

Wilcoxon Signed Rank Test

Here

W+ = 1 + 2 + 3 + 4 + 5 + 6 + 11 + · · · + 19 + 20 = 150

and
|W− | = 7 + 8 + 9 + 10 + 12 + 14 = 60.
For our two-tailed test, we take W = min(60, 150) = 60. From the Table
VIII in Appendix A, with n = 20 observations we have the critical value of
52 for a two-tailed test with P = 0.05. Since W = 60 ̸< 52, we cannot
reject H0 at a 95% level of significance..

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 319 / 355
An Introduction to Statistical Methods Inferences on Proportions

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 320 / 355
An Introduction to Statistical Methods Inferences on Proportions

Estimating Proportions
One of the (mathematically) simplest population parameters of general
interest is the proportion of members of a population with some trait.
Every member of the population is characterized as either having or not
having this trait. We describe this mathematically by defining the random
variable (
1 has trait,
X =
0 does not have trait.
The proportion of the members of the population having the trait is

1 X
N
# members wih trait
p= = xi
population size N
i=1

where N is the population size and xi is the value of the variable X for the
ith member of the population. Hence the proportion is equal to the mean
of X .
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 321 / 355
An Introduction to Statistical Methods Inferences on Proportions

Estimating Proportions
It follows that if we take a random sample X1 , . . . , Xn of X , the sample
mean
1X
n
p̂ = X = Xi
n
i=1

is an (unbiased) estimator for p.


The random variable X follows a point binomial distribution with
expectation E[X ] = p and Variance p(1 − p).
By the central limit theorem, p̂ is approximately normally distributed with
mean p and variance p(1 − p)/n.
Hence
p̂ − p
p
p(1 − p)/n
is approximately standard-normally distributed.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 322 / 355
An Introduction to Statistical Methods Inferences on Proportions

Estimating Proportions
We therefore obtain the following 100(1 − α)% confidence interval for p:
p
p̂ ± zα/2 p(1 − p)/n

But the interval depends on the unknown parameter p, which we are


actually trying to estimate! The solution is to replace p by p̂, i.e., to write
p
p̂ ± zα/2 p̂(1 − p̂)/n.

But then the number zα/2 is no longer accurate (when we replaced σ by S


to obtain a confidence interval for the mean, we had to switch from zα/2
to tα/2 ). However, we are approximating the binomial distribution in any
case - we argue that if the sample size n is large enough to allow the
central limit theorem to hold, then the difference between zα/2 and a
corrected value will be negligible.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 323 / 355
An Introduction to Statistical Methods Inferences on Proportions

Estimating Proportions

2.4.1. Example. In a random sample of 75 axle shafts, 12 have a surface


finish that is rougher than the specifications will allow. Therefore, a point
estimate of the proportion p of shafts in the population that exceed the
roughness specifications is p̂ = x = 12/75 = 0.16. A 95% two-sided
confidence interval for p is then
p p
p̂ − zα/2 p̂(1 − p̂)/n ≤ p ≤ p̂ + zα/2 p̂(1 − p̂)/n

or r r
0.16 · 0.84 0.16 · 0.84
0.16 − 1.96 ≤ p ≤ 0.16 + 1.96
75 75
which simplifies to 0.08 ≤ p ≤ 0.24.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 324 / 355
An Introduction to Statistical Methods Inferences on Proportions

Choosing the Sample Size


As a practical matter, we are often able to choose (perhaps within
constraints) the sample size. We may want to be able to claim that “with
xx% probability, p̂ differs from ppby at most d.” Given a 100(1 − α)%
confidence interval p = p̂ ± zα/2 p̂(1 − p̂)/n, we know with 100(1 − α)%
confidence that p
d = zα/2 p̂(1 − p̂)/n.
Given d, this means that we should choose
2 p̂(1 − p̂)
zα/2
n=
d2
to ensure that |p − p̂| < d with 100(1 − α)% confidence. However, this
formula requires us to have an idea (estimate) p̂ of p beforehand. If this is
not the case, we can at least use that x(1 − x) < 1/4 for all x ∈ R to
deduce that
2
zα/2
n=
4d 2
will ensure |p − p̂| < d with 100(1 − α)% confidence.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 325 / 355
An Introduction to Statistical Methods Inferences on Proportions

Choosing the Sample Size

2.4.2. Example. A new method of precoating fittings used in oil, break and
other fluid systems in heavy-duty trucks is being studied. How large a
sample is needed to estimate the proportions of fitting that leak to within
0.02 with 90% confidence?
Since no prior estimate is available, we take
2
z0.05 1.6452
n= = = 1692.
4d 2 4 · 0.022

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 326 / 355
An Introduction to Statistical Methods Inferences on Proportions

Hypothesis Testing
There a three types of hypotheses and tests for proportions. Let p) denote
the null value of a proportion p. Then we have
1. H0 : p = p0 , H1 : p > p0 (Right-tailed)
2. H0 : p = p0 , H1 : p < p0 (Left-tailed)
3. H0 : p = p0 , H1 : p ̸= p0 (Two-tailed)
For large sample sizes we use the following test statistic to test H0 : p = p0 :
p̂ − p0
Z=p .
p0 (1 − p0 )/n

If H0 is true, then this statistic follows a standard normal distribution.


1. For a right-tailed test, we reject H0 if Z is a large positive number.
2. For a left-tailed test, we reject H0 if Z is a large negative number.
3. For a two-tailed test, we reject H0 if |Z | is large.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 327 / 355
An Introduction to Statistical Methods Inferences on Proportions

Comparing Two Proportions


We are often interested in estimating the difference between two
proportions p1 and p2 , where in general p1 is the proportion of objects
with some trait in a population P1 and P2 is the proportion of objects with
some (other) trait in a (different) population P2 .
If we let (
(1) 1 has trait,
X =
0 does not have trait.
be a random variable defined for population P1 and X (2) be a similarly
defined random variable, then we are interested in the difference of the
means µ1 = p1 and µ2 = p2 of these random variables. An unbiased
estimator for p1 − p2 is given by
(1) (2)
p\
1 − p2 = p̂1 − p̂2 = X −X ,
(1) (1)
where X and X are the means of random samples from the random
variables X (1) (1)
and X , respectively.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 328 / 355
An Introduction to Statistical Methods Inferences on Proportions

Comparing Two Proportions

Assume that we have random samples of sizes n1 , n2 of X (1) and X (1) ,


(1) (1)
respectively. Since X and X are both approximately normally
distributed, with means p1 , p2 and variances p1 (1 − p1 )/n1 and
p2 (1 − p2 )/n2 , respectively, we obtain the following result:
2.4.3. Theorem. For large samples, the estimator p̂1 − p̂2 is approximately
normal with mean p1 − p2 and variance p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 .
This allows us to deduce the following 100(1 − α)% confidence interval for
p1 − p2 : s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
p̂1 − p̂2 ± zα/2 +
n1 n2
which is valid for large sample sizes.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 329 / 355
An Introduction to Statistical Methods Inferences on Proportions

Comparing Two Proportions


We can now also test hypotheses on the difference of two proportions. Let
(p1 − p2 )0 denote the null value of the difference. Then we have the
hypotheses
1. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 > (p1 − p2 )0
(Right-tailed)
2. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 < (p1 − p2 )0
(Left-tailed)
3. H0 : p1 − p2 = (p1 − p2 )0 , H1 : p1 − p2 ̸= (p1 − p2 )0
(Two-tailed)
For large sample sizes we use the approximate test statistic

p̂1 − p̂2 − (p1 − p2 )0


Z=q .
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
n1 + n2

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 330 / 355
An Introduction to Statistical Methods Inferences on Proportions

Pooled Proportions
Most commonly we test against Ho : (p1 − p2 )0 = 0. Then we have the
hypotheses
1. H0 : p1 − p2 = 0, H 1 : p1 − p2 > 0 (Right-tailed)
2. H0 : p1 − p2 = 0, H 1 : p1 − p2 < 0 (Left-tailed)
3. H0 : p1 − p2 = 0, H1 : p1 − p2 ̸= 0 (Two-tailed)
Now if H0 is true, then p̂1 and pˆ2 are both estimators for the same
proportion p. Then the variance becomes
µ ¶
1 1
p(1 − p)/n1 + p(1 − p)/n2 = p(1 − p) +
n1 n2

and
p̂1 − p̂2
Z=r ³ ´
p(1 − p) n11 + 1
n2

has an approximate standard normal distribution.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 331 / 355
An Introduction to Statistical Methods Inferences on Proportions

Pooled Proportions

In estimating p, we now have a choice of p̂1 and p̂2 .It turns out that it is
best to take the weighted average
n1 p̂1 + n2 p̂2
p̂ =
n1 + n2
and to use the statistic
p̂1 − p̂2
Z=r ³ ´.
p̂(1 − p̂) n11 + 1
n2

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 332 / 355
An Introduction to Statistical Methods Inferences on Proportions

Pooled Proportions
2.4.4. Example. Many consumers think that automobiles built on Mondays
are more likely to have serious defects than those built on any other day of
the week. To support this theory, a random sample of 100 cars built on
Monday is selected and inspected. Of these, eight are found to have
serious defects. A random sample of 200 cars produced on other days
reveals 12 with serious defects. Do these data support the stated
contention?

We test
H 0 : p1 = p2 , H1 : p1 > p2
where p1 denotes the proportion of cars with serious defects produced on
Mondays.
Estimates for p1 and p2 are
p̂1 = 8/100 = 0.08, p̂2 = 12/200 = 0.06.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 333 / 355
An Introduction to Statistical Methods Inferences on Proportions

Pooled Proportions

The pooled estimate for the common population proportion is


100 · 0.08 + 200 · 0.06
p̂ = = 20/300 = 0.066.
100 + 200
The observed value of the test statistic is
p̂1 − p̂2 0.08 − 0.06
r ³ ´=q ¡ 1 ¢ = 0.658.
p̂(1 − p̂) n11 + 1 0.066 · 0.934 100 + 1
200
n2

From the standard normal table, we see that the probability of observing
this large or a larger value is 0.2546, so we shall not reject H0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 334 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 335 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Two Means - A Point Estimator


We have two populations with different means µ1 and µ2 ; our goal is to
estimate the difference µ1 − µ2 by taking one sample from each
population in such a way that the selection of one sample does not
influence the selection of the other (they are independent).
The natural point estimator is
µ\
1 − µ2 := µ b2 = X 1 − X 2 .
b1 − µ
To determine confidence intervals and to test hypotheses we need to know
the distribution of X 1 − X 2 .

2.5.1. Theorem. Let X 1 and X 2 be the sample means based on


independent random samples of sizes n1 and n2 drawn from normal
distributions with means µ1 and µ2 and variance σ12 and σ22 , respectively.
Then X 1 − X 2 is normal with mean µ1 − µ2 and variance σ12 /n1 + σ22 /n2 .
As usual, the Central Limit theorem allows us to apply this result even to
non-normal populations if we have large sample sizes.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 336 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Variances

We have seen when estimating the mean of a single distribution that the
key lies in understanding the distribution of the variance. Therefore, we
first consider the comparison of variances.
We consider the two cases
1. H0 : σ12 = σ22 , H1 : σ12 > σ22 (right-tailed test)
2. H0 : σ12 = σ22 , H1 : σ12 ̸= σ22 (two-tailed test)
For comparing variances, we prefer to consider the quotient instead of the
difference of the estimators: If the sample variances are S12 and S22 , the
null hypothesis is true if S12 /S22 = 1, while we reject the null hypothesis if
the quotient is much larger than 1.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 337 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Variances - The F -Test

Recall that (n − 1)S 2 /σ 2 follows a χ2 -distribution with n − 1 degrees of


freedom. We will need to consider the distribution of the quotient of two
sample variances, so we introduce the following

2.5.2. Definition. Let Xγ21 and Xγ22 be independent χ2 random variables


with γ1 and γ2 degrees of freedom, respectively. The random variable

Xγ21 /γ1
Fγ1 ,γ2 =
Xγ22 /γ2

follows what is called an F distribution with γ1 and γ2 degrees of freedom.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 338 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Variances - The F -Test


2.5.3. Theorem. Let S12 and S22 be sample variances based on independent
random samples of sizes n1 and n2 drawn from normal populations with
means µ1 and µ2 and variances σ12 and σ22 , respectively.
If σ12 = σ22 , then the statistic
S12 /S22
follows an F distribution with n1 − 1 and n2 − 1 degrees of freedom.

Proof.
We know that (n1 − 1)S12 /σ12 and (n2 − 1)S22 /σ22 follows χ2 -distributions
with n1 − 1 and n2 − 1 degrees of freedom, respectively. Then

[(n1 − 1)S12 /σ12 ]/(n1 − 1) σ22 S12


Fn1 −1,n2 −1 = = .
[(n2 − 1)S22 /σ22 ]/(n2 − 1) σ12 S22

If σ12 = σ22 , this reduces to S12 /S22 .


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 339 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Variances - The F -Test

There are some assumptions and restrictions underlying the F -test:


1. Normality is essential,
2. The sample sizes should be equal,
3. The test is not very powerful (there is a high probability of
committing a Type II error)

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 340 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Variances - The F -Test

2.5.4. Example. Chemical etching is used to remove copper from printed


circuit boards. X1 and X2 represent process yields when two different
concentrations are used. Suppose that we wish to test H0 : σ12 = σ22 ,
H1 : σ12 ̸= σ22 .
Two samples of sizes n1 = n2 = 8 yield s12 = 4.02 and s22 = 3.89, and

s12 4.02
2 = = 1.03.
s2 3.89

If α = 0.05, we find that the critical value for F7,7 is 3.787. Therefore,
there is not enough evidence to reject H0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 341 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal


By Theorem 2.5.1, we know that

(X 1 − X 2 ) − (µ1 − µ2 )
p
σ12 /n1 + σ22 /n2

is standard normal. If σ12 = σ22 =: σ 2 , this reduces to

(X 1 − X 2 ) − (µ1 − µ2 )
p ,
σ 2 (1/n1 + 1/n2 )

and we are faced with the task of estimating σ 2 . We define the pooled
estimator
(n1 − 1)S12 + (n2 − 1)S22
Sp2 = .
n1 + n2 − 2
Then (n1 + n2 − 2)Sp2 /σ 2 will follow a χ2 -distribution with n1 + n2 − 2
degrees of freedom.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 342 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal

Thus
(X 1 − X 2 ) − (µ1 − µ2 )
q
Sp2 (1/n1 + 1/n2 )

will follow a T -distribution with n1 + n2 − 2 degrees of freedom. We


obtain the following 100(1 − α)% confidence interval for µ1 − µ2 ,
q
(X 1 − X 2 ) ± tα/2 Sp2 (1/n1 + 1/n2 ),

where tα/2 follows a Tn1 +n2 −2 -distribution.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 343 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal

2.5.5. Example. In a batch chemical process used for etching circuit


boards, two different catalysts are being compared to determine whether
they require different emersion times for removal of identical quantities of
photo-resistant material.
Twelve batches were run with catalyst 1, resulting in a sample mean
emersion time of x 1 = 24.6 minutes and a sample standard deviation of
s1 = 0.85 minutes. Fifteen batches were run with catalyst 2, resulting in a
mean emersion time of x 2 = 22.1 minutes and a standard deviation of
s2 = 0.98 minutes.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 344 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal

We will find a 95% confidence interval on the difference in means µ1 − µ2


assuming that the variances of the two populations are equal. The pooled
estimate for the variance gives

(n1 − 1)s12 + (n2 − 1)s22


sp2 = = 0.8557
n1 + n2 − 2
so sp = 0.925. Since t0.025,25 = 2.060, we obtain

µ1 − µ2 = (2.5 ± 0.74) minutes

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 345 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal

We can use the previous results for the testing of hypotheses on the
difference of means. If µ1 − µ2 = (µ1 − µ2 )0 (the null value of the
difference of means), the statistic

X 1 − X 2 − (µ1 − µ2 )0
Tn1 +n2 −2 = q
Sp2 (1/n1 + 1/n2 )

follows a T -distribution with n1 + n2 − 2 degrees of freedom.


While any value for (µ1 − µ2 )0 can be tested, the most common in
applications is (µ1 − µ2 )0 = 0 (testing for equality of means). We then
generally test H0 : µ1 = µ2 against H1 : µ1 ≷ µ2 (one-tailed test) or
H1 : µ1 ̸= µ2 (two-tailed test).
This type of test is known as a pooled T -test.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 346 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Equal


2.5.6. Example. Two catalysts are being analyzed to determine how they
affect the mean yield of a chemical process. Specifically, catalyst 1 is
currently in use,but catalyst 2 is acceptable. Since catalyst 2 is cheaper, if
it does not change the process yield, it should be adopted. Suppose we
wish to test the hypotheses
H0 : µ1 = µ2 , H1 : µ1 ̸= µ2 .
Pilot data yields n1 = 8, x 1 = 91.73, s12 = 3.89, n2 = 8, x 2 = 93.75,
s22 = 4.02. Then
sp2 = 3.96
and the test statistic is
x1 − x2
p = −2.03.
sp 1/n1 + 1/n2
Using α = 0.05, we find t0.025,14 = 2.145 and −t0.025,14 = −2.145, so H0
cannot be rejected.
Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 347 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Variances Unequal


If the variances of the two populations are distinct, we also consider the
standard normal random variable
(X 1 − X 2 ) − (µ1 − µ2 )
p
σ12 /n1 + σ22 /n2
and estimate the variances to obtain the statistic
(X 1 − X 2 ) − (µ1 − µ2 )
p
S12 /n1 + S22 /n2
We can expect this to also be a T -random variable, but it is not clear
what its degrees of freedom are.
A solution is to use the Smith-Satterthwaite approximation, setting
¡ 2 ¢2
S1 /n1 + S22 /n2
γ = (S 2 /n )2 (S 2 /n )2
1 2
n1 −1 + n2 −1
1 2

We round the values of γ down to the nearest integer.


Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 348 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Comparing Means - Paired Data

In some situations, we do not take independent samples from two different


populations, but rather the samples are naturally related to each other.
For example, we might test reaction times of a given set of people when
sober and when under the influence of alcohol.
In such situations, pairing of data is appropriate. Instead of considering
two random variables X and Y , we define a new random variable
D = X − Y , whose mean will be

µD = E[D] = E[X − Y ] = E[X ] − E[Y ] = µX − µY .

We can then analyze D using the methods for the mean of a single
random variable.
Hypothesis tests for paired data are called paired T -tests.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 349 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

The Wilcoxon Rank-Sum Test

If the sample sizes are small, the variances unequal, and/or the populations
non-normal, the T -tests may not yield good results. In such situations, a
non-parametric test is available that is nearly as good as a T -test.
The Wilcoxon rank-sum test tests two random variables X and Y for
equality. However, it is especially sensitive to differences in location. Hence
we usually state the hypotheses in terms of the medians MX and MY :
◮ H0 : MX = MY , H1 : MX > MY (right-tailed),
◮ H0 : MX = MY , H1 : MX < MY (left-tailed),
◮ H0 : MX = MY , H1 : MX ̸= MY (two-tailed).

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 350 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

The Wilcoxon Rank-Sum Test

Assume that we have two random samples X1 , . . . , Xm and Y1 , . . . , Yn ,


m ≤ n. We pool the n + m observations and rank them from 1 to N from
smallest to largest, while retaining their group identity. The test statistic,
denoted by Wm , is the sum of the ranks associated with the smaller (X )
sample.
The reasoning is as follows: if X is located below Y , then the smaller
ranks will be associated with the X values. Thus we will reject H0 in favor
of H1 : MX < MY for small values of Wm ,
For large values of m, Wm is approximately normal with mean
E[Wm ] = m(m + n + 1)/2 and variance Var Em = mn(m + n + 1)/12.
For paired data, we can use the sign test or the Wilcoxon signed-rank test
on D = X − Y as before.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 351 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

The Wilcoxon Rank-Sum Test


2.5.7. Example. The mean axial stress in tensile members used in an
aircraft structure is being studied. Two alloys are being investigated. Alloy
1 is a traditional material and alloy 2 is a new aluminum-lithium alloy that
is much lighter than the standard material. The sample data are arranged
in the following table:
Alloy 1 (psi) 3238 3195 3246 3190 3204
3254 3229 3225 3217 3241
Alloy 2 (psi) 3261 3178 3209 3212 3258
3248 3215 3226 3240 3234
We test

H0 : M1 = M2 , H1 : M1 ̸= M2 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 352 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

The Wilcoxon Rank-Sum Test

The data are arranged in order and ranked as follows:


Alloy Axial Stress Rank Alloy Axial Stress Rank
2 3187 1 2 3229 11
1 3190 2 1 3234 12
1 3195 3 2 3238 13
1 3204 4 1 3240 14
2 3209 5 2 3241 15
2 3212 6 1 3246 16
2 3215 7 1 3248 17
1 3217 8 2 3254 18
1 3225 9 1 3258 19
2 3226 10 2 3261 20

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 353 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

The Wilcoxon Rank-Sum Test

The sums of the ranks are W1 = 99 for alloy 1, and W2 = 111 for alloy 2.
By Table X in Appendix A, the critical values for α = 0.05 in a two-tailed
test are 79 and 131. Since neither sum of ranks is outside the interval
[79, 131], we cannot reject H0 .

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 354 / 355
An Introduction to Statistical Methods Comparing Two Means and Two Variances

Second Midterm Exam

The preceding material completes our introduction to statistical theory; we


will next look at some more specialized applications in statistics.
The preceding material encompasses all of the material that will be the
subject of the Second Midterm Exam. The exam will take place on
Thursday, the 2nd of April during the usual lecture time.
For this exam a non-programmable calculator is required.

Dr. Hohberger (UM-SJTU JI) Probabilistic Methods in Engineering Spring 2009 355 / 355

You might also like