Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF

College of Science and Technology
School of Science
Department of Mathematics & Physics
Module title: Probability Theory
Level 2, Applied Mathematics & Physics
Academic year: 2019-2020
Module Code: MAT 2162
November 2019
Contents
1 Introduction to Probability Theory 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Sample Space and Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Basic Concepts of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Axiomatic Development of Probability 14
2.1 Definition and Laws of Probability . . . . . . . . . . . . . . . . . . . . . . 14
2.2 The Conditional Probability and Independent

Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Calculus of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Bayes’ s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Selected Counting Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 The Fundamental Rule of Counting . . . . . . . . . . . . . . . . . . 38
2.5.2 Arrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.3 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.4 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.5 Stirling’s Approximation to n! . . . . . . . . . . . . . . . . . . . . . 44
2.6 Examples: Counting Rules and Probability Calculations . . . . . . . . . . 45
3 Random Variables 49
1
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 The Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . 55
3.2.2 The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . 56
3.2.3 The Geometric Random Variable . . . . . . . . . . . . . . . . . . . 58
3.2.4 The Hypergeometric Random Variable . . . . . . . . . . . . . . . . 58
3.2.5 The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . 60
3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 The Uniform Random Variable . . . . . . . . . . . . . . . . . . . . 62
3.3.2 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . 63
3.3.3 Gamma Random Variables . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.4 Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . 65
4 Expectations of Random Variables 73
4.1 The Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 The Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Expectation of a Function of a Random Variable . . . . . . . . . . . . . . . 76
5 Jointly Distributed Random Variables 86
5.1 Joint Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Covariance and Variance of Sums of Random Variables . . . . . . . . . . . 97
5.4 Joint Probability Distribution of Functions of Random Variables . . . . . . 104
6 Moment Generating and Characteristic Functions 109
6.1 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 109
2
7 Limit Theorems 116
7.1 The Joint Distribution of the Sample Mean and Sample Variance from a
Normal Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Some important Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . 119
3
Chapter 1
Introduction to Probability Theory
1.1 Introduction
In the study of the subject of probability, we first imagine an appropriate random experi-
ment. A random experiment is an experiment or a process for which the outcome cannot
be predicted with certainty. All the possible outcomes of a random experiment are known
but which outcome will result when you perform the experiment is not known, i.e, after
the experiment, the result of the random experiment is known. An experiment will be
random if it has more than one possible outcome, and deterministic if it has only one.
When we repeat a random experiment several times, we call each one of them a trial.
Thus, a trial is a particular performance of a random experiment. For example tossing a
coin or rolling a die are examples of random experiments.
A random experiment has three important components which are:
(1) Multiplicity of outcomes,
(2) Uncertainty regarding the outcomes, and
(3) Repeatability of the experiment in identical fashions.
Suppose that one tosses a regular coin up in the air. The coin has two sides, namely the
head (H) and tail (T ). Let us assume that the tossed coin will land on either H or T .
Every time one tosses the coin, there is the possibility of the coin landing on its head
or tail (multiplicity of outcomes). But, no one can say with absolute certainty whether
the coin would land on its head, or for that matter, on its tail (uncertainty regarding the
outcomes). One may toss this coin as many times as one likes under identical conditions
4
(repeatability) provided the coin is not damaged in any way in the process of tossing it
successively.
All three components are crucial ingredients of a random experiment. In order to contrast
a random experiment with another experiment, suppose that in a lab environment, a bowl
of pure water is boiled and the boiling temperature is then recorded. The first time this
experiment is performed, the recorded temperature would read 100◦ Celsius (or 212◦
Fahrenheit). Under identical and perfect lab conditions, we can think of repeating this
experiment several times, but then each time the boiling temperature would read 100◦
Celsius (or 212◦ Fahrenheit). Such an experiment will not fall in the category of a random
experiment because the requirements of multiplicity and uncertainty of the outcomes are
both violated here.
Exercise 1.1.1 Provide two examples of non-random experiments and two examples of
random experiments. Justify your answers.
1.2 Sample Space and Event
We interpret probability of an event as the relative frequency of the occurrence of that

event in a number of independent and identical replications of the experiment. We may
be curious to know the magnitude of the probability p of observing a head (H) when a
particular coin is tossed. In order to gather valuable information about p, we may decide
to toss the coin ten times, for example, and suppose that the following sequence of H and
T is observed:
T HT HT T HT T H (1.2.1)
Let nk be the number of Hs observed in a sequence of k tosses of the coin while nkk refers
to the associated relative frequency of H. For the observed sequence in the Expression
(1.2.1), the successive frequencies and relative frequencies of H are given in the accom-
panying Table 1.1. The observed values for nkk empirically provide a sense of what p may
be, but admittedly this particular observed sequence of relative frequencies appears a
little unstable. However, as the number of tosses increases, the oscillations between the
successive values of nkk will become less noticeable. Ultimately nkk and p are expected to
become indistinguishable in the sense that in the long haul nkk will be very close to p.
That is, our instinct may simply lead us to interpret p as
nk
p = lim
k→∞ k
5
nk nk
k nk k
k nk k
1
1 0 0 6 2 3
1 3
2 1 2
7 3 7
1 3
3 1 3
8 3 8
1 1
4 2 2
9 3 3
2 2
5 2 5
10 4 5
Table 1.1: Behavior of the Relative Frequency of the Hs
A random experiment provides in a natural fashion a list of all possible outcomes, also
referred to as the simple events. These simple events act like atoms in the sense that the
experimenter is going to observe only one of these simple events as a possible outcome
when the particular random experiment is performed.
A sample space is merely a set, denoted by S, which enumerates each and every possible
simple event or outcome. Then, a probability scheme is generated on the subsets of S,
including S itself, in a way which mimics the nature of the random experiment itself.
Throughout, we will write P (A) for the probability of a statement A, where A is a subset
of S. A more precise treatment of these topics is provided in Chapter 2.
Let us look at two simple examples first.
Example 1.2.1 Suppose that we toss a fair coin three times and record the outcomes
observed in the first, second, and third toss respectively from left to right. Then the
possible simple events are HHH, HHT, HT H, HT T, T HH, T HT, T T H or T T T . Thus
the sample space is given by
S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T } (1.2.2)
Since the coin is assumed to be fair, this particular random experiment generates the
following probability scheme:
P (HHH) = P (HHT ) = P (HT H) = P (HT T ) = P (T HH)

1
= P (T HT ) = P (T T H) = P (T T T ) =
8
Example 1.2.2 Suppose that we toss two fair dice, one red and the other yellow, at the
same time and record the scores on their faces landing upward. Then, each simple event
would constitute, for example, a pair (i, j) where i is the number of dots on the face of the
red die that lands up and j is the number of dots on the face of the yellow die that lands
up.
6
The sample space is given by
S = {(1, 1), (1, 2), ..., (1, 6), (2, 1), ..., (2, 6), ..., (6, 1), ..., (6, 6)}
consisting of exactly 36 possible simple events. Since both dice are assumed to be fair, this
particular random experiment generates the following probability scheme:
1
P {(i, j)} = for all i, j = 1, · · · , 6
36
Example 1.2.3 If the experiment consists of measuring the lifetime of a car, then the
sample space consists of all nonnegative real numbers i.e. the sample space is the set
S = {x : x ∈ R and x ≥ 0} = [0, ∞).
In summary:
• A random experiment is a procedure that produces exactly one outcome out of many
possible outcomes.
• A sample space, denoted by S, is defined to be the set of all the possible outcomes
in a random experiment.
• An event is defined to be any subset of the sample space, and events are usually
denoted by capital letters, A, B and so forth.
Example 1.2.4 Consider an experiment of tossing a die twice. Then the sample space
is S = {(i, j) : i, j = 1, 2, ..., 6}, which contains 36 elements. ”The sum of the results of
the two toss is equal to 10” is an event. If we denote this event by A then A = {(i, j) :
3
i + j = 10, i, j = 1, 2, · · · , 6} = {(4, 6), (6, 4), (5, 5)} and P (A) = 36 .
Exercise 1.2.1 If two fair dice are tossed, what is the probability that the sum is n where
n = 2, 3, · · · , 12.
Exercise 1.2.2 A number X is chosen at random from the set {1, 2, ..., 25}. Find the
probability that when divided by 6 it leaves remainder 1.
Exercise 1.2.3 A fair die is rolled three times and the scores are added. What is the
probability that the sum of the scores is 6?
1.3 Basic Concepts of Set Theory
Basic Set Operations
A set S is a collection of objects which are tied together with one or more common
defining properties. For example, we may consider a set S = {x : x is an integer} in
7
which case S can be alternately written as Z = {..., 2, 1, 0, 1, 2, ...}. Here the common
defining property which ties in all the members of the set Z is that they are integers.
In the set N = {1, 2, 3 · · · } of all positive integers, being positive integer is the common
defining property of all members of N.
Let S be a universe set, i.e. S is a set that contains all the entities, including itself, that
one wishes to consider in a given situation. We say that A is a subset of S, denoted by
A ⊆ S, provided that each member of A is also a member of S.
Consider A and B which are both subsets of S. Then, A is called a subset of B, denoted
by A ⊆ B, provided that each member of A is also a member of B. We say that A is a
proper subset of B, denoted by A ⊂ B, provided that A is a subset of B but there is at
least one member of B which does not belong to A. Two sets A and B are said to be
equal if and only if A ⊆ B, as well as B ⊆ A.
Example 1.3.1 Let us define S = {1, 3, 5, 7, 9, 11, 13}, A = {1, 5, 7}, B = {1, 5, 7, 11, 13},
C = {3, 5, 7, 9}, and D = {11, 13}. Here, A, B, C and D are all proper subsets of S. It is
obvious that A is a proper subset of B, but A is not a subset of either C or D.
Suppose that A and B are two subsets of S. Now, we mention some customary set
operations listed below:
Intersection: A ∩ B = {x : x ∈ A and x ∈ B}.
Union: A ∪ B = {x : x ∈ A or x ∈ B}.
Difference: A \ B = {x : x ∈ A but x ∈
/ B}.
Complement: Ac = {x : x ∈ S but x ∈
/ A}.
Symmetric Difference: A∆B = (A \ B) ∪ (B \ A).
Note that A∆B = (A ∩ B c ) ∪ (B ∩ Ac ) = (A ∪ B) \ (A ∩ B).
0
Let us note that the complement of a set A is sometimes denoted by A .
The Figure 1.1 give the set operations and Venn Diagrams for complement, subset, inter-
sect and union.
The union and intersection operations satisfy the following laws:
For any subsets A, B, C of S, we have
8
Figure 1.1: Sets Operations and Venn Diagrams.
Commutative Law : A ∪ B = B ∪ A, A ∩ B = B ∩ A.
Associative Law : (A ∪ B) ∪ C = A ∪ (B ∪ C), (A ∩ B) ∩ C = A ∩ (B ∩ C).
Distributive Law : (A ∪ B) ∩ C = (A ∩ B) ∪ (B ∩ C), (A ∩ B) ∪ C = (A ∪ B) ∩ (B ∪ C).
Two sets A and B are said to be disjoint if and only if there is no common element
between A and B, that is, if and only if A ∩ B = ∅, where ∅ denotes the empty set, i.e.
a set which has no elements. Two disjoint sets A and B are also referred to as being
mutually exclusive.
Example 1.3.2 (Example 1.3.1 Continued) One can verify that A ∪ B = {1, 5, 7, 11, 13},
9
B ∪ C = S, but C and D are mutually exclusive. Also, A and D are mutually exclusive.
Note that Ac = {3, 9, 11, 13}, A∆C = {1, 3, 9} and A∆B = {11, 13}.
Additional Laws for Set Operations
(i) Some additional laws for unions and intersections

For any subsets A and B of a universe set S, the following identities hold:
Idempotent law : A ∪ A = A and A ∩ A = A.

Domination Law : A ∪ S = S and A ∩ S = A.
Absorption Law : A ∪ (A ∩ B) = A and A ∩ (A ∪ B) = A.
(ii) Some additional laws for complements

Let A and B be subsets of a universe S, then the following hold:
De Morgan’s Laws: (A ∪ B)c = Ac ∩ B c and (A ∩ B)c = Ac ∪ B c .

Double Complement or Involution Law : (Ac )c = A.
Complement Laws for the Universe set and the empty set: S c = ∅ and ∅c = S.
(iii) The algebra of inclusion

If A, B and C are sets then the following hold:
Reflexivity: A ⊆ A.
Antisymetry: A ⊆ B and B ⊆ A if and only if A = B.
Transitivity: If A ⊆ B and B ⊆ C then A ⊆ C.
If A, B and C are subsets of a set S then the following hold:
(1) ∅ ⊆ A ⊆ S.
(2) A ⊆ A ∪ B and A ∩ B ⊆ A.
(3) If A ⊆ C and B ⊆ C then A ∪ B ⊆ C.
(4) If C ⊆ A and C ⊆ B then C ⊆ A ∩ B.
Let us, in addition, note that A ⊆ B ⇔ A ∩ B = A ⇔ A ∪ B = B ⇔ A \ B = ∅ ⇔

B c ⊆ Ac .
(iv) The algebra of relative complements

For any universe S and subsets A, B, and C of S, the following is a list of several
identities concerning relative complements and set-theoretic differences.
10
(1) C \ (A ∩ B) = (C \ A) ∪ (C \ B).
(2) C \ (A ∪ B) = (C \ A) ∩ (C \ B).
(3) (B \ A) ∩ C = (B ∩ C) \ A = B ∩ (C \ A).
(4) (B \ A) ∪ C = (B ∪ C) \ (A \ C).
(5) C \ (B \ A) = (A ∩ C) ∪ (C \ B).
(6) A \ A = ∅, A \ ∅ = A, B \ A = B ∩ Ac .
(7) (B \ A)c = A ∪ B c , S \ A = Ac , A \ S = ∅.
Finite, Countable and Uncountable Sets
The cardinality of a set is a measure of the ”number of elements of the set”. For example,
the set A = {10, 20, 30} contains 3 elements, and therefore A has a cardinality of 3.
A set is called finite if it is empty or has the same cardinality as the set {1, 2, ..., n} for
some n ∈ N; it is called infinite otherwise.
A countable set is a set with the same cardinality (number of elements) as some subset
of the set N. A countable set is either a finite set or a countably infinite set. Whether
finite or infinite, the elements of a countable set can always be counted one at a time
and, although the counting may never finish, every element of the set is associated with
a unique natural number.
An uncountable set (or uncountably infinite set) is an infinite set that contains too many
elements to be countable.
Example 1.3.3 (Finite, Countable and Uncountable Sets)
(i) The set A = {10, 20, 30, 40} is a finite set and hence a countable set.
(ii) The sets N and Z are all countable sets but they are not finite.
(iii) The sets R and I = (0, 1) are all uncountable sets and hence infinite.
Set operations on Collections of Sets
Now consider {Ai : i ∈ I}, a collection of subsets of S, where I is some indexing set. This
collection may be finite, countably infinite or uncountably infinite. We define
11
S
Union: i∈I Ai = {x : x ∈ Ai for at least one i ∈ I}
T (1.3.1)
Intersection: i∈I Ai = {x : x ∈ Ai for all i ∈ I}
The Definition 1.3.1 lays down the set operations involving the union and intersection
among arbitrary number of sets.
When we specialize I = {1, 2, 3, · · · } in the Definition 1.3.1, we can combine the notions
of countably infinite number of unions and intersections to come up with some interesting
Theorem 1.3.4 (De Morgans Law) Consider {Ai : i ∈ I}, a collection of subsets of
S. Then, !c !c
[ \ \ [
Ai = Aci and Ai = Aci (1.3.2)
i∈I i∈I i∈I i∈I
Proof. We proove the first equality

!c
[ \
Ai = Aci
i∈I i∈I
The second equality is proven in a similar reasoning.

S c S
Suppose that an element x belongs in i∈I Ai . That is, x ∈ S but x ∈
/ i∈I Ai , which
implies that x can not belong to any of the sets Ai , i ∈ I. Hence, the element x must
belong to each of the sets Aci , i ∈ I and hence x mus belong to the set i∈I Aci . Thus, we
T
have !c
[ \
Ai ⊆ Aci
i∈I i∈I
Suppose that an element x belongs to i∈I Aci . That is x ∈ Aci for each i ∈ I, which
T
implies that x can not belong to any of the sets Ai , i ∈ I. In other words, the element x
S S c
can not belong to the set i∈I Ai so that x must belong to the set i∈I Ai . Thus we
have !c
[ \
Ai ⊇ Aci
i∈I i∈I
The proof is now complete.
Definition 1.3.5 The collection of sets {Ai : i ∈ I} is said to consist of disjoint sets
if and only if no two sets in this collection share a common element, that is when
Ai ∩ Aj = ∅ for all i 6= j ∈ I.
The collection {Ai : i ∈ I} is called a partition of S if and only if
12
(i) {Ai : i ∈ I} consists of disjoint sets only, and
S
(ii) {Ai : i ∈ I} spans the whole space S, that is i∈I Ai = S.
Example 1.3.6 Let S = (0, 1] and define the collection of sets {Ai : i ∈ I} where
1
Ai = 21i , 2i−1

, i ∈ I = {1, 2, 3, · · · }.
One should check that the given collection of intervals form a partition of (0, 1].
Exercise 1.3.1 For all sets A, B and C. Is (A \ B) ∩ (C \ B) = (A ∩ C) \ B? Justify

your answer.
Exercise 1.3.2 For all sets A, B and C. Is (A ∩ B) ∪ B = A ∩ (B ∪ C)? Justify your

statement.
Exercise 1.3.3 Let A, B and C be sets. Then show that (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)

and (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C).
Exercise 1.3.4 Prove the following equality

!c
\ [
Ai = Aci
i∈I i∈I
13
Chapter 2
Axiomatic Development of
Probability
2.1 Definition and Laws of Probability
The axiomatic theory of probability was developed by Kolmogorov in his 1933 monograph,
originally written in German. Its English translation is cited as Kolmogorov (1950b).
Before we describe this approach, we need to fix some ideas first. Along the lines of
the examples discussed in the introduction, let us focus on some random experiment in
general and state a few definitions.
Definition 2.1.1 A sample space is a set, denoted by S, which enumerates each and
every possible outcome or simple event.
In general, an event is an appropriate subset of the sample space S, including the empty
subset ∅ and the whole set S.
In what follows we make this notion more precise.
Definition 2.1.2 Suppose that B = {Ai : Ai ⊆ S, i ∈ I} is a collection of subsets of S.

Then, B is called a Borel sigma-field or Borel sigma-algebra if the following conditions
hold:
(i) The empty set ∅ ∈ B;
(ii) If A ∈ B, then Ac ∈ B;
S∞
(iii) If Ai ∈ B for i = 1, 2, ..., then i=1 Ai ∈ B.
14
In other words, the Borel sigma-field B is closed under the operations of complement and
countable union of its members. It is obvious that the whole space S belongs to the
Borel sigma-field B since we can write S = ∅c ∈ B, by the requirement (i)-(ii) in the
Definition 2.1.2. Also if Ai ∈ B for i = 1, 2, ..., k, then ki=1 Ai ∈ B, since with Ai = ∅ for
S
i = k + 1, k + 2, ..., we can express ki=1 Ai ∈ B, as ∞

S S
i=1 Ai ∈ B which belongs to B in
view of (iii) in the Definition 2.1.2. That is, B is obviously closed under the operation of
finite unions of its members.
Example 2.1.3 Let S be any given non-empty set.
(i) The collection P(S) = {A : A ⊆ S} of all subsets of S is a Borel sigma-field on S.

The collection P(S) is called the power set of S.
(ii) If A is any subset of S then the collection σ(A) = {∅, A, Ac , S} is a Borel sigma-field
on S.
(iii) Suppose that A and B are two subsets of S such that A ⊆ B. Then the collection
σ(A, B) = {∅, A, B, Ac , B c , A ∪ B c , Ac ∩ B, S} is a Borel sigma-field on S.
Exercise 2.1.1 Show that if B is a sigma-Borel field then B is closed under countable
intersections i.e. if A1 , A2 , · · · are elements of B then
∞
\
Ai ∈ B
i=1
Solution 2.1.4 We recall that for any given set A the equality A = Acc . By using the
Demorgan’s Laws 1.3.3 one can write
∞
" ∞ !c # c ∞
!c
\ \ [
Ai = Ai = Aci
i=1 i=1 i=1
Since Ai ∈ B then Aci ∈ B by the Definition 2.1.2 (ii) for all i = 1, 2, · · · . It also follows
from the Definition 2.1.2 (iii) that ∞ c
S
i=1 Ai ∈ B. Finally, by the Definition 2.1.2 (ii) we
S∞ c c
have ( i=1 Ai ) ∈ B as requested.
Remark 2.1.1 If B1 and B2 are two Borel sigma-fields on S then the collection of sets
B1 ∩ B2 = {B : B ∈ B1 and B ∈ B2 } is also a sigma-field on S but the collection
B1 ∪ B2 = {B : B ∈ B1 or B ∈ B2 } does not need to a Borel sigma-field on S.
Exercise 2.1.2 Provide two examples of Borel sigma-fields on the set R of all real num-
bers.
Definition 2.1.5 Suppose that a fixed collection of subsets B = {Ai : Ai ⊆ S, i ∈ I} is a

Borel sigma-field. Then, any subset A of S is called an event if and only if A ∈ B.
15
Frequently, we work with the Borel sigma-field B which consists of all subsets of the
sample space S, that is B = P(S), but always it may not necessarily be that way. Having
started with a fixed Borel sigma-field B of subsets of S, a probability scheme is simply
a way to assign numbers between zero and one to every event while such assignment of
numbers must satisfy some general guidelines. In the next definition, we provide more
specifics.
Definition 2.1.6 A probability scheme assigns a unique number to a set A ∈ B, denoted

by P (A), for every set A ∈ B in such a way that the following conditions hold:
(i) P (S) = 1;
(ii) P (A) ≥ 0 for every A ∈ B;
(iii) P ( ∞
S P∞
i=1 A i ) = i=1 P (Ai ) for all Ai ∈ B such that all Ai ’s are all pairwise disjoint,
i = 1, 2, · · · .
Now, we are in a position to claim a few basic results involving probability. Some are
fairly intuitive while others may need more attention.
Theorem 2.1.7 Suppose that A and B are any two events and recall that ∅ denotes the
empty set. Suppose also that the sequence of events {Bi : i = 1, 2, 3, · · · } forms a partition
of the sample space S. Then,
(i) P (∅) = 0 and P (S) = 1;
(ii) P (Ac ) = 1 − P (A);
(iii) P (B ∩ Ac ) = P (B) − P (B ∩ A);
(iv) P (A ∪ B) = P (A) + P (B) − P (A ∩ B);
(v) If A ⊆ B, then P (A) ≤ P (B);

P∞
(vi) P (A) = i=1 P (A ∩ Bi ).
Proof. (i) Observe that ∅ ∪ ∅c = S and also ∅, ∅c = S are disjoint events. Hence, by
part (iii) in the Definition 2.1.6, we have 1 = P (S) = P (∅ ∪ ∅c ) = P (∅) + P (∅c ).
Thus, P (∅) = 1 − P (∅c ) = 1 − P (S) = 1 − 1 = 0, in view of part (i) in the Definition
2.1.6. The second part follows from part (ii).
(ii) Observe that A ∪ Ac = S and then proceed as before. Observe that A and Ac are
disjoint events.
16
(iii) Notice that B = (B∩A)∪(B∩Ac ), where B∩A and B∩Ac are disjoint events. Hence
by part (iii) in the Definition 2.1.6, we claim that P (B) = P {(B ∩ A) ∪ (B ∩ Ac )} =
P (B ∩ A) + P (B ∩ Ac ). Now, the result is immediate.
(iv) It is easy to verify that A ∪ B = (A \ B) ∪ B = (A ∩ B c ) ∪ (B ∩ Ac ) ∪ (A ∩ B)

where the three events A ∩ B c , B ∩ Ac , A ∩ B are also disjoint. Thus, we have
P (A ∪ B) = P (A ∩ B c ) + P (B ∩ Ac ) + P (A ∩ B) in view of part (iii) Definition 2.1.6.
It follows that P (A ∪ B) = {P (A) − P (A ∩ B)} + {P (B) − P (A ∩ B)} + P (A ∩ B),
which leads to the desired result.
(v) By the fact that A ⊆ B we have A ∪ (B ∩ Ac ) = B. Since the events A and B ∩ Ac

are disjoint then we have P (A) + P (B ∩ Ac ) = P (B). In view of the Definition 2.1.6
(ii) we have P (B ∩ Ac ) ≥ 0. It follows that P (A) ≤ P (B).
(vi) Since the sequence of events {Bi : i ≥ 1} forms a partition of the sample space
S, we can write A = A ∩ S = A ∩ ( ∞
S S∞
i=1 Bi ) = i=1 (A ∩ Bi ) where the events
A ∩ Bi , i = 1, 2, ... are also disjoint. Now, the result follows from part (iii) in the
Definition 2.1.6.
Example 2.1.8 (Example 1.2.1 Continued) Let us define three events as follows:
A: We observe two heads.
B: We observe at least one tail.
C: We observe no heads.
How can we obtain the probabilities of these events?
First notice that as subsets of S, one can define the given events as follows:
A = {HHT, HT H, T HH}.
B = {HHT, HT H, HT T, T HH, T HT, T T H, T T T },
C = {T T T }
Now it becomes obvious that P (A) = 38 , P (B) = 87 , and P (C) = 18 . One can also
see that A ∩ B = {HHT, HT H, T HH} so that P (A ∩ B) = 38 , whereas A ∪ C =
{HHT, HT H, T HH, T T T } so that P (A ∪ C) = 48 = 12 .
Example 2.1.9 Example 1.2.2 Continued) Consider the following events:
17
D: The total from the red and yellow dice is 8.
E: The red die comes up with 2 more than the yellow die.
Now, as subsets of the corresponding sample space S, we can rewrite these events as D =
{(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} and E = {(3, 1), (4, 2), (5, 3), (6, 4)}. It is now obvious
5 4
that P (D) = 36 and P (E) = 36 = 91 .
1
It is also obvious to see that D ∩ E = {(5, 3)} which gives P (D ∩ E) = 36
.
Example 2.1.10 In a college campus, suppose that 2600 are women out of 4000 under-
graduate students, while 800 are men among 2000 undergraduates who are under the age
25. From this population of undergraduate students if one student is selected at random,
what is the probability that the student will be either a man or be under the age 25?
Define two events as follows:
A: The selected undergradute student is male.
B: The selected undergradute student is under the age 25.
One can observe that

1400 2000 800
P (A) = , P (B) = , P (A ∩ B) =
4000 4000 4000
Now, apply the Theorem 2.1.7, part (iv), to write
1400 + 2000 − 800 13
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = =
4000 20
Exercise 2.1.3 Let Y ⊆ X. Prove that P (X \ Y ) = P (X) − P (Y ).
Solution 2.1.11 Since Y ⊆ X one can write X = Y ∪ (X \ Y ). Since the events Y

and X \ Y are disjoint then by using the Definition 2.1.6 (iii) we can write P (X) =
P (Y ) + P (X \ Y ) and hence P (X \ Y ) = P (X) − P (Y ) as required.
Exercise 2.1.4 Suppose that A, B are Borel sets, that is they belong to B. Show that
P (A∆B) = P (A) + P (B) − 2P (A ∩ B) where A∆B stands for the symmetric difference
of A and B.
Solution 2.1.12 Note that A∆B = (A ∩ B c ) ∪ (B ∩ Ac ) and the two sets A ∩ B c and
B ∩ Ac are disjoint. So by using Definition 2.1.6 (iii) one can write P (A∆B) = P (A ∩
B c ) + P (B ∩ Ac ). Using the Theorem 2.1.7 (iii), it follows that P (A∆B) = P (A) − P (A ∩
B) + P (B) − P (A ∩ B) which is equivalent to P (A∆B) = P (A) + P (B) − 2P (A ∩ B) as
requested.
18
Exercise 2.1.5 There are 100 cards: 10 of each red-numbered 1 through 10; 20 white-
numbered 1 through 20; 30 blue-numbered 1 through 30; and 40 magenta-numbered 1
through 40.
(i) Let R be the event of picking a red card. Find P (R) .
(ii) Let B be the event of picking a blue card. Find P (B) .
(iii) Let E be the event of picking a card with face value 11. Find P (E).
Find also the probabilities of the following events: B ∪ R, E ∩ R, E ∩ B, E ∪ R, E ∪ B, E \ B

and B \ E.
Solution 2.1.13 Using the given information in the statement of the problem one can see
10 30 3
that the requested probabilities are given by P (R) = 100 , P (B) = 100 , P (E) = 100 , P (B ∪
40 1 13 32 2
R) = 100 , P (E ∩ R) = 0, P (E ∩ B) = 100 , P (E ∪ R) = 100 , P (E ∪ B) = 100 , P (E \ B) = 100
29
and P (B \ E) = 100 .
2.2 The Conditional Probability and Independent

Events
Let us reconsider the Example 1.2.2. Suppose that the two fair dice, one red and the
other yellow, are tossed in another room. After the toss, the experimenter comes out to
announce that the event D has been observed. Recall that P (E) was 91 to begin with,
but we know now that D has happened, and so the probability of the event E should be
appropriately updated. Now then, how should one revise the probability of the event E,
given the additional information?
The basic idea is simple: when we are told that the event D has been observed, then D
should take over the role of the sample space while the original sample space S should
be irrelevant at this point. In order to evaluate the probability of the event E in this
situation, one should simply focus on the portion of E which is inside the set D. This is
the fundamental idea behind the concept of conditioning.
Definition 2.2.1 Let S and B be respectively the sample space and the Borel sigma-field.
Suppose that A, B are two arbitrary events. The conditional probability of the event A
given the other event B, denoted by P (A|B), is defined as
P (A ∩ B)
P (A|B) = , provided that P (B) > 0 (2.2.1)
P (B)
19
In the same vein, we will write
P (A ∩ B)
P (B|A) = , provided that P (A) > 0
P (A)
Definition 2.2.2 Two arbitrary events A and B are defined independent if and only if
P (A|B) = P (A), that is having the additional knowledge that B has been observed
has not affected the probability of A, provided that P (B) > 0.
Two arbitrary events A and B are then defined dependent if and only if P (A|B) 6= P (A),
in other words knowing that B has been observed has affected the probability of A,
provided that P (B) > 0.
In case the two events A and B are independent, intuitively it means that the occurrence
of the event A (or B) does not affect or influence the probability of the occurrence of the
other event B (or A). In other words, the occurrence of the event A (or B) yields no
reason to alter the likelihood of the other event B (or A).
When the two events A, B are dependent, sometimes we say that B is favorable to A if
and only if P (A|B) > P (A) provided that P (B) > 0. Also, when the two events A, B are
dependent, sometimes we say that B is unfavorable to A if and only if P (A|B) < P (A)
provided that P (B) > 0.
1
Example 2.2.3 (Example 1.2.2 Continued) Recall that P (D ∩ E) = P {(5, 3)} = 36 and
5 P (D∩E) 1 36 1 1
P (D) = 36 , so that we have P (E|D) = P (D) = 36 × 5 = 5 . But, P (E) = 9 which
is different from P (E|D). In other words, we conclude that D and E are two dependent
events. Since 51 = P (E|D) > P (E) = 19 , we may add that the event D is favorable to the
event E.
Theorem 2.2.4 The two events B and A are independent if and only if A and B are
independent. Also, the two events A and B are independent if and only if P (A ∩ B) =
P (A)P (B).
Proof. Let A and B be two given events.
We first observe that P (B|A) = P P(A∩B)

(A)
= P (A|B)P
P (A)
(B)
. So P (B|A) = P (B) if and only if
P (A|B) = P (A). Hence the events A and B are independent if and only if the events B
and A are independent.
For the second statement, we note that P (A ∩ B) = P (A|B)P (B). If A and B are
independent then P (A|B) = P (A). Hence P (A ∩ B) = P (A)P (B).
On the other hand, if P (A ∩ B) = P (A)P (B) then
P (A)P (B) = P (A ∩ B) = P (A|B)P (B) ⇐⇒ P (A) = P (A|B)
20
Therefore A and B are independent.
Exercise 2.2.1 Let A and B be independent events with P (A) = P (B) and
P (A ∪ B) = 12 . Find P (A).
Solution 2.2.5 Let us first recall that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). It follows
from the statement of the problem that 21 = P (A) + P (B) − P (A ∩ B) ⇐⇒ 12 = 2P (A) −
P (A ∩ B). Since A and B are assumed to be independent and P (A) = P (B) then P (A ∩
B) = P (A)P (B) = P (A)2 . Denoting P (A) by x, we obtain the following quadratic
√
equation 12 = 2x − x2 ⇐⇒ 2x2 − 4x + 1 = 0 for which the solutions are x = 2±2 2 . Since
√
the probability of any given event is between 0 and 1 then it is natural to take P (A) = 2−2 2 .
We now state and prove another interesting result.
Theorem 2.2.6 Suppose that A and B are two events. Then, the following statements
are equivalent:
(i) The events A and B are independent;
(ii) The events Ac and B are independent;
(iii) The events A and B c are independent;
(iv) The events Ac and B c are independent.
Proof. It will suffice to prove that (i) =⇒ (ii) =⇒ (iii) =⇒ (iv) =⇒ (i).
(i) =⇒ (ii) : Assume that A and B are independent events. That is, in view of the
Theorem 2.2.4, we have P (A∩B) = P (A)P (B). Again in view of the Theorem 2.2.4,
we need to show that P (Ac ∩ B) = P (Ac )P (B). We apply parts (ii) − (iii) from
the Theorem 2.1.7 to write P (Ac ∩ B) = P (B) − P (A ∩ B) = P (B) − P (A)P (B) =
P (B)(1 − P (A)) = P (B)P (Ac ).
(ii) =⇒ (iii): Assume that Ac and B are independent events. That is, in view of the
Theorem 2.2.4, we have P (Ac ∩ B) = P (Ac )P (B). Again in view of the Theorem
2.2.4, we need to show tha P (A∩B c ) = P (A)P (B c ). Now, we combine De Morgan’s
Law from the Theorem 1.3.3 as well as the parts (ii)−(iv) from the Theorem 2.1.7 to
write P (A ∩ B c ) = 1 − P {(A ∩ B c )c } = 1 − P (Ac ∪ B) = 1 − {P (Ac ) + P (B) − P (Ac ∩
B)} = 1 − P (Ac ) − P (B) − P (Ac )P (B) = [1 − P (Ac )] [1 − P (B)] = P (A)P (B c ).
(iii) =⇒ (iv): Assume that the events A and B c are independent. That is, in view
of the Theorem 2.2.4, we have P (A ∩ B c ) = P (A)P (B c ). Again in view of the
21
Theorem 2.2.4, we need to show tha P (Ac ∩ B c ) = P (Ac )P (B c ). In view of the
Theorem 2.1.7 (iii) and by the fact that the events A and B c are independent we
have P (Ac ∩ B c ) = P (B c ∩ Ac ) = P (B c ) − P (B c ∩ A) = P (B c ) − P (B c )P (A) =
(1 − P (A))P (B c ) = P (Ac )P (B c ).
(iv) =⇒ (i) : Assume that Ac and B c are independent events. That is, in view of the
Theorem 2.2.4, we have P (Ac ∩ B c ) = P (Ac )P (B c ). Again in view of the Theorem
2.2.4, we need to show that P (A ∩ B) = P (A)P (B). Now, we combine De Morgan’s
Law from the Theorem 1.3.4 as well as the parts (ii) − (iv) from the the Theorem
2.1.7 to write P (A∩B) = 1−P {(A∩B)c } = 1−P (Ac ∪B c ) = 1−{P (Ac )+P (B c )−
P (Ac ∩ B c )} = 1 − P (Ac ) − P (B c ) + P (Ac )P (B c ) = {1 − P (Ac )}{1 − P (B c )} =
P (A)P (B).
Definition 2.2.7 Let A1 , · · · , An be a collection of events. The events A1 , · · · , An are

said to be independent if !
n
\ Yn
P Ai = P (Ai )
i=1 i=1

said to be pairwise independent if
P (Ai ∩ Aj ) = P (Ai )P (Aj ) for all i 6= j

said to be mutually independent if and only if every sub-collection consists of independent
events, that is
k
! k
\ Y
P A ij = P (Aij ) for all 1 = i1 ≤ i2 ≤ · · · ≤ ik ≤ n and 2 ≤ k ≤ n (2.2.2)
j=1 j=1
Note that a collection of events A1 , · · · , An may be pairwise independent, that is, any two
events are independent according to the Definition 2.2.2 (see Definition 2.2.8), but the
whole collection of sets may not be mutually independent.
Example 2.2.10 Consider the random experiment of tossing a fair coin twice. Let us
define the following events:
A1 : Observe a head (H) on the first toss.
A2 : Observe a head (H) on the second toss.
A3 : Observe the same outcome on both tosses.
22
The sample space is given by S = {HH, HT, T H, T T } with each outcome being equally
likely. Now, rewrite
A1 = {HH, HT },
A2 = {HH, T H},
A3 = {HH, T T }
Thus, we have P (A1 ) = P (A2 ) = P (A3 ) = 42 = 12 . Now, P (A1 ∩ A2 ) = P (HH) = 1/4 =

P (A1 )P (A2 ), that is the two events A1 , A2 are independent. Similarly, one should verify
that A1 , A3 are independent, and so are also A2 , A3 . But, observe that P (A1 ∩ A2 ∩ A3 ) =
P (HH) = 14 and it is not the same as P (A1 )P (A2 )P (A3 ) = 21 × 12 × 12 = 18 . In other
words, the three events A1 , A2 , A3 are not mutually independent, but they are pairwise
independent.
Theorem 2.2.11 If A, B and C are mutually independent events then A ∪ B and C are
independent events.
Proof. From the Definition 2.2.7, we need to show that P {(A ∪ B) ∩ C} = P (A ∪ B)P (C).
Since the events are A, B and C are mutually independent then by Definition 2.2.9 we
have P (A ∩ B) = P (A)P (B), P (A ∩ C) = P (A)P (C), P (B ∩ C) = P (B)P (C) and
P (A ∩ B ∩ C) = P (A)P (B)P (C). Note that P {(A ∪ B) ∩ C} = P {A ∩ C) ∪ (B ∩ C)} =
P (A ∩ C) + P (B ∩ C) − P (A ∩ B ∩ C) = P (A)P (C) + P (B)P (C) − P (A)P (B)P (C) =
{P (A) + P (B) − P (A)P (B)} P (C) = P (A ∪ B)P (C).
Thus A ∪ B and C are independent events.
Definition 2.2.12 The events in the sequence E1 , E2 , ... are said to be mutually exclu-
sive, if Ei ∩ Ej = ∅ for all i 6= j.
In other words, the events are said to be mutually exclusive if they do not have any
outcomes (elements) in common, i.e. they are pairwise disjoint.
Definition 2.2.13 The events E1 , E2 , ..., En are said to be exhaustive if
E1 ∪ E2 ∪ · · · ∪ En = S
Exercise 2.2.2 Consider a sample space S and suppose that A, B are two events which
are mutually exclusive, that is, A ∩ B = ∅. If P (A) > 0, P (B) > 0, then can these two
events A, B be independent? Explain.
23
Solution 2.2.14 Since A ∩ B = ∅ then P (A ∩ B) = 0. In addition, since P (A) > 0 and
P (B) > 0 then P (A)P (B) > 0. Recall (see Theorem 6.2) that A and B are indepenedent
if and only if P (A ∩ B) = P (A)P (B). In our case 0 = P (A ∩ B) 6= P (A)P (B) > 0.
Hence the two events can not be independent.
Exercise 2.2.3 Suppose that A and B are two events such that P (A) = 0.7999 and
P (B) = 0.2002. Are A and B mutually exclusive events? Explain.
Solution 2.2.15 Let us first recall that A and B are mutually exclusive if A ∩ B = ∅. In
this case P (A ∩ B) = ∅. It is known that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A
and B are mutually exclusive then P (A ∪ B) = 0.7999 + 0.2002 > 1, which is impossible
by the Definition 2.1.7. Hence the events A and B, with the given probabilities can’t be
mutually exclusive.
Exercise 2.2.4 For events A and B, you are given that P (A) = 32 , P (B) = 2
5
and P (A ∪
B) = 34 . Find P (Ac ), P (B c ), A ∩ B, P (Ac ∪ B c ) and P (Ac ∩ B c ).
Solution 2.2.16 From the given information, one can see that P (Ac ) = 1 − P (A) =
1− 23 = 31 , P (B c ) = 1−P (B) = 1− 25 = 35 . Note that P (A∩B) = P (A)+P (B)−P (A∪B) =
2
3
+ 52 − 34 − 1960
19
. It is clear that P (Ac ∪ B c ) = P [(A ∩ B)c ] = 1 − P (A ∩ B) = 1 − 60 = 41
60
. Finally, P (Ac ∩ B c ) = P [(A ∪ B)c ] = 1 − P (A ∪ B) = 1 − 43 = 14 .
Exercise 2.2.5 There are two telephone lines A and B. Let E1 be the event that line A
is engaged and let E2 be the event that line B is engaged. After a statistical study one
finds that P (E1 ) = 0.5 and P (E2 ) = 0.6 and P (E1 ∩ E2 ) = 0.3. Find the probability of
the following events:
(i) F : line A is free.
(ii) G: at least one line is engaged.
(iii) H: at most one line is free.
Solution 2.2.17 From the statement of the problem we answer the requested probabilities.
(i) To say that ”line A is free” is the same as saying that ”line A is not engeged”.
Hence F = E1c . It follows that P (F ) = P (E1c ) = 1 − P (E1 ) = 1 − 0.5 = 0.5.
(ii) To say that ”at least one line is engaged” is equivalent to saying that ”exactly one
line is engaged” or ”two lines are engaged”. Hence G = E1 ∪ E2 . It follows that
P (G) = P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ) = 0.5 + 0.6 − 0.3 = 0.8.
24
(iii) To say that ”at most one line is free” is the same as saying that ”no line is free”
or ” exactly one line is free”. Note that no line is free, exactly one line is free and
two lines are free are the only possible cases that we can have. In other words,
these three events are exhaustive. In addition these events are disjoint. Hence
P {No line is free } + P {Exactly one line is free } + P {Both two lines are free} = 1.
Equivalently P {At most one line is free} = 1 − P {Both two lines are free} = 1 −
P (E1c ∩ E2c ) = 1 − P {(E1 ∪ E2 )c } = 1 − (1 − P (E1 ∪ E2 )) = 1 − 1 + 0.8 = 0.8.
Exercise 2.2.6 Let S = {A, B, C, D} be a sample space. Outcome A is 2 times as likely

as outcome B; outcome B is 4 times as likely as outcome C; outcome C is 2 times as
likely as outcome D. Find P (A), P (B), P (C), P (D).
Solution 2.2.18 From the statement of the problem, we have P (A) = 2P (B), P (B) =
4P (C), P (C) = 2P (D). Since S = {A, B, C, D} is the whole sample space we must have
P (S) = 1 ⇐⇒ P (A) + P (B) + P (C) + P (D) = 1. Note that P (A) = 2P (B) = 8P (C) =
16P (D), P (B) = 4P (C) = 8P (D) and P (C) = 2P (D). It follows that 16P (D) + 8P (D) +
1 2 8
2P (D) + P (D) = 1 ⇐⇒ 27P (D) = 1 ⇐⇒ P (D) = 27 . Therefore P (C) = 27 , P (B) = 27
and P (A) = 16
27
.
Exercise 2.2.7 Suppose that P (A) = 0.68, P (B) = 0.55 and P (A ∩ B) = 0.35. Find
(a) The conditional probability that B occurs given that A occurs.
(b) The conditional probability that B does not occur given that A occurs.
(c) The conditional probability that B occurs given that A does not occur.
Solution 2.2.19 From the given information we can answer the requested questions.
P (A∩B) 0.35 35
(a) The requested probability is P (B|A) = P (A)
= 0.68
= 68
≈ 0.515.
P (A∩B c ) P (A)−P (A∩B) 0.68−0.35
(b) The requested probability is given by P (B c |A) = P (A)
= P (A)
= 0.68
=
0.33 33
0.68
= 68 ≈ 0.485.
(c) Note that P (Ac ) = 1 − P (A) = 1 − 0.68 = 0.32. Hence the requested probability is
c)
then given that P (B|Ac ) = P P(B∩A
(Ac )
= P (B)−P (A∩B)
P (Ac )
= 0.55−0.35
0.32
= 0.20
0.32
= 20
32
= 0.625.
Exercise 2.2.8 Suppose that P (A) = 0.4, P (B) = 0.5 and the probability that either A
or B occurs is 0.7. Find
(a) The conditional probability that B occurs given that A occurs.
(b) The conditional probability that B occurs given that A does not occur.
25
Solution 2.2.20 (a) We are asked to calculate P (B|A) = P P(A∩B)
(A)
. From the statement
of the problem, it is clear that P (A ∪ B) = 0.7 and hence
P (A ∩ B) = P (A) + P (B) − P (A ∪ B) = 0.9 − 0.7 = 0.2

0.2
Hence P (B|A) = 0.4
= 21 ·
(b) Since P (A) = 0.4 then P (Ac ) = 1 − P (A) = 0.6. We are asked to find P (B|Ac ). Note
that
P (B ∩ Ac ) P (B) − P (A ∩ B) 0.5 − 0.2 1
P (B|Ac ) = c
= c
= = ·
P (A ) P (A ) 0.6 2
2.3 Calculus of Probability
Let A and B be events. We first recall that the conditional probability of event B occuring,
given that event A has already occured is given by
P (A ∩ B)
P (B|A) = , given that P (A) > 0 (2.3.1)
P (A)
The role of A and B can be interchanged to define P (A|B). The conditional probability
of A given that B has already occured is given by
P (A ∩ B)
P (A|B) = , given that P (B) > 0 (2.3.2)
P (B)
Suppose that A and B are two arbitrary events. Here we summarize some of the standard
rules involving probabilities.
Additive Rule: P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

P (A∩B)
Conditional Probability Rule: P (A|B) = P (B)
, provided that P (B) > 0.
Multiplicative Rule: P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A)
The additive rule was proved in the Theorem 2.1.7, part (iv). The Conditional Probability
Rule is a restatement of the Definition 2.2.1. The Multiplicative Rule follows easily from
two Equations 2.3.1 and 2.3.2 (but does not require P (A) > 0 and P (B) > 0).
Sometimes the experimental setup itself may directly indicate the values of the various
conditional probabilities. In such situations, one can obtain the probabilities of joint
events such as A ∩ B by cross multiplying the two sides in the Theorem 2.2.4.
26
From the simple case of the Multiplicative Rule P (A ∩ B) = P (A|B)P (B), one can
immediately get the Multiplicative Rule in general cases:
• P (A ∩ B ∩ C) = P (A|B ∩ C)P (B ∩ C) = P (A|B ∩ C)P (B|C)P (C)
• P (A ∩ B ∩ C ∩ D) = P (A|B ∩ C ∩ D)P (B|C ∩ D)P (C|D)P (D), and so on.
From the simple case of the Additive Rule, one can immediately get the Additive Rule
in general cases. For example we may calculate the probability that any one of the three
events E or F or G occurs. This is done as follows: From the associativity of the operation
of union we have the equality E ∪ F ∪ G = (E ∪ F ) ∪ G which by the Additive Rule gives
P (E ∪ F ∪ G) = P ((E ∪ F ) ∪ G) = P (E ∪ F ) + P (G) − P ((E ∪ F ) ∩ G).
Again by the Additive Rule, we get

P (E ∪ F ∪ G) = P (E) + P (F ) − P (E ∩ F ) + P (G) − P ((E ∩ G) ∪ (F ∩ G))
= P (E) + P (F ) − P (E ∩ F ) + P (G) − P (E ∩ G) − P (F ∩ G) + P (E ∩ G ∩ F ∩ G)
= P (E) + P (F ) + P (G) − P (E ∩ F ) − P (E ∩ G) − P (F ∩ G) + P (E ∩ F ∩ G)
Hence the final result is given by the following equality

P (E ∪ F ∪ G) = P (E) + P (F ) + P (G) − P (E ∩ F )
−P (E ∩ G) − P (F ∩ G) + P (E ∩ F ∩ G) (2.3.3)
In fact, it can be shown by induction that, for any n events E1 , E2 , E3 , ..., En ,
X X X
P (E1 ∪ E2 ∪ · · · ∪ En ) = P (Ei ) − P (Ei ∩ Ej ) + P (Ei ∩ Ej ∩ Ek )
i i<j i<j<k
X
n+1
− Ei ∩ Ej ∩ Ek ∩ El ) + · · · + (−1) P (E1 ∩ E2 ∩ · · · ∩ En )
i<j<k<l
In words, the previous equation states that the probability of the union of n events equals
the sum of the probabilities of these events taken one at a time minus the sum of the
probabilities of these events taken two at a time plus the sum of the probabilities of these
events taken three at a time, and so on.
Theorem 2.3.1 (Law of Total Probability) Assume that A1 , A2 , · · · , Am are events
such that m
S
i=1 Ai = S and Ai ∩ Aj = ∅ for i 6= j, P (Ai > 0 for all i, i.e, A1 , A2 , · · · , Am
form a partition of S = m
S
i=1 Ai in pairwise disjoint. Then the probability of an event B
can be calculated as m
X
P (B) = P (B|Ai )P (Ai )
i=1
27
Proof. The Law of Total Probability follows from the following observation: Since A1 , A2 , · · · , Am
are disjoint events then B ∩ A1 , B ∩ A2 , · · · , B ∩ Am are also disjoint events. Also since
Sm
i=1 Ai = S and B ⊆ S then
m
! m
[ [
B =B∩S =B∩ Ai = (B ∩ Ai )
i=1 i=1
Hence
m
! m m
[ X X
P (B) = P (B ∩ Ai ) = P (B ∩ Ai ) = P (B|Ai )P (Ai )
i=1 i=1 i=1
Note that in the last two equalities we have used the ”Additive Rule” (see also Theorem
2.1.7 (vi)) and the Conditional Probability Rule.
Let us look at some examples.
Example 2.3.2 Suppose that a company publishes two magazines, M1 and M2 . Based
on their record of subscriptions in a suburb (a residential area) they find that sixty percent
of the households subscribe only for M1 , forty five percent subscribe only for M2 , while
only twenty percent subscribe for both M1 and M2 . If a household is picked at random
from this suburb, the magazines publishers would like to address the following questions:
What is the probability that the randomly selected household is a subscriber for
(i) at least one of the magazines M1 , M2 ?
(ii) none of those magazines M1 , M2 ?
(iii) magazine M2 given that the same household subscribes for M1 ?
Let us now define the two events
A: The randomly selected household subscribes for the magazine M1 .
B: The randomly selected household subscribes for the magazine M2 .
We have been told that P (A) = 0.60, P (B) = 0.45 and P (A∩B) = 0.20. Then, P (A∪B) =
P (A) + P (B) − P (A ∩ B) = 0.85, that is there is 85% chance that the randomly selected
household is a subscriber for at least one of the magazines M1 , M2 . This answers question
(i).
28
In order to answer question (ii), we need to evaluate P (Ac ∩B c ) = P ((A ∪ B)c ) which can
simply be written as 1 − P (A ∪ B) = 0.15, that is there is 15% chance that the randomly
selected household subscribes for none of the magazines M1 , M2 .
Next, to answer question (iii), we obtain P (B|A) = P P(A∩B)

(A)
= 0.20
0.60
= 13 , that is there is
one in-three chance that the randomly selected household subscribes for the magazine M2
given that this household already receives the magazine M1 .
Example 2.3.3 Suppose that we have an urn at our disposal which contains eight green
and twelve blue marbles, all of equal size and weight. The following random experiment
is now performed: The marbles inside the urn are mixed and then one marble is picked
from the urn at random, but we do not observe its color. This first drawn marble is
not returned to the urn. The remaining marbles inside the urn are again mixed and one
marble is picked at random. This kind of selection process is often referred to as sampling
without replacement. Now, what is the probability that
(i) both the first and second drawn marbles are green?
(ii) the second drawn marble is green?
Let us define the two events
A: The randomly selected first marble is green.
B: The randomly selected second marble is green.
8 7 8
Obviously, P (A) = 20 = 0.4, P (B|A) = 19 and P (B|Ac ) = 19 . Observe that the experi-
mental setup itself dictates the value of P (B|A). A result such as the Conditional Product
Rule is not very helpful in the present situation. In order to answer question (i), we pro-
8 7
ceed by using the Multiplicative Rule to evaluate P (A ∩ B) = P (A)P (B|A) = 20 × 19 = 14
95
.
c
Obviously, A, A forms a partition of the sample space. Now, in order to answer question
(ii), using the Theorem 2.1.7, part (vi), we write
P (B) = P (A ∩ B) + P (Ac ∩ B) (2.3.4)
12 8 24
But, as before we have P (Ac ∩ B) = P (Ac )P (B|Ac ) = 20 × 19 = 95 . Thus, from the
14 24 38 2
Equality 2.3.4, we have P (B) = 95 + 95 = 95 = 5 = 0.4. Here, note that P (B|A) 6= P (B)
and so by the Definition 2.2.2, the two events A, B are dependent. One may guess this
fact easily from the layout of the experiment itself.
29
Exercise 2.3.1 A list of important customers contains 25 names. Among them, 20 per-
sons have their accounts in good standing while 5 are delinquent (irresponsible). Two
persons will be selected at random from this list and the status of their accounts will be
checked. Calcualte the probability that
(a) Both accounts are delinquent.
(b) One account is delinquent and the other one is in good standing.
Solution 2.3.4 Consider the following events
Di : The ith account is delinquent, for i = 1, 2.
Gi : The ith account is in good standing, for i = 1, 2.
(i) Here the problem is to calcualte the P (D1 ∩ D2 ). Using the Multiplicative Law, we
write
P (D1 ∩ D2 ) = P (D2 |D1 )P (D1 )
In order to calculate P (D1 ), we need only to consider selecting one account at ran-
5
dom from 20 good and 5 delinquent. Clearly, P (D1 ) = 25 . The next step is to
calcualte P (D2 |D1 ). Given that D1 has occured, there will remain 20 good and 4
delinquent accounts at the time the second selection is made. Therefore, the condi-
4
tional probability of D2 given D1 is given by P (D2 |D1 ) = 24 . Multiplying these two
probabilities, we get
5 4 1
P (Both delinquent) = P (D1 ∩ D2 ) = × = (2.3.5)
25 24 30
(ii) Represent by D1 ∩G2 the event the first account checked is delinquent and the second
account is in good standing and by G1 ∩ D2 the event the first account checked is
in good standing and the second account is delinquent.
Hence the probability of the event: One account is delinquent and the other one is
in good standing will be given by
P {(D1 ∩ G2 ) ∪ (G1 ∩ D2 )} = P (D1 ∩ G2 ) + P (G1 ∩ D2 )

−P {(D1 ∩ G2 ) ∩ (G1 ∩ D2 )}
Using the commutativity and associativity of the operation of intersection, we have
(D1 ∩ G2 ) ∩ (G1 ∩ D2 ) = (D1 ∩ G1 ) ∩ (D2 ∩ G2 ) = ∅,
30
since it is impossible to select one delinquent account and one account in a good
standing at the same time in the first trial i.e. D1 ∩ G1 = ∅ (hence the events
D1 and G1 are mutually exclusive. Similarly the events D2 and G2 are mutually
exclusive . Since
20 5 1
P (G1 ∩ D2 ) = P (G1 )P (D2 |G1 ) = 25
× 24
= 6
5 20 1
P (D1 ∩ G2 ) = P (D1 )P (G2 |D1 ) = 25
× 24
= 6
The required probability will be then given by

2 1
(D1 ∩ G2 ) ∩ (G1 ∩ D2 ) = P (D1 ∩ G2 ) + P (G1 ∩ D2 ) = =
6 3
Exercise 2.3.2 For two events A and B, the following probabilities are given: P (A) =
0.4, P (B) = 0.25 and P (A|B) = 0.7. Use the appropriate laws of probability to calculate
P (Ac ), P (A ∩ B) and P (A ∪ B).
Solution 2.3.5 Clearly P (Ac ) = 1 − P (A) = 1 − 0.4 = 0.6. By the Multiplicative Rule
7
P (A ∩ B) = P (A|B)P (B), it follows that P (A ∩ B) = 0.7 × 0.25 = 40 . Using the Additive
Rule we get immediately that
7 9
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.4 + 0.25 − =
40 40
Exercise 2.3.3 If P (A) = 0.4, P (B) = 0.4 and P (A ∪ B) = 0.7.
(a) Are A and B independent? Why or why not?
(b) Can A and B be mutually exclusive? Why or why not?
Solution 2.3.6 Let A and B be the given events with P (A) = 0.4, P (B) = 0.4 and
P (A ∪ B) = 0.7.
(a) If A and B are independent then P (A ∩ B) = P (A)P (B). But by the Additive Rule
we have P (A ∩ B) = P (A) + P (B) − P (A ∪ B) = 0.4 + 0.4 − 0.7 = 0.1. On the other
hand P (A)P (B) = 0.4 × 0.4 = 0.16. Hence A and B are not independent.
(b) If A and B are mutually exclusive then A ∩ B = ∅ which gives that P (A ∩ B) = ∅.

However, from the Additive Rule, we have P (A ∩ B) = P (A) + P (B) − P (A ∪ B) =
0.4 + 0.4 − 0.7 = 0.1. Hence A and B are not mutuall exclusive.
Exercise 2.3.4 Suppose P (A) = 0.6 and P (B) = 0.22.
(a) Determine P (A ∪ B) if A and B are independent.
31
(b) Determine P (A ∪ B) if A and B are mutually exclusive.
(c) Find P (A|B c ) if A and B are mutually exclusive.
Solution 2.3.7 (a) If A and B are assumed to be independent, it follows from the Addi-
tive Rule that P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) − P (A)P (B) =
0.6 + 0.22 − 0.6 × 0.22 = 0.688.
(b) If A and B are assumed to be mutually exclusive then A∩B = ∅ and hence P (A∩B) =
∅. It follows from the Additive Rule that P (A∪B) = P (A)+P (B) = 0.6+0.22 = 0.82.
(c) If A and B are assumed to be mutually exclusive then A∩B = ∅ and hence P (A∩B) =
c)
∅. It follows that P (A|B c ) = P P(A∩B
(B c )
= P (A)−P (A∩B)
1−P (B)
P (A)
= 1−P (B)
0.6
= 1−0.22 = 60
78
≈ 0.769
Exercise 2.3.5 A box contains 25 parts, of which 3 are defective and 22 are non-defective.
If 2 parts are randomly selected without replacement, find the probabilities of the following
events:
(i) Both parts are defective,
(ii) Neither part is defective
(iii) Exactly one part is defective.
Solution 2.3.8 Define the following events:
A: The first part is defective
B: The second is defective.
3 2
From the statement of the problem, one can see that P (A) = 25
, P (B|A) = 24
, P (Ac ) =
22
25
, P (B c |A) = 22
24
and P (B c |Ac ) = 21
24
(i) We need to calculate the probability of the event ”both parts are defective” i.e. A∩B.
2 3 6
By the Multiplicative Rule we have P (A∩B) = P (B|A)P (A) = 24 × 25 = 600 = 0.01.
(ii) It is requested to calculate the probability of the event ”neither one is defective”
which means that both the two parts are non-defective i.e. Ac ∩ B c . Note that
P (Ac ∩ B c ) = P (B c |Ac )P (Ac ) = 21
24
22
× 25 = 0.770.
(iii) Here we note that P {Neither is defective} + P {One is defective}

+ P {Both are defective} = 1. Hence P {One is defective} = 1 − P (Ac ∩ B c ) − P (A ∩
B) = 1 − 0.01 − 0.770 = 0.220.
32
Exercise 2.3.6 In a shipment of 15 air room air conditioners, there are three defectives
thermostats. Two air conditioners will be selected at random and inspected one after
another. Find the probablity that
(a) The first is defective
(b) The first is defective and the second is good
(c) Both are defective
(d) The second air conditioner is defective
(e) Exactly one is defective
Solution 2.3.9 We define the following events:
A: The first conditioner is defective
B: The second conditioner is defective.
3 2
From the statement of the problem, one can see that P (A) = 15
, P (B|A) = 14
, P (B|Ac ) =
3
14
, P (B c |A) = 12
14
and P (B c |Ac ) = 11
14
.
3
(a) It is clear that P (A) = 15
= 51 .
(b) The event ”The first is defective and the second is good” can be expressed as B c |A
12
and its probability is given by P (B c |A) = 14 = 67 .
(c) Here we are asked to calculate P (A∩B) which is given by P (A∩B) = P (B|A)P (A) =
2 3 6
14
× 15 = 210 .
(d) We are asked to calculate P (B). Let us observe that B = (B ∩ A) ∪ (B ∩ Ac ) and

3 3
hence P (B) = P (B ∩ A) + P (B ∩ Ac ) = P (B ∩ A) + P (B|Ac )P (Ac ) = 210 + 14 × 12
15
=
6 36 42
210
+ 210 = 210 .
(e) We now calculate P {Exactly one is defective}. Note that P {Neither is defective} +
P {One is defective} + P {Both are defective} = 1. But P {Neither is defective} =
P (Ac ∩ B c ) = P (B c |Ac )P (Ac ) = 1114
× 12
15
= 121
210
. Hence P {One is defective} = 1 −
c c 121 6 73
P (A ∩ B ) − P (A ∩ B) = 1 − 210 − 210 = 210
Exercise 2.3.7 Of three events A, B and C, suppose events A and B are independent
and events B and C are mutually exclusive. Their probabilities are P (A) = 0.7, P (B) =
0.2 and P (C) = 0.3. Express the following events in set notation and calculate their
probabilities.
33
(a) Both B and C occur.
(b) At least one of A and B occurs.
(c) B does not occurs
(d) All three events occur.
Solution 2.3.10 (a) The event ”Both B and C occur” can be expressed as B ∩ C. Since
B and C are assumed to be mutually exclusive then B∩C = ∅ and hence P (B∩C) = 0.
(b) The event ”At least one of A and B occurs” can be expressed as A ∪ B. Since A and
B are assumed to be independent then P (A ∩ B) = P (A)P (B) = 0.7 × 0.2 = 0.14 and
hence P (B ∪ C) = P (B) + P (C) − P (A ∩ B) = 0.2 + 0.3 − 0.14 = 0.36.
(c) The event ”B does not occurs” can be expressed as B c . Clearly P (B c ) = 1 − P (B) =
1 − 0.2 = 0.8.
(d) The event ”All three events occur” can be expresses as A ∩ B ∩ C. Since B ∩ C = ∅
it follows that A ∩ B ∩ C = ∅ and hence it is impossible for all three events to occur
at the same time. Therefore P (A ∩ B ∩ C) = 0.
2.4 Bayes’ s Theorem
We now address another type of situation highlighted by the following example.
Example 2.4.1 Suppose in another room, an experimenter has two urns at his disposal,
urn number 1 and urn number 2. The urn number 1 has eight green and twelve blue
marbles whereas the urn number 2 has ten green and eight blue marbles, all of same size
and weight. The experimenter selects one of the urns at random with equal probability and
from the selected urn picks a marble at random. It is announced that the selected marble
in the other room turned out blue. What is the probability that the blue marble was chosen
from the urn number 2? We will answer this question shortly.
The following theorem will be helpful in answering questions such as the one raised in the
Example 2.4.1.
Theorem 2.4.2 (Bayes’s Theorem) Suppose that the events {A1 , · · · , Ak } form a par-
tition of the sample space S and B is another event. Then,
P (Aj )P (B|Aj )
P (Aj |B) = Pk , for j = 1, 2, · · · , k (2.4.1)
i=1 P (Ai )P (B|Ai )
34
Proof. Since {A1 , · · · , Ak } form a partition of S, in view of the Theorem 2.1.7, part (vi)
(see also Theorem 2.3.1 about Total Probabilities) we can immediately write
k
X k
X
P (B) = P (B ∩ Ai ) = P (Ai )P (B|Ai ) (2.4.2)
i=1 i=1
by using the Multiplicative Rule. Next, using the Multiplicative Rule once more, let us
write
P (Aj ∩ B) P (B|Aj )P (Aj )

P (Aj |B) = ⇐⇒ P (Aj |B) = (2.4.3)
P (B) P (B)
The result follows by combining Equality 2.4.2 and Equality 2.4.3.
This marvelous result and the ideas originated from the works of Rev. Thomas Bayes
(1783). In the statement of the Theorem 2.4.2, note that the conditioning events on the
right hand side are A1 , · · · , Ak , but on the left hand side one has the conditioning event
B instead. The quantities such as P (Ai ), i = 1, · · · , k are often referred to as the apriori
or prior probabilities, whereas P (Aj |B) is referred to as the posterior probability.
Example 2.4.3 (Example 2.4.1 Continued) Define the events
Ai : The urn number i is selected, i = 1, 2
B: The marble picked from the selected urn is blue
It is clear that P (Ai ) = 21 for i = 1, 2, whereas we have P (B|A1 ) = 12

20
and P (B|A2 ) = 8
18
.
Now, applying the Bayes Theorem, we have
1 8
P (A2 )P (B|A2 ) 2
× 18 20
P (A2 |B) = = 1 12 =
{P (A1 )P (B|A1 ) + P (A2 )P (B|A2 )} (2 × 20
) + ( 12 × 8
18
) 47
Thus, the chance that the randomly drawn blue marble came from the urn number 2 was
20
47
which is equivalent to saying that the chance of the blue marble coming from the urn
number 1 was 2747
.
The Bayes Theorem helps in finding the conditional probabilities when the original con-
ditioning events A1 , · · · , Ak and the event B reverse their roles.
Example 2.4.4 This example has more practical flavor. Suppose that 40% of the in-
dividuals in a population have some disease. The diagnosis of the presence or absence
of this disease in an individual is reached by performing a type of blood test. But, like
many other clinical tests, this particular test is not perfect. The manufacturer of the
35
blood-test-kit made the accompanying information available to the clinics. If an individual
has the disease, the test indicates the absence (false negative) of the disease 10% of the
time whereas if an individual does not have the disease, the test indicates the presence
(false positive) of the disease 20% of the time. Now, from this population an individual
is selected at random and his blood is tested. The health professional is informed that
the test indicated the presence of the particular disease. What is the probability that this
individual does indeed have the disease?
Let us first formulate the problem. Define the following events
A1 : The individual has the disease
Ac1 : The individual does not have the disease
B: The blood test indicates the presence of the disease
Suppose that we are given the following information:
P (A1 ) = 0.4, P (Ac1 ) = 0.6, P (B|A1 ) = 0.9, and P (B|Ac1 ) = 0.2
We are asked to calculate the conditional probability of A1 given B. We denote A2 = Ac1

and use the Bayes’s Theorem. We have
P (A1 )P (B|A1 ) 0.4 × 0.9 3

P (A1 |B) = = =
P (A1 )P (B|A1 ) + P (A2 )P (B|A2 ) (0.4 × 0.9) + (0.6 × 0.2) 4
Thus there is 75% chance that the tested individual has the disease if we know that the
blood test had indicated so.
Exercise 2.4.1 There are three urns, #A, #B, and #C. Urn #A has a red marbles
and b green marbles, urn #B has c red marbles and d green marbles, and urn #C has
a red marbles and c green marbles. Let A be the event of choosing urn #A, B of choos-
ing urn #B and, C of choosing urn #C. Let R be the event of choosing a red marble
and G be the event of choosing a green marble. An urn is chosen at random, and af-
ter that, from this urn, a marble is chosen at random. Find the following probabilities:
P (G), P (G|C), P (C|G), P (R), P (R|A) and P (A|R).
Solution 2.4.5 • Conditioning on the urn chosen,

b 1
P (G) = P (G|A)P (A) + P (G|B)P (B) + P (G|C)P (C) = ×
a+b 3

d 1 c 1
+ × + ×
c+d 3 a+c 3
36
c
• P (G|C) = a+c
• We use Bayes’s Rule

c 1
P (C ∩ G) P (G|C)P (C) a+c
× 3
P (C|G) = = = b 1
d 1
c 1

P (G) P (G) a+b
× 3
+ c+d
× 3
+ a+c
× 3
• Conditioning on the urn chosen,

a 1
P (R) = P (R|A)P (A) + P (R|B)P (B) + P (R|C)P (C) = ×
a+b 3

c 1 a 1
+ × + ×
c+d 3 a+c 3
a
• P (R|A) = a+b
• We use Bayes’ Rule

a 1
P (R ∩ A) P (R|A)P (A) a+b
× 3
P (A|R) = = = a 1
c 1
a 1

P (R) P (R) a+b
× 3
+ c+d
× 3
+ a+c
× 3
2.5 Selected Counting Rules
In many situations, the sample space S consists of only a finite number of equally likely
outcomes and the associated Borel sigma-field B is the collection of all possible subsets of
S, including the empty set ∅ and the whole set S. Then, in order to find the probability
of an event A, it will be important to enumerate all the possible outcomes included in
the sample space S and the event A. This section reviews briefly some of the customary
counting rules followed by a few examples. We first recall the following:
Definition 2.5.1 A group of elements is said to be ordered if the order in which these
elements are drawn is of relevance. Otherwise, it is called unordered.
Example 2.5.2 Ordered sample: The first three places in an Olympic 100m race are
determined by the order in which the athletes arrive at the finishing line. If 8 athletes
are competing with each other, the number of possible results for the first three places
is of interest.
Unordered sample: The selected members for a national football team. The order in
which the selected names are announced is irrelevant.
37
2.5.1 The Fundamental Rule of Counting
Suppose that there are k different tasks where the ith task can be completed in ni ways,
i = 1, ..., k. If these taskes can be completed independently then, the total number N of
ways these k tasks can be completed is given by
k
Y
N= ni
i=1
Example 2.5.3 How many sample points are there in the sample space when a pair of
dice is thrown twice? In this case, it can be seen that the first die can land in any of n1 = 6
ways. For each these 6 ways the second die can also land in n2 = 6 ways. Therefore, the
pair of dice can land in N = n1 × n2 = 6 × 6 = 36 ways.
Remark 2.5.1 It is to be noted that in the case of events A and B that are mutually
exclusive, then if an event A can occurs in total of n1 ways and if a different event B can
occur in n2 ways then the event A or B (i.e. A ∪ B), can occur in n1 + n2 ways.
Example 2.5.4 An urn contain 2 white and 6 black balls. Count the number of cases in
which a ball that is drawn at random is either white or black. Clearly, a white ball can be
drawn in 2 cases and a black ball can be drawn in 6 cases. Thus the required number of
cases will be 2 + 6 = 8.
2.5.2 Arrangements
• Arrangements Without Repetition
Suppose that we are given n distinct objects and wish to arrange m of these objects in
a line, where n ≥ m ≥ 1. Since there are n ways of choosing the first object, and after
this is done, n − 1 ways of choosing the second object,· · · , and finally n − m + 1 ways of
choosing the mth object, it follows by the Fundamental Rule of Counting that the number
of different arrangements, denoted by the symbol n Pm , is given by
m
Y
n n
Pm = n(n − 1)(n − 2) · · · (n − m + 1) or Pm = (n − i + 1) (2.5.1)
i=1
n
where it is noted that the product has m factors. We call Pm the number of arrangements
of n objects taken m at a time.
Example 2.5.5 The sequence of letters ”SWAP” is an arrangement of 4 letters without

repetition from among 26 letters of the English Alphabet. The sequence ”WASP” is also an
38
arrangement of the same four letters but in different order. Thus ”SWAP” and ”WASP”
are different arrangements of 4 letters from among 26 letters of the English Alphabet.
Example 2.5.6 It is required to seat 5 men and 4 women in a row so that the women
occupy the even places. How many such arrangements are possible? The men may be
seated in 5 P5 ways, and the women 4 P4 ways. Each arrangement of the men may be
associated with each arrangement of the women. Hence, the number of arrangements will
be given by 5 P5 ×4 P4 = 120 × 24 = 2880.
• Arrangements With Repetitions
The mathematical concept of an arrangement with repetition(s) amounts to a sequence of

finitely many repeatable elements. More generally, an arrangement of m elements from
among n elements with repetitions allowed consists of a selection of m non-necessarily
distinct elements in a specific order.
The number n Pm of arrangements with repetition of m objects selected from among n > 0
objects is given by nm .
Example 2.5.7 The sequence of letters ”SWAPS” is an arrangement of five letters with
one repetition from among the 26 letters of the English alphabet. The sequence of letters
”WASPS” is also an arrangement of the same five letters with the same repetition from
among the 26 letters of the English alphabet, but in a different order. Thus ”SWAPS”
and ”WASPS” are different arrangements with repetitions.
2.5.3 Permutations
Definition 2.5.8 Consider a set of n elements. Each ordered composition of these n

elements is called a permutation.
From the Equation 2.5.1 and the Definition 2.5.8 it follows that the quantity n Pn is the
number of permutations of n objects taken n at a time.
We distinguish between two cases: If all the elements are distinguishable, then we speak
of permutation without replacement. However, if some or all of the elements are not
distinguishable, then we speak of permutation with replacement.
39
• Permutations without Replacement
If all the n elements are distinguishable, then there are n Pn different compositions of these
elements. This is the number of of ways we can arrange all n objects and it is denoted by
the special symbol
n! = n(n − 1)(n − 2)(n − 3) · · · (3)(2)(1) (2.5.2)
We describe the symbol n! as the n factorial. We use the convention to interpret 0! = 1.
Example 2.5.9 There were three candidate cities for hosting the 2020 Olympic Games:
Tokyo (T), Istanbul (I), and Madrid (M). Before the election, there were 3! = 6 possible
outcomes, regarding the final rankings of the cities:
(M, T, I), (M, I, T ), (T, M, I), (T, I, M ), (I, M, T ), (I, T, M )
• Permutations with Replacement
Assume that not all n elements are distinguishable. The elements are divided into groups,
and these groups are distinguishable. Suppose, there are s groups of sizes n1 , n2 , ..., ns .
The total number of different ways to arrange the n elements in s groups is:
n!
(2.5.3)
n1 !n2 !n3 ! · · · ns !
Example 2.5.10 Consider all the permutations of the letters in the word BOB. Since
there are three letters, there should be 3! = 6 different permutations. Those permutations
are BOB, BBO, OBB, OBB, BBO and BOB. Now, while there are six permutations,
some of them are indistinguishable from each other. If you look at the permutations that
are distinguishable, you only have three BOB, OBB and BBO.
In fact, since there are two groups with sizes n1 = 2 (group with two letters B) and
n2 = 1 (group with one letter O), it follows from the Identity 2.5.2 that to find the
number of distinguishable permutations, take the total number of letters factorial divide
by the frequency of each letter factorial. That is
3! 3×2×1
= =3
2!1! 2×1×1
Exercise 2.5.1 Find the number of distinguishable permutations of the letters in the word
M ISSISSIP P I.
40
Solution 2.5.11 In this situation we have permutation with replacement with n1 = 1, n2 =
4, n3 = 4 and n4 = 2, where n1 , n2 , n2 and n4 are respectively the frequencies of the letters
M, I, S and P respectively. Hence the number of distinguishable permutations will be then
given by
11!
= 34650
1!4!4!2!
Let us note that the total number of different permutations in the word M ISSISSIP P I
is 11! = 39916800.
2.5.4 Combinations
The Binomial coefficient for any integers m and n with n ≥ m ≥ 0 is denoted and defined
as

n n!
= (2.5.4)
m m!(n − m)!
and it is read as ”n choose m”. There are several calculation rules for the Binomial
coefficient:

n n n n
= 1, = n, = , (2.5.5)
0 1 m n−m
m
n+1−i
Y
n n+1 n n
= , = +
m i=1
i m m m−1
Since one can write

n! = n(n − 1)...(n − m + 1)(n − m)!
then the Identity 2.5.4 can be written as
n
n(n − 1)(n − 2)...(n − m + 1)(n − m)!

n Pm
= = (2.5.6)
m m!(n − m)! m!
The following result uses these combinatorial expressions and it goes by the name
Theorem 2.5.12 (Binomial Theorem) For any two real numbers a, b and a positive
integer n, one has
n
n
X n k n−k
(a + b) = a b (2.5.7)
k=0
k
41
For example, if a = b = 1, from the above equation, one can see that
n
n
X n
2 =
k=0
k
We now answer the question of how many different possibilities exist to draw m out of n
elements, i.e. m out of n balls from an urn. In other words, how many different ways
exist to select m out of n elements?
It is necessary to distinguish between the following four cases:
(i) Combinations without replacement and without consideration of the order of the
elements.
(ii) Combinations without replacement and with consideration of the order of the ele-
ments.
(iii) Combinations with replacement and without consideration of the order of the ele-
ments.
(iv) Combinations with replacement and with consideration of the order of the elements.
Let us note that combinations with and without replacement are also often called combi-
nations with and without repetition.
• Combinations without Replacement and without Consideration of the Order
When there is no replacement and the order of the elements is also not relevant, then the
total number of distinguishable combinations in drawing m out of n elements is

n
(2.5.8)
m
Example 2.5.13 Suppose a company elects a new board of directors. The board consists
of 5 members and 15 people are eligible to be elected. How many combinations for the
board of directors exist?
Since a person cannot be elected twice, we have a situation where there is no replacement.
The order is also of no importance: either one is elected or not. We can thus apply the
Rule 2.5.8 which yields
15 15!
= = 3003
5 5!10!
possible combinations.
42
• Combinations without Replacement and with Consideration of the Order
The total number of different combinations for the setting without replacement and with
consideration of the order is

n! n
= m! (2.5.9)
(n − m)! m
Example 2.5.14 Consider a horse race with 12 horses. A possible bet is to forecast the
winner of the race, the second horse of the race, and the third horse of the race. The total
number of different combinations for the horses in the first three places is
12!
= 12 × 11 × 10 = 1320
(12 − 3)!
This result can be explained intuitively: for the first place, there is a choice of 12 different
horses. For the second place, there is a choice of 11 different horses (12 horses minus the
winner). For the third place, there is a choice of 10 different horses (12 horses minus the
first and second horses). The total number of combinations is the product 12 × 11 × 10.
• Combinations with Replacement and without Consideration of the Order
The total number of different combinations with replacement and without consideration
of the order is
n+m−1 (n + m − 1)! n+m−1

= = (2.5.10)
m m!(n − 1)! n−1
Example 2.5.15 A farmer has 2 fields and aspires to cultivate one out of 4 different
organic products per field. Then, the total number of choices he has is
4+2−1

5 5!
= = = 10
2 2 2!3!
If 4 different organic products are denoted as A, B, C and D then the following combina-
tions are possible:
(A, A) (A, B) (A, C) (A, D)

(B, B) (B, C) (B, D)
(C, C) (C, D)
(D, D)
43
Please note that, for example, (A, B) is identical to (B, A) because the order in which
the products A and B are cultivated on the first or second field is not important in this
example.
• Combinations with Replacement and with Consideration of the Order
The total number of different combinations for the integers m and n with replacement
and when the order is of relevance is nm .
Example 2.5.16 Consider a credit card with a four-digit personal identification number
(PIN) code. The total number of possible combinations for the PIN is nm = 104 = 10000.
This follows from the fact hat every digit in the first, second, third, and fourth places
(m = 4) can be chosen out of ten digits from 0 to 9 (n = 10).
The proofs of the above formulas are beyond the outcomes of this course but they can be
found in many textbooks of Mathematics, for instance Yves Nievergelt, Foundations of
Logic and Mathematics: Applications to Computer Science and Cryptography, Springer
Science+Business Media, LLC,20012.
2.5.5 Stirling’s Approximation to n!
When n is large, a direct evaluation of n! may be impractical. Even for a moderately

large n the evaluation of n! becomes tedious. In such cases, an approximate formula for
large values of n, called the Stirling’s formula, is given by
√
n! ≈ 2πnnn e−n (2.5.11)
where e = 2.71828 · · · , which is the base of natural logarithms. The symbol ≈ means
that the ratio of the left side to the right side approaches 1 as n −→ ∞. That is
n!
lim √ =1 (2.5.12)
n→∞ 2πnnn e−n
Example 2.5.17 Evaluate 70!
Solution 2.5.18 Using the Stirling’s formula we obtain:

70
√

p 70 −70 70
70! ≈ 2π(70)(70) e = 140π = 1.20 × 10100
2.71828
We can obtain the result as follows: Let

70
√

70
N= 140π
2.71828
44
Then log N = 12 log 140 + 12 log π + 70 log 70 − 70 log(2.71828) = 1.07306 + 0.24857 +
129.15686 − 30.40061 = 100.07788 = 100 + 0.07788. Hence N = 1.20 × 10100 .
2.6 Examples: Counting Rules and Probability Cal-

culations
In this section, we will look at some examples of probability calculations by using some
of the counting techniques developed in the previous section.
Example 2.6.1 Suppose that a fair coin is tossed five times. Now the outcomes would
look like HHT HT, T HT T H and so on. By the fundamental rule of counting we realize
that the sample space S will consist of 25 (= 2 × 2 × 2 × 2 × 2), that is thirty two, outcomes
each of which has five components and these are equally likely. How many of these five
dimensional outcomes would include two heads?
Imagine five distinct positions in a row and each position will be filled by the letter H or
T . Out of these five positions, choose two positions and fill them both with the letter H
while the remaining three positions are filled with the letter T . This can be done in 52

ways, that is in (5)(4)

2
= 10 ways. In other words we have
5

10 5
P (Two Heads) = 25 = =
2 32 16
Example 2.6.2 There are ten students in a class. In how many ways can the teacher
form a committee of four students? In this selection process, naturally the order of selec-
tion is not pertinent.
A committee of four can be chosen in 10 ways, that is in (10)(9)(8)(7)

4 4!
= 210 ways. The
sample space S would then consist of 210 equally likely outcomes.
Example 2.6.3 (Example 2.6.2 Continued) Suppose that there are six men and four
women in the small class. Then what is the probability that a randomly selected committee
of four students would consist of two men and two women?
6 4

Two men and two women can be chosen in 2
× 2
ways, that is in 90 ways. In other
words,
(62)×(42) 90
P (Two men and two women are selected) = = = 37 .
(104) 120
Example 2.6.4 John, Sue, Rob, Dan and Molly have gone to see a movie. Inside the
theatre, they picked a row where there were exactly five empty seats next to each other.
45
These five friends can sit in those chairs in exactly 5! ways, that is in 120 ways. Here,
seating arrangement is naturally pertinent. The sample space S consists of 120 equally
likely outcomes. But the usher does not know in advance that John and Molly must sit
next to each other. If the usher lets the five friends take those seats randomly, what is the
probability that John and Molly would sit next to each other?
John and Molly may occupy the first two seats and other three friends may permute in 3!
ways in the remaining chairs. But then John and Molly may occupy the second and third
or the third and fourth or the fourth and fifth seats while the other three friends would
take the remaining three seats in any order. One can also permute John and Molly in 2!
ways. That is, we have
(2!)(4)(3!) 2
P (John and Molly sit next to each other) = =
5! 5
Example 2.6.5 Consider distributing a standard pack of 52 cards to four players (North,
South, East and West) in a bridge game where each player gets 13 cards. Here again the
ordering of the cards is not crucial to the game. The total number of ways these 52 cards
can be distributed among the four players is then given by

52 39 23 13
n= × × ×
13 13 13 13
Then,
4 4 44 39 26 13

4
× 4
× 5
× 13
× 13
× 13 11
P (North will receive 4 aces and 4 kings) = =
n 6431950
because North is given 4 aces and 4 kings plus 5 other cards from the remaining 44 cards
while the remaining 39 cards are distributed equally among South, East and West. By the
same token,
4
× 48 × 39 × 26 × 13

1 12 13 13 13 9139
P (South will receive exactly one ace) = =
n 20825
since South would receive one out of the 4 aces and 12 other cards from 48 non-ace cards
while the remaining 39 cards are distributed equally among North, East and West.
Exercise 2.6.1 From 7 consonants and 5 vowels, how many words can be formed consist-
ing of 4 different consonants and 3 different vowels? The words need not have meaning.
Solution 2.6.6 The four different consonants can be selected in 74 ways, the three

different vowels can be selected in 53 ways, and the resulting 7 different letters can

then be arranged among themselves in 7! ways. Then Number of words is equals to

7 5

4
× 3
× 7! = 1764000.
46
Exercise 2.6.2 The Dean’s office has received 12 nominations from which to designate 4
student representatives to serve on a campus curriculum commitee. Among the nominnees,
5 are liberal arts majors and 7 sciences majors.
(a) How many ways can 4 students be selected from the group of 12?
(b) How many selections are possible that would include 1 science major and 3 liberal arts
majors?
(c) If the selection process were random, what is the probability that 1 science major and
3 liberal ars majors would be selected?
Solution 2.6.7 According to the counting rules, we can answer the asked questions.
(a) The number of ways 4 students can be selected out of 12 is given by
12 × 11 × 10 × 9

12
= = 495
4 4×3×2×1
(b) One science major can be chosen from 7 in 71 = 7 ways. Also 3 liberal arts majors

can be chosen from 5 in 53 = 10 ways. Each of the 7 choices of science major can be

accompanied by each of the 10 choices of 3 liberal arts majors. By the Fundamental

Rule of Counting, we conclude that the number of possible samples with the stated
composition is
7 5
× = 7 × 10 = 70
1 3
(c) Let A be the event ”1 science and 3 liberal arts are selected”.
70
Then P (A) = 495 ≈ 0.141.
Exercise 2.6.3 From a group of α males and β females a committee of λ people will be
chosen. Here we assume that α + β ≥ λ.
(i) What is the probability that there are exactly η females?
(ii) What is the probability that at most 3 females will be chosen?
(iii) What is the probability that Mary and Peter will be serving together in a committee?
(iv) What is the probability that Mary and Peter will not be serving together?
Solution 2.6.8 First observe that this experiment has a sample space of size α+β

η
. There
β

are η ways of choosing the females. The remaining λ − η members of the committee
must be male,
47
(i) From the above observation, it follows that the desired probability (under the assump-
tion that β ≥ η and α ≥ λ − η) is
β α

η
× λ−η
α+β

λ
(ii) Either λ − 3 or λ − 2 or λ − 1 or λ males will be chosen. Corresponding to each

case, we must choose either 3 or 2 or 1 or 0 women, whence the desired probability
is
α
β α
β α
β α
β
λ−3 3
+ λ−2 2
+ λ−1 1
+ λ 0
α+β

λ
(iii) We must assume that Peter and Mary belong to the original set of people, otherwise
the probability will be 0. Since Peter and Mary must belong to the committee, we
must choose λ − 2 other people from the pool of the α + β − 2 people remaining. The
desired probability is thus
α+β−2

λ−2
α+β

λ
(iv) Again, we must assume that Peter and Mary belong to the original set of people,
otherwise the probability will be 1. Observe that one of the following three situations
may arise:
(1) Peter is in a committee, Mary is not,

(2) Mary is in a committee, Peter is not,
(3) Neither Peter nor Mary are in a committee.
The number of committees that include Peter but exclude Mary is α+β−2

λ−1
. The number
α+β−2

of committees that include Mary but exclude Peter is λ−1 . The number of committees
that exclude both Peter and Mary is α+β−2

λ
. Hence the desired probability is seen to be
α+β−2
+ α+β−2 + α+β−2

λ−1 λ−1 λ
α+β

λ
Exercise 2.6.4 An urn has 3 white marbles, 4 red marbles, and 5 blue marbles. Three
marbles are drawn at once from the urn, and their colour noted. What is the probability
that a marble of each colour is drawn?
Solution 2.6.9 First observe that this experiment has a sample space of size 12

3
. Since
we need to select a marble of each colour, the requested probablity will be given by
3 4 5

1
× 1
× 1 3
12
=
3
11
48
Chapter 3
Random Variables
3.1 Introduction
It frequently occurs that in performing an experiment we are mainly interested in some

functions of the outcome as opposed to the outcome itself. For instance, in tossing dice
we are often interested in the sum of the two dice and are not really concerned about
the actual outcome. That is, we may be interested in knowing that the sum is seven and
not be concerned over whether the actual outcome was (1, 6) or (2, 5) or (3, 4) or (4, 3) or
(5, 2) or (6, 1). These quantities of interest, or more formally, these real-valued functions
defined on the sample space, are known as random variables.
Definition 3.1.1 Let S represent the sample space of a random experiment, and let R be
the set of real numbers. A random variable is a function X which assigns to each element
s ∈ S one and only one number X(s) = x ∈ R, i.e
X : S −→ R
Example 3.1.2 (a) The score obtained from the roll of a die can be thought of as a
random variable taking the values 1 to 6.
(b) If two dice are rolled, a random variable can be defined to be the sum of the scores,
taking the values 2 to 12.
(c) A random variable can be defined to be the positive difference between the scores
obtained from two dice, taking the values 0 to 5.
It is usual to refer to random variables generically with uppercase letters such as X, Y , or

Z. The values taken by the random variables are then labeled with lowercase letters. For
49
example, it may be stated that a random variable X takes the values x = −0.6, x = 2.3,
and x = 4.0. This custom helps clarify whether the random variable is being referred to,
or a particular value taken by the random variable, and it is helpful in the subsequent
mathematical discussions.
Since the value of a random variable is determined by the outcome of the experiment, we
may assign probabilities to the possible values of the random variable.
Example 3.1.3 Letting X denote the random variable that is defined as the sum of two
fair dice; then
1
P {X = 2} = P {(1, 1)} = 36 ,
2
P {X = 3} = P {(1, 2), (2, 1)} = 36
3
P {X = 4} = P {(1, 3), (2, 2), (3, 1)} = 36 ,
4
P {X = 5} = P {(1, 4), (2, 3), (3, 2), (4, 1)} = 36
5
P {X = 6} = P {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} = 36
6
P {X = 7} = P {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} = 36
(3.1.1)
5
P {X = 8} = P {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} = 36
4
P {X = 9} = P {(3, 6), (4, 5), (5, 4), (6, 3)} = 36 ,
3
P {X = 10} = P {(4, 6), (5, 5), (6, 4)} = 36
2
P {X = 11} = P {(5, 6), (6, 5)} = 36 ,
1
P {X = 12} = P {(6, 6)} = 36 .
In other words, the random variable X can take on any integral value between two and
twelve, and the probability that it takes on each value is given by the Equation 3.1.1.
Since X must take on one of the values two through twelve, we must have that
12
! 12
[ X
1=P {X = n} = P {X = n}
n=2 n=2
which may be checked from the Equation 3.1.1.
Example 3.1.4 For a second example, suppose that our experiment consists of tossing
two fair coins. Letting Y denote the number of heads appearing, then Y is a random
variable taking on one of the values 0, 1, 2 with respective probabilities
P {Y = 0} = P {T T } = 41
2
P {Y = 1} = P {T H, HT )} = 4
(3.1.2)
P {Y = 2} = P {HH} = 14
1
Of course, P {Y = 0} + P {Y = 1} + P {Y = 2} = 4
+ 24 + 1
4
= 1.
50
Example 3.1.5 Suppose that we toss a coin having a probability p of coming up heads,
until the first head appears. Letting N denote the number of flips required, then assuming
that the outcome of successive flips are independent, N is a random variable taking on
one of the values 1, 2, 3, · · · , with respective probabilities
P {N = 1} = P {H} = p,
P {N = 2} = P {T H} = (1 − p)p,
P {N = 3} = P {T T H} = (1 − p)2 p,
.. (3.1.3)
.
n−1
| T T T{zT · · · T} H} = (1 − p) p, n ≥ 1
P {N = n} = P {T
n−1
As a check, note that

∞ ∞ ∞
!
[ X X p
P {N = n} = P {N = n} = p (1 − p)n−1 = =1
n=1 n=1 n=1
1 − (1 − p)
Example 3.1.6 Suppose that our experiment consists of seeing how long a battery can
operate before wearing down. Suppose also that we are not primarily interested in the
actual lifetime of the battery but are concerned only about whether or not the battery lasts
at least two years. In this case, we may define the random variable I by
(
1, if the lifetime of battery is two or more years
I= (3.1.4)
0, otherwise
If E denotes the event that the battery lasts two or more years, then the random variable
I is known as the indicator random variable for event E. Note that I equals 1 or 0
depending on whether or not E occurs.
Example 3.1.7 Suppose that independent trials, each of which results in any of m pos-
sible outcomes with respective probabilities
m
X
p1 , · · · , pm , pi = 1
i=1
are continually performed. Let X denote the number of trials needed until each outcome
has occurred at least once. Rather than directly considering P {X = n} we will first
determine P {X > n}, the probability that at least one of the outcomes has not yet occurred
after n trials. Letting Ai denote the event that outcome i has not yet occurred after the
first n trials, i = 1, · · · , m, then
51
n
!
[ X X
P {X > n) = P Ai = P (Ai ) − P (Ai ∩ Aj )
i=1 i i<j
X
+ P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)m+1 P (A1 ∩ A2 ∩ · · · ∩ Am )
i<j<k
Now, P (Ai ) is the probability that each of the first n trials results in a non-i outcome, and
so by independence P (Ai ) = (1 − pi )n . Similarly, P (Ai ∩ Aj ) is the probability that the
first n trials all result in a non-i and non-j outcome, and so P (Ai ∩ Aj ) = (1 − pi − pj )n .
As all of the other probabilities are similar, we see that
m
X X X
P {X > n) = (1 − pi )n − (1 − pi − pj )n + (1 − pi − pj − pk )n − · · ·
i=1 i<j i<j<k
Since
P {X = n} = P {X > n − 1} − P {X > n},
we see, upon using the algebraic identity
(1 − a)n−1 − (1 − a)n = a(1 − a)n−1 ,
that P {X = n} = P {X > n − 1} − P {X > n} is given by
m
X X
P {X = n} = pi (1 − pi )n−1 − (pi + pj )(1 − pi − pj )n−1
i=1 i<j
X
+ (pi + pj + pk )(1 − pi − pj − pk )n−1 − · · ·
i<j<k
In all of the preceding examples, the random variables of interest took on either a finite
or a countable number of possible values. Such random variables are called discrete.
However, there also exist random variables that take on a continuum of possible values.
These are known as continuous random variables. One example is the random variable
denoting the lifetime of a car, when the car’s lifetime is assumed to take on any value in
some interval (a, b).
Definition 3.1.8 The cumulative distribution function (cdf ) (or more simply the dis-
tribution function) F (·) of the random variable X is defined for any real number b,
−∞ < b < ∞, by
F (b) = P {X ≤ b}
52
In words, F (b) denotes the probability that the random variable X takes on a value that
is less than or equal to b. Some properties of the cdf F are:
(i) F (b) is a non-decreasing function of b (i.e. if b1 ≤ b2 then F (b1 ) ≤ F (b2 )).
(ii) limb→∞ F (b) = F (∞) = 1 (i.e. the upper limit of F is 1).
(iii) limb→−∞ F (b) = F (−∞) = 0 (i.e. the lower limit of F is 0).
Property (i) follows since for a < b the event {X ≤ a} is contained in the event {X ≤ b},
and so it must have a smaller probability.
Properties (ii) and (iii) follow since X must take on some finite value.
All probability questions about X can be answered in terms of the cdf F (.). For example,
P {a < X ≤ b} = F (b) − F (a) for all a < b
This follows since we may calculate P {a < X ≤ b} by first computing the probability
that X ≤ b [that is, F (b)] and then subtracting from this the probability that X ≤ a
[that is, F (a)].
If we desire the probability that X is strictly smaller than b, we may calculate this
probability by
P {X < b} = lim+ P {X ≤ b − h} = lim+ F (b − h)
h→0 h→0
where limh→0+ means that we are taking the limit as h decreases to 0.
Note that P {X < b} does not necessarily equal F (b) since F (b) also includes the proba-
bility that X equals b.
Remark 3.1.1 The following equations show in details how one can calculate probabilities
for random variables using the cdf F .
Let a and b be some known constants. Then
P {X ≤ a} = F (a)
P {X < a} = P {X ≤ a} − P {X = a} = F (a) − P {X = a}
P {X > a} = 1 − P {X ≤ a} = 1 − F (a)
P {X ≥ a} = 1 − P {X < a} = 1 − F (a) + P (X = a)
53
P {a ≤ X ≤ b} = P {X ≤ b} − P {X < a} = F (b) − F (a) + P (X = a)
P {a < X ≤ b} = F (b) − F (a)
P {a < X < b} = F (b) − F (a) − P (X = b)}
P {a ≤ X < b} = F (b) − F (a) − P {X = b} + P (X = a)
Example 3.1.9 Consider the experiment of rolling a die. There are six possible out-
comes. If we define the random variable X as the number of dots observed on the upper
surface of the die, then the six possible outcomes can be described as x1 = 1, x2 = 2, ..., x6 =
6. The respective probabilities are P {X = xi } = 16 for i = 1, 2, ..., 6. The cdf of X is
therefore defined as follows:



 0 if − ∞ < x < 1
1
if 1 ≤ x < 2


6



2
if 2 ≤ x < 3


 6

3
F (x) = 6
if 3 ≤ x < 4
4

if 4 ≤ x < 5



 6
5




 6
if 5 ≤ x < 6

 1 if 6 ≤ x < ∞
We can use the cdf to calculate any desired probability.
For example P {X ≤ 5} = F (5) = 65 .

5 3
Similarly, P {3 < X ≤ 5} = F (5) − F (3) = 6
− 6
= 26 .
3.2 Discrete Random Variables
As was previously mentioned, a random variable that can take on at most a countable
number of possible values is said to be discrete. For a discrete random variable X, we
define the probability mass function p(a) of X by
p(a) = P {X = a}
The probability mass function p(a) is positive for at most a countable number of values
of a. That is, if X must assume one of the values x1 , x2 , · · · , then
p(xi ) > 0, i = 1, 2, · · · .
p(x) = 0 all other values of x.
54
Since X must take on one of the values xi , we have
∞
X
p(xi ) = 1
i=1
The cumulative distribution function F can be expressed in terms of p(a) by

X
F (a) = p(xi )
all xi ≤a
For instance, suppose X has a probability mass function given by

1 1 1
p(1) = , p(2) = , p(3) =
2 3 6
then, the cumulative distribution function F of X is given by


 0 for a < 1
 1 for 1 ≤ a < 2

2
F (a) = 5
 for 2 ≤ a < 3
 6


1 for 3 ≤ a
Discrete random variables are often classified according to their probability mass func-
tions. We now consider some of these random variables.
3.2.1 The Bernoulli Random Variable
Suppose that a trial, or an experiment, whose outcome can be classified as either a

”success” or as a ”failure” is performed. If we let X equal 1 if the outcome is a success
and 0 if it is a failure, then the probability mass function of X is given by
p(0) = P {X = 0} = 1 − p
(3.2.1)
p(1) = P {X = 1} = p
where p, 0 ≤ p ≤ 1, is the probability that the trial is a ”success”.
A random variable X is said to be a Bernoulli random variable if its probability mass

function is given by the Equation 3.2.1 for some p ∈ (0, 1).
55
3.2.2 The Binomial Random Variable
Suppose that n independent trials, each of which results in a ”success” with probability p
and in a ”failure” with probability 1 − p, are to be performed. If X represents the number
of successes that occur in the n trials, then X is said to be a binomial random variable
with parameters (n, p).
The probability mass function of a binomial random variable having parameters (n, p) is
given by
n i
p(i) = P {X = i} = p (1 − p)n−i , i = 0, 1, 2, · · · , n (3.2.2)
i
where
n n!
=
i (n − i)!i!
equals the number of different groups of i objects that can be chosen from a set of n
objects. The validity of the Equation 3.2.2 may be verified by first noting that the
probability of any particular sequence of the n outcomes containing i successes and n − i
failures is, by the assumed independence of trials, pi (1 − p)n−i . The Equation 3.2.2 then
follows since there are ni different sequences of the n outcomes leading to i successes and

n − i failures.
For instance, if n = 3, i = 2, then there are 32 = 3 ways in which the three trials can result

in two successes. Namely, any one of the three outcomes (S, S, F ), (S, F, S), (F, S, S),
where the outcome (S, S, F ) means that the first two trials are successes and the third
a failure. Since each of the three outcomes (S, S, F ), (S, F, S), (F, S, S) has a probability
p2 (1 − p) of occuring the desired probability is thus is thaus 32 p2 (1 − p)

Note that, by the Binomial Theorem 2.5.12, the probabilities sum to one, that is,
∞ ∞ n
X X n i n−i
X n i
p(i) = p (1 − p) = p (1 − p)n−i = (p + (1 − p))n = 1 (3.2.3)
i=0 i=0
i i=0
i
Example 3.2.1 Four fair coins are flipped. If the outcomes are assumed independent,
what is the probability that two heads and two tails are obtained?
Letting X equal the number of heads (”successes”) that appear, then X is a binomial
random variable with parameters (n = 4, p = 21 ). Hence, by the Equation 3.2.2 we have
2 2
4 1 1 3
P (X = 2) = =
2 2 2 8
56
Example 3.2.2 It is known that any item produced by a certain machine will be defective
with probability 0.1, independently of any other item. What is the probability that in a
sample of three items, at most one will be defective?
If X is the number of defective items in the sample, then X is a binomial random variable
with parameters (3, 0.1). Hence, the desired probability is given by

3 0 3 3
P {X ≤ 1} = P {X = 0} + P {X = 1} = (0.1) (0.9) + (0.1)(0.9)2 = 0.972
0 1
Example 3.2.3 Suppose that an airplane engine will fail, when in flight, with probability
1−p independently from engine to engine; suppose that the airplane will make a successful
flight if at least 50 percent of its engines remain operative. For what values of p is a four-
engine plane preferable to a two-engine plane?
Because each engine is assumed to fail or function independently of what happens with
the other engines, it follows that the number of engines remaining operative is a binomial
random variable. Hence, the probability that a four-engine plane makes a successful flight
is

4 2 2 4 3 4 4
p (1 − p) + p (1 − p) + p (1 − p)0 = 6p2 (1 − p)2 + 4p3 (1 − p) + p4
2 3 4
whereas the corresponding probability for a two-engine plane is

2 2 2
p(1 − p) + p (1 − p)0 = 2p(1 − p) + p2
1 2
Hence the four-engine plane is safer if
6p2 (1 − p)2 + 4p3 (1 − p) + p4 ≥ 2p(1 − p) + p2
or or equivalently if
6p(1 − p)2 + 4p2 (1 − p) + p3 ≥ 2 − p
which simplifies to
3p3 − 8p2 + 7p − 2 ≥ 0 or (p − 1)2 (3p − 2) ≥ 0
which is equivalent to p ≥ 32 . Hence, the four-engine plane is safer when the engine
success probability is at least as large as 23 , whereas the two-engine plane is safer when
this probability falls below 23 .
Example 3.2.4 Suppose that a particular trait of a person (such as eye color or left
handedness) is classified on the basis of one pair of genes and suppose that D represents a
dominant gene and r a recessive gene. Thus a person with DD genes is pure dominance,
57
one with rr is pure recessive, and one with rD is hybrid. The pure dominance and the
hybrid are alike in appearance. Children receive one gene from each parent. If, with respect
to a particular trait, two hybrid parents have a total of four children, what is the probability
that exactly three of the four children have the outward appearance of the dominant gene?
If we assume that each child is equally likely to inherit either of two genes from each parent,
the probabilities that the child of two hybrid parents will have DD, rr or rD pairs of genes
are, respectively, 14 , 14 , 12 . Hence, because an offspring will have the outward appearance
of the dominant gene if its gene pair is either DD or rD, it follows that the number of
such children is binomially distributed with parameters (4, 34 ).Thus the desired probability
is 43 ( 43 )3 ( 14 ) = 27

64
.
Remark on Terminology: If X is a binomial random variable with parameters (n, p),

then we say that X has a binomial distribution with parameters (n, p).
3.2.3 The Geometric Random Variable
Suppose that independent trials, each having probability p of being a success, are per-
formed until a success occurs. If we let X be the number of trials required until the
first success, then X is said to be a geometric random variable with parameter p. Its
probability mass function is given by
p(n) = P {X = n} = (1 − p)n−1 p, n = 1, 2, · · · (3.2.4)
The Equation 3.2.4 follows since in order for X to equal n it is necessary and sufficient
that the first n − 1 trials be failures and the nth trial a success. The Equation 3.2.4 follows
since the outcomes of the successive trials are assumed to be independent.
To check that p(n) is a probability mass function, we note that

∞ ∞ ∞
X X
n−1
X
k 1
p(n) = p (1 − p) =p (1 − p) = p =1
n=1 n=1 k=0
1 − (1 − p)
Here we have used the formula for a geometric series which is given by
∞
X a
ark = for |r| < 1
k=1
1−r
3.2.4 The Hypergeometric Random Variable
We can again use the urn model to motivate, the hypergeometric distribution.
58
Consider an urn with N total balls where M are white balls and N − M are black balls.
We randomly draw n balls without replacement i.e. we do not place a ball back into the
urn once it is drawn. The order in which the balls are drawn is assumed to be of no
interest; only the number of drawn white balls is of relevance. We define the following
random variable
X : Number of white balls (x) among the n drawn balls
To be more precise, among the n drawn balls, x are white and n − x are black. There are
M

x
possibilities to choose x white balls from the total of M white balls and analogously,
−M
there are Nn−x

possibilities to choose n − x black balls from the total of N − M black
balls. In total, we draw n out of N balls. Recall the probability is defined as the number
of simple favourable events divided by all possible events. The number of combinations
N −M
for all possible events is Nn . The number of favourable events is M

x n−x
, because we
draw, independent of each other, x out of M balls and n − x out of N − M balls. Hence,
the Probability mass function of the hypergeometric distribution is
M N −M

x n−x
p(x) = P (X = x) = N
(3.2.5)
n
for x ∈ {max(0, n − (N − M ), · · · , min(n, M )}
Definition 3.2.5 A random variable X is said to follow a hypergeometric distribution

with parameters n, M, N if its Probability mass function is given by the Equation 3.2.6.
Example 3.2.6 The German national lottery draws 6 out of 49 balls from a rotating
bowl. Each ball is associated with a number between 1 and 49. A simple bet is to choose
6 numbers between 1 and 49. If 3 or more chosen numbers correspond to the numbers
drawn in the lottery, then one wins a certain amount of money. What is the probability
of choosing 4 correct numbers?
We can utilize the hypergeometric distribution with x = 4, M = 6, N = 49, and n = 6 to

calculate such probabilities. The interpretation is that we ”draw” (i.e. bet on) 4 out of
the 6 winning balls and ”draw” (i.e. bet on) another 2 out of the remaining 43 (49 - 6)
losing balls. In total, we draw 6 out of 49 balls. Calculating the number of the favourable
combinations and all possible combinations leads to the application of the hypergeometric
distribution as follows:
M −M
× Nn−x 6 43

x
P (X = 4) = N
= 4 493 ≈ 0.001 (3.2.6)
n 6
59
3.2.5 The Poisson Random Variable
A random variable X, taking on one of the values 0, 1, 2, ..., is said to be a Poisson random
variable with parameter λ, if for some λ > 0,
λi
p(i) = P {X = i} = e−λ , i = 0, 1, 2, · · · (3.2.7)
i!
The Equation 3.2.7 defines a probability mass function since
∞ ∞
X
−λ
X λi
p(i) = e = e−λ eλ = 1
i=0 i=0
i!
The Poisson random variable has a wide range of applications in a diverse number of
areas. An important property of the Poisson random variable is that it may be used to
approximate a binomial random variable when the binomial parameter n is large and p is
small. To see this, suppose that X is a binomial random variable with parameters (n, p),
and let λ = np. Then
i n−i
n! n! λ λ
P {X = i} = pi (1 − p)n−i = 1−
(n − i)!i! (n − i)!i! n n
i (1 − λ )n
n(n − 1) · · · (n − i + 1) λ n
=
ni i! (1 − nλ )i
Now, for n large and p small

n i
−λ n(n − 1) · · · (n − i + 1)

λ λ
1− ≈e ; ≈ 1 and 1 − ≈1
n ni n
Hence, for n large and p small,
λi
P {X = i} ≈ e−λ
i!
Example 3.2.7 Suppose that the number of typographical errors on a single page of the
current document has a Poisson distribution with parameter λ = 1. Calculate the proba-
bility that there is at least one error on this page.
The answer is given by
P {X ≥ 1} = 1 − P {X = 0} = 1 − e−1 ≈ 0.633
60
Example 3.2.8 If the number of accidents occurring on a highway each day is a Poisson
random variable with parameter λ = 3, what is the probability that no accidents occur
today?
The answer is given by P {X = 0} = e−3 ≈ 0.05
3.3 Continuous Random Variables
In this section, we shall concern ourselves with random variables whose set of possible
values is uncountable.
Let X be such a random variable. We say that X is a continuous random variable if there
exists a nonnegative function f (x), defined for all real x ∈ (−∞, ∞), having the property
that for any set B of real numbers
Z
P {X ∈ B} = f (x)dx (3.3.1)
B
The function f (x) is called the probability density function of the random variable X. In
words, the Equation 3.3.1 states that the probability that X will be in B may be obtained
by integrating the probability density function over the set B. Since X must assume some
value, f (x) must satisfy
Z ∞
1 = P {X ∈ (−∞, ∞)} = f (x)dx
−∞
All probability statements about X can be answered in terms of f (x). For instance,
letting B = [a, b], we obtain from the Equation 3.3.1 that
Z b
P {a ≤ X ≤ b} = f (x)dx (3.3.2)
a
If we let a = b in the preceding, then
Z a
P {X = a} = f (x)dx = 0
a
In words, this equation states that the probability that a continuous random variable will
assume any particular value is zero.
The relationship between the cumulative distribution F (·) and the probability density
f (·) is expressed by
61
Z a
F (a) = P {X ∈ (−∞, a]} = f (x)dx
−∞
Differentiating both sides of the preceding yields
d
(F (a)) = f (a)
da
That is, the density is the derivative of the cumulative distribution function. A somewhat
more intuitive interpretation of the density function may be obtained from the Equation
3.3.2 as follows:
Z a+ ε
n ε εo 2
P a− ≤X ≤a+ = f (x)dx ≈ εf (a)
2 2 a− 2ε
when ε is small. In other words, the probability that X will be contained in an interval
of length ε around the point a is approximately εf (a). From this, we see that f (a) is a
measure of how likely it is that the random variable will be near a. There are several
important continuous random variables that appear frequently in probability theory. The
remainder of this section is devoted to a study of certain of these random variables.
3.3.1 The Uniform Random Variable
A random variable is said to be uniformly distributed over the interval (0, 1) if its prob-
ability density function is given by
(
1, if 0 < x < 1
f (x) = (3.3.3)
0, otherwise
Note that the preceding is a density function since f (x) ≥ 0 and
Z ∞ Z 1
f (x)dx = dx = 1
−∞ 0
Since f (x) > 0 only when x ∈ (0, 1), it follows that X must assume a value in (0, 1). Also,
since f (x) is constant for x ∈ (0, 1), X is just as likely to be ”near” any value in (0, 1) as
any other value. To check this, note that, for any 0 < a < b < 1,
Z b
P {a ≤ X ≤ b} = f (x)dx = b − a
a
62
In other words, the probability that X is in any particular subinterval of (0, 1) equals the
length of that subinterval.
In general, we say that X is a uniform random variable on the interval (α, β) if its
probability density function is given by
(
1
β−α
,if α < x < β
f (x) = (3.3.4)
0, otherwise
Example 3.3.1 Calculate the cumulative distribution function of a random variable uni-
formly distributed over (α, β).
Ra
Since F (a) = −∞
f (x), we obtain from the Equation 3.3.5 that

 0, if a ≤ α

a−α
F (a) = β−α
, if α ≤ a < β

1, if a ≥ β

Example 3.3.2 If X is uniformly distributed over (0, 10), calculate the probability that
(a) X < 3, (b) X > 7, (c) 1 < X < 6.
It is clear that the probability density function of X is given by
(
1
10
,if 0 < x < 10
f (x) = (3.3.5)
0, otherwise
1
R3 3
P {X < 3} = 10 0
dx = 10
1
R7 7 3
P {X > 7} = 1 − P {X ≤ 7} = 1 − 10 0
dx = 1 − 10
= 10
1
R6 5
P {1 < X < 6} = 10 1
dx = 10
3.3.2 Exponential Random Variables
A continuous random variable whose probability density function is given, for some λ > 0,
by
63
(
λe−λx , if x ≥ 0
f (x) = (3.3.6)
0, if x < 0
is said to be an exponential random variable with parameter λ. These random variables

will be extensively studied latter, so we will content ourselves here with just calculating
the cumulative distribution function F :
Z a
F (a) = λe−λx dx = 1 − e−λa , a ≥ 0 (3.3.7)
0
R∞
Note that F (∞) = 0
λe−λx dx = 1 as, of course, it must.
3.3.3 Gamma Random Variables
A continuous random variable whose density is given by
λe−λx (λx)α−1
(
Γ(α)
, if x ≥ 0
f (x) = (3.3.8)
0, if x < 0
for some λ > 0, α > 0 is said to be a gamma random variable with parameters α, λ. The
quantity Γ(α) is called the gamma function and is defined by
Z ∞
Γ(α) = e−x xα−1 dx
0
The Gamma function satisfies the following properties:
(i) For every α > 0 we have Γ(α + 1) = αΓ(α).
(ii) For every n ∈ N we have Γ(n) = (n − 1)!

√
(iii) Γ( 21 ) = π
Exercise 3.3.1 Let X be a Gamma random variable where its density function is given
by the Equation 4.3.8. Prove that
Z ∞
f (x)dx = 1 (3.3.9)
−∞
That is f (x) is a probability density function.
64
For a more information about the Gamma function, we refer the reader to the course of
Special Functions.
3.3.4 Normal Random Variables
We say that X is a normal random variable (or simply that X is normally distributed )
with parameters µ and σ 2 if the density of X is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞, −∞ < µ < ∞, σ2 > 0 (3.3.10)
2πσ
This density function is a bell-shaped curve that is symmetric around µ, as it is shown

in Figure 3.1. The normal distribution with parameter µ (mean) and σ 2 ) (variance) is
denoted by N (µ, σ 2 ).
Figure 3.1: Normal curve.
Exercise 3.3.2 Let X be a normally distributed random variable where its density func-
tion is given by the Equation 3.3.10. Prove that
Z ∞
f (x)dx = 1 (3.3.11)
−∞
That is f (x) is a probability density function.
Remark 3.3.1 When there is more than one random variable under consideration, we
shall denote the cumulative distribution function of a random variable Z by Fz (·). Simi-
larly, we shall denote the density of Z by fz (·).
An important fact about normal random variables is given in the following theorem:
Theorem 3.3.3 If X is normally distributed with parameters µ and σ 2 then Y = αX + β

is normally distributed with parameters αµ+β and α2 σ 2 , where α and β are real numbers.
65
Proof. To prove this, suppose first that α > 0 and note that the cumulative distribution
function of the random variable Y is given by
FY (a) = P {Y ≤ a} = P {αX + β ≤ a}
Z a−β
a−β

α 1 (x−µ)2
=P X≤ = √ e− 2σ2 dx
α 2πσ
Z a −∞
−(v − (αµ + β))2

1
= √ exp dv (3.3.12)
−∞ 2πασ 2α2 σ 2
where the last equality is obtained by the change in variables v = αx + β. However, since
Z a
FY (a) = fY (v)dv,
−∞
it follows from the Equation 4.1.2 that the probability density function fY (·) is given by
−(v − (αµ + β))2

1
fY (v) = √ exp (3.3.13)
2πασ 2(ασ)2
Hence, Y is normally distributed with parameters αµ + β and (ασ)2 . A similar result is

also true when α < 0. One implication of the preceding result is that if X is normally
distributed with parameters µ and σ 2 then Z = (X−µ)
σ
is normally distributed with param-
eters 0 and 1. Such a random variable Z is said to have the standard or unit normal
distribution. The standard normal distribution is denoted by N (0, 1). The density of
such a random variable is given by
1 z2
f (z) = √ e− 2 , −∞ < z < ∞
2π
Let Z be the standrd normal distribution.The following properties can be observed from
the symmetry of the standard normal curve about 0.
(i) P (Z ≤ 0) = 0.5
(ii) P (Z ≤ −z) = 1 − P (Z ≤ z) = P (Z ≥ z) i.e. F (−z) = 1 − F (z).
Example 3.3.4 Let Z be the standard normal random variable. Find P [Z ≤ 1.37] and
P [Z > 1.37]. From the Normal probability table, we see that the probability P [Z ≤ 1.37] =
0.9147. This is the area to the left of z = 1.37. Moreover, because [Z > 1.37] is the
complement of [Z ≤ 1.37], it follows that P [Z > 1.37] = 1 − P [Z ≤ 1.37] = 1 − 0.9147 =
0.0853 as we can see in the normal table. An alternative method is to use symmetry to
show that P [Z > 1.37] = P [Z < −1.37] which can be obtained directly from the normal
table.
66
Example 3.3.5 Compute P [−0.155 < Z < 1.60]. From the Normal probability table, we
see that P [Z ≤ 1.60] = 0.9452. i.e. the area to the left of 0.160. We interplate between
the entries for −0.155 and 1.60 to obtain P [Z ≤ −0.155] = 0.4384 i.e. the area to the left
of −0.155. Let us note that to obtain P [Z ≤ −0.155] we have interpolated by taking the
mean of P [Z ≤ −0.150] and P [Z ≤ −0.160] since P [Z ≤ −0.155] is not presented on the
Normal Probability Table. Therefore, P [−1.55 < Z < 1.60] = 0.9452 − 0.4384 = 0.5068.
Example 3.3.6 (Normal Probability to a Standard Normal Probability) Given that

X has the normal distribution with µ = 60 and σ = 4, find P [55 ≤ X ≤ 63].
Here the standardized variable is Z = X−60

σ
. It follows that for x = 55 we get z = 55−60
4
=
63−60
−1.25 and for x = 63 we get z = 4 = 0.75. Therefore, P [55 ≤ X ≤ 63] = P [−1.25 ≤
Z ≤ 0.75] = P [Z ≤ 0.75] − P [Z ≤ −1.25] = 0.7734 − 0.1056 = 0.6678
The working steps employed in the Example 3.3.6 can be formalized into the rule.
If X is normally distributed with parameters µ and σ 2 , then
a−µ b−µ

P [a ≤ X ≤ b] = P ≤Z≤ (3.3.14)
σ σ
where Z has the standard normal distribution.
Exercise 3.3.3 The number of calories in a salad on the lunch is normally distributed
with µ = 200 and σ = 5. Find the probability that the salad you select will contain
(i) More than 208 calories.
(ii) Between 190 and 200 calories.
Solution 3.3.7 Let X be the number of calories in the salad, we have the standardized
variable Z = X−200
5
.
(i) The z−value corresponding to x = 208 is z = 208−200 5

= 1.6. Therefore P [X >
208] = P [Z > 1.6] = 1 − P [Z ≤ 1.6] = 1 − 0.9452 = 0.0548.
(ii) The z−value corresponding to x = 190 and x = 200 are z = 190−200 5

= −2 and
200−200
z= 5
= 0, respectively. We calculate P [190 ≤ X ≤ 200] = P [−2 ≤ Z ≤ 0] =
0.5 − 0.0228 = 0.4772
67
Normal probability table
negative Z
Second decimal place of Z

0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Z
0.0002 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 −3.4
0.0003 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0005 0.0005 0.0005 −3.3
0.0005 0.0005 0.0005 0.0006 0.0006 0.0006 0.0006 0.0006 0.0007 0.0007 −3.2
0.0007 0.0007 0.0008 0.0008 0.0008 0.0008 0.0009 0.0009 0.0009 0.0010 −3.1
0.0010 0.0010 0.0011 0.0011 0.0011 0.0012 0.0012 0.0013 0.0013 0.0013 −3.0
0.0014 0.0014 0.0015 0.0015 0.0016 0.0016 0.0017 0.0018 0.0018 0.0019 −2.9
0.0019 0.0020 0.0021 0.0021 0.0022 0.0023 0.0023 0.0024 0.0025 0.0026 −2.8
0.0026 0.0027 0.0028 0.0029 0.0030 0.0031 0.0032 0.0033 0.0034 0.0035 −2.7
0.0036 0.0037 0.0038 0.0039 0.0040 0.0041 0.0043 0.0044 0.0045 0.0047 −2.6
0.0048 0.0049 0.0051 0.0052 0.0054 0.0055 0.0057 0.0059 0.0060 0.0062 −2.5
0.0064 0.0066 0.0068 0.0069 0.0071 0.0073 0.0075 0.0078 0.0080 0.0082 −2.4
0.0084 0.0087 0.0089 0.0091 0.0094 0.0096 0.0099 0.0102 0.0104 0.0107 −2.3
0.0110 0.0113 0.0116 0.0119 0.0122 0.0125 0.0129 0.0132 0.0136 0.0139 −2.2
0.0143 0.0146 0.0150 0.0154 0.0158 0.0162 0.0166 0.0170 0.0174 0.0179 −2.1
0.0183 0.0188 0.0192 0.0197 0.0202 0.0207 0.0212 0.0217 0.0222 0.0228 −2.0
0.0233 0.0239 0.0244 0.0250 0.0256 0.0262 0.0268 0.0274 0.0281 0.0287 −1.9
0.0294 0.0301 0.0307 0.0314 0.0322 0.0329 0.0336 0.0344 0.0351 0.0359 −1.8
0.0367 0.0375 0.0384 0.0392 0.0401 0.0409 0.0418 0.0427 0.0436 0.0446 −1.7
0.0455 0.0465 0.0475 0.0485 0.0495 0.0505 0.0516 0.0526 0.0537 0.0548 −1.6
0.0559 0.0571 0.0582 0.0594 0.0606 0.0618 0.0630 0.0643 0.0655 0.0668 −1.5
0.0681 0.0694 0.0708 0.0721 0.0735 0.0749 0.0764 0.0778 0.0793 0.0808 −1.4
0.0823 0.0838 0.0853 0.0869 0.0885 0.0901 0.0918 0.0934 0.0951 0.0968 −1.3
0.0985 0.1003 0.1020 0.1038 0.1056 0.1075 0.1093 0.1112 0.1131 0.1151 −1.2
0.1170 0.1190 0.1210 0.1230 0.1251 0.1271 0.1292 0.1314 0.1335 0.1357 −1.1
0.1379 0.1401 0.1423 0.1446 0.1469 0.1492 0.1515 0.1539 0.1562 0.1587 −1.0
0.1611 0.1635 0.1660 0.1685 0.1711 0.1736 0.1762 0.1788 0.1814 0.1841 −0.9
0.1867 0.1894 0.1922 0.1949 0.1977 0.2005 0.2033 0.2061 0.2090 0.2119 −0.8
0.2148 0.2177 0.2206 0.2236 0.2266 0.2296 0.2327 0.2358 0.2389 0.2420 −0.7
0.2451 0.2483 0.2514 0.2546 0.2578 0.2611 0.2643 0.2676 0.2709 0.2743 −0.6
0.2776 0.2810 0.2843 0.2877 0.2912 0.2946 0.2981 0.3015 0.3050 0.3085 −0.5
0.3121 0.3156 0.3192 0.3228 0.3264 0.3300 0.3336 0.3372 0.3409 0.3446 −0.4
0.3483 0.3520 0.3557 0.3594 0.3632 0.3669 0.3707 0.3745 0.3783 0.3821 −0.3
0.3859 0.3897 0.3936 0.3974 0.4013 0.4052 0.4090 0.4129 0.4168 0.4207 −0.2
0.4247 0.4286 0.4325 0.4364 0.4404 0.4443 0.4483 0.4522 0.4562 0.4602 −0.1
0.4641 0.4681 0.4721 0.4761 0.4801 0.4840 0.4880 0.4920 0.4960 0.5000 −0.0
∗ For Z ≤ −3.50, the probability is less than or equal to 0.0002.
Normal probability table
Y positive Z
Second decimal place of Z

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
∗ For Z ≥ 3.50, the probability is greater than or equal to 0.9998.
Exercise 3.3.4 (Binomial Distribution With Genetics) According to Mendelian
Theory of inherited characteristics, a cross fertilization of related species of red and white
flowered plants produces a generation whose offspring contain 25% red flowered plants.
Suppose a horiculturist wishes to cross 5 pairs of cross-fertilized specises. Of the resulting
5 offspring, what is the probability that
(a) There will be no red flowered plants?
(b) There will be 4 or more red flowered plants?
Solution 3.3.8 Because the trials are conducted on different parent plants, it is natural
to assume that they are independent. Let the random variable X denote the number of red
flowered plants among the 5 offsprings. If we identify the occurence of red as a success
S, the Mendelian theory specifies that P (S) = p = 14 , and hence X has a Binomial
distribution with n = 5 and p = 0.25. The required probabilities are therefore
5
(0.25)0 (1 − 0.25)5 = 0.237

(a) P {X = 0} = 0
5 5
(0.25)4 (0.75)1 + (0.25)0 (0.75)5 =

(b) P {X ≥ 4} = P {X = 4} + P {X = 5} = 4 5
0.015 + 0.001 = 0.016.
Exercise 3.3.5 The raw scores in a national aptitude test are normally distributed with
µ = 506 and σ = 81.
(a) What proportion of the candidates scored below 574?
(b) Find the 30th percentile of the scores.
X−506
Solution 3.3.9 If we denote the raw scorers by X, the standardized score Z = 81
is
distributed as N (0, 1).
(a) The z− score corresponding to 574 is z = 574−506

81
= 0.8395. It follows that P [Z <
574] = P [Z < 0.8395] = 0.799. Hence 79, 9% ≈ 80% of the candidates scorered below
574. In order words, the score 574 nearly locates the 80th percentile.
(b) We first find the 30th percentile in the z−scale and then convert it to the x−scale. That
is we need to find z such that P [Z ≤ z] = 0.30. From the Standard Normal Table,
we find P [Z ≤ −0.524] = 0.30. The standardized score z = −0.524 corresponds to
x = 506 + (81)(−0.524) = 463.56. Therefore, the 30th percentile score is about 463.6.
Exercise 3.3.6 Suppose that it is known that a new treatment is successful in curing a
muscular pain in 50% of the cases. If it is tried on 15 patients, find the probability that
70
(a) At most 6 will be cured.
(b) The number cured will be no fewer than 6 and no more than 10.
(c) Twelve or more will be cured.
Solution 3.3.10 Designating the cure of a patient by S and assuming that the results for
individual patients are indepenedent, we note that the Binomial distributions with n = 15
and p = 0.5 is appropriate for X =number of patients who are cured. To compute the
required probabilities, we use the Binomial Distribution with n = 15 and p = 0.5.
(a) P {X ≤ 6} = P {X = 0} + P {X = 1} + P {X = 2} + P {X = 3} + P {X = 4} + P {X =
5} + P {X = 6} = 0.304.
(b) We are to calculate P {6 ≤ X ≤ 10} = P {X = 6} + P {X = 7} + P {X = 8} + P {X =

9} + P {X = 10} = 0.790.
(c) To find P {X ≥ 12} we use the law of complement, that is, P {X ≥ 12} = 1 − P {X ≤
11} = 1 − P {X = 0} + P {X = 1} + P {X = 2} + P {X = 3} + P {X = 4} + P {X =
5} + P {X = 6} + P {X = 7} + P {X = 8} + P {X = 9} + P {X = 10} + P {X = 11} =
1 − 0.982 = 0.018.
Exercise 3.3.7 If X is normally distributed with µ = 60 and σ = 4, find
(a) P [X < 55]
(b) P [X ≤ 67]
(c) P [X > 66]
(d) P [X > 51]
(e) P [53 ≤ X ≤ 69]
(f ) P [61 < X < 64]
Exercise 3.3.8 If X is normally distributed with µ = 20 and σ = 5, find b such that
(a) P [X < b] = 0.975
(b) P [X > b] = 0.025
(c) P [X < b] = 0.305
Exercise 3.3.9 A company producing cereals offers a toy in every sixth cereal package in
celebration of their 50th anniversary. A father immediately buys 20 packages.
71
(i) What is the probability of finding 4 toys in the 20 packages?
(ii) What is the probability of finding no toy at all?
Solution 3.3.11 The random variable X = number of packages with a toy is binomially
distributed. In each of n = 20 trials, a toy can be found with probability p = 61 .
20
( 61 )4 ( 56 )16 ≈ 0.20.

(i) Here we have P [X = 4] = 4
20
( 16 )0 ( 56 )20 ≈ 0.026.

(ii) Here we have P [X = 0] = 0
72
Chapter 4
Expectations of Random Variables
4.1 The Discrete Case
If X is a discrete random variable having a probability mass function p(x), then the
expected value of X is defined by
X
E[X] = xp(x) (4.1.1)
x:p(x)>0
In other words, the expected value of X is a weighted average of the possible values that
X can take on, each value being weighted by the probability that X assumes that value.
1
For example, if the probability mass function of X is given by p(1) = 2
= p(2) then

1 1 3
E[X] = 1 +2 =
2 2 2
is just an ordinary average of the two possible values 1 and 2 that X can assume.
1
On the other hand, if p(1) = 3
and p(2) = 23 then

2 1 5
E[X] = 1 +2 =
3 3 3
is a weighted average of the two possible values 1 and 2 where the value 2 is given twice
as much weight as the value 1 since p(2) = 2p(1).
Example 4.1.1 Find E[X] where X is the outcome when we roll a fair die.
Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 61 , we obtain

1 1 1 1 1 1 7
E[X] = 1 +2 +3 +4 +5 +6 =
6 6 6 6 6 6 2
73
Example 4.1.2 (Expectation of a Bernoulli Random Variable) Calculate E[X] when
X is a Bernoulli random variable with parameter p.
Since p(0) = 1 − p, p(1) = p, we have
E[X] = 0(1 − p) + 1(p) = p
Thus, the expected number of successes in a single trial is just the probability that the trial
will be a success.
Example 4.1.3 (Expectation of a Binomial Random Variable) Calculate E[X] when

X is binomially distributed with parameters n and p.
n n
X X n i
E[X] = ip(i) = i p (1 − p)n−i
i=0 i=0
i
n n
X in! X n!
= pi (1 − p)n−i = pi (1 − p)n−i
i=1
(n − i)!i! i=1
(n − i)!(i − 1)!
n n−1
(n − 1)! n−1 k
X X
= np pi−1 (1 − p)n−i = np p (1 − p)n−1−k (4.1.2)
i=1
(n − i)!(i − 1)! k=0
k
= np[p + (1 − p)]n−1 = np
where the second from the last equality follows by letting k = i − 1. Thus, the expected
number of successes in n independent trials is n multiplied by the probability that a trial
results in a success.
Example 4.1.4 (Expectation of a Geometric Random Variable) Calculate the ex-

pectation of a geometric random variable having parameter p.
By the Equation 3.2.4, we have

∞
X ∞
X
E[X] = np(1 − p)n−1 = p nq n−1
n=1 n=1
where q = 1 − p. It follows that
∞ ∞
!
X d n d X
n d q p 1
E[X] = p (q ) = p q =p = =
n=1
dq dq n=1
dq 1−q (1 − q)2 p
Here we have used the formula for a geometric series which is given by
∞
X a
ark = for |r| < 1
k=1
1−r
74
Indeed
∞ ∞ ∞
1 X X X 1 q
= qn = 1 + q n and hence qn = −1=
1−q n=0 n=1 n=1
1−q 1−q
In words, the expected number of independent trials we need to perform until we attain
our first success equals the reciprocal of the probability that any one trial results in a
success.
Example 4.1.5 (Expectation of a Poisson Random Variable) Calculate E[X] if X

is a Poisson random variable with parameter λ.
From the Equation 3.2.7, we have

∞ ∞
X ie−λ λi X e−λ λi
E[X] = =
i=0
i! i=1
(i − 1)!
∞ ∞
−λ
X λi−1 −λ
X λk
= λe = λe
i=1
(i − 1)! k=0
k!
= λe−λ eλ = λ
where we have used the identity

∞
X λk
= eλ
k=0
k!
4.2 The Continuous Case
We may also define the expected value of a continuous random variable. This is done as
follows. If X is a continuous random variable having a probability density function f (x),
then the expected value of X is defined by
Z ∞
E[X] = xf (x)dx (4.2.1)
−∞
Example 4.2.1 (Expectation of a Uniform Random Variable) Calculate the expec-

tation of a random variable uniformly distributed over (α, β).
From the Equation 3.3.5 we have
β
β 2 − α2 β+α
Z
x
E[X] = dx = = (4.2.2)
α β−α 2(β − α) 2
75
In other words, the expected value of a random variable uniformly distributed over the
interval (α, β) is just the midpoint of the interval.
Example 4.2.2 (Expectation of an Exponential Random Variable) Let X be ex-

ponentially distributed with parameter λ. Calculate E[X].
Z ∞
E[X] = xλe−λx dx
0
Integrating by parts yields
∞ ∞
e−λx

1
Z
∞
E[X] = −xe−λx 0 + −λx

e dx = 0 − = (4.2.3)
0 λ 0 λ
Example 4.2.3 (Expectation of a Normal Random Variable) Calculate E[X] when

X is normally distributed with parameters µ and σ 2 .
∞
1
Z
(x−µ)2
E[X] = √ xe− 2σ 2 dx
2πσ −∞
Writing x as (x − µ) + µ yields
∞ ∞
1
Z Z
−
(x−µ)2 µ (x−µ)2
E[X] = √ (x − µ)e 2σ 2 dx + √ e− 2σ 2 dx
2πσ −∞ 2πσ −∞
Letting y = x − µ leads to
∞ ∞
1
Z Z
y2
−
E[X] = √ ye 2σ 2 dy + µ f (x)dx
2πσ −∞ −∞
By symmetry (recall the properties of odd functions), the first integral must be 0, and so
Z ∞
E[X] = µ f (x)dx = µ
−∞
4.3 Expectation of a Function of a Random Variable
Suppose now that we are given a random variable X and its probability distribution (that
is, its probability mass function in the discrete case or its probability density function in
the continuous case). Suppose also that we are interested in calculating, not the expected
value of X, but the expected value of some function of X, say, g(X). How do we go about
76
doing this? One way is as follows. Since g(X) is itself a random variable, it must have a
probability distribution, which should be computable from a knowledge of the distribution
of X. Once we have obtained the distribution of g(X), we can then compute E[g(X)] by
the definition of the expectation.
Example 4.3.1 Suppose X has the following probability mass function:
p(0) = 0.2, p(1) = 0.5, p(2) = 0.3
Calculate E[X 2 ].
Letting Y = X 2 , we have that Y is a random variable that can take on one of the values
02 , 12 , 22 with respective probabilities
pY (0) = P {Y = 02 } = 0.2
pY (1) = P {Y = 12 } = 0.5
pY (4) = P {Y = 22 } = 0.3
Hence, E[X 2 ] = E[Y ] = 0(0.2) + 1(0.5) + 4(0.3) = 1.7 Note that
1.7 = E[X 2 ] 6= (E[X])2 = 1.21
Example 4.3.2 Let X be uniformly distributed over (0, 1). Calculate E[X 3 ]. Letting
Y = X 3 , we calculate the distribution of Y as follows. For 0 ≤ a ≤ 1 we have
1
Z a3
1 1
3
FY (a) = P {Y ≤ a} = P {X ≤ a} = P {X ≤ a } = 3 dx = a 3
0
where the last equality follows since X is uniformly distributed over (0, 1).
By differentiating FY (a), we obtain the density of Y , namely

1 2
fY (a) = a− 3 , 0 ≤ a ≤ 1
3
Hence
∞ 1 1 1
1 1 13 3 1
Z Z Z
3 − 23 1
E[X ] = E[Y ] = afy (a)da = aa da = a da =
3 a4 =
−∞ 3 0 3 0 34 0 4
While the foregoing procedure will, in theory, always enable us to compute the expectation
of any function of X from a knowledge of the distribution of X, there is, fortunately,
an easier way to do this. The following proposition shows how we can calculate the
expectation of g(X) without first determining its distribution.
Theorem 4.3.3 (a) If X is a discrete random variable with probability mass function
p(x), then for any real-valued function g,
X
E[g(X)] = g(x)p(x) (4.3.1)
x:p(x)>0
77
(b) If X is a continuous random variable with probability density function f (x), then for
any real-valued function g,
Z ∞
E[g(X)] = g(x)f (x)dx (4.3.2)
−∞
Example 4.3.4 Applying the Theorem to the Example 4.3.1 yields
E[X 2 ] = 02 (0.2) + 12 (0.5) + 22 (0.3) = 1.7
Example 4.3.5 Applying the Theorem to Example 4.3.2 yields

Z 1
3 1
E[X ] = x3 dx =
0 4
since f (x) = 1, 0 < x < 1.
A simple corollary of the Theorem 4.3.3 is the following.
Corollary 4.3.6 If a and b are constants, then E[aX + b] = aE[X] + b
Proof.
X X X
E[aX + b] = (ax + b)p(x) = a xp(x) + b p(x) = aE[X] + b
x:p(x)>0 x:p(x)>0 x:p(x)>0
In the continuous case,

Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)f (x)dx = a xf (x)dx + b f (x)dx = aE(X) + b
−∞ −∞ −∞
The expected value of a random variable X, E[X], is also referred to as the mean or the
first moment of X. The quantity E[X n ], n ≥ 1, is called the nth moment of X. By the
Theorem 4.3.3, we note that
X
E[X n ] = xn p(x), if X is discrete (4.3.3)
x:p(x)>0
Z ∞
E[X n ] = n
x f (x)dx, if X is continuous
−∞
78
The Variance of a Random Variable
Another quantity of interest is the variance of a random variable X, denoted by Var(X),

which is defined by
Var(X) = E (X − E[X])2

(4.3.4)
Thus, the variance of X measures the expected square of the deviation of X from its
expected value.
Theorem 4.3.7 Let X be a random variable. Then we have
Var(aX + b) = a2 Var(X)
where a, b are any two fixed real numbers.
Proof. It follows from the Equation 4.3.4 defining the variance and the Corollary 4.3.6
that
Var(aX + b) = E (aX + b − E[aX + b])2 = E (aX + b − aE[X] − b)2

= E a2 (X − E[X])2 = a2 E (X − E[X])2 = a2 Var(X)

Example 4.3.8 (Variance of the Normal Random Variable) Let X be normally dis-
tributed with parameters µ and σ 2 . Find Var(X).
Recalling (see Example 4.2.3) that E[X] = µ, we have that

Z ∞
(x − µ)2

2 1 2
Var(X) = E[(X − µ) ] = √ (x − µ) exp − dx
2πσ −∞ 2σ 2
x−µ
Substituting y = σ
yields
∞
σ2
Z 2
2 y
Var(X) = √ y exp − dy
2π −∞ 2
2
Integrating by parts (u = y, dv = y exp{− y2 }dy) gives
2 ∞ Z ∞
σ2
2
y y
Var(X) = √ −y exp − + exp − dy
2π 2 −∞ −∞ 2
Finally the Variance of X will be given by

Z ∞
σ2
2
y
Var(X) = √ exp − dy = σ 2
2π −∞ 2
The following theorem is a generalization of Corollary 4.3.6.
79
Theorem 4.3.9 Let X be a random variable. Suppose that we also have real valued
functions gi (x) and constants ai , i = 1, 2, · · · , k. Then we have
( k
) k
X X
E a0 + ai gi (X) = a0 + ai E[gi (X)] (4.3.5)
i=1 i=1
as long as all the expectations involved are finite. That is, the expectation is a linear
operation.
Proof. We supply a proof assuming that X is a continuous random variable with its pdf
f (x), x ∈ B. In the discrete case, the proof is similar. Let us write
( k
) Z ( k
)
X X
E a0 + ai gi (X) = a0 + ai gi (X) f (x)dx
i=1 B i=1
Z k
Z X
= a0 f (x)dx + ai gi (X)f (x)dx
B B i=1
Z Xk Z
= a0 f (x)dx + ai gi (x)f (x)dx
B i=1 B
a0 is constant and using property of the integral operations. Hence, we have
( k
) Z k Z
X X
E a0 + ai gi (X) = a0 f (x)dx + ai gi (x)f (x)dx (4.3.6)
i=1 B i=1 B
R
since ai ’s are all constants and B
f (x)dx = 1. Now, we have the desired result.
Example 4.3.10 Let Z be a standard normally distibuted random variable, that is µ = 0

and σ 2 = 1.
(i) How should one evaluate E[(Z − 2)2 ]?
(ii) How should one evaluate E[(Z − 2)(2Z + 3)]?
Solution 4.3.11 Let Z be the standard normal random variable.
(i) By using the Theorem 4.3.9 we proceed directly
E[(Z − 2)2 ] = E[Z 2 ] − 4E[Z] + 4 = 1 − 0 + 4 = 5
(ii) Again by using the Theorem 4.3.9 weproceed directly to write
E[(Z − 2)(2Z + 3) = 2E[Z 2 ] − E[Z] − 6 = 2 − 0 − 6 = −4
80
Theorem 4.3.12 Suppose that X is a random variable. Then we have the following
identity
Var(X) = E[X 2 ] − (E[X])2 (4.3.7)
Proof. We assume that X is continuous with density f , and let E[X] = µ. Then,
Z ∞
2 2 2
Var(X) = E[(X − µ) ] = E[X − 2µX + µ ] = (x2 − 2µx + µ2 )f (x)dx
−∞
Z ∞ Z ∞ Z ∞
2 2
= x f (x)dx − 2µ xf (x)dx + µ f (x)dx
−∞ −∞ −∞
= E[X 2 ] − 2µµ + µ2 = E[X 2 ] − µ2
A similar proof holds in the discrete case, and so we obtain the requested useful identity
Var(X) = E[X 2 ] − (E[X])2
Example 4.3.13 Calculate Var(X) when X represents the outcome when a fair die is
rolled. As previously noted in the Example 4.1.1, E[X] = 27 ·
Also

2 1 1 1 1 1 1 91
E[X ] = 1 +4 +9 + 16 + 25 + 36 =
6 6 6 6 6 6 6
It follows that
91 49 35
Var(X) = − =
6 4 12
Example 4.3.14 (Bernoulli random Variable) Recall that a Bernoulli random vari-
able X with parameter p takes the values 1 and 0 with probability p and 1 − p respectively.
We have seen in the Example 4.1.2 that E[X] = p. Note that
E[X 2 ] = (1)2 p + (0)2 (1 − p) = p
Hence
Var(X) = E[X 2 ] − (E[X])2 = p − p2 = p(1 − p)
Example 4.3.15 (Binomial random Variable) Let X be a Binomial random variable

with parameters n and p. We have seen in the Example 4.1.3 that E[X] = np. Note that
Var(X) = E[X(X − 1) + X] − (E[X])2 = E[X(X − 1)] + np − (np)2
Let us calculate E[X(X − 1)]. Using the properties of the Binomial development we get
81
n n
X n x X n!
E[X(X − 1)] = x(x − 1) p (1 − p)n−x = x(x − 1) px (1 − p)n−x
x=0
x x=0
x!(n − x)!
n n
X n! x n−x 2
X (n − 2)!
= p (1 − p) = n(n − 1)p px−2 (1 − p)n−x
x=2
(x − 2)!(n − x)! x=2
(x − 2)!(n − x)!
m
X m!
= n(n − 1)p2 py (1 − p)m−y = n(n − 1)p2 (p + (1 − p))m = n(n − 1)p2
y=0
y!(m − y)!
Hence Var(X) = E[X(X − 1)] + np − (np)2 = n(n − 1)p2 + np − (np)2

= np(1 − p)
Example 4.3.16 (Poisson random Variable) Let X be a poisson random variable

with parameter λ > 0. We have seen in the Example 4.1.5 that E[X] = λ. Like in
the previous example, we first find E[X(X − 1)]
∞ ∞
X e−λ λx −λ
X λx
E[X(X − 1)] = x(x − 1) =e x(x − 1)
x=0
x! x=0
x!
∞ ∞
−λ
X λx −λ
X λx
=e x(x − 1) = e
x=2
x! x=2
(x − 2)!
∞
X λk
= e−λ λ2 = e−λ λ2 eλ = λ2
k=0
k!
P∞ λk
where we have used the identity k=0 k! = eλ . Hence
Var(X) = E[X(X − 1)] + E[X] − (E[X])2 = λ2 + λ − λ2 = λ
Example 4.3.17 (Uniformly distributed random variable) Let X be a random vari-

able which is uniformly distributed on the interval (α, β). We have seen in the Example
4.2.1 that E[X] = 12 (β − α). In order to find the variance of X, we first calculate E[X 2 ].
β
x2 β 3 − α3 1 2
Z
2
β + βα + α2

E[X ] = dx = =
α β−α 3(β − α) 3
Now we use the formula Var(X) = E[X 2 ] − (E[X])2 to claim that
82
2
β 3 − α3

2 1 2
Var(X) = E[X ] − (E[X]) = − (α + β)
3(β − α) 2
2
1 2 1 1
2
(β − α)2

= β + βα + α − (α + β) =
3 2 12
Example 4.3.18 (Exponential random Variable) Let X be exponentially distributed
with parameter λ. We have see in the Example 4.2.2 that E[X] = λ1 . We first find E[X 2 ].
Using the Method of Integration by Parts and the Example 4.2.2 one can see that
∞ ∞ ∞
x2

2 2 2
Z Z
−λx
2
E[X ] = 2
x λe = λx + xλe−λx dx = 0 + 2
= 2
0 e 0 λ 0 λ λ
2 1 2 1
Hence Var(X) = E[X 2 ] − (E[X])2 =

λ2
− λ
= λ2
Exercise 4.3.1 Let Y be a a geometric random variable with parameter p. Show that
E[X 2 ] = 2(1−p)
p2
. Using the result from the Example 4.1.4, show that Var(Y ) = 1−p
p2
.
Example 4.3.19 (Gamma random Variable) Let X be a random variable with pa-
rameters (α, β) with α > 0, β > 0. Recall that the probability density function of X is
given by
β α xα−1 e−βx
(
Γ(α)
, if x ≥ 0
f (x) = (4.3.8)
0, if x < 0
Let us compute the k th moment of gamma distribution. To do this we first recall (see Real
Analysis Course) that
Γ(α) = (α − 1)Γ(α − 1) and Γ(n) = (n − 1)!
∞ Z ∞
α α−1 −βx
βα
Z
x e kβ
E[X ] =k
x dx = xα+k−1 e−βx dx
0 Γ(α) Γ(α) 0
β α Γ(α + k) ∞ β α+k xα+k−1 e−βx Γ(α + k)
Z
= α+k
dx =
Γ(α)β 0 Γ(α + k) Γ(α)β k
(α + k − 1)Γ(α + k − 1) (α + k − 1)(α + k − 2) · · · αΓ(α)
= k
=
Γ(α)β Γ(α)β k
(α + k − 1)(α + k − 2) · · · α
=
βk
We have used the fact that
∞
β α+k xα+k−1 e−βx
Z
dx = 1
0 Γ(α + k)
83
α+k α+k−1 −βx
since f (x) = β Γ(α+k)
x e
, x ≥ 0 is a propability density function of a Gamma random
variable with parameters α + k and β.
α(α+1)
Therefore E[X] = α
β
and E[X 2 ] = β2
. It follows that
α(α + 1) α2 α
Var(X) = E[X 2 ] − (E[X])2 = 2
− 2 = 2
β β β
Exercise 4.3.2 Consider the following cumulative distribution function of a random vari-
able X:


 0, if x < 2
1 2
F (x) = − 4 x + 2x + 3, if 2 ≤ x ≤ 4

1, if x > 4

(i) What is the pdf of X?
(ii) Calculate P {X < 3} and P {X = 4}.
(iii) Determine E(X) and Var(X).
Solution 4.3.20 (i) The first derivative of the cumulative distribution function yields
0
the pdf, that is F (x) = f (x):


 0, if x < 2
f (x) = − 12 x + 2, if 2 ≤ x ≤ 4

0, if x > 4

(ii) It is know from the Equation 3.3.2 that for any continuous random variable P {X =
a) = 0 and therefore P {X = 4} = 0. We calculate P {X < 3} = P {X ≤ 4}−P {X =
R3 R3
4} = F (3) − 0 = −∞ f (x)dx = 2 (− 21 x + 2)dx = − 49 + 6 + 1 − 3 = 0.75.
R∞
(iii) Using the Equation 4.2.1 we obtain the expectation as follows: E[X] = −∞ xf (x)dx =
R2 R4 1
R∞ R4
−∞
x(0)dx+ 2
x(− 2
x+2)dx+ 0
x(0)dx = 0+ 2
x(− 21 x+2)dx+0 = [− 16 x3 +x2 ]42 =
(− 64
4
+ 16) − (− 86 + 4) = 83 . Given that we have already calculated E[X], we can use
the Theorem 4.3.12 to calculate the variance as Var(X) = E[X 2 ] − (E[X])2 . The
R4 R4
expectation of X 2 is E[X 2 ] = 2 x2 f (x)dx = 2 x2 (− 12 x + 2)dx = [− 81 x4 + 23 x3 ]42 =
(−32 + 1283
) − (−2 + 16 3
) = 223
. We thus obtain Var(X) = 22 3
− ( 83 )2 = 66−64
9
= 29 .
Exercise 4.3.3 A quality index summarizes different features of a product by means of a

score. Different experts may assign different quality scores depending on their experience
with the product. Let X be the quality index for a tablet. Suppose the respective probability
density function is given as follows:
84
(
cx(2 − x), if 0 ≤ x ≤ 2
f (x) =
0, elsewhere
(i) Determine c such that f (x) is a proper pdf.
(ii) Determine the cumulative distribution function.
(iii) Calculate the expectation and variance of X.

Solution 4.3.21 We are given the following function
(
cx(2 − x), if 0 ≤ x ≤ 2
f (x) =
0, elsewhere
(i) In order f (x) to be a proper pdf, it must be a non-negative function and

Z ∞
f (x)dx = 1
−∞
R∞ R2
From the condition −∞
f (x)dx = 1 we must have 0 f (x)dx = 1. But
Z 2
4
cx(2 − x)dx = c
0 3
R2
Hence the equality 0
f (x)dx = 1 implies that c = 43 . It is clear that
3
f (x) = x (2 − x) ≥ 0 for all x ∈ [0, 2]
4
(ii) We calculate
x x
3 3 2 1
Z Z
P {X ≤ x} = f (t)dt = t(2 − t)dt = x 1 − x
−∞ −∞ 4 4 3
and therefore the Cumulative Distribution Function of X is given as follows:


 0, if x < 0
3 2
1 − 31 x if 0 ≤ x ≤ 2

F (x) = 4
x

1, if 2 < x

(iii) The expectation is

2 2
3
Z Z
E[X] = xf (x)dx = x x(2 − x) dx = 1
0 0 4
Let us also note that
2 2
3 6
Z Z
2 2 2
E[X ] = x f (x)dx = x x(2 − x) =
0 0 4 5
Hence
6 1
Var(X) = E[X 2 ] − (E[X])2 = − 12 =
5 5
85
Chapter 5
Jointly Distributed Random

Variables
5.1 Joint Distribution Functions
Thus far, we have concerned ourselves with the probability distribution of a single random
variable. However, we are often interested in probability statements concerning two or
more random variables. To deal with such probabilities, we define, for any two random
variables X and Y , the joint cumulative probability distribution function of X and Y by
FX,Y (a, b) = P {X ≤ a, Y ≤ b}, −∞ < a, b < ∞
Even though two random variables X and Y may be jointly distributed, if interest is fo-
cused on only one of the random variables, then it is appropriate to consider the probability
distribution of that random variable alone. This is known as the marginal distribution of
that random variable and can be obtained quite simply by summing or integrating the
joint probability distribution over the values of the other random variable. In other words,
we have the following:
The distribution of X can be obtained from the joint distribution of X and Y as follows:
FX (a) = P {X ≤ a} = P {X ≤ a, Y < ∞} = FX,Y (a, ∞)
Similarly, the cumulative distribution function of Y is given by
FY (b) = P {Y ≤ b} = P {X < ∞, Y < b} = FX,Y (∞, b)
86
In the case where X and Y are both discrete random variables, it is convenient to define
the joint probability mass function of X and Y by
p(x, y) = P {X = x, Y = y}
The cumulative distribution function FX,Y can be expressed in terms of p(x, y) by

XX
FX,Y (a, b) = P {X ≤ a, Y ≤ b} = p(x, y), −∞ < a, b < ∞
y≤b x≤a
The probability mass function of X may be obtained from p(x, y) by

X
pX (x) = p(x, y) for each x
y:p(x,y)>0
Similarly,
X
pY (y) = p(x, y) for each y
x:p(x,y)>0
The functions pX (x) is called the marginal probability mass function of X and pY (y) is
called the marginal probability mass function of Y
Example 5.1.1 The joint cdf of two discrete random variables X and Y is given as
follows:
 1

 8
, x = 1, y = 1
 5 , x = 1, y = 2

8
FX,Y (x, y) = 1
 , x = 2, y = 1
 4


1, x = 2, y = 2
Determine
(i) The joint probability mass function of X and Y .
(ii) The marginal probability mass function of X.
(iii) The marginal probability mass function of Y .
Solution 5.1.2 (i) The joint probability mass function is obtained from the relation-
ship:
XX
FX,Y (a, b) = P {X ≤ a, Y ≤ b} = p(x, y)
y≤b x≤a
87
1
FX,Y (1, 1) = pX,Y (1, 1) =
8
5 5 1 1
FX,Y (1, 2) = pX,Y (1, 1) + pX,Y (1, 2) = =⇒ pX,Y (1, 2) = − =
8 8 8 2
1 1 1 1
FX,Y (2, 1) = pX,Y (1, 1) + pX,Y (2, 1) = =⇒ pX,Y (2, 1) = − =
4 4 8 8
FX,Y (2, 2) = pX,Y (1, 1) + pX,Y (1, 2) + pX,Y (2, 1) + pX,Y (2, 2) = 1
1 1 1 1
=⇒ pX,Y (2, 2) = 1 − − − =
8 2 8 4
(ii) The marginal probability mass function of X is given by

X
pX (x) = p(x, y) for each x
y:p(x,y)>0
Hence
1 1 5
pX (1) = pX,Y (1, 1) + pX,Y (1, 2) = + =
8 2 8
1 1 3
pX (2) = pX,Y (2, 1) + pX,Y (2, 2) = + =
8 4 8
(iii) The marginal probability mass function of Y is given by

X
pY (y) = p(x, y) for each y
x:p(x,y)>0
Hence
1 1 1
pY (1) = pX,Y (1, 1) + pX,Y (2, 1) = + =
8 8 4
1 1 3
pY (2) = pX,Y (1, 2) + pX,Y (2, 2) = + =
2 4 4
Example 5.1.3 Suppose that a coin is tossed twice. Let X be the total number of heads
shown and Y the total number of tails. Since the sample space is
S = {HH, HT, T H, T T }
then (X, Y ) takes values in the set
Ω = {0, 1, 2} × {0, 1, 2} = {(i, j) : i ∈ {0, 1, 2}, j ∈ {0, 1, 2}}
Clearly, p(x, y) = P {X = x, Y = y} is zero, except at the points (0, 2), (1, 1) and (2, 0).
Furthermore,
p(0, 2) + p(1, 1) + p(2, 0) = (1 − p)2 + 2p(1 − p) + p2 = 1,
where we have denoted the probability of a head by p.
88
Exercise 5.1.1 A fair coin is tossed three times. Let X be a random variable that takes
the value 0 if the first toss is a tail and the value 1 if the first toss is a head. Also, let Y
be a random variable that defines the total number of heads in the three tosses.
Determine the joint probability mass function of X and Y .
Solution 5.1.4 Let H denote the event that a head appears on a toss and T the event
that a tail appears on a toss. The sample space is given by
S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }
The random variable X can take the values 0 and 1 while the random ariable Y can take
the values 0, 1, 2 and 3. Each of these eight sample points is equally likely to be obtained
with probability 81 . The joint probability mass function of of X and Y can be constructed
as follows:
1
pX,Y (0, 0) = P {X = 0, Y = 0} = P (T T T ) =
8
1
pX,Y (0, 1) = P {X = 0, Y = 1} = P (T HT ) + P (T T H) =
4
1
pX,Y (0, 2) = P {X = 0, Y = 2} = P (T HH) =
8
pX,Y (0, 3) = P {X = 0, Y = 3} = 0
pX,Y (1, 0) = P {X = 1, Y = 0} = 0
1
pX,Y (1, 1) = P {X = 1, Y = 1} = P (HT T ) =
8
1
pX,Y (1, 2) = P {X = 1, Y = 2} = P (HHT ) + P (HT H) =
4
1
pX,Y (1, 3) = P {X = 1, Y = 3} = P (HHH) =
8
(i) The marginal probability mass function of X.
(ii) The marginal probability mass function of Y .
We say that X and Y are jointly continuous if there exists a function f (x, y) (sometimes
donotes by fX,Y (x, y)), defined for all real x and y, having the property that for all sets
A and B of real numbers
Z Z
P {X ∈ A, Y ∈ B} = f (x, y)dxdy
B A
The function f (x, y) is called the joint probability density function of X and Y . The
probability density of X can be obtained from a knowledge of f (x, y) by the following
reasoning:
89
Z ∞ Z Z
P {X ∈ A} = P {X ∈ A, Y ∈ (−∞, ∞)} = f (x, y)dxdy = fX (x)dx
−∞ A A
where Z ∞
fX (x) = f (x, y)dy, for − ∞ < x < ∞
−∞
is thus the probability density function of X.
Similarly, the probability density function of Y is given by

Z ∞
fY (y) = f (x, y)dx, for − ∞ < y < ∞
−∞
The function fX (x) is called the marginal probability density function of X and the func-
tion fY (y) is called the marginal probability density function of Y
Example 5.1.5 Suppose that the joint pdf of two random variables X and Y is given by
(
6
5
(x + y 2 ), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, otherwise
(i) Verify that it is legitimate pdf.
(ii) Determine P {0 ≤ X ≤ 41 , 0 ≤ Y ≤ 41 }.
(iii) Determine the marginal pdf of X.
(iv) Determine the marginal pdf of Y .
(v) Find P { 14 ≤ Y ≤ 34 }.
Solution 5.1.6 (i) To verify that the given function is a legitime pdf, note that f (x, y) ≥
0 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 and
Z ∞Z ∞ Z 1Z 1 Z 1Z 1 Z 1Z 1
6 2
6 6 2
f (x, y)dxdy = x + y dxdy = xdxdy + y dxdy
−∞ −∞ 0 0 5 0 0 5 0 0 5
6 1 6 1 2 6 6
Z Z
= xdx + y dy = + =1
5 0 5 0 10 15
(ii) The requested probability is given by

1 1 Z 1Z 1
1 1 6 6 4 4
Z Z
4 4
2

P 0 ≤ X ≤ ,0 ≤ Y ≤ = x + y dxdy = xdxdy
4 4 0 0 5 5 0 0
Z 1Z 1
6 4 4 2 7
+ y dxdy =
5 0 0 640
90
(iii) The marginal pdf of X is given by
Z ∞ 1
6 6 2
Z
x + y 2 dy = x +

fX (x) = f (x, y)dy =
−∞ 0 5 5 5
for 0 ≤ x ≤ 1 and 0 otherwise.
(iv) The marginal pdf of Y is given by

Z ∞ 1
6 6 3
Z
x + y 2 dx = y 2 +

fY (y) = f (x, y)dx =
−∞ 0 5 5 5
for 0 ≤ y ≤ 1 and 0 otherwise.
(v) The requested probability is given by

Z 3 Z 3
1 3 4 4 6 2 3 37
P ≤Y ≤ = fY (y)dy = y + dy =
4 4 1
4
1
4
5 5 80
A variation of the Theorem 4.3.3 states that if X and Y are random variables and g is a
function of two variables, then
XX
E[g(X, Y )] = g(x, y)p(x, y) in the discrete case (5.1.1)
y x
Z ∞ Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy in the continuous case (5.1.2)
−∞ −∞
For example, if g(X, Y ) = X + Y , then, in the continuous case,
Z ∞ Z ∞
E[g(X, Y )] = (x + y)f (x, y)dxdy
−∞ −∞
Z ∞ Z ∞ Z ∞Z ∞
= xf (x, y)dxdy + yf (x, y)dxdy
−∞ −∞ −∞ −∞
= E(X) + E(Y )
where the first integral is evaluated by using the variation of the Theorem 4.3.3 with
g(x, y) = x, and the second with g(x, y) = y. The same result holds in the discrete case
and, combined with the Corollary 4.3.6 yields that for any constants a, b
E[aX + bY ] = aE[X] + bE[Y ] (5.1.3)
Joint probability distributions may also be defined for n random variables. The details
are exactly the same as when n = 2 and are left as an exercise. The corresponding result
91
to Equation 5.1.3 states that if X1 , X2 , ..., Xn are n random variables, then for any n
constants a1 , a2 , ..., an ,
E[a1 X1 + a2 X2 + · · · + an Xn ] = a1 E[X1 ] + a2 E[X2 ] + · · · + an E[Xn ] (5.1.4)
Example 5.1.7 Calculate the expected sum obtained when three fair dice are rolled.
Let X denote the sum obtained. Then X = X1 + X2 + X3 where Xi represents the value
of the ith die. Thus,

7 21
E[X] = E[X1 ] + E[X2 ] + E[X3 ] = 3 =
2 2
Example 5.1.8 As another example of the usefulness of the Equation 5.1.4, let us use
it to obtain the expectation of a binomial random variable having parameters n and p.
Recalling that such a random variable X represents the number of successes in n trials
when each trial has probability p of being a success, we have that X = X1 + X2 + · · · + Xn
where
(
1, if the ith trial is a success
Xi =
0, if the ith trial is a failure
Hence, Xi is a Bernoulli random variable having expectation E[Xi ] = 1(p) + 0(1 − p) = p.

Thus,
E[X] = E[X1 + X2 + · · · + Xn ] = E[X1 ] + E[X2 ] + · · · + E[Xn ] = np
This derivation should be compared with the one presented in the Example 4.1.3.
Example 5.1.9 At a party N men throw their hats into the center of a room. The hats
are mixed up and each man randomly selects one. Find the expected number of men who
select their own hats.
Letting X denote the number of men that select their own hats, we can best compute E[X]
by noting that X = X1 + X2 + · · · + XN where
(
1, if the ith man selects his own hat
Xi =
0, otherwise
Now, because the ith man is equally likely to select any of the N hats, it follows that
1
P {Xi = 1} = P {ith man selects his own hat} =
N
92
and so
1
E[Xi ] = 1P {Xi = 1} + 0P {Xi = 0} =
N
Hence, from the Equation 5.1.4 we obtain that

1
E[X] = E[X1 + X2 + · · · + XN ] = N =1
N
Hence, no matter how many people are at the party, on the average exactly one of the
men will select his own hat.
Example 5.1.10 Suppose there are 25 different types of coupons and suppose that each
time one obtains a coupon, it is equally likely to be any one of the 25 types. Compute the
expected number of different types that are contained in a set of 10 coupons.
Let X denote the number of different types in the set of 10 coupons. We compute E[X]
by using the representation X = X1 + X2 + · · · + X25 , where
(
1, if at least one type i coupon is in the set of 10
Xi =
0, otherwise
Now E[Xi ] = P {Xi = 1} = P {at least one type i coupon is in the set of 10}
24 10

= 1−P {no type i coupons are in the set of 10} = 1− 25 when the last equality follows
since each of the 10 coupons will (independently) not be a type i with probability 24
25
. Hence,
" 10 #
24
E[X] = E[X1 ] + E[X2 ] + · · · + E[X25 ] = 25 1 −
25
5.2 Independent Random Variables
The random variables X and Y are said to be independent if, for all a, b,
P {X ≤ a, Y ≤ b} = P {X ≤ a}P {Y ≤ b} (5.2.1)
In other words, X and Y are independent if, for all a and b, the events Ea = {X ≤ a}
and Fb = {Y ≤ b} are independent.
In terms of the joint distribution function FX,Y of X and Y , we have that X and Y are
independent if
FX,Y (a, b) = FX (a)FY (b) for all a, b
93
When X and Y are discrete, the condition of independence reduces to
pX,Y (x, y) = pX (x)pY (y) (5.2.2)
while if X and Y are jointly continuous, independence reduces to
fX,Y (x, y) = fX (x)fY (y) (5.2.3)
To prove this statement, consider first the discrete version, and suppose that the joint
probability mass function p(x, y) satisfies the Equation 5.2.2. Then
XX XX
P {X ≤ a, Y ≤ b} = p(x, y) = pX (x)pY (y)
y≤b x≤a y≤b x≤a
X X
= pY (y) pX (x) = P {X ≤ a}P {Y ≤ b}
y≤b x≤a
and so X and Y are independent. The Equation 5.2.2 implies independence in the con-
tinuous case is proven in the same manner and is left as an exercise.
Example 5.2.1 Let X and Y be two random variables and suppose that X and Y have
a joint probability density function of fX,Y (x, y) = 6xy 2 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 and
fX,Y (x, y) = 0 elsewhere. Check whether X and Y are independent.
The fact that this joint density function is a function of x multiplied by a function of
y immediately indicates that the two random variables are independent. The probability
density function of X is is given by
Z 1 Z 1
fX (x) = fX,Y (x, y)dy = 6xy 2 dy = 2x
0 0
for 0 ≤ x ≤ 1 and the probability density function of Y is is given by

Z 1 Z 1
fY (y) = fX,Y (x, y)dx = 6xy 2 dx = 3y 2
0 0
for 0 ≤ y ≤ 1. The fact that fX,Y (x, y) = fX (x)fY (y) confirms that the random variables
X and Y are independent.
Example 5.2.2 (Exercise 5.1.1 continued) Consider the random variables X and Y
defined in the Example 5.1.1. Are X and Y independent?
Solution 5.2.3 If X and Y are independent, then for all x and y we have that
pX,Y (x, y) = pX (x)pY (y)
94
Thus, to show that X and Y are not independent, all we have to do is to find a pair of x
and y at which the joint probability mass function does not satisfy the above condition.
Consider the point (x, y) = (0, 0)

X 1
pX (0) = pX,Y (0, y) = pX,Y (0, 1) + pX,Y (0, 2) + pX,Y (0, 3) =
y
2
X 1
pY (0) = pX,Y (x, 0) = pX,Y (0, 0) + pX,Y (1, 0) =
x
8
1 1 1
Since pX (0)pY (0) = 2
× 8
= 16
6= pX,Y (0, 0) = 18 , we conclude that X and Y are not
independent.
An important result concerning independence is the following.

Theorem 5.2.4 If X and Y are independent, then for any functions h and g,
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
Proof. Suppose that X and Y are jointly continuous. Then
Z ∞Z ∞
E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy
−∞ −∞
Z ∞Z ∞
= g(x)h(y)fX (x)fY (y)dxdy
−∞ −∞
Z ∞ Z ∞
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞
= E[g(X)]E[h(Y )]
The proof in the discrete case is similar.
Exercise 5.2.1 Let X and Y be two continuous random variables whose joint pdf is given
by
(
e−(x+y) , 0 ≤ x < ∞, 0 ≤ y < ∞
fX,Y (x, y) =
0, otherwise
Are X and Y independent?
Solution 5.2.5 To answer the question, we first evaluate the marginal pdfs of X and Y :
Z ∞ Z ∞ Z ∞
−(x+y) −x
fX (x) = fX,Y (x, y)dy = e dy = e e−y dy = e−x , x ≥ 0
Z−∞
∞ Z0 ∞ Z0 ∞
fY (y) = fX,Y (x, y)dx = e−(x+y) dx = e−y e−x dy = e−y , y ≥ 0
−∞ 0 0
Now, fX (x)fY (y) = e−x e−y = e−(x+y) = fX,Y (x, y), which means that X and Y are
independent.
95
Exercise 5.2.2 Determine if random variables X and Y are independent when their joint
pdf is given by
(
2e−(x+y) , 0 ≤ x < y, 0 ≤ y < ∞
fX,Y (x, y) =
0, otherwise
Solution 5.2.6 To answer the question, we first evaluate the marginal pdfs of X and Y :
Z ∞ Z ∞ Z ∞
−(x+y) −x
fX (x) = fX,Y (x, y)dy = 2 e dy = 2e e−y dy = 2e−2x
−∞ x x
Z ∞ Z y Z y
−(x+y) −y
fY (y) = fX,Y (x, y)dx = 2 e dx = 2e e−x dx = 2e−y (1 − e−y )
−∞ 0 0
Now, fX (x)fY (y) = 4e−(2x+y) − 4e−2(x+y) 6= fX,Y (x, y), which means that X and Y are
not independent.
Exercise 5.2.3 Suppose that a fair coin is tossed three times so that there are eight
equally likely outcomes, and that the random variable X is the number of heads obtained
in the first and second tosses, the random variable Y is the number of heads in the third
toss, and the random variable Z is the number of heads obtained in the second and third
tosses.
(a) (i) Find the joint probability mass function for X and Y .
(ii) Find the marginal distribution of X.
(iii) Find the marginal distribution of Y .
(b) (i) Find the joint probability mass function for X and Z.
(iii) Find the marginal distribution of Z.
(c) (i) Find the joint probability mass function for Y and Z.
(ii) Find the marginal distribution of Y .
(iii) Find the marginal distribution of Z.
96
5.3 Covariance and Variance of Sums of Random Vari-
ables
Definition 5.3.1 The covariance of any two random variables X and Y , denoted by
Cov(X, Y ), is defined by
Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]

= E[XY − Y E[X] − XE[Y ] + E[X]E[Y ]]
= E[XY ] − E[Y ]E[X] − E[X]E[Y ] + E[X]E[Y ]
= E[XY ] − E[X]E[Y ] (5.3.1)
Note that if X and Y are independent, then by the Theorem 5.2.4 it follows that
Cov(X, Y ) = 0
Let us consider now the special case where X and Y are indicator variables for whether
or not the events A and B occur. That is, for events A and B, define
( (
1, if A occurs 1, if B occurs
X= and Y =
0, otherwise 0, otherwise
Then, Cov(X, Y ) = E[XY ] − E[X]E[Y ] and, because XY will equal 1 or 0 depending on

whether or not both X and Y equal 1, we see that
Cov(X, Y ) = P {X = 1, Y = 1} − P {X = 1}P {Y = 1}
From this we see that
Cov(X, Y ) > 0 ⇔ P {X = 1, Y = 1} > P {X = 1}P {Y = 1}

P {X = 1, Y = 1}
⇔ > P {Y = 1}
P {X = 1}
⇔ P {Y = 1|X = 1} > P {Y = 1}
That is, the covariance of X and Y is positive if the outcome X = 1 makes it more likely
that Y = 1 (which, as is easily seen by symmetry, also implies the reverse).
In general it can be shown that a positive value of Cov(X, Y ) is an indication that Y

tends to increase as X does, whereas a negative value indicates that Y tends to decrease
as X increases.
97
Properties of Covariance
The following are important properties of covariance.
For any random variables X, Y, Z and any real constant α, the following hold:
(i) Cov(X, X) = Var(X),
(ii) Cov(X, Y ) = Cov(Y, X),
(iii) Cov(αX, Y ) = α Cov(X, Y ),
(iv) Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z).
Whereas the first three properties are immediate, the final one is easily proven as follows:
Cov(X, Y + Z) = E[X(Y + Z)] − E[X]E[Y + Z]

= (E[XY ] − E[X]E[Y ]) + (E[XZ] − E[X]E[Z])
= Cov(X, Y ) + Cov(X, Z)
The fourth property listed easily generalizes to give the following result:
n m
! n Xm
X X X
Cov Xi , Yj = Cov(Xi , Yj ) (5.3.2)
i=1 j=1 i=1 j=1
A useful expression for the variance of the sum of random variables can be obtained from
the Equation 5.3.2 as follows:
n
! n n
! n X
n
X X X X
Var Xi = Cov Xi , Xj = Cov(Xi , Xj )
i=1 i=1 j=1 i=1 j=1
Xn Xn X
= Cov(Xi , Xi ) + Cov(Xi , Xj )
i=1 i=1 j6=i
n
X Xn X
= Var(Xi ) + 2 Cov(Xi , Xj ) (5.3.3)
i=1 i=1 j<i
In particular, for any two random variables X1 and X2 we have
Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2 Cov(X1 , X2 )
If Xi , i = 1, ..., n are independent random variables, then the previous expression reduces
to !
Xn Xn
Var Xi = Var(Xi ) (5.3.4)
i=1 i=1
98
We can prove again the following statement (already proved in the Theorem 4.3.7), which
is a paricular case of the Property (iii) in the list above.
Corollary 5.3.2 Let X be a random variable. If α and β are any fixed real constants
then
Var(αX + β) = α2 Var(X)
Proof. Using the Properties (i)-(iv) we can write the following:
Var(αX + β) = Cov(αX + β, αX + β) = Cov(αX + β, αX) + Cov(αX + β, β)

= Cov(αX, αX) + Cov(β, αX) + Cov(αX, β) + Cov(β, β)
= α2 Cov(X, X) + 0 + 0 + 0 = α2 Var(X).
The last equality follows from the fact that α and β are fixed real constants and hence
Cov(αX, β) = Cov(β, αX) = Cov(β, β) = 0 by the use of the Equation 5.3.1.
In practice, the most convenient way to assess the strength of the dependence between
two random variables is through their correlation coefficient.
Definition 5.3.3 The correlation coefficient between two random variables X and Y is
defined to be
Cov(X, Y )
ρ(X, Y ) = p (5.3.5)
Var(X) Var(Y )
The correlation coefficient takes values between −1 and 1, and independent random vari-
ables have a correlation of 0.
Random variables with a positive correlation coefficient are said to be positively corre-
lated, and in such cases there is a tendency for high values of one random variable to
be associated with high values of the other random variable. Random variables with a
negative correlation coefficient are said to be negatively correlated, and in such cases there
is a tendency for high values of one random variable to be associated with low values of
the other random variable. The strength of these tendencies increases as the correlation
coefficient moves further away from 0 to 1 or to -1.
Example 5.3.4 (Example 5.2.1 Continued)

Z 1 1
1
Z
E[X] = xfX (x)dx = 2 x2 dx =
0 0 3
1 1
3
Z Z
E[Y ] = yfY (y)dy = 3 y 3 dy =
0 0 4
1 1 1 1
1
Z Z Z Z
E[XY ] = xyf (x, y)dydx = 6x2 y 3 dydx =
0 0 0 0 2
99
It follows that Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 21 − 23 × 43 = 0. This result is

expected since the random variables X and Y were shown to be independent.
Definition 5.3.5 If X1 , ..., Xn are independent and identically distributed, then the ran-
dom variable n
1X
X= Xi
n i=1
is called the sample mean.
The following theorem shows that the covariance between the sample mean and a deviation
from that sample mean is zero. It will be needed latter!
Theorem 5.3.6 Suppose that X1 , ..., Xn are independent and identically distributed with
expected value µ and variance σ 2 . Then,
(a) E[X] = µ,
σ2
(b) Var(X) = n
,
(c) Cov(X, Xi − X) = 0.
Proof. Parts (a) and (b) are easily established as follows:
n n
1X 1X
E[X] = E[Xi ] = µ = µ.
n i=1 n i=1
2 n
! n
2
1 X 1 X σ2
Var(X) = Var Xi = Var(Xi ) = .
n i=1
n i=1
n
To establish part (c) we reason as follows
!
1 X
Cov(X, Xi − X) = Cov(X, Xi ) − Cov(X, X) = Cov(Xi + Xj , Xi ) − Var(X)
n j6=i
!
1 1 X σ2 1 σ2 σ2 σ2
= Cov(Xi , Xi ) + Cov Xj , Xi − = Var(Xi ) − = − =0
n n j6=i
n n n n n
P
where the final equality used the fact that Xi and j6=i Xj are independent and thus have
covariance 0.
The Equation 5.3.3 is often useful when computing variances.
100
Example 5.3.7 (Variance of a Binomial Random Variable) Compute the variance
of a binomial random variable X with parameters n and p.
Since such a random variable represents the number of successes in n independent trials
when each trial has a common probability p of being a success, we may write X = X1 +
X2 + · · · + Xn where the Xi are independent Bernoulli random variables such that
(
1, if the ith trial is a success
Xi =
0, otherwise
Hence, from the Equation 5.3.3 we obtain Var(X) = Var(X1 ) + · · · + Var(Xn ). Since
Xi2 = Xi , we have
Var(Xi ) = E[Xi2 ] − (E[Xi ])2 = E[Xi ] − (E[Xi ])2 = p − p2
Hence Var(X) = n(p − p2 ) = np(1 − p).
Example 5.3.8 (The Hypergeometric Random Variable) Consider a population of

N individuals, some of whom are in favor of a certain proposition. In particular suppose
that N p of them are in favor and N − N p are opposed, where p is assumed to be unknown.
We are interested in estimating p, the fraction of the population that is for the proposition,
by randomly choosing and then determining the positions of n members of the population.
In such situations as described in the preceding, it is common to use the fraction of the
sampled population that is in favor of the proposition as an estimator of p. Hence, if we
let
(
1, if the ith person chosen is in favor
Xi =
0, otherwise
1
Pn
then the usual estimator of p is n i=1 Xi . Let us now compute its mean and variance.
Now !
n
X n
X
E Xi = E(Xi ) = np
i=1 i=1
th
where the final equality follows since the i person chosen is equally likely to be any of
the N individuals in the population and so has probability NNp of being in favor.
n
! n n X
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj )
i=1 i=1 i=1 j<i
101
Now, since Xi is a Bernoulli random variable with mean p, it follows that Var(Xi ) =
p(1 − p).
Also, for i 6= j we have
Cov(Xi , Xj ) = E[Xi Xj ] − E[Xi ]E[Xj ] = P {Xi = 1, Xj = 1} − p2

Np Np − 1
= P {Xi = 1}P {Xj = 1|Xi = 1} − p2 = − p2
N N −1
where the last equality follows since if the ith person to be chosen is in favor, then the j th
person chosen is equally likely to be any of the other N − 1 of which N p − 1 are in favor.
Thus, we see that
n
!
p(np − 1)

X n 2
Var Xi = np(1 − p) + 2 −p
i=1
2 N −1
n(n − 1)p(1 − p)
= np(1 − p) −
N −1
and so the mean and variance of our estimator are given by
" n
#
X
E Xi = p
i=1
n
!
X p(1 − p) (n − 1)p(1 − p)
Var Xi = −
i=1
n n(N − 1)
Some remarks are in order: As the mean of the estimator is the unknown value p, we
would like its variance to be as small as possible and we see by the preceding that, as
a function of the population size N , the variance increases as N increases. The limiting
value, as N −→ ∞, of the variance is p(1−p)
n
, which is not surprising since for N large each
of the Xi will be (approximately) independent random variables, and thus ni=1 Xi will
P
have an (approximately) binomial distribution with parameters n and p.
The random variable ni=1 Xi can be thought as representing the number of white balls
P
obtained when n balls are randomly selected from a population consisting of N p white
and N − N p black balls. (Identify a person who favors the proposition with a white ball
and one against with a black ball.) Such a random variable is called hypergeometric and
has a probability mass function given by
102
N −N p
( n ) Np

X
k n−k
P Xi = k = N
(5.3.6)
i=1 n
Sum of random variables
It is often important to be able to calculate the distribution of X +Y from the distributions

of X and Y when X and Y are independent.
Suppose first that X and Y are continuous, X having probability density f and Y having
probability density g. Then, letting FX+Y (a) be the cumulative distribution function of
X + Y , we have
ZZ
FX+Y (a) = P {X + Y ≤ a} = f (x)g(y)dxdy
x+y≤a
Z ∞ Z a−y Z ∞ Z a−y
= f (x)g(y)dxdy = f (x)dx g(y)dy
−∞ −∞ −∞ −∞
Z ∞
= FX (a − y)g(y)dy (5.3.7)
−∞
The cumulative distribution function FX+Y is called the convolution of the distributions
FX and FY (the cumulative distribution functions of X and Y , respectively).
By differentiating the Equation 5.3.7, we obtain that the probability density function
fX+Y (a) of X + Y is given by
Z ∞
d
fX+Y (a) = FX (a − y)g(y)dy
da −∞
Z ∞ Z ∞
d
= (FX (a − y)) g(y)dy = fX (a − y)dy (5.3.8)
−∞ da −∞
Example 5.3.9 (Sum of Two Independent Uniform Random Variables) If X and

Y are independent random variables both uniformly distributed on (0, 1), then calculate
the probability density of X + Y .
(
1, 0 < a < 1
From the Equation 5.3.8, since f (a) = g(a) = ,
0, otherwise
R∞
we obtain fX+Y (a) = 0 f (a − y)dy.
Ra
For 0 ≤ a ≤ 1, this yields 0
dy = a.
103
R1
For 1 < a < 2, this yields a−1
dy = 2 − a
Hence,

 a, 0 ≤ a ≤ 1

fX+Y (a) = 2 − a, 1 < a < 2

0, otherwise

Rather than deriving a general expression for the distribution of X + Y in the discrete
case, we shall consider an example.
Example 5.3.10 (Sums of Independent Poisson Random Variables) Let X and

Y be independent Poisson random variables with respective means λ1 and λ2 . Calculate
the distribution of X + Y .
Since the event {X + Y = n} may be written as the union of the disjoint events {X =
k, Y = n − k}, 0 ≤ k ≤ n, we have
n
X n
X
P (X + y = n} = P {X = k, Y = n − k} = P {X = k}P {Y = n − k}
k=0 k=0
n n−k n
X k
−λ1 λ1 −λ2 λ2 −(λ1 +λ2 )
X λk1 λ2n−k
= e e =e
k=0
k! (n − k)! k=0
k!(n − k)!
n
e−(λ1 +λ2 ) X n! k n−k e−(λ1 +λ2 )
= λ1 λ2 = (λ1 + λ2 )n (5.3.9)
n! k=0
k!(n − k)! n!
In words, X1 + X2 has a Poisson distribution with mean λ1 + λ2 .
The concept of independence may, of course, be defined for more than two random vari-
ables. In general, the n random variables X1 , X2 , ..., Xn are said to be independent if, for
all values a1 , a2 , ..., an ,
P {X1 ≤ a1 , X2 ≤ a2 , · · · , Xn ≤ an } = P {X1 ≤ a1 }P {X2 ≤ a2 } · · · P {Xn ≤ an }
5.4 Joint Probability Distribution of Functions of Ran-

dom Variables
Let X1 and X2 be jointly continuous random variables with joint probability density
function f (x1 , x2 ). It is sometimes necessary to obtain the joint distribution of the random
104
variables Y1 and Y2 which arise as functions of X1 and X2 . Specifically, suppose that
Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ) for some functions g1 and g2 .
Assume that the functions g1 and g2 satisfy the following conditions:
1. The equations y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ) can be uniquely solved for x1 and
x2 in terms of y1 and y2 with solutions given by, say,
x1 = h1 (y1 , y2 ), x2 = h2 (y1 , y2 )
2. The functions g1 and g2 have continuous partial derivatives at all points (x1 , x2 ) and
are such that the following 2 × 2 determinant
∂g1 ∂g1

∂x1 ∂x2 ∂g ∂g

1 2 ∂g1 ∂g2
J(x1 , x2 ) = = − 6= 0 at all points(x1 , x2 )

∂g2 ∂g2
∂x1 ∂x2 ∂x2 ∂x1

∂x1 ∂x2
Under these two conditions it can be shown that the random variables Y1 and Y2 are
jointly continuous with joint density function given by
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 )|J(x1 , x2 )|−1 (5.4.1)

where x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 ).
A proof of the Equation 5.4.1 would proceed along the following lines:
ZZ
P {Y1 ≤ y1 , Y2 ≤ y2 } = fX1 ,X2 (x1 , x2 )dx1 dx2 (5.4.2)
B
where B = {(x1 , x2 ) : g1 (x1 , x2 ) ≤ y1 , g2 (x1 , x2 ) ≤ y2 }
The joint density function can now be obtained by differentiating the Equation 5.4.2 with
respect to y1 and y2 . That the result of this differentiation will be equal to the right-hand
side of the Equation 5.4.1 is an exercise in advanced calculus whose proof will not be
presented in the present lecture notes.
Example 5.4.1 If X and Y are independent gamma random variables with parameters
X
(α, λ) and (β, λ), respectively, compute the joint density of U = X + Y and V = X+Y .
The joint density of X and Y is given by

λe−λx (λx)α−1 λe−λy (λy)β−1
fX,Y (x, y) =
Γ(α) Γ(β)
α+β
λ
= e−λ(x+y) xα−1 y β−1 (5.4.3)
Γ(α)Γ(β)
105
x ∂g1 ∂g1 ∂g2 y
Now, if g1 (x, y) = x + y, g2 (x, y) = x+y
, then ∂x
= 1, ∂y
= 1, ∂x
= (x+y)2
and
∂g2 x
∂y
= − (x+y)2 . It follows that

1 1
1
J(x1 , x2 ) = =−

y x
x+y

(x+y)2
− (x+y)2

x
Finally, because the equations u = x + y, v = x+y
have as their solutions x = uv, y =
u(1 − v), we see that
λe−λu (λu)α+β−1 v α−1 (1 − v)β−1 Γ(α + β)

fU,V (u, v) = fX,Y [uv, u(1 − v)] u =
Γ(α + β) Γ(α)Γ(β)
X
Hence X + Y and X+Y are independent, with X + Y having a gamma distribution with
X
parameters (α + β, λ) and X+Y having density function
Γ(α + β) α−1
fV (v) = v (1 − v)β−1 , 0 < v < 1
Γ(α)Γ(β)
This is called the beta density with parameters (α, β). This result is quite interesting.
For suppose there are n + m jobs to be performed, with each (independently) taking an
exponential amount of time with rate λ for performance, and suppose that we have two
workers to perform these jobs. Worker I will do jobs 1, 2, ..., n, and worker II will do the
remaining m jobs. If we let X and Y denote the total working times of workers I and
II, respectively, then upon using the preceding result it follows that X and Y will be
independent gamma random variables having parameters (n, λ) and (m, λ), respectively.
Then the preceding result yields that independently of the working time needed to com-
plete all n + m jobs (that is, of X + Y ), the proportion of this work that will be performed
by worker I has a beta distribution with parameters (n, m).
When the joint density function of the n random variables X1 , X2 , ..., Xn is given and we
want to compute the joint density function of Y1 , Y2 , ..., Yn , where
Y1 = g1 (X1 , · · · , Xn ), Y2 = g2 (X1 , · · · , Xn ), · · · , Yn = gn (X1 , · · · , Xn )

the approach is the same. Namely, we assume that the functions gi have continuous
partial derivatives and that the Jacobian determinant J(x1 , x2 , · · · , xn ) 6= 0 at all points
(x1 , x2 , · · · , xn ), where
106
∂g ∂g1 ∂g1

1
∂x1 ∂x2
··· ∂xn

∂g ∂g2 ∂g2
J(x1 , x2 , · · · , xn ) = ∂x21 ∂x2
· · · ∂xn
.. .. .
. . · · · ..
∂g ∂gn ∂gn

n
∂x1 ∂x2
· · · ∂xn
Furthermore, we suppose that the equations
y1 = g1 (x1 , ..., xn ), y2 = g2 (x1 , ..., xn ), ..., yn = gn (x1 , ..., xn )
have a unique solution, say,
x1 = h1 (y1 , ..., yn ), ..., xn = hn (y1 , ..., yn )
Under these assumptions the joint density function of the random variables Yi is given by
fY1 ,...,Yn (y1 , ..., yn ) = fX1 ,...,Xn (x1 , ..., xn ) |J(x1 , ..., xn )|−1 (5.4.4)
where xi = hi (y1 , ..., yn ), i = 1, 2, ..., n.
Exercise 5.4.1 Let U = X + Y and W = X − Y . Find the joint density function

fU,W (u, w) of U and W .
u+w
Solution 5.4.2 The unique solution to the equations u = x + y and w = x − y is x = 2
and y = u−w
2
. Thus there is only one set of solutions. Since
 
∂x ∂x " #
∂u ∂w
1 1
J(x, y) =  = = −2 (5.4.5)
 
∂y ∂y 1 −1
∂w ∂w
we obtain
u+w u−w

−1 1
fU,W (u, w) = fX,Y (x, y)|J(x, y)| = fX,Y ,
2 2 2
Exercise 5.4.2 Let U = X 2 + Y 2 and W = X 2 . Find the joint density function

fU,W (u, w) of U and W .
Exercise 5.4.3 If X is the amount of money spent on food and other expenses during a
day (in USD) and Y is the daily allowance of a businesswoman, the joint density of these
two variables is given by
(
c 100−x

x
, if 10 ≤ x ≤ 100, 40 ≤ y ≤ 100
fXY (x, y) =
0, elsewhere.
107
(i) Choose c such that fXY (x, y) is a probability density function.
(iii) Calculate the probability that more than 75 USD are spent.
(iv) Determine the conditional distribution of Y given X.
108
Chapter 6
Moment Generating and

Characteristic Functions
6.1 Moment Generating Functions
The moment generating function φ(t) of the random variable X is defined for all values t
by
 P
tx
 x e p(x), if X is discrete

φ(t) = E(etX ) = (6.1.1)
 R ∞ tx

−∞
e f (x)dx, if X is continuous
We call φ(t) the moment generating function because all of the moments of X can be
obtained by successively differentiating φ(t).
For example,
0 d tX d tX
φ (t) = E[e ] = E (e ) = E[XetX ]
dt dt
and hence
0
φ (0) = E[X]
Similarly,
00 d d
tx
φ (t) = E[Xe ] = E (Xe ) = E[X 2 etX ]
tX
dt dt
and hence
00
φ (0) = E[X 2 ]
109
In general, the nth derivative of φ(t) evaluated at t = 0 equals E[X n ], that is,
φ(n) (0) = E[X n ]
We now compute φ(t) for some common distributions.
Example 6.1.1 (The Binomial Distribution with Parameters n and p)

n n
tX n k X
tk n−k
X n
φ(t) = E[e ] = e p (1 − p) = (pet )k (1 − p)n−k
k=0
k k=0
k
= (pet + 1 − p)n
0 0
Hence φ (t) = n(pet + 1 − p)n−1 pet and so E[X] = φ (0) = np which checks with the result
obtained in the Example 4.1.3.
Differentiating a second time yields

00
φ (t) = n(n − 1)(pet + 1 − p)n−2 (pet )2 + n(pet + 1 − p)n−1 pet
and so
00
E[X 2 ] = φ (0) = n(n − 1)p2 + np
Thus, the variance of X is given by
Var(X) = E[X 2 ] − (E[X])2 = n(n − 1)p2 + np − (np)2 = np(1 − p)
Example 6.1.2 (The Poisson Distribution with Mean λ)

∞ ∞
X etn e−λ λn −λ
X (λet )n t
φ(t) = E[e ] = tX
=e = e−λ eλe = exp{λ(et − 1)}
n=0
n! n=0
n!
Differentiation yields
0
φ (t) = λet exp{λ(et − 1},
00
φ (t) = (λet )2 exp{λ(et − 1)} + λet exp{λ(et − 1)}
0 00
and so E[X] = φ (0) = λ, E[X 2 ] = φ (0) = λ2 + λ. Hence
Var(X) = E[X 2 ] − (E[X])2 = λ2 + λ − λ2 = λ
Thus, both the mean and the variance of the Poisson equal λ.
Example 6.1.3 (The Exponential Distribution with Parameter λ)

Z ∞ Z ∞
−λx λ
tX
φ(t) = E[e ] = tx
e λe dx = λ e(t−λ)x dx = for t < λ
0 0 λ−t
110
We note by the preceding derivation that, for the exponential distribution, φ(t) is only
defined for values of t less than λ. Differentiating yields to
0 λ 00 2λ
φ (t) = , φ (t) =
(λ − t)2 (λ − t)3
Hence
0 1 00 2
E[X] = φ (0) = , E[X 2 ] = φ (0) =
λ λ2
The variance of X is thus given by

2 1 1
Var(X) = E[X 2 ] − (E[X])2 = 2
− 2 = 2
λ λ λ
Example 6.1.4 (The Normal Distribution with Parameters µ and σ 2 ) The mo-
ment generating function of a standard normal random variable Z is obtained as follows.
∞ Z ∞
1 1
Z 2 (x2 −2tx)
tx − x2
tZ
φZ (t) = E[e ] = √ e e dx = √ e− 2 dx
2π −∞ 2π −∞
Z ∞
t2 1 (x−t)2
− 2 t2
=e 2 √ e dx = e 2
2π −∞
If Z is a standard normal, then X = σZ + µ is normal with parameters µ and σ 2 (see
Theorem 3.3.3). Therefore,
σ 2 t2

φX (t) = E[e ] = E et(σZ+µ) = eµt E etσZ = φZ (tσ) = exp
tX

+ µt
2
By differentiating we obtain
σ 2 t2

0 2
φX (t) = (µ + tσ ) exp + µt
2
22 22
00 2 2 σ t 2 σ t
φX (t) = (µ + tσ ) exp + µt + σ exp + µt
2 2
0 00
and so E[X] = φX (0) = µ, E[X 2 ] = φX (0) = µ2 + σ 2 implying that
Var(X) = E[X 2 ] − (E[X])2 = µ2 + σ 2 − µ2 = σ 2
An important property of moment generating functions is given in the following statement.
111
Theorem 6.1.5 The moment generating function of the sum of independent random vari-
ables is just the product of the individual moment generating functions.
Proof. To see this, suppose that X and Y are independent random variables and have
moment generating functions φX (t) and φY (t), respectively. Then φX+Y (t), the moment
generating function of X + Y , is given by
φX+Y (t) = E[et(X+Y ) ] = E[etX etY ] = E[etX ]E[etY ] = φX (t)φY (t)
where the next to the last equality follows from the Theorem 5.2.4 since X and Y are
independent.
Another important result is that the moment generating function uniquely determines
the distribution. That is, there exists a one-to-one correspondence between the moment
generating function and the distribution function of a random variable.
Discrete Probability Moment

Probability Mass Function Generating Mean Variance
Distribution p(x) Function φ(t)
Binomial with n
x
p (1 − p)n−x
parameters x
(pet + (1 − p))n np np(1 − p)
x = 1, 2, · · · , n
n, p, 0 ≤ p ≤ 1
Poisson with
x
parameter e−λ λx! exp{λ(et − 1)} λ λ
λ, λ > 0
Geometric with pet 1 1−p
p(1 − p)x−1 , x = 1, 2, · · · 1−(1−p)et p p2
parameter 0 ≤ p ≤ 1
Table 6.1: Moments generating functions for some common discrete distributions
Example 6.1.6 Suppose the moment generating function of a random variable X is given
t
by φ(t) = e3(e −1) . What is P {X = 0}?
t
We see from Table 6.1 that φ(t) = e3(e −1) is the moment generating function of a Poisson
random variable with mean 3. Hence, by the oneto- one correspondence between mo-
ment generating functions and distribution functions, it follows that X must be a Poisson
random variable with mean 3. Thus, P {X = 0} = e−3 .
112
Continuous Probability Moment
Probability Density Function Generating Mean Variance
Distribution f (x) Function, φ(t)
Uniform (
1
b−a
, if a < x < b etb −eta a+b (b−a)2
parameters over f (x) = t(b−a) 2 12
)
0, otherwise
(a, b)
Exponential with (
λe−λx , if x ≥ 0 λ 1 1
parameter f (x) = λ−t λ λ2
0, if x < 0
λ>0
Gamma with (
λe−λx (λx)n−1
Γ(n)
, if x ≥ 0 λ n
n n
parameter (n, λ) f (x) = λ−t λ λ2
0, if x < 0
λ>0
n 2
o
Normal with 1
f (x) = √2πσ exp − (x−µ)
2σ 2
n
σ 2 t2
o
exp µt + 2
µ σ2
parameters (µ, σ 2 ) −∞ < x < ∞
Table 6.2: Moments generating functions for some common continuous distributions
Exercise 6.1.1 Let X be a uniformly distributed random variable over (a, b).
(i) Find the moment generating function of X.
(ii) Find the expectation E[X] and the variance Var(X) of X.
Example 6.1.7 (Sums of Independent Binomial Random Variables) If X and Y

are independent binomial random variables with parameters (n, p) and (m, p), respectively,
then what is the distribution of X + Y ?
The moment generating function of X + Y is given by
φX+Y (t) = φX (t)φY (t) = [(pet + 1 − p)n ][(pet + 1 − p)m ] = (pet + 1 − p)n+m
But (pet +1−p)n+m is just the moment generating function of a binomial random variable
having parameters m + n and p. Thus, this must be the distribution of X + Y .
Example 6.1.8 (Sums of Independent Poisson Random Variables) Calculate the

distribution of X+Y when X and Y are independent Poisson random variables with means
λ1 and λ2 , respectively.
The moment generating function of X + Y is given by
φX+Y (t) = φX (t)φY (t) = [exp{λ1 (et − 1)][exp{λ2 (et − 1)] = exp{(λ1 + λ2 )(et − 1)}
113
Hence, X + Y is Poisson distributed with mean λ1 + λ2 , verifying the result given in
Example 5.3.10.
Example 6.1.9 (Sums of Independent Normal Random Variables) Show that if

X and Y are independent normal random variables with parameters (µ1 , σ12 ) and (µ2 , σ22 ),
respectively, then X + Y is normal with mean µ1 + µ2 and variance σ12 + σ22 .
σ12 t2
22
σ2 t
φX+Y (t) = φX (t)φY (t) = exp + µ1 t exp + µ2 t
2 2
2
(σ1 + σ22 )t2

= exp + (µ1 + µ2 )t
2
which is the moment generating function of a normal random variable with mean µ1 + µ2
and variance σ12 + σ22 . Hence, the result follows since the moment generating function
uniquely determines the distribution.
Remark 6.1.1 For a nonnegative random variable X, it is often convenient to define its
Laplace transform g(t), t ≥ 0, by g(t) = φ(−t) = E[e−tX ]. That is, the Laplace trans-
form evaluated at t is just the moment generating function evaluated at −t . The advantage
of dealing with the Laplace transform, rather than the moment generating function, when
the random variable is nonnegative, that is if X ≥ 0 and t ≥ 0, then 0 ≤ e−tX ≤ 1.
That is, the Laplace transform is always between 0 and 1. As in the case of moment
generating functions, it remains true that nonnegative random variables that have the
same Laplace transform must also have the same distribution.
It is also possible to define the joint moment generating function of two or more random
variables. This is done as follows: For any n random variables X1 , ..., Xn , the joint moment
generating function, φ(t1 , ..., tn ), is defined for all real values of t1 , ..., tn by
φ(t1 , · · · , tn ) = E(et1 X1 +···+tn Xn )
It can be shown that φ(t1 , · · · , tn ) uniquely determines the joint distribution of X1 , ..., Xn .
Example 6.1.10 (The Multivariate Normal Distribution) Let Z1 , ..., Zn be a set of

n independent standard normal random variables. If, for some constants aij , 1 ≤ i ≤
114
m, 1 ≤ j ≤ n and µi , 1 ≤ i ≤ m
X1 = a11 Z1 + a12 Z2 + · · · + a1n Zn + µ1

X2 = a21 Z1 + a22 Z2 + · · · + a2n Zn + µ2
..
.
Xi = ai1 Z1 + ai2 Z2 + · · · + ain Zn + µi
..
.
Xm = am1 Z1 + am2 Z2 + · · · + amn Zn + µm
then the random variables X1 , ..., Xm are said to have a multivariate normal distribution.
It follows from the fact that the sum of independent normal random variables is itself a
normal random variable that each Xi is a normal random variable with mean and variance
given by E[Xi ] = µi and Var(Xi ) = nj=1 a2ij .
P
Let us now determine
φ(t1 , · · · , tn ) = E[exp{t1 X1 + · · · + tn Xn }]
the joint moment generating function of X1 , ..., Xm . The first thing to note is that
Pm
since i=1 ti Xi is itself a linear combination of the independent normal random vari-
ables Z1 , · · · , Zn , is also normally distributed. Its mean and variance are respectively
" m
# m
X X
E ti Xi = ti µi
i=1 i=1
m
! m m
! m X
m
X X X X
Var ti Xi = Cov ti Xi , tj Xj = ti tj Cov(Xi , Xj )
i=1 i=1 j=1 i=1 j=1
Now, if Y is a normal random variable with mean µ and variance σ 2 , then E[eY ] =
σ2
φY (t)|t=1 = eµ+ 2 . Thus we see that
( m m m
)
X 1 XX
φ(t1 , ..., tm ) = exp ti µi + ti tj Cov(Xi , Xj ) (6.1.2)
i=1
2 i=1 j=1
which shows that the joint distribution of X1 , ..., Xm is completely determined from a
knowledge of the values of E[Xi ] and Cov(Xi , Xj ), i, j = 1, 2, · · · , m.
115
Chapter 7
Limit Theorems
7.1 The Joint Distribution of the Sample Mean and

Sample Variance from a Normal Population
Definition 7.1.1 Let X1 , ..., Xn be independent and identically distributed random vari-
ables, each with mean µ and variance σ 2 . The random variable S 2 defined by
n
2 1 X 2
S = Xi − X
n − 1 i=1
is called the sample variance of these data
To compute E[S 2 ] we use the identity
n n
X 2 X
Xi − X = (Xi − µ)2 − n(X − µ)2 (7.1.1)
i=1 i=1
which is proven as follows:
n n n n
X 2 X 2 X 2 2
X
Xi − X = Xi − µ + µ − X = (Xi − µ) + n(X − µ) + 2(µ − X) (Xi − µ)
i=1 i=1 i=1 i=1
Xn n
X
(Xi − µ)2 + n(X − µ)2 + 2(µ − X) nX − nµ = (Xi − µ)2 + n(X − µ)2 − 2n(X − µ)2

=
i=1 i=1
and the Identity 7.1.1 follows.
116
Using the Identity 7.1.1 and the Theorem 5.3.6 (b)
n
X
2
(Xi − µ)2 − nE (X − µ)2 = nσ 2 − n Var(X) = (n − 1)σ 2

E[(n − 1)S ] =
i=1
Thus, we obtain from the preceding that
E[S 2 ] = σ 2
Pn
We will now determine the joint distribution of the sample mean X = i=1 Xi and the
sample variance S 2 when the Xi have a normal distribution.
To begin we need the concept of a chi-squared random variable.
Definition 7.1.2 If Z1 , · · · Zn are independent standard normal random variables, then

the random variable n
X
Zi2
i=1
is said to be a chi-squared random variable with n degrees of freedom.
Pn
We shall now compute the moment generating function of i=1 Zi2 . To begin, note that
Z ∞
1 2 x2
2
etx e− 2 dx

E exp{tzi } = √
2π −∞
Z ∞
1 x 2 1
=√ e− 2σ2 = σ = (1 − 2t)− 2
2π −∞
where σ 2 = (1 − 2t)−1
Hence
" ( n
)# n
n
X Y
Zi2 E exp tzi2 = (1 − 2t)− 2

E exp t =
i=1 i=1
Now, let X1 , ..., Xn be independent normal random variables, each with mean µ and
variance σ 2 , and let X = ni=1 Xi and S 2 denote their sample mean and sample variance.
P
Since the sum of independent normal random variables is also a normal random variable,
2
it follows that X is a normal random variable with expected value µ and variance σ2 . In
addition, from the Theorem 5.3.6
117
Cov(X, Xi − X) = 0, i = 1, 2, · · · , n (7.1.2)
Also, since X, X1 − X, X2 − X, ..., Xn − X are all linear combinations of the independent
standard normal random variables Xiσ−X , i = 1, ..., n, it follows that the random variables
X, X1 − X, X2 − X, ..., Xn − X have a joint distribution that is multivariate normal.
2
However, if we let Y be a normal random variable with mean µ and variance σn that is
independent of X1 , ..., Xn , then the random variables Y, X1 − X, X2 − X, ..., Xn − X is also
have a multivariate normal distribution, and by the Equation 7.1.2, they have the same
expected values and covariances as the random variables X, Xi −X, i = 1, 2, · · · , n. Thus,
since a multivariate normal distribution is completely determined by its expected values
and covariances, we can conclude that the random vectors Y, X1 − X, X2 − X, ..., Xn − X
and X, X1 − X, X2 − X, ..., Xn − X have the same joint distribution; thus showing that
X is independent of the sequence of deviations Xi − X, i = 1, 2, · · · , n.
Since X is independent of the sequence of deviations Xi − X, i = 1, 2, · · · , n, it follows

that it is also independent of the sample variance.
n
2 1 X 2
S = Xi − X
n − 1 i=1
To determine the distribution of S 2 , use the Identity 7.1.1 to obtain
n
X
2
(n − 1)S = (Xi − µ)2 − n(X − µ)2
i=1
Dividing both sides of this equation by σ 2 yields
!2 n
(n − 1)S 2 X −µ X (Xi − µ)2
+ = (7.1.3)
σ2 √σ σ2
n i=1
Pn (Xi −µ)2
Now i=1 σ2
is the sum of the squares of n independent standard normal random
variables, and so is a chi-squared random variable
with n degrees of freedom; it thus has
n 2
moment generating function (1 − 2t)− 2 . Also X−µ
√σ
is the square of a standard normal
n
random variable and so is a chi-squared random variable with one degree of freedom; and
1
thus has moment generating function (1 − 2t)− 2 . In addition, we have previously seen
that the two random variables on the left side of the Equation 7.1.3 are independent.
Therefore, because the moment generating function of the sum of independent random
variables is equal to the product of their individual moment generating functions, we
obtain that
118
S2

1 n
E exp t(n − 1) 2 (1 − 2t)− 2 = (1 − 2t)− 2
σ
which can be written as
S2

n−1
E exp t(n − 1) 2 = (1 − 2t)− 2
σ
n−1
But because (1 − 2t)− 2 is the moment generating function of a chi-squared random
variable with n − 1 degrees of freedom, we can conclude, since the moment generating
function uniquely determines the distribution of the random variable, that this is the
2
distribution of (n−1)S
σ2
.
Summing up, we have shown the following.
Theorem 7.1.3 If X1 , ..., Xn are independent and identically distributed normal random
variables with mean µ and variance σ 2 then the sample mean X and the sample variance
2
S 2 are independent. X is a normal random variable with mean µ and variance σn while
(n−1)S 2
σ2
is a chi-squared random variable with n − 1 degrees of freedom.
7.2 Some important Limit Theorems
We start this section by proving a result known as Markov’s inequality.
Theorem 7.2.1 (Markov’s Inequality) If X is a random variable that takes only non-
negative values, then for any value a > 0
E[X]
P {X ≥ a} ≤
a
Proof. We give a proof for the case where X is continuous with density f :
Z ∞ Z a Z ∞ Z ∞
E[X] = xf (x)dx = xf (x)dx + xf (x)dx ≥ xf (x)dx
0 0 a a
Z ∞ Z ∞
≥ af (x)dx = a f (x)dx = aP {X ≥ a}
a a
and the result is proven.
As a corollary, we obtain the following.
Theorem 7.2.2 (Chebyshev’s Inequality) If X is a random variable with mean µ and

variance σ 2 , then, for any value k > 0,
σ2
P {|X − µ| ≥ k} ≤
k2
119
Proof. Since (X −µ)2 is a nonnegative random variable, we can apply Markov’s inequality
with a = k 2 to obtain
E[(X − µ)2 ]
P {(X − µ)2 ≥ k 2 } ≤
k2
But since (X − µ)2 ≥ k 2 if and only if |X − µ| ≥ k, the preceding is equivalent to
E[(X − µ)2 ] σ2
P {|X − µ| ≥ k} ≤ =
k2 k2
and the proof is complete.
The importance of Markov’s and Chebyshev’s inequalities is that they enable us to derive
bounds on probabilities when only the mean, or both the mean and the variance, of the
probability distribution are known. Of course, if the actual distribution were known, then
the desired probabilities could be exactly computed, and we would not need to resort to
bounds.
Example 7.2.3 Suppose we know that the number of items produced in a factory during
a week is a random variable with mean 500.
(a) What can be said about the probability that this week’s production will be at least 1000?
(b) If the variance of a week’s production is known to equal 100, then what can be said
about the probability that this week’s production will be between 400 and 600?
Let X be the number of items that will be produced in a week.
(a) By Markov’s inequality,
E[X] 500 1
P {X ≥ 1000} ≤ = =
1000 1000 2
(b) By Chebyshev’s inequality,
σ2 100 1
P {|X − 500| ≥ 100} ≤ 2
= =
100 10000 100
Hence
1 99
P {|X − 500| < 100} > 1 − =
100 100
and so the probability that this week’s production will be between 400 and 600 is at
least 0.99.
120
The following theorem, known as the Strong Law of Large Numbers, is probably the
most well-known result in probability theory. It states that the average of a sequence
of independent random variables having the same distribution will, with probability 1,
converge to the mean of that distribution.
Theorem 7.2.4 (Strong Law of Large Numbers) Let X1 , X2 , · · · be a sequence of

independent random variables having a common distribution, and let E[Xi ] = µ. Then
with probability 1,
X 1 + X2 + · · · + Xn
−→ µ as n −→ ∞ (7.2.1)
n
In other words, the Strong Law of Large Numbers says that
n o X1 + X2 + · · · + X n
P lim X n = µ = 1, where X n =
n→∞ n
As an example of the preceding, suppose that a sequence of independent trials is per-

formed. Let E be a fixed event and denote by P (E) the probability that E occurs on any
particular trial. Letting
(
1, if E occurs on the ith trial
Xi =
0, if E does not occur on the ith trial
We have, by the strong law of large numbers that, with probability 1,
X1 + X 2 + · · · + Xn
−→ E[Xi ] = P (E) (7.2.2)
n
Since X1 + X2 + · · · + Xn represents the number of times that the event E occurs in the
first n trials, we may interpret the Equation 7.2.2 as stating that, with probability 1, the
limiting proportion of time that the event E occurs is just P (E).
Running neck and neck with the Strong Law of Large Numbers for the honor of being
probability theory’s number one result is the Central Limit Theorem. Besides its the-
oretical interest and importance, this theorem provides a simple method for computing
approximate probabilities for sums of independent random variables. It also explains the
remarkable fact that the empirical frequencies of so many natural ”populations” exhibit
a bell-shaped (that is, normal) curve.
Theorem 7.2.5 (Central Limit Theorem) Let X1 , X2 , ... be a sequence of indepen-

dent, identically distributed random variables, each with mean µ and variance σ 2 . Then
the distribution of
X1 + X2 + · · · + Xn − nµ
√
σ n
121
tends to the standard normal as n −→ ∞. That is,
a
X1 + X2 + · · · + Xn − nµ

1
Z
x2
P √ ≤a −→ √ e− 2 dx as n −→ ∞
σ n 2π −∞
Note that like the other results of this section, this theorem holds for any distribution of
the Xi ’s; herein lies its power.
If X is binomially distributed with parameters n and p, then X has the same distribution
as the sum of n independent Bernoulli random variables, each with parameter p. (Recall
that the Bernoulli random variable is just a binomial random variable whose parameter
n equals 1.) Hence, the distribution of
X − E[X] X − np
p =p
Var(X) np(1 − p)
approaches the standard normal distribution as n approaches ∞. The normal approxi-
mation will, in general, be quite good for values of n satisfying np(1 − p) ≥ 10.
Example 7.2.6 (Normal Approximation to the Binomial) Let X be the number of

times that a fair coin, flipped 40 times, lands heads. Find the probability that X = 20.
Use the normal approximation and then compare it to the exact solution.
Since the binomial is a discrete random variable, and the normal a continuous random
variable, it leads to a better approximation to write the desired probability as
19.5 − 20 X − 20 20.5 − 20

P {X = 20} = P {19.5 < X < 20.5} = P √ < √ < √
10 10 10
X − 20
= P {−0.16 < √ < 0.16} = Φ(0.16) − Φ(−0.16)
10
where Φ(x), the probability that the standard normal is less than x is given by
Z x
1 y2
Φ(x) = √ e− 2 dy
2π −∞
By the symmetry of the standard normal distribution Φ(−0.16) = P {N (0, 1) > 0.16} =
1 − Φ(0.16), where N (0, 1) is a standard normal random variable. Hence, the desired
probability is approximated by P {X = 20} ≈ 2Φ(0.16) − 1.
Using the Normal Probability Table about the value of Φ(x), we obtain that P {X = 20} ≈
0.1272.
140
40
The exact result is P {X = 20} = 20 2
which can be shown to equal 0.1268.
122
Example 7.2.7 Let Xi , i = 1, 2, ..., 10 be independent random variables, each being uni-
P10
formly distributed over (0, 1). Estimate P i=1 Xi > 7 .
1 1
Since E[Xi ] = and Var(Xi ) = 12
2
we have by the Central Limit Theorem that
( 10 )  
P 10
i=1 Xi − 5 7−5 
X 
P Xi > 7 = P q >q ≈ 1 − Φ(2.2) = 0.0139.
1 1 
i=1
 10( 12 ) 10( 12 )
Example 7.2.8 The lifetime of a special type of battery is a random variable with mean
40 hours and standard deviation 20 hours. A battery is used until it fails, at which point it
is replaced by a new one. Assuming a stockpile of 25 such batteries, the lifetimes of which
are independent, approximate the probability that over 1100 hours of use can be obtained.
If we let Xi denote the lifetime of the ith battery to be put in use, then we desire p =
P {X1 + · · · + X25 > 1100}, which is approximated as follows:
X1 + · · · + Xn − 1000 1100 − 1000

P √ > √ ≈ P {N (0, 1) > 1} = 1 − Φ(1) ≈ 0.1587
20 25 20 25
We now present a heuristic proof of the Central Limit theorem. Suppose first that the
Xi have mean 0 and variance 1, and let E[etX ] denote their common moment generating
function. Then, the moment generating function of X1 +···+X
√
n
n
is
X1 + · · · + Xn
h tX1 tX2 i h tX in
√ √ tX
√n √
E exp t √ = E e n e n ···e n = E e n by independence.
n
Now, for n large, we obtain from the Taylor series expansion of ey that
tX
√ tX t2 X 2
e n ≈1+ √ +
n 2n
Taking expectations shows that when n is large

h tX
√
i tE[X] t2 E[X 2 ] t2
E e n ≈1+ √ + = 1 + , because E[X] = 0, and E[X 2 ] = 1
n 2n 2n
Therefore, we obtain that when n is large
n
X1 + · · · + X n t2

E exp t √ ≈ 1+
n 2n
When n goes to ∞ the approximation can be shown to become exact and we have that
123
X1 + · · · + Xn

t2
lim E exp t √ =e2
n→∞ n
Thus, the moment generating function of X1 +···+X

√
n
n
converges to the moment generating
function of a (standard) normal random variable with mean 0 and variance 1. Using this,
it can be proven that the distribution function of the random variable X1 +···+X
√
n
n
converges
to the standard normal distribution function.
Xi −µ
When the Xi have mean µ and variance σ 2 , the random variables σ
have mean 0 and
variance 1. Thus, the preceding shows that
X1 − µ + X 2 − µ + · · · + X n − µ

P √ ≤a −→ Φ(a)
σ n
which proves the Central Limit Theorem.
Exercise 7.2.1 Let X be a binomial distribution with p = 0.6 and n = 150. Approximate
the probability that
(a) X is between 82 and 101 both inclusive.
(b) X is greater than 97.
Solution 7.2.9 (a) We calculate the mean and the variance of X. The mean is given
by µ = np = 150(0.6) = 90 and the variance is given by σ = Var(X) = np(1 − p) =
p
150(0.6)(0.4) = 36 and hence Var(X) = 6. The standardized variable is
X −µ X − 90
Z= =
σ 6
The event [82 ≤ X ≤ 101] includes both endpoints. The appropriate continuity cor-
rection is to substract 21 from the lower end and add 12 to the upper end. We then
approximate
81.5 − 90 X − 90 101.5 − 90

P [81.5 ≤ X ≤ 101.5] = P ≤ ≤
6 6 6
= P [−1.417 ≤ Z ≤ 1.917]
From the Normal Probability Table, we interpolate P [−1.417 ≤ Z ≤ 1.917] = 0.9724−

0.782 = 0.8942 and approximate [82 ≤ X ≤ 101] by the normal probability 0.8942.
124
(b) For [X > 97], we reason that 97 is not included so that [X ≥ 97 + 0.5] or [X ≥ 97.5]
is the event of interest.
X − 90 97.5 − 90

P [X ≥ 97.5] ≈ P ≥
6 6
= P [Z ≥ 1.25] = 1 − 0.8944
The normal approximation to the binomial gives P [X > 97] ≈ 0.1056.
Exercise 7.2.2 (A Normal Probability Approximation for a Survey) A large-scale

survey conducted five years ago revealed that 30% of adult population were regular users of
alcoholic beverages. If this still the current rate, what is the probability that in a random
sample of 1000 adults the number of users of alcoholic beverages will be
(a) less than 280?
(b) 316 or more?
Exercise 7.2.3 Consider a population with mean = 82 and variance = 144.
(a) If a random sample of size 64 selected, what is the probability that the sample mean
will lie between 80.8 and 83.2?
(b) With a random sample of size 100, what is the probability that the sample mean will
lie between 80.8 and 83.2?
Exercise 7.2.4 Suppose that the population distribution of the gripping strengths of in-
dustrial workers is knowm to have mean of 110 and variance 100. For a random sample
of 75 workers, what is the probability that the sample mean gripping strength will be
(a) Between 109 and 112?
(b) Greater than 111?
125

Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF

Uploaded by

Copyright:

Available Formats

You might also like

Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF

Uploaded by

Copyright:

Available Formats

College of Science and Technology

Module title: Probability Theory

Level 2, Applied Mathematics & Physics

Academic year: 2019-2020

Module Code: MAT 2162

1 Introduction to Probability Theory 4

1.2 Sample Space and Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Basic Concepts of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Axiomatic Development of Probability 14

2.1 Definition and Laws of Probability . . . . . . . . . . . . . . . . . . . . . . 14

2.2 The Conditional Probability and Independent

2.3 Calculus of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Bayes’ s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 Selected Counting Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.1 The Fundamental Rule of Counting . . . . . . . . . . . . . . . . . . 38

2.5.5 Stirling’s Approximation to n! . . . . . . . . . . . . . . . . . . . . . 44

2.6 Examples: Counting Rules and Probability Calculations . . . . . . . . . . 45

3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.1 The Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . 55

3.2.2 The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . 56

3.2.3 The Geometric Random Variable . . . . . . . . . . . . . . . . . . . 58

3.2.4 The Hypergeometric Random Variable . . . . . . . . . . . . . . . . 58

3.2.5 The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . 60

3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.1 The Uniform Random Variable . . . . . . . . . . . . . . . . . . . . 62

3.3.2 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . 63

3.3.3 Gamma Random Variables . . . . . . . . . . . . . . . . . . . . . . . 64

3.3.4 Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . 65

4 Expectations of Random Variables 73

4.1 The Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 The Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Expectation of a Function of a Random Variable . . . . . . . . . . . . . . . 76

5 Jointly Distributed Random Variables 86

5.1 Joint Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3 Covariance and Variance of Sums of Random Variables . . . . . . . . . . . 97

5.4 Joint Probability Distribution of Functions of Random Variables . . . . . . 104

6 Moment Generating and Characteristic Functions 109

6.1 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2 Some important Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . 119

Introduction to Probability Theory

A random experiment has three important components which are:

(1) Multiplicity of outcomes,

(2) Uncertainty regarding the outcomes, and

(3) Repeatability of the experiment in identical fashions.

1.2 Sample Space and Event

We interpret probability of an event as the relative frequency of the occurrence of that

Table 1.1: Behavior of the Relative Frequency of the Hs

Let us look at two simple examples first.

S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T } (1.2.2)

P (HHH) = P (HHT ) = P (HT H) = P (HT T ) = P (T HH)

1.3 Basic Concepts of Set Theory

Basic Set Operations

Intersection: A ∩ B = {x : x ∈ A and x ∈ B}.

Symmetric Difference: A∆B = (A \ B) ∪ (B \ A).

Note that A∆B = (A ∩ B c ) ∪ (B ∩ Ac ) = (A ∪ B) \ (A ∩ B).

The union and intersection operations satisfy the following laws: