Professional Documents
Culture Documents
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
School of Science
Department of Mathematics & Physics
November 2019
Contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5.2 Arrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.3 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.4 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Random Variables 49
1
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2
7 Limit Theorems 116
7.1 The Joint Distribution of the Sample Mean and Sample Variance from a
Normal Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3
Chapter 1
1.1 Introduction
In the study of the subject of probability, we first imagine an appropriate random experi-
ment. A random experiment is an experiment or a process for which the outcome cannot
be predicted with certainty. All the possible outcomes of a random experiment are known
but which outcome will result when you perform the experiment is not known, i.e, after
the experiment, the result of the random experiment is known. An experiment will be
random if it has more than one possible outcome, and deterministic if it has only one.
When we repeat a random experiment several times, we call each one of them a trial.
Thus, a trial is a particular performance of a random experiment. For example tossing a
coin or rolling a die are examples of random experiments.
Suppose that one tosses a regular coin up in the air. The coin has two sides, namely the
head (H) and tail (T ). Let us assume that the tossed coin will land on either H or T .
Every time one tosses the coin, there is the possibility of the coin landing on its head
or tail (multiplicity of outcomes). But, no one can say with absolute certainty whether
the coin would land on its head, or for that matter, on its tail (uncertainty regarding the
outcomes). One may toss this coin as many times as one likes under identical conditions
4
(repeatability) provided the coin is not damaged in any way in the process of tossing it
successively.
All three components are crucial ingredients of a random experiment. In order to contrast
a random experiment with another experiment, suppose that in a lab environment, a bowl
of pure water is boiled and the boiling temperature is then recorded. The first time this
experiment is performed, the recorded temperature would read 100◦ Celsius (or 212◦
Fahrenheit). Under identical and perfect lab conditions, we can think of repeating this
experiment several times, but then each time the boiling temperature would read 100◦
Celsius (or 212◦ Fahrenheit). Such an experiment will not fall in the category of a random
experiment because the requirements of multiplicity and uncertainty of the outcomes are
both violated here.
Exercise 1.1.1 Provide two examples of non-random experiments and two examples of
random experiments. Justify your answers.
T HT HT T HT T H (1.2.1)
Let nk be the number of Hs observed in a sequence of k tosses of the coin while nkk refers
to the associated relative frequency of H. For the observed sequence in the Expression
(1.2.1), the successive frequencies and relative frequencies of H are given in the accom-
panying Table 1.1. The observed values for nkk empirically provide a sense of what p may
be, but admittedly this particular observed sequence of relative frequencies appears a
little unstable. However, as the number of tosses increases, the oscillations between the
successive values of nkk will become less noticeable. Ultimately nkk and p are expected to
become indistinguishable in the sense that in the long haul nkk will be very close to p.
That is, our instinct may simply lead us to interpret p as
nk
p = lim
k→∞ k
5
nk nk
k nk k
k nk k
1
1 0 0 6 2 3
1 3
2 1 2
7 3 7
1 3
3 1 3
8 3 8
1 1
4 2 2
9 3 3
2 2
5 2 5
10 4 5
A random experiment provides in a natural fashion a list of all possible outcomes, also
referred to as the simple events. These simple events act like atoms in the sense that the
experimenter is going to observe only one of these simple events as a possible outcome
when the particular random experiment is performed.
A sample space is merely a set, denoted by S, which enumerates each and every possible
simple event or outcome. Then, a probability scheme is generated on the subsets of S,
including S itself, in a way which mimics the nature of the random experiment itself.
Throughout, we will write P (A) for the probability of a statement A, where A is a subset
of S. A more precise treatment of these topics is provided in Chapter 2.
Example 1.2.1 Suppose that we toss a fair coin three times and record the outcomes
observed in the first, second, and third toss respectively from left to right. Then the
possible simple events are HHH, HHT, HT H, HT T, T HH, T HT, T T H or T T T . Thus
the sample space is given by
Since the coin is assumed to be fair, this particular random experiment generates the
following probability scheme:
6
The sample space is given by
S = {(1, 1), (1, 2), ..., (1, 6), (2, 1), ..., (2, 6), ..., (6, 1), ..., (6, 6)}
consisting of exactly 36 possible simple events. Since both dice are assumed to be fair, this
particular random experiment generates the following probability scheme:
1
P {(i, j)} = for all i, j = 1, · · · , 6
36
Example 1.2.3 If the experiment consists of measuring the lifetime of a car, then the
sample space consists of all nonnegative real numbers i.e. the sample space is the set
S = {x : x ∈ R and x ≥ 0} = [0, ∞).
In summary:
• A random experiment is a procedure that produces exactly one outcome out of many
possible outcomes.
• A sample space, denoted by S, is defined to be the set of all the possible outcomes
in a random experiment.
• An event is defined to be any subset of the sample space, and events are usually
denoted by capital letters, A, B and so forth.
Example 1.2.4 Consider an experiment of tossing a die twice. Then the sample space
is S = {(i, j) : i, j = 1, 2, ..., 6}, which contains 36 elements. ”The sum of the results of
the two toss is equal to 10” is an event. If we denote this event by A then A = {(i, j) :
3
i + j = 10, i, j = 1, 2, · · · , 6} = {(4, 6), (6, 4), (5, 5)} and P (A) = 36 .
Exercise 1.2.1 If two fair dice are tossed, what is the probability that the sum is n where
n = 2, 3, · · · , 12.
Exercise 1.2.2 A number X is chosen at random from the set {1, 2, ..., 25}. Find the
probability that when divided by 6 it leaves remainder 1.
Exercise 1.2.3 A fair die is rolled three times and the scores are added. What is the
probability that the sum of the scores is 6?
A set S is a collection of objects which are tied together with one or more common
defining properties. For example, we may consider a set S = {x : x is an integer} in
7
which case S can be alternately written as Z = {..., 2, 1, 0, 1, 2, ...}. Here the common
defining property which ties in all the members of the set Z is that they are integers.
In the set N = {1, 2, 3 · · · } of all positive integers, being positive integer is the common
defining property of all members of N.
Let S be a universe set, i.e. S is a set that contains all the entities, including itself, that
one wishes to consider in a given situation. We say that A is a subset of S, denoted by
A ⊆ S, provided that each member of A is also a member of S.
Consider A and B which are both subsets of S. Then, A is called a subset of B, denoted
by A ⊆ B, provided that each member of A is also a member of B. We say that A is a
proper subset of B, denoted by A ⊂ B, provided that A is a subset of B but there is at
least one member of B which does not belong to A. Two sets A and B are said to be
equal if and only if A ⊆ B, as well as B ⊆ A.
Example 1.3.1 Let us define S = {1, 3, 5, 7, 9, 11, 13}, A = {1, 5, 7}, B = {1, 5, 7, 11, 13},
C = {3, 5, 7, 9}, and D = {11, 13}. Here, A, B, C and D are all proper subsets of S. It is
obvious that A is a proper subset of B, but A is not a subset of either C or D.
Suppose that A and B are two subsets of S. Now, we mention some customary set
operations listed below:
Union: A ∪ B = {x : x ∈ A or x ∈ B}.
Difference: A \ B = {x : x ∈ A but x ∈
/ B}.
Complement: Ac = {x : x ∈ S but x ∈
/ A}.
0
Let us note that the complement of a set A is sometimes denoted by A .
The Figure 1.1 give the set operations and Venn Diagrams for complement, subset, inter-
sect and union.
8
Figure 1.1: Sets Operations and Venn Diagrams.
Commutative Law : A ∪ B = B ∪ A, A ∩ B = B ∩ A.
Two sets A and B are said to be disjoint if and only if there is no common element
between A and B, that is, if and only if A ∩ B = ∅, where ∅ denotes the empty set, i.e.
a set which has no elements. Two disjoint sets A and B are also referred to as being
mutually exclusive.
Example 1.3.2 (Example 1.3.1 Continued) One can verify that A ∪ B = {1, 5, 7, 11, 13},
9
B ∪ C = S, but C and D are mutually exclusive. Also, A and D are mutually exclusive.
Note that Ac = {3, 9, 11, 13}, A∆C = {1, 3, 9} and A∆B = {11, 13}.
Reflexivity: A ⊆ A.
Antisymetry: A ⊆ B and B ⊆ A if and only if A = B.
Transitivity: If A ⊆ B and B ⊆ C then A ⊆ C.
(1) ∅ ⊆ A ⊆ S.
(2) A ⊆ A ∪ B and A ∩ B ⊆ A.
(3) If A ⊆ C and B ⊆ C then A ∪ B ⊆ C.
(4) If C ⊆ A and C ⊆ B then C ⊆ A ∩ B.
10
(1) C \ (A ∩ B) = (C \ A) ∪ (C \ B).
(2) C \ (A ∪ B) = (C \ A) ∩ (C \ B).
(3) (B \ A) ∩ C = (B ∩ C) \ A = B ∩ (C \ A).
(4) (B \ A) ∪ C = (B ∪ C) \ (A \ C).
(5) C \ (B \ A) = (A ∩ C) ∪ (C \ B).
(6) A \ A = ∅, A \ ∅ = A, B \ A = B ∩ Ac .
(7) (B \ A)c = A ∪ B c , S \ A = Ac , A \ S = ∅.
The cardinality of a set is a measure of the ”number of elements of the set”. For example,
the set A = {10, 20, 30} contains 3 elements, and therefore A has a cardinality of 3.
A set is called finite if it is empty or has the same cardinality as the set {1, 2, ..., n} for
some n ∈ N; it is called infinite otherwise.
A countable set is a set with the same cardinality (number of elements) as some subset
of the set N. A countable set is either a finite set or a countably infinite set. Whether
finite or infinite, the elements of a countable set can always be counted one at a time
and, although the counting may never finish, every element of the set is associated with
a unique natural number.
An uncountable set (or uncountably infinite set) is an infinite set that contains too many
elements to be countable.
(i) The set A = {10, 20, 30, 40} is a finite set and hence a countable set.
(ii) The sets N and Z are all countable sets but they are not finite.
(iii) The sets R and I = (0, 1) are all uncountable sets and hence infinite.
Now consider {Ai : i ∈ I}, a collection of subsets of S, where I is some indexing set. This
collection may be finite, countably infinite or uncountably infinite. We define
11
S
Union: i∈I Ai = {x : x ∈ Ai for at least one i ∈ I}
T (1.3.1)
Intersection: i∈I Ai = {x : x ∈ Ai for all i ∈ I}
The Definition 1.3.1 lays down the set operations involving the union and intersection
among arbitrary number of sets.
When we specialize I = {1, 2, 3, · · · } in the Definition 1.3.1, we can combine the notions
of countably infinite number of unions and intersections to come up with some interesting
Theorem 1.3.4 (De Morgans Law) Consider {Ai : i ∈ I}, a collection of subsets of
S. Then, !c !c
[ \ \ [
Ai = Aci and Ai = Aci (1.3.2)
i∈I i∈I i∈I i∈I
have !c
[ \
Ai ⊆ Aci
i∈I i∈I
Suppose that an element x belongs to i∈I Aci . That is x ∈ Aci for each i ∈ I, which
T
implies that x can not belong to any of the sets Ai , i ∈ I. In other words, the element x
S S c
can not belong to the set i∈I Ai so that x must belong to the set i∈I Ai . Thus we
have !c
[ \
Ai ⊇ Aci
i∈I i∈I
Definition 1.3.5 The collection of sets {Ai : i ∈ I} is said to consist of disjoint sets
if and only if no two sets in this collection share a common element, that is when
Ai ∩ Aj = ∅ for all i 6= j ∈ I.
12
(i) {Ai : i ∈ I} consists of disjoint sets only, and
S
(ii) {Ai : i ∈ I} spans the whole space S, that is i∈I Ai = S.
Example 1.3.6 Let S = (0, 1] and define the collection of sets {Ai : i ∈ I} where
1
Ai = 21i , 2i−1
, i ∈ I = {1, 2, 3, · · · }.
One should check that the given collection of intervals form a partition of (0, 1].
13
Chapter 2
Axiomatic Development of
Probability
The axiomatic theory of probability was developed by Kolmogorov in his 1933 monograph,
originally written in German. Its English translation is cited as Kolmogorov (1950b).
Before we describe this approach, we need to fix some ideas first. Along the lines of
the examples discussed in the introduction, let us focus on some random experiment in
general and state a few definitions.
Definition 2.1.1 A sample space is a set, denoted by S, which enumerates each and
every possible outcome or simple event.
In general, an event is an appropriate subset of the sample space S, including the empty
subset ∅ and the whole set S.
(ii) If A ∈ B, then Ac ∈ B;
S∞
(iii) If Ai ∈ B for i = 1, 2, ..., then i=1 Ai ∈ B.
14
In other words, the Borel sigma-field B is closed under the operations of complement and
countable union of its members. It is obvious that the whole space S belongs to the
Borel sigma-field B since we can write S = ∅c ∈ B, by the requirement (i)-(ii) in the
Definition 2.1.2. Also if Ai ∈ B for i = 1, 2, ..., k, then ki=1 Ai ∈ B, since with Ai = ∅ for
S
(ii) If A is any subset of S then the collection σ(A) = {∅, A, Ac , S} is a Borel sigma-field
on S.
(iii) Suppose that A and B are two subsets of S such that A ⊆ B. Then the collection
σ(A, B) = {∅, A, B, Ac , B c , A ∪ B c , Ac ∩ B, S} is a Borel sigma-field on S.
Exercise 2.1.1 Show that if B is a sigma-Borel field then B is closed under countable
intersections i.e. if A1 , A2 , · · · are elements of B then
∞
\
Ai ∈ B
i=1
Solution 2.1.4 We recall that for any given set A the equality A = Acc . By using the
Demorgan’s Laws 1.3.3 one can write
∞
" ∞ !c # c ∞
!c
\ \ [
Ai = Ai = Aci
i=1 i=1 i=1
Since Ai ∈ B then Aci ∈ B by the Definition 2.1.2 (ii) for all i = 1, 2, · · · . It also follows
from the Definition 2.1.2 (iii) that ∞ c
S
i=1 Ai ∈ B. Finally, by the Definition 2.1.2 (ii) we
S∞ c c
have ( i=1 Ai ) ∈ B as requested.
Remark 2.1.1 If B1 and B2 are two Borel sigma-fields on S then the collection of sets
B1 ∩ B2 = {B : B ∈ B1 and B ∈ B2 } is also a sigma-field on S but the collection
B1 ∪ B2 = {B : B ∈ B1 or B ∈ B2 } does not need to a Borel sigma-field on S.
Exercise 2.1.2 Provide two examples of Borel sigma-fields on the set R of all real num-
bers.
15
Frequently, we work with the Borel sigma-field B which consists of all subsets of the
sample space S, that is B = P(S), but always it may not necessarily be that way. Having
started with a fixed Borel sigma-field B of subsets of S, a probability scheme is simply
a way to assign numbers between zero and one to every event while such assignment of
numbers must satisfy some general guidelines. In the next definition, we provide more
specifics.
(i) P (S) = 1;
(iii) P ( ∞
S P∞
i=1 A i ) = i=1 P (Ai ) for all Ai ∈ B such that all Ai ’s are all pairwise disjoint,
i = 1, 2, · · · .
Now, we are in a position to claim a few basic results involving probability. Some are
fairly intuitive while others may need more attention.
Theorem 2.1.7 Suppose that A and B are any two events and recall that ∅ denotes the
empty set. Suppose also that the sequence of events {Bi : i = 1, 2, 3, · · · } forms a partition
of the sample space S. Then,
Proof. (i) Observe that ∅ ∪ ∅c = S and also ∅, ∅c = S are disjoint events. Hence, by
part (iii) in the Definition 2.1.6, we have 1 = P (S) = P (∅ ∪ ∅c ) = P (∅) + P (∅c ).
Thus, P (∅) = 1 − P (∅c ) = 1 − P (S) = 1 − 1 = 0, in view of part (i) in the Definition
2.1.6. The second part follows from part (ii).
(ii) Observe that A ∪ Ac = S and then proceed as before. Observe that A and Ac are
disjoint events.
16
(iii) Notice that B = (B∩A)∪(B∩Ac ), where B∩A and B∩Ac are disjoint events. Hence
by part (iii) in the Definition 2.1.6, we claim that P (B) = P {(B ∩ A) ∪ (B ∩ Ac )} =
P (B ∩ A) + P (B ∩ Ac ). Now, the result is immediate.
(vi) Since the sequence of events {Bi : i ≥ 1} forms a partition of the sample space
S, we can write A = A ∩ S = A ∩ ( ∞
S S∞
i=1 Bi ) = i=1 (A ∩ Bi ) where the events
A ∩ Bi , i = 1, 2, ... are also disjoint. Now, the result follows from part (iii) in the
Definition 2.1.6.
Example 2.1.8 (Example 1.2.1 Continued) Let us define three events as follows:
C: We observe no heads.
First notice that as subsets of S, one can define the given events as follows:
A = {HHT, HT H, T HH}.
C = {T T T }
Now it becomes obvious that P (A) = 38 , P (B) = 87 , and P (C) = 18 . One can also
see that A ∩ B = {HHT, HT H, T HH} so that P (A ∩ B) = 38 , whereas A ∪ C =
{HHT, HT H, T HH, T T T } so that P (A ∪ C) = 48 = 12 .
17
D: The total from the red and yellow dice is 8.
E: The red die comes up with 2 more than the yellow die.
Now, as subsets of the corresponding sample space S, we can rewrite these events as D =
{(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} and E = {(3, 1), (4, 2), (5, 3), (6, 4)}. It is now obvious
5 4
that P (D) = 36 and P (E) = 36 = 91 .
1
It is also obvious to see that D ∩ E = {(5, 3)} which gives P (D ∩ E) = 36
.
Example 2.1.10 In a college campus, suppose that 2600 are women out of 4000 under-
graduate students, while 800 are men among 2000 undergraduates who are under the age
25. From this population of undergraduate students if one student is selected at random,
what is the probability that the student will be either a man or be under the age 25?
Exercise 2.1.4 Suppose that A, B are Borel sets, that is they belong to B. Show that
P (A∆B) = P (A) + P (B) − 2P (A ∩ B) where A∆B stands for the symmetric difference
of A and B.
Solution 2.1.12 Note that A∆B = (A ∩ B c ) ∪ (B ∩ Ac ) and the two sets A ∩ B c and
B ∩ Ac are disjoint. So by using Definition 2.1.6 (iii) one can write P (A∆B) = P (A ∩
B c ) + P (B ∩ Ac ). Using the Theorem 2.1.7 (iii), it follows that P (A∆B) = P (A) − P (A ∩
B) + P (B) − P (A ∩ B) which is equivalent to P (A∆B) = P (A) + P (B) − 2P (A ∩ B) as
requested.
18
Exercise 2.1.5 There are 100 cards: 10 of each red-numbered 1 through 10; 20 white-
numbered 1 through 20; 30 blue-numbered 1 through 30; and 40 magenta-numbered 1
through 40.
(iii) Let E be the event of picking a card with face value 11. Find P (E).
Solution 2.1.13 Using the given information in the statement of the problem one can see
10 30 3
that the requested probabilities are given by P (R) = 100 , P (B) = 100 , P (E) = 100 , P (B ∪
40 1 13 32 2
R) = 100 , P (E ∩ R) = 0, P (E ∩ B) = 100 , P (E ∪ R) = 100 , P (E ∪ B) = 100 , P (E \ B) = 100
29
and P (B \ E) = 100 .
Let us reconsider the Example 1.2.2. Suppose that the two fair dice, one red and the
other yellow, are tossed in another room. After the toss, the experimenter comes out to
announce that the event D has been observed. Recall that P (E) was 91 to begin with,
but we know now that D has happened, and so the probability of the event E should be
appropriately updated. Now then, how should one revise the probability of the event E,
given the additional information?
The basic idea is simple: when we are told that the event D has been observed, then D
should take over the role of the sample space while the original sample space S should
be irrelevant at this point. In order to evaluate the probability of the event E in this
situation, one should simply focus on the portion of E which is inside the set D. This is
the fundamental idea behind the concept of conditioning.
Definition 2.2.1 Let S and B be respectively the sample space and the Borel sigma-field.
Suppose that A, B are two arbitrary events. The conditional probability of the event A
given the other event B, denoted by P (A|B), is defined as
P (A ∩ B)
P (A|B) = , provided that P (B) > 0 (2.2.1)
P (B)
19
In the same vein, we will write
P (A ∩ B)
P (B|A) = , provided that P (A) > 0
P (A)
Definition 2.2.2 Two arbitrary events A and B are defined independent if and only if
P (A|B) = P (A), that is having the additional knowledge that B has been observed
has not affected the probability of A, provided that P (B) > 0.
Two arbitrary events A and B are then defined dependent if and only if P (A|B) 6= P (A),
in other words knowing that B has been observed has affected the probability of A,
provided that P (B) > 0.
In case the two events A and B are independent, intuitively it means that the occurrence
of the event A (or B) does not affect or influence the probability of the occurrence of the
other event B (or A). In other words, the occurrence of the event A (or B) yields no
reason to alter the likelihood of the other event B (or A).
When the two events A, B are dependent, sometimes we say that B is favorable to A if
and only if P (A|B) > P (A) provided that P (B) > 0. Also, when the two events A, B are
dependent, sometimes we say that B is unfavorable to A if and only if P (A|B) < P (A)
provided that P (B) > 0.
1
Example 2.2.3 (Example 1.2.2 Continued) Recall that P (D ∩ E) = P {(5, 3)} = 36 and
5 P (D∩E) 1 36 1 1
P (D) = 36 , so that we have P (E|D) = P (D) = 36 × 5 = 5 . But, P (E) = 9 which
is different from P (E|D). In other words, we conclude that D and E are two dependent
events. Since 51 = P (E|D) > P (E) = 19 , we may add that the event D is favorable to the
event E.
Theorem 2.2.4 The two events B and A are independent if and only if A and B are
independent. Also, the two events A and B are independent if and only if P (A ∩ B) =
P (A)P (B).
For the second statement, we note that P (A ∩ B) = P (A|B)P (B). If A and B are
independent then P (A|B) = P (A). Hence P (A ∩ B) = P (A)P (B).
20
Therefore A and B are independent.
Exercise 2.2.1 Let A and B be independent events with P (A) = P (B) and
P (A ∪ B) = 12 . Find P (A).
Solution 2.2.5 Let us first recall that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). It follows
from the statement of the problem that 21 = P (A) + P (B) − P (A ∩ B) ⇐⇒ 12 = 2P (A) −
P (A ∩ B). Since A and B are assumed to be independent and P (A) = P (B) then P (A ∩
B) = P (A)P (B) = P (A)2 . Denoting P (A) by x, we obtain the following quadratic
√
equation 12 = 2x − x2 ⇐⇒ 2x2 − 4x + 1 = 0 for which the solutions are x = 2±2 2 . Since
√
the probability of any given event is between 0 and 1 then it is natural to take P (A) = 2−2 2 .
Theorem 2.2.6 Suppose that A and B are two events. Then, the following statements
are equivalent:
Proof. It will suffice to prove that (i) =⇒ (ii) =⇒ (iii) =⇒ (iv) =⇒ (i).
(i) =⇒ (ii) : Assume that A and B are independent events. That is, in view of the
Theorem 2.2.4, we have P (A∩B) = P (A)P (B). Again in view of the Theorem 2.2.4,
we need to show that P (Ac ∩ B) = P (Ac )P (B). We apply parts (ii) − (iii) from
the Theorem 2.1.7 to write P (Ac ∩ B) = P (B) − P (A ∩ B) = P (B) − P (A)P (B) =
P (B)(1 − P (A)) = P (B)P (Ac ).
(ii) =⇒ (iii): Assume that Ac and B are independent events. That is, in view of the
Theorem 2.2.4, we have P (Ac ∩ B) = P (Ac )P (B). Again in view of the Theorem
2.2.4, we need to show tha P (A∩B c ) = P (A)P (B c ). Now, we combine De Morgan’s
Law from the Theorem 1.3.3 as well as the parts (ii)−(iv) from the Theorem 2.1.7 to
write P (A ∩ B c ) = 1 − P {(A ∩ B c )c } = 1 − P (Ac ∪ B) = 1 − {P (Ac ) + P (B) − P (Ac ∩
B)} = 1 − P (Ac ) − P (B) − P (Ac )P (B) = [1 − P (Ac )] [1 − P (B)] = P (A)P (B c ).
(iii) =⇒ (iv): Assume that the events A and B c are independent. That is, in view
of the Theorem 2.2.4, we have P (A ∩ B c ) = P (A)P (B c ). Again in view of the
21
Theorem 2.2.4, we need to show tha P (Ac ∩ B c ) = P (Ac )P (B c ). In view of the
Theorem 2.1.7 (iii) and by the fact that the events A and B c are independent we
have P (Ac ∩ B c ) = P (B c ∩ Ac ) = P (B c ) − P (B c ∩ A) = P (B c ) − P (B c )P (A) =
(1 − P (A))P (B c ) = P (Ac )P (B c ).
(iv) =⇒ (i) : Assume that Ac and B c are independent events. That is, in view of the
Theorem 2.2.4, we have P (Ac ∩ B c ) = P (Ac )P (B c ). Again in view of the Theorem
2.2.4, we need to show that P (A ∩ B) = P (A)P (B). Now, we combine De Morgan’s
Law from the Theorem 1.3.4 as well as the parts (ii) − (iv) from the the Theorem
2.1.7 to write P (A∩B) = 1−P {(A∩B)c } = 1−P (Ac ∪B c ) = 1−{P (Ac )+P (B c )−
P (Ac ∩ B c )} = 1 − P (Ac ) − P (B c ) + P (Ac )P (B c ) = {1 − P (Ac )}{1 − P (B c )} =
P (A)P (B).
Note that a collection of events A1 , · · · , An may be pairwise independent, that is, any two
events are independent according to the Definition 2.2.2 (see Definition 2.2.8), but the
whole collection of sets may not be mutually independent.
Example 2.2.10 Consider the random experiment of tossing a fair coin twice. Let us
define the following events:
22
The sample space is given by S = {HH, HT, T H, T T } with each outcome being equally
likely. Now, rewrite
A1 = {HH, HT },
A2 = {HH, T H},
A3 = {HH, T T }
Theorem 2.2.11 If A, B and C are mutually independent events then A ∪ B and C are
independent events.
Proof. From the Definition 2.2.7, we need to show that P {(A ∪ B) ∩ C} = P (A ∪ B)P (C).
Since the events are A, B and C are mutually independent then by Definition 2.2.9 we
have P (A ∩ B) = P (A)P (B), P (A ∩ C) = P (A)P (C), P (B ∩ C) = P (B)P (C) and
P (A ∩ B ∩ C) = P (A)P (B)P (C). Note that P {(A ∪ B) ∩ C} = P {A ∩ C) ∪ (B ∩ C)} =
P (A ∩ C) + P (B ∩ C) − P (A ∩ B ∩ C) = P (A)P (C) + P (B)P (C) − P (A)P (B)P (C) =
{P (A) + P (B) − P (A)P (B)} P (C) = P (A ∪ B)P (C).
Definition 2.2.12 The events in the sequence E1 , E2 , ... are said to be mutually exclu-
sive, if Ei ∩ Ej = ∅ for all i 6= j.
In other words, the events are said to be mutually exclusive if they do not have any
outcomes (elements) in common, i.e. they are pairwise disjoint.
E1 ∪ E2 ∪ · · · ∪ En = S
Exercise 2.2.2 Consider a sample space S and suppose that A, B are two events which
are mutually exclusive, that is, A ∩ B = ∅. If P (A) > 0, P (B) > 0, then can these two
events A, B be independent? Explain.
23
Solution 2.2.14 Since A ∩ B = ∅ then P (A ∩ B) = 0. In addition, since P (A) > 0 and
P (B) > 0 then P (A)P (B) > 0. Recall (see Theorem 6.2) that A and B are indepenedent
if and only if P (A ∩ B) = P (A)P (B). In our case 0 = P (A ∩ B) 6= P (A)P (B) > 0.
Hence the two events can not be independent.
Exercise 2.2.3 Suppose that A and B are two events such that P (A) = 0.7999 and
P (B) = 0.2002. Are A and B mutually exclusive events? Explain.
Solution 2.2.15 Let us first recall that A and B are mutually exclusive if A ∩ B = ∅. In
this case P (A ∩ B) = ∅. It is known that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A
and B are mutually exclusive then P (A ∪ B) = 0.7999 + 0.2002 > 1, which is impossible
by the Definition 2.1.7. Hence the events A and B, with the given probabilities can’t be
mutually exclusive.
Exercise 2.2.4 For events A and B, you are given that P (A) = 32 , P (B) = 2
5
and P (A ∪
B) = 34 . Find P (Ac ), P (B c ), A ∩ B, P (Ac ∪ B c ) and P (Ac ∩ B c ).
Solution 2.2.16 From the given information, one can see that P (Ac ) = 1 − P (A) =
1− 23 = 31 , P (B c ) = 1−P (B) = 1− 25 = 35 . Note that P (A∩B) = P (A)+P (B)−P (A∪B) =
2
3
+ 52 − 34 − 1960
19
. It is clear that P (Ac ∪ B c ) = P [(A ∩ B)c ] = 1 − P (A ∩ B) = 1 − 60 = 41
60
. Finally, P (Ac ∩ B c ) = P [(A ∪ B)c ] = 1 − P (A ∪ B) = 1 − 43 = 14 .
Exercise 2.2.5 There are two telephone lines A and B. Let E1 be the event that line A
is engaged and let E2 be the event that line B is engaged. After a statistical study one
finds that P (E1 ) = 0.5 and P (E2 ) = 0.6 and P (E1 ∩ E2 ) = 0.3. Find the probability of
the following events:
Solution 2.2.17 From the statement of the problem we answer the requested probabilities.
(i) To say that ”line A is free” is the same as saying that ”line A is not engeged”.
Hence F = E1c . It follows that P (F ) = P (E1c ) = 1 − P (E1 ) = 1 − 0.5 = 0.5.
(ii) To say that ”at least one line is engaged” is equivalent to saying that ”exactly one
line is engaged” or ”two lines are engaged”. Hence G = E1 ∪ E2 . It follows that
P (G) = P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ) = 0.5 + 0.6 − 0.3 = 0.8.
24
(iii) To say that ”at most one line is free” is the same as saying that ”no line is free”
or ” exactly one line is free”. Note that no line is free, exactly one line is free and
two lines are free are the only possible cases that we can have. In other words,
these three events are exhaustive. In addition these events are disjoint. Hence
P {No line is free } + P {Exactly one line is free } + P {Both two lines are free} = 1.
Equivalently P {At most one line is free} = 1 − P {Both two lines are free} = 1 −
P (E1c ∩ E2c ) = 1 − P {(E1 ∪ E2 )c } = 1 − (1 − P (E1 ∪ E2 )) = 1 − 1 + 0.8 = 0.8.
Solution 2.2.18 From the statement of the problem, we have P (A) = 2P (B), P (B) =
4P (C), P (C) = 2P (D). Since S = {A, B, C, D} is the whole sample space we must have
P (S) = 1 ⇐⇒ P (A) + P (B) + P (C) + P (D) = 1. Note that P (A) = 2P (B) = 8P (C) =
16P (D), P (B) = 4P (C) = 8P (D) and P (C) = 2P (D). It follows that 16P (D) + 8P (D) +
1 2 8
2P (D) + P (D) = 1 ⇐⇒ 27P (D) = 1 ⇐⇒ P (D) = 27 . Therefore P (C) = 27 , P (B) = 27
and P (A) = 16
27
.
Exercise 2.2.7 Suppose that P (A) = 0.68, P (B) = 0.55 and P (A ∩ B) = 0.35. Find
(b) The conditional probability that B does not occur given that A occurs.
(c) The conditional probability that B occurs given that A does not occur.
Solution 2.2.19 From the given information we can answer the requested questions.
P (A∩B) 0.35 35
(a) The requested probability is P (B|A) = P (A)
= 0.68
= 68
≈ 0.515.
P (A∩B c ) P (A)−P (A∩B) 0.68−0.35
(b) The requested probability is given by P (B c |A) = P (A)
= P (A)
= 0.68
=
0.33 33
0.68
= 68 ≈ 0.485.
(c) Note that P (Ac ) = 1 − P (A) = 1 − 0.68 = 0.32. Hence the requested probability is
c)
then given that P (B|Ac ) = P P(B∩A
(Ac )
= P (B)−P (A∩B)
P (Ac )
= 0.55−0.35
0.32
= 0.20
0.32
= 20
32
= 0.625.
Exercise 2.2.8 Suppose that P (A) = 0.4, P (B) = 0.5 and the probability that either A
or B occurs is 0.7. Find
(b) The conditional probability that B occurs given that A does not occur.
25
Solution 2.2.20 (a) We are asked to calculate P (B|A) = P P(A∩B)
(A)
. From the statement
of the problem, it is clear that P (A ∪ B) = 0.7 and hence
(b) Since P (A) = 0.4 then P (Ac ) = 1 − P (A) = 0.6. We are asked to find P (B|Ac ). Note
that
P (B ∩ Ac ) P (B) − P (A ∩ B) 0.5 − 0.2 1
P (B|Ac ) = c
= c
= = ·
P (A ) P (A ) 0.6 2
Let A and B be events. We first recall that the conditional probability of event B occuring,
given that event A has already occured is given by
P (A ∩ B)
P (B|A) = , given that P (A) > 0 (2.3.1)
P (A)
The role of A and B can be interchanged to define P (A|B). The conditional probability
of A given that B has already occured is given by
P (A ∩ B)
P (A|B) = , given that P (B) > 0 (2.3.2)
P (B)
Suppose that A and B are two arbitrary events. Here we summarize some of the standard
rules involving probabilities.
The additive rule was proved in the Theorem 2.1.7, part (iv). The Conditional Probability
Rule is a restatement of the Definition 2.2.1. The Multiplicative Rule follows easily from
two Equations 2.3.1 and 2.3.2 (but does not require P (A) > 0 and P (B) > 0).
Sometimes the experimental setup itself may directly indicate the values of the various
conditional probabilities. In such situations, one can obtain the probabilities of joint
events such as A ∩ B by cross multiplying the two sides in the Theorem 2.2.4.
26
From the simple case of the Multiplicative Rule P (A ∩ B) = P (A|B)P (B), one can
immediately get the Multiplicative Rule in general cases:
From the simple case of the Additive Rule, one can immediately get the Additive Rule
in general cases. For example we may calculate the probability that any one of the three
events E or F or G occurs. This is done as follows: From the associativity of the operation
of union we have the equality E ∪ F ∪ G = (E ∪ F ) ∪ G which by the Additive Rule gives
P (E ∪ F ∪ G) = P ((E ∪ F ) ∪ G) = P (E ∪ F ) + P (G) − P ((E ∪ F ) ∩ G).
X X X
P (E1 ∪ E2 ∪ · · · ∪ En ) = P (Ei ) − P (Ei ∩ Ej ) + P (Ei ∩ Ej ∩ Ek )
i i<j i<j<k
X
n+1
− Ei ∩ Ej ∩ Ek ∩ El ) + · · · + (−1) P (E1 ∩ E2 ∩ · · · ∩ En )
i<j<k<l
In words, the previous equation states that the probability of the union of n events equals
the sum of the probabilities of these events taken one at a time minus the sum of the
probabilities of these events taken two at a time plus the sum of the probabilities of these
events taken three at a time, and so on.
Theorem 2.3.1 (Law of Total Probability) Assume that A1 , A2 , · · · , Am are events
such that m
S
i=1 Ai = S and Ai ∩ Aj = ∅ for i 6= j, P (Ai > 0 for all i, i.e, A1 , A2 , · · · , Am
form a partition of S = m
S
i=1 Ai in pairwise disjoint. Then the probability of an event B
can be calculated as m
X
P (B) = P (B|Ai )P (Ai )
i=1
27
Proof. The Law of Total Probability follows from the following observation: Since A1 , A2 , · · · , Am
are disjoint events then B ∩ A1 , B ∩ A2 , · · · , B ∩ Am are also disjoint events. Also since
Sm
i=1 Ai = S and B ⊆ S then
m
! m
[ [
B =B∩S =B∩ Ai = (B ∩ Ai )
i=1 i=1
Hence
m
! m m
[ X X
P (B) = P (B ∩ Ai ) = P (B ∩ Ai ) = P (B|Ai )P (Ai )
i=1 i=1 i=1
Note that in the last two equalities we have used the ”Additive Rule” (see also Theorem
2.1.7 (vi)) and the Conditional Probability Rule.
Example 2.3.2 Suppose that a company publishes two magazines, M1 and M2 . Based
on their record of subscriptions in a suburb (a residential area) they find that sixty percent
of the households subscribe only for M1 , forty five percent subscribe only for M2 , while
only twenty percent subscribe for both M1 and M2 . If a household is picked at random
from this suburb, the magazines publishers would like to address the following questions:
What is the probability that the randomly selected household is a subscriber for
We have been told that P (A) = 0.60, P (B) = 0.45 and P (A∩B) = 0.20. Then, P (A∪B) =
P (A) + P (B) − P (A ∩ B) = 0.85, that is there is 85% chance that the randomly selected
household is a subscriber for at least one of the magazines M1 , M2 . This answers question
(i).
28
In order to answer question (ii), we need to evaluate P (Ac ∩B c ) = P ((A ∪ B)c ) which can
simply be written as 1 − P (A ∪ B) = 0.15, that is there is 15% chance that the randomly
selected household subscribes for none of the magazines M1 , M2 .
Example 2.3.3 Suppose that we have an urn at our disposal which contains eight green
and twelve blue marbles, all of equal size and weight. The following random experiment
is now performed: The marbles inside the urn are mixed and then one marble is picked
from the urn at random, but we do not observe its color. This first drawn marble is
not returned to the urn. The remaining marbles inside the urn are again mixed and one
marble is picked at random. This kind of selection process is often referred to as sampling
without replacement. Now, what is the probability that
(i) both the first and second drawn marbles are green?
8 7 8
Obviously, P (A) = 20 = 0.4, P (B|A) = 19 and P (B|Ac ) = 19 . Observe that the experi-
mental setup itself dictates the value of P (B|A). A result such as the Conditional Product
Rule is not very helpful in the present situation. In order to answer question (i), we pro-
8 7
ceed by using the Multiplicative Rule to evaluate P (A ∩ B) = P (A)P (B|A) = 20 × 19 = 14
95
.
c
Obviously, A, A forms a partition of the sample space. Now, in order to answer question
(ii), using the Theorem 2.1.7, part (vi), we write
12 8 24
But, as before we have P (Ac ∩ B) = P (Ac )P (B|Ac ) = 20 × 19 = 95 . Thus, from the
14 24 38 2
Equality 2.3.4, we have P (B) = 95 + 95 = 95 = 5 = 0.4. Here, note that P (B|A) 6= P (B)
and so by the Definition 2.2.2, the two events A, B are dependent. One may guess this
fact easily from the layout of the experiment itself.
29
Exercise 2.3.1 A list of important customers contains 25 names. Among them, 20 per-
sons have their accounts in good standing while 5 are delinquent (irresponsible). Two
persons will be selected at random from this list and the status of their accounts will be
checked. Calcualte the probability that
(b) One account is delinquent and the other one is in good standing.
(i) Here the problem is to calcualte the P (D1 ∩ D2 ). Using the Multiplicative Law, we
write
P (D1 ∩ D2 ) = P (D2 |D1 )P (D1 )
In order to calculate P (D1 ), we need only to consider selecting one account at ran-
5
dom from 20 good and 5 delinquent. Clearly, P (D1 ) = 25 . The next step is to
calcualte P (D2 |D1 ). Given that D1 has occured, there will remain 20 good and 4
delinquent accounts at the time the second selection is made. Therefore, the condi-
4
tional probability of D2 given D1 is given by P (D2 |D1 ) = 24 . Multiplying these two
probabilities, we get
5 4 1
P (Both delinquent) = P (D1 ∩ D2 ) = × = (2.3.5)
25 24 30
(ii) Represent by D1 ∩G2 the event the first account checked is delinquent and the second
account is in good standing and by G1 ∩ D2 the event the first account checked is
in good standing and the second account is delinquent.
Hence the probability of the event: One account is delinquent and the other one is
in good standing will be given by
30
since it is impossible to select one delinquent account and one account in a good
standing at the same time in the first trial i.e. D1 ∩ G1 = ∅ (hence the events
D1 and G1 are mutually exclusive. Similarly the events D2 and G2 are mutually
exclusive . Since
20 5 1
P (G1 ∩ D2 ) = P (G1 )P (D2 |G1 ) = 25
× 24
= 6
5 20 1
P (D1 ∩ G2 ) = P (D1 )P (G2 |D1 ) = 25
× 24
= 6
Solution 2.3.5 Clearly P (Ac ) = 1 − P (A) = 1 − 0.4 = 0.6. By the Multiplicative Rule
7
P (A ∩ B) = P (A|B)P (B), it follows that P (A ∩ B) = 0.7 × 0.25 = 40 . Using the Additive
Rule we get immediately that
7 9
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.4 + 0.25 − =
40 40
Exercise 2.3.3 If P (A) = 0.4, P (B) = 0.4 and P (A ∪ B) = 0.7.
Solution 2.3.6 Let A and B be the given events with P (A) = 0.4, P (B) = 0.4 and
P (A ∪ B) = 0.7.
(a) If A and B are independent then P (A ∩ B) = P (A)P (B). But by the Additive Rule
we have P (A ∩ B) = P (A) + P (B) − P (A ∪ B) = 0.4 + 0.4 − 0.7 = 0.1. On the other
hand P (A)P (B) = 0.4 × 0.4 = 0.16. Hence A and B are not independent.
31
(b) Determine P (A ∪ B) if A and B are mutually exclusive.
Solution 2.3.7 (a) If A and B are assumed to be independent, it follows from the Addi-
tive Rule that P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) − P (A)P (B) =
0.6 + 0.22 − 0.6 × 0.22 = 0.688.
(b) If A and B are assumed to be mutually exclusive then A∩B = ∅ and hence P (A∩B) =
∅. It follows from the Additive Rule that P (A∪B) = P (A)+P (B) = 0.6+0.22 = 0.82.
(c) If A and B are assumed to be mutually exclusive then A∩B = ∅ and hence P (A∩B) =
c)
∅. It follows that P (A|B c ) = P P(A∩B
(B c )
= P (A)−P (A∩B)
1−P (B)
P (A)
= 1−P (B)
0.6
= 1−0.22 = 60
78
≈ 0.769
Exercise 2.3.5 A box contains 25 parts, of which 3 are defective and 22 are non-defective.
If 2 parts are randomly selected without replacement, find the probabilities of the following
events:
3 2
From the statement of the problem, one can see that P (A) = 25
, P (B|A) = 24
, P (Ac ) =
22
25
, P (B c |A) = 22
24
and P (B c |Ac ) = 21
24
(i) We need to calculate the probability of the event ”both parts are defective” i.e. A∩B.
2 3 6
By the Multiplicative Rule we have P (A∩B) = P (B|A)P (A) = 24 × 25 = 600 = 0.01.
(ii) It is requested to calculate the probability of the event ”neither one is defective”
which means that both the two parts are non-defective i.e. Ac ∩ B c . Note that
P (Ac ∩ B c ) = P (B c |Ac )P (Ac ) = 21
24
22
× 25 = 0.770.
32
Exercise 2.3.6 In a shipment of 15 air room air conditioners, there are three defectives
thermostats. Two air conditioners will be selected at random and inspected one after
another. Find the probablity that
3 2
From the statement of the problem, one can see that P (A) = 15
, P (B|A) = 14
, P (B|Ac ) =
3
14
, P (B c |A) = 12
14
and P (B c |Ac ) = 11
14
.
3
(a) It is clear that P (A) = 15
= 51 .
(b) The event ”The first is defective and the second is good” can be expressed as B c |A
12
and its probability is given by P (B c |A) = 14 = 67 .
(c) Here we are asked to calculate P (A∩B) which is given by P (A∩B) = P (B|A)P (A) =
2 3 6
14
× 15 = 210 .
(e) We now calculate P {Exactly one is defective}. Note that P {Neither is defective} +
P {One is defective} + P {Both are defective} = 1. But P {Neither is defective} =
P (Ac ∩ B c ) = P (B c |Ac )P (Ac ) = 1114
× 12
15
= 121
210
. Hence P {One is defective} = 1 −
c c 121 6 73
P (A ∩ B ) − P (A ∩ B) = 1 − 210 − 210 = 210
Exercise 2.3.7 Of three events A, B and C, suppose events A and B are independent
and events B and C are mutually exclusive. Their probabilities are P (A) = 0.7, P (B) =
0.2 and P (C) = 0.3. Express the following events in set notation and calculate their
probabilities.
33
(a) Both B and C occur.
Solution 2.3.10 (a) The event ”Both B and C occur” can be expressed as B ∩ C. Since
B and C are assumed to be mutually exclusive then B∩C = ∅ and hence P (B∩C) = 0.
(b) The event ”At least one of A and B occurs” can be expressed as A ∪ B. Since A and
B are assumed to be independent then P (A ∩ B) = P (A)P (B) = 0.7 × 0.2 = 0.14 and
hence P (B ∪ C) = P (B) + P (C) − P (A ∩ B) = 0.2 + 0.3 − 0.14 = 0.36.
(c) The event ”B does not occurs” can be expressed as B c . Clearly P (B c ) = 1 − P (B) =
1 − 0.2 = 0.8.
(d) The event ”All three events occur” can be expresses as A ∩ B ∩ C. Since B ∩ C = ∅
it follows that A ∩ B ∩ C = ∅ and hence it is impossible for all three events to occur
at the same time. Therefore P (A ∩ B ∩ C) = 0.
Example 2.4.1 Suppose in another room, an experimenter has two urns at his disposal,
urn number 1 and urn number 2. The urn number 1 has eight green and twelve blue
marbles whereas the urn number 2 has ten green and eight blue marbles, all of same size
and weight. The experimenter selects one of the urns at random with equal probability and
from the selected urn picks a marble at random. It is announced that the selected marble
in the other room turned out blue. What is the probability that the blue marble was chosen
from the urn number 2? We will answer this question shortly.
The following theorem will be helpful in answering questions such as the one raised in the
Example 2.4.1.
Theorem 2.4.2 (Bayes’s Theorem) Suppose that the events {A1 , · · · , Ak } form a par-
tition of the sample space S and B is another event. Then,
P (Aj )P (B|Aj )
P (Aj |B) = Pk , for j = 1, 2, · · · , k (2.4.1)
i=1 P (Ai )P (B|Ai )
34
Proof. Since {A1 , · · · , Ak } form a partition of S, in view of the Theorem 2.1.7, part (vi)
(see also Theorem 2.3.1 about Total Probabilities) we can immediately write
k
X k
X
P (B) = P (B ∩ Ai ) = P (Ai )P (B|Ai ) (2.4.2)
i=1 i=1
by using the Multiplicative Rule. Next, using the Multiplicative Rule once more, let us
write
This marvelous result and the ideas originated from the works of Rev. Thomas Bayes
(1783). In the statement of the Theorem 2.4.2, note that the conditioning events on the
right hand side are A1 , · · · , Ak , but on the left hand side one has the conditioning event
B instead. The quantities such as P (Ai ), i = 1, · · · , k are often referred to as the apriori
or prior probabilities, whereas P (Aj |B) is referred to as the posterior probability.
Thus, the chance that the randomly drawn blue marble came from the urn number 2 was
20
47
which is equivalent to saying that the chance of the blue marble coming from the urn
number 1 was 2747
.
The Bayes Theorem helps in finding the conditional probabilities when the original con-
ditioning events A1 , · · · , Ak and the event B reverse their roles.
Example 2.4.4 This example has more practical flavor. Suppose that 40% of the in-
dividuals in a population have some disease. The diagnosis of the presence or absence
of this disease in an individual is reached by performing a type of blood test. But, like
many other clinical tests, this particular test is not perfect. The manufacturer of the
35
blood-test-kit made the accompanying information available to the clinics. If an individual
has the disease, the test indicates the absence (false negative) of the disease 10% of the
time whereas if an individual does not have the disease, the test indicates the presence
(false positive) of the disease 20% of the time. Now, from this population an individual
is selected at random and his blood is tested. The health professional is informed that
the test indicated the presence of the particular disease. What is the probability that this
individual does indeed have the disease?
Thus there is 75% chance that the tested individual has the disease if we know that the
blood test had indicated so.
Exercise 2.4.1 There are three urns, #A, #B, and #C. Urn #A has a red marbles
and b green marbles, urn #B has c red marbles and d green marbles, and urn #C has
a red marbles and c green marbles. Let A be the event of choosing urn #A, B of choos-
ing urn #B and, C of choosing urn #C. Let R be the event of choosing a red marble
and G be the event of choosing a green marble. An urn is chosen at random, and af-
ter that, from this urn, a marble is chosen at random. Find the following probabilities:
P (G), P (G|C), P (C|G), P (R), P (R|A) and P (A|R).
b 1
P (G) = P (G|A)P (A) + P (G|B)P (B) + P (G|C)P (C) = ×
a+b 3
d 1 c 1
+ × + ×
c+d 3 a+c 3
36
c
• P (G|C) = a+c
a 1
P (R) = P (R|A)P (A) + P (R|B)P (B) + P (R|C)P (C) = ×
a+b 3
c 1 a 1
+ × + ×
c+d 3 a+c 3
a
• P (R|A) = a+b
In many situations, the sample space S consists of only a finite number of equally likely
outcomes and the associated Borel sigma-field B is the collection of all possible subsets of
S, including the empty set ∅ and the whole set S. Then, in order to find the probability
of an event A, it will be important to enumerate all the possible outcomes included in
the sample space S and the event A. This section reviews briefly some of the customary
counting rules followed by a few examples. We first recall the following:
Definition 2.5.1 A group of elements is said to be ordered if the order in which these
elements are drawn is of relevance. Otherwise, it is called unordered.
Example 2.5.2 Ordered sample: The first three places in an Olympic 100m race are
determined by the order in which the athletes arrive at the finishing line. If 8 athletes
are competing with each other, the number of possible results for the first three places
is of interest.
Unordered sample: The selected members for a national football team. The order in
which the selected names are announced is irrelevant.
37
2.5.1 The Fundamental Rule of Counting
Suppose that there are k different tasks where the ith task can be completed in ni ways,
i = 1, ..., k. If these taskes can be completed independently then, the total number N of
ways these k tasks can be completed is given by
k
Y
N= ni
i=1
Example 2.5.3 How many sample points are there in the sample space when a pair of
dice is thrown twice? In this case, it can be seen that the first die can land in any of n1 = 6
ways. For each these 6 ways the second die can also land in n2 = 6 ways. Therefore, the
pair of dice can land in N = n1 × n2 = 6 × 6 = 36 ways.
Remark 2.5.1 It is to be noted that in the case of events A and B that are mutually
exclusive, then if an event A can occurs in total of n1 ways and if a different event B can
occur in n2 ways then the event A or B (i.e. A ∪ B), can occur in n1 + n2 ways.
Example 2.5.4 An urn contain 2 white and 6 black balls. Count the number of cases in
which a ball that is drawn at random is either white or black. Clearly, a white ball can be
drawn in 2 cases and a black ball can be drawn in 6 cases. Thus the required number of
cases will be 2 + 6 = 8.
2.5.2 Arrangements
Suppose that we are given n distinct objects and wish to arrange m of these objects in
a line, where n ≥ m ≥ 1. Since there are n ways of choosing the first object, and after
this is done, n − 1 ways of choosing the second object,· · · , and finally n − m + 1 ways of
choosing the mth object, it follows by the Fundamental Rule of Counting that the number
of different arrangements, denoted by the symbol n Pm , is given by
m
Y
n n
Pm = n(n − 1)(n − 2) · · · (n − m + 1) or Pm = (n − i + 1) (2.5.1)
i=1
n
where it is noted that the product has m factors. We call Pm the number of arrangements
of n objects taken m at a time.
38
arrangement of the same four letters but in different order. Thus ”SWAP” and ”WASP”
are different arrangements of 4 letters from among 26 letters of the English Alphabet.
Example 2.5.6 It is required to seat 5 men and 4 women in a row so that the women
occupy the even places. How many such arrangements are possible? The men may be
seated in 5 P5 ways, and the women 4 P4 ways. Each arrangement of the men may be
associated with each arrangement of the women. Hence, the number of arrangements will
be given by 5 P5 ×4 P4 = 120 × 24 = 2880.
The number n Pm of arrangements with repetition of m objects selected from among n > 0
objects is given by nm .
Example 2.5.7 The sequence of letters ”SWAPS” is an arrangement of five letters with
one repetition from among the 26 letters of the English alphabet. The sequence of letters
”WASPS” is also an arrangement of the same five letters with the same repetition from
among the 26 letters of the English alphabet, but in a different order. Thus ”SWAPS”
and ”WASPS” are different arrangements with repetitions.
2.5.3 Permutations
From the Equation 2.5.1 and the Definition 2.5.8 it follows that the quantity n Pn is the
number of permutations of n objects taken n at a time.
We distinguish between two cases: If all the elements are distinguishable, then we speak
of permutation without replacement. However, if some or all of the elements are not
distinguishable, then we speak of permutation with replacement.
39
• Permutations without Replacement
If all the n elements are distinguishable, then there are n Pn different compositions of these
elements. This is the number of of ways we can arrange all n objects and it is denoted by
the special symbol
Example 2.5.9 There were three candidate cities for hosting the 2020 Olympic Games:
Tokyo (T), Istanbul (I), and Madrid (M). Before the election, there were 3! = 6 possible
outcomes, regarding the final rankings of the cities:
Assume that not all n elements are distinguishable. The elements are divided into groups,
and these groups are distinguishable. Suppose, there are s groups of sizes n1 , n2 , ..., ns .
The total number of different ways to arrange the n elements in s groups is:
n!
(2.5.3)
n1 !n2 !n3 ! · · · ns !
Example 2.5.10 Consider all the permutations of the letters in the word BOB. Since
there are three letters, there should be 3! = 6 different permutations. Those permutations
are BOB, BBO, OBB, OBB, BBO and BOB. Now, while there are six permutations,
some of them are indistinguishable from each other. If you look at the permutations that
are distinguishable, you only have three BOB, OBB and BBO.
In fact, since there are two groups with sizes n1 = 2 (group with two letters B) and
n2 = 1 (group with one letter O), it follows from the Identity 2.5.2 that to find the
number of distinguishable permutations, take the total number of letters factorial divide
by the frequency of each letter factorial. That is
3! 3×2×1
= =3
2!1! 2×1×1
Exercise 2.5.1 Find the number of distinguishable permutations of the letters in the word
M ISSISSIP P I.
40
Solution 2.5.11 In this situation we have permutation with replacement with n1 = 1, n2 =
4, n3 = 4 and n4 = 2, where n1 , n2 , n2 and n4 are respectively the frequencies of the letters
M, I, S and P respectively. Hence the number of distinguishable permutations will be then
given by
11!
= 34650
1!4!4!2!
Let us note that the total number of different permutations in the word M ISSISSIP P I
is 11! = 39916800.
2.5.4 Combinations
The Binomial coefficient for any integers m and n with n ≥ m ≥ 0 is denoted and defined
as
n n!
= (2.5.4)
m m!(n − m)!
and it is read as ”n choose m”. There are several calculation rules for the Binomial
coefficient:
n n n n
= 1, = n, = , (2.5.5)
0 1 m n−m
m
n+1−i
Y
n n+1 n n
= , = +
m i=1
i m m m−1
The following result uses these combinatorial expressions and it goes by the name
Theorem 2.5.12 (Binomial Theorem) For any two real numbers a, b and a positive
integer n, one has
n
n
X n k n−k
(a + b) = a b (2.5.7)
k=0
k
41
For example, if a = b = 1, from the above equation, one can see that
n
n
X n
2 =
k=0
k
We now answer the question of how many different possibilities exist to draw m out of n
elements, i.e. m out of n balls from an urn. In other words, how many different ways
exist to select m out of n elements?
(i) Combinations without replacement and without consideration of the order of the
elements.
(ii) Combinations without replacement and with consideration of the order of the ele-
ments.
(iii) Combinations with replacement and without consideration of the order of the ele-
ments.
(iv) Combinations with replacement and with consideration of the order of the elements.
Let us note that combinations with and without replacement are also often called combi-
nations with and without repetition.
When there is no replacement and the order of the elements is also not relevant, then the
total number of distinguishable combinations in drawing m out of n elements is
n
(2.5.8)
m
Example 2.5.13 Suppose a company elects a new board of directors. The board consists
of 5 members and 15 people are eligible to be elected. How many combinations for the
board of directors exist?
Since a person cannot be elected twice, we have a situation where there is no replacement.
The order is also of no importance: either one is elected or not. We can thus apply the
Rule 2.5.8 which yields
15 15!
= = 3003
5 5!10!
possible combinations.
42
• Combinations without Replacement and with Consideration of the Order
The total number of different combinations for the setting without replacement and with
consideration of the order is
n! n
= m! (2.5.9)
(n − m)! m
Example 2.5.14 Consider a horse race with 12 horses. A possible bet is to forecast the
winner of the race, the second horse of the race, and the third horse of the race. The total
number of different combinations for the horses in the first three places is
12!
= 12 × 11 × 10 = 1320
(12 − 3)!
This result can be explained intuitively: for the first place, there is a choice of 12 different
horses. For the second place, there is a choice of 11 different horses (12 horses minus the
winner). For the third place, there is a choice of 10 different horses (12 horses minus the
first and second horses). The total number of combinations is the product 12 × 11 × 10.
The total number of different combinations with replacement and without consideration
of the order is
If 4 different organic products are denoted as A, B, C and D then the following combina-
tions are possible:
43
Please note that, for example, (A, B) is identical to (B, A) because the order in which
the products A and B are cultivated on the first or second field is not important in this
example.
The total number of different combinations for the integers m and n with replacement
and when the order is of relevance is nm .
Example 2.5.16 Consider a credit card with a four-digit personal identification number
(PIN) code. The total number of possible combinations for the PIN is nm = 104 = 10000.
This follows from the fact hat every digit in the first, second, third, and fourth places
(m = 4) can be chosen out of ten digits from 0 to 9 (n = 10).
The proofs of the above formulas are beyond the outcomes of this course but they can be
found in many textbooks of Mathematics, for instance Yves Nievergelt, Foundations of
Logic and Mathematics: Applications to Computer Science and Cryptography, Springer
Science+Business Media, LLC,20012.
where e = 2.71828 · · · , which is the base of natural logarithms. The symbol ≈ means
that the ratio of the left side to the right side approaches 1 as n −→ ∞. That is
n!
lim √ =1 (2.5.12)
n→∞ 2πnnn e−n
Example 2.5.17 Evaluate 70!
44
Then log N = 12 log 140 + 12 log π + 70 log 70 − 70 log(2.71828) = 1.07306 + 0.24857 +
129.15686 − 30.40061 = 100.07788 = 100 + 0.07788. Hence N = 1.20 × 10100 .
In this section, we will look at some examples of probability calculations by using some
of the counting techniques developed in the previous section.
Example 2.6.1 Suppose that a fair coin is tossed five times. Now the outcomes would
look like HHT HT, T HT T H and so on. By the fundamental rule of counting we realize
that the sample space S will consist of 25 (= 2 × 2 × 2 × 2 × 2), that is thirty two, outcomes
each of which has five components and these are equally likely. How many of these five
dimensional outcomes would include two heads?
Imagine five distinct positions in a row and each position will be filled by the letter H or
T . Out of these five positions, choose two positions and fill them both with the letter H
while the remaining three positions are filled with the letter T . This can be done in 52
Example 2.6.3 (Example 2.6.2 Continued) Suppose that there are six men and four
women in the small class. Then what is the probability that a randomly selected committee
of four students would consist of two men and two women?
6 4
Two men and two women can be chosen in 2
× 2
ways, that is in 90 ways. In other
words,
(62)×(42) 90
P (Two men and two women are selected) = = = 37 .
(104) 120
Example 2.6.4 John, Sue, Rob, Dan and Molly have gone to see a movie. Inside the
theatre, they picked a row where there were exactly five empty seats next to each other.
45
These five friends can sit in those chairs in exactly 5! ways, that is in 120 ways. Here,
seating arrangement is naturally pertinent. The sample space S consists of 120 equally
likely outcomes. But the usher does not know in advance that John and Molly must sit
next to each other. If the usher lets the five friends take those seats randomly, what is the
probability that John and Molly would sit next to each other?
John and Molly may occupy the first two seats and other three friends may permute in 3!
ways in the remaining chairs. But then John and Molly may occupy the second and third
or the third and fourth or the fourth and fifth seats while the other three friends would
take the remaining three seats in any order. One can also permute John and Molly in 2!
ways. That is, we have
(2!)(4)(3!) 2
P (John and Molly sit next to each other) = =
5! 5
Example 2.6.5 Consider distributing a standard pack of 52 cards to four players (North,
South, East and West) in a bridge game where each player gets 13 cards. Here again the
ordering of the cards is not crucial to the game. The total number of ways these 52 cards
can be distributed among the four players is then given by
52 39 23 13
n= × × ×
13 13 13 13
Then,
4 4 44 39 26 13
4
× 4
× 5
× 13
× 13
× 13 11
P (North will receive 4 aces and 4 kings) = =
n 6431950
because North is given 4 aces and 4 kings plus 5 other cards from the remaining 44 cards
while the remaining 39 cards are distributed equally among South, East and West. By the
same token,
4
× 48 × 39 × 26 × 13
1 12 13 13 13 9139
P (South will receive exactly one ace) = =
n 20825
since South would receive one out of the 4 aces and 12 other cards from 48 non-ace cards
while the remaining 39 cards are distributed equally among North, East and West.
Exercise 2.6.1 From 7 consonants and 5 vowels, how many words can be formed consist-
ing of 4 different consonants and 3 different vowels? The words need not have meaning.
Solution 2.6.6 The four different consonants can be selected in 74 ways, the three
different vowels can be selected in 53 ways, and the resulting 7 different letters can
46
Exercise 2.6.2 The Dean’s office has received 12 nominations from which to designate 4
student representatives to serve on a campus curriculum commitee. Among the nominnees,
5 are liberal arts majors and 7 sciences majors.
(a) How many ways can 4 students be selected from the group of 12?
(b) How many selections are possible that would include 1 science major and 3 liberal arts
majors?
(c) If the selection process were random, what is the probability that 1 science major and
3 liberal ars majors would be selected?
Solution 2.6.7 According to the counting rules, we can answer the asked questions.
12 × 11 × 10 × 9
12
= = 495
4 4×3×2×1
(b) One science major can be chosen from 7 in 71 = 7 ways. Also 3 liberal arts majors
can be chosen from 5 in 53 = 10 ways. Each of the 7 choices of science major can be
(c) Let A be the event ”1 science and 3 liberal arts are selected”.
70
Then P (A) = 495 ≈ 0.141.
Exercise 2.6.3 From a group of α males and β females a committee of λ people will be
chosen. Here we assume that α + β ≥ λ.
(iii) What is the probability that Mary and Peter will be serving together in a committee?
(iv) What is the probability that Mary and Peter will not be serving together?
Solution 2.6.8 First observe that this experiment has a sample space of size α+β
η
. There
β
are η ways of choosing the females. The remaining λ − η members of the committee
must be male,
47
(i) From the above observation, it follows that the desired probability (under the assump-
tion that β ≥ η and α ≥ λ − η) is
β α
η
× λ−η
α+β
λ
α
β α
β α
β α
β
λ−3 3
+ λ−2 2
+ λ−1 1
+ λ 0
α+β
λ
(iii) We must assume that Peter and Mary belong to the original set of people, otherwise
the probability will be 0. Since Peter and Mary must belong to the committee, we
must choose λ − 2 other people from the pool of the α + β − 2 people remaining. The
desired probability is thus
α+β−2
λ−2
α+β
λ
(iv) Again, we must assume that Peter and Mary belong to the original set of people,
otherwise the probability will be 1. Observe that one of the following three situations
may arise:
The number of committees that include Peter but exclude Mary is α+β−2
λ−1
. The number
α+β−2
of committees that include Mary but exclude Peter is λ−1 . The number of committees
that exclude both Peter and Mary is α+β−2
λ
. Hence the desired probability is seen to be
α+β−2
+ α+β−2 + α+β−2
λ−1 λ−1 λ
α+β
λ
Exercise 2.6.4 An urn has 3 white marbles, 4 red marbles, and 5 blue marbles. Three
marbles are drawn at once from the urn, and their colour noted. What is the probability
that a marble of each colour is drawn?
Solution 2.6.9 First observe that this experiment has a sample space of size 12
3
. Since
we need to select a marble of each colour, the requested probablity will be given by
3 4 5
1
× 1
× 1 3
12
=
3
11
48
Chapter 3
Random Variables
3.1 Introduction
Definition 3.1.1 Let S represent the sample space of a random experiment, and let R be
the set of real numbers. A random variable is a function X which assigns to each element
s ∈ S one and only one number X(s) = x ∈ R, i.e
X : S −→ R
Example 3.1.2 (a) The score obtained from the roll of a die can be thought of as a
random variable taking the values 1 to 6.
(b) If two dice are rolled, a random variable can be defined to be the sum of the scores,
taking the values 2 to 12.
(c) A random variable can be defined to be the positive difference between the scores
obtained from two dice, taking the values 0 to 5.
49
example, it may be stated that a random variable X takes the values x = −0.6, x = 2.3,
and x = 4.0. This custom helps clarify whether the random variable is being referred to,
or a particular value taken by the random variable, and it is helpful in the subsequent
mathematical discussions.
Since the value of a random variable is determined by the outcome of the experiment, we
may assign probabilities to the possible values of the random variable.
Example 3.1.3 Letting X denote the random variable that is defined as the sum of two
fair dice; then
1
P {X = 2} = P {(1, 1)} = 36 ,
2
P {X = 3} = P {(1, 2), (2, 1)} = 36
3
P {X = 4} = P {(1, 3), (2, 2), (3, 1)} = 36 ,
4
P {X = 5} = P {(1, 4), (2, 3), (3, 2), (4, 1)} = 36
5
P {X = 6} = P {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} = 36
6
P {X = 7} = P {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} = 36
(3.1.1)
5
P {X = 8} = P {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} = 36
4
P {X = 9} = P {(3, 6), (4, 5), (5, 4), (6, 3)} = 36 ,
3
P {X = 10} = P {(4, 6), (5, 5), (6, 4)} = 36
2
P {X = 11} = P {(5, 6), (6, 5)} = 36 ,
1
P {X = 12} = P {(6, 6)} = 36 .
In other words, the random variable X can take on any integral value between two and
twelve, and the probability that it takes on each value is given by the Equation 3.1.1.
Since X must take on one of the values two through twelve, we must have that
12
! 12
[ X
1=P {X = n} = P {X = n}
n=2 n=2
Example 3.1.4 For a second example, suppose that our experiment consists of tossing
two fair coins. Letting Y denote the number of heads appearing, then Y is a random
variable taking on one of the values 0, 1, 2 with respective probabilities
P {Y = 0} = P {T T } = 41
2
P {Y = 1} = P {T H, HT )} = 4
(3.1.2)
P {Y = 2} = P {HH} = 14
1
Of course, P {Y = 0} + P {Y = 1} + P {Y = 2} = 4
+ 24 + 1
4
= 1.
50
Example 3.1.5 Suppose that we toss a coin having a probability p of coming up heads,
until the first head appears. Letting N denote the number of flips required, then assuming
that the outcome of successive flips are independent, N is a random variable taking on
one of the values 1, 2, 3, · · · , with respective probabilities
P {N = 1} = P {H} = p,
P {N = 2} = P {T H} = (1 − p)p,
P {N = 3} = P {T T H} = (1 − p)2 p,
.. (3.1.3)
.
n−1
| T T T{zT · · · T} H} = (1 − p) p, n ≥ 1
P {N = n} = P {T
n−1
Example 3.1.6 Suppose that our experiment consists of seeing how long a battery can
operate before wearing down. Suppose also that we are not primarily interested in the
actual lifetime of the battery but are concerned only about whether or not the battery lasts
at least two years. In this case, we may define the random variable I by
(
1, if the lifetime of battery is two or more years
I= (3.1.4)
0, otherwise
If E denotes the event that the battery lasts two or more years, then the random variable
I is known as the indicator random variable for event E. Note that I equals 1 or 0
depending on whether or not E occurs.
Example 3.1.7 Suppose that independent trials, each of which results in any of m pos-
sible outcomes with respective probabilities
m
X
p1 , · · · , pm , pi = 1
i=1
are continually performed. Let X denote the number of trials needed until each outcome
has occurred at least once. Rather than directly considering P {X = n} we will first
determine P {X > n}, the probability that at least one of the outcomes has not yet occurred
after n trials. Letting Ai denote the event that outcome i has not yet occurred after the
first n trials, i = 1, · · · , m, then
51
n
!
[ X X
P {X > n) = P Ai = P (Ai ) − P (Ai ∩ Aj )
i=1 i i<j
X
+ P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)m+1 P (A1 ∩ A2 ∩ · · · ∩ Am )
i<j<k
Now, P (Ai ) is the probability that each of the first n trials results in a non-i outcome, and
so by independence P (Ai ) = (1 − pi )n . Similarly, P (Ai ∩ Aj ) is the probability that the
first n trials all result in a non-i and non-j outcome, and so P (Ai ∩ Aj ) = (1 − pi − pj )n .
As all of the other probabilities are similar, we see that
m
X X X
P {X > n) = (1 − pi )n − (1 − pi − pj )n + (1 − pi − pj − pk )n − · · ·
i=1 i<j i<j<k
Since
P {X = n} = P {X > n − 1} − P {X > n},
we see, upon using the algebraic identity
m
X X
P {X = n} = pi (1 − pi )n−1 − (pi + pj )(1 − pi − pj )n−1
i=1 i<j
X
+ (pi + pj + pk )(1 − pi − pj − pk )n−1 − · · ·
i<j<k
In all of the preceding examples, the random variables of interest took on either a finite
or a countable number of possible values. Such random variables are called discrete.
However, there also exist random variables that take on a continuum of possible values.
These are known as continuous random variables. One example is the random variable
denoting the lifetime of a car, when the car’s lifetime is assumed to take on any value in
some interval (a, b).
Definition 3.1.8 The cumulative distribution function (cdf ) (or more simply the dis-
tribution function) F (·) of the random variable X is defined for any real number b,
−∞ < b < ∞, by
F (b) = P {X ≤ b}
52
In words, F (b) denotes the probability that the random variable X takes on a value that
is less than or equal to b. Some properties of the cdf F are:
Property (i) follows since for a < b the event {X ≤ a} is contained in the event {X ≤ b},
and so it must have a smaller probability.
Properties (ii) and (iii) follow since X must take on some finite value.
All probability questions about X can be answered in terms of the cdf F (.). For example,
This follows since we may calculate P {a < X ≤ b} by first computing the probability
that X ≤ b [that is, F (b)] and then subtracting from this the probability that X ≤ a
[that is, F (a)].
If we desire the probability that X is strictly smaller than b, we may calculate this
probability by
P {X < b} = lim+ P {X ≤ b − h} = lim+ F (b − h)
h→0 h→0
Note that P {X < b} does not necessarily equal F (b) since F (b) also includes the proba-
bility that X equals b.
Remark 3.1.1 The following equations show in details how one can calculate probabilities
for random variables using the cdf F .
P {X ≤ a} = F (a)
P {X < a} = P {X ≤ a} − P {X = a} = F (a) − P {X = a}
P {X > a} = 1 − P {X ≤ a} = 1 − F (a)
P {X ≥ a} = 1 − P {X < a} = 1 − F (a) + P (X = a)
53
P {a ≤ X ≤ b} = P {X ≤ b} − P {X < a} = F (b) − F (a) + P (X = a)
P {a < X ≤ b} = F (b) − F (a)
P {a < X < b} = F (b) − F (a) − P (X = b)}
P {a ≤ X < b} = F (b) − F (a) − P {X = b} + P (X = a)
Example 3.1.9 Consider the experiment of rolling a die. There are six possible out-
comes. If we define the random variable X as the number of dots observed on the upper
surface of the die, then the six possible outcomes can be described as x1 = 1, x2 = 2, ..., x6 =
6. The respective probabilities are P {X = xi } = 16 for i = 1, 2, ..., 6. The cdf of X is
therefore defined as follows:
0 if − ∞ < x < 1
1
if 1 ≤ x < 2
6
2
if 2 ≤ x < 3
6
3
F (x) = 6
if 3 ≤ x < 4
4
if 4 ≤ x < 5
6
5
6
if 5 ≤ x < 6
1 if 6 ≤ x < ∞
As was previously mentioned, a random variable that can take on at most a countable
number of possible values is said to be discrete. For a discrete random variable X, we
define the probability mass function p(a) of X by
p(a) = P {X = a}
The probability mass function p(a) is positive for at most a countable number of values
of a. That is, if X must assume one of the values x1 , x2 , · · · , then
p(xi ) > 0, i = 1, 2, · · · .
54
Since X must take on one of the values xi , we have
∞
X
p(xi ) = 1
i=1
0 for a < 1
1 for 1 ≤ a < 2
2
F (a) = 5
for 2 ≤ a < 3
6
1 for 3 ≤ a
Discrete random variables are often classified according to their probability mass func-
tions. We now consider some of these random variables.
p(0) = P {X = 0} = 1 − p
(3.2.1)
p(1) = P {X = 1} = p
55
3.2.2 The Binomial Random Variable
Suppose that n independent trials, each of which results in a ”success” with probability p
and in a ”failure” with probability 1 − p, are to be performed. If X represents the number
of successes that occur in the n trials, then X is said to be a binomial random variable
with parameters (n, p).
The probability mass function of a binomial random variable having parameters (n, p) is
given by
n i
p(i) = P {X = i} = p (1 − p)n−i , i = 0, 1, 2, · · · , n (3.2.2)
i
where
n n!
=
i (n − i)!i!
equals the number of different groups of i objects that can be chosen from a set of n
objects. The validity of the Equation 3.2.2 may be verified by first noting that the
probability of any particular sequence of the n outcomes containing i successes and n − i
failures is, by the assumed independence of trials, pi (1 − p)n−i . The Equation 3.2.2 then
follows since there are ni different sequences of the n outcomes leading to i successes and
n − i failures.
For instance, if n = 3, i = 2, then there are 32 = 3 ways in which the three trials can result
in two successes. Namely, any one of the three outcomes (S, S, F ), (S, F, S), (F, S, S),
where the outcome (S, S, F ) means that the first two trials are successes and the third
a failure. Since each of the three outcomes (S, S, F ), (S, F, S), (F, S, S) has a probability
p2 (1 − p) of occuring the desired probability is thus is thaus 32 p2 (1 − p)
Note that, by the Binomial Theorem 2.5.12, the probabilities sum to one, that is,
∞ ∞ n
X X n i n−i
X n i
p(i) = p (1 − p) = p (1 − p)n−i = (p + (1 − p))n = 1 (3.2.3)
i=0 i=0
i i=0
i
Example 3.2.1 Four fair coins are flipped. If the outcomes are assumed independent,
what is the probability that two heads and two tails are obtained?
Letting X equal the number of heads (”successes”) that appear, then X is a binomial
random variable with parameters (n = 4, p = 21 ). Hence, by the Equation 3.2.2 we have
2 2
4 1 1 3
P (X = 2) = =
2 2 2 8
56
Example 3.2.2 It is known that any item produced by a certain machine will be defective
with probability 0.1, independently of any other item. What is the probability that in a
sample of three items, at most one will be defective?
If X is the number of defective items in the sample, then X is a binomial random variable
with parameters (3, 0.1). Hence, the desired probability is given by
3 0 3 3
P {X ≤ 1} = P {X = 0} + P {X = 1} = (0.1) (0.9) + (0.1)(0.9)2 = 0.972
0 1
Example 3.2.3 Suppose that an airplane engine will fail, when in flight, with probability
1−p independently from engine to engine; suppose that the airplane will make a successful
flight if at least 50 percent of its engines remain operative. For what values of p is a four-
engine plane preferable to a two-engine plane?
Because each engine is assumed to fail or function independently of what happens with
the other engines, it follows that the number of engines remaining operative is a binomial
random variable. Hence, the probability that a four-engine plane makes a successful flight
is
4 2 2 4 3 4 4
p (1 − p) + p (1 − p) + p (1 − p)0 = 6p2 (1 − p)2 + 4p3 (1 − p) + p4
2 3 4
whereas the corresponding probability for a two-engine plane is
2 2 2
p(1 − p) + p (1 − p)0 = 2p(1 − p) + p2
1 2
or or equivalently if
6p(1 − p)2 + 4p2 (1 − p) + p3 ≥ 2 − p
which simplifies to
which is equivalent to p ≥ 32 . Hence, the four-engine plane is safer when the engine
success probability is at least as large as 23 , whereas the two-engine plane is safer when
this probability falls below 23 .
Example 3.2.4 Suppose that a particular trait of a person (such as eye color or left
handedness) is classified on the basis of one pair of genes and suppose that D represents a
dominant gene and r a recessive gene. Thus a person with DD genes is pure dominance,
57
one with rr is pure recessive, and one with rD is hybrid. The pure dominance and the
hybrid are alike in appearance. Children receive one gene from each parent. If, with respect
to a particular trait, two hybrid parents have a total of four children, what is the probability
that exactly three of the four children have the outward appearance of the dominant gene?
If we assume that each child is equally likely to inherit either of two genes from each parent,
the probabilities that the child of two hybrid parents will have DD, rr or rD pairs of genes
are, respectively, 14 , 14 , 12 . Hence, because an offspring will have the outward appearance
of the dominant gene if its gene pair is either DD or rD, it follows that the number of
such children is binomially distributed with parameters (4, 34 ).Thus the desired probability
is 43 ( 43 )3 ( 14 ) = 27
64
.
Suppose that independent trials, each having probability p of being a success, are per-
formed until a success occurs. If we let X be the number of trials required until the
first success, then X is said to be a geometric random variable with parameter p. Its
probability mass function is given by
The Equation 3.2.4 follows since in order for X to equal n it is necessary and sufficient
that the first n − 1 trials be failures and the nth trial a success. The Equation 3.2.4 follows
since the outcomes of the successive trials are assumed to be independent.
Here we have used the formula for a geometric series which is given by
∞
X a
ark = for |r| < 1
k=1
1−r
We can again use the urn model to motivate, the hypergeometric distribution.
58
Consider an urn with N total balls where M are white balls and N − M are black balls.
We randomly draw n balls without replacement i.e. we do not place a ball back into the
urn once it is drawn. The order in which the balls are drawn is assumed to be of no
interest; only the number of drawn white balls is of relevance. We define the following
random variable
To be more precise, among the n drawn balls, x are white and n − x are black. There are
M
x
possibilities to choose x white balls from the total of M white balls and analogously,
−M
there are Nn−x
possibilities to choose n − x black balls from the total of N − M black
balls. In total, we draw n out of N balls. Recall the probability is defined as the number
of simple favourable events divided by all possible events. The number of combinations
N −M
for all possible events is Nn . The number of favourable events is M
x n−x
, because we
draw, independent of each other, x out of M balls and n − x out of N − M balls. Hence,
the Probability mass function of the hypergeometric distribution is
M N −M
x n−x
p(x) = P (X = x) = N
(3.2.5)
n
Example 3.2.6 The German national lottery draws 6 out of 49 balls from a rotating
bowl. Each ball is associated with a number between 1 and 49. A simple bet is to choose
6 numbers between 1 and 49. If 3 or more chosen numbers correspond to the numbers
drawn in the lottery, then one wins a certain amount of money. What is the probability
of choosing 4 correct numbers?
59
3.2.5 The Poisson Random Variable
A random variable X, taking on one of the values 0, 1, 2, ..., is said to be a Poisson random
variable with parameter λ, if for some λ > 0,
λi
p(i) = P {X = i} = e−λ , i = 0, 1, 2, · · · (3.2.7)
i!
The Equation 3.2.7 defines a probability mass function since
∞ ∞
X
−λ
X λi
p(i) = e = e−λ eλ = 1
i=0 i=0
i!
The Poisson random variable has a wide range of applications in a diverse number of
areas. An important property of the Poisson random variable is that it may be used to
approximate a binomial random variable when the binomial parameter n is large and p is
small. To see this, suppose that X is a binomial random variable with parameters (n, p),
and let λ = np. Then
i n−i
n! n! λ λ
P {X = i} = pi (1 − p)n−i = 1−
(n − i)!i! (n − i)!i! n n
i (1 − λ )n
n(n − 1) · · · (n − i + 1) λ n
=
ni i! (1 − nλ )i
λi
P {X = i} ≈ e−λ
i!
Example 3.2.7 Suppose that the number of typographical errors on a single page of the
current document has a Poisson distribution with parameter λ = 1. Calculate the proba-
bility that there is at least one error on this page.
P {X ≥ 1} = 1 − P {X = 0} = 1 − e−1 ≈ 0.633
60
Example 3.2.8 If the number of accidents occurring on a highway each day is a Poisson
random variable with parameter λ = 3, what is the probability that no accidents occur
today?
In this section, we shall concern ourselves with random variables whose set of possible
values is uncountable.
Let X be such a random variable. We say that X is a continuous random variable if there
exists a nonnegative function f (x), defined for all real x ∈ (−∞, ∞), having the property
that for any set B of real numbers
Z
P {X ∈ B} = f (x)dx (3.3.1)
B
The function f (x) is called the probability density function of the random variable X. In
words, the Equation 3.3.1 states that the probability that X will be in B may be obtained
by integrating the probability density function over the set B. Since X must assume some
value, f (x) must satisfy
Z ∞
1 = P {X ∈ (−∞, ∞)} = f (x)dx
−∞
All probability statements about X can be answered in terms of f (x). For instance,
letting B = [a, b], we obtain from the Equation 3.3.1 that
Z b
P {a ≤ X ≤ b} = f (x)dx (3.3.2)
a
If we let a = b in the preceding, then
Z a
P {X = a} = f (x)dx = 0
a
In words, this equation states that the probability that a continuous random variable will
assume any particular value is zero.
The relationship between the cumulative distribution F (·) and the probability density
f (·) is expressed by
61
Z a
F (a) = P {X ∈ (−∞, a]} = f (x)dx
−∞
Differentiating both sides of the preceding yields
d
(F (a)) = f (a)
da
That is, the density is the derivative of the cumulative distribution function. A somewhat
more intuitive interpretation of the density function may be obtained from the Equation
3.3.2 as follows:
Z a+ ε
n ε εo 2
P a− ≤X ≤a+ = f (x)dx ≈ εf (a)
2 2 a− 2ε
when ε is small. In other words, the probability that X will be contained in an interval
of length ε around the point a is approximately εf (a). From this, we see that f (a) is a
measure of how likely it is that the random variable will be near a. There are several
important continuous random variables that appear frequently in probability theory. The
remainder of this section is devoted to a study of certain of these random variables.
A random variable is said to be uniformly distributed over the interval (0, 1) if its prob-
ability density function is given by
(
1, if 0 < x < 1
f (x) = (3.3.3)
0, otherwise
Z ∞ Z 1
f (x)dx = dx = 1
−∞ 0
Since f (x) > 0 only when x ∈ (0, 1), it follows that X must assume a value in (0, 1). Also,
since f (x) is constant for x ∈ (0, 1), X is just as likely to be ”near” any value in (0, 1) as
any other value. To check this, note that, for any 0 < a < b < 1,
Z b
P {a ≤ X ≤ b} = f (x)dx = b − a
a
62
In other words, the probability that X is in any particular subinterval of (0, 1) equals the
length of that subinterval.
In general, we say that X is a uniform random variable on the interval (α, β) if its
probability density function is given by
(
1
β−α
,if α < x < β
f (x) = (3.3.4)
0, otherwise
Example 3.3.1 Calculate the cumulative distribution function of a random variable uni-
formly distributed over (α, β).
Ra
Since F (a) = −∞
f (x), we obtain from the Equation 3.3.5 that
0, if a ≤ α
a−α
F (a) = β−α
, if α ≤ a < β
1, if a ≥ β
Example 3.3.2 If X is uniformly distributed over (0, 10), calculate the probability that
(a) X < 3, (b) X > 7, (c) 1 < X < 6.
(
1
10
,if 0 < x < 10
f (x) = (3.3.5)
0, otherwise
1
R3 3
P {X < 3} = 10 0
dx = 10
1
R7 7 3
P {X > 7} = 1 − P {X ≤ 7} = 1 − 10 0
dx = 1 − 10
= 10
1
R6 5
P {1 < X < 6} = 10 1
dx = 10
A continuous random variable whose probability density function is given, for some λ > 0,
by
63
(
λe−λx , if x ≥ 0
f (x) = (3.3.6)
0, if x < 0
Z a
F (a) = λe−λx dx = 1 − e−λa , a ≥ 0 (3.3.7)
0
R∞
Note that F (∞) = 0
λe−λx dx = 1 as, of course, it must.
λe−λx (λx)α−1
(
Γ(α)
, if x ≥ 0
f (x) = (3.3.8)
0, if x < 0
for some λ > 0, α > 0 is said to be a gamma random variable with parameters α, λ. The
quantity Γ(α) is called the gamma function and is defined by
Z ∞
Γ(α) = e−x xα−1 dx
0
Exercise 3.3.1 Let X be a Gamma random variable where its density function is given
by the Equation 4.3.8. Prove that
Z ∞
f (x)dx = 1 (3.3.9)
−∞
64
For a more information about the Gamma function, we refer the reader to the course of
Special Functions.
We say that X is a normal random variable (or simply that X is normally distributed )
with parameters µ and σ 2 if the density of X is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞, −∞ < µ < ∞, σ2 > 0 (3.3.10)
2πσ
Exercise 3.3.2 Let X be a normally distributed random variable where its density func-
tion is given by the Equation 3.3.10. Prove that
Z ∞
f (x)dx = 1 (3.3.11)
−∞
Remark 3.3.1 When there is more than one random variable under consideration, we
shall denote the cumulative distribution function of a random variable Z by Fz (·). Simi-
larly, we shall denote the density of Z by fz (·).
An important fact about normal random variables is given in the following theorem:
65
Proof. To prove this, suppose first that α > 0 and note that the cumulative distribution
function of the random variable Y is given by
FY (a) = P {Y ≤ a} = P {αX + β ≤ a}
Z a−β
a−β
α 1 (x−µ)2
=P X≤ = √ e− 2σ2 dx
α 2πσ
Z a −∞
−(v − (αµ + β))2
1
= √ exp dv (3.3.12)
−∞ 2πασ 2α2 σ 2
where the last equality is obtained by the change in variables v = αx + β. However, since
Z a
FY (a) = fY (v)dv,
−∞
it follows from the Equation 4.1.2 that the probability density function fY (·) is given by
1 z2
f (z) = √ e− 2 , −∞ < z < ∞
2π
Let Z be the standrd normal distribution.The following properties can be observed from
the symmetry of the standard normal curve about 0.
(i) P (Z ≤ 0) = 0.5
Example 3.3.4 Let Z be the standard normal random variable. Find P [Z ≤ 1.37] and
P [Z > 1.37]. From the Normal probability table, we see that the probability P [Z ≤ 1.37] =
0.9147. This is the area to the left of z = 1.37. Moreover, because [Z > 1.37] is the
complement of [Z ≤ 1.37], it follows that P [Z > 1.37] = 1 − P [Z ≤ 1.37] = 1 − 0.9147 =
0.0853 as we can see in the normal table. An alternative method is to use symmetry to
show that P [Z > 1.37] = P [Z < −1.37] which can be obtained directly from the normal
table.
66
Example 3.3.5 Compute P [−0.155 < Z < 1.60]. From the Normal probability table, we
see that P [Z ≤ 1.60] = 0.9452. i.e. the area to the left of 0.160. We interplate between
the entries for −0.155 and 1.60 to obtain P [Z ≤ −0.155] = 0.4384 i.e. the area to the left
of −0.155. Let us note that to obtain P [Z ≤ −0.155] we have interpolated by taking the
mean of P [Z ≤ −0.150] and P [Z ≤ −0.160] since P [Z ≤ −0.155] is not presented on the
Normal Probability Table. Therefore, P [−1.55 < Z < 1.60] = 0.9452 − 0.4384 = 0.5068.
The working steps employed in the Example 3.3.6 can be formalized into the rule.
a−µ b−µ
P [a ≤ X ≤ b] = P ≤Z≤ (3.3.14)
σ σ
Exercise 3.3.3 The number of calories in a salad on the lunch is normally distributed
with µ = 200 and σ = 5. Find the probability that the salad you select will contain
Solution 3.3.7 Let X be the number of calories in the salad, we have the standardized
variable Z = X−200
5
.
67
Normal probability table
negative Z
Y positive Z
Solution 3.3.8 Because the trials are conducted on different parent plants, it is natural
to assume that they are independent. Let the random variable X denote the number of red
flowered plants among the 5 offsprings. If we identify the occurence of red as a success
S, the Mendelian theory specifies that P (S) = p = 14 , and hence X has a Binomial
distribution with n = 5 and p = 0.25. The required probabilities are therefore
5
(0.25)0 (1 − 0.25)5 = 0.237
(a) P {X = 0} = 0
5 5
(0.25)4 (0.75)1 + (0.25)0 (0.75)5 =
(b) P {X ≥ 4} = P {X = 4} + P {X = 5} = 4 5
0.015 + 0.001 = 0.016.
Exercise 3.3.5 The raw scores in a national aptitude test are normally distributed with
µ = 506 and σ = 81.
X−506
Solution 3.3.9 If we denote the raw scorers by X, the standardized score Z = 81
is
distributed as N (0, 1).
(b) We first find the 30th percentile in the z−scale and then convert it to the x−scale. That
is we need to find z such that P [Z ≤ z] = 0.30. From the Standard Normal Table,
we find P [Z ≤ −0.524] = 0.30. The standardized score z = −0.524 corresponds to
x = 506 + (81)(−0.524) = 463.56. Therefore, the 30th percentile score is about 463.6.
Exercise 3.3.6 Suppose that it is known that a new treatment is successful in curing a
muscular pain in 50% of the cases. If it is tried on 15 patients, find the probability that
70
(a) At most 6 will be cured.
(b) The number cured will be no fewer than 6 and no more than 10.
Solution 3.3.10 Designating the cure of a patient by S and assuming that the results for
individual patients are indepenedent, we note that the Binomial distributions with n = 15
and p = 0.5 is appropriate for X =number of patients who are cured. To compute the
required probabilities, we use the Binomial Distribution with n = 15 and p = 0.5.
(a) P {X ≤ 6} = P {X = 0} + P {X = 1} + P {X = 2} + P {X = 3} + P {X = 4} + P {X =
5} + P {X = 6} = 0.304.
(c) To find P {X ≥ 12} we use the law of complement, that is, P {X ≥ 12} = 1 − P {X ≤
11} = 1 − P {X = 0} + P {X = 1} + P {X = 2} + P {X = 3} + P {X = 4} + P {X =
5} + P {X = 6} + P {X = 7} + P {X = 8} + P {X = 9} + P {X = 10} + P {X = 11} =
1 − 0.982 = 0.018.
(b) P [X ≤ 67]
Exercise 3.3.9 A company producing cereals offers a toy in every sixth cereal package in
celebration of their 50th anniversary. A father immediately buys 20 packages.
71
(i) What is the probability of finding 4 toys in the 20 packages?
Solution 3.3.11 The random variable X = number of packages with a toy is binomially
distributed. In each of n = 20 trials, a toy can be found with probability p = 61 .
20
( 61 )4 ( 56 )16 ≈ 0.20.
(i) Here we have P [X = 4] = 4
20
( 16 )0 ( 56 )20 ≈ 0.026.
(ii) Here we have P [X = 0] = 0
72
Chapter 4
If X is a discrete random variable having a probability mass function p(x), then the
expected value of X is defined by
X
E[X] = xp(x) (4.1.1)
x:p(x)>0
In other words, the expected value of X is a weighted average of the possible values that
X can take on, each value being weighted by the probability that X assumes that value.
1
For example, if the probability mass function of X is given by p(1) = 2
= p(2) then
1 1 3
E[X] = 1 +2 =
2 2 2
is just an ordinary average of the two possible values 1 and 2 that X can assume.
1
On the other hand, if p(1) = 3
and p(2) = 23 then
2 1 5
E[X] = 1 +2 =
3 3 3
is a weighted average of the two possible values 1 and 2 where the value 2 is given twice
as much weight as the value 1 since p(2) = 2p(1).
Example 4.1.1 Find E[X] where X is the outcome when we roll a fair die.
73
Example 4.1.2 (Expectation of a Bernoulli Random Variable) Calculate E[X] when
X is a Bernoulli random variable with parameter p.
Thus, the expected number of successes in a single trial is just the probability that the trial
will be a success.
n n
X X n i
E[X] = ip(i) = i p (1 − p)n−i
i=0 i=0
i
n n
X in! X n!
= pi (1 − p)n−i = pi (1 − p)n−i
i=1
(n − i)!i! i=1
(n − i)!(i − 1)!
n n−1
(n − 1)! n−1 k
X X
= np pi−1 (1 − p)n−i = np p (1 − p)n−1−k (4.1.2)
i=1
(n − i)!(i − 1)! k=0
k
= np[p + (1 − p)]n−1 = np
where the second from the last equality follows by letting k = i − 1. Thus, the expected
number of successes in n independent trials is n multiplied by the probability that a trial
results in a success.
∞ ∞
!
X d n d X
n d q p 1
E[X] = p (q ) = p q =p = =
n=1
dq dq n=1
dq 1−q (1 − q)2 p
Here we have used the formula for a geometric series which is given by
∞
X a
ark = for |r| < 1
k=1
1−r
74
Indeed
∞ ∞ ∞
1 X X X 1 q
= qn = 1 + q n and hence qn = −1=
1−q n=0 n=1 n=1
1−q 1−q
In words, the expected number of independent trials we need to perform until we attain
our first success equals the reciprocal of the probability that any one trial results in a
success.
We may also define the expected value of a continuous random variable. This is done as
follows. If X is a continuous random variable having a probability density function f (x),
then the expected value of X is defined by
Z ∞
E[X] = xf (x)dx (4.2.1)
−∞
β
β 2 − α2 β+α
Z
x
E[X] = dx = = (4.2.2)
α β−α 2(β − α) 2
75
In other words, the expected value of a random variable uniformly distributed over the
interval (α, β) is just the midpoint of the interval.
Z ∞
E[X] = xλe−λx dx
0
Integrating by parts yields
∞ ∞
e−λx
1
Z
∞
E[X] = −xe−λx 0 + −λx
e dx = 0 − = (4.2.3)
0 λ 0 λ
∞
1
Z
(x−µ)2
E[X] = √ xe− 2σ 2 dx
2πσ −∞
Writing x as (x − µ) + µ yields
∞ ∞
1
Z Z
−
(x−µ)2 µ (x−µ)2
E[X] = √ (x − µ)e 2σ 2 dx + √ e− 2σ 2 dx
2πσ −∞ 2πσ −∞
Letting y = x − µ leads to
∞ ∞
1
Z Z
y2
−
E[X] = √ ye 2σ 2 dy + µ f (x)dx
2πσ −∞ −∞
By symmetry (recall the properties of odd functions), the first integral must be 0, and so
Z ∞
E[X] = µ f (x)dx = µ
−∞
Suppose now that we are given a random variable X and its probability distribution (that
is, its probability mass function in the discrete case or its probability density function in
the continuous case). Suppose also that we are interested in calculating, not the expected
value of X, but the expected value of some function of X, say, g(X). How do we go about
76
doing this? One way is as follows. Since g(X) is itself a random variable, it must have a
probability distribution, which should be computable from a knowledge of the distribution
of X. Once we have obtained the distribution of g(X), we can then compute E[g(X)] by
the definition of the expectation.
Calculate E[X 2 ].
Letting Y = X 2 , we have that Y is a random variable that can take on one of the values
02 , 12 , 22 with respective probabilities
pY (0) = P {Y = 02 } = 0.2
pY (1) = P {Y = 12 } = 0.5
pY (4) = P {Y = 22 } = 0.3
Example 4.3.2 Let X be uniformly distributed over (0, 1). Calculate E[X 3 ]. Letting
Y = X 3 , we calculate the distribution of Y as follows. For 0 ≤ a ≤ 1 we have
1
Z a3
1 1
3
FY (a) = P {Y ≤ a} = P {X ≤ a} = P {X ≤ a } = 3 dx = a 3
0
where the last equality follows since X is uniformly distributed over (0, 1).
While the foregoing procedure will, in theory, always enable us to compute the expectation
of any function of X from a knowledge of the distribution of X, there is, fortunately,
an easier way to do this. The following proposition shows how we can calculate the
expectation of g(X) without first determining its distribution.
Theorem 4.3.3 (a) If X is a discrete random variable with probability mass function
p(x), then for any real-valued function g,
X
E[g(X)] = g(x)p(x) (4.3.1)
x:p(x)>0
77
(b) If X is a continuous random variable with probability density function f (x), then for
any real-valued function g,
Z ∞
E[g(X)] = g(x)f (x)dx (4.3.2)
−∞
Proof.
X X X
E[aX + b] = (ax + b)p(x) = a xp(x) + b p(x) = aE[X] + b
x:p(x)>0 x:p(x)>0 x:p(x)>0
The expected value of a random variable X, E[X], is also referred to as the mean or the
first moment of X. The quantity E[X n ], n ≥ 1, is called the nth moment of X. By the
Theorem 4.3.3, we note that
X
E[X n ] = xn p(x), if X is discrete (4.3.3)
x:p(x)>0
Z ∞
E[X n ] = n
x f (x)dx, if X is continuous
−∞
78
The Variance of a Random Variable
Var(aX + b) = a2 Var(X)
Proof. It follows from the Equation 4.3.4 defining the variance and the Corollary 4.3.6
that
Example 4.3.8 (Variance of the Normal Random Variable) Let X be normally dis-
tributed with parameters µ and σ 2 . Find Var(X).
2 ∞ Z ∞
σ2
2
y y
Var(X) = √ −y exp − + exp − dy
2π 2 −∞ −∞ 2
79
Theorem 4.3.9 Let X be a random variable. Suppose that we also have real valued
functions gi (x) and constants ai , i = 1, 2, · · · , k. Then we have
( k
) k
X X
E a0 + ai gi (X) = a0 + ai E[gi (X)] (4.3.5)
i=1 i=1
as long as all the expectations involved are finite. That is, the expectation is a linear
operation.
Proof. We supply a proof assuming that X is a continuous random variable with its pdf
f (x), x ∈ B. In the discrete case, the proof is similar. Let us write
( k
) Z ( k
)
X X
E a0 + ai gi (X) = a0 + ai gi (X) f (x)dx
i=1 B i=1
Z k
Z X
= a0 f (x)dx + ai gi (X)f (x)dx
B B i=1
Z Xk Z
= a0 f (x)dx + ai gi (x)f (x)dx
B i=1 B
( k
) Z k Z
X X
E a0 + ai gi (X) = a0 f (x)dx + ai gi (x)f (x)dx (4.3.6)
i=1 B i=1 B
R
since ai ’s are all constants and B
f (x)dx = 1. Now, we have the desired result.
80
Theorem 4.3.12 Suppose that X is a random variable. Then we have the following
identity
Var(X) = E[X 2 ] − (E[X])2 (4.3.7)
Proof. We assume that X is continuous with density f , and let E[X] = µ. Then,
Z ∞
2 2 2
Var(X) = E[(X − µ) ] = E[X − 2µX + µ ] = (x2 − 2µx + µ2 )f (x)dx
−∞
Z ∞ Z ∞ Z ∞
2 2
= x f (x)dx − 2µ xf (x)dx + µ f (x)dx
−∞ −∞ −∞
= E[X 2 ] − 2µµ + µ2 = E[X 2 ] − µ2
A similar proof holds in the discrete case, and so we obtain the requested useful identity
Example 4.3.13 Calculate Var(X) when X represents the outcome when a fair die is
rolled. As previously noted in the Example 4.1.1, E[X] = 27 ·
Also
2 1 1 1 1 1 1 91
E[X ] = 1 +4 +9 + 16 + 25 + 36 =
6 6 6 6 6 6 6
It follows that
91 49 35
Var(X) = − =
6 4 12
Example 4.3.14 (Bernoulli random Variable) Recall that a Bernoulli random vari-
able X with parameter p takes the values 1 and 0 with probability p and 1 − p respectively.
We have seen in the Example 4.1.2 that E[X] = p. Note that
Hence
Var(X) = E[X 2 ] − (E[X])2 = p − p2 = p(1 − p)
Let us calculate E[X(X − 1)]. Using the properties of the Binomial development we get
81
n n
X n x X n!
E[X(X − 1)] = x(x − 1) p (1 − p)n−x = x(x − 1) px (1 − p)n−x
x=0
x x=0
x!(n − x)!
n n
X n! x n−x 2
X (n − 2)!
= p (1 − p) = n(n − 1)p px−2 (1 − p)n−x
x=2
(x − 2)!(n − x)! x=2
(x − 2)!(n − x)!
m
X m!
= n(n − 1)p2 py (1 − p)m−y = n(n − 1)p2 (p + (1 − p))m = n(n − 1)p2
y=0
y!(m − y)!
∞ ∞
X e−λ λx −λ
X λx
E[X(X − 1)] = x(x − 1) =e x(x − 1)
x=0
x! x=0
x!
∞ ∞
−λ
X λx −λ
X λx
=e x(x − 1) = e
x=2
x! x=2
(x − 2)!
∞
X λk
= e−λ λ2 = e−λ λ2 eλ = λ2
k=0
k!
P∞ λk
where we have used the identity k=0 k! = eλ . Hence
β
x2 β 3 − α3 1 2
Z
2
β + βα + α2
E[X ] = dx = =
α β−α 3(β − α) 3
82
2
β 3 − α3
2 1 2
Var(X) = E[X ] − (E[X]) = − (α + β)
3(β − α) 2
2
1 2 1 1
2
(β − α)2
= β + βα + α − (α + β) =
3 2 12
Example 4.3.18 (Exponential random Variable) Let X be exponentially distributed
with parameter λ. We have see in the Example 4.2.2 that E[X] = λ1 . We first find E[X 2 ].
Using the Method of Integration by Parts and the Example 4.2.2 one can see that
∞ ∞ ∞
x2
2 2 2
Z Z
−λx
2
E[X ] = 2
x λe = λx + xλe−λx dx = 0 + 2
= 2
0 e 0 λ 0 λ λ
2 1 2 1
Hence Var(X) = E[X 2 ] − (E[X])2 =
λ2
− λ
= λ2
Exercise 4.3.1 Let Y be a a geometric random variable with parameter p. Show that
E[X 2 ] = 2(1−p)
p2
. Using the result from the Example 4.1.4, show that Var(Y ) = 1−p
p2
.
Example 4.3.19 (Gamma random Variable) Let X be a random variable with pa-
rameters (α, β) with α > 0, β > 0. Recall that the probability density function of X is
given by
β α xα−1 e−βx
(
Γ(α)
, if x ≥ 0
f (x) = (4.3.8)
0, if x < 0
Let us compute the k th moment of gamma distribution. To do this we first recall (see Real
Analysis Course) that
∞ Z ∞
α α−1 −βx
βα
Z
x e kβ
E[X ] =k
x dx = xα+k−1 e−βx dx
0 Γ(α) Γ(α) 0
β α Γ(α + k) ∞ β α+k xα+k−1 e−βx Γ(α + k)
Z
= α+k
dx =
Γ(α)β 0 Γ(α + k) Γ(α)β k
(α + k − 1)Γ(α + k − 1) (α + k − 1)(α + k − 2) · · · αΓ(α)
= k
=
Γ(α)β Γ(α)β k
(α + k − 1)(α + k − 2) · · · α
=
βk
We have used the fact that
∞
β α+k xα+k−1 e−βx
Z
dx = 1
0 Γ(α + k)
83
α+k α+k−1 −βx
since f (x) = β Γ(α+k)
x e
, x ≥ 0 is a propability density function of a Gamma random
variable with parameters α + k and β.
α(α+1)
Therefore E[X] = α
β
and E[X 2 ] = β2
. It follows that
α(α + 1) α2 α
Var(X) = E[X 2 ] − (E[X])2 = 2
− 2 = 2
β β β
Exercise 4.3.2 Consider the following cumulative distribution function of a random vari-
able X:
0, if x < 2
1 2
F (x) = − 4 x + 2x + 3, if 2 ≤ x ≤ 4
1, if x > 4
Solution 4.3.20 (i) The first derivative of the cumulative distribution function yields
0
the pdf, that is F (x) = f (x):
0, if x < 2
f (x) = − 12 x + 2, if 2 ≤ x ≤ 4
0, if x > 4
(ii) It is know from the Equation 3.3.2 that for any continuous random variable P {X =
a) = 0 and therefore P {X = 4} = 0. We calculate P {X < 3} = P {X ≤ 4}−P {X =
R3 R3
4} = F (3) − 0 = −∞ f (x)dx = 2 (− 21 x + 2)dx = − 49 + 6 + 1 − 3 = 0.75.
R∞
(iii) Using the Equation 4.2.1 we obtain the expectation as follows: E[X] = −∞ xf (x)dx =
R2 R4 1
R∞ R4
−∞
x(0)dx+ 2
x(− 2
x+2)dx+ 0
x(0)dx = 0+ 2
x(− 21 x+2)dx+0 = [− 16 x3 +x2 ]42 =
(− 64
4
+ 16) − (− 86 + 4) = 83 . Given that we have already calculated E[X], we can use
the Theorem 4.3.12 to calculate the variance as Var(X) = E[X 2 ] − (E[X])2 . The
R4 R4
expectation of X 2 is E[X 2 ] = 2 x2 f (x)dx = 2 x2 (− 12 x + 2)dx = [− 81 x4 + 23 x3 ]42 =
(−32 + 1283
) − (−2 + 16 3
) = 223
. We thus obtain Var(X) = 22 3
− ( 83 )2 = 66−64
9
= 29 .
84
(
cx(2 − x), if 0 ≤ x ≤ 2
f (x) =
0, elsewhere
0, if x < 0
3 2
1 − 31 x if 0 ≤ x ≤ 2
F (x) = 4
x
1, if 2 < x
85
Chapter 5
Thus far, we have concerned ourselves with the probability distribution of a single random
variable. However, we are often interested in probability statements concerning two or
more random variables. To deal with such probabilities, we define, for any two random
variables X and Y , the joint cumulative probability distribution function of X and Y by
Even though two random variables X and Y may be jointly distributed, if interest is fo-
cused on only one of the random variables, then it is appropriate to consider the probability
distribution of that random variable alone. This is known as the marginal distribution of
that random variable and can be obtained quite simply by summing or integrating the
joint probability distribution over the values of the other random variable. In other words,
we have the following:
The distribution of X can be obtained from the joint distribution of X and Y as follows:
86
In the case where X and Y are both discrete random variables, it is convenient to define
the joint probability mass function of X and Y by
p(x, y) = P {X = x, Y = y}
Similarly,
X
pY (y) = p(x, y) for each y
x:p(x,y)>0
The functions pX (x) is called the marginal probability mass function of X and pY (y) is
called the marginal probability mass function of Y
Example 5.1.1 The joint cdf of two discrete random variables X and Y is given as
follows:
1
8
, x = 1, y = 1
5 , x = 1, y = 2
8
FX,Y (x, y) = 1
, x = 2, y = 1
4
1, x = 2, y = 2
Determine
Solution 5.1.2 (i) The joint probability mass function is obtained from the relation-
ship:
XX
FX,Y (a, b) = P {X ≤ a, Y ≤ b} = p(x, y)
y≤b x≤a
87
1
FX,Y (1, 1) = pX,Y (1, 1) =
8
5 5 1 1
FX,Y (1, 2) = pX,Y (1, 1) + pX,Y (1, 2) = =⇒ pX,Y (1, 2) = − =
8 8 8 2
1 1 1 1
FX,Y (2, 1) = pX,Y (1, 1) + pX,Y (2, 1) = =⇒ pX,Y (2, 1) = − =
4 4 8 8
FX,Y (2, 2) = pX,Y (1, 1) + pX,Y (1, 2) + pX,Y (2, 1) + pX,Y (2, 2) = 1
1 1 1 1
=⇒ pX,Y (2, 2) = 1 − − − =
8 2 8 4
Hence
1 1 5
pX (1) = pX,Y (1, 1) + pX,Y (1, 2) = + =
8 2 8
1 1 3
pX (2) = pX,Y (2, 1) + pX,Y (2, 2) = + =
8 4 8
Hence
1 1 1
pY (1) = pX,Y (1, 1) + pX,Y (2, 1) = + =
8 8 4
1 1 3
pY (2) = pX,Y (1, 2) + pX,Y (2, 2) = + =
2 4 4
Example 5.1.3 Suppose that a coin is tossed twice. Let X be the total number of heads
shown and Y the total number of tails. Since the sample space is
S = {HH, HT, T H, T T }
Clearly, p(x, y) = P {X = x, Y = y} is zero, except at the points (0, 2), (1, 1) and (2, 0).
Furthermore,
88
Exercise 5.1.1 A fair coin is tossed three times. Let X be a random variable that takes
the value 0 if the first toss is a tail and the value 1 if the first toss is a head. Also, let Y
be a random variable that defines the total number of heads in the three tosses.
Solution 5.1.4 Let H denote the event that a head appears on a toss and T the event
that a tail appears on a toss. The sample space is given by
The random variable X can take the values 0 and 1 while the random ariable Y can take
the values 0, 1, 2 and 3. Each of these eight sample points is equally likely to be obtained
with probability 81 . The joint probability mass function of of X and Y can be constructed
as follows:
1
pX,Y (0, 0) = P {X = 0, Y = 0} = P (T T T ) =
8
1
pX,Y (0, 1) = P {X = 0, Y = 1} = P (T HT ) + P (T T H) =
4
1
pX,Y (0, 2) = P {X = 0, Y = 2} = P (T HH) =
8
pX,Y (0, 3) = P {X = 0, Y = 3} = 0
pX,Y (1, 0) = P {X = 1, Y = 0} = 0
1
pX,Y (1, 1) = P {X = 1, Y = 1} = P (HT T ) =
8
1
pX,Y (1, 2) = P {X = 1, Y = 2} = P (HHT ) + P (HT H) =
4
1
pX,Y (1, 3) = P {X = 1, Y = 3} = P (HHH) =
8
We say that X and Y are jointly continuous if there exists a function f (x, y) (sometimes
donotes by fX,Y (x, y)), defined for all real x and y, having the property that for all sets
A and B of real numbers
Z Z
P {X ∈ A, Y ∈ B} = f (x, y)dxdy
B A
The function f (x, y) is called the joint probability density function of X and Y . The
probability density of X can be obtained from a knowledge of f (x, y) by the following
reasoning:
89
Z ∞ Z Z
P {X ∈ A} = P {X ∈ A, Y ∈ (−∞, ∞)} = f (x, y)dxdy = fX (x)dx
−∞ A A
where Z ∞
fX (x) = f (x, y)dy, for − ∞ < x < ∞
−∞
The function fX (x) is called the marginal probability density function of X and the func-
tion fY (y) is called the marginal probability density function of Y
Example 5.1.5 Suppose that the joint pdf of two random variables X and Y is given by
(
6
5
(x + y 2 ), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, otherwise
(ii) Determine P {0 ≤ X ≤ 41 , 0 ≤ Y ≤ 41 }.
(v) Find P { 14 ≤ Y ≤ 34 }.
Solution 5.1.6 (i) To verify that the given function is a legitime pdf, note that f (x, y) ≥
0 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 and
Z ∞Z ∞ Z 1Z 1 Z 1Z 1 Z 1Z 1
6 2
6 6 2
f (x, y)dxdy = x + y dxdy = xdxdy + y dxdy
−∞ −∞ 0 0 5 0 0 5 0 0 5
6 1 6 1 2 6 6
Z Z
= xdx + y dy = + =1
5 0 5 0 10 15
90
(iii) The marginal pdf of X is given by
Z ∞ 1
6 6 2
Z
x + y 2 dy = x +
fX (x) = f (x, y)dy =
−∞ 0 5 5 5
for 0 ≤ x ≤ 1 and 0 otherwise.
A variation of the Theorem 4.3.3 states that if X and Y are random variables and g is a
function of two variables, then
XX
E[g(X, Y )] = g(x, y)p(x, y) in the discrete case (5.1.1)
y x
Z ∞ Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy in the continuous case (5.1.2)
−∞ −∞
Z ∞ Z ∞
E[g(X, Y )] = (x + y)f (x, y)dxdy
−∞ −∞
Z ∞ Z ∞ Z ∞Z ∞
= xf (x, y)dxdy + yf (x, y)dxdy
−∞ −∞ −∞ −∞
= E(X) + E(Y )
where the first integral is evaluated by using the variation of the Theorem 4.3.3 with
g(x, y) = x, and the second with g(x, y) = y. The same result holds in the discrete case
and, combined with the Corollary 4.3.6 yields that for any constants a, b
Joint probability distributions may also be defined for n random variables. The details
are exactly the same as when n = 2 and are left as an exercise. The corresponding result
91
to Equation 5.1.3 states that if X1 , X2 , ..., Xn are n random variables, then for any n
constants a1 , a2 , ..., an ,
Example 5.1.7 Calculate the expected sum obtained when three fair dice are rolled.
Let X denote the sum obtained. Then X = X1 + X2 + X3 where Xi represents the value
of the ith die. Thus,
7 21
E[X] = E[X1 ] + E[X2 ] + E[X3 ] = 3 =
2 2
Example 5.1.8 As another example of the usefulness of the Equation 5.1.4, let us use
it to obtain the expectation of a binomial random variable having parameters n and p.
Recalling that such a random variable X represents the number of successes in n trials
when each trial has probability p of being a success, we have that X = X1 + X2 + · · · + Xn
where
(
1, if the ith trial is a success
Xi =
0, if the ith trial is a failure
This derivation should be compared with the one presented in the Example 4.1.3.
Example 5.1.9 At a party N men throw their hats into the center of a room. The hats
are mixed up and each man randomly selects one. Find the expected number of men who
select their own hats.
Letting X denote the number of men that select their own hats, we can best compute E[X]
by noting that X = X1 + X2 + · · · + XN where
(
1, if the ith man selects his own hat
Xi =
0, otherwise
Now, because the ith man is equally likely to select any of the N hats, it follows that
1
P {Xi = 1} = P {ith man selects his own hat} =
N
92
and so
1
E[Xi ] = 1P {Xi = 1} + 0P {Xi = 0} =
N
Hence, from the Equation 5.1.4 we obtain that
1
E[X] = E[X1 + X2 + · · · + XN ] = N =1
N
Hence, no matter how many people are at the party, on the average exactly one of the
men will select his own hat.
Example 5.1.10 Suppose there are 25 different types of coupons and suppose that each
time one obtains a coupon, it is equally likely to be any one of the 25 types. Compute the
expected number of different types that are contained in a set of 10 coupons.
Let X denote the number of different types in the set of 10 coupons. We compute E[X]
by using the representation X = X1 + X2 + · · · + X25 , where
(
1, if at least one type i coupon is in the set of 10
Xi =
0, otherwise
Now E[Xi ] = P {Xi = 1} = P {at least one type i coupon is in the set of 10}
24 10
= 1−P {no type i coupons are in the set of 10} = 1− 25 when the last equality follows
since each of the 10 coupons will (independently) not be a type i with probability 24
25
. Hence,
" 10 #
24
E[X] = E[X1 ] + E[X2 ] + · · · + E[X25 ] = 25 1 −
25
The random variables X and Y are said to be independent if, for all a, b,
P {X ≤ a, Y ≤ b} = P {X ≤ a}P {Y ≤ b} (5.2.1)
In other words, X and Y are independent if, for all a and b, the events Ea = {X ≤ a}
and Fb = {Y ≤ b} are independent.
In terms of the joint distribution function FX,Y of X and Y , we have that X and Y are
independent if
93
When X and Y are discrete, the condition of independence reduces to
To prove this statement, consider first the discrete version, and suppose that the joint
probability mass function p(x, y) satisfies the Equation 5.2.2. Then
XX XX
P {X ≤ a, Y ≤ b} = p(x, y) = pX (x)pY (y)
y≤b x≤a y≤b x≤a
X X
= pY (y) pX (x) = P {X ≤ a}P {Y ≤ b}
y≤b x≤a
and so X and Y are independent. The Equation 5.2.2 implies independence in the con-
tinuous case is proven in the same manner and is left as an exercise.
Example 5.2.1 Let X and Y be two random variables and suppose that X and Y have
a joint probability density function of fX,Y (x, y) = 6xy 2 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 and
fX,Y (x, y) = 0 elsewhere. Check whether X and Y are independent.
The fact that this joint density function is a function of x multiplied by a function of
y immediately indicates that the two random variables are independent. The probability
density function of X is is given by
Z 1 Z 1
fX (x) = fX,Y (x, y)dy = 6xy 2 dy = 2x
0 0
for 0 ≤ y ≤ 1. The fact that fX,Y (x, y) = fX (x)fY (y) confirms that the random variables
X and Y are independent.
Example 5.2.2 (Exercise 5.1.1 continued) Consider the random variables X and Y
defined in the Example 5.1.1. Are X and Y independent?
Solution 5.2.3 If X and Y are independent, then for all x and y we have that
94
Thus, to show that X and Y are not independent, all we have to do is to find a pair of x
and y at which the joint probability mass function does not satisfy the above condition.
1 1 1
Since pX (0)pY (0) = 2
× 8
= 16
6= pX,Y (0, 0) = 18 , we conclude that X and Y are not
independent.
(
e−(x+y) , 0 ≤ x < ∞, 0 ≤ y < ∞
fX,Y (x, y) =
0, otherwise
Are X and Y independent?
Solution 5.2.5 To answer the question, we first evaluate the marginal pdfs of X and Y :
Z ∞ Z ∞ Z ∞
−(x+y) −x
fX (x) = fX,Y (x, y)dy = e dy = e e−y dy = e−x , x ≥ 0
Z−∞
∞ Z0 ∞ Z0 ∞
fY (y) = fX,Y (x, y)dx = e−(x+y) dx = e−y e−x dy = e−y , y ≥ 0
−∞ 0 0
Now, fX (x)fY (y) = e−x e−y = e−(x+y) = fX,Y (x, y), which means that X and Y are
independent.
95
Exercise 5.2.2 Determine if random variables X and Y are independent when their joint
pdf is given by
(
2e−(x+y) , 0 ≤ x < y, 0 ≤ y < ∞
fX,Y (x, y) =
0, otherwise
Solution 5.2.6 To answer the question, we first evaluate the marginal pdfs of X and Y :
Z ∞ Z ∞ Z ∞
−(x+y) −x
fX (x) = fX,Y (x, y)dy = 2 e dy = 2e e−y dy = 2e−2x
−∞ x x
Z ∞ Z y Z y
−(x+y) −y
fY (y) = fX,Y (x, y)dx = 2 e dx = 2e e−x dx = 2e−y (1 − e−y )
−∞ 0 0
Now, fX (x)fY (y) = 4e−(2x+y) − 4e−2(x+y) 6= fX,Y (x, y), which means that X and Y are
not independent.
Exercise 5.2.3 Suppose that a fair coin is tossed three times so that there are eight
equally likely outcomes, and that the random variable X is the number of heads obtained
in the first and second tosses, the random variable Y is the number of heads in the third
toss, and the random variable Z is the number of heads obtained in the second and third
tosses.
(a) (i) Find the joint probability mass function for X and Y .
(ii) Find the marginal distribution of X.
(iii) Find the marginal distribution of Y .
(b) (i) Find the joint probability mass function for X and Z.
(ii) Find the marginal distribution of X.
(iii) Find the marginal distribution of Z.
(c) (i) Find the joint probability mass function for Y and Z.
(ii) Find the marginal distribution of Y .
(iii) Find the marginal distribution of Z.
96
5.3 Covariance and Variance of Sums of Random Vari-
ables
Definition 5.3.1 The covariance of any two random variables X and Y , denoted by
Cov(X, Y ), is defined by
Note that if X and Y are independent, then by the Theorem 5.2.4 it follows that
Cov(X, Y ) = 0
Let us consider now the special case where X and Y are indicator variables for whether
or not the events A and B occur. That is, for events A and B, define
( (
1, if A occurs 1, if B occurs
X= and Y =
0, otherwise 0, otherwise
Cov(X, Y ) = P {X = 1, Y = 1} − P {X = 1}P {Y = 1}
That is, the covariance of X and Y is positive if the outcome X = 1 makes it more likely
that Y = 1 (which, as is easily seen by symmetry, also implies the reverse).
97
Properties of Covariance
For any random variables X, Y, Z and any real constant α, the following hold:
Whereas the first three properties are immediate, the final one is easily proven as follows:
The fourth property listed easily generalizes to give the following result:
n m
! n Xm
X X X
Cov Xi , Yj = Cov(Xi , Yj ) (5.3.2)
i=1 j=1 i=1 j=1
A useful expression for the variance of the sum of random variables can be obtained from
the Equation 5.3.2 as follows:
n
! n n
! n X
n
X X X X
Var Xi = Cov Xi , Xj = Cov(Xi , Xj )
i=1 i=1 j=1 i=1 j=1
Xn Xn X
= Cov(Xi , Xi ) + Cov(Xi , Xj )
i=1 i=1 j6=i
n
X Xn X
= Var(Xi ) + 2 Cov(Xi , Xj ) (5.3.3)
i=1 i=1 j<i
If Xi , i = 1, ..., n are independent random variables, then the previous expression reduces
to !
Xn Xn
Var Xi = Var(Xi ) (5.3.4)
i=1 i=1
98
We can prove again the following statement (already proved in the Theorem 4.3.7), which
is a paricular case of the Property (iii) in the list above.
Corollary 5.3.2 Let X be a random variable. If α and β are any fixed real constants
then
Var(αX + β) = α2 Var(X)
The last equality follows from the fact that α and β are fixed real constants and hence
Cov(αX, β) = Cov(β, αX) = Cov(β, β) = 0 by the use of the Equation 5.3.1.
In practice, the most convenient way to assess the strength of the dependence between
two random variables is through their correlation coefficient.
Definition 5.3.3 The correlation coefficient between two random variables X and Y is
defined to be
Cov(X, Y )
ρ(X, Y ) = p (5.3.5)
Var(X) Var(Y )
The correlation coefficient takes values between −1 and 1, and independent random vari-
ables have a correlation of 0.
Random variables with a positive correlation coefficient are said to be positively corre-
lated, and in such cases there is a tendency for high values of one random variable to
be associated with high values of the other random variable. Random variables with a
negative correlation coefficient are said to be negatively correlated, and in such cases there
is a tendency for high values of one random variable to be associated with low values of
the other random variable. The strength of these tendencies increases as the correlation
coefficient moves further away from 0 to 1 or to -1.
99
It follows that Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 21 − 23 × 43 = 0. This result is
Definition 5.3.5 If X1 , ..., Xn are independent and identically distributed, then the ran-
dom variable n
1X
X= Xi
n i=1
is called the sample mean.
The following theorem shows that the covariance between the sample mean and a deviation
from that sample mean is zero. It will be needed latter!
Theorem 5.3.6 Suppose that X1 , ..., Xn are independent and identically distributed with
expected value µ and variance σ 2 . Then,
(a) E[X] = µ,
σ2
(b) Var(X) = n
,
(c) Cov(X, Xi − X) = 0.
n n
1X 1X
E[X] = E[Xi ] = µ = µ.
n i=1 n i=1
2 n
! n
2
1 X 1 X σ2
Var(X) = Var Xi = Var(Xi ) = .
n i=1
n i=1
n
To establish part (c) we reason as follows
!
1 X
Cov(X, Xi − X) = Cov(X, Xi ) − Cov(X, X) = Cov(Xi + Xj , Xi ) − Var(X)
n j6=i
!
1 1 X σ2 1 σ2 σ2 σ2
= Cov(Xi , Xi ) + Cov Xj , Xi − = Var(Xi ) − = − =0
n n j6=i
n n n n n
P
where the final equality used the fact that Xi and j6=i Xj are independent and thus have
covariance 0.
100
Example 5.3.7 (Variance of a Binomial Random Variable) Compute the variance
of a binomial random variable X with parameters n and p.
Since such a random variable represents the number of successes in n independent trials
when each trial has a common probability p of being a success, we may write X = X1 +
X2 + · · · + Xn where the Xi are independent Bernoulli random variables such that
(
1, if the ith trial is a success
Xi =
0, otherwise
Hence, from the Equation 5.3.3 we obtain Var(X) = Var(X1 ) + · · · + Var(Xn ). Since
Xi2 = Xi , we have
(
1, if the ith person chosen is in favor
Xi =
0, otherwise
1
Pn
then the usual estimator of p is n i=1 Xi . Let us now compute its mean and variance.
Now !
n
X n
X
E Xi = E(Xi ) = np
i=1 i=1
th
where the final equality follows since the i person chosen is equally likely to be any of
the N individuals in the population and so has probability NNp of being in favor.
n
! n n X
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj )
i=1 i=1 i=1 j<i
101
Now, since Xi is a Bernoulli random variable with mean p, it follows that Var(Xi ) =
p(1 − p).
n
!
p(np − 1)
X n 2
Var Xi = np(1 − p) + 2 −p
i=1
2 N −1
n(n − 1)p(1 − p)
= np(1 − p) −
N −1
and so the mean and variance of our estimator are given by
" n
#
X
E Xi = p
i=1
n
!
X p(1 − p) (n − 1)p(1 − p)
Var Xi = −
i=1
n n(N − 1)
Some remarks are in order: As the mean of the estimator is the unknown value p, we
would like its variance to be as small as possible and we see by the preceding that, as
a function of the population size N , the variance increases as N increases. The limiting
value, as N −→ ∞, of the variance is p(1−p)
n
, which is not surprising since for N large each
of the Xi will be (approximately) independent random variables, and thus ni=1 Xi will
P
The random variable ni=1 Xi can be thought as representing the number of white balls
P
obtained when n balls are randomly selected from a population consisting of N p white
and N − N p black balls. (Identify a person who favors the proposition with a white ball
and one against with a black ball.) Such a random variable is called hypergeometric and
has a probability mass function given by
102
N −N p
( n ) Np
X
k n−k
P Xi = k = N
(5.3.6)
i=1 n
Suppose first that X and Y are continuous, X having probability density f and Y having
probability density g. Then, letting FX+Y (a) be the cumulative distribution function of
X + Y , we have
ZZ
FX+Y (a) = P {X + Y ≤ a} = f (x)g(y)dxdy
x+y≤a
Z ∞ Z a−y Z ∞ Z a−y
= f (x)g(y)dxdy = f (x)dx g(y)dy
−∞ −∞ −∞ −∞
Z ∞
= FX (a − y)g(y)dy (5.3.7)
−∞
The cumulative distribution function FX+Y is called the convolution of the distributions
FX and FY (the cumulative distribution functions of X and Y , respectively).
By differentiating the Equation 5.3.7, we obtain that the probability density function
fX+Y (a) of X + Y is given by
Z ∞
d
fX+Y (a) = FX (a − y)g(y)dy
da −∞
Z ∞ Z ∞
d
= (FX (a − y)) g(y)dy = fX (a − y)dy (5.3.8)
−∞ da −∞
Ra
For 0 ≤ a ≤ 1, this yields 0
dy = a.
103
R1
For 1 < a < 2, this yields a−1
dy = 2 − a
Hence,
a, 0 ≤ a ≤ 1
fX+Y (a) = 2 − a, 1 < a < 2
0, otherwise
Rather than deriving a general expression for the distribution of X + Y in the discrete
case, we shall consider an example.
Since the event {X + Y = n} may be written as the union of the disjoint events {X =
k, Y = n − k}, 0 ≤ k ≤ n, we have
n
X n
X
P (X + y = n} = P {X = k, Y = n − k} = P {X = k}P {Y = n − k}
k=0 k=0
n n−k n
X k
−λ1 λ1 −λ2 λ2 −(λ1 +λ2 )
X λk1 λ2n−k
= e e =e
k=0
k! (n − k)! k=0
k!(n − k)!
n
e−(λ1 +λ2 ) X n! k n−k e−(λ1 +λ2 )
= λ1 λ2 = (λ1 + λ2 )n (5.3.9)
n! k=0
k!(n − k)! n!
The concept of independence may, of course, be defined for more than two random vari-
ables. In general, the n random variables X1 , X2 , ..., Xn are said to be independent if, for
all values a1 , a2 , ..., an ,
Let X1 and X2 be jointly continuous random variables with joint probability density
function f (x1 , x2 ). It is sometimes necessary to obtain the joint distribution of the random
104
variables Y1 and Y2 which arise as functions of X1 and X2 . Specifically, suppose that
Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ) for some functions g1 and g2 .
1. The equations y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ) can be uniquely solved for x1 and
x2 in terms of y1 and y2 with solutions given by, say,
x1 = h1 (y1 , y2 ), x2 = h2 (y1 , y2 )
2. The functions g1 and g2 have continuous partial derivatives at all points (x1 , x2 ) and
are such that the following 2 × 2 determinant
∂g1 ∂g1
∂x1 ∂x2 ∂g ∂g
1 2 ∂g1 ∂g2
J(x1 , x2 ) = = − 6= 0 at all points(x1 , x2 )
∂g2 ∂g2
∂x1 ∂x2 ∂x2 ∂x1
∂x1 ∂x2
Under these two conditions it can be shown that the random variables Y1 and Y2 are
jointly continuous with joint density function given by
A proof of the Equation 5.4.1 would proceed along the following lines:
ZZ
P {Y1 ≤ y1 , Y2 ≤ y2 } = fX1 ,X2 (x1 , x2 )dx1 dx2 (5.4.2)
B
The joint density function can now be obtained by differentiating the Equation 5.4.2 with
respect to y1 and y2 . That the result of this differentiation will be equal to the right-hand
side of the Equation 5.4.1 is an exercise in advanced calculus whose proof will not be
presented in the present lecture notes.
Example 5.4.1 If X and Y are independent gamma random variables with parameters
X
(α, λ) and (β, λ), respectively, compute the joint density of U = X + Y and V = X+Y .
105
x ∂g1 ∂g1 ∂g2 y
Now, if g1 (x, y) = x + y, g2 (x, y) = x+y
, then ∂x
= 1, ∂y
= 1, ∂x
= (x+y)2
and
∂g2 x
∂y
= − (x+y)2 . It follows that
1 1
1
J(x1 , x2 ) = =−
y x
x+y
(x+y)2
− (x+y)2
x
Finally, because the equations u = x + y, v = x+y
have as their solutions x = uv, y =
u(1 − v), we see that
X
Hence X + Y and X+Y are independent, with X + Y having a gamma distribution with
X
parameters (α + β, λ) and X+Y having density function
Γ(α + β) α−1
fV (v) = v (1 − v)β−1 , 0 < v < 1
Γ(α)Γ(β)
This is called the beta density with parameters (α, β). This result is quite interesting.
For suppose there are n + m jobs to be performed, with each (independently) taking an
exponential amount of time with rate λ for performance, and suppose that we have two
workers to perform these jobs. Worker I will do jobs 1, 2, ..., n, and worker II will do the
remaining m jobs. If we let X and Y denote the total working times of workers I and
II, respectively, then upon using the preceding result it follows that X and Y will be
independent gamma random variables having parameters (n, λ) and (m, λ), respectively.
Then the preceding result yields that independently of the working time needed to com-
plete all n + m jobs (that is, of X + Y ), the proportion of this work that will be performed
by worker I has a beta distribution with parameters (n, m).
When the joint density function of the n random variables X1 , X2 , ..., Xn is given and we
want to compute the joint density function of Y1 , Y2 , ..., Yn , where
106
∂g ∂g1 ∂g1
1
∂x1 ∂x2
··· ∂xn
∂g ∂g2 ∂g2
J(x1 , x2 , · · · , xn ) = ∂x21 ∂x2
· · · ∂xn
.. .. .
. . · · · ..
∂g ∂gn ∂gn
n
∂x1 ∂x2
· · · ∂xn
Under these assumptions the joint density function of the random variables Yi is given by
fY1 ,...,Yn (y1 , ..., yn ) = fX1 ,...,Xn (x1 , ..., xn ) |J(x1 , ..., xn )|−1 (5.4.4)
∂x ∂x " #
∂u ∂w
1 1
J(x, y) = = = −2 (5.4.5)
∂y ∂y 1 −1
∂w ∂w
we obtain
u+w u−w
−1 1
fU,W (u, w) = fX,Y (x, y)|J(x, y)| = fX,Y ,
2 2 2
Exercise 5.4.3 If X is the amount of money spent on food and other expenses during a
day (in USD) and Y is the daily allowance of a businesswoman, the joint density of these
two variables is given by
(
c 100−x
x
, if 10 ≤ x ≤ 100, 40 ≤ y ≤ 100
fXY (x, y) =
0, elsewhere.
107
(i) Choose c such that fXY (x, y) is a probability density function.
(iii) Calculate the probability that more than 75 USD are spent.
108
Chapter 6
The moment generating function φ(t) of the random variable X is defined for all values t
by
P
tx
x e p(x), if X is discrete
φ(t) = E(etX ) = (6.1.1)
R ∞ tx
−∞
e f (x)dx, if X is continuous
We call φ(t) the moment generating function because all of the moments of X can be
obtained by successively differentiating φ(t).
For example,
0 d tX d tX
φ (t) = E[e ] = E (e ) = E[XetX ]
dt dt
and hence
0
φ (0) = E[X]
Similarly,
00 d d
tx
φ (t) = E[Xe ] = E (Xe ) = E[X 2 etX ]
tX
dt dt
and hence
00
φ (0) = E[X 2 ]
109
In general, the nth derivative of φ(t) evaluated at t = 0 equals E[X n ], that is,
and so
00
E[X 2 ] = φ (0) = n(n − 1)p2 + np
Differentiation yields
0
φ (t) = λet exp{λ(et − 1},
00
φ (t) = (λet )2 exp{λ(et − 1)} + λet exp{λ(et − 1)}
0 00
and so E[X] = φ (0) = λ, E[X 2 ] = φ (0) = λ2 + λ. Hence
Thus, both the mean and the variance of the Poisson equal λ.
110
We note by the preceding derivation that, for the exponential distribution, φ(t) is only
defined for values of t less than λ. Differentiating yields to
0 λ 00 2λ
φ (t) = , φ (t) =
(λ − t)2 (λ − t)3
Hence
0 1 00 2
E[X] = φ (0) = , E[X 2 ] = φ (0) =
λ λ2
∞ Z ∞
1 1
Z 2 (x2 −2tx)
tx − x2
tZ
φZ (t) = E[e ] = √ e e dx = √ e− 2 dx
2π −∞ 2π −∞
Z ∞
t2 1 (x−t)2
− 2 t2
=e 2 √ e dx = e 2
2π −∞
If Z is a standard normal, then X = σZ + µ is normal with parameters µ and σ 2 (see
Theorem 3.3.3). Therefore,
σ 2 t2
φX (t) = E[e ] = E et(σZ+µ) = eµt E etσZ = φZ (tσ) = exp
tX
+ µt
2
By differentiating we obtain
σ 2 t2
0 2
φX (t) = (µ + tσ ) exp + µt
2
22 22
00 2 2 σ t 2 σ t
φX (t) = (µ + tσ ) exp + µt + σ exp + µt
2 2
0 00
and so E[X] = φX (0) = µ, E[X 2 ] = φX (0) = µ2 + σ 2 implying that
111
Theorem 6.1.5 The moment generating function of the sum of independent random vari-
ables is just the product of the individual moment generating functions.
Proof. To see this, suppose that X and Y are independent random variables and have
moment generating functions φX (t) and φY (t), respectively. Then φX+Y (t), the moment
generating function of X + Y , is given by
where the next to the last equality follows from the Theorem 5.2.4 since X and Y are
independent.
Another important result is that the moment generating function uniquely determines
the distribution. That is, there exists a one-to-one correspondence between the moment
generating function and the distribution function of a random variable.
Table 6.1: Moments generating functions for some common discrete distributions
Example 6.1.6 Suppose the moment generating function of a random variable X is given
t
by φ(t) = e3(e −1) . What is P {X = 0}?
t
We see from Table 6.1 that φ(t) = e3(e −1) is the moment generating function of a Poisson
random variable with mean 3. Hence, by the oneto- one correspondence between mo-
ment generating functions and distribution functions, it follows that X must be a Poisson
random variable with mean 3. Thus, P {X = 0} = e−3 .
112
Continuous Probability Moment
Probability Density Function Generating Mean Variance
Distribution f (x) Function, φ(t)
Uniform (
1
b−a
, if a < x < b etb −eta a+b (b−a)2
parameters over f (x) = t(b−a) 2 12
)
0, otherwise
(a, b)
Exponential with (
λe−λx , if x ≥ 0 λ 1 1
parameter f (x) = λ−t λ λ2
0, if x < 0
λ>0
Gamma with (
λe−λx (λx)n−1
Γ(n)
, if x ≥ 0 λ n
n n
parameter (n, λ) f (x) = λ−t λ λ2
0, if x < 0
λ>0
n 2
o
Normal with 1
f (x) = √2πσ exp − (x−µ)
2σ 2
n
σ 2 t2
o
exp µt + 2
µ σ2
parameters (µ, σ 2 ) −∞ < x < ∞
Table 6.2: Moments generating functions for some common continuous distributions
Exercise 6.1.1 Let X be a uniformly distributed random variable over (a, b).
φX+Y (t) = φX (t)φY (t) = [(pet + 1 − p)n ][(pet + 1 − p)m ] = (pet + 1 − p)n+m
But (pet +1−p)n+m is just the moment generating function of a binomial random variable
having parameters m + n and p. Thus, this must be the distribution of X + Y .
φX+Y (t) = φX (t)φY (t) = [exp{λ1 (et − 1)][exp{λ2 (et − 1)] = exp{(λ1 + λ2 )(et − 1)}
113
Hence, X + Y is Poisson distributed with mean λ1 + λ2 , verifying the result given in
Example 5.3.10.
σ12 t2
22
σ2 t
φX+Y (t) = φX (t)φY (t) = exp + µ1 t exp + µ2 t
2 2
2
(σ1 + σ22 )t2
= exp + (µ1 + µ2 )t
2
which is the moment generating function of a normal random variable with mean µ1 + µ2
and variance σ12 + σ22 . Hence, the result follows since the moment generating function
uniquely determines the distribution.
Remark 6.1.1 For a nonnegative random variable X, it is often convenient to define its
Laplace transform g(t), t ≥ 0, by g(t) = φ(−t) = E[e−tX ]. That is, the Laplace trans-
form evaluated at t is just the moment generating function evaluated at −t . The advantage
of dealing with the Laplace transform, rather than the moment generating function, when
the random variable is nonnegative, that is if X ≥ 0 and t ≥ 0, then 0 ≤ e−tX ≤ 1.
That is, the Laplace transform is always between 0 and 1. As in the case of moment
generating functions, it remains true that nonnegative random variables that have the
same Laplace transform must also have the same distribution.
It is also possible to define the joint moment generating function of two or more random
variables. This is done as follows: For any n random variables X1 , ..., Xn , the joint moment
generating function, φ(t1 , ..., tn ), is defined for all real values of t1 , ..., tn by
It can be shown that φ(t1 , · · · , tn ) uniquely determines the joint distribution of X1 , ..., Xn .
114
m, 1 ≤ j ≤ n and µi , 1 ≤ i ≤ m
then the random variables X1 , ..., Xm are said to have a multivariate normal distribution.
It follows from the fact that the sum of independent normal random variables is itself a
normal random variable that each Xi is a normal random variable with mean and variance
given by E[Xi ] = µi and Var(Xi ) = nj=1 a2ij .
P
φ(t1 , · · · , tn ) = E[exp{t1 X1 + · · · + tn Xn }]
the joint moment generating function of X1 , ..., Xm . The first thing to note is that
Pm
since i=1 ti Xi is itself a linear combination of the independent normal random vari-
ables Z1 , · · · , Zn , is also normally distributed. Its mean and variance are respectively
" m
# m
X X
E ti Xi = ti µi
i=1 i=1
m
! m m
! m X
m
X X X X
Var ti Xi = Cov ti Xi , tj Xj = ti tj Cov(Xi , Xj )
i=1 i=1 j=1 i=1 j=1
Now, if Y is a normal random variable with mean µ and variance σ 2 , then E[eY ] =
σ2
φY (t)|t=1 = eµ+ 2 . Thus we see that
( m m m
)
X 1 XX
φ(t1 , ..., tm ) = exp ti µi + ti tj Cov(Xi , Xj ) (6.1.2)
i=1
2 i=1 j=1
which shows that the joint distribution of X1 , ..., Xm is completely determined from a
knowledge of the values of E[Xi ] and Cov(Xi , Xj ), i, j = 1, 2, · · · , m.
115
Chapter 7
Limit Theorems
Definition 7.1.1 Let X1 , ..., Xn be independent and identically distributed random vari-
ables, each with mean µ and variance σ 2 . The random variable S 2 defined by
n
2 1 X 2
S = Xi − X
n − 1 i=1
is called the sample variance of these data
n n
X 2 X
Xi − X = (Xi − µ)2 − n(X − µ)2 (7.1.1)
i=1 i=1
n n n n
X 2 X 2 X 2 2
X
Xi − X = Xi − µ + µ − X = (Xi − µ) + n(X − µ) + 2(µ − X) (Xi − µ)
i=1 i=1 i=1 i=1
Xn n
X
(Xi − µ)2 + n(X − µ)2 + 2(µ − X) nX − nµ = (Xi − µ)2 + n(X − µ)2 − 2n(X − µ)2
=
i=1 i=1
116
Using the Identity 7.1.1 and the Theorem 5.3.6 (b)
n
X
2
(Xi − µ)2 − nE (X − µ)2 = nσ 2 − n Var(X) = (n − 1)σ 2
E[(n − 1)S ] =
i=1
E[S 2 ] = σ 2
Pn
We will now determine the joint distribution of the sample mean X = i=1 Xi and the
sample variance S 2 when the Xi have a normal distribution.
Pn
We shall now compute the moment generating function of i=1 Zi2 . To begin, note that
Z ∞
1 2 x2
2
etx e− 2 dx
E exp{tzi } = √
2π −∞
Z ∞
1 x 2 1
=√ e− 2σ2 = σ = (1 − 2t)− 2
2π −∞
where σ 2 = (1 − 2t)−1
Hence
" ( n
)# n
n
X Y
Zi2 E exp tzi2 = (1 − 2t)− 2
E exp t =
i=1 i=1
Now, let X1 , ..., Xn be independent normal random variables, each with mean µ and
variance σ 2 , and let X = ni=1 Xi and S 2 denote their sample mean and sample variance.
P
Since the sum of independent normal random variables is also a normal random variable,
2
it follows that X is a normal random variable with expected value µ and variance σ2 . In
addition, from the Theorem 5.3.6
117
Cov(X, Xi − X) = 0, i = 1, 2, · · · , n (7.1.2)
Also, since X, X1 − X, X2 − X, ..., Xn − X are all linear combinations of the independent
standard normal random variables Xiσ−X , i = 1, ..., n, it follows that the random variables
X, X1 − X, X2 − X, ..., Xn − X have a joint distribution that is multivariate normal.
2
However, if we let Y be a normal random variable with mean µ and variance σn that is
independent of X1 , ..., Xn , then the random variables Y, X1 − X, X2 − X, ..., Xn − X is also
have a multivariate normal distribution, and by the Equation 7.1.2, they have the same
expected values and covariances as the random variables X, Xi −X, i = 1, 2, · · · , n. Thus,
since a multivariate normal distribution is completely determined by its expected values
and covariances, we can conclude that the random vectors Y, X1 − X, X2 − X, ..., Xn − X
and X, X1 − X, X2 − X, ..., Xn − X have the same joint distribution; thus showing that
X is independent of the sequence of deviations Xi − X, i = 1, 2, · · · , n.
n
2 1 X 2
S = Xi − X
n − 1 i=1
n
X
2
(n − 1)S = (Xi − µ)2 − n(X − µ)2
i=1
!2 n
(n − 1)S 2 X −µ X (Xi − µ)2
+ = (7.1.3)
σ2 √σ σ2
n i=1
Pn (Xi −µ)2
Now i=1 σ2
is the sum of the squares of n independent standard normal random
variables, and so is a chi-squared random variable
with n degrees of freedom; it thus has
n 2
moment generating function (1 − 2t)− 2 . Also X−µ
√σ
is the square of a standard normal
n
random variable and so is a chi-squared random variable with one degree of freedom; and
1
thus has moment generating function (1 − 2t)− 2 . In addition, we have previously seen
that the two random variables on the left side of the Equation 7.1.3 are independent.
Therefore, because the moment generating function of the sum of independent random
variables is equal to the product of their individual moment generating functions, we
obtain that
118
S2
1 n
E exp t(n − 1) 2 (1 − 2t)− 2 = (1 − 2t)− 2
σ
which can be written as
S2
n−1
E exp t(n − 1) 2 = (1 − 2t)− 2
σ
n−1
But because (1 − 2t)− 2 is the moment generating function of a chi-squared random
variable with n − 1 degrees of freedom, we can conclude, since the moment generating
function uniquely determines the distribution of the random variable, that this is the
2
distribution of (n−1)S
σ2
.
Theorem 7.1.3 If X1 , ..., Xn are independent and identically distributed normal random
variables with mean µ and variance σ 2 then the sample mean X and the sample variance
2
S 2 are independent. X is a normal random variable with mean µ and variance σn while
(n−1)S 2
σ2
is a chi-squared random variable with n − 1 degrees of freedom.
Theorem 7.2.1 (Markov’s Inequality) If X is a random variable that takes only non-
negative values, then for any value a > 0
E[X]
P {X ≥ a} ≤
a
Proof. We give a proof for the case where X is continuous with density f :
Z ∞ Z a Z ∞ Z ∞
E[X] = xf (x)dx = xf (x)dx + xf (x)dx ≥ xf (x)dx
0 0 a a
Z ∞ Z ∞
≥ af (x)dx = a f (x)dx = aP {X ≥ a}
a a
E[(X − µ)2 ] σ2
P {|X − µ| ≥ k} ≤ =
k2 k2
and the proof is complete.
The importance of Markov’s and Chebyshev’s inequalities is that they enable us to derive
bounds on probabilities when only the mean, or both the mean and the variance, of the
probability distribution are known. Of course, if the actual distribution were known, then
the desired probabilities could be exactly computed, and we would not need to resort to
bounds.
Example 7.2.3 Suppose we know that the number of items produced in a factory during
a week is a random variable with mean 500.
(a) What can be said about the probability that this week’s production will be at least 1000?
(b) If the variance of a week’s production is known to equal 100, then what can be said
about the probability that this week’s production will be between 400 and 600?
E[X] 500 1
P {X ≥ 1000} ≤ = =
1000 1000 2
σ2 100 1
P {|X − 500| ≥ 100} ≤ 2
= =
100 10000 100
Hence
1 99
P {|X − 500| < 100} > 1 − =
100 100
and so the probability that this week’s production will be between 400 and 600 is at
least 0.99.
120
The following theorem, known as the Strong Law of Large Numbers, is probably the
most well-known result in probability theory. It states that the average of a sequence
of independent random variables having the same distribution will, with probability 1,
converge to the mean of that distribution.
n o X1 + X2 + · · · + X n
P lim X n = µ = 1, where X n =
n→∞ n
(
1, if E occurs on the ith trial
Xi =
0, if E does not occur on the ith trial
We have, by the strong law of large numbers that, with probability 1,
X1 + X 2 + · · · + Xn
−→ E[Xi ] = P (E) (7.2.2)
n
Since X1 + X2 + · · · + Xn represents the number of times that the event E occurs in the
first n trials, we may interpret the Equation 7.2.2 as stating that, with probability 1, the
limiting proportion of time that the event E occurs is just P (E).
Running neck and neck with the Strong Law of Large Numbers for the honor of being
probability theory’s number one result is the Central Limit Theorem. Besides its the-
oretical interest and importance, this theorem provides a simple method for computing
approximate probabilities for sums of independent random variables. It also explains the
remarkable fact that the empirical frequencies of so many natural ”populations” exhibit
a bell-shaped (that is, normal) curve.
121
tends to the standard normal as n −→ ∞. That is,
a
X1 + X2 + · · · + Xn − nµ
1
Z
x2
P √ ≤a −→ √ e− 2 dx as n −→ ∞
σ n 2π −∞
Note that like the other results of this section, this theorem holds for any distribution of
the Xi ’s; herein lies its power.
If X is binomially distributed with parameters n and p, then X has the same distribution
as the sum of n independent Bernoulli random variables, each with parameter p. (Recall
that the Bernoulli random variable is just a binomial random variable whose parameter
n equals 1.) Hence, the distribution of
X − E[X] X − np
p =p
Var(X) np(1 − p)
approaches the standard normal distribution as n approaches ∞. The normal approxi-
mation will, in general, be quite good for values of n satisfying np(1 − p) ≥ 10.
Since the binomial is a discrete random variable, and the normal a continuous random
variable, it leads to a better approximation to write the desired probability as
19.5 − 20 X − 20 20.5 − 20
P {X = 20} = P {19.5 < X < 20.5} = P √ < √ < √
10 10 10
X − 20
= P {−0.16 < √ < 0.16} = Φ(0.16) − Φ(−0.16)
10
where Φ(x), the probability that the standard normal is less than x is given by
Z x
1 y2
Φ(x) = √ e− 2 dy
2π −∞
By the symmetry of the standard normal distribution Φ(−0.16) = P {N (0, 1) > 0.16} =
1 − Φ(0.16), where N (0, 1) is a standard normal random variable. Hence, the desired
probability is approximated by P {X = 20} ≈ 2Φ(0.16) − 1.
Using the Normal Probability Table about the value of Φ(x), we obtain that P {X = 20} ≈
0.1272.
140
40
The exact result is P {X = 20} = 20 2
which can be shown to equal 0.1268.
122
Example 7.2.7 Let Xi , i = 1, 2, ..., 10 be independent random variables, each being uni-
P10
formly distributed over (0, 1). Estimate P i=1 Xi > 7 .
1 1
Since E[Xi ] = and Var(Xi ) = 12
2
we have by the Central Limit Theorem that
( 10 )
P 10
i=1 Xi − 5 7−5
X
P Xi > 7 = P q >q ≈ 1 − Φ(2.2) = 0.0139.
1 1
i=1
10( 12 ) 10( 12 )
Example 7.2.8 The lifetime of a special type of battery is a random variable with mean
40 hours and standard deviation 20 hours. A battery is used until it fails, at which point it
is replaced by a new one. Assuming a stockpile of 25 such batteries, the lifetimes of which
are independent, approximate the probability that over 1100 hours of use can be obtained.
If we let Xi denote the lifetime of the ith battery to be put in use, then we desire p =
P {X1 + · · · + X25 > 1100}, which is approximated as follows:
We now present a heuristic proof of the Central Limit theorem. Suppose first that the
Xi have mean 0 and variance 1, and let E[etX ] denote their common moment generating
function. Then, the moment generating function of X1 +···+X
√
n
n
is
X1 + · · · + Xn
h tX1 tX2 i h tX in
√ √ tX
√n √
E exp t √ = E e n e n ···e n = E e n by independence.
n
Now, for n large, we obtain from the Taylor series expansion of ey that
tX
√ tX t2 X 2
e n ≈1+ √ +
n 2n
n
X1 + · · · + X n t2
E exp t √ ≈ 1+
n 2n
When n goes to ∞ the approximation can be shown to become exact and we have that
123
X1 + · · · + Xn
t2
lim E exp t √ =e2
n→∞ n
X1 − µ + X 2 − µ + · · · + X n − µ
P √ ≤a −→ Φ(a)
σ n
which proves the Central Limit Theorem.
Exercise 7.2.1 Let X be a binomial distribution with p = 0.6 and n = 150. Approximate
the probability that
Solution 7.2.9 (a) We calculate the mean and the variance of X. The mean is given
by µ = np = 150(0.6) = 90 and the variance is given by σ = Var(X) = np(1 − p) =
p
150(0.6)(0.4) = 36 and hence Var(X) = 6. The standardized variable is
X −µ X − 90
Z= =
σ 6
The event [82 ≤ X ≤ 101] includes both endpoints. The appropriate continuity cor-
rection is to substract 21 from the lower end and add 12 to the upper end. We then
approximate
81.5 − 90 X − 90 101.5 − 90
P [81.5 ≤ X ≤ 101.5] = P ≤ ≤
6 6 6
= P [−1.417 ≤ Z ≤ 1.917]
124
(b) For [X > 97], we reason that 97 is not included so that [X ≥ 97 + 0.5] or [X ≥ 97.5]
is the event of interest.
X − 90 97.5 − 90
P [X ≥ 97.5] ≈ P ≥
6 6
= P [Z ≥ 1.25] = 1 − 0.8944
(a) If a random sample of size 64 selected, what is the probability that the sample mean
will lie between 80.8 and 83.2?
(b) With a random sample of size 100, what is the probability that the sample mean will
lie between 80.8 and 83.2?
Exercise 7.2.4 Suppose that the population distribution of the gripping strengths of in-
dustrial workers is knowm to have mean of 110 and variance 100. For a random sample
of 75 workers, what is the probability that the sample mean gripping strength will be
125