3 s2.0 B9780128104958000038 Main

Chapter 2
Conditional Probability and

Independence
2.1 CONDITIONAL PROBABILITY
In many statistical applications we have events A and B and want to explain or
predict A from B, i.e., we want to say how likely A is, given that B has occurred.
Example 2.1. In university admissions, we might be interested in
A = {got g.p.a. 4.0 in college}

B = {got g.p.a. 4.0 in High School}.
Example 2.2. Investment professionals might be interested in the case
A = {stocks up Today}
B = {stocks up Yesterday}.
We are interested not just in marginal probabilities, but also in conditional

probabilities, that is, we want to incorporate some information into our predic-
tions.
Definition 2.1. The probability of an event A ∈ A given an event B ∈ A, de-

noted P (A|B), is given by
P (A ∩ B)
P (A|B) = , (2.1)
P (B)
when P (B) > 0.
The exclusion of the case P (B) = 0 has important ramifications later on.
If A and B are mutually exclusive events, then P (A|B) = 0. If A ⊆ B, then
P (A|B) = P (A)/P (B) ≥ P (A) with strict inequality unless P (B) = 1. If
B ⊆ A, then P (A|B) = 1. We will later have interest in the special case where
P (A|B) = P (A).
Note that P (·|B) is a probability measure that maps A → R+ . In particular,
we have:
Probability, Statistics and Econometrics. http://dx.doi.org/10.1016/B978-0-12-810495-8.00003-8
Copyright © 2017 Elsevier Inc. All rights reserved. 11
12 Probability, Statistics and Econometrics
Theorem 2.1. Suppose that B ∈ A. Then

1. For any A ∈ A, P (A|B) ≥ 0;
∞ = 1 ∞
2. P (B|B)
∞
3. P i=1 Ai |B = i=1 P (Ai |B) for any pairwise disjoint events {Ai }i=1 .
This follows directly from the definitions and properties of P .

Example 2.3. We have 100 stocks and we observe whether they went up or
down on consecutive days, the information is given below
Today
up down
up 53 25 78
Y’day
down 15 7 22
68 32 100
The information given here is effectively: P (A∩B), P (A∩B c ), P (B ∩Ac ), and

P (B c ∩ Ac ), where A = {up Y’day} and B = {up T’day}. It is an easy exercise
to convert these “joint probabilities” to marginal and conditional probabilities
using the Law of total probability. Specifically, the equation
P (A) = P (A ∩ B) + P (A ∩ B c )
determines the marginals, and then definition (2.1) gives the conditionals. Thus
78
P (Up Y’day) =
100
22
P (Down Y’day) =
100
53/100 53
P (Up Today|Up Y’day) = = .
78/100 78
2.2 BAYES THEOREM

We next give the famous Bayes Rule (or Formula, Lemma or Theorem depend-
ing on the source), which is an important result in probability and statistics.
Often we have prior information about a marginal probability and about a con-
ditional probability, but we are interested in the reverse direction conditional
probability, i.e., in making an inference. This can be obtained from this rule.
Theorem 2.2. (Bayes Rule). Let A and B be two events with P (A) = 0. Then
P (A|B) × P (B) P (A|B) · P (B)
P (B|A) = = . (2.2)
P (A) P (A|B) · P (B) + P (A|B c ) · P (B c )
Conditional Probability and Independence Chapter | 2 13
Proof. This follows from the definition of conditional probability and the law
of total probability:
P (A ∩ B) = P (A|B) · P (B) = P (B|A) · P (A)

P (A) = P (A ∩ B) + P (A ∩ B c ) = P (A|B) · P (B) + P (A|B c ) · P (B c ).
The probability P (B) is frequently called the prior probability and P (A|B)
is called the likelihood, while P (B|A) is called the posterior. We next give
some examples.
Example 2.4. Suppose we want to know what is the probability a person is

telling the truth given the results of a Polygraph test. Let a positive reading on
the Polygraph be denoted by (+), and a negative reading be denoted by (−);
T denotes the person is telling truth and L denotes the person is lying. We have
information on P (±|L) and P (±|T ) from lab work (we are using a shorthand
but obvious notation here). Suppose we get + readout
P (+|T )P (T )
P (T |+) =
P (+|T )P (T ) + P (+|L)P (L)
If we believe that P (T ) = 0.99, say, and know that P (+|L) = 0.88, P (+|T ) =
0.14, then
P (T |+) = 0.94.
This is perhaps a bit surprising, see Prosecutor’s Fallacy.
Example 2.5. We next consider an example from economic theory, called the
sequential trading model. A stock price can take two possible values

VH with prob 1 − δ
V=
VL with prob δ,
where VL < VH and δ ∈ [0, 1]. This is the prior distribution on value. The in-
vestor is chosen randomly from two possible types

I with prob μ
T=
U with prob 1 − μ
The timeline is: first, a value is chosen, and then a type of investor is chosen,
and that investor carries out his strategy. The strategies of the investors are as
follows. The informed traders (I ), will buy if the value is high VH , and sell if
the value is low VL , provided the quoted prices lie in the interval [VL , VH ]. The
uninformed traders (U ), buy or sell with probability 1/2. Suppose that a buy
order is received (but it is not known from whom), what does that tell us about
the value of the stock? Let A = {V = VL } and B = {buy order received}. We
can calculate P (B|A) directly from the knowledge of the traders strategies and
the distribution of trader types. That is
1
P (B|A) = (1 − μ)
2
because if the value is low the informed trader will not buy and the uninformed
trader will buy one half of the time. Likewise
1
P (B|Ac ) = (1 − μ) + μ
2
because when the value is high the informed trader will always buy. We want to
know P (A|B) as this tells us the updated value distribution. By Bayes rule, we
can calculate the updated distribution (posterior) of V
likelihood prior
posterior

P (B|A) P (A)
P (A|B) =
P (B)
P (buy|V = VL )P (V = VL )
=
P (buy|V = VL )P (V = VL ) + P (buy|V = VH )P (V = VH )
2 (1 − μ) × δ
1
=
(1 + μ(1 − 2δ))/2
1−μ
= ×δ
1 + μ(1 − 2δ)
≤δ
(μ + 1)
P (V = VH |buy) = P (Ac |B) = 1 − P (A|B) = (1 − δ) ≥ 1 − δ
(1 + μ(1 − 2δ)
The information that a buy order has been received is useful and increases our
valuation of the asset. On the other hand, if a sell order were received, this would
lower our valuation of the asset.
Example 2.6. Why most published research findings are false, Ioannidis (2005).
2.3 INDEPENDENCE
We next define the notion of independence, which is a central property in much
of statistics. This concerns a special case of conditional probability that makes
many calculations simpler.
Definition 2.2. Independence. Suppose P (A), P (B) > 0, then A and B are
independent events if:
(1) P (A ∩ B) = P (A) · P (B)
(2) P (A|B) = P (A)
(3) P (B|A) = P (B)
These are equivalent definitions. Definition (1) is symmetric in A and B; the

value of property (1) is that given knowledge of P (A) and P (B), we can directly
determine P (A ∩ B), whereas without independence all we can say is the bound
(1.2). Definitions (2) and (3) are perhaps easier to interpret. In (2) we are saying
that knowledge of B does not change our assessment of the likelihood of A, and
essentially B is useless for this purpose. We can in principle allow P (A) and/or
P (B) to be zero in all three definitions provided we assign 0/0 = 0.
If A and B are mutually exclusive events, i.e., P (A ∩ B) = 0, then A and B
cannot be independent unless either P (A) = 0 or P (B) = 0. If A ⊆ B, then A
and B cannot be independent (unless P (A) = 0). Independence is a symmetric
relationship, so that A is independent of B, if and only if B is independent of A.
Furthermore, if A is independent of B, then: A is independent of B c , Ac is
independent of B, and Ac is independent of B c . If you know B, then you know
B c . However, it is not a transitive relationship, i.e. A is independent of B and
B is independent of C does not imply that A is independent of C.
Example 2.7. A counterexample: in the six-sided die example S = {1, 2, 3, 4,

5, 6}, take A = {2, 4, 6}, B = {1, 2, 3, 4}, and C = {1, 3, 5}.
Independence is an important property that is often assumed in applications.

For example:
(a) Stock returns are independent from day to day;
(b) Different household spending decisions are independent of each other;
(c) Legal cases such as Sally Clark/Roy Meadow. Expert Meadow tes-
tified that the odds against two cot deaths occurring in the same family was
73,000,000:1, a figure which he obtained by squaring the observed ratio of live-
births to cot deaths in affluent non-smoking families (approximately 8,500:1),
which would be valid under independence of these two events. He testified under
oath that: “one sudden infant death in a family is a tragedy, two is suspicious and
three is murder unless proven otherwise” (Meadow’s law). See Dawid (2002)
expert witness statement in the retrial.
Although independence is a central case, in practice dependence is common
in many settings. One can measure the amount of dependence and its direction
(positive or negative) between two events in several ways:
α(A, B) = P (A ∩ B) − P (A) · P (B)

P (A ∩ B)
β(A, B) = P (A|B) − P (A) = − P (A)
P (B)
P (A ∩ B)
γ (A, B) = − 1,
P (A)P (B)
where α, β, γ can be positive or negative indicating the direction of the mutual
dependence. For example, if β > 0 this means that event A is more likely to
occur when B has occurred than when we don’t know whether B has occurred or
not. We may show that α ∈ [−1, 1/4], β ∈ [−1, 1], and γ ∈ R. These measures
allow us to rank cases according to the degree of dependence. Suppose that
A ⊆ B, then clearly, knowing B gives us some information about whether A
has occurred whenever B is a strict subset of S. In this case
α(A, B) = P (A)(1 − P (B))

P (A)
β(A, B) = −1
P (B)
1
γ (A, B) = − 1.
P (B)
Example 2.8. In the stock example, we have very mild negative dependence
with:
α(A, B) = 0.53 − 0.78 × 0.68 = −0.0004

β(A, B) = −0.000588
γ (A, B) = −0.000754.
Example 2.9. An example of conditional probability and independence. Sup-

pose you deal two cards without replacement.

first card second card first card
A= ,B , C= .
is Ace is King is King
We have
P (A) = 4/52, P (B|A) = 4/51

P (B) = P (B|C) · P (C) + P (B|C c )P (C c ).
Furthermore, P (C) = 4/52, P (C c ) = 48/52, P (B|C) = 3/51, P (B|C c ) =

4/51. This implies that
3 4 4 48 12 + 192 204 4 4
P (B) = · + · = = = <
51 52 51 52 51 · 52 51 · 52 52 51
So P (B) < P (B|A), i.e., A and B are not independent events. In this case
β(A, B) = 1/51 > 0 meaning there is positive dependence.
We next consider the more general case with more than two events.
Definition 2.3. A general definition of independence. Events A1 , . . . , An are

said to be mutually independent if
⎛ ⎞

k
k
P⎝ Aij ⎠ = P (Aij ), for all Ai1 , . . . , Aik , k = 2, . . . , n
j =1 j =1
For example, independence of events A, B, C requires the following condi-

tions to hold:
1. P (A ∩ B) = P (A)P (B)
2. P (A ∩ C) = P (A)P (C)
3. P (B ∩ C) = P (B)P (C)
4. P (A ∩ B ∩ C) = P (A)P (B)P (C).
Example 2.10. The infinite monkey theorem says that if one had an infinite
number of monkeys randomly tapping on a keyboard, with probability one, at
least one of them will produce the complete works of Shakespeare. If one has a
finite set of characters on the typewriter K and a finite length document n, then
the probability that any one monkey would type this document exactly is K −n .
If there are 47 keys on the standard typewriter, and 884,421 words, so perhaps
5 million characters, in which case the probability is so low that a given monkey
will produce the documents. Let Ai = {Monkey i nails it}. Then
P (Ai ) = K −n
However, the probability that no monkeys would produce it is (assuming that

the monkeys are mutually independent agents and don’t interfere with other
monkeys)

−n M
lim P ∩M i=1 Ai = lim 1 − K
c
= 0.
M→∞ M→∞
This can be strengthened to say that with probability one an infinite number
of monkeys would produce the complete works of Shakespeare. Consider the
sets B1 = {1, 2, . . . , M}, B2 = {M + 1, M + 2, . . . , 2M}, etc. There was an Arts
council grant that was commissioned to investigate the infinite monkey theo-
rem.1
We give two further independence concepts.
Definition 2.4. Pairwise independence. Events A1 , . . . , An are said to be pair-

wise independent if for all Ai , Aj
P (Ai ∩ Aj ) = P (Ai ) · P (Aj ).
We can have pairwise independence but not independence, i.e., pairwise in-
dependence is the weaker property.
Example 2.11. Suppose that S = {1, 2, 3, 4}, A = {1, 2}, B = {1, 3}, and C =
{1, 4}. Then P (A ∩ B ∩ C) = 1/4 but P (A) = P (B) = P (C) = 1/2 and P (A ∩
B) = P (A ∩ C) = P (B ∩ C) = 1/4.
There is a further concept of interest in many applications.
Definition 2.5. Conditional Independence. Suppose that P (A), P (B), P (C) >
0, then A and B are independent events given C if either:
(1) P (A ∩ B|C) = P (A|C)P (B|C);
(2) P (C)P (A ∩ B ∩ C) = P (A ∩ C) · P (B ∩ C).
Note that independence of A and B does not imply conditional independence

of A and B given C, and vice versa, conditional independence of A and B given
C does not imply independence of A and B.
Example 2.12. You toss two dice

Your first die Your second die Both dice are
A= ,B , C=
6 is 6 same
A and B here are independent. However, A and B are conditionally dependent

given C, since if you know C then your first die will tell you exactly what the
other one is.
There are many examples where conditional independence holds but not in-
dependence. Likewise, the direction of dependence or association can change
1. In 2003, lecturers and students from the University of Plymouth MediaLab Arts course studied
the literary output of real monkeys. They left a computer keyboard in the enclosure of six Celebes
crested macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast
the results on a website. The monkeys produced nothing but five total pages largely consisting of
the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued
by urinating and defecating on it.
depending on whether you are in the conditional distribution or unconditional.

This is called Simpson’s Paradox: the reversal of the direction of an asso-
ciation when data from several groups are combined (aggregated) to form a
single group. That is, we may have Pr(A|B) > Pr(A|B c ) but both Pr(A|B, C) <
Pr(A|B c , C) and Pr(A|B, C c ) < Pr(A|B c , C c ) for events A, B, C.
Example 2.13. A common application of conditional independence is in time

series. We have a sequence of outcomes on the stock market observed over time,
either up or down. Each outcome is random at time t, and there is a probability
pt of an Up and 1 − pt of a Down and this probability may depend on past
outcomes. We assume that only what happens yesterday is relevant for today,
i.e., the future is independent of the past given the present
P (Outcomet |Outcomet−1 ∩ · · · ∩ Outcome−∞ ) = P (Outcomet |Outcomet−1 ).
This is called a Markov Chain. A special case of this is when pt is time invari-
ant as in our next example.
Example 2.14. Gambler’s ruin. Each period there is a probability p of an up

and probability 1 − p of a down, and this does not depend on past outcomes.
Suppose that you have $1 and take $1 positions in the stock market indefinitely
[when the market is up, you gain $1, and when it is down you lose $1], or until
you become bankrupt. What is the probability you become bankrupt? You can
write out a table showing all the possible paths you can take to the destination
$0. An important thing to note in this problem is the symmetry. Let Pj,k be the
probability of going from $j to $k. Clearly, Pj +,k+ for any , i.e., just adding
$ to your total doesn’t change anything real. Now note that
P1,0 = 1 − p + pP2,0 .
Furthermore, P2,1 = P1,0 and so P2,0 = P2,1 P1,0 = P1,0

2 . Therefore, we have a
quadratic equation
P1,0 = 1 − p + pP1,0
2
,
which has solutions
1−p
P1,0 = 1 ; P1,0 = .
p
The first solution is relevant when p ≤ 1/2, while the second is relevant other-
wise. This says that for example even if you have a fifty fifty chance of success,
you will become bankrupt with probability one. This shows the advantage of the
principle “quit while you are ahead”.

3 s2.0 B9780128104958000038 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 s2.0 B9780128104958000038 Main

Uploaded by

Copyright:

Available Formats

Chapter 2

Conditional Probability and

Example 2.1. In university admissions, we might be interested in

A = {got g.p.a. 4.0 in college}

Example 2.2. Investment professionals might be interested in the case

We are interested not just in marginal probabilities, but also in conditional

Definition 2.1. The probability of an event A ∈ A given an event B ∈ A, de-

Theorem 2.1. Suppose that B ∈ A. Then

This follows directly from the definitions and properties of P .

The information given here is effectively: P (A∩B), P (A∩B c ), P (B ∩Ac ), and

2.2 BAYES THEOREM

P (A ∩ B) = P (A|B) · P (B) = P (B|A) · P (A)

Example 2.4. Suppose we want to know what is the probability a person is

These are equivalent definitions. Definition (1) is symmetric in A and B; the

Example 2.7. A counterexample: in the six-sided die example S = {1, 2, 3, 4,

Independence is an important property that is often assumed in applications.

(positive or negative) between two events in several ways:

α(A, B) = P (A ∩ B) − P (A) · P (B)

α(A, B) = P (A)(1 − P (B))

α(A, B) = 0.53 − 0.78 × 0.68 = −0.0004

Example 2.9. An example of conditional probability and independence. Sup-

P (A) = 4/52, P (B|A) = 4/51

Furthermore, P (C) = 4/52, P (C c ) = 48/52, P (B|C) = 3/51, P (B|C c ) =

Definition 2.3. A general definition of independence. Events A1 , . . . , An are

For example, independence of events A, B, C requires the following condi-

However, the probability that no monkeys would produce it is (assuming that

We give two further independence concepts.

Definition 2.4. Pairwise independence. Events A1 , . . . , An are said to be pair-

P (Ai ∩ Aj ) = P (Ai ) · P (Aj ).

There is a further concept of interest in many applications.

Note that independence of A and B does not imply conditional independence

Example 2.12. You toss two dice

A and B here are independent. However, A and B are conditionally dependent

depending on whether you are in the conditional distribution or unconditional.

Example 2.13. A common application of conditional independence is in time

P (Outcomet |Outcomet−1 ∩ · · · ∩ Outcome−∞ ) = P (Outcomet |Outcomet−1 ).

Example 2.14. Gambler’s ruin. Each period there is a probability p of an up

Furthermore, P2,1 = P1,0 and so P2,0 = P2,1 P1,0 = P1,0

You might also like