Lecture 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Lecture 2: Some more introductory material

Daniel Frances c 2016

Contents

1 Some basics 2

1.1 Historical Perspective and Context . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Inference versus Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 How many people does it take? . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Subjective Probabilities 4

2.1 Eliciting Subjective Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Proving P (A) + P (Ā) = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Well Calibrated Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Proper Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Historical vs. Extensive Form 12

4 Bayes’ Rule Revisited 13

4.1 Use Basic English - avoid the formula . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Flipping Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Using Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Back to the Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Conclusion 15

1
1 SOME BASICS

1 Some basics

Quoting Wikipedia

”Decision Analysis (DA) is the discipline comprising the philosophy, theory, methodology,
and professional practice necessary to address important decisions in a formal manner. De-
cision analysis includes many procedures, methods, and tools for identifying, representing,
and formally assessing important aspects of a decision, for prescribing a recommended course
of action by applying the maximum expected utility action axiom to a well-formed repre-
sentation of the decision, and for translating the formal representation of a decision and its
corresponding recommendation into insight for the decision maker (DM) and other stake-
holders”

Note that this course has nothing to do with learning how humans actually go about making
decisions, but rather how they ought to make decisions.

1.1 Historical Perspective and Context


• In 1954 Leonard Jimmy Savage wrote The Foundation of Statistics which critiqued
the field of Statistics, and introduced the use of subjective probabilities and utilities
in what has since become Bayesian Statistics. To this day Statisticians are either
Classical (using a frequentist interpretation of probability ) or Bayesian (using it as a
measure of subjective belief).

• In 1968 Howard Rai↵a wrote Decision Analysis: Introductory Lectures on Choices


Under Uncertainty - highly recommended reading, which expounded the same theory
but with a greater focus on making decisions, rather than solving inference problems

• In 1980 The Decision Analysis Society (DAS) was founded and continues under IN-
FORMS responsible for the Decision Analysis Journal

• The field of Bayesian DA has expanded since the 1990’s due to the explosive devel-
opment of numerical techniques and software which are finally allowing the Bayesian
approach to be fully implemented.

• Lecture notes are based mainly on the text by Jim Smith, plus MCMC methods from
Bayesian Statistics texts.

• The Precision Tree, Netica and OpenBUGS software packages are available on ECF.
Outside ECF for download to your PC, the first package is available on a time limited
basis as part of Palisades Decision Suite, Netica is available for free, either a limited
version without a password, or a full version with a password. The password is available
upon request from the instructor. OpenBUGS is open software.

Page 2 Compiled on 2016/09/14 at 17:32:07


1.2 Inference versus Decision 1 SOME BASICS

1.2 Inference versus Decision

These two areas are very related in Bayesian Decision Analysis, but not at all in Classical
Statistics.

Statistical Inference draws conclusions from datasets, when such datasets are a↵ected by
random variation. The format for expressing these conclusions is very di↵erent for Classical
and Bayesian Statisticians.

Classical Statisticians would express their conclusions in terms of confidence intervals and p-
values, on probability distribution parameters based purely on the data. This is the approach
taken in most undergraduate Statistics courses.

Bayesian Statisticians would express the conclusions on the same parameters as posterior
distributions, based on both the data and the DM’s prior distributions. The prior distribution
is what the client, the DM, personally believes about the parameter before observing the
data. The posterior distribution is an expression of the DM’s belief after the data has been
observed.

The dispute between the two Statistician types is well documented - here we obviously take
the Bayesian approach. It is comforting to know that the di↵erences between the conclusions
obtained from each approach narrows as the data accumulates, so that the impact of the
DM’s prior distribution becomes irrelevant. Yet there is a big philosophical di↵erence when
it comes to making decisions.

Classical Statisticians objectively draw conclusions entirely from the data, without any in-
teraction with the DM - except to know the parameters that require inference. The same
conclusions are valid for any DM that wishes to use them for decision making.

Bayesian Statisticians draws conclusions jointly with the DM. They are poised and ready
to guide the DM into making the decision that is in the best interest of the DM, using the
methods of Bayesian Decision Analysis. These use both the posterior distributions derived
from the Inference stage, and aim to maximize the DM’s expected utility. Thus whereas
Bayesian Inference stops short of recommending optimal decisions, Bayesian Decision Anal-
ysis incorporates both an inference and a decision making stage.

This same di↵erence will come to light again when we deal with Bayesian Networks, which
are part of Bayesian Inference, and Influence Diagrams which are part of Bayesian Decision
Analysis.

For Bayesians, Inference and Decision Analysis, are almost inseparable.

1.3 How many people does it take?

While Classical Statistics is carried out almost exclusively by the Statistician, BDA requires
more participants.

Page 3 Compiled on 2016/09/14 at 17:32:07


2 SUBJECTIVE PROBABILITIES

1. The DM, who has the need to make a decision, provides prior probabilities, and is the
default expert regarding the problem to be solved. e.g. the physician
2. The Analyst, the name we will use instead of statistician, who is the expert on the
BDA methodology. e.g. yourself
(a) Helping the DM build a subjective probability model, and documenting the basis
for this model.
(b) Elicit the DMs utility function, which will include regards for the stakeholders.
(c) Calculate the expected utility of each of the viable alternatives. The best decision
is the one that has the highest expected utility
3. The Expert, to whom the DM delegates to provide subjective probabilities, when she
believes they are more qualified. e.g. a specialist
4. The Auditor, who ensures the overall analysis makes sense. e.g. the family physician
together with the patient.

Note that in this course we will assume a single individual for each of these roles. In general
BDA does allow for group decision making, and multiple experts, to try to find consensus
among multiple cooperative players. If the players are not cooperative than we are in the
related realm of Game Theory.

2 Subjective Probabilities

We now need to deal with the notion of subjective probabilities, and the science, mainly
psychology, of eliciting such distributions correctly.

We will distinguish between the notion that

(a) Deep down you have a quantitative measure of belief that a certain event will occur -
which we will name and process as a probability
(b) The analyst elicits an estimate of this probability from you which is close but not
identical to your true probability.

2.1 Eliciting Subjective Probabilities

Assume a fair betting wheel, the probability of winning is equal to the ratio of the winning
area to the total area as shown in Fig. 2.2

The wheel is used to eliciting beliefs about future events. For example suppose we wish to
elicit your belief that the Liberals will again win the next election,i.e. event A. The analyst
will then ask you to choose between

Page 4 Compiled on 2016/09/14 at 17:32:07


2.1 Eliciting Subjective Probabilities 2 SUBJECTIVE PROBABILITIES

Figure 1: Betting Wheel

A: a bet that the Liberals will again win the election - event A takes place.

B: a bet that the spinner with initial position x falls in the wheel’s blue area.

Clearly you would like to know the probability of winning in each case. Unfortunately, while
we can clearly disclose the probability of winning with gamble B, you each have a di↵erent
subjective probability of winning gamble A.

We would start with an initial setting of the wheel, say at 50% of the area, and ask you
for your preferred bet A or B. If A B we would infer that P (A) > P (B), and would
increase the winning area. We would continue to increase until A B. At that time we
would slightly decrease the winning area. Eventually the DM would express indi↵erence, i.e.
A ⇠ B, as shown in Fig 2

The left side of Fig 2 depicts gamble A, with event A as the winning event, with b(A)
representing the DM perceived value of betting on A. On the right side we show gamble
B, with event Ep (x) as the winning event with objective probability of winning as p and
b(Ep (x)) denoting the perceived value of betting on the spinning wheel. Note that once
we inspect the spinning wheel we realize that the probability of falling in the blue area is
independent of the starting position x and thus the event Ep = Ep (x) and at the indi↵erence
point b(A) = b(Ep ).

Figure 2: Indi↵erence Point

Page 5 Compiled on 2016/09/14 at 17:32:07


2.2 Proving P (A) + P (Ā) = 1 2 SUBJECTIVE PROBABILITIES

2.2 Proving P (A) + P (Ā) = 1

The question is whether the laws of probabilities also apply to these subjective estimates, for
example will the sum of the subjective probabilities sum to one over all mutually exclusive
possible events, i.e. )
b(A) = b(Ep )
) p + q = 1?
b(Ā) = b(Eq )

Or could the DM say that since these are only subjective estimates that p + q 6= 1?

To make this concept more precise we use the following definitions and axioms.

Definition 2.1. Perceived Value of a Bet on event A. For any possible event A, consider
a bet in which a DM only wins a prize if the event A occurs. Then the Perceived Value
of this bet on event A, or b(A), is the maximum price the DM is willing to pay for this
bet. For notational purpose the value of the prize is kept constant.

Definition 2.2. The subjective probability of event A: P (A) = p where b(A) = b(Ep ).

Axiom 2.1. Comparability. For every event A considered by the DM there is a unique
p such that b(A) = b(Ep ) and P (A) = p, where Ep is an event with known probability p

Axiom 2.2. Substitutability. If a given lottery is modified by replacing one of the


prizes in the lottery with an equivalent substitute, so that the DM is indi↵erent between
receiving the prize or its substitute, then if everything else remaining fixed, the DM will
be indi↵erent between the original and modified lotteries.

Axiom 2.3. Transitivity. Preferences among any alternatives A, B and C, infer addi-
tional preferences as follows:
) ) )
A⇠B A B A B
)A⇠C )A C )A C
B⇠C B C B⇠C

Page 6 Compiled on 2016/09/14 at 17:32:07


2.2 Proving P (A) + P (Ā) = 1 2 SUBJECTIVE PROBABILITIES

Theorem 2.1. Given the subjective probability P (A) of an event A, and the subjective
probability P (Ā) of the mutually exclusive event not-A or Ā, then the Comparability,
Substitutability and Transitivity Axioms imply P (A) + P (Ā) = 1.

Proof. By Comparability

win win
A P(A)

~
Ā 1-P(A)
lose lose

win win
Ā P(Ā)

~
A 1-P(Ā)
lose lose

Figure 3: Spin wheel equivalents for events A and Ā

By Substitution and Independence between tossing a fair coin and event A

win win
win

1/2 A P(A)

win
win
1/2 A 1/2 1/2 ½[P(A)+P(Ā)]
1/2 1-P(A)
Ā
lose lose lose

~ win ~ win
~ win ~
1/2
1/2 P(Ā) 1-½[P(A)+P(Ā)]
1/2 lose Ā 1/2 Ā

lose

1/2 1-P(Ā)
A
lose
lose lose

Figure 4: Spin wheel equivalents for combined fair coin toss and events A and Ā

1
Then by Transitivity 2
= 12 [P (A) + P (Ā)] or P (A) + P (Ā) = 1

Page 7 Compiled on 2016/09/14 at 17:32:07


2.3 Well Calibrated Assessments 2 SUBJECTIVE PROBABILITIES

It can similarly be shown that under the same assumptions the subjective probability follows
all the rules of objective probability, such as conditional probabilities, and most importantly
Bayes Rule. For an excellent and most readable reference to supporting this claim see Howard
Rai↵a(1968): Decision Analysis - Introductory Lectures on Choices under Uncertainty.

Definition 2.3. A DM’s probability assignments are coherent if the sum of probabilities
over all mutually exclusive possible events add to one.

For an excellent article observing that subjective probabilities are near-coherent see the early
article on predictive markets http://mason.gmu.edu/~rhanson/PAM/PRESS2/ORMSToday-6-04.
htm, a phenomenon still alive and well.

2.3 Well Calibrated Assessments

Suppose that the DM provides assessments on events for which we can later determine
how the events unfolded. For example, take a weather forecaster who daily predicts the
probability of rain (of course carefully defined). Suppose there were 100 days in which the
forecaster predicted a 20% probability of rain, then we would expect that in about 20% of
those days, i.e. on about 20 days, it indeed rained. In general, if there were n(q) days on
which he predicted a q probability of rain, and if it rained r(q) days when he predicted a q
r(q)
probability of rain, then we would expect that r(q) = q ⇤ n(q) or q = n(q)

Definition 2.4. A DM is empirically well calibrated if the probability assessments - q-


r(q)
are equal to the historical proportions q̂ = n(q) , i.e. q = q̂.

Page 8 Compiled on 2016/09/14 at 17:32:07


2.3 Well Calibrated Assessments 2 SUBJECTIVE PROBABILITIES

Figure 5 shows over 10 days what the probability assessment was, and whether it in fact
rained or not.

Figure 5: Rain Forecast Data

Then

• For q = 0.0, n(q) = 1 days, r(q) = 0 days, q̂ = 0/1 = 0. Perfect!

• For q = 0.5, n(q) = 2 days, r(q) = 1 days, q̂ = 1/2 = 0.5. Perfect!

• For q = 0.6, n(q) = 5 days, r(q) = 4 days, q̂ = 4/5 = 0.8. So-so!

• For q = 0.8, n(q) = 5 days, r(q) = 4 days, q̂ = 4/5 = 0.8. Perfect!

• For q = 1.0, n(q) = 1 days, r(q) = 1 days, q̂ = 1/1 = 1. Perfect!

So this DM is fairly well calibrated.

Consider a related table of outcomes for 3 forecasters: P1 , P2 , P3 .

Figure 6: Rain Forecast Data from P1 , P2 , P3

Clearly Figure 6 shows P1 and P3 are empirically better calibrated than P2 . Yet P2 stuck
out his neck much more often to predict definitely either rain (q = 1.0) or no rain (q = 0.0).
Thus we need some better scoring rules than strict historical records.

Page 9 Compiled on 2016/09/14 at 17:32:07


2.4 Proper Scoring Rules 2 SUBJECTIVE PROBABILITIES

2.4 Proper Scoring Rules

Definition 2.5. A scoring rule is a loss function L(a, q) where a = 1 if event A occurs
and a = 0 if it does not, and q is the DM’s probability assessment for event A,

Having a scoring rule to measure a forecaster’s performance will cause him to minimize
his average loss over the long run. Thus suppose he truly believes that P (A) = p then to
minimize his expected loss - assuming EMV - he may not predict p but instead predict q ⇤
which solves
min L(q|p) = pL(1, q) + (1 p)L(0, q)
q

Definition 2.6. A proper scoring rule is any scoring rule for which L(q|p) is minimized
by q ⇤ = p, i.e. it provides an incentive for the forecaster to use the best forecast.

An example of an improper scoring rule is L(a, q) = |a q|, for which

L(q|p) = p(1 q) + (1 p)q = p pq + q pq = p + q 2pq


8
>
<q⇤ = 0 p < 1/2
d
L(q|p) = 1 2p ) q ⇤ 2 [0, 1] p = 0.5
dq >
: ⇤
q =1 p > 1/2

One of the earliest proper scoring rules is the Brier score for which

L(a, q) = (a q)2

So that the forecaster will choose q⇤ by

min L(q|p) = p(1 q)2 +(1 p)q 2 ) 2p(1 q)( 1)+2(1 p)q = 0 ) p+pq+q pq = 0 ) q ⇤ = p
q

Pn
Definition 2.7. The empirical score Sn = i=1 L(ai , qi ), where L(a, q) is a proper
scoring rule, qi 2 [0, 1] is the forecast for period i = 1 . . . n, and ai 2 {0, 1} is the actual
outcome for period i.

Page 10 Compiled on 2016/09/14 at 17:32:07


2.4 Proper Scoring Rules 2 SUBJECTIVE PROBABILITIES

Thus the performance of di↵erent forecasters over time is reflected in low Empirical scores.
Let’s compute the Brier score for the 3 forecasters in Figure 6.
n
X
Sn (P1 ) = L(ai , qi )
i=1
X X
= (1 qi ) 2 + qi2
rain i no rain i
= 32(1 0.4) + 290(1 0.5)2 + 204(1 0.6)2
2

+ 2(80 32)(0.4)2 + (580 290)0.52 + (340 204)0.62


= 245.8
Sn (P2 ) = 20 + 16(0.62 ) + 20(0.52 )
+ 24(0.42 ) + 20(0.52 ) + 40
= 79.0
Sn (P3 ) = 8(0.81) + 25(0.64) + 36(0.49) + 32(0.36) + 50(0.25)
+ 36(0.16) + 84(0.09) + 124(0.04) + 81(0.01)
+ 72(0.01) + 100(0.04) + 84(0.09) + 48(0.16) + 50(0.25)
+ 24(0.36) + 36(0.49) + 31(0.64) + 9(0.81) + 50
= 169.0

This confirms our suspicion that in fact P2 is the best forecaster of the three.

The Logarithmic score is also a widely used proper scoring rule for which
(
log q a=1
L(a, q) =
log(1 q) a = 0

The forecaster will choose q ⇤ by

min L(q|p) = p( log q) + (1 p)( log(1 q) ) (p/q) + (1 p)/(1 q) = 0 ) q ⇤ = p


q

Page 11 Compiled on 2016/09/14 at 17:32:07


3 HISTORICAL VS. EXTENSIVE FORM

3 Historical vs. Extensive Form

Let’s consider again the decision tree in the real estate example.

Figure 7: Real Estate Decision Tree: Extensive or Rollback Form

If we look at the tree carefully we might be perturbed by the fact that the event that Joe
Lewis attends the meeting or not comes before the Residential or Industrial vote. In terms
of Joe, the first event to take place, is that Joe assesses how the vote will go, and then he
makes the decision as to weather he should attend the meeting or not. Thus the decision
tree seems to be drawn in the wrong order, and perhaps should be shown as

Figure 8: Real Estate Decision Tree: Historical Form

But, unfortunately this format while showing a more natural historical sequence between the
two events, it misrepresents the decision, which must be taken before knowing the outcome
of the city council vote. Later when we use Influence Diagrams, both forms can be shown
correctly.

Page 12 Compiled on 2016/09/14 at 17:32:07


4 BAYES’ RULE REVISITED

4 Bayes’ Rule Revisited

Most undergrad probability courses involve some proof of Bayes’ Rule. But there is a big
di↵erence between understanding it for an exam, and understanding it so that you are willing
to accept in making your own decisions.

At a recent meeting of the Society for Medical Decision Making, this was spoken about
a lot. Physicians are constantly dealing with odds of cures, side-e↵ects, false positives or
low sensitivity, false negatives or low specificity, etc. yet they are not trained very well in
the calculus of probability. In one test, doctors were given a simple situation with prior
and conditional probabilities, and asked to calculate the posterior probabilities (only Bayes’
Rule is correct) yet the answers were all over the map. Solutions to this problem included
avoiding Bayes’ formula at all costs and expressing probabilities as odds. Let’s explore
various approaches.

4.1 Use Basic English - avoid the formula

Let’s solve a problem using plain reasoning. Suppose a patient has a 1% chance of having a
disease, and that he is sent for a diagnostic test with a 90% sensitivity and 80% specificity.
What is the post test probability of having the disease if the patient is tested +ve? What is
it if the patient is tested -ve?

Suppose the physician has seen 1000 such patients. The analysis follows the table below:

Figure 9: Bayes Updating Table


Sensitive Specific
# 90% 80%
1000 + -
has disease 1% 10 9 1
prior
does not 99% 990 198 792
Total 207 793
has disease 4.3% 0.1%
posterior
does not 95.7% 99.9%

Thus, the chances of our patient having the disease after a positive test is 9/207=4.3%. and
with a negative result the chance of having the disease is reduced to 1/793 or 0.13%. Nothing
Intuitive about these numbers. You have to do the math!

The claim was made that after training physicians with the English versions there was a
drastic improvement in the ability to use Bayes’ Rule, although some were still hopeless.

Page 13 Compiled on 2016/09/14 at 17:32:07


4.2 Flipping Trees 4 BAYES’ RULE REVISITED

4.2 Flipping Trees

Another way of understanding Bayes’ Rule is by using decision trees that only have random
nodes. Figure 10 shows the random variable nodes in the order in which they actually take
place. First the patient has the disease and then we observe the symptom. This is called a
Historical Tree. Unfortunately as we already saw this format - shown before in figure 8 -
is not that useful in solving decision analysis problems. Instead we “flip” the tree into an

Figure 10: Historical Tree

+
0.9

D
0.1
0.01
-

Dummy

0.99
+
0.2

N
0.8

Extensive or Rollback format as shown in Fig 11 by showing the random events in the order
in which the DM discovers them: first the symptom and then a probability that the patient
has a particular disease.

Figure 11: Extensive or Rollback Tree

D
0.9*0.01/
0.207=
+ 0.04348

0.01*0.9+
1-.04348= N
0.95652
0.99*0.2=
0.207
Dummy

1-0.207=
D
0.1*0.01\
0.793
0.793=
- 0.00126

1-0.00126=
0.99874 N

Assign probabilities, based on probability logic, conform to Bayes Rule.

Page 14 Compiled on 2016/09/14 at 17:32:07


4.3 Using Odds 5 CONCLUSION

4.3 Using Odds

It turns out that by using odds instead of probabilities one obtains a very intuitively appeal-
ing version of Bayes’ Rule.

While we normally model likelihood of an event A with a probability P (A), another means
is to use odds. Thus if we say that the odds that A will occur is ‘X to Y’, then this is the
X
same as saying that P (A) = X+Y . The expression ‘X to Y’ is written as ‘X:Y’ or ‘X/Y’, and
can be treated as a normal ratio. In addition if we have two events A and B and we say
that the odds of event A, versus event B, are ‘X to Y’ or ‘X/Y’ then P (A)/P (B) = X/Y .
If the odds are given as a single number ‘X’, then it is assumed that the odds are ‘X to 1’.

Consider the posterior odds of event i, versus event k.

p(i | x) p(x | i)p(i) p(x | i) p(i)


Posterior odds = = =
p(k | x) p(x | k)p(k) p(x | k) p(k)

The first ratio of probabilities is called the Likelihood Ratio, and leads to an easy way to
remember that the Posterior Odds equal the Likelihood Ratio times the Prior Odds. Or in
our example the posterior odds of having disease D given a +ve test, is

P (+|D)
Posterior odds(D|+) = Prior odds(D) = (0.9/0.2)(1/99) = 0.04545
P (+|N )
0.04545
Thus the posterior probability P (D | +) = 1+0.04545
= 0.043 as before.

4.4 Back to the Formula

P (D) = 0.01, P (N ) = 0.99


P (+ | D) = 0.9, P ( | D) = 0.1, P (+ | N ) = 0.2, P ( | N ) = 0.8
Plugging in the specific numbers for our case we obtain
0.9 ⇤ 0.01 9 0.2 ⇤ 0.99 198
P (D | +) = = and P (N | +) = =
0.9 ⇤ 0.01 + 0.2 ⇤ 0.99 207 0.9 ⇤ 0.01 + 0.2 ⇤ 0.99 207
0.1 ⇤ 0.01 1 0.8 ⇤ 0.99 792
P (D | )= = and P (N | )= =
0.1 ⇤ 0.01 + 0.8 ⇤ 0.99 793 0.1 ⇤ 0.01 + 0.8 ⇤ 0.99 793

Same results as before.

5 Conclusion

With these basic tools we are ready to tackle a slightly more complex DA problem.

Page 15 Compiled on 2016/09/14 at 17:32:07

You might also like