Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 27

Comparing Classical and Bayesian

Approaches to Hypothesis Testing

James O. Berger
Institute of Statistics and Decision Sciences
Duke University
www.stat.duke.edu
Outline

• The apparent overuse of hypothesis testing


• When is point null testing needed?
• The misleading nature of P-values
• Bayesian and conditional frequentist testing of
plausible hypotheses
• Advantages of Bayesian testing
• Conclusions
I. The apparent overuse of
hypothesis testing

• Tests are often performed when they are


irrelevant.
• Rejection by an irrelevant test is sometimes
viewed as “license” to forget statistics in further
analysis
Prototypical example

Habitat Rank Observed Hypothesis


Type Usage
A 1 3.8
B 2 3.6 H0 : "mean usage is
C 3 2.8 equal for all habitats"
D 4 1.8 Rejected (P<.025)
E 5 1.5
F 6 0.7
Statistical mistakes in the example

• The hypothesis is not plausible; testing serves no


purpose.
• The observed usage levels are given without
confidence sets.
• The rankings are based only on observed means,
and are given without uncertainties. (For instance,
perhaps Pr (A>B)=0.6 only.)
Prototypical example

Habitat Rank Observed Hypothesis


Type Usage
A 1 3.8
B 2 3.6 H0 : "mean usage is
C 3 2.8 equal for all habitats"
D 4 1.8 Rejected (P<.025)
E 5 1.5
F 6 0.7
Statistical mistakes in the example

• The hypothesis is not plausible; testing serves no


purpose.
• The observed usage levels are given without
confidence sets.
• The rankings are based only on observed means,
and are given without uncertainties. (For instance,
perhaps Pr (A>B)=0.6 only.)
Prototypical example

Habitat Rank Observed Hypothesis


Type Usage
A 1 3.8
B 2 3.6 H0 : "mean usage is
C 3 2.8 equal for all habitats"
D 4 1.8 Rejected (P<.025)
E 5 1.5
F 6 0.7
II. When is testing of a point null
hypothesis needed?
Answer: When the hypothesis is plausible, to
some degree.
Note that, while H 0 :  0 is typically not
plausible, it is a good approximation to
H 0 :| |   , as long as  <  (4 n )
(assuming n Gaussian observations with
standard deviation  ).
Examples of hypotheses that are not
realistically plausible
• H0: small mammals are as abundant on livestock grazing
land as on non-grazing land
• H0: survival rates of brood mates are independent
• H0: bird abundance does not depend on the type of forest
habitat they occupy
• H0: cottontail choice of habitat does not depend on the
season
Examples of hypotheses that may be
plausible, to at least some degree:

• H0: Males and females of a species are the same


in terms of characteristic A.
• H0: Proximity to logging roads does not affect
ground nest predation.
• H0: Pollutant A does not affect Species B.
III. For plausible hypotheses, P-values
are misleading as measures of evidence
Example: Experimental drugs D1, D2, D3, . . .
are to be tested.
Each Test: H0: Di has negligible effect
H1: Di is effective
Typical Bayesian Answer: The probability
that H0 is true is 0.06.
Classical Answer (P-value): If H0 were true, the
probability of observing hypothetical data as or
more "extreme" than the actual data is 0.06.
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28

DRUG D7 D8 D9 D10 D11 D12


P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66

Question: How strongly do we believe that


Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is known
that, apriori, about 50% of the Drugs will
have negligible effect. Then,

(i) of the Drugs for which the P-value  0.05,


at least 25% (and typically over 50%)
will have negligible effect;

(ii) of the Drugs for which the P-value  0.01,


at least 7% (and typically over 15%)
will have negligible effect.
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28

DRUG D7 D8 D9 D10 D11 D12


P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66

Question: How strongly do we believe that


Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is known
that, apriori, about 50% of the Drugs will
have negligible effect. Then,

(i) of the Drugs for which the P-value  0.05,


at least 25% (and typically over 50%)
will have negligible effect;

(ii) of the Drugs for which the P-value  0.01,


at least 7% (and typically over 15%)
will have negligible effect.
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28

DRUG D7 D8 D9 D10 D11 D12


P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66

Question: How strongly do we believe that


Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
IV. Bayesian testing of point hypotheses
Data and Model: X has density f ( x| )
Example: X  # of eggs hatched out of n eggs
in a recently polluted area (so f is binomial,
and  is the true proportion that would hatch).
To Test: H 0 :  0 versus H1 :  0
Example: 0 is the historically known proportion
of eggs that hatch in the area
The prior distribution
Let P1 and P2 be the prior probabilities of H1 and H 2 .
(The usual default choice is P1  P2  0.5.)
Under H1 , let  ( ) be the density representing
information concerning the location of  . (The usual
default choice for the binomial problem is  ( )  1.)

Note: There are two schools of Bayesian statistics,


the subjective school, where the prior distribution
reflects real extraneous information, and the objective
school, where the prior is chosen in a default fashion.
Posterior probability that H0 is true,
given the data (from Bayes theorem):

P0 f ( x|0 )
Pr( H 0 | data x ) 
P0 f ( x|0 )  P1 
 
{ 
f ( x| ) ( ) d
0}


 1  0 x  1  0  
x n ( 1)
Beta ( x  1, n  x  1)
(for the binomial testing problem)
Note: Some prefer to use the Bayes Factor (or
weighted likelihood ratio) of H 0 to H1 ,
f ( x|0 )
B

 
{ 
f ( x| ) ( ) d
0}

likelihood of data under H 0


= ,
" average" likelihood of data under H1
since this does not involve prior probabilties of the H i .

Example: Suppose x=40 eggs hatched out of n=100.


Then Pr( H 0 | data x )  0.52 and B  0.92. (Here a
classical test would yield P  value  0.05.)
Conditional frequentist interpretation of
the posterior probability of H0

Pr( H 0 | data x ) is also the frequentist type I error


probability , conditional on observing data of the
same "strength of evidence" as the actual data x.
(The classical type I error probability makes the
mistake of reporting the error averaged over data
of very different strengths.)
V. Advantages of Bayesian testing

• Pr (H0 | data x) reflects real expected error rates:


P-values do not.
• A default formula exists for all situations:
( 1)
 

Pr( H 0 | data x )  1 
 f ( x ,  ) f ( x *
,  ) f ( x *
,  0 ) dx *
d 
 ,

  f ( x ,  0 ) f ( x *
,  ) d  

where x * is independent (unobserved) data of the smallest
size such that the above integrals exist.
• Posterior probabilities allow for incorporation of
personal opinion, if desired. Indeed, if the
published default posterior probability of H0 is P*,
and your prior probability of H0 is P0, then your
posterior probability of H0 is
( 1)
  1  1 
Pr( H 0 | data x )  1    1  *  1 
  P0   P  
Example: In the binomial example, recall P  0.52.
*

A "skeptic" has P0  01. ; hence Pr( H 0 | data x )  011


. .
A " believer" has P0  0.9; hence Pr( H 0 | data x )  0.91.
• Posterior probabilities are not affected by the
reason for stopping experimentation, and hence
do not require rigid experimental designs (as do
classical testing measures).
• Posterior probabilities can be used for multiple
models or hypotheses.
Example: H 0 : pollutant A has no effect on species B
H1 : pollutant A decreases abundance of species B
H 2 : pollutant A increases abundance of species B
Pr( H 0 | data ) .30, Pr( H1 | data ) .68, Pr( H 2 | data ) .02
An aside: integrating science and
statistics via the Bayesian paradigm
• Any scientific question can be asked (e.g., What is
the probability that switching to management plan
A will increase species abundance by 20% more
than will plan B?)
• Models can be built that simultaneously incorporate
known science and statistics.
• If desired, expert opinion can be built into the
analysis.
Conclusions
• Hypothesis testing is overutilized while (Bayesian)
statistics is underutilized.
• Hypothesis testing is needed only when testing a
“plausible” hypothesis (and this may be a rare
occurrence in wildlife studies).
• The Bayesian approach to hypothesis testing has
considerable advantages in terms of interpretability
(actual error rates), general applicability, and
flexible experimentation.

You might also like