Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Statistics for phase III trials

Marc Buyse, ScD


IDDI and University of Hasselt, Belgium
1
James Lind (1747)
A Treatise on the Scurvy
Treatment of sick
sailors:
Sea water - 2
Vinegar - 2
Cider - 2
Elixir of vitriol - 2
Eluctary - 2
Lemon juice - 2
2
Jan Van Helmont (1648)
« Let us cast lots »
Let us take out of the
hospitals… 200 or 500 poor
people, that have fevers,
pleurisies. Let us divide
them into halves, let us
cast lots, that one half of
them may fall to my share,
and the other to yours; I
will cure them without
bloodletting and sensible 3
Van Helmont thought clinical trial
N = 200
No
Bloodletting Total
bloodletting

Alive 70 80 150

Dead 30 20 50

Total 100 100 200

treatment effect () = 30/100 – 20/100 = 0.3 – 0.2 = 0.1

P-value = 0.10

4
Van Helmont thought clinical trial
N = 500
No
Bloodletting Total
bloodletting

Alive 175 200 375

Dead 75 50 125

Total 250 250 500

treatment effect () = 75/250 – 50/250 = 0.3 – 0.2 = 0.1

P-value = 0.01
The simple two arm design
Patients are (stratified and) randomized between
two or more treatment groups

Experimental treatment
Randomize
Control
(e.g. standard treatment)
Treatment effect

The treatment effect is , the difference between


the outcomes of the two groups:
• If the endpoint is response, the outcome is the
response rate and  = rT – rC
(the difference in response rates)
• If the endpoint is survival, the outcome is the
hazard rate and  = log(hC / hT )
(the log hazard ratio)
• etc.
Test of hypothesis

The null hypothesis is the one we wish to reject:


H0:  = 0

The alternative hypothesis is the one we wish to


accept:
HA:  > 0 (one-sided)
or HA:   0 (two-sided)
Statistical test
• If the endpoint is response, we test whether the
difference in proportions differs from 0
• If the endpoint is survival, we test whether the
log hazard ratio differs from 0
The “P -value”

The test statistic allows us to test whether the


treatment effect estimated from the data is
incompatible with H0. The P -value is
𝑃 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃 (𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 > 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 |𝐻0)

Conventionally, the test is significant if

𝑃 − 𝑣𝑎𝑙𝑢𝑒 < 0.05


Decisions

Formulate hypotheses

Calculate test statistic

Test for significance

P < 0.05 P  0.05

HO rejected HO not rejected


Errors

Decision → HO rejected HO not rejected


Truth (P < 0.05) (P  0.05)

HO true
( = 0) FALSE POSITIVE OK
 1-
HO not true
( > 0) OK FALSE NEGATIVE
1- 
Errors

Decision → HO rejected HO not rejected


Truth (P < 0.05) (P  0.05)

HO true
( = 0) FALSE POSITIVE OK
 1-
HO not true
( > 0) OK FALSE NEGATIVE
1- 
Analogy with diagnostic tests

Diagnostic → Test Test


Disease positive negative

Absent FALSE POSITIVE SPECIFICITY


 1-

Present SENSITIVITY FALSE NEGATIVE


1- 
Choice of  and 
• , the probability of a false positive result* is
usually chosen equal to 0.05
• , the probability of a false negative result** is
usually chosen equal to 0.10 – 0.20

* Accepting an ineffective treatment


** Missing an effective treatment
Positive predictive value

Suppose we test 1,000 drugs using =0.05 and =0.2


Suppose that 10% of drugs tested are truly effective*
– 100 effective drugs tested at 80% power yields
80 true positives and 20 false negatives
– 900 ineffective drugs tested at =5% yields
45 false positives and 855 true negatives
– In total, 80 + 45 = 125 drugs are claimed effective; of these,
80 are true positive, hence there is only 80 / 125 = 0.64, or
two chances in three that a drug claimed effective is truly so
– This is the positive predictive value of the trial

* Unknown
Positive predictive value

95%

64%

0.05 0.005
Sample size

𝑓(𝛼, 𝛽)
𝑁=
(𝛿 Τ𝜎)²
where
 probability of false positive
 probability of false negative
 / standardized treatment effect

18
Sample size

𝑓(𝛼, 𝛽)
𝑁=
(𝛿 Τ𝜎)²

If  is divided by 2
(e.g. a 10% difference instead of a 20% difference),
the sample size is multiplied by

19
Sample size

𝑓(𝛼, 𝛽)
𝑁=
(𝛿 Τ𝜎)²

If  is divided by 2
(e.g. a 10% difference instead of a 20% difference),
the sample size is multiplied by 4 ! (= 2²)

20
Evidence of no difference?

In trials that purport to show a treatment difference but


fail to reach statistical significance,

no evidence of difference

evidence of no difference !

… for no evidence of difference may result from a


lack of power (due to insufficient sample size)
Tests of hypothesis in equivalence trials

The null hypothesis is the one we wish to reject:


H0:   Equivalence

The alternative hypothesis is the one we wish to


accept:
HA:  < Equivalence (one-sided)
or HA: || < | Equivalence| (two-sided)
A trial of endocrine therapy

The ATAC trial for postmenopausal women


with early breast cancer

Anastrozole

Randomize Tamoxifen (Control group)

Anastrozole + Tamoxifen

Ref: ATAC Investigators. Lancet 359:2131;2002


Comparisons of interest in ATAC

Anastrozole
Anastrozole
Randomize Tamoxifen (standard treatment)
Combination
Anastrozole + Tamoxifen
Hypotheses in ATAC trial

Superiority of the combination over Tamoxifen:


H0: Combination = 0 vs. HA: Combination > 0

Non-inferiority of Anastrozole compared to Tamoxifen:


H0: Anastrozole  Equivalence vs. HA: Anastrozole < Equivalence
with Equivalence such that the hazard ratio would be < 1.25
(i.e. at most 25% worse for anastrozole)
ATAC
ATAC
ATAC

It was anticipated that Anastrozole would have HR < 1.25


(in fact, it turned out to be 0.83)

It was hoped that the combination would have HR < 1


(in fact, it turned out to be 1.02)

Although the trial results contradicted the pre-specified


hypotheses, these results were reliable and clinically
useful because of the large size of the trial

You might also like