Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

J. R. Statist. Soc.

A (2013)
176, Part 2, pp.

The risky reliance on small surrogate end point


studies when planning a large prevention trial

Stuart G. Baker and Barnett S. Kramer


National Cancer Institute, Bethesda, USA

[Received November 2011. Final revision March 2012]

Summary. The definitive evaluation of treatment to prevent a chronic disease with low incidence
in middle age, such as cancer or cardiovascular disease, requires a trial with a large sample
size of perhaps 20 000 or more. To help to decide whether to implement a large true end point
trial, investigators first typically estimate the effect of treatment on a surrogate end point in a
trial with a greatly reduced sample size of perhaps 200 subjects. If investigators reject the null
hypothesis of no treatment effect in the surrogate end point trial they implicitly assume that
they would probably correctly reject the null hypothesis of no treatment effect for the true end
point. Surrogate end point trials are generally designed with adequate power to detect an effect
of treatment on the surrogate end point. However, we show that a small surrogate end point
trial is more likely than a large surrogate end point trial to give a misleading conclusion about
the beneficial effect of treatment on the true end point, which can lead to a faulty (and costly)
decision about implementing a large true end point prevention trial. If a small surrogate end
point trial rejects the null hypothesis of no treatment effect, an intermediate-sized surrogate end
point trial could be a useful next step in the decision-making process for launching a large true
end point prevention trial.
Keywords: Cancer prevention; Cardiovascular disease; Prentice criterion; Principal
stratification; Sample size calculation; Surrogate end point

1. Introduction
Searching for an effective treatment to prevent a chronic disease with low incidence in a cohort
of middle-aged people, such as cancer or cardiovascular disease, is challenging because a defin-
itive trial with a disease incidence end point requires a very large sample size of perhaps 20 000
or more. Before implementing such a large trial, investigators often seek evidence of treatment
benefit from a small trial of perhaps 200 by using a surrogate end point observed before the true
end point of disease incidence. For example, a recent trial of a drug to reduce the surrogate end
point of occurrence of adenoma involved a sample size of 267 (Thompson et al., 2010). If this
study had used the true end point of colorectal cancer incidence or mortality, the sample size
could have been as large as 70 000 (Atkin, 2010). As another example, a recent trial of a drug
to reduce the surrogate end point of the occurrence of bronchial dysplasia involved a sample
size of 100 (Lam et al., 2004). If this study had used the true end point of lung cancer incidence,
the sample size could have been as large as 30 000 (Alpha-Tocopherol, Beta Carotene Cancer
Prevention Study Group, 1994). These surrogate end point trials have adequate power to detect
an effect of treatment on a surrogate end point, and their small sample sizes are touted as an
important advantage (Psaty et al., 1999). But is this the proverbial case of a free lunch that is
not really free?
Address for correspondence: Stuart G. Baker, Biometry Research Group, Division of Cancer Prevention,
National Cancer Institute, EPN 3118, 6130 Executive Boulevard, MSC 7354, Bethesda, MD 20892-7354, USA.
E-mail: sb16i@nih.gov

Published 2012. This article is a US Government work and is in the public domain in the USA.
2 S. G. Baker and B. S. Kramer
To answer this question, it is helpful to consider the primary goal of using a surrogate end
point and the data that are available to achieve this goal. The following discussion of the use of
surrogate end points is not a summary of an extensive statistical literature (e.g. Weir and Walle
(2006) and Lassere (2008)) but focuses on key points in study design and implications.
With treatment trials, usually the primary goal of using a surrogate end point is extrapolation,
namely drawing conclusions about the effect of treatment on a true end point while shortening
the duration of the trial. In this setting data are typically available from at least one historical
trial with the same surrogate and true end points associated with the trial of interest. Typically
a model that is constructed from these historical data is used to predict the effect of treatment
on the true end point based on the surrogate end point in the trial of interest (e.g. Baker et al.
(2012)).
With the prevention trials that are discussed here, usually the primary goal of using a surro-
gate end point is to draw conclusions about the effect of treatment on a true end point by using
a much smaller sample size than with the true end point. (A secondary goal is shortening the
duration of the study.) In terms of available data, typically there are no historical trials with
the same surrogate and true end points as associated with the trial of interest. With no data for
predicting treatment effect on the true end point, the focus is on hypothesis testing to justify a
much larger definitive trial. In the hypothesis testing framework, Prentice (1989) defined a valid
surrogate end point as a surrogate end point satisfying what we call the extrapolation assumption,
namely rejecting the null hypothesis of no treatment effect on a surrogate end point in favour
of a beneficial treatment effect on the surrogate end point implies rejecting the null hypothesis
of no treatment effect on a true end point in favour of a beneficial treatment effect on the true
end point.
Using three models relating surrogate and true end points, we show that the link between
the extrapolation assumption and the size of the surrogate end point trial explains why a small
surrogate end point trial is particularly unreliable for drawing conclusions about the effect of
treatment on the true end point.

2. Binary surrogate end point: mixture model


Let S = 0, 1 and T = 0, 1 denote binary surrogate and true end points respectively, where outcome
0 is unfavourable and outcome 1 is favourable. For example T = 0 and T = 1 are respectively
incidence and no incidence of lung cancer, and S = 0 and S = 1 are respectively occurrence and
non-occurrence of bronchial dysplasia. Let Z = 0 (control), 1 (experimental) denote the ran-
domization group. Let pz = pr.S = 1|Z = z/, fz = pr.T = 1|Z = z/, bs = pr.T = 1|S = s, Z = 0/,
and cs = pr.T = 1|S = s, Z = 1/ − pr.T = 1|S = s, Z = 0/. The probabilities of true end point are
mixtures
f0 = p0 b1 + .1 − p0 /b0 , .1/

f1 = p1 .b1 + c1 / + .1 − p1 /.b0 + c0 /: .2/


Equations (1) and (2), which define the mixture model in Baker et al. (2012), imply that f1 − f0 =
.p1 − p0 /.b1 − b0 / + dMIX , where dMIX = p1 c1 + .1 − p1 /c0 .
The goal is to reject f1 = f0 in favour of f1 > f0 . For a reasonable surrogate end point,
b1 − b0 > 0. Therefore
p1 − p0 > 0 implies f1 − f0 > dMIX : .3/
The extrapolation assumption says that p1 − p0 > 0 (namely rejecting the null hypothesis of no
Risky Reliance on Small Surrogate End Point Studies 3
treatment effect on a surrogate end point) implies that f1 − f0 > 0 (namely rejecting the null
hypothesis of no treatment effect on a true end point); this is a special case of equation (3) in
which dMIX = 0. Thus, if an investigator believes that the extrapolation assumption holds, but in
reality dMIX < 0, it is possible to conclude that there is a beneficial effect of treatment on the true
end point when, in fact, there is a detrimental effect of treatment on the true end point equal to
dMIX :To put the magnitude of dMIX in perspective relative to the effect of treatment, we compute
the relative error (as a percentage), namely REMIX = 100dMIX =.f1 − f0 /. The requirement for
the extrapolation condition that dMIX = 0 implies that c0 = c1 = 0, which is called the Prentice
criterion, namely the probability of the true end point given the surrogate end point does not
depend on the randomization group. See also Buyse and Molenberghs (1998).
For a true end point trial, a two-sided type I error of 5% and power of 90%, a standard
formula for the sample size of each equal-sized randomization group (Halperin et al., 1968) is

size.f0 , f1 / = [1:96{f1 .1 − f1 / + f0 .1 − f0 /}1=2 + 1:28{2favg .1 − favg /}1=2 ]2 =.f1 − f0 /2 , .4/

where favg = .f0 + f1 /=2: The sample size for the surrogate end point trial with a two-sided type
I error of 5% and power of 90% is size.p0 , p1 / computed under the Prentice criterion.
Consider a realistic example with f0 = 0:003 and f1 = 0:004 (Table 1). Realistic values of
parameters under the Prentice criterion can be obtained by using proportional probabilities of
end points, namely pz = Rfz , which implies b0 = 0 and b1 = 1=R. For R = 100, a sample size of
73300 for a true end point trial is reduced to 480, but the relative error for a very small deviation
from the Prentice criterion of c1 = −0:002 is − 80%, which is a possibility of great concern. If
1 − pz is close to 1, size.Rp0 , Rp1 /=size.p0 , p1 / ≈ 1=R: For a deviation from the Prentice crite-
rion in only c1 (with c0 = 0/, the relative error is REMIX .p1 / = 100p1 c1 =.f1 − f0 /, which implies
that REMIX .Rp1 /=REMIX .p1 / = R. Thus, regardless of the values of f0 and f1 , if the sample
size decreases by a factor of approximately R, the relative error increases by a factor of R, in
agreement with Table 1.

Table 1. Relative errors and sample sizes†

Formulation Key Sample size per Deviation from Relative


parameter randomization group extrapolation error (%)
assumption
True end Surrogate
point end point

Binary: mixture R = 0:01 73300 480 c1 = −0:002 −80.0


R = 0:10 7100 instead of 0 −8.0
R = 1:00 73300 −0.8
Binary: PS R = 0:01 73300 480 hA = −0:002 −60.0
R = 0:10 7100 instead of 0 −6.0
R = 1:00 73300 −0.6
Continuous: b = 0:08 73800 480 c = 0:002 −46.1
model for mean b = 0:31 7100 instead of 0 −12.2
b = 1:00 73300 −3.8

†The relative error is the percentage change in the effect of treatment on the true end point
(under the deviation from the extrapolation assumption) that is consistent with rejecting
the null hypothesis of no effect of treatment on the surrogate end point (when incorrectly
making the extrapolation assumption). Computations are based on probabilities of true
end point of f0 = 0:003 and f1 = 0:004.
4 S. G. Baker and B. S. Kramer
3. Binary surrogate end point: principal stratification model
The principal stratification (PS) model for surrogate end points (Frangakis and Rubin, 2002) was
originally formulated to estimate causal effects. Here we discuss its implications for hypothesis
testing. Let S Å denote the four principal strata: A (always), when S = 1 regardless of randomiza-
tion assignment, C (consistent) when S = z, I (inconsistent) when S = 1 − z and N (never) when
S = 0 regardless of randomization assignment. Let psÅ = pr.S Å = sÅ /, bsÅ = pr.T = 1|S Å = sÅ ,
Z = 0/, hsÅ = pr.T = 1|S Å = sÅ , Z = 1/ − Pr.T = 1|S Å = sÅ , Z = 0/. By definition, p1 = pA + pC
because S = 1 for Z = 1 only in principal strata A and C. Similarly p0 = pA + pI because S = 1 for
Z = 0 only in principal strata A and I. Consequently p1 − p0 = pC − pI . Combining this result
with
f0 = pA bA + pC bC + pI bI + pN bN , .5/
f1 = pA .bA + hA / + pI .bI + hI / + pC .bC + hC / + pN .bN + hN / .6/

gives f1 − f0 = .p1 − p0 /hC + dPS , where dPS = hA pA + hN pN + .hI − hC /pI : The relative error
is REPS = 100dPS =.f1 − f0 /: Here the extrapolation assumption, p1 − p0 > 0 implies f1 − f0 > 0,
requires dPS = 0. An appealing scenario for dPS = 0 is what we call the PS criterion: hA = hN = 0
and pI = 0, namely the probability of the true end point for principal strata A and N depends
only on the level of the surrogate end point and not randomization group and there is no person
for whom the level of surrogate end point is ‘inconsistent’ with the randomization group. The
PS criterion is analogous to the identifiability requirement in some PS models (e.g. Baker et al.
(2011)).
The sample size formula for a true end point trial is size.f0 , f1 /. The sample size for the
surrogate end point trial with a two-sided type I error of 5% and power of 90% is size.p0 , p1 /
computed under the PS criterion.
Again let f0 = 0:003 and f1 = 0:004. Realistic values of parameters under the PS criterion can
be obtained by using pz = Rfz , which implies pA = f0 R, pC = .f1 − f0 /R and pN = 1 − f1 R. For
R = 100, a sample size of 73 300 for a true end point trial is reduced to 480 (Table 1), and the
relative error for a very small deviation from the PS criterion of hA = −0:002 is − 60%, which is
a possibility of great concern. If 1 − pz is close to 1, size.Rp0 , Rp1 /=size.p0 , p1 / ≈ 1=R: For the
deviation from the PS criterion in only hA .with hN = pI = 0/ the relative error is REPS .p0 / =
100 hA pA =.f1 − f0 / = hA p0 =.f1 − f0 /, so REPS .Rp0 /=REPS .p0 / = R. Thus, as with the mixture
model, if the sample size decreases by a factor of approximately R, the relative error increases
by a factor of R, in agreement with Table 1.

4. Continuous surrogate end point: model for the mean


To investigate small surrogate end point trials with a continuous surrogate end point, let sz
denote the mean value of a continuous surrogate end point for randomization group z, and let
tz = logit.fz /. Also let σT2 and σS2 denote the variance for a sample size of 1 of t1 − t0 and s1 − s0
respectively. On the basis of the delta method: σT2 = 1={f0 .1 − f0 /} + 1={f1 .1 − f1 /}: A simple
linear model with σT and σS as scale factors,
t0 =σT = a0 + b.s0 =σS /, .7/

t1 =σT = a1 + .b + c/s1 =σS , .8/


implies that .t1 − t0 /=σT = b.s1 − s0 /=σS + dMEAN , where dMEAN = cs1 =σS + a1 − a0 . The rela-
tive error is REMEAN = 100dMEAN ={.t1 − t0 /=σT }. Under this model s1 − s0 > 0 implies that
Risky Reliance on Small Surrogate End Point Studies 5
t1 − t0 > dMEAN . Here the extrapolation assumption, s1 − s0 > 0 implies t1 − t0 > 0, requires
dMEAN = 0. An appealing scenario for dMEAN = 0 is what we call the Prentice criterion for the
mean: a0 = a1 and c = 0, which implies the same effect of the mean surrogate end point on the
true end point for each randomization group.
For the true end point trial with a two-sided type I error of 5%, and power of 90%, the sample
size of each equal-sized randomization group is sizeÅ .t0 , t1 / = .1:96 + 1:28/2 σT2 =.t1 − t0 /2 . The
sample size for the surrogate end point trial with a two-sided type I error of 5% and power of
90% is sizeÅ .s0 , s1 / computed under the Prentice criterion for the mean, which can be written
as sizeÅ .b/ = sizeÅ .t0 , t1 /b2 :
Again consider f0 = 0:003 and f1 = 0:004. The sample size for the true end point trial is
73800. To obtain realistic parameter values, we consider sample sizes computed under the other
models. For the sample size of 480 (Table 1), the relative error arising from a small deviation
from the Prentice criterion for the mean of c = 0:002 is − 46:1%, which is yet another possibility
of great concern. Note that sizeÅ .b=k/=sizeÅ .b/ = k2 : For the deviation from the Prentice crite-
rion for the mean involving only c (with a0 = a1 /, REMEAN .b/ = 100t1 {c=.b + c/}=.t1 − t0 /, so
REMEAN .b=k/=REMEAN .b/ = .b + c/=.b=k + c/ ≈ k for small c. Thus, if the sample size decreases
by a factor of k2 , the relative error increases by a factor of approximately k, in agreement with
Table 1.

5. Discussion
The search for treatments to prevent cancer or cardiovascular disease involves preliminary eval-
uations using small surrogate end point trials, and as a practical matter this trend is likely to
continue in the genomic era to handle the likely explosion in potential hypotheses to be tested.
Although a small surrogate end point trial is typically conducted, it has a greater potential than
a large surrogate end point trial for a misleading conclusion that the treatment has a beneficial
effect on the true end point. This misleading conclusion could lead to a faulty decision about
implementing a definitive trial with a true end point. The implications regarding expenditure of
resources are enormous.
The focus here is on an incorrect conclusion to implement a large prevention trial after
rejecting the null hypothesis in the surrogate end point study. It is also possible to draw an
incorrect conclusion of not implementing a large prevention trial after not rejecting the null
hypothesis in the surrogate end point study. However, this latter incorrect conclusion is of
limited interest because the beneficial treatment effect would not be likely to be large and
researchers are looking for a large beneficial treatment effect to make a large prevention trial
worthwhile.
If a small surrogate end point trial indicates a promising treatment, investigators should next
investigate an intermediate-sized surrogate end point trial (with a more frequently occurring
surrogate end point) before jumping to a very large, resource-intensive, prevention trial with
a true end point of a rare disease. Of course the intermediate-sized surrogate end point trial
is no guarantee of drawing a correct conclusion. Therefore other sources of evidence, such as
observational studies and animal testing, need to be considered before implementing a large true
end point prevention trial. Also surrogate end point trials typically do not provide information
about multiple end points and long-term side effects, which is another reason for caution.

Acknowledgement
This research was supported by the National Institutes of Health.
6 S. G. Baker and B. S. Kramer
References
Alpha-Tocopherol, Beta Carotene Cancer Prevention Study Group (1994) The effect of vitamin E and beta Car-
otene on the incidence of lung cancer and other cancers in male smokers. New Engl. J. Med., 330, 1029–1035.
Atkin, W. S., Edwards, R., Kralj-Hans, I., Wooldrage, K., Hart, A. R., Northover, J. M., Parkin, D. M., War-
dle J., Duffy, S. W., Cuzick, J. and UK Flexible Sigmoidoscopy Trial Investigators (2010) Once-only flexible
sigmoidoscopy screening in prevention of colorectal cancer: a multicentre randomised controlled trial. Lancet,
375, 1624–1633.
Baker, S. G., Lindeman, K. S. and Kramer, B. S. (2011) Clarifying the role of principal stratification in the paired
availability design. Int. J. Biostatist., 7, article 25.
Baker, S. G., Sargent, D. J., Buyse, M. and Burzykowski, T. (2012) Predicting treatment effect from surrogate
endpoints and historical trials: an extrapolation involving probabilities of a binary outcome or survival to a
specific time. Biometrics, 68, 248–257.
Buyse, M. and Molenberghs, G. (1998) The validation of surrogate endpoints in randomized experiments. Bio-
metrics, 54, 1014–1029.
Frangakis, C. E. and Rubin, D. B. (2002) Principal stratification in causal inference. Biometrics, 58, 21–29.
Halperin, M., Rogot, E., Gurian, J. and Ederer, F. (1968) Sample sizes for medical trials with special reference to
long-term therapy. J. Chron. Dis., 21, 13–24.
Lam, S., leRiche, J. C., McWilliams, A., Macaulay, C., Dyachkova, Y., Szabo, E., Mayo, J., Schellenberg, R., Cold-
man, A., Hawk, E. and Gazdar, A. (2004) A randomized phase IIb trial of pulmicort turbuhaler (budesonide)
in people with dysplasia of the bronchial epithelium. Clin. Cancer Res., 10, 6502–6511.
Lassere, M. N. (2008) The Biomarker-Surrogacy Evaluation Schema: a review of the biomarker-surrogate litera-
ture and a proposal for a criterion-based, quantitative, multidimensional hierarchical levels of evidence schema
for evaluating the status of biomarkers as surrogate end points. Statist. Meth. Med. Res., 17, 303–340.
Prentice, R. L. (1989) Surrogate end points in clinical trials: definitions and operational criteria. Statist. Med., 8,
431–440.
Psaty, B. M., Weiss, N. S., Furberg, C. D., Koepsell, T. D., Siscovick, D. S., Rosendaal, F. R., Smith, N. L.,
Heckbert, S. R., Kaplan, R. C., Lin, D., Fleming, T. R. and Wagner, E. H. (1999) Surrogate end points, health
outcomes, and the drug-approval process for the treatment of risk factors for cardiovascular disease. J. Am.
Med. Ass., 282, 786–790.
Thompson, P. A., Wertheim, B. C., Zell, J. A., Chen, W. P., McLaren, C. E., LaFleur, B. J., Meyskens, F. L. and
Gerner, E. W. (2010) Levels of rectal mucosal polyamines and prostaglandin E2 predict, ability of DFMO and
Sulindac to prevent colorectal adenoma. Gastroenterology, 139, 797–805.
Weir, C. J. and Walle, R. J. (2006) Statistical evaluation of biomarkers as surrogate end points: a literature review.
Statist. Med., 25, 183–203.

You might also like