Professional Documents
Culture Documents
Phase 3 Umbrella Trial Ren CCT (1)
Phase 3 Umbrella Trial Ren CCT (1)
A R T I C L E I N F O A B S T R A C T
Keywords: Master protocols, in particular umbrella trials and platform trials, when evaluating multiple experimental
Master protocols treatments with a common control, could save patient resource, increase trial efficiency, and reduce drug
Umbrella trials development cost. Compared to the phase 3 platform trials that allow unlimited number of experimental arms to
Optimal allocation ratio
be added, it is more practical for individual companies to evaluate two experimental arms with a common
Familywise error rate
Concurrent control
control in an umbrella trial and allow the second experimental arm to be added at a later time. There have been
Non-concurrent control limited research done in this type of trials in terms of statistical properties and guidance. In this article, we
Rolling arms present statistical considerations of a phase 3 three-arm umbrella design including Type I error control and
power, as well as the optimal allocation ratio. We intend to not only complement the existing literature, but more
importantly to provide practical guidance to pave the way for its implementation by individual companies.
1. Introduction umbrella study starting with only one experimental arm and its control
arm, with the potential to add a second experimental arm mid-trial
Umbrella trial is a type of master protocol where multiple therapies would still have the benefit of saving patient resources and reducing
would be evaluated in the same trial, while platform trial adds the enrollment competition. Despite the efficiencies of being able to add a
perpetual feature to an umbrella trial. When designed and implemented treatment arm to an ongoing clinical trial, there is no clear methodo
properly, umbrella and platform trials could increase the efficiency of logical guidance on this topic [6]. In this article, we provide detailed
drug development ( [22]; FDA guidance). Compared to the traditional statistical considerations and guidance such as type I error, power, and
clinical trials that evaluate one experimental treatment at a time, an the optimal randomization ratio for this three-arm umbrella phase 3
umbrella or platform trial with common control has the potential of study design.
reducing the total number of patients allocated to the control arm, Several key statistical considerations arise in a phase 3 umbrella/
making it more appealing to patients and drug developers. It could also platform trial with common control [6]. The first one is the need for
gain operational efficiency when adding a new experimental arm uti family-wise error rate (FWER) control. There have been long debates
lizing the existing trial infrastructure. Although there has been a lot of whether FWER should be controlled at the study level. If two experi
excitement in the concept of master protocols, very few Phase 3 um mental arms are evaluated in the same trial for efficiency purposes and
brella trials have been conducted especially by individual companies. the objectives are to compare each experimental arm to the control arm
The only few phase 3 platform trials were conducted are by third-party without comparing between experimental arms, it can be considered
contracting research organizations and co-sponsored by multiple phar similar as conducting two separate trials and therefore FWER is not
maceutical companies [24]. There are no individual company-led phase necessary [9]. Bennett and Mander [4] only evaluated the scenario
3 platform trials so far. Individual companies usually only have limited where multiplicity adjustment is required. In this article, we evaluate
number of investigational regimens moving into Phase 3 within a certain both cases, with or without FWER control. The second key statistical
timeframe. The perpetual feature of a platform trial may not be bene consideration is whether to include non-concurrent control patients in
ficial in individual company-led studies. It is more common that a the analysis. We consider three different ways of utilizing the control
pharmaceutical company may have few (two or three) investigational patients in the analysis. At the same time, the optimal allocation ratio
treatment regimens of interest within similar time period. A Phase 3 when adding a new treatment arm is often of interest, which we will
* Corresponding author.
E-mail address: yixin.ren@merck.com (Y. Ren).
https://doi.org/10.1016/j.cct.2021.106538
Received 24 May 2021; Received in revised form 2 August 2021; Accepted 6 August 2021
Available online 9 August 2021
1551-7144/© 2021 Elsevier Inc. All rights reserved.
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
2
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
required (Section 3.2). If all control data are used for Arm 2, the test statistic for Arm 2 would
be:
3.1. Without FWER control X 2 − X 0A
ZH2 = √̅̅̅̅̅̅̅̅̅̅̅̅,
In this sub-section, we evaluate the situation where FWER does not σ n10A + 1n
need to be controlled at α. Each experimental arm to control arm com
parison has its full level of type I error α. We evaluate each of the sce where (ZH1, ZH2)′ ~ BVN(0, ΣH), and ΣH is a 2 × 2 matrix with diagonal
narios of control data use respectively. entries equal to 1 and off-diagonal entries equal to ρH, with ρH =
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
Scenario 1 (only concurrent randomized control data used for com RC RA rp
(1+RC )(1+RA )⋅RC RA .
parison): We consider only using the concurrent randomized control
The FWER is
data when conducting the analysis between each experimental arm and
the control arm. This is a more conservative approach that maintains ∫
Z1− α ∫
Z1− α
true randomization between the control arm and each experimental arm
′
FWERH = 1 − ΦH ((zH1 , zH2 ) , r, p)dZH1 dZH2 ,
and is commonly used. − ∞ − ∞
The standardized test statistic comparing each experimental arm i, i
= 1, 2 to its concurrent control can be written as: where ΦH((zH1, zH2)′ , r, p) is the density function of the bivariate normal
distribution with mean 0 and variance covariance matrix ΣH. The
X i − X 0C
ZCi = √̅̅̅̅̅̅̅̅̅̅̅̅̅, i = 1, 2, probability of making two errors under null hypothesis is:
1
σ n0C + 1n
∫∞ ∫∞
′
P(both tested positive |both inactive) = ΦH ((zH1 , zH2 ) , r, p)dZH1 dZH2
where (ZC1, ZC2) follows a standardized bivariate normal distribution
′
Z1− α Z1− α
BVN(0, ΣC) under the null hypothesis, with mean 0 and variance of 1,
rp (2)
and the correlation between ZC1 and ZC2 is ρC = RC (1+R C)
.Here ΣC =
[ ] Scenario 3 (all-control): In this scenario, we consider using data from
1 ρC
ρC 1 both concurrent and non-concurrent control for the analyses of both
experimental arms. The corresponding test statistic for comparing
The FWER can be written as:
experimental treatment i, i = 1, 2 to control, is given by
∫
Z1− α ∫
Z1− α
′ X i − X 0A
FWERC = 1 − ΦC ((zC1 , zC2 ) , r, p)dZC1 dZC2 , ZAi = √̅̅̅̅̅̅̅̅̅̅̅̅,
− ∞ − ∞ σ n10A + 1n
where ΦC((zC1, zC2)′ , r, p) is the density function of the bivariate normal and (ZA1, ZA2)′ ~ BVN(0, ΣA), where 0 is the zero vector of length 2, ΣA is
distribution with mean 0 and variance covariance matrix ΣC, and Z1− α is a 2 × 2 matrix with diagonal entries equal to 1 and off-diagonal entries
the percentile of a standard normal which satisfies P(X < Z1− α) = 1 − α. rp
equal to ρA, where ρA = RA (1+R . The FWER can then be written as:
Since both experimental arms share a portion of the control arm pa A)
tients, when the control arm patients performed worse than expected ∫
Z1− α ∫
Z1− α
3
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
Fig. 2 shows the correlations ρC, ρH, and ρA over the overlap p for r = where m = C, H, A represents concurrent, hybrid, and all control,
0.5, 1, 1.5, and 2 under three scenarios. It is clear that correlation in respectively. The critical value cm can be calculated using [4]. By
creases as overlap increases; and as more control data is used, correla replacing the critical value Zα with cm, m = C, H, A in Eqs. ( [1,2], and
tion decreases. [3], the probabilities of making two errors with multiplicity adjusted can
Assume each experimental arm is tested at α = 2.5% (one-sided) also be calculated.
(typical for phase 3 studies), Fig. 3 shows the FWER of these three sce Fig. 4 shows the probability of making two errors under the null
narios of control data use. The black line represents the FWER of two hypothesis. The first row displays vales of r = 0.5, 1, 1.5, and 2 under all
independent two-arm trials, which is a fixed value of 1 − (1 − α)2; and three control scenarios without FWER control (Section 3.1). The second
the four dotted lines are for r = 0.5, 1, 1.5, and 2, respectively for the row displays the same scenarios with FWER control. The solid line in
proposed study design. The FWER is always smaller in the proposed each plot is the probability of making two errors under null with two
umbrella study compared to the traditional setting of two independent separate two-arm trials. Those probabilities are the same when p =
two-arm trials due to the positive correlation between the two test sta 0 without adjusting for multiplicity. Although the absolute values of the
tistics from shared control. probability of making two errors are small in all scenarios, the proba
As the overlap parameter p increases, positive correlation increases bility increases as p increases; and it is larger when FWER control is not
and the FWER would decrease. When more control data is used, FWER required, compared to that when FWER control is required and in two
becomes similar for different values of randomization ratio r. For more separate two-arm trials with its own control. When adjusting for mul
practical values of r (i.e., from 1 to 2), when overlap is minimum or tiplicity to control FWER, the probability of making two errors under the
moderate (say <0.6), FWER is similar for different values of randomi null hypothesis is smaller compared to the case of two independent two-
zation ratio r. When there is substantial overlap (say >0.6), FWER de arm studies with its own control when overlap is small. The probability
creases as r decreases. The patterns of FWER are similar in all three of making two errors under global null is slightly larger than that when
scenarios on the use of control. The more non-concurrent control data conducting two separate two-arm trial when overlap is large. This in
included in the analysis, the smaller the correlation between the two test dicates that multiplicity adjustment has largely kept the probability of
statistics. When two separate two-arm trials are conducted, the corre making two errors under check. Moreover, the probability of making
lation between two test statistics is zero. As a result, FWER is the largest two errors under global null decreases as more control arm data are
in the two independent two-arm trial case, followed by the all control used, i.e., it is the largest under the concurrent control scenario and the
scenario, followed by the hybrid scenario, and is the smallest in the smallest under all control scenario.
concurrent control only scenario, assuming same randomization ratio r
and an overlap parameter p between 0 and 1. When p is 1, the FWER in 4. Power and optimal allocation
all three control scenarios are the same. When p is 0, the FWER in the
non-concurrent control scenario is the same as that in the two inde 4.1. Overall power without FWER control
pendent two-arm trials case.
We summarize and compare the probability of making two errors As introduced in the earlier section, we define the overall power as
under null hypothesis together with the scenarios where FWER control is the probability of at least one positive experimental arm when both
required in Section 3.4. experimental arms are active (assuming same target treatment effect Δ).
Scenario 1: When both treatment arms are compared with the con
3.2. With FWER control current randomized control, without controlling the FWER, the overall
study power =1 − P(ZC1* < Zβ1*, ZC2* < Zβ2*), where
When the two experimental arms are for the same or related claims,
Z*Ci = ZCi − √̅̅̅̅̅̅̅̅
Δ ̅
, Zβ* = Zβ* for i = 1, 2 and satisfies
multiplicity adjustment of the two experimental arms comparison is 1 1
n0C +n
i
Fig. 4. Probabilities of making two errors under three scenarios without and
Fig. 3. Probabilities of family-wise error rate under three scenarios. with adjusting multiplicity.
4
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
∫Zβ* ∫Zβ* 4.3. Optimal allocation ratio r for overlapping enrollment period
(( )′ ) *
Overall Power = 1 − ΦC z*C1 , z*C2 , r, p dZC1 *
dZC2 (6)
Assume the portion of overlap p is given, the optimal allocation for
− ∞ − ∞
the overlapping enrollment period is obtained by maximizing the overall
Scenario 2: Under the hybrid case, the overall power =1 − P(ZH1* < study power for a fixed number of total sample size N. A straightforward
Zβ1*, ZH2* < Zβ2*), where ZH1* is the same as ZC1*, and ZH2* is centralized approach is to minimize the sum of the variances of Xi − X0 , without
by √̅̅̅̅̅̅̅̅
Δ ̅
1 1
. Zβ1* = Zβ* and Zβ2* satisfies considering their correlation or multiplicity adjustment subject to the
n0A +n
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ constraint of a fixed total sample size. The minimizer can be calculated
6RA ( ) directly for concurrent and all control cases. This approach is not
Zα + Zβ* = Zα + Zβ (7) applicable for the hybrid case because of the complexity and asymme
2 (4 − 2p + rp)(1 + RA )
tricity in its structure of test statistics. The optimal allocation is
√̅̅̅̅̅̅̅
The overall power of the two definitions are 3− p− 1
r* = 1 + p for p ∈ (0, 1] under concurrent control case. For
∫Zβ* ∫β2
Z *
example, when p = 0.8, the optimal allocation ratio r* = 1.6. When p =
(( )′ ) *
Overall Power = 1 − ΦH z*H1 , z*H2 , r, p dZH1 *
dZH2 (8) 0, the second term of r* should be discarded, and the optimal r is 1 for
√̅̅̅
− ∞ − ∞ separate two-arm trials. When p = 1, the optimal allocation is 2, which
is the same as Dunnett’s suggestion under 3-arm trial with no delayed
Scenario 3: While all control data are used for both treatment arms,
entry. A more sophisticated approach is to incorporate the correlation
the overall power =1 − P(ZA1* < Zβ2*, ZA2* < Zβ2*), where ZA1* = ZA2*
between test statistics and multiplicity adjustment in the optimal allo
= ZH2* More explicitly,
cation ratio calculation. There is no closed form derivation of the
Z * Z * optimal allocation ratio if we also consider the correlation between test
∫β2 ∫β2
(( )′ ) *
(9) statistics and the multiplicity adjustment, although we can evaluate the
Overall Power = 1 − ΦA z*A1 , z*A2 , r, p dZA1 *
dZA2
optimal allocation ratio via simulations.
− ∞ − ∞
We evaluate the optimal allocation ratio via closed form derivation
√̅̅̅̅̅̅̅
3− p− 1
4.2. Overall power with FWER control r* = 1 + p as well as via simulations. Results from both approaches
show that the optimal allocation ratio of the overlap period r is robust to
The overall power calculation under multiplicity adjustment is the choices of N and β. Fig. 6 shows the overall power when β = 30 % ,
similar to that without multiplicity adjustment by replacing Zα with cm 20 % , and 10% under concurrent control case without and with mul
for m = C, H,and A in the formula [5,7] for obtaining Zβ* and Zβ2* for the tiplicity adjustment, respectively. The first row represents the results
concurrent, hybrid, and all control case, respectively. without multiplicity adjustment; and the second row shows the results
The first row of Fig. 5 shows the overall study power when β = 20% where FWER is controlled at α. The pattern is the same regardless of the
and r = 0.5, 1, 1.5, and 2 under all three control data use scenarios choices of β, p, and whether multiplicity is adjusted. When p is relatively
without multiplicity adjustment. The second row shows the same pa large, the optimal r for both with and without FWER control is around 2.
rameters for scenarios with multiplicity adjustment. The overall power For example, when p = 0.8, the optimal r is around 1.85. When p is small,
is similar when p is small, and the divergence becomes conspicuous say p = 0.2, the curves of the overall power are fairly flat, which in
when overlap is larger than 0.6. For r = 0.5, since the randomization dicates different randomization ratios do not impact the overall power
ratio is less than half for control arm compared to the two separate 2-arm much for small p values. Choosing r = 1 for simplicity will not introduce
trials, the performance is not as good for all scenarios. The umbrella a large power loss. Though the figure shows r value from 0 to over 5 to
design is more powerful when r is in a practical range (i.e., from 1 to 2) have a complete picture of the pattern of r for different p values, in
and there is no need to control FWER. However, the design is less practice, r value less than 1 has a much lower overall power, and r value
powerful if controlling FWER at the α level except for the all control case over 2 would randomize more patients to the control arm than to the
when p is small. Overall, the conservatism of multiplicity adjustment experimental arms compared to two separate two-arm studies. Thus,
makes the design unworthy to consider in practice. both cases are not ideal.
5. Discussion
5
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
Fig. 6. Overall power comparison at β = 30 % , 20 % , and 10% under concurrent case without and with adjusting multiplicity.
administration of the same drug, multiplicity adjustment is required. Declaration of Competing Interest
Both approaches were considered in this article.
In the case where FWER control is not required, the FWER is smaller The authors declare that they have no known competing financial
in the 3-arm umbrella trial than that of the two separate two-arm trials. interests or personal relationships that could have appeared to influence
The probability of making two false positive decisions under global null the work reported in this paper.
is larger than that of the two separate two-arm studies though the ab The authors declare the following financial interests/personal re
solute value is small. If FWER is required to control at α, the probability lationships which may be considered as potential competing interests.
of making two errors under global null is similar to that with two
separate two-arm studies. The overall power increases as the enrollment References
of the second treatment arm starts earlier, i.e., with larger p, when only
concurrent control data are used in the analysis, under the assumption of [1] B.M. Alexander, S. Ba, M.S. Berger, et al., Adaptive global innovative learning
environment for glioblastoma: GBM AGILE, Clin. Cancer Res. 24 (4) (2019)
fixed total sample size. The trend is reversed when non-concurrent 737–743.
control is included in the analysis. Moreover, adjusting for the multi [2] X. Bai, Q. Deng, D. Liu, Multiplicity issues for platform trials with a shared control
plicity and control FWER at the α level makes the design more conser arm, J. Biopharm. Stat. (2020), https://doi.org/10.1080/
10543406.2020.1821703.
vative and generally unworthy in terms of power. [3] A.D. Barker, et al., I-SPY 2: an adaptive breast cancer trial design in the setting of
Assuming a fixed total sample size, the optimal allocation for the neoadjuvant chemotherapy, Clin. Pharmacol. & Ther. 86 (1) (2009) 97–100.
overlapping enrollment period is determined by maximizing the prob [4] M. Bennett, A.P. Mander, Designs for adding a treatment arm to an ongoing clinical
trial, Trials (2020), https://doi.org/10.1186/s13063-020-4073-1.
ability of detecting a treatment effect in at least one of the two experi [5] F. Bretz, F. Koenig, Commentary on Parker and Weir, Clinical Trials 17 (5) (2020)
mental arms. Apparently, the optimal allocation is robust to the total 567–569.
sample size and the type II error with or without multiplicity adjust [6] D.R. Cohen, S. Todd, W.M. Gregory, et al., Adding a treatment arm to an ongoing
clinical trial: a review of methodology and practice, Trials 16 (1) (2015) 1–9.
ment. When r is in a practical range, i.e., from 1 to 2, it is important to
[7] O. Collignon, C. Gartner, A.B. Haidich, et al., Current statistical considerations and
emphasize that adding a treatment arm midway through a two-arm trial regulatory perspectives on the planning of confirmatory basket, umbrella, and
is better than running two separate trials in terms of patient resource, platform trials, Clinical Pharmacology & Therapeutics 107 (5) (2020) 1059–1067.
time, and logistics. [9] B. Freidlin, E.L. Korn, R. Gray, et al., Multi-arm clinical trials of new agents: some
design considerations, Statistics in Clin. Cancer Res. (2008), https://doi.org/
Throughout this article, the standardized test statistics are assumed 10.1158/1078-0432.CCR-08-0328.
to follow a normal distribution with a known population variance. When [10] M.J. Grayling, J.M. Wason, A.P. Mander, An optimized multi-arm multi-stage
the primary endpoint is a time-to-event endpoint, modifications of the clinical trial design for unknown variance, Contemporary Clin. Trials 67 (2018)
116–120.
derivations are needed. For example, instead of evaluating the sample [11] D.R. Howard, J.M. Brown, S. Todd, et al., Recommendations on multiple testing
mean, the log hazard ratio follows a normal distribution. If we still as adjustment in multi-arm trials with a shared control group, Stat. Methods Med.
sume the total sample size N is given, the overall study power will be Res. 27 (5) (2018) 1513–1530.
[12] K.M. Lee, J. Wason, Including non-concurrent control patients in the analysis of
driven by the number of events rather than the total sample size of pa platform trials: is it worth it? BMC Med. Res. Methodol. 20 (1) (2020) 1–2.
tients. On the other hand, if all control case is considered, number of [15] R.A. Parker, C.J. Weir, Non-adjustment for multiple testing in multi-arm trials of
events in Arm 1 would be more than that in the concurrent control only distinct treatments: rationale and justification, Clinical Trials 17 (5) (2020)
562–566.
case due to longer follow up even with the same sample size. Further [16] M.A. Proschan, D.A. Follmann, Multiple comparisons with control in a single
research will be done considering time-to-event endpoints. Moreover, experiment versus separate experiments: why do we feel differently? Am. Stat. 49
the population variance is unknown. Even though the normality (2) (1995) 144–149.
[18] N. Stallard, S. Todd, D. Parashar, et al., On the need to adjust for multiplicity in
assumption is adequate especially in a confirmatory trial with a large
confirmatory clinical trials with master protocols, Ann. Oncol. 30 (4) (2019) 506.
sample size, inaccurate approximation of the population variance will
affect the operating characteristics of the trial. While one may adapt the
methods discussed by Grayling et al. to handle unknow variance [10],
the general conclusion is not expected to change much.
6
Y. Ren et al. Contemporary Clinical Trials 109 (2021) 106538
[19] F. Van Leth, et al., Comparison of first-line antiretroviral therapy with regimens [24] M. Elias Laurin Meyer, P. Mesenbrink, C. Dunger-Baldauf, H.J. Fulle, E. Glimm,
including nevirapine, efavirenz, or both drugs, plus stavudine and lamivudine: a Y. Li, M. Posch, F. Konig, The Evolution of Master Protocol Clinical Trials Designs:
randomised open-label trial, the 2NN study, Lancet 363 (9417) (2004) 1253–1263. A Systematic Literature Review, Clinical Therapeutics 42 (7) (2020) 1330–1360,
[22] J. Woodcock, L.M. LaVange, Master protocols to study multiple therapies, multiple https://doi.org/10.1016/j.clinthera.2020.05.010.
diseases, or both, N. Engl. J. Med. 377 (1) (2017) 62–70.