Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

PHARMACEUTICAL STATISTICS

Pharmaceut. Statist. 2007; 6: 283–296


Published online 23 October 2007 in Wiley InterScience
(www.interscience.wiley.com) DOI: 10.1002/pst.316

Biomarker as a classifier in
pharmacogenomics clinical trials: a
tribute to 30th anniversary of PSIz,}
Sue-Jane Wang*,y
Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and
Research, US Food and Drug Administration, Silver Spring, MD, USA

Pharmacogenetics is one of many evolving sciences that have come to the fore since the formation of
the Statisticians in the Pharmaceutical Industry (PSI) 30 years ago. Following the completion of the
human genome project and the HapMap in the early 21st century, pharmacogenetics has gradually
focused on studies of whole-genome single-nucleotide-polymorphisms screening associating disease
pathophysiology with potential therapeutic interventions. Around this time, transcription profiling
aiming at similar objectives has also been actively pursued, known as pharmacogenomics. It has
become increasingly apparent that treatment effects between different genomic patient subsets can be
dissimilar, and the value and need for genomic biomarkers to help predict effects, particularly in
cancer clinical studies, have become issues of paramount importance. Pharmacogenomics/
pharmaogenetics has thus become intensely focused on the search for genomic biomarkers for use
as classifiers to select patients in randomized-controlled trials.
We highlight that the predictive utility of a genomic classifier has tremendous clinical appeal and
that there will be growing examples in which use of a companion diagnostic will need to be considered
and may become an integral part in the utilization of drugs in medical practice. The credible
mechanism to test the clinical utility of a genomic classifier is to employ the study results from a
prospective trial that recruits all patients. Such investigations, if well designed, will allow analysis of
all relevant performance factors in the drug and diagnostic combination including the sensitivity,
specificity, positive and negative predictive values of the diagnostic test and the efficacy of the
drug. Published in 2007 by John Wiley & Sons, Ltd.

*Correspondence to: Sue-Jane Wang, Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and
Research, US FDA, HFD-700, WO 22, Room 6216, 10903 New Hampshire Ave., Silver Spring, MD 20993, USA.
y
E-mail: suejane.wang@fda.hhs.gov
z
The views expressed in this article are not necessarily those of the US Food and Drug Administration.
}
This article is a US Government work and is in the public domain in the USA.

Published in 2007 by John Wiley & Sons, Ltd.


284 S.-J. Wang

Keywords: adaptive vs guided design; companion diagnostic; prognostic-predictive; prospective/


retrospective; clinical utility

1. INTRODUCTION ability of high-dimensional data, such as gene


expression studies [6,7] or whole-genome SNPs
In the past 30 years during which the important scan studies, e.g. [8], development of genomic
role of statistics in the pharmaceutical industry has (composite) biomarkers [9] that link to treatment
been well recognized, developments in science have effect in drug development has become widespread
rapidly evolved and have challenged the basic and and of clinical practical interest. The goal of this
clinical tenets of pharmaceutical research in interest aims at the eventual individualization of
medicine. Clinical studies of single genes that have medical treatment [6,7]. A genomic (composite)
a major effect on the action of a therapeutic drug biomarker is generally developed by combining
are the study of pharmacogenetics entertained in many individual genes via a prediction algorithm
late 1950s early 1960s, coined by Vogel [1] and to form a classifier that may predict patients’
used by several others, e.g. Kalow [2]. With the therapeutic outcome or guide drug treatment. A
completion of the human genome project and the genomic composite biomarker as a classifier has its
HapMap [3], pharmacogenetics as a field has appeal over a single biomarker when it provides
evolved into studies of whole-genome-single-nu- much higher sensitivity and specificity or when it
cleotide polymorphisms (SNPs) for screening or adds much value to the existing clinical criteria to
association with disease and/or therapeutics. classify patients before treatment intervention.
Studies that incorporate microarray data in Establishment of a genomic classifier may be
preclinical, pharmacokinetic, pharmacodynamic pursued using the traditional clinical trial phased
or clinical studies for analysis are often viewed as approach. In this paper, we discuss a learn-vs-
pharmacogenomic studies. Simon and Wang [4] confirm framework that involves flexible genomic
defined pharmacogenomics as the science of drug trial designs following the phases of genomic
determining how the benefits and adverse effects biomarker development for drug response [10].
of a drug vary among a target population of Such a framework can be performed either
patients based on genomic features of the patient’s prospectively or retrospectively depending on the
germ line and diseased tissue. For the purpose of availability and quality of stored genomic biospe-
this paper, we will use the general term ‘genomic’ cimens.
or ‘pharmacogenomics’ to include molecular,
genetic or pharmacogenetics where appropriate.
When there is a priori defined potential genomic 2.1. Prospective
subset that is postulated to have a major impact
(efficacy or toxicity) from treatment, we consider it The prospective development of a genomic bio-
marker used as a classifier for patient selection
the positive genomic subset, denoted by gþ :
may begin in early phase studies. The key question
is ‘Is there a genomic biomarker present at
baseline that can serve to predict therapeutic
2. GENOMIC BIOMARKER AS A effect?’ Before launching a late-phase registration
CLASSIFIER trial, exploration of early phase data seeks to
identify promising genomic classifier(s) that may
Application of genomics in clinical trials requires predict inconsistent treatment effects between
the use of genomic samples and knowledge of genomic patient subsets or across tumor types.
genomic biomarker(s), e.g. [5]. With the avail- Depending on the clinical development program

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 285

and its range of study options, there could be a to multiplicity adjustment to ensure proper control
number of objectives for which a genomic of at least one false-positive conclusion.
biomarker can be used. Simon and Wang [4] adopt the alpha allocation
principle of Moye and Deswal [18] without
accounting for the correlation, whereas the
2.1.1. Enrichment objective approach of Wang and Hung [14] accounts for
the correlation of the two test statistics in the
When there is a reasonable body of biological context of a fixed design. Wang [15], Freidlin and
evidences to support a specific hypothesis that a Simon [16] and Wang et al. [17] consider a two-
therapeutic agent inhibits a specific molecular stage adaptive design. The utility of the correlation
target, an enrichment design [11] may be pursued is discussed [15,17], but incorporation of the
to address the targeted gþ hypothesis, viz., H0: correlation for multiplicity adjustment is not a
Dþ ¼ 0 vs H1: Dþ > 0: The genomic biomarker in requirement.
an enrichment design is used to exclude patients, In the approach by FS, Stage 1 data are used to
e.g. patients whose tumors underexpress the evaluate if there is a genomic patient subset (gþ)
HER2/Neu protein were excluded from the study much more likely to benefit from the treatment,
of trastuzmab (HerceptinR) effect [12]. Enrichment and Stage 2 adaptively tests the treatment effect in
design requires an available diagnostic assay. To the identified gþ subset (H0: Dþ ¼ 0 vs H1: Dþ > 0)
ensure generation of meaningful data, it is using only Stage 2 data if data from both stages
imperative that the analytical validation of the fail to demonstrate an overall treatment effect (H0:
diagnostic assay be established before initiating an D ¼ 0 vs H1: D>0) [16]. To prospectively test the
enriched design approach. For instance, the result treatment effect in the gþ genomic subset, FS
of a diagnostic assay providing information on the separates Stage 1 learning data to develop the
presence/absence of the CYP2C9 allele should genomic classifier from Stage 2 test data.
match with the DNA sequence gold standard [13]. In contrast, the development of the genomic
An analytically validated diagnostic assay allows classifier by the WOH approach is external to the
accurate classification of patients’ genomic status. randomized-controlled trial; thus, the analysis uses
Such a diagnostic assay, if used in the enriched gþ data in both stages [17]. The rationale of
trial, permits the randomized comparison for patient adaptation is not to exclude the g patient
treatment effect in the targeted gþ genomic subset. subset unless the g subset is failing the treatment
due to futility or serious enough early safety
concern at the end of Stage 1. This stringent rule
2.1.2. Composite objective often permits evaluation of the clinical utility of
the genomic classifier at the study end, H0: D ¼ 0
When there is no biological plausibility or no and Dþ ¼ 0 vs H1: D>0 or D+>0. In the rare case
established known drug target, there is generally when the interim data strongly indicate that the
less confidence to exclude patients on the basis of a g subset is failing the treatment, recruitment of
genomic biomarker; thus, the study of all rando- the remaining patients in Stage 2 would consider
mized patients is preferred. Below, we discuss only the gþ subset. The final composite hypothesis
prospective approaches that consider randomized is thus an enriched hypothesis for either (H0: Dþ ¼
comparison. First, a composite objective, H0: D ¼ 0 vs H1: Dþ > 0) or (H0: D ¼ 0 vs H1: D>0) in that
0 and Dþ ¼ 0 vs H1: D>0 or Dþ > 0; that new the percentage of gþ patients is higher than the
treatment is effective in all randomized patients or estimated prevalence of gþ obtained from rando-
in the gþ genomic subset can be tested in a fixed mized patient inclusion. WOH showed that when
design or an adaptive design setting [4,14–17]. the genomic classifier is developed external to the
These approaches provide two opportunities to well-controlled trial, the power gain for the gþ
conclude a treatment effect; thus, they are subject subset effect of the considered adaptive design as

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
286 S.-J. Wang

compared with that of FS ranges from 20% to statistical significance based on a pre-specified
40% if the utility of the classifier is predictive and significance level, one tests whether treatment
from 10% to 35% if the utility of the classifier is effect exists in all the patients randomized
prognostic-predictive using an alpha allocation of (D>0). If, however, there is a significant treat-
0.02 for D and 0.005 for Dþ ; see Figures 4 and 5(A) ment-by-biomarker interaction effect, one then
of WOH [17]. The power gain for the gþ subset is tests the g and gþ subset hypotheses separately.
due to the availability of the use of the gþ Stage 1 It is not clear what significance level the
data that were not used to learn, although the use interaction test should be with the branch testing
of data is subject to multiplicity adjustment. scheme described above so as to maintain the
control of the overall type I error rate [17,20]. In
addition, the interaction test generally requires
2.1.3. Separate subset objective much larger sample size as compared with a study
sized for an overall effect [21]. For this reason,
A design stratified by the genomic biomarker when the composite objective is of primary focus,
status that tests H0: Dþ ¼ 0 vs H1: Dþ 6¼ 0 and H0: the interaction objective may not be preferred as
D ¼ 0 vs H1: D 6¼ 0 [4,19] separately, essentially the study still needs to address whether treatment
undertakes two independent clinical trials, which effect can be concluded in all patients or the gþ
may not be useful if the clinical utility of the patient subset. On the other hand, when the
biomarker is predictive, viz., Dþ 6¼ 0 and D ¼ 0 in significance level for testing the interaction effect
truth, to be illustrated in Section 3, Graphs (vi)– guarantees the type I error rate control, the
(viii) of Figure 1. The sample size for the null effect conservative interaction test approach due to
in the g– subset will be very troublesome in larger sample size is likely to increase the power
placebo-controlled trials and will likely be very of the study for concluding the alternative
large if a non-inferiority objective is required to composite hypothesis.
show that the effect with the new treatment is
essentially no different from the comparator with a
tight non-inferiority margin in active-controlled
trials. 2.1.5. Guided objective

The biomarker-guided design, e.g. [19,22], pursues


2.1.4. Interaction objective a randomized comparison between the biomarker-
guided (or biomarker-based) arm and the non-
An interaction study objective tests the interaction biomarker-guided arm to test the hypothesis that
hypothesis of whether treatment effect in the gþ the treatment effect based on the guided approach
subset is the same as that in the g subset: H0: (Dg) is superior to the non-guided approach (Dg0 ),
Dþ ¼ D vs that these effects differ: H1: Dþ 6¼ D : i.e. H0: D ¼ 0 vs H1: D>0, where D ¼ Dg  Dg0 .
A significant interaction test result provides The diagnostic assay is performed at study
evidence that the magnitude of the treatment effect randomization for the biomarker-guided arm and
between gþ and g differs. But, the interaction at the study end for the non-biomarker-guided
test does not address the overall hypothesis arm. Thus, treatment assignment is guided by
(H0: D ¼ 0 vs H1: D > 0) and the gþ subset biomarker status for those patients randomized to
hypothesis (H0: Dþ ¼ 0 vs H1: Dþ > 0) directly. the guided arm and is non-random. In contrast,
Often, the composite hypothesis is addressed treatment assignment is random for those patients
indirectly when one starts with the interaction randomized to the non-guided arm if more than
test. The test sequence to indirectly address the one treatment is to be studied. As shown in
composite hypothesis consists of two branches. Table I, we discuss three variations in the two-
That is, if the interaction test result does not reach arm-guided design and those used in practice.

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 287

(i). Null-Dx-factor, Null-T-effect (ii). Null-Dx-factor, T-effect


80 80

response rate
response rate

60 60
50
g- 43 g-
40 40
36 36 36 g+ 36 g+

20 20

0 0
P T Pool P T Pool

(iii). Dx-factor, Null-T-effect (iv). Dx-factor, T-effect


80 80
response rate

response rate
60 60 60
55
50 50 50 50
g- 46 g-
40 40 41
36 36 36 g+ 36 g+
20 20

0 0
P T Pool P T Pool

(v). Dx-factor, T-effect (vi). Null-Dx-factor, T-effect in g+ only


(Differential T-effects) 80
80
75 60
response rate

response rate

60 61 50
g-
47 50 g- 40 39
44 g+
40 38
g+ 28 28 28
20 20

0 0
P T Pool P T Pool

(vii). Dx-factor, T-effect in g+ only (viii). Dx-factor, T-effect in g+ only


80 80
response rate

response rate

60 60
50 50
g- g-
40 39 40 39
34 34 34 g+ g+
28 28
20 20
16 16 16

0 0
P T Pool P T Pool

Figure 1. Utility of genomic biomarker as a classifier in therapeutic drug trial.

When preliminary data indicate that adminis- an example in the non-guided arm, the caveats to
tration of a new treatment or an (approved) (1a) include (i) the randomized balance between
standard of care (SOC) is likely to result in the SOC arm and the guided arm in the gþ subset
irreversible toxicity or futile effect in the gþ is lost, and (ii) if the prevalence of gþ is low,
genomic subset, (1a) considers exclusion of these (1prevalence)*100% of patients receives SOC in
gþ patients in the guided arm from comparison, both arms, and a superior biomarker-guided
e.g. [23], see also Figure 2(C) of [22]. Using SOC as overall treatment-induced toxicity or efficacy

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
288 S.-J. Wang

Table I. Randomization to biomarker or non-biomarker-guided treatment assignment.


Biomarker status determined by the Biomarker-guided experimental Non-biomarker-guided control arm
genomic diagnostic assay arm (assay performed at study (assay performed at the study end)
randomization)
(1a) Exclusion of likely toxic/ineffective patients; (1b) Placebo
gþ (1a) Excluded All receive SOC (or TRT)n
(1b) Pacebon
g SOC (or TRT)
(2) Randomization balance preserved
gþ T1 Receive T1 or T2 through
randomizationnn
g T2
n
SOC, standard of care; TRT, treatment.
nn
T1, T2 are two different treatments.

would rely on a highly predictive biomarker, thus, guided design may not be useful for efficacy or
a conservative approach. safety assessment unless the biomarker is highly
An improvement to (1a) is feasible if the predictive of efficacy or safety. The treatment
blinding of the biomarker status is properly effect is essentially assessed in the gþ patient
controlled and not known by either the patient subset alone, and the overall treatment effect will
or the investigator. In these circumstances, pa- be diluted by the equal performing g– patient
tients can be kept in the trial by receiving the subset who receives the same treatment in the
placebo, i.e. (1b), and be followed up for the guided and non-guided arms.
treatment duration. With this modification, one The guided design in (2) of Table I also
can perform a randomized comparison of SOC vs preserves the randomization balance. In this
placebo in the gþ subset. If the SOC-related guided design, in principle, if there is no enriched
toxicity event rate is high and is very likely to be selection of gþ or g patients at study randomiza-
guided by the positive biomarker status, this tion, the expected percentages of gþ vs g patients
comparison can be very powerful with very few between the guided (g) and non-guided (g0 ) arms
gþ patients needed even though the prevalence of should be equal in probability [19] and preserves
gþ may also be low. In addition, the randomiza- the randomization ratio. Thus, if one tests the
tion balance is maintained. Although it may be hypothesis, H0: D ¼ 0 vs H1: D>0, where
argued that it is unethical to withhold treatment D ¼ Dg  Dg0 , and the null hypothesis is rejected,
from these gþ patients with other available drugs then the treatment effect of the guided approach (a
once their biomarker status is known, the justifica- weighted average effect of T1 in gþ and T2 in g)
tion of whether it is unethical to withhold being superior to that with the non-guided
treatment should be based on the duration and approach (a weighted average effect of T1 and T2
expected impact of such a therapeutic decision. without necessarily treating T1 in gþ and T2 in g)
The bias issue can be handled by proper can be determined.
blinding. Note that the irreversible toxicity is yet Such guided design, however, is inefficient
to be tested and, thus, should not have ethical compared with the simple two-arm-randomized
consequence. SOC in the biomarker guided design biomarker-stratified design. To illustrate, suppose
of (1a) or (1b) would be applicable to evaluate the the gþ prevalence in the studied patient popula-
suspected biomarker associated irreversible toxi- tion is 20%. In the guided design, the biomarker-
city. If SOC is any treatment, (1b) is essentially guided study arm will have 20% of the patients
Figure 2(A) of [19]. The (1b) design can be applied receive T1 and 80% of the patients receive T2 and,
to test clinical utility for efficacy or safety. Such in the non-guided study arm, 10% receives T1 and

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 289

10% receives T2 in the 20% gþ patients, while prognostic utility of MammaPrint, the predictive
40% receives T1 and 40% receives T2 in the 80% clinical utility of the 70-gene signature used in the
g patients. In other words, within each (positive MINDACT-controlled trial prospectively can be
vs negative) biomarker stratum, half of the further assessed with better precision.
patients receive the same treatment and the other
half receive opposite treatments. In such scenarios,
2.2. Retrospective
the treatment effect within each stratum effectively
uses only half of the available patients for An approved drug may have several competitors
comparison. The inefficiency is compounded in on the drug market, but its benefit–risk profile may
the comparison of the weighted effects between the be much less desirable than its competitors [28].
guided vs the non-guided arms, D ¼ Dg  Dg0 . If Completed late phase clinical trials may fail to
the treatment effect is reversed between strata, e.g. demonstrate a treatment effect. Other retrospective
a positive effect in gþ and a negative effect in g situations include completed cohort clinical studies
patient subsets, the testing of treatment effect that are not necessarily randomized and well
within each stratum is also held back due to the controlled, a nested case control study within a
use of half of the patients, and the overall effect completed controlled trial or an ongoing study
may be much smaller due to the offset by the that investigates genomic exposure retrospectively.
positive and negative effects. The usefulness of These completed/ongoing clinical studies or clin-
biomarker-guided designs discussed in Table I also ical trials can be very useful. For instance, the
relies on accurate classifiers exemplifying a receiver recent wake up call of high failure rates in late-
operating characteristics curve with very high odds phase drug development [29] and the ability to
ratio, the ratio of the odds of true-positive fraction bank the genomic samples taken before clinical
to the odds of false-positive fraction, e.g. OR c36 trial initiation has given the clinical trial commu-
[24]. nity the opportunity to perform flexible explora-
In practice, other than the example by Hughes tory genomic drug trials [10,28]. In such cases,
et al. [23], modification to the listed guided designs development of a genomic biomarker using
of Table I is considered in the TAILORx trial [25] banked genomic samples from completed well-
and the MINDACT trial [26]. The primary controlled clinical trials individually or meta-
comparison in the TAILORx trial is performed analytically offers a logical tool for retrospective
only in those patients who are in the intermediate exploration. The goal is to identify a genomic
risk group and these patients are randomized to subset showing much larger treatment benefit than
either the experimental (biomarker-guided) arm or the originally intended patient population if
the SOC arm. Thus, there is no loss in the heterogeneity of therapeutic effect is at stake.
efficiency of the design for comparison in this Retrospective development of a genomic bio-
intermediate risk stratum. In the MINDACT trial, marker is of genuine interest and often driven by
the primary comparison is in the discordant pairs both time and cost considerations. Putting aside
between the genomic risk and the clinical risk the genomic sample quality issues, during the
defined before treatment. With these two guided exploration stage, there can be several genomic
designs, those patients who are at low risk biomarkers that are equally plausible and promis-
(genomic low and clinical low) or high risk ing for their predictive clinical utility derived from
(genomic high and clinical high) in the TAILORx gene expression or whole-genomic SNPs scan data.
(MIDACT) trial receive specific treatment not This is due to the fact that often more than
based on randomization. We note that the hundred-folds or thousand-folds of genes are
genomic diagnostic assay based on the 70-gene available than the number of patients studied.
expression signature has been approved by reg- The genomic biomarker developed using the
ulatory agency, known as MammaPrintR [27]. training data should be validated in an indepen-
Based on the analytic validation of the approved dent test data [30]. Indeed, exploration using

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
290 S.-J. Wang

completed clinical trials, ongoing clinical trials or consistent or an inconsistent effect detected by
observational retrospective studies where archived treatment-by-covariate interaction. Wang et al.
genomic biospecimens are accessible has become [17] presented three scenarios of clinical utility of a
an emerging path for research and development in genomic biomarker in the setting of a randomized-
drug development, e.g. [6–8,10,28]. controlled trial:
There are several advantages of retrospective Scenario A: predictive of drug effect;
exploration. Scenario B: prognostic of disease mechanism;
Scenario C: prognostic of disease state and
* One maximizes accrual through a collection of predictive of drug effect.
all completed clinical studies. Here, we exhaust all types of treatment effect
* The right biomarker need not be known a outcomes in a two-arm placebo-controlled clinical
priori. trial within which a genomic biomarker might be
* Retrospective evaluation allows exploration used as a classifier to determine the treatment
and refinement of the genomic biomarker with effect in a specific gþ genomic subset, see Graphs
much larger patient database than early pro- (i)–(viii) in Figure 1.
spective studies. To illustrate, we use response rate as the
* Treatment effect can be assessed in the gþ and primary endpoint. In the following graphs, disease
g biomarker subsets. state is abbreviated as Dx and treatment is
There are some disadvantages. abbreviated as T. In Figure 1, (i) (neither genomic
biomarker nor drug treatment plays a role in
* Not all randomized patients will consent to disease pathophysiology and drug effect) and (ii)
participate in the pharmacogenomics sub(stu- (genomic biomarker does not impact on treatment
dy), resulting in convenience patient samples effect) presents the scenarios where the genomic
that can range from as low as 10%–15% to as biomarker used as a classifier has no disease utility
high as 80%–85%. and clinical utility. Thus, there is no value of such
* Methods of genomic sample collection and a genomic biomarker in drug development. When
genomic sample handling may be suboptimal. a genomic biomarker has the ability to predict
* Time factor may impact on the sample quality differential clinical outcomes in a group of patients
depending on the platforms available over time. independent of their exposure to treatment, all
* Missing data in clinical sample and genomic graphs except (i) and (ii) would satisfy the
sample, and the methods of imputation of these definition of prognostics based on the analysis
missing data can be challenging. pooling across the treatment (T) and placebo (P)
groups, see the response rates under the ‘pool’ x-
For exploratory purposes, however, these issues
axis.
may be less of concern.
To investigate if there is a treatment effect,
clinical outcomes in patients from both T and P
groups are compared. Graph (iii) shows that the
3. UTILITY OF GENOMIC genomic biomarker is prognostic of disease me-
BIOMARKER INVOLVING DRUG chanism and has no impact on treatment (also
TREATMENT Scenario B of [17]), and Graph (iv) depicts a
prognostic genomic biomarker for disease me-
In a typical conventional clinical trial, the study chanism and the treatment effect is not impacted
objective is to investigate whether an experimental by the biomarker status. For the purpose of a
treatment is superior or non-inferior to its diagnostic assay development, the prognostic
comparator. After treatment effect is demon- utility of a diagnostics is generally not considered
strated, the impact of important baseline covari- a high-risk product, e.g. [27]. For assessing
ates on treatment effect is then assessed for a therapeutics, a prognostic biomarker can improve

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 291

the precision of a treatment effect estimate in a hypothesis based on a single well-defined genomic
linear model. (composite) biomarker in the patient population
Using a cancer trial as an example, if the studied. Second, the analytical validity of the
predictivity defines sensitivity of a tumor to a genomic diagnostic assay should have been estab-
distinct therapeutic agent, then Graphs (v)–(viii) lished to accurately classify patients for entry to
satisfy the definition of predictivity. For power the randomized clinical trial. In practice, the
consideration in designing a clinical trial, Wang interest to analytically validate the diagnostics
et al. [17] distinguish between quantitative treat- performance at the time of initiating a phase III
ment-by-biomarker interaction (Graph (v)) and well-controlled trial for the composite hypothesis
qualitative interaction (Graphs (vi)–(viii)). For may not be high, partly when the genomic
quantitative interaction, a genomic biomarker that biomarker or the clinical utility of the genomic
is prognostic of treatment effect, but with different biomarker is still under exploration or may not be
magnitudes between the gþ and g biomarker well established at trial initiation, e.g. [16]. Thus,
subsets, such as those shown in Graph (v), the the study results yield less informative data.
biomarker is prognostic of disease state and When the analytical validity of the diagnostic
predictive of differential treatment effects, where assay is not established and only the treatment
Dþ > D > 0: In such cases, the genomic biomarker effect in the gþ subset is concluded, those designs
results in differential treatment effects and its prospectively specify the study objectives that can
clinical utility is considered prognostic-predictive at best be considered a preliminary assessment of
of drug effect (also Scenario C of [17]). clinical utility. The prospective intent to study the
In contrast, as shown in Graphs (vi)–(viii), there clinical utility of a genomic classifier is congruent
are two distinct treatment effects (D ¼ 0 and with scientific principles. In the trastuzmab trial
Dþ > 0): a null effect in the g genomic subset and [12], it would be useful had an analytically
a positive treatment effect in the gþ genomic validated HER2 diagnostic test been concurrently
subset. Such treatment-by-biomarker qualitative used in the drug trial. Fortunately, the study
interaction signifies the predictive nature of the showed that Herceptin+chemotherapy is superior
genomic biomarker (also Scenario A of [17]). In to chemotherapy alone in all breast cancer patients
the extreme case, the treatment effect can be studied, the efficacy standard for a two-arm-
inferior in the g genomic subset, i.e. D 50 and controlled trial. Thus, the concerns of the analy-
Dþ > 0: Whether the genomic classifier is based on tical validation not established for the HER2
a single gene or derived from multiple genes diagnostic test in the same trial were somewhat less
expressed by a well-defined risk score or prediction when the effect was also observed in the HER2 3+
algorithm, the genomic (composite) biomarker for patient subset identified by means of the home-
assessing drug effect can be defined as a measur- brew IHC kits developed in the research labora-
able characteristic serving as an indicator for tory [4].
differential response or predictive of response to Although the specificity can be as low as 0.6 of a
therapy, e.g. between patients diagnostically clas- diagnostic test, compared with the conventional
sified as gþ or g [9]. untargeted design of a single one-size-fits-all
hypothesis, the enrichment design generally has
sufficient power with fewer patients in terms of
4. CONFIRMATION OF CLINICAL number of patients randomized when the genomic
UTILITY classifier is predictive or prognostic-predictive [31].
However, in terms of number of patients screened,
Ultimately, for confirmatory purposes, two major the untargeted design may not always be less
components should be required. First, one pre- efficient, especially, when the genomic classifier is
specifies a superiority hypothesis in an enriched only prognostic-predictive [31]. The adaptive de-
patient population or a priori clinical composite sign considered by WOH increases the power

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
292 S.-J. Wang

dramatically in selecting a sensitive patient subset principle, i.e. those who determine the classi-
if the genomic biomarker is predictive of ther- fication statuses are blinded to the treatment
apeutics [17]. The biomarker-guided design that code [28].
excludes the gþ patient subset relies on highly (3) The clinical data including treatment assign-
predictive biomarker classifiers to guide the treat- ment are retrospective.
ment for safety or for efficacy, and can be (4) The genomic study on classifier’s clinical
improved to include the gþ subset in the guided utility is from a randomized well-controlled
arm for randomized comparison of SOC vs clinical trial, though completed.
placebo within this gþ subset if the blinding of
the genomic status is feasible. Such modification It seems intuitive that when the quality of the
can increase the power for the gþ subset hypoth- banked genomic samples meets the standards of
esis using a two-stage adaptive design with a pre- archived biospecimens, the clinical data from the
specified composite hypothesis, e.g. [17]. The prospective/retrospective study may be considered
modification can also provide a reference compar- as good as the original clinical trial conducted.
ison in the gþ subset when the prevalence of gþ is However, in clinical trial practice, two levels of
low. It is worthwhile to note that improvement in consent are required for conducting a clinical trial
overall power and reduction in overall variability and also studying the pharmacogenomics. Only
through these designs are also a function of the those patients who also consented to genomic
prevalence of gþ in addition to the usual design samples will qualify for the pharmacogenomic
parameters for sizing the study. (sub)study. Among other issues, genomic sample
The pre-specified hypothesis should be tested in quality control, standardization of the genomic
an independent study not used for developing the diagnostic assay, missing genomic status, it is not
genomic classifier. Ideally, such study should be clear whether randomized balance is preserved in
prospectively planned and executed in a well- these convenient genomic samples. The various
controlled investigation separating early learning bias issues and potential confounding effects well-
data from data for independent confirmation. known from convenient cohort study cannot be
Those designs described in Section 2.1, fixed, ignored. In addition, such completed clinical trials
adaptive or guided and that are pursued after the often are those that have failed their primary
analytical validity has been established are the best clinical effect prospectively investigated and are
candidates to confirm the clinical utility of the known to be negative studies. Such studies would
genomic classifier prospectively. be primarily exploratory for hypothesis genera-
In recent years, there has been an enthusiastic tion, as the experimentwise type I error rate has
interest in the use of a prospective/retrospective already been spent and the negative study evidence
framework. This framework allows one to pro- has already been concluded. The genomic biomar-
spectively assess the clinical utility using clinical ker only identified/defined at the time of the
data from completed randomized controlled clin- interim analysis that embarked upon the compo-
ical trials for which the genomic classifier is site objective at trial initiation suffers these same
developed from yet other completed clinical set of issues.
studies or clinical trials described in Section 2.2. If the clinical hypothesis can be completely
The major arguments of prospective/retrospective absorbed by the genomic hypothesis within the
assessment include the following: same trial, it might be argued that the experiment-
wise type I error rate already spent for the clinical
(1) The confirmatory study is completely inde- hypothesis can be claimed back and can be re-used
pendent of those studies used to develop the for the genomic hypothesis test. It is not immedi-
genomic classifier. ately obvious that such a strong assumption can
(2) The classifier status of the genomic biomarker be justified. It seems reasonable that the two
is prospectively determined based on blinded hypotheses, clinical and genomic, are correlated.

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 293

Environmental factors are likely to play some roles survival, shows a statistically significant associa-
in between the clinical and the genomic hypotheses. tion to treatment, but T, the clinical endpoint, say,
One has to acknowledge that the clinical utility overall survival in cancer trials, cannot consis-
of predictivity is unlikely to be established from tently show a statistically significant benefit in a
completed studies. The predictivity is about the series of clinical trials. In such cases, the effect of
future and not the past. Of note, if in truth treatment is shown only on S. If one were to
the genomic biomarker used as a classifier is classify patients into two groups, those who had
predictive of therapeutic effect as those shown in longer time to progression vs those showing no
Graphs (vi)–(viii), the effect size is generally much difference in (or much shorter) time to progression,
larger than the conventionally identified treatment and if these two types of patients can be described
effect size planned [14,17]. Thus, only a small by some important baseline characteristics such as
number of patients would be needed to prospec- genomic classifier or clinical classifier through post
tively assess the treatment effect in the gþ hoc exploration, there might be a plausible link
subset alone for confirmation. This is practically between the genomic classifier and the therapeu-
doable when the prevalence of gþ is low, e.g. n ¼ tics, e.g. [12].
73 in the gþ subset corresponding to 8% of the It is vital that the utility of the baseline
randomized patients in [23]. characteristics as a classifier so identified be
prospectively tested and studied as those described
in Section 2.1, e.g. the composite objective. With
such assessment, the classifier is used to prospec-
5. SURROGATE ENDPOINT IN tively select patients for treatment that is indepen-
PREDICTIVE BIOMARKER dent of treatment intervention, and longer survival
FRAMEWORK is likely observed in just one of the two subsets,
e.g. hazard ratio of 0.72, p50.002 in the good
Proper validation of a candidate surrogate end- prognostic group, and 1.03, p ¼ 0:90 in the poor
point (S) [32] in therapeutic intervention trials that prognostic group. When this is the case, the
can substitute ultimate clinical outcome (T) estimated hazard reduction on overall survival
requires more than a single trial, presumably, a (e.g. hazard ratio of 0.89, p ¼ 0:27) will be diluted
series of randomized clinical trials that have both S due to the differential survival benefit observed
and T measured [33,34]. Prentice criteria [35] have between the gþ and g– subsets and, thus, likely to
been viewed to be too stringent and strict as the fail the statistical significance when the two subsets
correlation between both S and T and the are combined for analysis.
associated treatment effect on S and on T must In other words, the concept of identifying a
be established for a validated predictive surrogacy. genomic biomarker that is used to prospectively
In addition, Fleming [36] argues that one can select sensitive patients can be a useful tool for the
rarely establish that surrogate endpoints are valid. problem of a candidate predictive surrogacy. It is
Even in that rare setting in which data on difficult to discern if a surrogacy is confounded by
treatment Z would allow one to view S as a valid drug intervention. A pre-specified gþ genomic
surrogate for T, one cannot extrapolate this subset may at least predict one of the multiple
surrogacy to any new treatment Z* that could causal pathways influencing the drug treatment.
have mechanisms of action that differ from those Turning a surrogate endpoint problem into a
of Z. Clearly, the timeframe for establishing a biomarker classifier problem can be much more
valid S from a series of clinical trials needed where appealing than a candidate surrogate validation
T is usually the overall survival can be over- approach which can be much more difficult,
whelming. requiring a series of clinical trials, and time
A major difficulty in establishing a validated S consuming. If the predictive clinical utility of the
statistically arises when S, e.g. progression free genomic classifier exists, as long as the survival

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
294 S.-J. Wang

benefit can be demonstrated in the gþ subset, even test within the same trial that consists of both
if the survival benefit in the g– subset may be much biomarker-positive and -negative patients. The
less or futile, the explored treatment effect in the ability to assess the relevant characteristics in
gþ subset can be used to generate hypothesis for the drug trial and the diagnostic trial allows
future trial planning. Unlike validation of a estimation of the performance characteristics,
surrogate endpoint, this approach does not require e.g. sensitivity, specificity, positive and negative
a series of clinical trials to identify and confirm a predictive values, with better precision for drug-
predictive gþ genomic subset treatment effect. It is diagnostic co-development or a companion
advisable to consider use of a composite hypoth- diagnostic development. Such an approach also
esis that states treatment effect exists in either all allows adherence to the reporting standards for
randomized patients or only the gþ patient subsets the drug trial [37] and the diagnostic trial [38]
for prospective planning and confirmation. where applicable.

A prospective randomized well-controlled investi-


6. DISCUSSION gation is a much more credible standard for
definitive inference that uses proper statistical
This paper discusses how pharmacogenetics and methods for evaluation. This was acknowledged
pharmacogenomics have grown to influence how 30 years ago and remains the gold standard today.
design and analysis of clinical trials used in the last Scientific standard of substantial evidence derived
30 years can be improved. The genomic biomarker from adequate and well-controlled investigations
may be a single biomarker or a composite will remain. This includes the concept of hypoth-
biomarker derived from, e.g. transcription profil- esis testing, estimation, randomization and blind-
ing, whole-genome SNPs scan study. ing that set the foundation of statistical principle
To learn if a genomic biomarker used as a established in 1970s during the time of the
classifier can offer prognostic-predictive or pre- formation of PSI, materialized in ICH E-9 [39]
dictive clinical utility, there are at least three steps. with further emphasis on comparative benefits and
risks that should be optimized for patients for real
* Development of a genomic biomarker can be world use.
achieved through a series of early prospective In most cases, there is large uncertainty on the
clinical studies or a series of retrospective predictive utility of a genomic classifier due to little
analyses of completed clinical studies with available knowledge about the drug target or all
available banked baseline genomic samples for the possible causal pathways for a therapeutic
meta-analytic exploration. effect. Assume the quality standard of the banked
* Preliminary understanding of the clinical utility genomic biospecimens and standardization of the
may be demonstrated from randomized well- diagnostic assays are acceptable. The adaptive
controlled investigations. If the analytical vali- feature, if well designed, to prospectively select
dation of the genomic diagnostic assay has not sensitive patients based on molecular or genomic
been established, these studies serve to provide biomarkers, which are used as a classifier, will
preliminary evidence of the genomic classifier’s likely increase the therapeutic index. This includes
clinical utility and the diagnostic assay’s per- the proposal to turn the surrogate endpoint
formance characteristics where all biomarker- problem to a (genomic) classifier problem. These
positive and biomarker-negative patients are alternative designs have great potential to bring
included, e.g. [16,17]. clinical trial research one step closer to persona-
* With an approved genomic diagnostic assay lized medicine [25,26,40]. For the next 30 years and
that is analytically validated and a hypothesized more, as this research unfolds, it is envisioned that
predictive clinical utility, one can put the there will be plenty of opportunities for improve-
clinical utility of the genomic classifier to the ment to benefit/risk assessments from the conven-

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
Treatment effect, genomic classifier, prognostic-predictive clinical utility 295

tional one-size-fits-all two-arm well-controlled Technology Conference, Taipei, Taiwan. DOI


comparison. More research on achieving develop- 10.1109/EITC.2005.1544330. 2005; 13–16.
ment of drug and diagnostics to target those 10. Maruvada P, Srivastava S. Joint National Cancer
Institute–Food and Drug Administration Work-
patients who benefit will continue to be an ongoing shop on research strategies, study designs, and
endeavor. statistical approaches to biomarker validation
for cancer diagnosis and detection. Cancer Epide-
miology, Biomarkers & Prevention 2006; 15(6):
ACKNOWLEDGEMENTS
1078–1082.
11. Temple RJ. Enrichment designs: efficiency in devel-
opment of cancer treatments. Journal of Clinical
The author would like to thank Dr Steven Julious for Oncology 2005; 23(22):4838–4839.
the invitation and Dr Steve Gutman and the anonymous 12. Baselga J. Herceptin alone or in combination with
referees for their constructive comments during the chemotherapy in the treatment of HER2-positive
preparation of this work. This research work was metastatic breast cancer: pivotal trials. Oncology
supported by the RSR funds #02-06, #04-06, #05-02 2001; 61(S2):14–21.
and #05-14 awarded by the Center for Drug Evaluation
13. Jain KK. Applications of AmpliChip CYP450.
and Research, US Food and Drug Administration.
Molecular Diagnostics 2005; 9(3):119–127.
14. Wang SJ, Hung HMJ. Trials in trials: alpha
allocation strategy and sub-trial planning. The
REFERENCES Proceedings of American Statistical Association,
Biopharmaceutical Section [CD-ROM], American
1. Vogel F. Moderne probleme der humangenetic. Statistical Association: Alexandria, VA, 2005.
Ergebnisse der Inneren Medizı¨n und Kinderheilkunde 15. Wang SJ. ‘Regulatory update on pharmacoge-
1959; 165:835–837. nomics: an FDA perspective.’ Special Report in the
2. Kalow W. Pharmacogenetics: heredity and the First Multi-track DIA Workshop on ‘DIA Congress
response to drugs. W.B. Saunders: Philadelphia, on the Development and Utilization of Pharmaceu-
PA/London, 1962. ticals moving towards eRegulation/Risk Management
3. The International HapMap Consortium. A haplo- – Safety and Efficacy/Biostatistics’, Japan, 2005;
type map of the human genome. Nature 2005; 16–20.
437(7063):1299–1320. 16. Freidlin B, Simon R. Adaptive signature design: an
4. Simon R, Wang SJ. Use of genomic signatures in adaptive clinical trial design for generating and
therapeutics development in oncology and other prospectively testing a gene expression signature for
diseases. The Pharmacogenomics Journal 2006; sensitive patients. Clinical Cancer Research 2005;
6:166–173. 11(21):7872–7878.
5. Gutman S, Kessler LG. The US Food and Drug 17. Wang SJ, O’Neill RT, Hung HMJ. Approaches to
Administration perspective on cancer biomarker evaluation of treatment effect in randomized clinical
development. Nature Reviews: Cancer 2006; 6:565– trials with genomic subset. Pharmaceutical Statistics
571. 2007. DOI: 10.1002/pst.300.
6. van’t Veer LJ, Dai H, Vijver MJvd et al. Gene 18. Moye LA, Deswal A. Trials within trials: confirma-
expression profiling predicts clinical outcome of tory subgroup analyses in controlled clinical experi-
breast cancer. Nature 2002; 415:530–536. ments. Controlled Clinical Trial 2001; 22:605–619.
7. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, 19. Sargent DJ, Conley BA, Allegra C, Collette L.
Baehner FL, Walker MG, Watson D, Park T, Hiller Clinical trial designs for predictive marker valida-
W, Fisher ER, Wickerham DL, Bryant J, Wolmark tion in cancer treatment trials. Journal of Clinical
N. A multigene assay to predict recurrence of Oncology 2005; 23(9):2020–2027.
tamoxifen-treated, node-negative breast cancer. New 20. Simon R. Validation of pharmacogenomic biomar-
England Journal of Medicine 2004; 351:2817–2826. ker classifiers for treatment selection. Cancer Bio-
8. The Wellcome Trust Case Control Consortium. markers 2006; 2:89–96.
Genome-wide association study of 14,000 cases of 21. Lachenbruch PA. A note on sample size computa-
seven common diseases and 3,000 shared controls. tion for testing interactions. Statistics in Medicine
Nature 2007; 447:661–683. 1988; 7:467–469.
9. Wang SJ. Utility of high dimensional genomic 22. Pusztai L, Hess KR. Clinical trial design for
biomarkers in therapeutic and/or diagnostic devel- microarray predictive marker discovery and
opment. The IEEE Conference Proceedings for assessment. Annals of Oncology 2004; 15(12):
the Fifth International Emerging Information 1731–1737.

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst
296 S.-J. Wang

23. Hughes S, Hughes A, Brothers C, Spreen W, 31. Maitournam A, Simon R. On the efficiency of
Thorborn D. PREDICT-1 (CNA106030): the first targeted clinical trials. Statistics in Medicine 2005;
powered, prospective trial of pharmacogenetic 24:329–339.
screening to reduce drug adverse events. Pharma- 32. Biomarkers Definitions Working Group. Commen-
ceutical Statistics 2007. DOI: 10.1002/pst.286. tary: biomarkers and surrogate endpoints: preferred
24. Pepe MS. Evaluating technologies for classification definitions and conceptual framework. Clinical
and prediction in medicine. Statistics in Medicine Pharmacology and Therapeutics 2001; 69(3):89–95.
2005; 24:3687–3696. 33. Fleming TR, DeMets DL. Surrogate end points in
25. Sparano JA. Current clinical trial – the TAILORx clinical trials: are we being misled? Annals of Internal
trial: individualized options for treatment. Commu- Medicine 1996; 125(7):605–613.
nity Oncology 2006; 3:494–496. 34. Buyse M, Molenberghs G, Burzykowski T, Renard
26. Bogaerts J, Cardoso F, Buyse M, Braga S, D, Geys H. The validation of surrogate endpoints in
Loi S, Harrison JA, Bines J, Mook S, Decker N, meta-analysis of randomized experiments. Biosta-
Ravdin P, Therasse P, Rutgers E, van’t Veer LJ, tistics 2000; 1:49–67.
Piccart M on behalf of the TRANSBIG consortium. 35. Prentice RL. Surrogate endpoints in clinical trials:
Gene signature evaluation as a prognostic tool: definition and operational criteria. Statistics in
challenges in the design of the MINDACT Medicine 1989; 8:431–440.
trial. Nature Clinical Practice Oncology 2006; 36. Fleming TR. Evaluating therapeutic interventions:
3(10):540–551. some issues and experiences (with discussion and
27. FDA Clears the First In Vitro Diagnostic Multi- rejoinder). Statistical Science 1992; 7:428–456.
variate Index Assay. Office of In Vitro Diagnostic 37. Moher D, Schulz KF, Altman DG. The CONSORT
Device Evaluation and Safety (OIVD) at the Food statement: revised recommendations for improving
and Drug Administration (FDA), 6 February 2007. the quality of reports of parallel-group randomized
Available at: http://www.fda.gov/cdrh/oivd/news. trials. The Lancet 2001; 357(9263):1191–1194.
html. 38. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA,
28. Wang SJ, Cohen N, Katz DA, Ruano G, Shaw P, Glasziou PP, Irwig LM, Lijmer JG, Moher D,
Spear B. Retrospective validation of genomic Rennie D, de Vet HC. Towards complete and
biomarkers – what are the questions, challenges accurate reporting of studies of diagnostic accuracy:
and strategies for developing useful relationships to the STARD initiative. Standards for reporting of
clinical outcomes – workshop summary. The Phar- diagnostic assay. Clinical Chemistry 2003; 49(1):1–6.
macogenomics Journal 2006; 6:82–88. 39. US Food and Drug Administration. Statistical
29. Kola I, Landis J. Can the pharmaceutical industry principles for clinical trials (ICH E-9). International
reduce attrition rates? Nature Review: Drug Dis- Conference on Harmonization. U.S. Food and Drug
covery 2004; 3(8):711–715. Administration, DHHS, February 1998.
30. Simon R, Radmacher MD, Dobbin K, McShane 40. Wang SJ. Genomic biomarker derived therapeutic
LM. Pitfalls in the use of DNA microarray data for effect in pharmacogenomicsclinical trials: a biosta-
diagnostic and prognostic classification. Journal of tistics view of personalized medicine. Taiwan Clin-
National Cancer Institute 2003; 95:14–18. ical Trials 2006; 4:57–66.

Published in 2007 by John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 283–296
DOI: 10.1002/pst

You might also like