Professional Documents
Culture Documents
Mini-Review: Kathleen F. Kerr, Allison Meisner, Heather Thiessen-Philbrook, Steven G. Coca, and Chirag R. Parikh
Mini-Review: Kathleen F. Kerr, Allison Meisner, Heather Thiessen-Philbrook, Steven G. Coca, and Chirag R. Parikh
Abstract
The field of nephrology is actively involved in developing biomarkers and improving models for predicting
patients risks of AKI and CKD and their outcomes. However, some important aspects of evaluating biomarkers
and risk models are not widely appreciated, and statistical methods are still evolving. This review describes some
of the most important statistical concepts for this area of research and identifies common pitfalls. Particular
attention is paid to metrics proposed within the last 5 years for quantifying the incremental predictive value of a
new biomarker.
Clin J Am Soc Nephrol 9: 14881496, 2014. doi: 10.2215/CJN.10351013
Introduction
There has recently been a surge of interest in biomarkers throughout medicine, including nephrology. Biomarkers for AKI are particularly exciting for
their potential to overcome the limitations of serum
creatinine and improve risk prediction (1). Risk prediction is most valuable when it enables clinicians
to match the appropriate treatment to a patients
needs or when it allows public health systems to
allocate resources effectively. Risk prediction can
also be valuable in clinical research settings. For
example, a risk prediction model that identies patients at high risk for an adverse outcome could be
used for enrollment in a clinical trial for a preventive therapy.
The broad purpose of this article is to provide guidance for nephrology researchers interested in biomarkers
for a binary (dichotomous) outcome, acknowledging
that statistical methods in this eld continue to evolve.
Specic goals are (1) promoting good statistical practice, (2) identifying misconceptions, and (3) describing
metrics that quantify the prediction increment. We pay
particular attention to recent proposals that rely on
the concept of reclassication to evaluate new markers, especially net reclassication improvement (NRI)
statistics.
There are many algorithms for combining predictors into a classier or risk prediction model. We focus
on logistic regression because it is exible and relatively familiar to clinicians. The logistic model produces a
formula for combining predictor variables into a risk
score. For example, the Society of Thoracic Surgeons
(STS) score (3) combines patient data into a risk score
for dialysis after cardiac surgery (MI indicates myocardial infarction, and NYHA indicates New York
Heart Association):
STS risk score a b1 Age b2 Surgery Type
b3 Diabetes b4 MI Recent
b5 Race b6 Chronic Lung Disease
b7 Reoperation b8 NYHA Class
b9 Cardiogenic Shock
b10 Last SerumCreatinine
Data Example
1488
*Department of
Biostatistics, School of
Public Health,
University of
Washington, Seattle,
Washington; Division
of Nephrology,
Department of
Medicine, Western
University, London,
Ontario, Canada; and
Section of
Nephrology, Yale
University School of
Medicine, Veterans
Affairs Connecticut
Healthcare System,
and the Program of
Applied Translational
Research, New
Haven, Connecticut
Correspondence:
Dr. Chirag Parikh,
Section of
Nephrology, Yale
University School of
Medicine, 60 Temple
Street, Suite 6C, New
Haven, CT 06510.
Email: chirag.parikh@
yale.edu
1489
Figure 1. | Calibration plots allow the visual assessment of risk model calibration, a fundamental aspect of model validity. The diagonal line
represents the ideal scenario where the actual and predicted probabilities are equal. The triangles are estimates of the actual event rates for
groups with similar predicted probabilities. The solid line represents an estimate of the actual event rate for every predicted probability. The
hash marks along the horizontal axis show the distribution of predicted probabilities for the dataset. (A) Calibration plot for a well calibrated risk
model. Predicted probabilities and observed event rates are nearly identical. (B) Calibration plot for a risk model that tends to overestimate risks.
For example, the group of patients with predicted probability of about 0.4 has an event rate,20%. (C) A risk model that tends to underestimate
risks. (D) A risk model that is poorly calibrated in a haphazard way, sometimes overestimating risks and sometimes underestimating risks.
knowledge of the relationship between variables and outcomes to select predictors. Important considerations in
variable selection include the following:
c A candidate predictor should be clearly defined and measureable in a standardized way that can be reproduced in the
clinic (6).
c Variables that are challenging to collect in some patients
can be problematic (e.g., family history) for future applications of the risk model.
c It is usually disadvantageous to categorize continuous variables (1114). A U-shaped relationship between a predictor
and the outcome may call for sophisticated modeling; categorization is rarely adequate.
c The set of candidate predictors expands when one considers
transformations of predictors (e.g., marker M as well as log
[M]) and statistical interaction terms (such as M1M2).
c Sample size and the number of events limit the number of predictors that should be considered. A rule of thumb is that there
1490
Figure 2. | ROC curves show the range of true and false positive rates
afforded by a marker or risk model. The figure shows ROC curves for
two biomarkers. The numbers in the legend are the AUC values for
each marker. Serum BNP has higher AUC than urinary KIM-1, although urinary KIM-1 performs better at low false-positive rates. BNP,
B-type natriuretic peptide; KIM-1, kidney injury molecule-1.
Figure 3. | The prediction increment of a new marker is not a monotone function of its marginal strength. The baseline risk model is
modestly predictive on its own, with area under the receiver-operating
characteristic curve (AUC)=0.7. A new marker is added to aid prediction, and it is correlated with the baseline risk. The figure shows that
the prediction increment of the new marker can increase or decrease as
its individual strength increases. In the figure, the prediction increment is
measured by the increase in AUC. However, results are similar for other
popular measures of incremental value.
1491
with larger individual strength will have higher incremental value. In fact, a markers incremental value is generally
not an increasing function of its individual strength (16)
(Figure 3).
Evaluating the Prediction Increment
There is broad agreement that a new marker should be
judged in terms of its incremental value and not its
individual predictive strength (23), but there is no consensus on how incremental value should be measured. We
review traditional measures and newer proposals. We
end by applying all the measures to assess the incremental
value of urinary KIM-1.
TPR, FPR, AUC
The previous section describes the TPR and the FPR. For
evaluating the prediction increment of a new marker, one
can examine how these quantities change with the addition
of the new marker (e.g., DTPR and DFPR). Similarly, for
assessing the prediction increment of a new marker Y over
established predictors X, a common metric is the change in
AUC (DAUC). However, the shortcomings of AUC carry
over to DAUC. The DeLong test should not be used to test
the null hypothesis that DAUC=0 (24), although this is the
method implemented in most statistical software. In fact,
P values for DAUC are not necessary and should be avoided
(see below) (25). These issues have prompted interest in
alternative measures of the prediction increment.
Reclassification Percentage
A reclassication table cross-tabulates how patients fall
into risk categories under the baseline risk model that uses
the established predictors, and the expanded risk model
that additionally incorporates the new marker (Table 1).
The reclassication rate (RC) (26) is the proportion of patients in the off-diagonal cells of the reclassication table.
As a descriptive statistic, a small RC means that the marker
0%10%
.10%#25%
.25%
Total
Cases
Controls
0%10%
.10%#25%
.25%
Total
0%10%
.10%#25%
.25%
Total
9 (29.0)
0 (0.0)
0 (0.0)
9 (29.0)
0 (0.0)
4 (12.9)
2 (6.5)
6 (19.4)
0 (0.0)
4 (12.9)
12 (38.7)
16 (51.6)
9 (29.0)
8 (25.8)
14 (45.2)
31 (100)
364 (83.1)
18 (4.1)
1 (0.2)
383 (87.4)
15 (3.4)
25 (5.7)
1 (0.2)
41 (9.4)
0 (0.0)
5 (1.1)
9 (2.1)
14 (3.2)
379 (86.5)
48 (11.0)
11 (2.5)
438 (100)
Risk thresholds at 10% and 25% dene low-, medium-, high-risk categories. Baseline risk model is composed of cardiopulmonary bypass time
and postoperative serum creatinine. Expanded risk model is composed of baseline risk markers and urine kidney injury molecule-1 (KIM-1).
Performance measures were calculated from reclassication table. The reclassication rate (RC) is the proportion of the population in the offdiagonal cells in the reclassication table. About 19.4% of cases and 9.1% of controls are reclassied. Overall, 9.8% of the sample is reclassied, so
RC=9.8%. The three-category event net reclassication improvement (NRIev) is calculated using only the data on cases, summing the proportions in the upper diagonal and subtracting the proportions in the lower diagonal: (0+0+4)/312(0+0+2)/31=0.06, so NRIev=0.06. The threecategory nonevent net reclassication improvement (NRIne) is calculated using only the data on controls, summing the proportions in the lower
diagonal and subtracting the proportions in the upper diagonal: (18+1+1)/438(15+0+5)/438=0, so NRIne=0. RC, NRIe, and NRIne depend on
the number of risk categories and the specic thresholds used to dene those categories. The most important information in the table is in the
Total row and columns, which show how each risk model distributes cases and controls into risk categories. This information is also in Table 3.
1492
(1)
(2)
1493
Table 2. Measures of model performance and the prediction increment for kidney injury molecule-1 in the AKI dataset
Measure
Base Model
Expanded Model
KIM-1 Prediction
Increment
0.840a
0.215a
0.857a
0.245a
0.017a
0.030a,b
0.037b
0.035
0.002
0.408
0.226
0.183
0.277
0.051
0.312
0.049
0.695a
0.136a
0.691a
0.128a
20.004a,c
20.008a,d
0.072
0.009
0.000c
0.009d
0.429a
0.026a
0.493a
0.034a
0.064a,c
0.008a,d
0.028
0.058
0.065b
20.007c
0.098
0.065
0.065
0.000
Measures of model performance and the prediction increment for kidney injury molecule-1 (KIM-1) in the AKI data set. The base model
includes time on cardiopulmonary bypass and serum creatinine, and the expanded model additionally includes urinary KIM-1. For the
purposes of illustration, we report all measures discussed in the text, including measures that we do not recommend. AUC, area under
the receiver-operating characteristic curve; MRD, mean risk difference; IDI, integrated discrimation improvement; IDIe, event IDI;
.0
IDIne, nonevent IDI; NRI.0, category-free net reclassication improvement; NRI.0
e , category-free event NRI; NRIne , category-free
nonevent NRI; TPR, true-positive rate; FPR, false-positive rate; RC, reclassication rate.
a
Corrected for resubstitution bias using a bootstrapping technique (Supplemental Material).
b
DMRD is the same as IDI. In this example, the numbers are different because the MRD values are corrected for resubstitution bias; there
is no corresponding method for IDI statistics.
c
DTPR is the same as the two-category event NRI. Values in the table differ because DTPR is corrected for resubstitution bias; there is no
corresponding method for NRI statistics.
d
DFPR is equivalent to the two-category nonevent NRI. Values in the table differ because DFPR is corrected for resubstitution bias; there
is no corresponding method for NRI statistics.
screening tests for a serious condition for which a falsepositive result has minimal consequences, then model B is
superior to model A. Yet model A may be favored by some
metrics that ignore clinical consequences.
Net benet (NB) is a measure that incorporates information on clinical consequences, specically the relative
benet of correctly identifying disease and the cost
of a false-positive result (38).
NB PTP 2 PFPw
where P(TP) is the proportion of the population that is
true positive and P(FP) is the proportion that is false positive. The weight w is the benet of identifying a true-positive
result relative to the cost of a false-positive result. For
1494
Variable
0%10%
.10%#25%
.25%
Baseline Risk
Model (%)
Expanded Risk
Model (%)
Cases
Controls
Cases
Controls
29.0
25.8
45.2
86.5
11.0
2.5
29.0
19.4
51.6
87.4
9.4
3.2
Both risk models place 29% of cases into the low-risk category, but
the expanded risk model places more cases into the highest-risk
category. This suggests that the expanded risk model does a better
job at identifying cases as high risk. However, the expanded risk
model also places more controls into the highest-risk category. Because the controls are most of the population, this could be a serious
drawback for the expanded risk model. Note that this table extracts
the information from the Total row and columns of Table 1.
Summary
Statisticians, epidemiologists, and clinicians currently
struggle to reach consensus on best practices for developing risk prediction models and assessing new markers.
Table 4 summarizes most of the guidelines discussed in
this paper.
Acknowledgments
The research was supported by National Institutes of Health
(NIH) grant RO1HL085757 (C.R.P.) to fund the TRIBE-AKI Consortium to study novel biomarkers of AKI after cardiac surgery. C.R.P.
is also supported by NIH grant K24DK090203. S.G.C. is supported by
NIH grants K23DK080132 and R01DK096549. S.G.C. and C.R.P. are
also members of the NIH-sponsored ASsess, Serial Evaluation, and
Subsequent Sequelae in Acute Kidney Injury (ASSESS-AKI) Consortium
(U01DK082185).
Disclosures
None.
References
1. Vanmassenhove J, Vanholder R, Nagler E, Van Biesen W: Urinary
and serum biomarkers for the diagnosis of acute kidney injury: An
in-depth review of the literature. Nephrol Dial Transplant 28:
254273, 2013
2. Parikh CR, Coca SG, Thiessen-Philbrook H, Shlipak MG, Koyner
JL, Wang Z, Edelstein CL, Devarajan P, Patel UD, Zappitelli M,
Krawczeski CD, Passik CS, Swaminathan M, Garg AX,
Consortium T-A; TRIBE-AKI Consortium: Postoperative biomarkers predict acute kidney injury and poor outcomes after
adult cardiac surgery. J Am Soc Nephrol 22: 17481757, 2011
3. Mehta RH, Grab JD, OBrien SM, Bridges CR, Gammie JS, Haan
CK, Ferguson TB, Peterson ED; Society of Thoracic Surgeons
National Cardiac Surgery Database Investigators: Bedside tool
for predicting the risk of postoperative dialysis in patients undergoing cardiac surgery. Circulation 114: 22082216, quiz
2208, 2006
4. Bellomo R, Ronco C, Kellum JA, Mehta RL, Palevsky P; Acute
Dialysis Quality Initiative workgroup: Acute renal failure
definition, outcome measures, animal models, fluid therapy and
information technology needs: The Second International Consensus Conference of the Acute Dialysis Quality Initiative (ADQI)
Group. Crit Care 8: R204R212, 2004
5. Mehta RL, Kellum JA, Shah SV, Molitoris BA, Ronco C, Warnock
DG, Levin A, Network AKI; Acute Kidney Injury Network: Acute
Kidney Injury Network: Report of an initiative to improve outcomes in acute kidney injury. Crit Care 11: R31, 2007
6. Moons KG, Kengne AP, Woodward M, Royston P, Vergouwe Y,
Altman DG, Grobbee DE: Risk prediction models: I. Development, internal validation, and assessing the incremental
value of a new (bio)marker. Heart 98: 683690, 2012
7. Derksen S, Keselman H: Backward, forward, and stepwise automated subset-selection algorithms - frequency of obtaining
authentic and noise variables. Br J Math Stat Psychol 45: 265
282, 1992
8. Austin PC, Tu JV: Automated variable selection methods for logistic regression produced unstable models for predicting acute
myocardial infarction mortality. J Clin Epidemiol 57: 11381146,
2004
9. Steyerberg EW, Eijkemans MJ, Habbema JD: Stepwise selection
in small data sets: A simulation study of bias in logistic regression
analysis. J Clin Epidemiol 52: 935942, 1999
10. Harrell FEJ: Regression Modeling Strategies, New York, Springer,
2001
11. Altman DG, Royston P: The cost of dichotomising continuous
variables. BMJ 332: 1080, 2006
12. Naggara O, Raymond J, Guilbert F, Roy D, Weill A, Altman DG:
Analysis by categorizing or dichotomizing continuous variables
is inadvisable: An example from the natural history of unruptured
aneurysms. AJNR Am J Neuroradiol 32: 437440, 2011
13. Frslie KF, Rislien J, Laake P, Henriksen T, Qvigstad E, Veierd
MB: Categorisation of continuous exposure variables revisited. A
response to the Hyperglycaemia and Adverse Pregnancy Outcome (HAPO) Study. BMC Med Res Methodol 10: 103, 2010
14. Royston P, Altman DG, Sauerbrei W: Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med 25: 127141, 2006
15. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A
simulation study of the number of events per variable in logistic
regression analysis. J Clin Epidemiol 49: 13731379, 1996
1495
1496
35. Pickering JW, Endre ZH: New metrics for assessing diagnostic
potential of candidate biomarkers. Clin J Am Soc Nephrol 7:
13551364, 2012
36. Kerr KF, McClelland RL, Brown ER, Lumley T: Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol 174: 364374, 2011
37. Uno H, Tian L, Cai T, Kohane IS, Wei LJ: A unified inference
procedure for a class of measures to assess improvement in risk
prediction systems with survival data. Stat Med 32: 24302442,
2013