Professional Documents
Culture Documents
Using Contextual Learning To Improve Diagnostic Accuracy Application in Breast Cancer Screening
Using Contextual Learning To Improve Diagnostic Accuracy Application in Breast Cancer Screening
3, MAY 2016
Abstract—Clinicians need to routinely make management deci- result in most cases, room for improvement exists in cases where
sions about patients who are at risk for a disease such as breast discerning the difference between a benign or malignant mass is
cancer. This paper presents a novel clinical decision support tool difficult [5]–[7]. CDS tools may provide better diagnostic rec-
that is capable of helping physicians make diagnostic decisions.
We apply this support system to improve the specificity of breast ommendations in these cases by exploiting past knowledge of
cancer screening and diagnosis. The system utilizes clinical context prior cases and their outcomes. Third, CDS tools help reduce
(e.g., demographics, medical history) to minimize the false positive fluctuations in diagnostic performance due to human factors
rates while avoiding false negatives. An online contextual learning (e.g., fatigue, distraction), offering consistent recommendations.
algorithm is used to update the diagnostic strategy presented to the Nevertheless, while these tools have been widely advocated to
physicians over time. We analytically evaluate the diagnostic per-
formance loss of the proposed algorithm, in which the true patient improve the diagnostic performance of clinicians, their adoption
distribution is not known and needs to be learned, as compared has remained limited given the following reasons: 1) the need to
with the optimal strategy where all information is assumed known, train current CDS tools using a fixed, predefined set of training
and prove that the false positive rate of the proposed learning al- cases; 2) the challenge of learning relevant features from high-
gorithm asymptotically converges to the optimum. In addition, our dimensional datasets; and 3) the inability to convey uncertainty
algorithm also has the important merit that it can provide individ-
ualized confidence estimates about the accuracy of the diagnosis that may be associated with a recommendation.
recommendation. Moreover, the relevancy of contextual features To address these challenges, we present an online algorithm
is assessed, enabling the approach to identify specific contextual for generating diagnostic recommendations to physicians by
features that provide the most value of information in reducing leveraging retrospective cases in the electronic health record
diagnostic errors. Experiments were conducted using patient data (EHR) to reduce the FPR of diagnosis given a prescribed false
collected at a large academic medical center. Our proposed ap-
proach outperforms the current clinical practice by 36% in terms negative rate (FNR). We demonstrate our approach in the do-
of false positive rate given a 2% false negative rate. main of breast cancer screening and diagnosis, since breast
cancer is a common cancer among women with an estimated
Index Terms—Breast cancer, computer-aided diagnosis system,
contextual learning, multiarmed bandit (MAB), online learning.
232 670 new cases among women in the U.S. in 2014 [8],
[9]. The proposed CDS tool is designed to aid physicians with
I. INTRODUCTION making management decisions particularly in borderline cases.
Radiological assessment of breast images are categorized using
LINICAL decision support (CDS) tools help clinicians
C make detection and diagnostic decisions for complex dis-
eases such as lung cancer [1], breast cancer [2], [3], and diabetes
the Breast Imaging Report and Data System (BI-RADS) score.
BI-RADS scores of 3 or 4 represent borderline cases associated
with short interval followup or biopsy, respectively. Presently,
[4]. There are a number of advantages to integrate CDS tools
many benign cases are being classified as BI-RADS 4A, which
as part of the clinical workflow instead of solely relying on hu-
has raised the concern of overdiagnosis.
man intuition. First, the diagnostic accuracy of clinicians varies
To improve the determination of whether a patient should
widely. A previous study has shown that false positive rates
be assigned as BI-RADS 3 or 4A, we explicitly consider the
(FPRs) for breast cancer detection range from 2.6% to 15.9%
contextual information of the patient (also known as situational
among different radiologists and younger and more recently
information) that affects diagnostic errors for breast cancer. The
trained radiologists have higher FPRs than experienced radiolo-
contextual information is captured as the current state of a pa-
gists [5]; the deployment of CDS tools may reduce this variabil-
tient, including demographics (age, disease history, etc.), the
ity. Second, although clinicians provide the correct diagnostic
breast density (based on the BI-RADS breast density scale), the
assessment history, whether the opposite breast has been diag-
Manuscript received August 24, 2014; revised January 6, 2015; accepted nosed with a mass, and the imaging modality that was used to
March 14, 2015. Date of publication March 20, 2015; date of current version
May 9, 2016. This work was supported by the U.S. Air Force Office of Scientific
provide the imaging data. We hypothesize that the incorpora-
Research under the DDDAS program. tion of contextual information will help provide more specific
L. Song, J. Xu, and M. van der Schaar are with the Department of Electrical personalized diagnostic recommendations to patients.
Engineering, University of California, Los Angeles, Los Angeles, CA 90095
USA (e-mail: songlinqi@ucla.edu; jiexu@ucla.edu; mihaela@ee.ucla.edu).
The rest of the paper is organized as follows. Section II dis-
W. Hsu is with the Department of Radiological Sciences, Univer- cusses related works. In Section III, we describe the system
sity of California, Los Angeles, Los Angeles, CA 90024 USA (e-mail: model and formulate the design problem. Section IV presents
willhsu@mii.ucla.edu).
Color versions of one or more of the figures in this paper are available online
a systematic methodology for determining the optimal diag-
at http://ieeexplore.ieee.org. nostic recommendation strategy. Section V discusses practical
Digital Object Identifier 10.1109/JBHI.2015.2414934 issues related to the system: relevant context selection, prior
2168-2194 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 903
1: for t = 1, 2, ...do
D. Diagnostic Recommendation Problem 2: Context x t arrives. Find the active cluster C , such that x t ∈ C .
3: Determine the threshold σ t :
Based on the given CABCDS system, our design goal is to σ t = threshold determination(A, M A , σ̄ A , t, η ).
propose a recommendation algorithm that minimizes the FPR 4: Output the recommendation strategy:
π t (C ) = 0, if σ̄ C < σ t , and π t (C ) = 1, otherwise.
given a tolerable FNR η (e.g., < 2%). The tradeoff between FPR 5: The outcome s t of the patient is revealed.
and FNR can be specified by a physician or by an institution. 6: Update patient case counter for active cluster C :M C = M C + 1.
Therefore, the diagnostic recommendation problem is formally 7: Update the estimation: σ̄ C = # p o s i t i vMe c a s e s i n C .
C
8: if M C ≥ p(2 −l ) then Context space refinement
written as
9: Update the set of active clusters:
minimize FPR (A, M A , σ̄ A ) = context_space_refinement(A, M A , σ̄ A , C )
(1)
subject to FNR ≤ η 10: end if
11: end for
threshold_determination(A, M A , σ̄ A , t, η )
IV. DIAGNOSTIC RECOMMENDATION ALGORITHM Input: active cluster set A, patient case counter M A , estimation of
The main idea of our diagnostic recommendation approach is probability of malignancy σ̄ A , the FNR tolerance η .
Output: the optimal threshold σ
to adaptively cluster patients based on related contexts and then
1: Initialize the estimation of probability of malignancy σ = 1 and the
learn the best action for each patient cluster. FNR estimation μ̄ 1 = 1.
2: while μ̄ 1 > η do
A. Structure of the Optimal Strategy 3: σ = σ − t −1 .
4: Calculate the FNR for threshold σ :
Σ C :σ C ≤M C
In order to solve the diagnostic recommendation problem, μ̄ 1 = max{1 , t −1 }
.
we first analyze the structure of the optimal solution where 5: end while
6: return σ .
all information (i.e., the distribution of context f (x) and the
probability of being malignant σ(x)) is known. We observe that context_space_refinement(A, M A , σ̄ A , C )
Input: active cluster set A, patient case counter M A , estimation of
the underlying probability σ(x) varies for different contexts x, probability of malignancy σ̄ A , the cluster C to be split.
and hence, the solution is to recommend a biopsy for patients Output: the updated A, M A , σ̄ A
with a sufficiently high probability of having a malignancy, 1: Split the cluster C with size 2 −l into 2 d X subclusters with size
and to recommend a short interval followup for patients with a 2 −( l + 1 ) .
2: Remove cluster C from the active cluster set A, and add the 2 d X
sufficiently low probability.
subclusters into the active cluster set A.
Proposition 1: The optimal strategy π ∗ (x) for the diagnostic 3: Initialize the patient case counter and estimation of probability of
recommendation problem in (1) is a threshold strategy: there malignancy for each subcluster C :
M C = # patien t cases in C .
exists a threshold ση , such that the optimal strategy satisfies # positiv e ca ses in C
π ∗ (x) = 1, if σ(x) ≥ ση , and π ∗ (x) = 0, otherwise.
σ̄ C = MC ;
4: return A, M A , σ̄ A .
The intuition of this proposition is as follows: performing a
biopsy when the probability of being malignant is low induces
a high FPR. Conversely, performing a short-interval followup
patients within a cluster becomes available, our algorithm adap-
when the probability of being malignant is high induces a high
tively shrinks the cluster size, thereby allowing more specific
FNR.
recommendations to be made.
Proof: See Appendix A.
The proposed algorithm maintains a set of disjoint clusters
Note that the context distribution f (x) and outcome distri-
(referred to as “active clusters”) that cover the whole context
bution σ(x) are not known in practice. As such, the algorithm
space. For each active cluster C, the algorithm maintains an
needs to learn the distribution of contexts and outcomes.
estimate of the probability of a patient in this cluster of hav-
ing a malignant tumor σ̄C . Based on these estimates of all
B. Description of the Proposed Learning Algorithm active clusters, the algorithm finds the optimal threshold σt
Section III emphasized the need to uniquely characterize pa- determined in Proposition 1 and the corresponding recommen-
tients using contexts. However, no two patients are exactly the dation strategy for each cluster such that the FPR is minimized
same. Knowledge from diagnosis and treatment of former pa- provided that the FNR is below the given tolerance η. After the
tients can only be transferred to the present/future patient by patient outcome is revealed, the algorithm updates the estimate
recognizing and exploiting similarities among patients. Our pro- of the probability of a patient in cluster C of being malignant
posed algorithm uses patients with similar contexts to accumu- σ̄C . If the number of patient cases MC belonging to a cer-
late information about a new patient and make recommendations tain cluster C exceeds a certain threshold, the algorithm splits
based on the accumulated knowledge. As knowledge about more the cluster into smaller clusters in order to make more specific
906 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016
defined as
T
Rπ (T ) = [Ec(π t (xt ), s(xt )) − Ec(π ∗ (xt ), s(xt ))].
t=1
(4)
We have the following theorem to bound this regret.
Theorem 2: The ASR of the fixed threshold-based DRA al-
gorithm is bounded by R(T ) = O(T g (d X ) ), and the IPR for
patient t is bounded by r(t) = O(tg (d X )−1 ).
Proof: See Appendix D.
The regret in terms of weighted error of the fixed threshold-
based DRA algorithm has the same sublinear order as the regret
in terms of FPR of the DRA algorithm. This finding implies
Fig. 5. Comparison of FPR for different algorithms, given tolerable that the fixed threshold-based DRA algorithm will converge to
FNR = 5%.
the optimal diagnostic strategy, summarized by the following
corollary.
Corollary 2: The performance of the fixed threshold-based
DRA algorithm converges to the optimal performance, in terms
of the expected weighted error in (4).
In addition, based on Theorem 2, we can see that the conver-
gence speed is fast (sublinear in the number of received patient
cases).
V. PRACTICAL CONSIDERATIONS
In this section, we discuss some practical issues related to
the system implementation and give appropriate approaches to
address these issues.
Fig. 6. Comparison of FPR for different algorithms, given tolerable
FNR = 2%.
A. Relevant Context Analysis
While large amounts of patient data are routinely captured
situation, the DRA algorithm degrades to a fixed threshold- in the EHR as part of clinical care, some information is inher-
based algorithm that does not need to learn the threshold value ently more relevant to assessing the probability of breast cancer
over time. Hence, step 5 of the DRA algorithm can be omitted, than others. In an online learning setting, identifying which
and the algorithm only needs to learn the distribution of patient contextual information is more relevant to making a clinical
outcomes σ(x). We consider a weighted false positive and false recommendation based on retrospective data is important. For
negative error c(π(x), s(x)) in this setting: the dX -dimensional context space X , the probability of making
c(π(x), s(x)) = ση ∗ false positive errors diagnostic errors may be correlated with missing or noisy data.
For example, the chance of making a diagnostic error may be
+(1 − ση ) ∗ false negative errors. (2) low when results of a molecular assay are available, but the
The strategy for minimizing the expectation of the weighted chance of making a diagnostic error may be high when only
error c(π(x), s(x)) is obtained by information about a patient’s previous BI-RADS assessment is
known.
min Ec(π(x), s(x)). (3) For a dX -dimensional context space, 2d X DRA learning in-
π ∈Π
stances can be executed at the same time. At time t, the average
The optimal strategy is denoted by π † (x), which assumes all FPRs of all the learning instances are evaluated. We denote
information is known, leading to the following proposition: by instance 1 the learning instance using all dX -dimensional
Proposition 2: The optimal strategy π † (x) is equivalent to contextual information. If for another learning instance i, the
the optimal strategy π ∗ (x) for the same ση . difference of FPR compared with instance 1 is below some
Proof: Appendix C. level given by physicians, then the contextual information used
Intuitively, Proposition 2 shows that the optimal weighted er- by learning instance i is relevant. Hence, the system can select
ror minimization strategy is equivalent to the optimal threshold- the more relevant context to make diagnostic recommendations,
based strategy. Hence, we define regret as the difference in the as shown in Fig. 4. From Theorem 1, when less contextual in-
weighted error, comparing our fixed threshold-based DRA al- formation is used, the convergence rate improves. This property
gorithm to the optimal strategy π ∗ (x). Formally, the ASR is is demonstrated in practice using actual data in Section VI-C.
908 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016
TABLE VI 39% and 36% for η = 5% and η = 2%, respectively, the LDA
FPR AND FNR COMPARISON
approach in terms of the FPR by 34% and 30% for η = 5% and
η = 2%, respectively, and the neural-fuzzy approach in terms of
Algorithms η = 5% η = 2%
the FPR by 14% and 12% for η = 5% and η = 2%, respectively.
FPR FNR Accuracy FPR FNR Accuracy
DRA (proposed) 61.0% 4.4% 0.45 64.2% 2.0% 0.42 C. Contextual Feature Selection
Neural fuzzy 71.0% 4.9% 0.36 72.8% 1.9% 0.34
Clinical 100% 0 0.10 100% 0 0.10 The impact of contextual feature selection is twofold: 1) dif-
LDA 92.1% 7.8% 0.16 92.1% 7.8% 0.16 ferent contextual feature selections affect the performance of
the algorithm, and 2) analysis of the selected features provides
us with interesting insights of the underlying domain problem.
probability of being malignant of BI-RADS 4 is 26.12%, which
Existing feature selection methods can be used as a prepro-
is between those of BI-RADS 4A and 4B and near the total
cessing to the algorithm. We use a classical wrapper method, the
average probability of being malignant.
branch and bound method, to select a subset of features [41],
[42]. By comparing the scores (accuracy of classification) of
B. Performance Evaluation of the DRA algorithm
different feature selections, this method selects the feature sub-
To perform the online adaptive learning, the data instances set in a “backward elimination” manner: one starts with the set
described previously are sequentially fed into the algorithm. of all variables and progressively eliminates the least promising
Results are compared with the clinical approach and two other ones. We consider the scenario that three features are selected
classical classifiers: the neural-fuzzy approach, and the linear among all the five features. Results of using the branch and
discriminant analysis (LDA) approach, which are defined as bound feature selection method and prediction accuracy of all
follows: possible subsets are shown in Table VII.
(1) Clinical approach [31]: Current clinical practice may be From the above simple example, we can see that different
thought of as a threshold-based approach, which recom- features influence the accuracy of recommendation. To illus-
mends a biopsy for all patients that fall in BI-RADS 4, trate the effect of context selection, we conduct the experiments
4A, 4B, 4C and above. using the proposed DRA algorithm for BI-RADS 4A patients
(2) Neural-fuzzy approach [14], [30]: The neural-fuzzy ap- by selecting different features and analyzing their relevance to
proach models the diagnosis system as a three-layered answer the following three questions:
neural network. The first layer represents input variables (1) Does using patient age information in addition to the BI-
with various patient features; the hidden layer represents RADS results (breast density, assessment history of both
the fuzzy rules for diagnostic decision based on the in- breasts, modality, etc.) as contextual features improve the
put variables; and the third layer represents the output diagnostic accuracy?
diagnostic recommendations. (2) How do the breast density and the choice of imaging
(3) LDA [2], [32]: The LDA approach trains a classifier us- modality affect the diagnostic accuracy?
ing features extracted from imaging tests and assessment (3) Does the assessment history of patients provide valuable
report, and the trained classifier can be used to make information to the diagnostic decisions?
diagnostic recommendations. The relevance of different contextual features to predict patient
System performance using different algorithms is shown in outcome is quantitatively described as the FPR and FNR.
Figs. 5 and 6, and Table VI. The FNR is empirically given as (1) In Table VIII, we compare the results of using both age
5% and 2%, respectively. We assume that the same prior infor- information and BI-RADS result information versus us-
mation or training data are available for each algorithm. Results ing only BI-RADS result information. Results show that
show the relationship between average FPR and the percentage no significant change in FPR is seen when the age in-
of patient arrivals (patient cases). The FPR of the proposed DRA formation is considered. Although women with differ-
algorithm decreases over time. In order to achieve a lower FNR, ent ages have significantly different chances of having
the FPR and the accuracy need to be sacrificed. Table VI shows breast cancer [37], our results imply that the information
that the FPR of the clinical approach is 1 for BI-RADS 4A pa- about patient age plays a less important role in determin-
tients, since it simply recommends all BI-RADS 4A patients to ing the diagnostic strategy than the BI-RADS test result
undergo a biopsy. The LDA algorithm cannot satisfy the FNR information, such as breast density, assessment history,
constraint, since it tries to linearly cluster patients based on the characteristic of opposite breast, and modality.
contextual information. However, the structure of the contextual (2) In Table IX, we show the importance of considering
information may not be linear, and as a result, a big performance breast density and modality in order to achieve a diag-
loss is incurred. The neural fuzzy approach results in a perfor- nostic recommendation by comparing the results of using
mance loss and does not converge to optimum, likely because both the breast density and modality, not using modality,
the trained rule may not be optimal and cannot be adaptively and not using breast density or modality. Cases 1 and
updated in time. Our DRA approach can be updated over time, 4 show that without the information about breast den-
maintaining a balance between FNR and FPR. The DRA algo- sity and modality, the FPR increases by over 14% for
rithm outperforms the clinical approach in terms of the FPR by both scenarios of 2% and 5% tolerable FNR. In addition,
910 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016
TABLE VII
ACCURACIES OF DIFFERENT FEATURE SELECTIONS
Subset of features
{1,2,3} {1,2,4} {1,2,5} {1,3,4} {1,3,5} {1,4,5} {2,3,4} {2,3,5} {2,4,5} {3,4,5}*
Accuracy (η = 5% ) 0.34 0.33 0.38 0.32 0.37 0.39 0.33 0.38 0.37 0.41
Accuracy (η = 2% ) 0.28 0.20 0.28 0.29 0.33 0.32 0.24 0.24 0.32 0.37
3,4,5*: This is the subset of features obtained by the branch and bound algorithm.
TABLE VIII
IMPACT OF AGE INFORMATION
Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy
TABLE IX
IMPACT OF BREAST DENSITY AND MODALITY
Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy
TABLE X
IMPACT OF ASSESSMENT HISTORY OF BOTH BREASTS
Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy
taking into account the information about breast density information about the assessment history of both breasts
without knowing the modality can result in a significant needs to be considered when a low FNR is suggested.
increase in FPR, as shown by Cases 1 and 3. In fact, While these examples are illustrated using clinical fea-
no research has shown that breast density significantly tures, the same approach can be extended to consider
implies the risk of cancer [10], but the breast density imaging features as well. The integration of imaging-
may cause lesions to be obscured in mammography [10], derived features (e.g., texture, shape) is part of ongoing
[33]. Hence, using different modalities, such as mam- work.
mography, ultrasound, and magnetic resonance imaging, As discussed in Section V, the different feature selections also
can help reduce the diagnostic error when the patient has affect the convergence rate of our online learning algorithm. In
dense breasts. Figs. 7 and 8, we show results of convergence rate for the above
(3) In Table X, we study the impact of assessment history discussed Case 7, where the assessment history of both breasts
of both breasts on determining the diagnostic strategy by is not considered, Case 2, where the age information is not con-
comparing the results of using the assessment history of sidered, and Case 1 with all contextual information. We can see
both breasts, using the same side breast without infor- from Fig. 7 that for a high tolerable FNR (5%), the convergence
mation about the opposite side breast, and not using the rate of Case 2 with a low context dimension is higher than that
assessment history of any side breast. Results show that of the Case 7 with a medium context dimension, as well as
there is a 16% decrease in FPR when the tolerable FNR is than that of Case 1 with a high context dimension. However,
low (i.e., 2%), and there is less than 7% variation in FPR for the low tolerable FNR (2%), the convergence rate of Case
when the tolerable FNR is high (i.e., 5%). Hence, the 7 is very low. In fact, as we previously showed, Case 7 does
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 911
Fig. 7. Comparison of convergence rate for different context selection, Fig. 9. ROC of the proposed computer-aided diagnosis system.
tolerable FNR = 5%.
online contextual learning algorithm is efficient, yet general and caused by clusters that have a σ(x) near the threshold ση ,
thus, it can potentially be used for diagnosis assist in other dis- and rt,2 is the regret caused by clusters that have a σ(x)
ease domains. Each of these domains has its own unique set of far from the threshold ση , but have a wrong estimation of
contextual information and desired patient outcomes. For ex- f (x). We denote by σm in (C) = minx∈C σ(x) the minimum
ample, in lung cancer screening patients, results of the low-dose probability of being malignant for cluster C, and denote by
computed tomography study (e.g., characterization of the nod- σm ax (C) = maxx∈C σ(x) the maximum probability of being
ule), pulmonary function tests, smoking and medical history, malignant for cluster C. We define three types of clusters.
and environmental exposures are potential contexts. The con- Type I cluster: The cluster for patient t that has a σm ax (C)
textual online learning algorithm can be adapted to handle such smaller than or equal to the threshold minus a small value, i.e.,
scenarios, helping physicians leverage available clinical big data {C : C ∈ C t , σm ax (C) ≤ ση − bt−α }, where b > 0, 0 < α < 1
to inform clinical decisions in each of these respective disease are parameters.
domains. Type II cluster: The cluster for patient t that has a σm in (C)
greater than or equal the threshold plus a small value, i.e., {C :
APPENDIX A C ∈ C t , σm in (C) ≥ ση + bt−α }.
PROOF OF PROPOSITION 1 Type III cluster: The remaining clusters that have a
σ(x) near ση , i.e., {C : C ∈ C t , σm in (C) − bt−α < ση <
First, the recommended strategy is consistent: π(x ) ≤
σm ax (C) + bt−α }.
π(x ) if σ(x ) ≤ σ(x ). The optimal solution will be
Due to Berstein’s inequality, we have that the estimation for
among the threshold-based strategies: πσ (x) = 1, if σ(x) ≥
the context arrival at a cluster C has the following property:
σ, and πσ (x) = 0, otherwise. Second, the monotone prop-
erty is satisfied: Ex μ0 (πσ (x), s(x)) ≤ Ex μ0 (πσ (x), s(x)) MC
− f (C)| > b1 t1−α } ≤ b11 e−t
α
Pr{|
and Ex μ1 (πσ (x), s(x)) ≥ Ex μ1 (πσ (x), s(x)), when σ ≥ σ . t
Here, we denote by μ0 and μ1 the FNR and the FPR. This implies
where b1 and b11 are positive constants. And the realized σ̄C in
that when a higher threshold is chosen, actions for some contexts
a cluster C has the following property:
will change from undergoing a biopsy to follow up. Therefore,
the FNR is reduced and the FPR is increased. Hence, a threshold Pr{σ̄C > σm ax (C) + b2 t1−α or σ̄C < σm in (C) − b2 t1−α }
≤ b22 e−t
α
ση exists, such that for any threshold-based strategy πσ (x) with
σ > ση , the following property holds: Ex μ1 (πσ (x), s(x)) > η,
and Ex μ1 (πσ η (x), s(x)) ≤ η. Obviously, the optimal solution where b2 and b22 are positive constants. We define the
is πσ η (x), and we write the optimal solution as π ∗ (x) for short. normal state as the event that the estimations of f (x)
and σ(x) are accurate enough. The set of normal states
APPENDIX B are denoted by NC = {| MtC − f (C)| ≤ b1 t1−α , σm in (C) −
PROOF OF THEOREM 1 b2 t1−α ≤ σ̄C ≤ σm ax (C) + b2 t1−α }. And we denote the set of
abnormal state by N̄C , which is the complementary set of NC .
We first describe the intuition of the proof and then give the The probability of an abnormal
proof. The intuition is to cluster the contexts into small clusters state happens for one of the
active cluster is bounded by C ∈C t Pr{N̄C }.
over time. Within each context cluster, if σ(x) within the context Hence, we can see that the regret for rt,1 is caused by Type III
cluster has a gap from ση , and the estimation of σ(x) is accurate clusters and can be bounded by
enough, then the strategy in these context clusters are the same
as the optimal strategy. The probability that the estimation of rt,1 ≤ Pr{x : x ∈ C, σm in (C) − bt−α < ση < σm ax (C)
( d −1 ) l
σ(x) has a large deviation from the true value tends to 0. For the +bt−α } ≤ K 22 d l ≤ K2−l
context clusters with σ(x) close to ση , the strategies selected
by the algorithm may not be the same as the optimal strategy; where K is a constant, and the inequality is due to the covering
however, the probability of context arrivals in these clusters will property that the (d − 1)-dimensional surface σ(x) = ση and
tend to 0, since Pr{x : σ(x) = ση } = 0. The strategy of the the Lipschitz condition that σm ax (C) − σm in (C) ≤ L1 2−l .
learning algorithm tends to the optimal strategy except some The regret rt,2
can be bounded by the probability that an
context clusters whose arrival probability tends to 0. abnormal state C ∈C t Pr{N̄C } occurs. Hence, for patient t,
Formally, we define Lipschitz conditions for the theorem. the IPR can be bounded by
(1) Patient outcome distribution: There exists a Lipschitz
rt
= rt,1 + rt,2
constant L1 > 0, such that for all x, x ∈ X , we have
≤ C ∈C t K2−l + b11 e−t + b22 e−t ≤ O(t−1+g (d X ) )
α α
APPENDIX C we have
PROOF OF PROPOSITION 2
T
RC ,s (T ) ≤ Pr{WCt , VCt (π)}
In order to show the equivalence of the two optimal strategies, t=1 π ∈LC (B )
we consider the weighted error of choosing different actions for
T
context x. If the action π(x) = 1 is chosen, then ≤ Pr{r̄C ,π ≥ μ̄C ,π + Ht , WCt }
t=1 π ∈LC (B )
mal action set is defined as LC (B) = {π : μ̄C ,π C∗ − μC ,π > ≤ BLdX 22(d X +p−α ) T dX +p
.
BLdX 2−lα }.
α /2
We add virtual exploration process into the al- Therefore, the ASR follows by setting z = 2α/p, and p =
√
gorithm: assume at least 2tz log(t) patients are diagnosed with 3α + 9α 2 +8α d X
. Since the IPR decreases as t increases, it can
error. We then can decompose the ASR into three terms: the 2
regret caused by virtual exploration Re (T ), the regret caused by be bounded by O(tg (d X )−1 ).
suboptimal arm selection Rs (T ), and the regret caused by near
optimal arm selection Rn (T ). We first introduce three lemmas APPENDIX E
to show useful properties of the DRA algorithm. PROOF OF THEOREMS 3 AND 4
Lemma 1. The active cluster level l for patient t can be at For Theorems 3 and 4, the clinical ASR can be calculated
most (log2 t)/p + 1. by another deviating
regret term. For Theorem 3, this term can
Proof.
According to the context space partition process, we be bounded by Tt=1 ε = εT . For Theorem 4, this term can be
have lj =1 2pj ≤ t, where l + 1 is the maximum level for pa- 1 −β
bounded by Tt=1 εt ≤ T1−β . Hence, Theorems 3 and 4 follow.
tient t. Hence, the result follows.
Lemma 2. The regret caused by virtual exploration in one
REFERENCES
cluster up to patient t is bounded by 2tz log t.
Proof. Since the virtual exploration number can be bounded [1] M. N. Gurcan, B. Sahiner, N. Petrick, H. P. Chan, E. A. Kazerooni, P. N.
Cascade, and L. Hadjiiski, “Lung nodule detection on thoracic computed
by tz log t for each action, the result follows. tomography images: Preliminary evaluation of a computer-aided diagnosis
Lemma 3. If B = α /22 −α + 2, and 2αp < z < 1, then the system,” Med. Phys., vol. 29, no. 11, pp. 2552–2558, 2002.
L dX 2
[2] J. Tang, R. M. Rangayyan, J. Xu, I. E. Naqa, and Y. Yang, “Computer-
regret caused by suboptimal action selection in one cluster up aided detection and diagnosis of breast cancer with mammography: Recent
2
to patient t is bounded by 2π3 . advances,” IEEE Trans. Inf. Technol. Biomed., vol. 13, no. 2, pp. 236-251,
Proof. Let WC denote the event that the current phase is an
t Mar. 2009.
[3] K. Ganesan, U. Acharya, C. K. Chua, L. C. Min, K. Abraham, and K. Ng,
exploitation phase in the context cluster C, and let VCt (π) be the “Computer-aided breast cancer detection using mammograms: A review,”
event that the suboptimal action π is selected in at time t. Then, IEEE Rev. Biomed. Eng., vol. 6, pp. 77–98, Mar. 2013.
914 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016
[4] M. F. Ganji and M. S. Abadeh, “A fuzzy classification system based on [24] S. Timp, C. Varela, and N. Karssemeijer, “Computer-aided diagnosis
Ant Colony Optimization for diabetes disease diagnosis,” Expert Syst. with temporal analysis to improve radiologists’ interpretation of mam-
Appl., vol. 38, no. 12, pp. 14650–14659, 2011. mographic mass lesions,” IEEE Trans. Inf. Technol. Biomed., vol. 14, no.
[5] J. G. Elmore, D. L. Miglioretti, L. M. Reisch, M. B. Barton, W. Kreuter, C. 3, pp. 803–808, May 2010.
L. Christiansen, and S. W. Fletcher, “Screening mammograms by commu- [25] S. Ghosh, S. Mondal, and B. Ghosh, “A comparative study of breast cancer
nity radiologists: variability in false-positive rates,” J. Nat. Cancer Inst., detection based on SVM and MLP BPN classifier,” in Proc. IEEE 1st Int.
vol. 94, no. 18, pp. 1373–1380, 2002. Conf. Autom. Control, Energy Syst., 2014, pp. 1–4.
[6] M. A. Musen, B. Middleton, and R. A. Greenes, “Clinical decision- [26] S. Singh and K. Bovis, “An evaluation of contrast enhancement techniques
support systems,” in Biomedical Informatics, London, U.K.: Springer, for mammographic breast masses,” IEEE Trans. Inf. Technol. Biomed.,
2014, pp. 643–674. vol. 9, no. 1, pp. 109–119, Mar. 2005.
[7] K. Kerlikowske, P. A. Carney, B. Geller, M. T. Mandelson, S. H. Taplin, [27] K. Panetta, Y. Zhou, S. Agaian, and H. Jia, “Nonlinear unsharp masking for
K. Malvin, V. Ernster, N. Urban, G. Cutter, R. Rosenberg, and R. Ballard- mammogram enhancement,” IEEE Trans. Inf. Technol. Biomed., vol. 15,
Barbash, “Performance of screening mammography among women with no. 6, pp. 918–928, Nov. 2011.
and without a first-degree relative with breast cancer,” Ann. Internal Med., [28] A. N. Karahaliou, I. S. Boniatis, S. G. Skiadopoulos, F. N. Sakellaropou-
vol. 133, pp. 855–863, 2000. los, N. S. Arikidis, E. A. Likaki, G. S. Panayiotakis, and L. I. Costari-
[8] Cancer Facts & Figures 2014, Am. Cancer Soc., Atlanta, GA, USA, 2014. dou , “Breast cancer diagnosis: Analyzing texture of tissue surrounding
[9] R. Seigel, D. Naishadham, and A. Jemal, Cancer Statistics. Atlanta, GA, microcalcifications,” IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 6,
USA: Am. Cancer Soc., 2013. pp. 731–738, Nov. 2008.
[10] ACR BI-RADS Breast Imaging and Reporting Data System: Breast Imag- [29] G. Bozza, M. Brignone, and M. Pastorino, “Application of the no-sampling
ing, Atlas, 5th ed., Am. Coll. Radiol., Reston, VA, USA, 2013. linear sampling method to breast cancer detection,” IEEE Trans. Biomed.
[11] L. Liberman and J. H. Menell, “Breast imaging reporting and data system Eng., vol. 57, no. 10, pp. 2525–2534, Oct. 2010.
(BI-RADS),” Radiol. Clinics North Am., vol. 40, no. 3, pp. 409–430, 2002. [30] A. Keles, A. Keles, and U. Yavuz, “Expert system based on neuro-fuzzy
[12] M. L. Giger, Z. Huo, M. A. Kupinski, and C. J. Vyborny, “Computer-aided rules for diagnosis breast cancer,” Expert Syst. Appl., vol. 38, no. 5,
diagnosis in mammography,” in Handbook of Medical Imaging, vol. 2, pp. 5719–5726, 2011.
2000, pp. 915–1004, Bellingham, WA, USA, SPIE Press. [31] W. Hsu, S. Han, C. Arnold, A. Bui, and D. R. Enzmann, “RadQA: A
[13] D. Gur, J. H. Sumkin, H. E. Rockette, M. Ganott, C. Hakim, L. Hardesty, data-driven approach to assessing the accuracy and utility of radiologic
W. R. Poller, R. Shah, and L. Wallace, “Changes in breast cancer detection interpretations,” in Proc. SIIM Annu. Meet., 2014, abstract. [Online].
and mammography recall rates after the introduction of a computer-aided Available: http://siimcenter.org/education/presentation/2014/quality-
detection system,” J. Nat. Cancer Inst., vol. 96, no. 3, pp. 185–190, 2004. safety-radiation-dose-scientific-session
[14] D. Tsai, H. Fujita, K. Horita, T. Endo, C. Kido, and T. Ishigaki, “Classifica- [32] K. Armstrong, E. A. Handorf, J. Chen, and M. N. B. Demeter, “Breast
tion of breast tumors in mammograms using a neural network: Utilization cancer risk prediction and mammography biopsy decisions: A model-
of selected features,” in Proc. IEEE Int. Joint Conf. Neural Netw., 1993, based study,” Am. J. Preventive Med., vol. 44, no. 1, pp. 15–22, 2013.
pp. 967–970. [33] S. Venkataraman and P. J. Slanetz, Breast Imaging: Mammography and
[15] F. Schnorrenberg, C. S. Pattichis, K. C. Kyriacou, and C. N. Schizas, Ultrasonography. Waltham, MA, USA: UptoDate, 2014.
“Computer-aided detection of breast cancer nuclei,” IEEE Trans. Inform. [34] C. Wiratkapun, W. Bunyapaiboonsri, B. Wibulpolprasert, and P. Lert-
Technol. Biomed., vol. 1, no. 2, pp. 128–140, Jun. 1997. sithichai, “Biopsy rate and positive predictive value for breast cancer
[16] W. E. Polakowski, D. A. Cournoyer, S. K. Rogers, M. P. DeSimio, D. in BI-RADS category 4 breast lesions,” Med. J. Med. Assoc. Thailand,
W. Ruck, J. W. Hoffmeister, and R. A. Raines, “Computer-aided breast vol. 93, no. 7, pp. 830–837, 2010.
cancer detection and diagnosis of masses using difference of Gaussians [35] C. I. Flowers, C. O’Donoghue, D. Moore, A. Goss, D. Kim, J.-H. Kim,
and derivative-based feature saliency,” IEEE Trans. Med. Imaging, vol. 16, S. G. Elias, J. Fridland, and L. J. Esserman, “Reducing false-positive
no. 6, pp. 811–819, Dec. 1997. biopsies: a pilot study to reduce benign biopsy rates for BI-RADS 4A/B
[17] N. R. Mudigonda, R. M. Rangayyan, and J. L. Desautels, “Gradient and assessments through testing risk stratification and new thresholds for in-
texture analysis for the classification of mammographic masses,” IEEE tervention,” Breast Cancer Res. Treatment, vol. 139, no. 3, pp. 769–777,
Trans. Med. Imaging, vol. 19, no. 10, pp. 1032–1043, Oct. 2000. 2013.
[18] Y.-H. Chang, L. A. Hardesty, C. M. Hakim, T. S. Chang, B. Zheng, [36] T. Ayer, O. Alagoz, and N. K. Stout, “OR forum—A POMDP approach to
W. F. Good, and D. Gur, “Knowledge-based computer-aided detection personalize mammography screening decisions,” Oper. Res., vol. 60, no.
of masses on digitized mammograms: A preliminary assessment,” Med. 5, pp. 1019–1034, 2012.
Phys., vol. 28, no. 4, pp. 455–461, 2001. [37] S. W. Fletcher, Risk Prediction Models for Breast Cancer Screening.
[19] C. M. Kocur, S. K. Rogers, L. R. Myers, T. Burns, M. Kabrisky, J. W. Waltham, MA, USA: UptoDate, 2014.
Hoffmeister, K. W. Bauer, and J. M. Steppe, “Using neural networks to [38] A. Slivkins, “Contextual bandits with similarity information,” arXiv
select wavelet features for breast cancer diagnosis,” IEEE Eng. Med. Biol. preprint arXiv:0907.3986, 2009.
Mag., vol. 15, no. 3, pp. 95–102, May/Jun. 1996. [39] J. Langford and T. Zhang, “The epoch-greedy algorithm for contextual
[20] A. Urmaliya and J. Singhai, “Sequential minimal optimization for support multi-armed bandits,” Adv. Neural Inf. Process. Syst., vol. 20, pp. 1096–
vector machine with feature selection in breast cancer diagnosis,” in Proc. 1103, 2007.
IEEE 2nd Int. Conf. Image Inf. Process., 2013, pp. 481–486. [40] T. Lu, D. Pal, and M Pal, “Contextual multi-armed bandits,” in Proc. Int.
[21] M. Sameti, R. K. Ward, J. Morgan-Parkes, and B. Palcic, “Image feature Conf. Artif. Intell. Statist., pp. 485–492, 2010.
extraction in the last screening mammograms prior to detection of breast [41] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif.
cancer,” IEEE J. Sel. Topics Signal Process., vol. 3, no. 1, pp. 46–52, Intell., vol. 97, no. 1, pp. 273–324, 1997.
Feb. 2009. [42] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm for
[22] M. Donelli, I. J. Craddock, D. Gibbins, and M. Sarafianou, “A three- feature subset selection,” IEEE Trans. Comput., vol. C-26, no. 9, pp. 917–
dimensional time domain microwave imaging method for breast cancer 922, Sep. 1977.
detection based on an evolutionary algorithm,” Progress Electromagn.
Res. M, vol. 18, pp. 179–195, 2011.
[23] P. S. Pawar and D. R. Patil, “Breast cancer detection using neural network
models,” in Proc. IEEE Int. Conf. Commun. Syst. Netw. Technol., 2013,
pp. 568–572.
Author’s photographs and biographies not available at the time of publication.