Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

902 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO.

3, MAY 2016

Using Contextual Learning to Improve Diagnostic


Accuracy: Application in Breast Cancer Screening
Linqi Song, William Hsu, Member, IEEE, Jie Xu, and Mihaela van der Schaar, Fellow, IEEE

Abstract—Clinicians need to routinely make management deci- result in most cases, room for improvement exists in cases where
sions about patients who are at risk for a disease such as breast discerning the difference between a benign or malignant mass is
cancer. This paper presents a novel clinical decision support tool difficult [5]–[7]. CDS tools may provide better diagnostic rec-
that is capable of helping physicians make diagnostic decisions.
We apply this support system to improve the specificity of breast ommendations in these cases by exploiting past knowledge of
cancer screening and diagnosis. The system utilizes clinical context prior cases and their outcomes. Third, CDS tools help reduce
(e.g., demographics, medical history) to minimize the false positive fluctuations in diagnostic performance due to human factors
rates while avoiding false negatives. An online contextual learning (e.g., fatigue, distraction), offering consistent recommendations.
algorithm is used to update the diagnostic strategy presented to the Nevertheless, while these tools have been widely advocated to
physicians over time. We analytically evaluate the diagnostic per-
formance loss of the proposed algorithm, in which the true patient improve the diagnostic performance of clinicians, their adoption
distribution is not known and needs to be learned, as compared has remained limited given the following reasons: 1) the need to
with the optimal strategy where all information is assumed known, train current CDS tools using a fixed, predefined set of training
and prove that the false positive rate of the proposed learning al- cases; 2) the challenge of learning relevant features from high-
gorithm asymptotically converges to the optimum. In addition, our dimensional datasets; and 3) the inability to convey uncertainty
algorithm also has the important merit that it can provide individ-
ualized confidence estimates about the accuracy of the diagnosis that may be associated with a recommendation.
recommendation. Moreover, the relevancy of contextual features To address these challenges, we present an online algorithm
is assessed, enabling the approach to identify specific contextual for generating diagnostic recommendations to physicians by
features that provide the most value of information in reducing leveraging retrospective cases in the electronic health record
diagnostic errors. Experiments were conducted using patient data (EHR) to reduce the FPR of diagnosis given a prescribed false
collected at a large academic medical center. Our proposed ap-
proach outperforms the current clinical practice by 36% in terms negative rate (FNR). We demonstrate our approach in the do-
of false positive rate given a 2% false negative rate. main of breast cancer screening and diagnosis, since breast
cancer is a common cancer among women with an estimated
Index Terms—Breast cancer, computer-aided diagnosis system,
contextual learning, multiarmed bandit (MAB), online learning.
232 670 new cases among women in the U.S. in 2014 [8],
[9]. The proposed CDS tool is designed to aid physicians with
I. INTRODUCTION making management decisions particularly in borderline cases.
Radiological assessment of breast images are categorized using
LINICAL decision support (CDS) tools help clinicians
C make detection and diagnostic decisions for complex dis-
eases such as lung cancer [1], breast cancer [2], [3], and diabetes
the Breast Imaging Report and Data System (BI-RADS) score.
BI-RADS scores of 3 or 4 represent borderline cases associated
with short interval followup or biopsy, respectively. Presently,
[4]. There are a number of advantages to integrate CDS tools
many benign cases are being classified as BI-RADS 4A, which
as part of the clinical workflow instead of solely relying on hu-
has raised the concern of overdiagnosis.
man intuition. First, the diagnostic accuracy of clinicians varies
To improve the determination of whether a patient should
widely. A previous study has shown that false positive rates
be assigned as BI-RADS 3 or 4A, we explicitly consider the
(FPRs) for breast cancer detection range from 2.6% to 15.9%
contextual information of the patient (also known as situational
among different radiologists and younger and more recently
information) that affects diagnostic errors for breast cancer. The
trained radiologists have higher FPRs than experienced radiolo-
contextual information is captured as the current state of a pa-
gists [5]; the deployment of CDS tools may reduce this variabil-
tient, including demographics (age, disease history, etc.), the
ity. Second, although clinicians provide the correct diagnostic
breast density (based on the BI-RADS breast density scale), the
assessment history, whether the opposite breast has been diag-
Manuscript received August 24, 2014; revised January 6, 2015; accepted nosed with a mass, and the imaging modality that was used to
March 14, 2015. Date of publication March 20, 2015; date of current version
May 9, 2016. This work was supported by the U.S. Air Force Office of Scientific
provide the imaging data. We hypothesize that the incorpora-
Research under the DDDAS program. tion of contextual information will help provide more specific
L. Song, J. Xu, and M. van der Schaar are with the Department of Electrical personalized diagnostic recommendations to patients.
Engineering, University of California, Los Angeles, Los Angeles, CA 90095
USA (e-mail: songlinqi@ucla.edu; jiexu@ucla.edu; mihaela@ee.ucla.edu).
The rest of the paper is organized as follows. Section II dis-
W. Hsu is with the Department of Radiological Sciences, Univer- cusses related works. In Section III, we describe the system
sity of California, Los Angeles, Los Angeles, CA 90024 USA (e-mail: model and formulate the design problem. Section IV presents
willhsu@mii.ucla.edu).
Color versions of one or more of the figures in this paper are available online
a systematic methodology for determining the optimal diag-
at http://ieeexplore.ieee.org. nostic recommendation strategy. Section V discusses practical
Digital Object Identifier 10.1109/JBHI.2015.2414934 issues related to the system: relevant context selection, prior

2168-2194 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 903

Fig. 1. Breast cancer diagnostic process overview.

information, and clinical regret. In Section VI, we present the TABLE I


COMPARISON WITH PREVIOUSLY PUBLISHED WORKS
experimental results and our findings. Section VII concludes the
paper.
Employs Adaptive Performance Trade-off between
BI-RADS strategy guarantee FPR and FNR
II. RELATED WORKS [12]–[29] No No No No
[30] Yes No No No
A. Computer-Aided Detection and Diagnosis System for Our work Yes Yes Yes Yes
Breast Cancer
Various signal processing and machine learning techniques
efficient and accurate approaches are needed to reduce unnec-
have been introduced to perform computer-aided detection
essary biopsies [35]. In [30], a neural-fuzzy approach has been
and diagnosis. Early works have focused on image process-
proposed, but these rule-based algorithms cannot be easily up-
ing and classification techniques to extract features of the image
dated. Such approaches incur some performance loss because
and predict the outcome (i.e., whether benign or malignant) in
the underlying distribution of patient (outcome and context) is
the image [12]–[20]. A neural network-based algorithm [14]
not known, and limited training data give a limited estimation of
and a linear discriminant approach [2] are proposed to solve the
the actual distribution, resulting in suboptimal diagnostic strate-
diagnosis problem.
gies. A partially observable Markov decision process has been
In breast screening, two types of CDS tools can be inte-
used in [36] to solve a screening related question. However,
grated in the diagnostic workflow, as depicted in Fig. 1: 1) the
the algorithm does not learn unknown distributions nor does
computer-aided detection system, which helps the radiologist
it provide performance guarantees. The main components of
identify important features in the image that are abnormal [21]–
our approach are illustrated in Fig. 1. Compared with existing
[23]; and 2) the computer-aided diagnosis system, which helps
works [2], [3], [30], our proposed framework employs an online
physicians determine the diagnostic strategy for the patient (e.g.,
learning approach, which continuously learns and adaptively
whether a patient should receive a biopsy or continue followup
updates the diagnostic strategy over time, eventually achieving
imaging) [24]–[30]. Our focus in this work is on the latter:
the optimal strategy. Moreover, the proposed learning algorithm
we demonstrate how context derived from clinical (e.g., demo-
quickly (and provably) learns this optimal strategy. Learning in
graphics, history) and imaging (e.g., breast density) sources can
our framework is enabled by modeling each patient as charac-
be used to provide diagnostic recommendations that reduce the
terized by his/her context and using the context to determine
number of biopsies performed while maintaining a low number
the similarity to the information gathered from other patients.
of false negatives.
Knowledge from diagnosis of former patients can only be trans-
Clinically, management decisions that involve further inter-
ferred to the present/future patient by recognizing and exploiting
ventions are indicated by a BI-RADS score of 0 (the imaging
similarities. Using contexts and their similarities, our approach
study cannot be interpreted and must be retaken), 4 (or 4A, 4B,
is able to make diagnostic recommendations that are person-
and 4C, which represent different levels of suspicion), or 5 (high
alized to the patient. A comparison of our framework against
suspicion) [10], [11]. A BI-RADS score of 3 means that the mass
existing frameworks is shown in Table I.
is likely benign with a short interval followup recommended.
On the other hand, a BI-RADS 4 or 5 indicates that a biopsy
is recommended. These decisions are made in the context of B. Contextual Multiarmed Bandit
other clinical variables (e.g., if the mass is palpable). The dif- Our diagnostic recommendation algorithm (DRA) is based on
ficulty lies in borderline cases (e.g., BI-RADS 3 or 4A) where the contextual multiarmed bandit (MAB) framework [38]–[40]
indications are unclear whether a biopsy is truly necessary. For and incorporates the following innovations. First, prior infor-
example, despite the cost and risk associated with biopsies, the mation is considered, allowing the system to learn directly from
positive predictive value for BI-RADS 4A is only 9% [34]. More prior information or other learners. Second, in existing works
904 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

TABLE II TABLE III


COMPARISON WITH EXISTING MAB ALGORITHMS TYPES OF CONTEXTS AND DESCRIPTIONS

Explores Considers prior Considers Optimizes Context Description


patients information relevant context under constraint
Demographics The characteristics of a patient, age, race, disease history, family
Prior works Yes No No No medical history, etc.
[38]–[40] Breast Density Group 1: The breast is almost entirely fat (fibrous and glandular
Our approach No Yes Yes Yes tissue < 25%).
Group 2: There are scattered fibroglandular densities (fibrous and
glandular tissue 25% to 50%).
Group 3: The breast tissue is heterogeneously dense (fibrous and
glandular tissue 50% to 75%).
Group 4: The breast tissue is extremely dense (fibrous and
glandular tissue > 75%).
Historical assessments The information contained in previous imaging exam assessments
(e.g., whether findings in BIRADS 3 or higher appear in the past,
or whether there is a significant change in the past year).
Characteristic of the The information of the opposite breast (e.g., whether findings in
opposite breast BI-RADS 3 or higher appear for the opposite breast).
Modality The modality used for imaging: mammography (MG), ultrasound
(US), magnetic resonance imaging (MRI) or computer
radiography (CR).

Fig. 2. CABCDS model.


types of contextual features are considered: patient demograph-
ics (e.g., age, race), breast density, assessment history, whether
[38]–[40], the estimated error of an action can be updated only the opposite breast has a high BI-RADS score previously (e.g.,
after the action is selected, and the algorithm needs to explore achieve BI-RADS 3), and imaging modality (e.g., mammogram,
patients (by recommending different diagnostic actions to differ- ultrasound). Regardless of whether the context variable is dis-
ent patients under the same context) in order to get information crete or continuous, each context is modeled as discrete val-
about every action. However, due to ethical reasons, we cannot ues that have numerical values between 0 and 1.1 In this way,
perform this type of exploration. In our algorithm, the diagnostic each context is a vector of features, or a point in the context
error of any action can be updated each time, and our algorithm space X = [0, 1]d X , where dX is the number of features. For
does not need to explore patients in order to learn. Third, our example, a context can be represented as x = (0.7, 0.2, 0, 1, 0),
algorithm considers minimizing the FPR, given an FNR con- where x1 = 0.7 represents a 70-year old patient with a scattered
straint. Existing works do not consider such a constraint. This fibroglandular breast density (x2 = 0.2), an assessment of only
is the first work using MABs for CDS and required several key BI-RADS 1 or 2 (x3 = 0) in the preceding screening study, an
innovations as compared to the conventional MAB works. A assessment of BI-RADS 3 for the opposite breast (x4 = 1), and
summary that compares the proposed learning approach with mammography as the imaging modality used (x5 = 0).
existing MAB works is shown in Table II.
C. Computer-Aided Diagnosis Module
III. SYSTEM MODEL
This module consists of recommendation generation and di-
A. Computer-Aided Breast Cancer Diagnosis System agnostic evaluation steps. The recommendation generation step
suggests a diagnostic strategy based on the contextual informa-
We consider a computer-aided breast cancer diagnosis sys-
tion and previous diagnostic evaluations. A diagnostic strategy
tem (CABCDS) as shown in Fig. 2. The system contains two
is the approach for selecting an action, either to undergo a biopsy
modules: context extraction and computer-aided diagnosis. We
or to follow up, based on the observed contextual information.
consider a sequence of patients numbered t = 1, 2, . . . arrive
Given the context xt of a patient t, πt (xt ) represents the ac-
with a borderline test result. Context extraction module aggre-
tion selected by the diagnostic strategy πt . The strategy set is
gates information xt from the EHR about a patient t, having a
denoted by Π.
distribution of f (xt ). Then, the computer-aided diagnosis mod-
The diagnostic evaluation module collects outcomes of pa-
ule generates a diagnostic recommendation πt ∈ {0, 1} to the
tients. The outcome of the patient t is st (xt ), which is either
physician, where 0 represents a six-month imaging followup and
0 (representing benign) or 1 (representing malignant). If a pa-
1 represents a biopsy. Here, we consider a binary decision, but
tient undergoes a biopsy or returns for a short-term followup,
the approach can easily be extended to incorporate additional
the patient’s outcome is revealed, where if the patient has been
choices.
followed up for a certain time and the condition is stable, then
the outcome is considered benign. We use σ(x) to represent
B. Context Extraction Module the probability of being malignant for a patient with context x.
To better assist physicians, the CABCDS system considers
a diverse set of contextual information to make sufficiently ac- 1 Note that the modeling of discrete contexts in different categories can also
curate recommendations. As shown in Table III, the following be considered as several learners, each corresponding to one category.
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 905

The evaluation of the diagnostic recommendation is through di- TABLE IV


DIAGNOSTIC RECOMMENDATION ALGORITHM
agnostic errors. Two types of diagnostic errors are considered:
false positive (e.g., if the outcome st (xt ) is benign, and the
diagnostic recommendation algorithm
recommended action is to undergo a biopsy) and false negative
Initialization: Set active cluster set A = {X }, patient case counter M C = 0 for each
(e.g., if the outcome st (xt ) is malignant, and the recommended active cluster C , estimation of probability of malignancy σ̄ C = 1 for each active cluster
action is a short-term followup). C.

1: for t = 1, 2, ...do
D. Diagnostic Recommendation Problem 2: Context x t arrives. Find the active cluster C , such that x t ∈ C .
3: Determine the threshold σ t :
Based on the given CABCDS system, our design goal is to σ t = threshold determination(A, M A , σ̄ A , t, η ).
propose a recommendation algorithm that minimizes the FPR 4: Output the recommendation strategy:
π t (C ) = 0, if σ̄ C < σ t , and π t (C ) = 1, otherwise.
given a tolerable FNR η (e.g., < 2%). The tradeoff between FPR 5: The outcome s t of the patient is revealed.
and FNR can be specified by a physician or by an institution. 6: Update patient case counter for active cluster C :M C = M C + 1.
Therefore, the diagnostic recommendation problem is formally 7: Update the estimation: σ̄ C = # p o s i t i vMe c a s e s i n C .
C
8: if M C ≥ p(2 −l ) then  Context space refinement
written as
9: Update the set of active clusters:
minimize FPR (A, M A , σ̄ A ) = context_space_refinement(A, M A , σ̄ A , C )
(1)
subject to FNR ≤ η 10: end if
11: end for

threshold_determination(A, M A , σ̄ A , t, η )
IV. DIAGNOSTIC RECOMMENDATION ALGORITHM Input: active cluster set A, patient case counter M A , estimation of
The main idea of our diagnostic recommendation approach is probability of malignancy σ̄ A , the FNR tolerance η .
Output: the optimal threshold σ
to adaptively cluster patients based on related contexts and then
1: Initialize the estimation of probability of malignancy σ = 1 and the
learn the best action for each patient cluster. FNR estimation μ̄ 1 = 1.
2: while μ̄ 1 > η do
A. Structure of the Optimal Strategy 3: σ = σ − t −1 .
4: Calculate the FNR for threshold σ :
Σ C :σ C ≤M C
In order to solve the diagnostic recommendation problem, μ̄ 1 = max{1 , t −1 }
.
we first analyze the structure of the optimal solution where 5: end while
6: return σ .
all information (i.e., the distribution of context f (x) and the
probability of being malignant σ(x)) is known. We observe that context_space_refinement(A, M A , σ̄ A , C )
Input: active cluster set A, patient case counter M A , estimation of
the underlying probability σ(x) varies for different contexts x, probability of malignancy σ̄ A , the cluster C to be split.
and hence, the solution is to recommend a biopsy for patients Output: the updated A, M A , σ̄ A
with a sufficiently high probability of having a malignancy, 1: Split the cluster C with size 2 −l into 2 d X subclusters with size
and to recommend a short interval followup for patients with a 2 −( l + 1 ) .
2: Remove cluster C from the active cluster set A, and add the 2 d X
sufficiently low probability.
subclusters into the active cluster set A.
Proposition 1: The optimal strategy π ∗ (x) for the diagnostic 3: Initialize the patient case counter and estimation of probability of
recommendation problem in (1) is a threshold strategy: there malignancy for each subcluster C  :
M C  = # patien t cases in C  .
exists a threshold ση , such that the optimal strategy satisfies # positiv e ca ses in C 
π ∗ (x) = 1, if σ(x) ≥ ση , and π ∗ (x) = 0, otherwise.
σ̄ C  = MC ;
4: return A, M A , σ̄ A .
The intuition of this proposition is as follows: performing a
biopsy when the probability of being malignant is low induces
a high FPR. Conversely, performing a short-interval followup
patients within a cluster becomes available, our algorithm adap-
when the probability of being malignant is high induces a high
tively shrinks the cluster size, thereby allowing more specific
FNR.
recommendations to be made.
Proof: See Appendix A.
The proposed algorithm maintains a set of disjoint clusters
Note that the context distribution f (x) and outcome distri-
(referred to as “active clusters”) that cover the whole context
bution σ(x) are not known in practice. As such, the algorithm
space. For each active cluster C, the algorithm maintains an
needs to learn the distribution of contexts and outcomes.
estimate of the probability of a patient in this cluster of hav-
ing a malignant tumor σ̄C . Based on these estimates of all
B. Description of the Proposed Learning Algorithm active clusters, the algorithm finds the optimal threshold σt
Section III emphasized the need to uniquely characterize pa- determined in Proposition 1 and the corresponding recommen-
tients using contexts. However, no two patients are exactly the dation strategy for each cluster such that the FPR is minimized
same. Knowledge from diagnosis and treatment of former pa- provided that the FNR is below the given tolerance η. After the
tients can only be transferred to the present/future patient by patient outcome is revealed, the algorithm updates the estimate
recognizing and exploiting similarities among patients. Our pro- of the probability of a patient in cluster C of being malignant
posed algorithm uses patients with similar contexts to accumu- σ̄C . If the number of patient cases MC belonging to a cer-
late information about a new patient and make recommendations tain cluster C exceeds a certain threshold, the algorithm splits
based on the accumulated knowledge. As knowledge about more the cluster into smaller clusters in order to make more specific
906 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

Fig. 3. Illustrative example of the DRA algorithm.

recommendations without sacrificing accuracy. We denote by


A the set of active clusters, by σ̄A the set of σ̄C in all clusters
C ∈ A, and by MA the set of MC in all clusters C ∈ A.
The DRA is formally presented in Table IV and depicted in
Fig. 3. When a patient arrives, the system extracts the context
of the patient. For example, the patient is 60 years old and has a
dense breast (for illustration, we only consider the two features Fig. 4. CABCDS system with relevant context.
of the context). The algorithm then finds an active cluster C,
which the patient belongs to. Based on former patient cases, the the expected FPR of our learning algorithm compared with the
algorithm determines the optimal threshold σt and recommends optimal strategy π ∗ (x), assuming all information (i.e., the dis-
the strategy: to undergo a biopsy if the estimated probability tribution of context f (x) and the probability of patient outcome
of malignancy for cluster C exceeds this threshold σt , and to σ(x)) is known. In practice, the information is not known and
follow up otherwise. To determine the optimal threshold σt , the needs to be learned. We consider two types of regrets: individ-
algorithm finds the highest value of σt such that the FNR is ual patient regret (IPR) and aggregate system regret (ASR). The
below the tolerant level η. If a sufficient number of patient cases IPR is defined as the performance difference in terms of the
exist, the context space refinement process is performed that FPR of strategy πt and that of π ∗ (x) for patient t, and the ASR
splits the current cluster into smaller clusters. The context space is defined as the aggregated performance difference in terms of
is uniformly partitioned2 by the algorithm on each dimension the FPR of strategy πt and that of π ∗ (x) for patients 1, 2, . . . , T .
(feature) by 2l , each cluster (not necessarily the active cluster) The performance of the proposed algorithm can be guaranteed
with size 2−l . The refinement process is to split an active cluster in terms of the regret, as shown in Theorem 1.
with size 2−l into 2d X smaller clusters with size 2−(l+1) .3 As Theorem 1: The ASR of the DRA algorithm up to time T
shown in Fig. 3, when dX = 2, a cluster with size 1/2 is split can be bounded by R(T ) = O(T g (d X ) ), and the IPR of the DRA
into four clusters with size 1/4. For example, consider a patient algorithm for patient t can be bounded by r(t) = O(tg (d X )−1 ),
cluster in which patients are aged from 60 to 79 with breast where 0 < g(dX ) < 1 is a parameter depending on the number
density of Group 1 or 2. As new patients are added that fit this of features dX .
cluster, a more accurate estimation of the probability of malig- Proof: See Appendix B.
nancy can be characterized for patients in this cluster. When Note that the IPR O(tg (d X )−1 ) goes to 0, as t goes to infinity,
the patient cases are sufficiently many, the cluster is split into implying that the proposed DRA algorithm will converge to
finer clusters in order to make more specific diagnostic recom- the optimal diagnostic performance. Accordingly, the following
mendations: patients aged 60 to 69 with breast density Group corollary is introduced.
1, patients aged 60 to 69 with breast density Group 2, patients Corollary 1: The performance of the proposed DRA algo-
aged 70 to 79 with breast density Group 1, and patients aged 70 rithm converges to the optimal performance, in terms of the
to 79 with breast density Group 2. After the diagnostic decision FPR.
of the patient is made and the outcome of the patient is revealed, In addition, from Theorem 1, the convergence speed is fast,
the patient case counter MC and the estimated probability of at least in a sublinear4 rate, as shown in Figs. 5 and 6.
being malignant σ̄C are updated accordingly.
D. Algorithm Performance with Known Threshold
C. Evaluation of Algorithm Performance
In previous sections, the algorithm is shown to adaptively
In this section, the performance of the proposed DRA al- learn the optimal threshold ση (probability of being malignant)
gorithm is analyzed in terms of the learning regret, which is over time, given an FNR constraint. However, in some clinical
contexts, a fixed FNR may already exist (e.g., based on physician
2 Other types of partitioning methods can also be applied. preference or clinical practice guidelines) [10], [11]. In this
3 The splitting is determined by a size-dependent function p(2−l ) = 2 p l ,
where an empirical parameter p depends on how fast the clusters are to be
partitioned. A smaller p results in a faster partition process of the context space, 4 A sublinear rate indicates that the expected performance loss is O(1/tγ )
and a larger p results in a slower partition process of the context space. for patient t, where 0 < γ < 1.
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 907

defined as

T
Rπ (T ) = [Ec(π t (xt ), s(xt )) − Ec(π ∗ (xt ), s(xt ))].
t=1
(4)
We have the following theorem to bound this regret.
Theorem 2: The ASR of the fixed threshold-based DRA al-
gorithm is bounded by R(T ) = O(T g (d X ) ), and the IPR for
patient t is bounded by r(t) = O(tg (d X )−1 ).
Proof: See Appendix D.
The regret in terms of weighted error of the fixed threshold-
based DRA algorithm has the same sublinear order as the regret
in terms of FPR of the DRA algorithm. This finding implies
Fig. 5. Comparison of FPR for different algorithms, given tolerable that the fixed threshold-based DRA algorithm will converge to
FNR = 5%.
the optimal diagnostic strategy, summarized by the following
corollary.
Corollary 2: The performance of the fixed threshold-based
DRA algorithm converges to the optimal performance, in terms
of the expected weighted error in (4).
In addition, based on Theorem 2, we can see that the conver-
gence speed is fast (sublinear in the number of received patient
cases).

V. PRACTICAL CONSIDERATIONS
In this section, we discuss some practical issues related to
the system implementation and give appropriate approaches to
address these issues.
Fig. 6. Comparison of FPR for different algorithms, given tolerable
FNR = 2%.
A. Relevant Context Analysis
While large amounts of patient data are routinely captured
situation, the DRA algorithm degrades to a fixed threshold- in the EHR as part of clinical care, some information is inher-
based algorithm that does not need to learn the threshold value ently more relevant to assessing the probability of breast cancer
over time. Hence, step 5 of the DRA algorithm can be omitted, than others. In an online learning setting, identifying which
and the algorithm only needs to learn the distribution of patient contextual information is more relevant to making a clinical
outcomes σ(x). We consider a weighted false positive and false recommendation based on retrospective data is important. For
negative error c(π(x), s(x)) in this setting: the dX -dimensional context space X , the probability of making
c(π(x), s(x)) = ση ∗ false positive errors diagnostic errors may be correlated with missing or noisy data.
For example, the chance of making a diagnostic error may be
+(1 − ση ) ∗ false negative errors. (2) low when results of a molecular assay are available, but the
The strategy for minimizing the expectation of the weighted chance of making a diagnostic error may be high when only
error c(π(x), s(x)) is obtained by information about a patient’s previous BI-RADS assessment is
known.
min Ec(π(x), s(x)). (3) For a dX -dimensional context space, 2d X DRA learning in-
π ∈Π
stances can be executed at the same time. At time t, the average
The optimal strategy is denoted by π † (x), which assumes all FPRs of all the learning instances are evaluated. We denote
information is known, leading to the following proposition: by instance 1 the learning instance using all dX -dimensional
Proposition 2: The optimal strategy π † (x) is equivalent to contextual information. If for another learning instance i, the
the optimal strategy π ∗ (x) for the same ση . difference of FPR compared with instance 1 is below some
Proof: Appendix C. level given by physicians, then the contextual information used
Intuitively, Proposition 2 shows that the optimal weighted er- by learning instance i is relevant. Hence, the system can select
ror minimization strategy is equivalent to the optimal threshold- the more relevant context to make diagnostic recommendations,
based strategy. Hence, we define regret as the difference in the as shown in Fig. 4. From Theorem 1, when less contextual in-
weighted error, comparing our fixed threshold-based DRA al- formation is used, the convergence rate improves. This property
gorithm to the optimal strategy π ∗ (x). Formally, the ASR is is demonstrated in practice using actual data in Section VI-C.
908 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

B. Learning with Prior Information TABLE V


DESCRIPTION OF DIFFERENT CATEGORIES
Although the relevant context analysis can help identify sig-
nificant predictors from the entire information space, the selec- BI-RADS No. instances Prob. of malignant
tion process can be challenging when little information is known
about the underlying patient distribution, a problem known as 4 2282 26.12%
4A 1171 9.91%
“cold-start.” One approach to solve this is to introduce prior con- 4B 827 37.24%
textual information, such as probabilistic statements that have 4C 360 78.61%
been previously reported in other research studies, showing the Total 4640 28.08%

relationship between contexts and the probability of cancer [37].


This prior information can be represented as an input parame-
ter into the relevant context analysis module to derive an initial CABCDS system is πt , the physician has a probability εt of
distribution. selecting a strategy π̂t = πt .
Another source of prior information can be drawn from previ- Theorem 4: Given a decreasing probability εt = t1β of not
ously seen patient cases with known outcomes or reported in the following the recommended strategy, the clinical ASR up to
published literature. The prior statistical information includes time T can be bounded by
the distribution of patients with cancer or the FNR and FPR
obtained from other studies. The effect of using prior informa- R(T ) = O(T g (d X ) ) + O(T 1−β ). (6)
tion is equivalent to a number of N training patient cases before
Proof: See Appendix E.
running the system. In this case, the ASR can be bounded by
In this case, the clinical ASR has another sublinear term
R(T ) = O((T − N )g (d X ) ). This shows that the performance of
O(T 1−β ). In fact, both the learning ASR of the DRA algorithm
the system is greatly improved at the beginning of its operation
and the clinical ASR are sublinear in T and, hence, converge to
since it can successfully capitalize on the prior knowledge.
the optimal strategy, as shown in the following corollary.
Corollary 3: Given a decreasing probability εt = t1β of not
C. Clinical Regret Analysis following the recommended strategy, the clinical performance
In previous sections, we describe the algorithm for the in terms of FPR converges to the optimal performance.
CABCDS system and evaluated its performance in terms of
learning regret (IPR or ASR). However, in practice, physicians VI. EXPERIMENTAL RESULTS
may not always follow the system’s recommendations due to
In this section, the performance of the designed system is
differences in opinion between the experience of the physician
shown using our proposed algorithm. First, the breast cancer
and the recommended diagnostic strategy. In this section, we
dataset used to evaluate the system performance is described.
further evaluate the system performance by taking into account
Then, our proposed online learning algorithm is evaluated and
the actions of physicians and whether their actions are in agree-
compared with other exiting algorithms. Finally, the impact of
ment with the diagnostic recommendation. We call this analysis
relevant contexts on the system performance in terms of diag-
clinical regret.
nostic error rate and convergence rate is discussed.
We make the assumption that physicians have a certain but
fixed probability of not following the recommended strategy
when the system is first deployed. In this scenario, we denote A. Data Description
the probability of not following the recommended strategy by A deidentified dataset of 4640 individuals who underwent
ε. That is, when the diagnostic strategy recommended by the screening and diagnostic mammograms at a large academic
CABCDS system is at , the physician has a probability ε of medical center is used. Patient outcome is derived from biopsy
selecting a strategy ât = at . result, which is typically obtained for individuals with a BI-
Theorem 3: Given a fixed probability ε of not following the RADS score of 4 or 5. Our focus is on analyzing cases that
recommended strategy, the clinical ASR up to time T can be are BI-RADS 4A; this category represents patients whose test
bounded by results are less suspicious for cancer, raising the concern about
unnecessary biopsies. We consider five contextual features, in-
R(T ) = O(T g (d X ) ) + O(εT ). (5)
cluding: 1) patient age; 2) breast density; 3) assessment history
Proof: See Appendix E. (whether or not the immediately preceding exam shows a finding
In this case, since the system performance converges to the of BI-RADS 3 or above); 4) assessment results for the opposite
optimal strategy, a constant probability of deviation will result in breast (whether or not the immediately preceding exam shows
a linear clinical ASR in T . We now consider the scenario where a finding of BI-RADS 3 or above); and 5) the imaging modality
the physicians have a decreasing probability of not following used.
the recommended strategy given that confidence estimates for Characteristics of different BI-RADS categories are shown
diagnostic recommendations increase as the system learns from in Table V. The probability of being malignant increases from
additional cases. In this scenario, we denote the probability of 9.91% to 78.61% as the BI-RADS category varies from 4A,
not following the recommended strategy by εt = t1β (0 < β < 4B, to 4C. Prior to the introduction of BI-RADS 4A, 4B, 4C,
1). That is, when the diagnostic strategy recommended by the all suspicious nodules were categorized as BI-RADS 4. The
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 909

TABLE VI 39% and 36% for η = 5% and η = 2%, respectively, the LDA
FPR AND FNR COMPARISON
approach in terms of the FPR by 34% and 30% for η = 5% and
η = 2%, respectively, and the neural-fuzzy approach in terms of
Algorithms η = 5% η = 2%
the FPR by 14% and 12% for η = 5% and η = 2%, respectively.
FPR FNR Accuracy FPR FNR Accuracy

DRA (proposed) 61.0% 4.4% 0.45 64.2% 2.0% 0.42 C. Contextual Feature Selection
Neural fuzzy 71.0% 4.9% 0.36 72.8% 1.9% 0.34
Clinical 100% 0 0.10 100% 0 0.10 The impact of contextual feature selection is twofold: 1) dif-
LDA 92.1% 7.8% 0.16 92.1% 7.8% 0.16 ferent contextual feature selections affect the performance of
the algorithm, and 2) analysis of the selected features provides
us with interesting insights of the underlying domain problem.
probability of being malignant of BI-RADS 4 is 26.12%, which
Existing feature selection methods can be used as a prepro-
is between those of BI-RADS 4A and 4B and near the total
cessing to the algorithm. We use a classical wrapper method, the
average probability of being malignant.
branch and bound method, to select a subset of features [41],
[42]. By comparing the scores (accuracy of classification) of
B. Performance Evaluation of the DRA algorithm
different feature selections, this method selects the feature sub-
To perform the online adaptive learning, the data instances set in a “backward elimination” manner: one starts with the set
described previously are sequentially fed into the algorithm. of all variables and progressively eliminates the least promising
Results are compared with the clinical approach and two other ones. We consider the scenario that three features are selected
classical classifiers: the neural-fuzzy approach, and the linear among all the five features. Results of using the branch and
discriminant analysis (LDA) approach, which are defined as bound feature selection method and prediction accuracy of all
follows: possible subsets are shown in Table VII.
(1) Clinical approach [31]: Current clinical practice may be From the above simple example, we can see that different
thought of as a threshold-based approach, which recom- features influence the accuracy of recommendation. To illus-
mends a biopsy for all patients that fall in BI-RADS 4, trate the effect of context selection, we conduct the experiments
4A, 4B, 4C and above. using the proposed DRA algorithm for BI-RADS 4A patients
(2) Neural-fuzzy approach [14], [30]: The neural-fuzzy ap- by selecting different features and analyzing their relevance to
proach models the diagnosis system as a three-layered answer the following three questions:
neural network. The first layer represents input variables (1) Does using patient age information in addition to the BI-
with various patient features; the hidden layer represents RADS results (breast density, assessment history of both
the fuzzy rules for diagnostic decision based on the in- breasts, modality, etc.) as contextual features improve the
put variables; and the third layer represents the output diagnostic accuracy?
diagnostic recommendations. (2) How do the breast density and the choice of imaging
(3) LDA [2], [32]: The LDA approach trains a classifier us- modality affect the diagnostic accuracy?
ing features extracted from imaging tests and assessment (3) Does the assessment history of patients provide valuable
report, and the trained classifier can be used to make information to the diagnostic decisions?
diagnostic recommendations. The relevance of different contextual features to predict patient
System performance using different algorithms is shown in outcome is quantitatively described as the FPR and FNR.
Figs. 5 and 6, and Table VI. The FNR is empirically given as (1) In Table VIII, we compare the results of using both age
5% and 2%, respectively. We assume that the same prior infor- information and BI-RADS result information versus us-
mation or training data are available for each algorithm. Results ing only BI-RADS result information. Results show that
show the relationship between average FPR and the percentage no significant change in FPR is seen when the age in-
of patient arrivals (patient cases). The FPR of the proposed DRA formation is considered. Although women with differ-
algorithm decreases over time. In order to achieve a lower FNR, ent ages have significantly different chances of having
the FPR and the accuracy need to be sacrificed. Table VI shows breast cancer [37], our results imply that the information
that the FPR of the clinical approach is 1 for BI-RADS 4A pa- about patient age plays a less important role in determin-
tients, since it simply recommends all BI-RADS 4A patients to ing the diagnostic strategy than the BI-RADS test result
undergo a biopsy. The LDA algorithm cannot satisfy the FNR information, such as breast density, assessment history,
constraint, since it tries to linearly cluster patients based on the characteristic of opposite breast, and modality.
contextual information. However, the structure of the contextual (2) In Table IX, we show the importance of considering
information may not be linear, and as a result, a big performance breast density and modality in order to achieve a diag-
loss is incurred. The neural fuzzy approach results in a perfor- nostic recommendation by comparing the results of using
mance loss and does not converge to optimum, likely because both the breast density and modality, not using modality,
the trained rule may not be optimal and cannot be adaptively and not using breast density or modality. Cases 1 and
updated in time. Our DRA approach can be updated over time, 4 show that without the information about breast den-
maintaining a balance between FNR and FPR. The DRA algo- sity and modality, the FPR increases by over 14% for
rithm outperforms the clinical approach in terms of the FPR by both scenarios of 2% and 5% tolerable FNR. In addition,
910 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

TABLE VII
ACCURACIES OF DIFFERENT FEATURE SELECTIONS

Subset of features

{1,2,3} {1,2,4} {1,2,5} {1,3,4} {1,3,5} {1,4,5} {2,3,4} {2,3,5} {2,4,5} {3,4,5}*

Accuracy (η = 5% ) 0.34 0.33 0.38 0.32 0.37 0.39 0.33 0.38 0.37 0.41
Accuracy (η = 2% ) 0.28 0.20 0.28 0.29 0.33 0.32 0.24 0.24 0.32 0.37

3,4,5*: This is the subset of features obtained by the branch and bound algorithm.

TABLE VIII
IMPACT OF AGE INFORMATION

Case Contextual feature selections η = 5% η = 2%

Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy

1 X X X X X 61.0% 4.4% 0.45 64.2% 2.0% 0.42


2 X X X X 63.1% 4.7% 0.43 67.2% 1.7% 0.40

TABLE IX
IMPACT OF BREAST DENSITY AND MODALITY

Case Contextual feature selections η = 5% η = 2%

Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy

1 X X X X X 61.0% 4.4% 0.45 64.2% 2.0% 0.42


3 X X X X 72.1% 4.7% 0.35 81.0% 2.0% 0.27
4 X X X 74.8% 4.3% 0.32 79.6% 2.0% 0.29

TABLE X
IMPACT OF ASSESSMENT HISTORY OF BOTH BREASTS

Case Contextual feature selections η = 5% η = 2%

Age Breast density Assessment history Opposite breast Modality FPR FNR Accuracy FPR FNR Accuracy

1 X X X X X 61.0% 4.4% 0.45 64.2% 2.0% 0.42


5 X X X X 65.6% 4.7% 0.40 77.7% 1.8% 0.30
6 X X X X 64.0% 4.8% 0.42 73.2% 2.0% 0.34
7 X X X 67.9% 4.4% 0.38 79.9% 1.9% 0.28

taking into account the information about breast density information about the assessment history of both breasts
without knowing the modality can result in a significant needs to be considered when a low FNR is suggested.
increase in FPR, as shown by Cases 1 and 3. In fact, While these examples are illustrated using clinical fea-
no research has shown that breast density significantly tures, the same approach can be extended to consider
implies the risk of cancer [10], but the breast density imaging features as well. The integration of imaging-
may cause lesions to be obscured in mammography [10], derived features (e.g., texture, shape) is part of ongoing
[33]. Hence, using different modalities, such as mam- work.
mography, ultrasound, and magnetic resonance imaging, As discussed in Section V, the different feature selections also
can help reduce the diagnostic error when the patient has affect the convergence rate of our online learning algorithm. In
dense breasts. Figs. 7 and 8, we show results of convergence rate for the above
(3) In Table X, we study the impact of assessment history discussed Case 7, where the assessment history of both breasts
of both breasts on determining the diagnostic strategy by is not considered, Case 2, where the age information is not con-
comparing the results of using the assessment history of sidered, and Case 1 with all contextual information. We can see
both breasts, using the same side breast without infor- from Fig. 7 that for a high tolerable FNR (5%), the convergence
mation about the opposite side breast, and not using the rate of Case 2 with a low context dimension is higher than that
assessment history of any side breast. Results show that of the Case 7 with a medium context dimension, as well as
there is a 16% decrease in FPR when the tolerable FNR is than that of Case 1 with a high context dimension. However,
low (i.e., 2%), and there is less than 7% variation in FPR for the low tolerable FNR (2%), the convergence rate of Case
when the tolerable FNR is high (i.e., 5%). Hence, the 7 is very low. In fact, as we previously showed, Case 7 does
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 911

Fig. 7. Comparison of convergence rate for different context selection, Fig. 9. ROC of the proposed computer-aided diagnosis system.
tolerable FNR = 5%.

diagnostic recommendations to physicians, aiming to minimize


the FPR of diagnosis, given a predefined FNR. The proposed
algorithm is an online algorithm that allows the system to update
the diagnosis strategy over time. We analytically show that the
performance of our proposed algorithm converges to the optimal
performance and quantify the rate of convergence.
The key contributions of this paper are as follows.
(1) The process of breast cancer diagnosis is represented as a
sequential decision making and online learning problem.
The DRA is formulated to make diagnostic recommenda-
tions over time while quickly converging on the optimal
strategy. The algorithm exploits the dynamic nature of
patient data (e.g., the context space grows as more pa-
tients are seen) to learn from and minimize the FPR of
Fig. 8. Comparison of convergence rate for different context selection, toler-
able FNR = 2%. diagnosis given an FNR (e.g., <2%).
(2) Two types of “regret” (learning regret and clinical regret)
are employed to evaluate the performance of the CDS
not consider the relevant contextual information regarding the tool. The regret associated with the proposed DRA al-
assessment history of both breasts. This results in poor perfor- gorithm is analytically quantified, showing that the FPR
mance in terms of both the learning speed and the FPR in the asymptotically converges to the optimal strategy and that
scenario of a low tolerable FNR (2%). the convergence rate is fast (i.e., sublinear).
(3) Selection of relevant contexts is performed in relation to
D. Receiver Operating Characteristic of the System minimizing diagnostic errors by identifying what knowl-
edge or information is most influential in determining the
To evaluate the system performance using different FNR tol-
correct diagnostic action. This information is provided to
erance, we simulate the receiver operating characteristic (ROC)
the physician who can decide what information to exploit
of the system, as shown in Fig. 9. The goal of the ROC analy-
so as to make efficient and effective diagnostic decisions.
sis is to show an overview of the system performance and the
(4) The proposed algorithm’s performance is measured
tradeoff between the FPR and the FNR, according to which the
through experiments that incorporate clinical, imaging,
physicians can determine the appropriate FNR tolerance level.
and pathology data on 4640 patients who underwent a
As can be seen from Fig. 9, the FPR increases when the FNR
diagnostic mammogram at our institution. Results show
decreases. In clinical practice, a balance between false positive
that an improvement in specificity can be achieved by
and FNRs need to be achieved: while we care more when a
exploiting the contextual information associated with the
patient does not receive a timely treatment, the number of over-
patient for the breast cancer diagnosis. Specifically, the
diagnosed cases must be minimized in order for CABCDS to be
proposed algorithm outperforms the current clinical ap-
clinically accepted. We address this tradeoff by minimizing the
proach by 36% in terms of the FPR given a 2% FNR.
FPR given a user-defined FNR.
One future work is to continue evaluating the presented frame-
work and explore its implementation in the clinic. Understand-
VII. DISCUSSION AND FUTURE WORKS
ing the utility and impact of the proposed approach in the current
This paper presents a novel design framework for a CABCDS. practice of breast cancer diagnosis requires further study. Nev-
The system incorporates contextual information and makes ertheless, the initial experimental results demonstrate that our
912 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

online contextual learning algorithm is efficient, yet general and caused by clusters that have a σ(x) near the threshold ση ,
thus, it can potentially be used for diagnosis assist in other dis- and rt,2 is the regret caused by clusters that have a σ(x)
ease domains. Each of these domains has its own unique set of far from the threshold ση , but have a wrong estimation of
contextual information and desired patient outcomes. For ex- f (x). We denote by σm in (C) = minx∈C σ(x) the minimum
ample, in lung cancer screening patients, results of the low-dose probability of being malignant for cluster C, and denote by
computed tomography study (e.g., characterization of the nod- σm ax (C) = maxx∈C σ(x) the maximum probability of being
ule), pulmonary function tests, smoking and medical history, malignant for cluster C. We define three types of clusters.
and environmental exposures are potential contexts. The con- Type I cluster: The cluster for patient t that has a σm ax (C)
textual online learning algorithm can be adapted to handle such smaller than or equal to the threshold minus a small value, i.e.,
scenarios, helping physicians leverage available clinical big data {C : C ∈ C t , σm ax (C) ≤ ση − bt−α }, where b > 0, 0 < α < 1
to inform clinical decisions in each of these respective disease are parameters.
domains. Type II cluster: The cluster for patient t that has a σm in (C)
greater than or equal the threshold plus a small value, i.e., {C :
APPENDIX A C ∈ C t , σm in (C) ≥ ση + bt−α }.
PROOF OF PROPOSITION 1 Type III cluster: The remaining clusters that have a
σ(x) near ση , i.e., {C : C ∈ C t , σm in (C) − bt−α < ση <
First, the recommended strategy is consistent: π(x ) ≤
σm ax (C) + bt−α }.
π(x ) if σ(x ) ≤ σ(x ). The optimal solution will be
Due to Berstein’s inequality, we have that the estimation for
among the threshold-based strategies: πσ (x) = 1, if σ(x) ≥
the context arrival at a cluster C has the following property:
σ, and πσ (x) = 0, otherwise. Second, the monotone prop-
erty is satisfied: Ex μ0 (πσ (x), s(x)) ≤ Ex μ0 (πσ  (x), s(x)) MC
− f (C)| > b1 t1−α } ≤ b11 e−t
α
Pr{|
and Ex μ1 (πσ (x), s(x)) ≥ Ex μ1 (πσ  (x), s(x)), when σ ≥ σ  . t
Here, we denote by μ0 and μ1 the FNR and the FPR. This implies
where b1 and b11 are positive constants. And the realized σ̄C in
that when a higher threshold is chosen, actions for some contexts
a cluster C has the following property:
will change from undergoing a biopsy to follow up. Therefore,
the FNR is reduced and the FPR is increased. Hence, a threshold Pr{σ̄C > σm ax (C) + b2 t1−α or σ̄C < σm in (C) − b2 t1−α }
≤ b22 e−t
α
ση exists, such that for any threshold-based strategy πσ (x) with
σ > ση , the following property holds: Ex μ1 (πσ (x), s(x)) > η,
and Ex μ1 (πσ η (x), s(x)) ≤ η. Obviously, the optimal solution where b2 and b22 are positive constants. We define the
is πσ η (x), and we write the optimal solution as π ∗ (x) for short. normal state as the event that the estimations of f (x)
and σ(x) are accurate enough. The set of normal states
APPENDIX B are denoted by NC = {| MtC − f (C)| ≤ b1 t1−α , σm in (C) −
PROOF OF THEOREM 1 b2 t1−α ≤ σ̄C ≤ σm ax (C) + b2 t1−α }. And we denote the set of
abnormal state by N̄C , which is the complementary set of NC .
We first describe the intuition of the proof and then give the The probability of an abnormal
proof. The intuition is to cluster the contexts into small clusters  state happens for one of the
active cluster is bounded by C ∈C t Pr{N̄C }.
over time. Within each context cluster, if σ(x) within the context Hence, we can see that the regret for rt,1 is caused by Type III
cluster has a gap from ση , and the estimation of σ(x) is accurate clusters and can be bounded by
enough, then the strategy in these context clusters are the same
as the optimal strategy. The probability that the estimation of rt,1 ≤ Pr{x : x ∈ C, σm in (C) − bt−α < ση < σm ax (C)
( d −1 ) l
σ(x) has a large deviation from the true value tends to 0. For the +bt−α } ≤ K 22 d l ≤ K2−l
context clusters with σ(x) close to ση , the strategies selected
by the algorithm may not be the same as the optimal strategy; where K is a constant, and the inequality is due to the covering
however, the probability of context arrivals in these clusters will property that the (d − 1)-dimensional surface σ(x) = ση and
tend to 0, since Pr{x : σ(x) = ση } = 0. The strategy of the the Lipschitz condition that σm ax (C) − σm in (C) ≤ L1 2−l .
learning algorithm tends to the optimal strategy except some The regret rt,2
 can be bounded by the probability that an
context clusters whose arrival probability tends to 0. abnormal state C ∈C t Pr{N̄C } occurs. Hence, for patient t,
Formally, we define Lipschitz conditions for the theorem. the IPR can be bounded by
(1) Patient outcome distribution: There exists a Lipschitz
rt 
= rt,1 + rt,2
constant L1 > 0, such that for all x, x ∈ X , we have
≤ C ∈C t K2−l + b11 e−t + b22 e−t ≤ O(t−1+g (d X ) )
α α

|σ(x) − σ(x )| ≤ L1 ||x − x ||.



(2) Context distribution: There exists a Lipschitz constant
where g(dX ) = dd X +1/2+ √
9+8d X /2
, and the worst-case context
L2 > 0, such that for all x, x ∈ X , we have |f (x) − X +3/2+ 9+8d X /2

f (x )| ≤ L2 ||x − x ||. arrival and partition is considered as in Appendix D. Hence, we


We consider the IPR rt at some sufficiently large patient obtain the ASR up to patient T :
number t. Then, we can see that the IPR can be decom- T
posed into two terms rt = rt,1 + rt,2 , where rt,1 is the regret R(T ) = rt ≤ O(T g (d X ) ).
t=1
SONG et al.: USING CONTEXTUAL LEARNING TO IMPROVE DIAGNOSTIC ACCURACY: APPLICATION IN BREAST CANCER SCREENING 913

APPENDIX C we have
PROOF OF PROPOSITION 2 
T 
RC ,s (T ) ≤ Pr{WCt , VCt (π)}
In order to show the equivalence of the two optimal strategies, t=1 π ∈LC (B )
we consider the weighted error of choosing different actions for 
T 
context x. If the action π(x) = 1 is chosen, then ≤ Pr{r̄C ,π ≥ μ̄C ,π + Ht , WCt }
t=1 π ∈LC (B )

Ec(π(x) = 1, s(x)) = (1 − ση )σ(x). (7) + Pr{r̄C ,π C∗ ≤ μ̄C ,π C∗ − Ht , WCt } + Pr{r̄C ,π ≥ r̄C ,π C∗ ,


r̄C ,π < μ̄C ,π + Ht , r̄C ,π C∗ > μ̄C ,π C∗ + Ht , WCt }
If the action π(x) = 0 is chosen, then (11)
where Ht = t−z /2 , z ≥ 2α/p. The third term on the right hand
Ec(π(x) = 0, s(x)) = ση (1 − σ(x)). (8) side of (11) is 0. Hence, we can bound the regret by

T 
Hence, the optimal strategy π † (x) satisfies: RC ,s (T ) ≤ Pr{r̄C ,π ≥ E[r̄C ,π ] + LdX 2−lα }
α /2
t=1 π ∈LC (B )

† 1, if Ec(π(x) = 1, s(x)) ≤ Ec(π(x) = 0, s(x)) 
T
+ Pr{r̄C ,π C∗ ≤ E[r̄C ,π C∗ ] − LdX 2−lα } ≤ 4t−2 ≤
α /2 4π 2
π (x) = 3 .
0, otherwise. t=1
(9) (12)
By plugging (7) and (8) into (9), we have the optimal strategy We can see that the highest level of subspaces is at most
π † (x): 1 + log2 p + d X T . Then, the maximum number of subspaces is
dX
 bounded by 22d X T d X + p .
1, if σ(x) ≥ ση
π † (x) = (10) Therefore, according to Lemma 2, we can bound the explo-
0, otherwise.
ration regret by
Therefore, the proposition follows. dX
Re (T ) ≤ 22d X +1 T d X + p T z log T. (13)

APPENDIX D Accord to Lemma 3, we can bound the suboptimal regret by


PROOF OF THEOREM 2 dX
22d X +1 π 2 T d X + p
To prove Theorem 2, we first introduce some important no- Rs (T ) ≤ . (14)
3
tions to characterize the properties of the cost. Let us define πC∗
as the best action corresponding to the context at the center of the We can also bound the near optimal regret by
subspace C. Let us also define μx,π as the expected weighted 
1+log 2 p + d X T
BLdX 2−lα
α /2
error, μ̄C ,π = maxx∈C μx,π , and μC ,π = minx∈C μx,π . For a Rn (T ) ≤
(15)
size 2−l subspace C (referred to as “level l”), the subopti- α /2
l=0
d X + p −α

mal action set is defined as LC (B) = {π : μ̄C ,π C∗ − μC ,π > ≤ BLdX 22(d X +p−α ) T dX +p
.
BLdX 2−lα }.
α /2
We add virtual exploration process into the al- Therefore, the ASR follows by setting z = 2α/p, and p =

gorithm: assume at least 2tz log(t) patients are diagnosed with 3α + 9α 2 +8α d X
. Since the IPR decreases as t increases, it can
error. We then can decompose the ASR into three terms: the 2
regret caused by virtual exploration Re (T ), the regret caused by be bounded by O(tg (d X )−1 ).
suboptimal arm selection Rs (T ), and the regret caused by near
optimal arm selection Rn (T ). We first introduce three lemmas APPENDIX E
to show useful properties of the DRA algorithm. PROOF OF THEOREMS 3 AND 4
Lemma 1. The active cluster level l for patient t can be at For Theorems 3 and 4, the clinical ASR can be calculated
most (log2 t)/p + 1. by another deviating
 regret term. For Theorem 3, this term can
Proof.
 According to the context space partition process, we be bounded by Tt=1 ε = εT . For Theorem 4, this term can be
have lj =1 2pj ≤ t, where l + 1 is the maximum level for pa-  1 −β
bounded by Tt=1 εt ≤ T1−β . Hence, Theorems 3 and 4 follow.
tient t. Hence, the result follows.
Lemma 2. The regret caused by virtual exploration in one
REFERENCES
cluster up to patient t is bounded by 2tz log t.
Proof. Since the virtual exploration number can be bounded [1] M. N. Gurcan, B. Sahiner, N. Petrick, H. P. Chan, E. A. Kazerooni, P. N.
Cascade, and L. Hadjiiski, “Lung nodule detection on thoracic computed
by tz log t for each action, the result follows. tomography images: Preliminary evaluation of a computer-aided diagnosis
Lemma 3. If B = α /22 −α + 2, and 2αp < z < 1, then the system,” Med. Phys., vol. 29, no. 11, pp. 2552–2558, 2002.
L dX 2
[2] J. Tang, R. M. Rangayyan, J. Xu, I. E. Naqa, and Y. Yang, “Computer-
regret caused by suboptimal action selection in one cluster up aided detection and diagnosis of breast cancer with mammography: Recent
2
to patient t is bounded by 2π3 . advances,” IEEE Trans. Inf. Technol. Biomed., vol. 13, no. 2, pp. 236-251,
Proof. Let WC denote the event that the current phase is an
t Mar. 2009.
[3] K. Ganesan, U. Acharya, C. K. Chua, L. C. Min, K. Abraham, and K. Ng,
exploitation phase in the context cluster C, and let VCt (π) be the “Computer-aided breast cancer detection using mammograms: A review,”
event that the suboptimal action π is selected in at time t. Then, IEEE Rev. Biomed. Eng., vol. 6, pp. 77–98, Mar. 2013.
914 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 20, NO. 3, MAY 2016

[4] M. F. Ganji and M. S. Abadeh, “A fuzzy classification system based on [24] S. Timp, C. Varela, and N. Karssemeijer, “Computer-aided diagnosis
Ant Colony Optimization for diabetes disease diagnosis,” Expert Syst. with temporal analysis to improve radiologists’ interpretation of mam-
Appl., vol. 38, no. 12, pp. 14650–14659, 2011. mographic mass lesions,” IEEE Trans. Inf. Technol. Biomed., vol. 14, no.
[5] J. G. Elmore, D. L. Miglioretti, L. M. Reisch, M. B. Barton, W. Kreuter, C. 3, pp. 803–808, May 2010.
L. Christiansen, and S. W. Fletcher, “Screening mammograms by commu- [25] S. Ghosh, S. Mondal, and B. Ghosh, “A comparative study of breast cancer
nity radiologists: variability in false-positive rates,” J. Nat. Cancer Inst., detection based on SVM and MLP BPN classifier,” in Proc. IEEE 1st Int.
vol. 94, no. 18, pp. 1373–1380, 2002. Conf. Autom. Control, Energy Syst., 2014, pp. 1–4.
[6] M. A. Musen, B. Middleton, and R. A. Greenes, “Clinical decision- [26] S. Singh and K. Bovis, “An evaluation of contrast enhancement techniques
support systems,” in Biomedical Informatics, London, U.K.: Springer, for mammographic breast masses,” IEEE Trans. Inf. Technol. Biomed.,
2014, pp. 643–674. vol. 9, no. 1, pp. 109–119, Mar. 2005.
[7] K. Kerlikowske, P. A. Carney, B. Geller, M. T. Mandelson, S. H. Taplin, [27] K. Panetta, Y. Zhou, S. Agaian, and H. Jia, “Nonlinear unsharp masking for
K. Malvin, V. Ernster, N. Urban, G. Cutter, R. Rosenberg, and R. Ballard- mammogram enhancement,” IEEE Trans. Inf. Technol. Biomed., vol. 15,
Barbash, “Performance of screening mammography among women with no. 6, pp. 918–928, Nov. 2011.
and without a first-degree relative with breast cancer,” Ann. Internal Med., [28] A. N. Karahaliou, I. S. Boniatis, S. G. Skiadopoulos, F. N. Sakellaropou-
vol. 133, pp. 855–863, 2000. los, N. S. Arikidis, E. A. Likaki, G. S. Panayiotakis, and L. I. Costari-
[8] Cancer Facts & Figures 2014, Am. Cancer Soc., Atlanta, GA, USA, 2014. dou , “Breast cancer diagnosis: Analyzing texture of tissue surrounding
[9] R. Seigel, D. Naishadham, and A. Jemal, Cancer Statistics. Atlanta, GA, microcalcifications,” IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 6,
USA: Am. Cancer Soc., 2013. pp. 731–738, Nov. 2008.
[10] ACR BI-RADS Breast Imaging and Reporting Data System: Breast Imag- [29] G. Bozza, M. Brignone, and M. Pastorino, “Application of the no-sampling
ing, Atlas, 5th ed., Am. Coll. Radiol., Reston, VA, USA, 2013. linear sampling method to breast cancer detection,” IEEE Trans. Biomed.
[11] L. Liberman and J. H. Menell, “Breast imaging reporting and data system Eng., vol. 57, no. 10, pp. 2525–2534, Oct. 2010.
(BI-RADS),” Radiol. Clinics North Am., vol. 40, no. 3, pp. 409–430, 2002. [30] A. Keles, A. Keles, and U. Yavuz, “Expert system based on neuro-fuzzy
[12] M. L. Giger, Z. Huo, M. A. Kupinski, and C. J. Vyborny, “Computer-aided rules for diagnosis breast cancer,” Expert Syst. Appl., vol. 38, no. 5,
diagnosis in mammography,” in Handbook of Medical Imaging, vol. 2, pp. 5719–5726, 2011.
2000, pp. 915–1004, Bellingham, WA, USA, SPIE Press. [31] W. Hsu, S. Han, C. Arnold, A. Bui, and D. R. Enzmann, “RadQA: A
[13] D. Gur, J. H. Sumkin, H. E. Rockette, M. Ganott, C. Hakim, L. Hardesty, data-driven approach to assessing the accuracy and utility of radiologic
W. R. Poller, R. Shah, and L. Wallace, “Changes in breast cancer detection interpretations,” in Proc. SIIM Annu. Meet., 2014, abstract. [Online].
and mammography recall rates after the introduction of a computer-aided Available: http://siimcenter.org/education/presentation/2014/quality-
detection system,” J. Nat. Cancer Inst., vol. 96, no. 3, pp. 185–190, 2004. safety-radiation-dose-scientific-session
[14] D. Tsai, H. Fujita, K. Horita, T. Endo, C. Kido, and T. Ishigaki, “Classifica- [32] K. Armstrong, E. A. Handorf, J. Chen, and M. N. B. Demeter, “Breast
tion of breast tumors in mammograms using a neural network: Utilization cancer risk prediction and mammography biopsy decisions: A model-
of selected features,” in Proc. IEEE Int. Joint Conf. Neural Netw., 1993, based study,” Am. J. Preventive Med., vol. 44, no. 1, pp. 15–22, 2013.
pp. 967–970. [33] S. Venkataraman and P. J. Slanetz, Breast Imaging: Mammography and
[15] F. Schnorrenberg, C. S. Pattichis, K. C. Kyriacou, and C. N. Schizas, Ultrasonography. Waltham, MA, USA: UptoDate, 2014.
“Computer-aided detection of breast cancer nuclei,” IEEE Trans. Inform. [34] C. Wiratkapun, W. Bunyapaiboonsri, B. Wibulpolprasert, and P. Lert-
Technol. Biomed., vol. 1, no. 2, pp. 128–140, Jun. 1997. sithichai, “Biopsy rate and positive predictive value for breast cancer
[16] W. E. Polakowski, D. A. Cournoyer, S. K. Rogers, M. P. DeSimio, D. in BI-RADS category 4 breast lesions,” Med. J. Med. Assoc. Thailand,
W. Ruck, J. W. Hoffmeister, and R. A. Raines, “Computer-aided breast vol. 93, no. 7, pp. 830–837, 2010.
cancer detection and diagnosis of masses using difference of Gaussians [35] C. I. Flowers, C. O’Donoghue, D. Moore, A. Goss, D. Kim, J.-H. Kim,
and derivative-based feature saliency,” IEEE Trans. Med. Imaging, vol. 16, S. G. Elias, J. Fridland, and L. J. Esserman, “Reducing false-positive
no. 6, pp. 811–819, Dec. 1997. biopsies: a pilot study to reduce benign biopsy rates for BI-RADS 4A/B
[17] N. R. Mudigonda, R. M. Rangayyan, and J. L. Desautels, “Gradient and assessments through testing risk stratification and new thresholds for in-
texture analysis for the classification of mammographic masses,” IEEE tervention,” Breast Cancer Res. Treatment, vol. 139, no. 3, pp. 769–777,
Trans. Med. Imaging, vol. 19, no. 10, pp. 1032–1043, Oct. 2000. 2013.
[18] Y.-H. Chang, L. A. Hardesty, C. M. Hakim, T. S. Chang, B. Zheng, [36] T. Ayer, O. Alagoz, and N. K. Stout, “OR forum—A POMDP approach to
W. F. Good, and D. Gur, “Knowledge-based computer-aided detection personalize mammography screening decisions,” Oper. Res., vol. 60, no.
of masses on digitized mammograms: A preliminary assessment,” Med. 5, pp. 1019–1034, 2012.
Phys., vol. 28, no. 4, pp. 455–461, 2001. [37] S. W. Fletcher, Risk Prediction Models for Breast Cancer Screening.
[19] C. M. Kocur, S. K. Rogers, L. R. Myers, T. Burns, M. Kabrisky, J. W. Waltham, MA, USA: UptoDate, 2014.
Hoffmeister, K. W. Bauer, and J. M. Steppe, “Using neural networks to [38] A. Slivkins, “Contextual bandits with similarity information,” arXiv
select wavelet features for breast cancer diagnosis,” IEEE Eng. Med. Biol. preprint arXiv:0907.3986, 2009.
Mag., vol. 15, no. 3, pp. 95–102, May/Jun. 1996. [39] J. Langford and T. Zhang, “The epoch-greedy algorithm for contextual
[20] A. Urmaliya and J. Singhai, “Sequential minimal optimization for support multi-armed bandits,” Adv. Neural Inf. Process. Syst., vol. 20, pp. 1096–
vector machine with feature selection in breast cancer diagnosis,” in Proc. 1103, 2007.
IEEE 2nd Int. Conf. Image Inf. Process., 2013, pp. 481–486. [40] T. Lu, D. Pal, and M Pal, “Contextual multi-armed bandits,” in Proc. Int.
[21] M. Sameti, R. K. Ward, J. Morgan-Parkes, and B. Palcic, “Image feature Conf. Artif. Intell. Statist., pp. 485–492, 2010.
extraction in the last screening mammograms prior to detection of breast [41] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif.
cancer,” IEEE J. Sel. Topics Signal Process., vol. 3, no. 1, pp. 46–52, Intell., vol. 97, no. 1, pp. 273–324, 1997.
Feb. 2009. [42] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm for
[22] M. Donelli, I. J. Craddock, D. Gibbins, and M. Sarafianou, “A three- feature subset selection,” IEEE Trans. Comput., vol. C-26, no. 9, pp. 917–
dimensional time domain microwave imaging method for breast cancer 922, Sep. 1977.
detection based on an evolutionary algorithm,” Progress Electromagn.
Res. M, vol. 18, pp. 179–195, 2011.
[23] P. S. Pawar and D. R. Patil, “Breast cancer detection using neural network
models,” in Proc. IEEE Int. Conf. Commun. Syst. Netw. Technol., 2013,
pp. 568–572.
Author’s photographs and biographies not available at the time of publication.

You might also like