Deep Learning Mama

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Deep Learning in Breast Cancer

Screening 14
Hugh Harvey, Andreas Heindl, Galvin Khara,
Dimitrios Korkinof, Michael O’Neill, Joseph Yearsley,
Edith Karpati, Tobias Rijken, Peter Kecskemethy,
and Gabor Forrai

14.1.1 The Breast Cancer Screening


14.1 Background
Global Landscape
Out of the myriad proposed use-cases for ar-
Breast cancer is currently the most frequent
tificial intelligence in radiology, breast cancer
cancer and the most frequent cause of cancer-
screening is perhaps the best known and most re-
induced deaths in women in Europe [1]. The
searched. Computer aided detection (CAD) sys-
favorable results of randomized clinical trials
tems have been available for over a decade,
have led to the implementation of regional and
meaning that the application of more recent deep
national population-based screening programmes
learning techniques to mammography already
for breast cancer in many upper-middle-income
has a benchmark against which to compete. For
countries since the end of the 1980s [2]. The pri-
the most part there is also a behavioral change
mary aim of a breast screening programme is to
barrier to overcome before deep learning tech-
reduce mortality from breast cancer through early
nologies are accepted into clinical practice, made
detection. Early detection of cancer comprises of
even more difficult by CAD’s largely unconvinc-
two strategies: screening and early diagnosis.
ing performance compared to its promise.
In this chapter we discuss the history of breast
• Screening involves the systematic application
cancer screening, the rise and fall of traditional
of a screening test for a specific cancer in
CAD systems, and explore the different deep
an asymptomatic population in order to detect
learning techniques and their associated common
and treat early cancer or pre-cancerous health
challenges.
conditions before they become a threat to the
well-being of the individual or the community.
Mammography is the cornerstone of breast
H. Harvey () · A. Heindl · G. Khara · D. Korkinof · cancer screening and is widely offered as a
M. O’Neill · J. Yearsley · E. Karpati · T. Rijken ·
P. Kecskemethy
public health policy on a routine basis.
Kheiron Medical Technologies, London, UK • Early diagnosis is based on improved public
e-mail: hugh@kheironmed.com and professional awareness (particularly at
G. Forrai the primary health care level) of signs and
European Society of Breast Imaging, Vienna, Austria symptoms associated with cancer, improved
e-mail: forrai.gabor@t-online.hu

© Springer Nature Switzerland AG 2019 187


E. R. Ranschaert et al. (eds.), Artificial Intelligence in Medical Imaging,
https://doi.org/10.1007/978-3-319-94878-2_14
188 H. Harvey et al.

health-care-seeking behavior, prompt clinical and an increase in interval cancers downstream.


assessment, and early referral of suspected According to Fletcher [9], false positive test
cancer cases, such that appropriate diagnostic results increase when technology increases sen-
investigations and treatment can be rapidly sitivity but decreases specificity. While the use
instituted leading to improved mortality of more sensitive and less specific technology
outcomes. Indeed, all developed European may be appropriate for patients at very high risk
countries have incorporated nationwide of developing cancer (such as those with BRCA
organized screening programmes, starting mutations or untested women with first-degree
from the 1980s. relatives with BRCA mutations) use of these
tests are not appropriate for general populations
Breast cancer screening using mammography at lower risk. In Europe, cancer detection rates
has been proven to be the most effective single are similar to those in the USA [10], but Euro-
imaging tool at the population level for detecting pean frequency of false positives is lower [11],
breast cancer in its earliest and most treatable which we may assume are due to differences
stage [3]. However, according to a review by in the medico-legal environment, existence of
Boyer et al. [4], the limiting factors with the double reporting programmes, guidelines for ap-
use of mammography are the breast’s structure propriate false positive rates, and more standard-
(high density) and the fact that mammography ized training requirements for mammographic
is difficult to interpret, even for experts. As a readers.
consequence mistakes during interpretation may
occur due to fatigue, lack of attention, failure in
detection or interpretation, and can lead to signif-
A Brief History of UK Breast Screening
icant inter and intra-observer variation. There are
as many missed lesions when they have not been • 1986: In the UK, Professor Sir Patrick
seen (i.e., reading errors) as when they have been Forrest produced the “Forrest report,”
incorrectly judged (i.e., decision errors) [5]. The commissioned by the forward thinking
reported rate of missed cancers varies from 16 to Health Secretary, Kenneth Clarke. Hav-
31% [6]. To reduce this rate, double-reading of ing taken evidence on the effectiveness
screening mammograms by two independent ex- of breast cancer screening from several
perts was introduced in some programs. Blinded international trials (America, Holland,
double-reading reportedly reduces false negative Sweden, Scotland, and the UK), Forrest
results and the average radiologist can expect concluded that the NHS should set up a
an 8–14% gain in sensitivity and a 4–10% in- national breast screening program.
crease in specificity with double-reading pair- • 1988: This was swiftly incorporated into
ing [7]. This is not surprising considering that practice, and by 1988 the NHS had the
double readers are often more specialized, and world’s first invitation-based breast can-
read a larger volume of cases per year. Double- cer screening program. “Forrest units”
reading is now well established in many Eu- were set up across the UK for screening
ropean countries, whereas in the USA, where women between 50 and 64 years old
double-reading is not mandatory, single-reading who were invited every 3 years for a
programmes with ad-hoc non-invitational screen- standard mammogram (two compressed
ing are more common. views of each breast).
The most common direct harm associated with • 2000: With the advent of digital
errors in mammography are false positive test mammography, medical imaging data
results [8] which cause additional work and costs became available in a format amenable
for health care providers, and emotional stress to computational analysis.
and worry for patients. Patient harms arise from
false negatives, leading to delays in diagnosis (continued)
14 Deep Learning in Breast Cancer Screening 189

spite convincing evidence that double-reading is


• 2004: Researchers developed what be- simply better. The cost-inefficiency of double-
came known as CAD, using feature- reading, however, is a tempting target for policy
engineered programs to highlight abnor- makers looking to trim expenditure.
malities on mammograms. These sys- Initially, CAD was optimistically seen as a
tems use hand-crafted features such as tool that would augment the radiologist, help-
breast density, parenchymal texture, the ing lower the potential to miss cancers on a
presence of a mass or microcalcifica- mammogram (false negatives) and reducing the
tions to determine whether or not a frequency of false positives. Ultimately, CAD
cancer might be present. They were de- was positioned as a means to improve the eco-
signed to alert a radiologist to the pres- nomic outcomes of screening by tackling both
ence of a specific feature by attempting of these challenges. It made use of some of the
to mimic an expert’s decision-making most advanced techniques for the time during its
process by highlighting regions on a boom in the late 1990s. The most common user
mammogram according to recognized interface of CAD is that of overlaid markings
characteristics. on top of a mammography image indicating the
• 2012: Ongoing analysis of the screening areas which the CAD has processed and detected
program proved the benefit of widening as potentially representing a malignant feature.
the age range for invitation to screening While there are different systems on the market,
to between 47 and 73 years old. they provide broadly similar outputs.
• 2014: The UK system was now success- Feature extraction utilizes machine recogni-
fully discovering over 50% of the fe- tion of hand-engineered visual motifs. An early
male population’s breast cancers (within example is ImageChecker M1000 which detected
the target age range) before they be- spiculated lesions by identifying radiating lines
came symptomatic. Screening is now emerging from a 6-mm center within a 32-mm
the internationally recognized hallmark circle [13]. While this helped the method to be
for best practice. interpretable it also led to significant detection
limitations that manifested in a greater number
of false positives as it struggled to account for all
eventualities.
Once the features are extracted another
14.1.2 The Rise and Fall of CAD
method is used to decide, based on those
features, whether an area is malignant or not.
14.1.2.1 Rise: The Premise and Promise
Such discriminators are traditionally rule based,
The success of screening programs has driven
decisions trees, support-vector machines, or
both demand and costs, with an estimated 37
multi-layer perceptrons. An example of this
million mammograms now being performed each
could be a simple rule such as “if spiculated
year in the USA alone [12]. There are conse-
mass present then cancer.” These methods have
quently not enough human radiologists to keep
many issues due to their oversimplification of the
up with the workload. The situation is even more
underlying problem.
pressured in European countries where double-
reading essentially requires that the number of ra-
diologists per case is double than that of the USA. 14.1.2.2 How Does CAD Perform?
The shortage of expensive specialized breast ra- There is a wide body of literature examining
diologists is getting so acute that murmurs of the performance of different CAD systems, most
reducing the benchmark of double-reading to commonly ImageChecker (R2, now Hologic) and
single-reading in the UK are being uttered, de- iCAD SecondLook. These studies used several
190 H. Harvey et al.

different methodologies, so direct comparisons Taylor and Potts [17] performed a meta-
between them are difficult. In this section we analysis to estimate impact of CAD and double-
review the most seminal of these works, noting reading respectively on odds ratios for cancer
the differences in methodologies, sample sizes, detection and recall rates. Meta-analysis included
outcome measurements, and conclusions. The 10 studies comparing single-reading with CAD
variation in these studies makes for a somewhat to single-reading and 17 studies comparing
heterogeneous overall picture, with some studies double to single-reading. All studies were
strongly advocating for the use of CAD, and published between 1991 and 2008. Despite
others showing no significant benefits [14]. an evident heterogeneity between the studies,
Gilbert et al. [15] conducted a trial to evidences were sufficient to claim that double-
determine whether the performance of a single reading increases cancer detection rate and that
reader using a CAD (ImageChecker) would double-reading with arbitration does so while
match the performance achieved by two readers. lowering recall rate. However, evidences were
28,204 subjects were screened in this prospective insufficient to claim that CAD improves cancer
study. Authors reported similar cancer detection detection rates, while CAD clearly increased
rates, sensitivity, specificity, and positive recall rate. When comparing CAD and double-
predictive value after single-reading with CAD reading with arbitration, authors did not find a
and after double-reading, and a higher recall difference in cancer detection rate, but double-
rate after single-reading with CAD. Finally, no reading with arbitration showed a significantly
difference was found in pathological attributes better recall rate. Based on these findings, authors
between tumors detected by single-reading with concluded that the best current evidence shows
CAD alone and those detected by double-reading grounds for preferring double-reading to single-
alone. The authors concluded that single-reading reading with CAD.
with CAD could be an alternative to double- Noble et al. [18] aimed to assess the diag-
reading, especially for the detection of small nostic performance of a CAD (ImageChecker)
breast cancers where the double-reading remains for screening mammography in terms of sensi-
the best method, and could improve the rate of tivity, specificity, incremental recall, and cancer
detection of cancer from screening mammograms diagnosis rates. This meta-analysis was based
read by a single reader. on the results of three retrospective studies and
A second prospective analysis called four prospective studies published between 2001
CADETII published by the same group [16] and 2008. Authors reported strong heterogene-
was conducted to evaluate the mammographic ity in the results between the different studies.
features of breast cancer that favor lesion They supposed that several environmental factors
detection with single-reading and CAD or with could influence the CAD performances, includ-
double-reading. Similar results were obtained ing accuracy and experience of radiologist, i.e.,
for patients in whom the predominant radiologic very accurate radiologists may have a smaller
feature was either a mass or a microcalcification. incremental cancer detection rate using CAD
However, authors reported superior performance than less accurate or less experienced radiologists
for double-reading in the detection of cancers because they would miss fewer cases of cancer
that manifested as parenchymal deformities and without CAD. Radiologists who are more confi-
superior performance for single-reading with dent in their interpretation skills may also be less
CAD in the detection of cancers that manifested likely to recall healthy women primarily based
as asymmetric densities, suggesting that for more upon CAD findings. On the other hand, less
challenging cancer cases, both reading protocols confident mammographers concerned about false
have strengths and weaknesses. However, there negative readings may recommend the recall of a
was a small but significant relative difference greater proportion of healthy women when CAD
in recall rates of 18% between the two study is employed. Based on these findings, it was
groups. difficult to draw conclusions on the beneficial
14 Deep Learning in Breast Cancer Screening 191

impact of using a CAD for screening mammog- gist only. Comparing the findings in the different
raphy. reading strategies showed that double-reading
Karssemeijer et al. [19] conducted a resulted in a higher sensitivity at the cost of a
prospective study to compare full field digital lower specificity, whereas pre-reading resulted
mammography (FFDM) reading using CAD in a higher specificity at the cost of a lower
(ImageChecker) with screen film mammography sensitivity. In addition, the results of the present
(SFM) in a population-based breast cancer study demonstrated that systematic application of
screening program for initial and subsequent CAD software in a clinical population failed to
screening examinations. In total, 367,600 improve the performance of both radiologist and
screening examinations were performed and technologist readers. In conclusion, this study
results similar to previous studies were reported, does not support the use of CAD in the detection
i.e., the detection of ductal carcinoma in situ and of breast cancers.
microcalcification clusters improved with FFDM Sohns et al. [22] conducted a retrospective
using CAD, while the recall rate increased. In study to assess the clinical usefulness of
conclusion, this study supports the use of a CAD ImageChecker in the interpretation of early
to assist a single reader in centralized screening research, benign, and malignant mammograms.
programs for the detection of breast cancers. 303 patients were analyzed by three single
Destounis et al. [20] conducted a retrospective readers with different experience with and
study to evaluate the ability of the CAD without the CAD. Authors reported that the
ImageChecker to identify breast carcinoma three readers could increase their accuracy by
in standard mammographic projections based the aid of the CAD system with the strongest
on the analysis of 45 biopsy-proven lesions benefit for the less experienced reader. They
and 44 screening BIRADS category 1 digital also reported that the increase of accuracy was
mammography examinations which were used as strongly dependent on the readers’ experience.
a comparative normal/control population. CAD In conclusion, this study supported the use of a
demonstrated a lesion/case sensitivity of 87%. CAD in the interpretation of mammograms for
The image sensitivity was found to be 69% in the the detection of breast cancers especially for less
MLO (mediolateral oblique) view, and 78% in experienced readers.
the CC (craniocaudal) view. For this evaluation, Murakami et al. [23] assessed, in a retrospec-
CAD was able to detect all lesion types across tive study including 152 patients, the usefulness
the range of breast densities supporting the use of of the iCAD SecondLook in the detection of
a CAD to assist in the detection of breast cancers. breast cancers. Authors reported that the use of
van den Biggelaar et al. [21] prospectively a CAD system with digital mammography (DM)
aimed to assess the impact of different mammo- could identify 91% of breast cancers with a high
gram reading strategies on the diagnosis of breast sensitivity for cancers manifesting as calcifica-
cancer in 1048 consecutive patients referred for tions (100%) or masses (98%). Of particular
digital mammography to a hospital (i.e., symp- interest, authors also reported that sensitivity was
tomatic, not a screening population). The follow- maintained for cancers with a histopathology for
ing reading strategies were implemented: single- which the sensitivity of mammography is known
reading by a radiologist with or without CAD to be lower (i.e., invasive lobular carcinomas
(iCAD Second Look), breast technologists em- and small neoplasms). In conclusion, this study
ployed as pre-readers or double readers. Au- supported the use of a CAD as an effective
thors reported that the strategy of double-reading tool for assisting the diagnosis of early breast
mammograms by a radiologist and a technologist cancer.
obtained the highest diagnostic yield in this pa- Cole et al. [24] aimed to assess the im-
tient population, as compared to the strategy of pact of two CAD systems on the performance
pre-reading by technologists or the conventional of radiologists with digital mammograms. 300
strategy of mammogram reading by the radiolo- cases were retrospectively reviewed by 14 and
192 H. Harvey et al.

15 radiologists using respectively the iCAD Sec- 14.1.2.3 So Why Did CAD “Fail”?
ondLook and the R2 Image Checker. Authors In the USA the use of CAD systems earns ra-
reported that although both CADs increased area diologists extra reimbursement (varying between
under the curve (AUC) and sensitivity of the 15 and 40, but trending downwards due to recent
readers, the average differences observed were bundling of billing codes). Such reimbursement
not statistically significant. Cole et al. concluded has undoubtedly incentivized clinical uptake in
that radiologists rarely changed their diagnostic a payer system. As a result, the use of CAD in
decision after the addition of CAD, regardless of US screening has effectively prevented adoption
which CAD system was used. of double-reading (which would be more costly);
A study conducted by Bargalló et al. [25] however, with the performance of CAD systems
aimed to assess the impact of shifting from a coming into disrepute, some have questioned if
standard double-reading plus arbitration protocol monetarily favoring the use of such a system is
to a single-reading by experienced radiologists economically correct when their accuracy and
assisted by CAD in a breast cancer screening efficacy is in doubt [27].
program. During the 8 years of this prospec- As discussed above, many published studies
tive study, 47,462 consecutive screening mam- have yielded ambiguous results; however,
mograms were reviewed. As main findings, the most found the use of CAD to be associated
authors reported an increase of the cancer detec- with a higher sensitivity, but lower specificity.
tion rate in the period when the single reader was Sanchez et al. [28] observed similar trends in
assisted by a CAD (iCAD SecondLook), which a Spanish screening population, where CAD
could be even higher depending on the experi- by itself produced a sensitivity of 84% and a
ence of the radiologist, i.e., specialized breast corresponding specificity of 13.2%. Freer and
radiologists performed better than general radi- Ulissey [29] found the use of CAD caused
ologists. The recall rate was slightly increased a recall rate increase of 1.2% with a 19.5%
during the period when the single reader was as- increase in the number of cancers detected. These
sisted by a CAD (iCAD SecondLook). However, studies were on CADs that analyzed screen film
this increase could be, according to the authors, mammograms which were digitized, and not
related to the absence of arbitration which was FFDM. Others [30] found that the use of CAD
responsible for the strong reduction of recall rate on FFDM detected 93% of calcifications, and
in the double reader protocol. In conclusion, this 92% of masses, at a cost of 2.3 false positive
study supported the use of a CAD to assist a marks per case on average.
single reader in centralized screening programs Despite heterogeneous results, by 2008
for the detection of breast cancers. over 70% of screening hospitals in the USA
Lehman et al. [26] aimed to measure perfor- had adopted CAD [31]. This widespread
mance of digital screening mammography with adoption, coupled with a lack of definitive
and without CAD (unidentified manufacturer) positive evidence with CAD usage in a screening
in US community practice. Authors compared setting, has resulted in some skepticism among
retrospectively accuracies of digital screening radiologists. This all coming at a cost of over
mammography interpreted with (N = 495,818) $400 million a year in the USA [32]. In essence,
and without (N = 129,807) CAD. Authors while CAD was sometimes shown to improve
did not report any improvement in sensitivity, sensitivity, it often decreased specificity, leading
specificity, recall rates, except for the detection to distraction of radiologists by having to check
of intraductal carcinoma when the CAD was used false positive markings [33], and increasing
and concluded that there was no beneficial impact recall rates. Conversely, in Europe, CAD uptake
of using CAD for mammography interpretation. is under 10%, and human double-reading is far
This study did not support the use of CAD in more widespread, with significantly lower recall
breast cancer detection in a screening setting. rates.
14 Deep Learning in Breast Cancer Screening 193

Table 14.1 Reported and minimum human-level performance for mammography reading
Source Sensitivity Specificity
Computer aids and human second reading as interventions in 87.3% 91.4%
screening mammography: two systematic reviews to compare effects
on cancer detection and recall rate [26]
US national performance benchmarks for modern screening digital 87% 89%
mammography [34]
Minimally acceptable interpretive performance criteria for screening 75% 88%
mammography [35]
Criteria for identifying radiologists with acceptable screening ≥80% ≥85%
mammography interpretive performance [36]
Breast cancer surveillance consortium US digital screening 84% 91%
mammography [37]

The technical reason for CADs inability to Any cancers missed or falsely flagged by CAD
perform at the level of a human reader (Ta- have significant downstream costs on both the
ble 14.1) is due to the underlying algorithms’ health care system and patient welfare—and in
lack of flexibility in predicting the diverse range summary make CAD technologically insufficient
of abnormalities that arise biologically in the for making a positive net impact. Thus, the ul-
breast. Microcalcifications are morphologically timate goal of future software based on deep
completely different to masses, and each has their learning is to detect malignancies and diagnose
own family of subclasses with distinct shapes, screening cases at a level that undoubtedly sup-
sizes, and orientations. Architectural distortions ports radiologists, which is at or beyond the
exhibit even more subtlety. Disappointment was level of an experienced single reader’s average
inevitable due to this huge level of variation, performance. This will ensure that when used to
coupled with the fact that traditional CAD algo- support a single-reader setting, diagnostic time
rithms have to account for all of these structural is minimized, with a minimum number of false
and contextual differences explicitly in the form positive marks. Such a system could ensure that
of hand-engineered features (which themselves sensitivity and specificity is not sacrificed if the
are based on heuristics or mathematical pixel second read in a double reader program is carried
distributions). Deep learning algorithms do not out by a deep learning system. Perhaps most
suffer from this problem, as they adaptively learn importantly, the reduction in false positives could
useful features depending on the task at hand lead to significant benefits to health care costs
by themselves (at the expense of requiring more and patient outcomes. This opens up the possi-
computational power and more data to train, with bility of implementing cost-effective screening
a potential decrease in interpretability). Despite programs in developing countries, where the lack
this no system has yet reported both superior of trained radiologists makes it impossible in the
stand-alone sensitivity and specificity for even current climate.
the minimum acceptance criteria.
Regardless of whether CAD adoption has re-
sulted in a net positive or negative to a radiol- 14.1.3 A Brief History of Deep
ogist’s workflow, a number of facts cannot be Learning for Mammography
disputed. Traditional CAD by itself is not capable
of replicating performance similar to a radiolo- The first successful application of deep learning
gist, thus a significant portion of a radiologist’s in mammography dates back to 1996 [38]. The
time is spent either discarding areas marked by authors proposed a patch-based system that is
the system due to its high false positive rate, or able to detect the presence of a mass in regions
second guessing a CAD’s false negative misses. of interest (ROI). The decision of inspecting
194 H. Harvey et al.

mammograms by extracting small patches was ently contrasted original input images. In parallel
motivated by the limited amount of computa- to the full image analysis, the authors suggest us-
tional resources available at that time. These ing a second patch-based network. Both outputs
early deep learning systems were not designed are later combined using a random forest that
to detect microcalcifications because the authors computes the final probability of cancer and the
referred to previous publications that claimed that location of the suspected abnormality. The study
existing CAD can detect them reliably enough. did not provide any details on the proprietary
Subsequently it took some years of development dataset (for example, the hardware manufacturer,
on the hardware side until researchers got inter- the distribution of lesion types, etc.)
ested again in deep learning-based approaches. Compared to the aforementioned approaches,
Dhungel et al. [39], and Ertosun and Rubin [40] Kim et al. [46] propose a setup that is based on
can be accredited with starting off the new wave pure data-driven features from raw mammograms
of deep learning with hybrid approaches, com- without any lesion annotations. An interesting
bining traditional machine learning with deep observation that the authors make is that their
learning. The former suggested a cascaded CNN- study yielded a better sensitivity on samples
based approach followed by a random forest and with a mass rather than calcifications, something
classical image post-processing. The latter pub- which contradicts previous reports utilizing tradi-
lished a two-stage deep learning system where tional CAD methodologies [47]. In order to over-
the first classifies whether the image contains a come vendor-specific image characteristics, the
mass, and the second localizes these masses. authors applied random perturbation of the pixel
In 2015 Carneiro et al. [41] achieved signif- intensity. Differences in diagnostic performance
icant improvement for mass and microcalcifica- on different hardware vendors was attributed to
tion detection by using deep learning networks underlying differences in the distribution of ma-
that were pre-trained on ImageNet which is a lignant cases for each vendor. One limitation
collection of about 14 million annotated real- of the proposed approach is that benign cases
life images. Deep neural networks seemed to were excluded completely, which is obviously
be able to learn robust high-level features de- problematic when comparing results to a real-
spite the significant differences in image content. world clinical setting.
Furthermore the authors made use of multiple
views (MLO and CC) without pre-registration of
the input images. Evaluation was done on the 14.2 Goals for Automated
InBreast [42] and DDSM [43] datasets. Systems
In 2017 Kooi et al. [44] published a deep
learning-based approach trained on a much larger Automated systems designed to assist radiol-
dataset of 45,000 images. Again, the authors ogists in mammographic screening have three
proposed patch-based deep learning approach high-level tasks to achieve.
that focused on the detection of solid malig-
nant lesions including architectural distortions,
thus ignoring cysts or fibroadenomata. A small 14.2.1 Recall Decision Support
reader study showed that their algorithm had sim-
ilar patch-level performance as three experienced The goal in screening is to make a decision about
readers. whether to recall a patient for further assessment
In the same year Teare et al. [45] described a or not. This, in essence, is a high-level binary
complex network that can deal with three classes output from a human radiologist for each and
encountered in mammography (normal, benign, every screening case they read. Recall rates vary
malignant). Their approach encompasses a full between different screening programmes, tend-
image-based analysis where an enhanced input ing to be higher in the USA than in Europe and
image is derived by a composition of three differ- very low in Northern Europe. The UK guidelines
14 Deep Learning in Breast Cancer Screening 195

suggest a recall rate as low as <5–7% is achiev- AUC of 0.96 on the InBreast dataset and 0.3 false
able [48], whereas in the USA recall rates of up positives per image at a sensitivity of 90%. These
to 15% have been reported [49]. results show a significant improvement on earlier
Assuming an autonomous system can achieve CAD performance; however, the recall rates for
a very high (99%) sensitivity at a suitable recall these newer deep learning systems are not yet
rate, the possibility of triaging opens up, thus known, and again, human-level performance in
aiding screening units in optimizing their work- recall decision-making is not yet met. Unfor-
flow. tunately, the fact that these assessments were
A powerful-enough software system learning performed on datasets of limited quality limits
from the past failures of CAD could be used to the conclusiveness of the results.
act as an effective second reader if its sensitiv-
ity and specificity is good enough (most likely
requiring performance at or beyond the level of 14.2.2 Lesion Localization
a human reader). Because of the vastly different
performance, this setup is qualitatively different A binary recall decision is of limited inter-
from CADs of the past in that they don’t directly pretability, and begs the question “How can we
support a single reader but actually act as a reader know what the decision was based on?”. For
themselves, although with vastly different “way each recall decision, it would be both useful and
of thinking” (capabilities and error profiles) than reassuring to see the suspicious regions that led
radiologists, thus complementing radiologists. to the decision. Ideally the algorithm would not
In 2016, the DREAM challenge was set up, only be able to discriminate between normal and
inviting deep learning researchers to develop sys- suspicious regions, but would also be able to
tems to detect breast cancers on a proprietary display where these features are present on the
dataset for a monetary grand prize. The input image. This process is called localization.
data consisted of around 640,000 images of both Traditional CAD systems offered this kind
breasts and, if available, previous screening ex- of localization. However, as discussed in
ams of the same subject, clinical/demographic Sect. 14.1.2.3 these systems were undiscerning,
information such as race, age, and family history generating so many false positives as to render
of breast cancer. The winning team (Therapixel, the localizations almost meaningless. Deep-
France) attained a specificity of 80.8% at a set CNNs are also capable of localization and top
sensitivity of 80% (AUC 0.87) [50]. This rep- the leader-boards of all the (non-radiological)
resented the first public competition to apply major localization challenges such as PASCAL
deep learning to screening mammography. How- VOC, COCO, and ImageNet. It is natural that
ever, none of the entrants, including the winners, they be applied to digital mammography too.
reached close to the performance of single human Most attempts to perform localization in mam-
reading radiologists (Table 14.1). This may have mography with deep learning algorithms have
been due to issues in the underlying data, its taken a patch-based approach [41, 44, 52–54]. At
labeling, limitations in the competition design, or training time, individual patches are sampled and
more simply availability of mature deep learning cropped from a full sized image and fed through a
techniques in radiology at this relatively early CNN to produce class predictions for each patch
time for the field. (for example, malignant mass, benign mass, or
Teare et al. [45] report a case-wise AUC of background). At test time, the network can be
0.92, with a sensitivity of 91%, and specificity slid incrementally over the image to ensure full
of 80.4%. Kim et al. [46] report an AUC of 0.90, prediction coverage. The patch-based approach
with their case-wise decisions being derived from mitigates the difficulty of fitting full-sized mam-
summations of individual lesion detection. Ribli mograms into memory. However, it also suf-
et al. [51] trained a network to segment lesions fers two major drawbacks. Firstly, by cropping
achieving an AUC of 0.85. They also showed an patches from the full image we are asking the
196 H. Harvey et al.

network to base its classification decision on a In semantic segmentation the goal is to clas-
small fraction of the available context. Imagine sify each individual pixel in an image as be-
cutting out a section of irregular parenchyma longing to one class or another. In order to do
from an image and trying to decide whether it this, the low-resolution encoding of the image
is normal tissue or a mass. To make a reasoned produced by the contracting network of the CNN
judgment, we need to “take a step back” and backbone must be gradually expanded back to
study the whole surrounding area. What looks full image resolution. This can be achieved by
normal in a fibrous parenchyma may be a clear appending a number of subsequent layers to the
anomaly in a fatty breast. The result is typically a CNN backbone, with the pooling operations of
large number of false positives when the patches these layers replaced by up-sampling operations.
are re-combined back into the full image. The In order to recover the location specific informa-
second major drawback is the inefficiency of tion lost during the pooling operations in the con-
the process. A significant amount of redundant tracting path, high resolution features from the
computation is performed over the overlapping contracting path must be re-combined with the
regions of the input patches. up-sampled feature maps via skip connections.
The current state of the art in deep learning Ronneberger et al. [59] had breakthrough suc-
localization does not include patch-based meth- cess with the application of such a segmentation
ods. Rather, it is semantic segmentation [55], network to biomedical images, winning the ISBI
object detection [56–58], and instance segmen- cell tracking challenge 2015 by a large margin.
tation approaches (Fig. 14.1) that top the leader- Variants of this network are also being applied in
boards of the big public dataset challenges. These digital mammography [60, 61].
approaches all consider the full image rather than Semantic segmentation cannot distinguish be-
cropped patches. This allows them to overcome tween separate instances of each class, for ex-
the two major drawbacks of the patch-based ample, if there were two masses in the image,
approach: the network now sees the full context the algorithm would not tell you which pixels
of the image, and the forward pass computation belonged to the first mass and which to the
is highly amortized over overlapping regions of second, but would simply say “here are all the
the image. mass pixels.” In object detection, the goal is to

Fig. 14.1 Comparison of the three state-of-the-art local- via bounding boxes. In (c) approaches (a) and (b) are
ization approaches in deep learning. In (a) each pixel is combined by providing bounding boxes and pixel-level
classified as one class or another (here mass vs back- labels for each separate mass instance
ground). In (b) each mass instance is separately identified
14 Deep Learning in Breast Cancer Screening 197

separately identify each instance of each class. phy has been the proportion of the mammogram
However, unlike semantic segmentation, this is that is opaque, referred to as percent density
not done on the pixel level. Often, some sort of (PD). In area-based PD, opacity is judged by
selective search is used to generate region pro- simple thresholding of image pixel values, while
posals with a high “objectness” score [56,57,62], volume-based PD (Volpara) also considers the
and this is followed by a classifier head to predict thickness of the dense tissue by making use of the
a class probability for each proposed region, and unthresholded pixel values. Opinions differ as to
a regressor head to adjust shapes of the bounding which approach, area or volume-based PD, is the
boxes that define each region for more precise better. Shepherd [68] concluded that volumetric
localization. Ribli et al. [51] came runner up PD methods are better predictors of breast cancer
in the digital mammography DREAM challenge risk than area-based PD, while others have con-
using the Faster-RCNN object detection network cluded the opposite [69, 70].
[56] (although the challenge was actually based While PD is an important risk factor, there is
on case-wise recall decision, the high sensitivity growing evidence to suggest that texture char-
and specificity of the localizations meant that the acteristics, which are not necessarily correlated
case-wise label could also be reliably inferred). with PD, may also be an indicator [71]. In-
Instance segmentation combines the pixel- deed, a key recent result from the University
level detail of semantic segmentation with the of Manchester’s PROCAS trial, set up with the
instance-level aspect of object detection. It aim of predicting patient risk at screening, was
classifies each pixel according to both class that texture features could in fact be a more
and instance. In other words, it tells you powerful risk indicator than PD methods [72].
“here are all the masses, and here are all the Similar conclusions have been drawn previously
pixels that belong to each one.” This approach by others [73–75]. The recognition of texture
extends object detection by including a semantic characteristics as an important risk indicator has
segmentation branch in parallel with the classifier led to the adoption of classification systems that
and bounding box regression branches. For each classify breasts not only according to the pro-
detected object, we now get a refined bounding portion of fibroglandular tissue, but also its dis-
box, a class probability, and a segmentation tribution (e.g., “scattered” or “heterogeneously
mask. The state of the art in this task is the dense”). Two such systems are the BI-RADS
Mask-RCNN network [63], the direct descendant density scale (Fig. 14.2) and the parenchymal
of Faster-RCNN, although as of yet there pattern (PP) scale (Fig. 14.3) developed by Pro-
are no reported results of this network on fessor Tabar [77].
mammography datasets. Whether breast cancer risk is best assessed
by PD or texture-based approaches, or a com-
bination of the two, is still a matter of ongoing
14.2.3 Density Stratification and Risk research. What is clear is that a major downside
Prediction of these methods is that they rely on rigid, hand-
crafted features based on simple pixel intensi-
It is widely accepted that the density of breast ties (and in the case of texture, the gray-level
tissue—that is, the proportion of fibroglandular to co-occurrence matrix of neighboring pixels or
fatty tissue in the breast—is a strong hindrance to Gaussian features [78]). This introduces a lot of
the detection of breast cancer due to the potential human bias, as well as trial and error to identify
for lesions to hide in a high density background which features are the most effective.
[64–66]. It is estimated that 26% of breast can- With deep learning approaches, instead of
cers in woman under the age of 55 are attributable needing to decide on a set of features to use a
to breast density over 50% (independent of other priori, the most salient features to use are learned
risk factors such as age) [67]. A widely used directly from the data, and are specifically tai-
measure of breast density in digital mammogra- lored to the task at hand. In addition, the features
198 H. Harvey et al.

Fig. 14.2 BI-RADS density scale. Reproduced with kind permission from the American College of Radiology [76]

Fig. 14.3 Tabar Parenchymal Patterns (PP). Reproduced with kind permission from Professor Tabar [77]

learned by deep learning algorithms are signifi- sification tasks at once (Fig. 14.4). Both works
cantly richer than the crude PD or GCLM-based reported human-level accuracies on these tasks
features. In particular, they are highly specialized compared with a consensus of radiologists.
for discriminating between different patterns and These results highlight the potential of deep
textures. It is unsurprising therefore that deep learning to improve risk assessment in breast can-
learning algorithms are already being applied to cer screening based on tissue density. Not only
great effect in density and PP estimation. For are deep neural networks highly adept at learning
example, Wu et al. [79] trained a CNN to classify the types of feature used in traditional PD and
breasts according to the BI-RADS density scale. texture-based approaches, they also have the flex-
O’Neill (Kheiron Medical Technologies) pre- ibility to learn any other (perhaps more subtle)
sented at RSNA 2017 on how a CNN-based risk indicators present in the mammogram. They
model could be applied equally effectively to BI- offer the potential therefore, of a single unified
RADS and PP classification, and that best results approach to breast cancer risk assessment that is
were obtained by jointly training for both clas- both consistent and accurate.
14 Deep Learning in Breast Cancer Screening 199

Fig. 14.4 ROC curves for 1.0


the four density classes,
taking ground truth as the
multi-center radiologist 0.8
consensus. The model was

True Positive Rate


trained jointly with
BI-RADS and PP labels. 0.6
The lower AUC for classes
B and C is likely due to
noisy labels—these classes 0.4
are the hardest for
radiologists to distinguish A (AUC = 0.98)
between 0.2 B (AUC = 0.89)
C (AUC = 0.90)
D (AUC = 0.98)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate

which is possible to train on. The positives of this


14.3 Deep Learning Challenges method include preserving the contextual infor-
Specific to Mammography mation contained in the image, and not requir-
ing localized pixel-level labels for classification
14.3.1 Memory Constraints tasks. As long as the down-sampling does not
and Image Size destroy the presence of important visual features,
it is the most likely route to successful classi-
The majority of deep neural networks for im- fication. However, it is not without drawbacks.
age perception tasks in the public domain were As mentioned previously, most successful CNN
designed for images with a maximum size of architectures were designed on much smaller im-
299 × 299. This is vastly different to FFDM, age sizes. Researchers must be consider whether
which are orders of magnitude greater in height, the maximum receptive fields (the maximum re-
width, and total pixel count. This increase in gion of the image which contributes to a neurons
image size comes at a hefty design cost when activation in the final convolutional layer of a
developing such algorithms for mammography. CNN) of these architectures is large enough to
As the amount of RAM available on the majority span the necessary contextual information. If
of high-end GPUs is currently 12 GB, one full this receptive field is too small (or too large), a
resolution image is too large to train on with any significant amount of network redesign may be
CNN architecture that has yielded the state-of- necessary, which is particularly difficult in the
the-art results on the ImageNet challenge over case of more sophisticated architectures (such
the past 6 years. This problem has traditionally as inception and resnets). Also, much smaller
been tackled by down-sampling the image to a batch sizes are possible when training on down-
smaller resolution, or splitting the image up into sampled images, which may affect training.
smaller constituent patches. Both approaches are There are other, more sophisticated, ways to
highlighted in Fig. 14.5. get around the trade-off between network depth,
Patch-based approaches are by far the most image, and batch size. These include advanced
common way to overcome the intractable com- methods which look explicitly at the execution of
putational requirements of full resolution digi- training with deep learning models, with check-
tal mammograms. The limitations of these have pointed memory management. The intermediate
been discussed in Sect. 14.2.2. results are usually required for the commonly
Another approach to solving issues with large used back propagation method of updating net-
image sizes is to down-sample the image to a size work weights [81]. Another method introduced
200 H. Harvey et al.

Fig. 14.5 A comparison between downscaling a full mammogram or cropping full resolution patches. If the proportion
of down-sampling is too high, important visual features may be lost. Reproduced with kind permission from [80]

(and getting a lot of attention recently) are re- and such choices are a key factor in determining
versible networks [82], where one can com- a models success.
pute the gradient without storing a majority of
the network’s activations, decreasing memory re-
quirements by replacing activations with simple 14.3.2 Data Access and Quality
mathematical operations.
Looking towards the future, as more advanced Researchers have access to several public and
techniques emerge, and more powerful GPUs restricted image databases. However, quantity,
are designed, researchers will still grapple with quality, and availability of metadata and clinical
similar memory limitations. A variety of exciting data vary a lot between those datasets. For ex-
avenues are yet to be explored, including the pos- ample, scanned hard copy films may not be use-
sibility of considering a patients entire screening ful for developing state-of-the-art digital mam-
history, or genetic information, when making a mography algorithms. One of the more popu-
recall decision. Thus design choices similar to lar databases is DDSM which is available to
those described above will need to be considered, the general public containing more than 10,000
14 Deep Learning in Breast Cancer Screening 201

images. Unfortunately, the quality of the digi- 14.3.3 Data Issues During Training
tized films does not match that of FFDM [51]
and the provided annotations are not as accurate It is very rarely possible to collect a dataset that
as they should be for training machine learn- is perfectly balanced with respect to different
ing systems (e.g., 339 images contain annota- classes and features, completely unbiased and
tions although the masses are not clearly visible plentiful. Even where this is possible, it can be
[83]). An up-to-date and better curated version very expensive and time-consuming. More often
of DDSM was published more recently [83]. At than not, researchers need to carefully consider
the time of writing, only one group has published the imperfections of the datasets in order to
work on the new release of DDSM [84]. The achieve the desired results.
second most frequently cited database is MIAS,
however, compared to DDSM it lacks samples. 14.3.3.1 Dataset Imbalance
Furthermore, offering only 8-bit images is no It is frequently the case in medical imaging that
longer state of the art, therefore we can only the class we are most interested in accurately
assume that this dataset will not be useful for fu- predicting is also the least frequent one. For
ture deep learning projects. The InBreast dataset example, the prevalence of breast cancer in a
is also often used as a benchmark as it consists screening population is between 0.6 and 1.0%.
of annotated FFDM images. However, with 115 Assuming a dataset consists of standard views
cases it is rather small, cannot be considered (CC and MLO) for each breast and that observing
representative of real-world inputs, and is not malignancies in both sides is relatively rare, it is
suitable to assess the performance of algorithms possible that as many as 99.7% of the images
in real-world settings. will be benign. Naturally, developers wish to
There are many other mammography datasets, take advantage of all the available images, but
with varying volume and quality. Table 14.2 severe class imbalance causes problems during
summarizes the most popular of these publicly model training. Some of the main issues can be
available data sources. identified as follows.

Table 14.2 Commonly used mammography datasets for deep learning


Name Origin Year No. of cases No. of images Access
MIAS UK 1994 161 322 Public
OPTIMAM UK 2008 9559 154,078 On request
DDSM and CBIS-DDSM USA 1999 2620 10,480 Public
Nijmegen Netherlands 1998 21 40 On request
Trueta Spain 2008 89 320 On request
IRMA Germany 2008 Unknown 10,509 Public
MIRAcle Greece 2009 196 204 Unknown
LLNL USA Unknown 50 198 Cost
Malaga Spain Unknown 35 Unknown Unknown
NDMA USA Unknown Unknown 1,000,000 On request
BancoWeb Brazil 2010 320 1400 Public
Inbreast Portugal 2012 115 410 On request
BCDR-F0X Portugal 2012 1010 3703 On request
BCDR-D0X Portugal 2012 724 3612 On request
SNUBH Korea 2015 Unknown 49 Public
202 H. Harvey et al.

Insufficient Data in Minority Class cases from another. This undesired pattern can
The data points in the minority class may be be easily picked up and the resulting model
insufficient for training a model with the desired may learn to discriminate between the two scan-
capacity. The straightforward solution is to sim- ners, rather than detecting malignancies. Another
ply acquire more data. If that is not an option, example is if the training dataset comes from
we may resort to transforming our existing data mostly symptomatic cases, but we wish to use the
in a plausible way (i.e., by adding some noise or resulting model for breast screening instead. The
image rotation) or generating realistic synthetic two clinical settings are quite different and the
data. Needless to say, none of the alternatives trained model may perform poorly when applied
can be a perfect substitute for high-quality real to a different clinical setting than the data was
data. acquired from.

Account for Class Imbalance During 14.3.3.3 Under-Fitting, Overfitting,


Training and Generalization
It is necessary to account for class imbalance, Under-fitting is when both training and validation
otherwise training may fail completely. During loss are below their minimum value. This can be
supervised learning, at every step of the itera- caused either by using a model too simple for the
tive training process, one can tweak the model task (underspecified) or by not training the model
parameters towards minimizing a loss function long enough.
indicative of the model performance. In most Over-fitting is a bit more subtle and occurs
cases, this function needs to be differentiable. when the validation loss is higher than its optimal
Examples of loss functions are the error rate, value. In this case, the model fits well on the
entropy, or mean square error. These metrics are training data and the training loss is low, but
adversely affected by data imbalance. Consider fails to adequately generalize that knowledge to
the case of a model always predicting that a unseen data. It is usually difficult to diagnose or
breast screening case is benign. This model will fix. Over-fitting may be caused when:
be correct, at least, 99% of the time in a screening
population, but of course such a model would • Using a model more complex than necessary,
be of no practical use. There are some simple in combination with the finite amount of train-
solutions for re-balancing prevalence, including ing data.
over- and under-sampling of the classes, as well • Training for too long and presenting the same
as differently transforming classes. training data multiple times to the model.
Possible solutions to these problems include
(either individually or in combination): The effects of both under-fitting and over-
fitting are illustrated in Fig. 14.6.
• Oversampling the minority class When it comes to mammography and the pub-
• Transforming the minority class in a plausible licly available datasets, it is important to ascertain
manner [85] the underlying biases in the data, the proportion
• Under-sampling the majority class. of cases from different manufactures, and the
clinical setting the data was derived from. Over-
14.3.3.2 Dataset Bias fitting to one specific hardware manufacturer is
Detecting biases in the datasets is crucial for a real risk with a non-heterogeneous dataset, as
any machine learning method, as any discrepancy is overfitting a model to a symptomatic cohort
between the training data and reality will most of patients. The latter usually present with much
certainly be reflected in the model performance. larger and more obvious changes on their imag-
Imagine a scenario in which most of the benign ing. More subtle malignancies, as usually found
cases we have in our possession come from in a screening setting, may be overlooked by
one type of scanner and most of the malignant an overfitted model to symptomatic patient data.
14 Deep Learning in Breast Cancer Screening 203

Fig. 14.6 (a) Illustration of training, validation loss, of iterations. (b) The effect of model capacity with the
areas of under-fitting and over-fitting. The effect is similar amount of training data
with the x-axis representing model capacity or number

Finally, in order for a deep learning algorithm to prohibitively expensive. A similar problem ex-
be useful in a wide variety of clinical practice, ists for the BIRADS scheme for malignancy
care must be taken to ensure generalizability of assessment. The subjective differences between
the training data as much as possible, to allow for a case being labeled BIRADS 4 (a 30% PPV
a more generalizable end result model. for malignancy) and BIRADS 5 (a 95% PPV
for malignancy) again lead to significant variabil-
ity among readers (especially when considering
14.3.4 Data Labeling difficult features such as architectural distortions
and asymmetries) [88]. The strongest possible
A key ingredient in the success of any machine marker for malignancy in this case is to use
learning algorithm is how well a dataset is la- biopsy-proven follow-up results. However, this
beled. In real-world applications at the current data may not always be available.
level of the field at the time of writing this book, The problem is further exacerbated if pixel-
no amount of sophisticated research techniques, level labeling is required (as is the case for patch-
architectures, and computational resources can based and segmentation approaches). This can
make up for poorly labeled data. This general be especially difficult when combining datasets
principle is colloquially described as “garbage in, from various sources, especially those in the
garbage out” within the community. This is an public domain. Figure 14.7a highlights a badly
even greater challenge in medical imaging, as a annotated calcification in the publicly available
“ground truth” can be difficult to establish due DDSM database; here the annotation passes out-
to significant levels of inter and intra-radiologist side of the breast region, and has large areas
variability, coupled with sometimes obscure defi- where there are no calcifications present. Another
nition metrics. A perfect example of this is breast problem arises in how specific anomalies were
density. Radiologists disagree not only among labeled; this is particularly tricky for calcifica-
each other, but also with themselves, i.e., might tions, where precise hand annotation would be
provide widely differing opinion at different re- prohibitively time-consuming, whereas coarser
peat assessment [86, 87]. These inconsistencies region of interest annotations may suffer from
inevitably lead to noisier target labels, which a low level signal to noise ratio. Figure 14.7b
can make naive training a challenge. However, highlights the ideal scenario where each individ-
a good consensus requires multiple radiologists ual microcalcification is labeled, as well as the
to label each image, which quickly becomes overall cluster.
204 H. Harvey et al.

Fig. 14.7 Images from various mammography tions. (b) The green boxes are individually labeled
databases. (a) The blue contour highlights the breast calcifications, the red box is the coarser cluster
edge, and the green contours are the lesion annota- annotation

14.3.5 Principled Uncertainties However, obtaining the exact predictive distribu-


tion of a deep learning model is an intractable
The most widely used deep learning models have problem due to their size and complexity. There
an important shortcoming: they lack an under- have been significant recent attempts to approx-
lying mechanism to provide uncertainty infor- imate the predictive distribution [90, 91], but
mation about the predictions they make. Instead solving this problem is still an active area of
they often output point estimates between 0 and research.
1, which are often taken blindly as a measure If the outputs of a deep learning model could
of confidence. There have already been a few be calibrated to a meaningful scale, such as the
high profile cases where blindly trusting deci- probability of malignancy, then we would be
sions made by deep learning algorithms has had safe in interpreting them as a measure of confi-
disastrous consequences. For example in 2016, dence: an output of 0.7 in a binary malignant-
and then again in 2018, there were fatalities due or-not mammography classification task would
to mistakes made by the perception system of mean 70% chance of the scan containing a can-
autonomous vehicles. In health care a wrong cer, allowing a well-reasoned recall decision to
decision can be a matter of life or death, and so made. Unfortunately the outputs of deep learning
being able to place trust in the decisions of deep algorithms are notoriously un-calibrated [92]
learning models applied to such an industry is of (Fig. 14.8); improving this calibration is once
critical importance. again an active area of research [92, 93].
Normally, a statistical model would output
an entire predictive distribution as opposed to
a point estimate. The spread of this distribution 14.3.6 Interpretability
would tell us how confident a model is in its pre-
diction, and consequently, how much we should With deep learning being used more widely in
trust it: a narrow spread of values would indicate practical applications, there has been a lot of
a high confidence, a broad spread the opposite. scrutiny on the underlying algorithms and how
14 Deep Learning in Breast Cancer Screening 205

Theoretical Perspective From a purely theoret-


Outputs ical standpoint, a deep neural network has a de-
terministic and fully interpretable behavior. It is
Gap
deterministic, in the sense that the same input will
result in the same output every time. It is fully
interpretable, in the sense that we can trace the
Accuracy

final decision back to each activated neuron and


all activations can be traced back to the pixels that
contributed. We can go even further and visualize
the areas of the image that contribute more in
the decision, as presented in [95]. However, this
mode of interpretation does not necessarily make
sense from a human perspective.
Error=30.6
0.0 0.2 0.4 0.6 0.8 1.0 Necessity for Interpretability The main ques-
Confidence tion is whether interpretability is really some-
thing necessary to have. There are undoubtedly
Fig. 14.8 Calibration plots for 110-layer ResNet on cases where it adds value to a system and others
CIFAR-100. Confidence is the output of the network.
when it arguably does not. Let us assume we wish
The heights of the blue bars give the actual accuracy
achieved by thresholding at the corresponding confidence. to build system to assist junior breast radiologists
The shaded red bars show the discrepancy between the during their training. In that case, being able to
ResNet and a perfectly calibrated model; the ResNet is explain why the system flagged a malignancy in
overconfident in its predictions. Applied to breast screen-
an image is very valuable. However, assuming
ing this may correspond to an excessively high recall rate.
Reproduced with kind permission from [92] we wish to deploy a deep neural network for
automated breast screening in a country with no
breast screening program—would interpretabil-
automated decisions are reached. Regulatory ity add any value in that case? More importantly,
bodies in the USA, EU, and other countries have if faced with the decision, should we choose an
already set some requirements for explainability interpretable system with lower sensitivity over a
of automated decisions. Researchers proposing “black-box” one?
deep learning methods for breast image analysis
are also making efforts to achieve some level of Interpretability Through Supervision Even in
interpretability of the proposed algorithms [54]. cases when interpretability cannot be naturally
Arguably, this is both due to the anticipation achieved, it can still be learned. For instance,
that interpretability will be a requirement for networks can be trained to attend to and base their
regulatory approval and also for reassuring users decisions on the regions of a mammogram that
(doctors and patients) that the algorithms perform a human radiologist has indicated as important.
as intended. However, some have argued that We can even train networks to generate textual
CAD systems should not give any localizing explanations of their decision, learning that skill
information to radiologists at all, and instead from radiologists’ reports.
simply allow them to review any images deemed There are, however, a several issues associated
suspicious by the system and marked for recall with doing this:
[94]. This allows for a direct “machine read”
output to be plugged into existing systems as • The task to be learned becomes significantly
an independent second or third reader, and more complex and difficult to learn.
completely avoids introducing any anchoring • Annotations become even more costly and
bias to the human-reading process. time-consuming.
206 H. Harvey et al.

• There may be inconsistencies between anno-


tators, due to the subjective nature of the task.

Finally, there is no guarantee whatsoever that the


network trained to explain itself will outperform
the one trained on a much simpler task, and the
chances are that it will not.

14.4 Future Directions

14.4.1 Generative Adversarial


Networks (GANs)

The ability to train generative models that are Fig. 14.9 The discriminator is trained to distinguish be-
tween real and synthetic images. The generator attempts
able to synthesize images indistinguishable from to produce realistic images indistinguishable by the dis-
real ones is the ultimate modeling objective. criminator. The two networks gradually improve one-
It implies that the underlying mechanism or another through this competition. Learning can only take
distribution responsible for generating the place at the equilibrium between the two adversaries
observed images has been successfully captured.
Of course, capturing the mechanism responsible model that can be used to generate images
for producing gigapixel breast imaging data is sequentially, one pixel after the other. It has
very difficult. Nevertheless, it is convenient that shown promise, but has not scaled so far for
generative models do not necessarily require higher resolutions.
labeled data to train.
A framework for training generative models Using adversarial training on the contrary has
in an adversarial manner was introduced by the been demonstrated to generate sharp images.
seminal paper of Goodfellow et al. [96] and Training these models, however, is notoriously
has signified a leap forward towards effectively difficult due to severe instability that manifests
training models for image data generation with itself when the equilibrium between generator
high fidelity. It is based on a simple but powerful and discriminator is lost. For that reason, a
idea that a generator neural network is trained to great deal of contemporary research is focused
produce realistic examples and the discriminator towards stabilizing the training and improving
is trained to be able to discern between real and our theoretical understanding.
fake ones (a “critic”). The two networks form The significance of GANs goes far beyond
an adversarial relationship and gradually improve generating realistic images of faces, furniture, or
one-another through competition, much like two natural scenes and, as we previously mentioned,
opponents in a game (Fig. 14.9). stems from modeling the underlying data dis-
GANs are not the only approach to generative tribution. There are several applications of high
models, but are arguably currently the most suc- significance for the medical imaging community
cessful one. Alternative methods include: and, by extension, breast imaging. In Fig. 14.10
we provide examples of some early work on
• Variational auto-encoders (VAEs) [97]: With- whole mammogram synthesis.
out adversarial loss, these methods use per-
ceptual image similarity losses and tend to Synthetic Data Generation The expectation
produce blurrier, less sharp images. of GANs is that well-trained generative models
• Auto-regressive models (pixel RNNs) [98]: A could be used to synthesize an unlimited amount
recurrent neural network is an auto-regressive of high-quality images that can be used to
14 Deep Learning in Breast Cancer Screening 207

Fig. 14.10 One of the two rows of images in this figure consists of real mammograms and the other of synthetically
generated ones. Can you tell which is which?

improve downstream detection and classification amounts of labeled data aren’t as readily acces-
models. sible.
Promising examples of using GAN-generated GANs have been particularly useful in semi-
synthetic data are recently emerging in the litera- supervised learning and several studies have
ture. For instance, Salehinejad et al. [99] gener- shown the aforementioned benefits in practice,
ated X-ray images and Costa et el. [100] retinal both in modeling natural [102] and medical
images with the accompanying vessel segmenta- images [103].
tion. Our group has also published early work
on synthesis of high resolution mammograms Domain Adaptation A common problem
[101] (Fig. 14.10). However, more evidence and in medical imaging is being able to transfer
clinical evaluation is required before there can be a trained model to a different modality,
wide adoption of such methods. manufacturer, or other domain where labeled data
are scarce or unavailable. Generative adversarial
Semi-supervised Learning Semi-supervised networks have been successfully used in medical
learning is a very effective way to leverage imaging to do so. For example, Kamnitsas et al.
unlabeled data to increase model performance [104] used GANs for domain transfer in brain CT
or reduce the requirement for labeled examples. semantic segmentation and Wolterink et al. [105]
The concept is closely related to multi-task used them to transfer from low to regular-dose
learning, where jointly modeling multiple tasks CT. Future work on tomosynthesis imaging (see
is beneficial to each individual task as well. In Sect. 14.4.3) may benefit from the use of GANs
the semi-supervised case, an additional benefit for domain adaptation.
is that at least one of the tasks learned does
not require labeled data. The benefit could be
superior in performance for the same amount 14.4.2 Active Learning
of labeled data, or a more graceful degradation and Regulation
when reducing the amount of labeled data. This
may be of particular use in mammography where Current regulatory processes do not allow for
small datasets are publicly available, but large active learning using deep learning models. A
208 H. Harvey et al.

“build and freeze” framework is the current stan- mography is its application to dense breast tis-
dard, requiring developers to validate their model sue. Particularly dense tissue is capable of ob-
on a rigid dataset, report the results, and then scuring the presence of certain types of lesions
apply for regulatory approval. on 2D mammography scans. These lesions are
In the future it might well be possible that more easily resolved when considering different
models implemented in hospitals make use of angled slices from DBT. DBT thus enhances
active learning, whereby networks continuously the morphological properties of abnormal tissue,
learn from new clinical data in a live setting. which should yield significantly better detection
In a symbiotic system, a biopsied mass could rates, while delivering an X-ray dosage only
be used to provide a data label, and once this slightly above that of conventional 2D mam-
label becomes available the image could be added mography [106] (and well within recommended
to a liquid training dataset. A positive biopsy safety guidelines). The increase in resolution can
result would create a positive label, along with also help guide biopsies by providing a more
metadata including phenotyping and genomics of accurate target region.
the malignancy subtype. A negative result would However, as DBT is a relatively new technique
not be assigned a data label until a set amount of (when considering levels of adoption), there is
time has passed, for example 2 years, therefore significantly less literature on it compared to
giving more confidence to the negative label. traditional FFDM. Initial prospective studies
This should allow for the system to continuously showed either superiority, or non-inferiority
improve, especially with the prospect of having a when compared to FFDM, but these were all
global network learning from thousands of scans small in scale, and are summarized in the review
a day to help radiologists. by Vedantham et al. [107]. The most significant
There are several barriers to overcome before retrospective study was conducted by Gilbert et
a constantly learning system could be deployed; al. [108], with DBT showing moderate increases
patient consent, validation of a model which is in performance to 2D mammography (AUC
continuously updating, and overcoming variation increased from 0.84 to 0.88), especially in the
between clinical sites based on their local data. case of dense tissue (AUC increased from 0.83
However, it is up to the regulatory bodies to to 0.87). The most significant criticisms of the
change their practice before symbiotic constantly modality come from a resource perspective, with
learning systems may even be feasible. significant infrastructure updates and training
procedures required. As each scan consists
of tens of slices, the time of reading also
14.4.3 Tomosynthesis increases [109], thus hospitals which are already
operating at capacity may not easily deal with
While the vast majority of screening programs the increased workload.
across the globe currently employ 2D mammog- This increase in cognitive workload makes it
raphy, the rising use of digital breast tomosyn- a perfect candidate for assistance from machine
thesis (DBT) likely signals the direction for the learning algorithms. Past research on CAD use
future of these programs. DBT is a tomography in DBT is sparse, with results reproducing the
technique in which numerous low dose X-ray same limitations as 2D CAD systems—a pro-
images are acquired in an arc around the com- hibitively high number of false positives at ad-
pressed breast. A 3D reconstruction is formed equate sensitivity levels. This has been shown in
from the various projections, in a similar fashion studies on masses [110], calcifications [111,112],
to CT and MRI scans. DBT is also capable and both [113]. Thus the development of the
of constructing a 2D synthetic image, by su- next generation of intelligent algorithms, capa-
perimposing all of the slices in a manner that ble of constructively aiding a radiologist, will
resembles a traditional 2D mammogram. The be critical in facilitating DBT adoption, espe-
primary advantage of DBT over traditional mam- cially in countries where radiologists are already
14 Deep Learning in Breast Cancer Screening 209

overworked. Unfortunately, the increase in over- of the input by sliding a window of a fixed
all workload from DBT cases also translates into size over the DNA sequence and encode the
extra engineering and labeling challenges. All of resulting words as binary 2D matrix. The re-
the memory constraints that apply to traditional sults are promising but the authors have chosen
four image 2D cases need now to account for their hyper-parameters empirically (word size,
cases that can have a multitude of images. Also, region size, network architecture) so further work
accurate pixel-level labels will be crucial with is required to get a better understanding how
current techniques, which translates to costly those parameters influence the analysis. Yin et
annotation demands. Despite these hurdles, the al. [116] take an image representation of the
possibility of adopting DBT as a standard for DNA sequence as input to a CNN and predict
future screening programs promises an exciting key determinants of chromatin structure. Their
future in which patient outcomes improve. Deep approach is able to detect interactions between
learning algorithms could play a critical role in distal elements in the DNA sequence as well
realizing this future. as the presence and absence of splice-junctions.
Compared to Nguyen et al. [115], the authors
added residual connections to reuse the learned
14.4.4 Genomics features as well as larger convolution filters.
A popular term in literature is “radio-
The biggest challenge in fighting cancer is the genomics” [117] which refers to the relationship
heterogeneity of the disease. Progress on various between imaging phenotypes and tumor
scientific fields pushed further the understanding genetics by image-based surrogates for genetic
of the complex biological processes of inva- testing [118]. Commercial genetic tests, such
sive breast cancer. However, linking molecular as OncotypeDx (Genomic Health Inc., San
data with radiological imaging data is not triv- Francisco), which is used to predict recurrence
ial. An interesting paper analyzing the correla- and therapeutic response, are currently being ex-
tion between breast cancer molecular subtypes plored with respect to radiogenomic associations.
and mammographic appearance was published However, the majority of publications still use
by Killelea et al. [114]. Their retrospective anal- traditional machine learning with hand-crafted
ysis revealed characteristic associations between features to find associations between genetics and
the appearance of the tumors on the mammo- image-derived features [118, 119]. The promised
graphic image and the molecular profile. Archi- future of radiogenomics will require linking
tectural distortions were associated with luminal- massive mammography and genetic datasets,
type cancers whereas calcifications with or with- something that is yet to be achieved.
out mass are correlated with HER2-positive can-
cers. Triple negative cancer was found to be asso-
ciated with a noncalcified mass. Those promising 14.5 Summary
findings raise the question on what applying
deep learning to paired molecular and mammo- It is not surprising to see a flurry of deep learn-
graphic data could reveal. Subtle features that are ing activity in the mammography sector, espe-
not captured in the high-level statistical analysis cially in Europe, where several countries hold ro-
by Killelea et al. [114] may give new insights bust breast nationwide screening databases, with
into breast cancer development and may reveal every mammogram result, biopsy, and surgical
image-based biomarkers that could potentially outcome linked to every screening event. Early
replace expensive sequencing in the future. A research in deep learning has shown both sen-
novel way of analyzing the DNA using deep sitivity and specificity of these algorithms ap-
learning was published by Nguyen et al. [115]. proaching that of single human readers. Over the
The authors propose to use deep CNNs for DNA next couple of years we will undoubtedly see
sequence analysis. They keep the sequential form deep learning algorithms entering into screening
210 H. Harvey et al.

settings. This will of course necessitate robust ing and validation. GANs may hold the potential
clinical trials both retrospectively to benchmark to unlock vast amounts of synthetic training data,
performance and prospectively to ensure that although their performance at present is not suf-
algorithmic performance is maintained in a real- ficient to provide robust comparison against real-
world clinical setting. The holy grail will be to world data. DBT may also herald a new source
prove conclusively that deep learning systems of “big data,” simply by providing more images
can accurately make recall decisions as well per case. It is certainly within the sights of re-
as, or better than, human double-reading, while searchers to utilize domain adaptation techniques
providing highly explainable and interpretable to apply 2D mammography algorithms to 3D
results when needed. However, radiologists are datasets. Finally, the era of radiogenomics, much
unlikely to hand over the reigns just yet, and may anticipated but limited by data availability at
instead prefer single-reader programs supported scale, will only come of age once genomic testing
by deep learning, effectively halving the work- in breast cancer becomes standard practice.
load for the already overstretched double-reading
radiologists.
Deep learning technology could also poten-
14.6 Take Home Points
tially improve consistency and accuracy of ex-
isting single-reader programs, such as those in
• Breast mammography (2D FFDM) is seen as
the USA, as well as provide an immense new
a key modality ripe for deep learning.
resource to countries yet to implement a screen-
• Deep learning is most likely to act as a second
ing programme at all. The potential for deep
reader in screening programs.
learning-support in national screening, as well
• Deep learning has the potential to improve
as for underdeveloped health care systems to
accuracy and consistency of screening pro-
leapfrog into the deep learning era, may therefore
grams.
be just a few years away.
• As for any medical imaging analysis, access to
It is interesting to note that despite the
large labeled datasets remains a challenge.
advances of traditional CAD, the European
• Generative adversarial networks may assist in
guidelines for quality assurance in breast cancer
data augmentation via image synthesis.
screening and diagnosis (and their following
• 3D tomosynthesis and radiogenomics are the
supplements) do not address evaluation of
next area of research for deep learning tools.
processing algorithms and CAD [120]. There
are however consolidated standards focusing on
different topics to ensure the technical image References
quality of mammograms used for screening
and assessment is sufficient to achieve the 1. Ferlay J, Steliarova-Foucher E, Lortet-Tieulent J,
objectives of cancer detection. Perhaps, with Rosso S, Coebergh JWW, Comber H, Forman D,
the advent of deep learning these guidelines Bray F. Cancer incidence and mortality patterns in
Europe: estimates for 40 countries in 2012. Eur J
will eventually be updated to include CAD Cancer. 2013;49(6):1374–1403.
usage, especially if deep learning systems are 2. Tabár L, Gad A, Holmberg LH, Ljungquist U,
eventually proven to demonstrate stand-alone Fagerberg CJG, Baldetorp L, Gröntoft O, Lund-
sensitivity and specificity above that of single ström B, Månson JC, Eklund G, Day NE, Pettersson
F. Reduction in mortality from breast cancer after
human readers, while simultaneously reducing mass screening with mammography: randomised
recall rates and reducing the occurrence of trial from the breast cancer screening working group
interval cancers (cancers that present in between of the Swedish National Board of Health and Wel-
screening intervals). fare. Lancet. 1985;325(8433):829–32.
3. Lee CH, David Dershaw D, Kopans D, Evans
To reach this goal, several hurdles must be P, Monsees B, Monticciolo D, James Brenner R,
overcome. First, larger more accurately labeled Bassett L, Berg W, Feig S, Hendrick E, Mendelson
datasets are required, both for algorithmic train- E, D’Orsi C, Sickles E, Burhenne LW. Breast
14 Deep Learning in Breast Cancer Screening 211

cancer screening with imaging: recommendations 17. Taylor P, Potts HWW. Computer aids and hu-
from the society of breast imaging and the ACR man second reading as interventions in screening
on the use of mammography, breast MRI, breast mammography: two systematic reviews to compare
ultrasound, and other technologies for the detection effects on cancer detection and recall rate. Eur J
of clinically occult breast cancer. J Am Coll Radiol. Cancer. 2008;44(6):798–807.
2010;7(1):18–27. 18. Noble M, Bruening W, Uhl S, Schoelles K.
4. Boyer B, Balleyguier C, Granat O, Pharaboz C. Computer-aided detection mammography for breast
CAD in questions/answers: review of the literature. cancer screening: systematic review and meta-
Eur J Radiol. 2009;69(1):24–33. analysis. Arch Gynecol Obstet. 2009;279(6):881–
5. Duijm LEM, Louwman MWJ, Groenewoud JH, 90.
Van De Poll-Franse LV, Fracheboud J, Coebergh 19. Karssemeijer N, Bluekens AM, Beijerinck D,
JW. Inter-observer variability in mammography Deurenberg JJ, Beekman M, Visser R, van Engen
screening and effect of type and number of readers R, Bartels-Kortland A, Broeders MJ. Breast cancer
on screening outcome. Br J Cancer. 2009;100(6): screening results 5 years after introduction of digi-
901–7. tal mammography in a population-based screening
6. Dinitto P, Logan-young W, Bonaccio E, Zuley ML, program. Radiology. 2009;253(2):353–8.
Willison KM. Breast imaging can computer-aided 20. Destounis S, Hanson S, Morgan R, Murphy P,
detection with double reading of screening mammo- Somerville P, Seifert P, Andolina V, Arieno A,
grams help decrease the false-negative rate? Initial Skolny M, Logan-Young W. Computer-aided detec-
experience 1. Radiology. 2004;232(2):578–84. tion of breast carcinoma in standard mammographic
7. Beam CA, Sullivan DC, Layde PM. Effect of projections with digital mammography. Int J Com-
human variability on independent double reading in put Assist Radiol Surg. 2009;4(4):331–6.
screening mammography. Acad Radiol. 1996;3(11): 21. van den Biggelaar FJHM, Kessels AGH, Van
891–7. Engelshoven JMA, Flobbe K. Strategies for digital
8. Tice JA, Kerlikowske K. Screening and preven- mammography interpretation in a clinical patient
tion of breast cancer in primary care. Prim Care. population. Int J Cancer. 2009;125(12):2923–9.
2009;36(3):533–58. 22. Sohns C, Angic B, Sossalla S, Konietschke F,
9. Fletcher SW. Breast cancer screening: a 35-year Obenauer S. Computer-assisted diagnosis in full-
perspective. Epidemiol Rev. 2011;33(1):165–75. field digital mammography-results in dependence of
10. Hofvind S, Geller BM, Skelly J, Vacek PM. Sen- readers experiences. Breast J. 2010;16(5):490–7.
sitivity and specificity of mammographic screening 23. Murakami R, Kumita S, Tani H, Yoshida T, Sugizaki
as practised in Vermont and Norway. Br J Radiol. K, Kuwako T, Kiriyama T, Hakozaki K, Okazaki E,
2012;85(1020):e1226–32. Yanagihara K, Iida S, Haga S, Tsuchiya S. Detection
11. Domingo L, Hofvind S, Hubbard RA, Román M, of breast cancer with a computer-aided detection
Benkeser D, Sala M, Castells X. Cross-national applied to full-field digital mammography. J Digit
comparison of screening mammography accuracy Imaging. 2013;26(4):768–73.
measures in U.S., Norway, and Spain. Eur Radiol. 24. Cole EB, Zhang Z, Marques HS, Edward Hendrick
2016;26(8):2520–8. R, Yaffe MJ, Pisano ED. Impact of computer-
12. Langreth R. Too many mammograms. Forbes; aided detection systems on radiologist accuracy
2009. with digital mammography. Am J Roentgenol.
13. Taylor P, Champness J, Given-Wilson R, Johnston 2014;203(4):909–16.
K, Potts H. Impact of computer-aided detec- 25. Bargalló X, Santamaría G, Del Amo M, Arguis
tion prompts on the sensitivity and specificity of P, Ríos J, Grau J, Burrel M, Cores E, Velasco
screening mammography. Health Technol Assess. M. Single reading with computer-aided de-
2005;9(6):iii, 1–58. tection performed by selected radiologists in a
14. Philpotts LE. Can computer-aided detection be breast cancer screening program. Eur J Radiol.
detrimental to mammographic interpretation? Ra- 2014;83(11):2019–23.
diology. 2009;253(1):17–22. 26. Lehman CD, Wellman RD, Buist DSM,
15. Gilbert FJ, Astley SM, Gillan MGC, Agbaje Kerlikowske K, Tosteson ANA, Miglioretti DL.
OF, Wallis MG, James J, Boggis CRM, Duffy Diagnostic accuracy of digital screening
SW. Single reading with computer-aided detec- mammography with and without computer-aided
tion for screening mammography. N Engl J Med. detection. JAMA Intern Med. 2015;175(11):1828.
2008;359(16):1675–84. 27. Berry DA. Computer-assisted detection and screen-
16. Gilbert FJ, Astley SM, Gillan MG, Agbaje OF, ing mammography: where’s the beef? J Natl Cancer
Wallis MG, James J, Boggis CR, Duffy SW. Inst. 2011;103(15):1139–41.
CADET II: a prospective trial of computer-aided 28. Sanchez Gómez S, Torres Tabanera M, Vega Bolivar
detection (CAD) in the UK Breast Screening Pro- A, Sainz Miranda M, Baroja Mazo A, Ruiz Diaz
gramme. J Clin Oncol. 2008;26(15 suppl):508. M, Martinez Miravete P, Lag Asturiano E, Muñoz
212 H. Harvey et al.

Cacho P, Delgado Macias T. Impact of a CAD 39. Dhungel N, Carneiro G, Bradley AP. Automated
system in a screen-film mammography screening mass detection from mammograms using deep
program: a prospective study. Eur J Radiol. learning and random forest. In: International con-
2011;80(3):e317–21. ference on digital image computing: techniques and
29. Freer TW, Ulissey MJ. Screening mammography applications; 2015. p. 1–8.
with computer-aided detection: prospective study 40. Ertosun MG, Rubin DL. Probabilistic visual search
of 12,860 patients in a community breast center. for masses within mammography images using
Radiology. 2001;220(3):781–6. deep learning. In: IEEE international conference on
30. The JS, Schilling KJ, Hoffmeister JW, Friedmann bioinformatics and biomedicine; 2015. p. 1310–5.
E, McGinnis R, Holcomb RG. Detection of 41. Carneiro G, Nascimento J, Bradley AP. Unreg-
breast cancer with full-field digital mammography istered multiview mammogram analysis with pre-
and computer-aided detection. Am J Roentgenol. trained deep learning models. In: Proceedings of
2009;192(2):337–40. the 18th international conference on medical image
31. Rao VM, Levin DC, Parker L, Cavanaugh B, computing and computer-assisted intervention. Lec-
Frangos AJ, Sunshine JH. How widely is ture notes in computer science. Vol 9351. Cham:
computer-aided detection used in screening and Springer; 2015. p. 652–60.
diagnostic mammography? J Am Coll Radiol. 42. Moreira IC, Amaral I, Domingues I, Cardoso A,
2010;7(10):802–5. Cardoso MJ, Cardoso JS. INbreast: toward a full-
32. Onega T, Aiello Bowles EJ, Miglioretti DL, Carney field digital mammographic database. Acad Radiol.
PA, Geller BM, Yankaskas BC, Kerlikowske K, 2012;19(2):236–48.
Sickles EA, Elmore JG. Radiologists’ perceptions 43. Clark K, Vendt B, Smith K, Freymann J, Kirby
of computer aided detection versus double reading J, Koppel P, Moore S, Phillips S, Maffitt D,
for mammography interpretation. Acad Radiol. Pringle M, Tarbox L, Prior F. The cancer imag-
2010;17(10):1217–26. ing archive (TCIA): maintaining and operating a
33. Kohli A, Jha S. Why CAD failed in mammography. public information repository. J Digit Imaging.
J Am Coll Radiol. 2018;15(3 Pt B):535–7. 2013;26(6):1045–57.
34. Lehman CD, Arao RF, Sprague BL, Lee JM, 44. Kooi T, Litjens G, van Ginneken B, Gubern-
Buist DSM, Kerlikowske K, Henderson LM, Onega Mérida A, Sánchez CI, Mann R, den Heeten A,
T, Tosteson ANA, Rauscher GH, Miglioretti DL. Karssemeijer N. Large scale deep learning for com-
National performance benchmarks for modern puter aided detection of mammographic lesions.
screening digital mammography: update from the Med Image Anal. 2017;35:303–12.
breast cancer surveillance consortium. Radiology. 45. Teare P, Fishman M, Benzaquen O, Toledano E,
2017;283(1):49–58. Elnekave E. Malignancy detection on mam-
35. Carney PA, Sickles EA, Monsees BS, Bassett LW, mography using dual deep convolutional neural
James Brenner R, Feig SA, Smith RA, Rosenberg networks and genetically discovered false color
RD, Andrew Bogart T, Browning S, Barry JW, input enhancement. J Digit Imaging. 2017;30(4):
Kelly MM, Tran KA, Miglioretti DL. Identify- 499–505.
ing minimally acceptable interpretive performance 46. Kim E-K, Kim H-E, Han K, Kang BJ, Sohn Y-
criteria for screening mammography. Radiology. M, Woo OH, Lee CW. Applying data-driven
2010;255(2):354–61. imaging biomarker in mammography for breast
36. Miglioretti DL, Ichikawa L, Smith RA, Bassett LW, cancer screening: preliminary study. Sci Rep.
Feig SA, Monsees B, Parikh JR, Rosenberg RD, 2018;8(1):2762.
Sickles EA, Carney PA. Criteria for identifying 47. Elter M, Horsch A. CADx of mammographic
radiologists with acceptable screening mammog- masses and clustered microcalcifications: a review.
raphy interpretive performance on basis of mul- Med Phys. 2009;36(6):2052–68.
tiple performance measures. Am J Roentgenol. 48. Breast screening: consolidated programme stan-
2015;204(4):W486–91. dards - GOV.UK; 2017.
37. Myers ER, Moorman P, Gierisch JM, Havrilesky 49. Rothschild J, Lourenco AP, Mainiero MB. Screen-
LJ, Grimm LJ, Ghate S, Davidson B, Mong- ing mammography recall rate: does practice site
tomery RC, Crowley MJ, McCrory DC, Kendrick matter? Radiology. 2013;269(2):348–53.
A, Sanders GD. Benefits and harms of breast cancer 50. Sage Bionetworks. The Digital Mammography
screening: a systematic review. J Am Med Assoc. DREAM Challenge; 2016.
2015;314:1615–34. 51. Ribli D, Horváth A, Unger Z, Pollner P, Csabai I.
38. Sahiner B, Chan HP, Petrick N, Wei D, Helvie MA, Detecting and classifying lesions in mammograms
Adler DD, Goodsitt MM. Classification of mass and with deep learning. Sci Rep. 2018;8(1):4165.
normal breast tissue: a convolution neural network 52. Dhungel N, Carneiro G, Bradley AP. The
classifier with spatial domain and texture images. automated learning of deep features for breast
IEEE Trans Med Imaging. 1996;15(5):598–610. mass classification from mammograms. In: Interna-
14 Deep Learning in Breast Cancer Screening 213

tional conference on medical image computing and 68. Shepherd JA, Kerlikowske K, Ma L, Duewer F, Fan
computer-assisted intervention. Cham: Springer; B, Wang J, Malkov S, Vittinghoff E, Cummings
2016. p. 106–14. SR. Volume of mammographic density and risk of
53. Arevalo J, Gonzalez FA, Ramos-Pollan R, Oliveira breast cancer. Cancer Epidemiol Biomarkers Prev.
JL, Lopez MAG. Convolutional neural networks 2011;20(7):1473–82.
for mammography mass lesion classification. In: 69. Boyd N, Martin L, Gunasekara A, Melnichouk
IEEE Engineering in Medicine and Biology Society O, Maudsley G, Peressotti C, Yaffe M, Minkin
(EMBC). Washington: IEEE; 2015. p. 797–800. S. Mammographic density and breast cancer risk:
54. Lévy D, Jain A. Breast mass classification from evaluation of a novel method of measuring breast
mammograms using deep convolutional neural net- tissue volumes. Cancer Epidemiol Biomarkers Prev.
works; 2016. arxiv:1612.00542. 2009;18(6):1754–62.
55. Chen L-C, Papandreou G, Kokkinos I, Murphy K, 70. Aitken Z, McCormack VA, Highnam RP, Martin
Yuille AL. DeepLab: semantic image segmentation L, Gunasekara A, Melnichouk O, Mawdsley G,
with deep convolutional nets, atrous convolution, Peressotti C, Yaffe M, Boyd NF, dos Santos Silva
and fully connected CRFs. IEEE Trans Pattern Anal I. Screen-film mammographic density and breast
Mach Intell. 2018;40(4):834–48. cancer risk: a comparison of the volumetric standard
56. Ren S, He K, Girshick R, Sun J. Faster R-CNN: mammogram form and the interactive threshold
towards real-time object detection with region pro- measurement methods. Cancer Epidemiol Biomark-
posal networks; 2016. arxiv:1506.01497. ers Prev. 2010;19(2):418–28.
57. Li Y, He K, Sun J. R-fcn: object detection via 71. Gastounioti A, Conant EF, Kontos D. Beyond
region-based fully convolutional networks. In: Ad- breast density: a review on the advancing role of
vances in neural information processing systems; parenchymal texture analysis in breast cancer risk
2016. assessment. Breast Cancer Res. 2016;18(1):91.
58. Liu W, Anguelov D, Erhan D, Szegedy C, Reed 72. Astley SM, Harkness EF, Sergeant JC, Warwick J,
S, Fu C-Y, Berg AC. SSD: single shot multibox Stavrinos P, Warren R, Wilson M, Beetles U, Gadde
detector; 2016. arxiv:1512.02325. S, Lim Y, Jain A, Bundred S, Barr N, Reece V,
59. Ronneberger O, Fischer P, Brox T. U-Net: convo- Brentnall AR, Cuzick J, Howell T, Evans DG. A
lutional networks for biomedical image segmenta- comparison of five methods of measuring mammo-
tion. In: Medical image computing and computer- graphic density: a case-control study. Breast Cancer
assisted intervention – MICCAI 2015; 2015. Res. 2018;20(1):10.
p. 234–41. 73. Manduca A, Carston MJ, Heine JJ, Scott CG,
60. Zhu W, Xiang X, Tran TD, Xie X. Adversarial Pankratz VS, Brandt KR, Sellers TA, Vachon CM,
deep structural networks for mammographic mass Cerhan JR. Texture features from mammographic
segmentation; 2017. arxiv:1612.05970. images and risk of breast cancer. Cancer Epidemiol
61. de Moor T, Rodriguez-Ruiz A, Mérida AG, Mann Biomarkers Prev. 2009;18(3):837–45.
R, Teuwen J. Automated soft tissue lesion de- 74. Li J, Szekely L, Eriksson L, Heddson B, Sund-
tection and segmentation in digital mammogra- bom A, Czene K, Hall P, Humphreys K. High-
phy using a u-net deep learning network; 2018. throughput mammographic-density measurement: a
arxiv:1802.06865. tool for risk prediction of breast cancer. Breast
62. Uijlings JRR, van de Sande KEA, Gevers T, Smeul- Cancer Res. 2012;14(4):R114.
ders AWM. Selective search for object recognition. 75. Häberle L, Wagner F, Fasching PA, Jud SM,
Int J Comput Vis. 2013;104(2):154–71. Heusinger K, Loehberg CR, Hein A, Bayer CM,
63. He K, Gkioxari G, Dollár P, Girshick R. Mask R- Hack CC, Lux MP, Binder K, Elter M, Münzen-
CNN; 2017. arxiv:1703.06870. mayer C, Schulz-Wendtland R, Meier-Meitinger M,
64. Assi V, Warwick J, Cuzick J, Duffy SW. Clinical Adamietz BR, Uder M, Beckmann MW, Wittenberg
and epidemiological issues in mammographic den- T. Characterizing mammographic images by us-
sity. Nat Rev Clin Oncol. 2012;9(1):33–40. ing generic texture features. Breast Cancer Res.
65. Colin C, Schott-Pethelaz A-M. Mammographic 2012;14(2):R59.
density as a risk factor: to go out of a 30-year fog. 76. Bott R. ACR BI-RADS atlas. In: Igarss 2014; 2014.
Acta Radiol. 2017;58(6):NP1. 77. Gram IT, Funkhouser E, Tabár L. The Tabar clas-
66. Colin C. Mammographic density: is there a public sification of mammographic parenchymal patterns.
health significance linked to published relative risk Eur J Radiol. 1997;24:131–6.
data? Radiology. 2017;284(3):918–9. 78. Petersen K, Nielsen M, Diao P, Karssemeijer N,
67. Martin LJ, Melnichouk O, Guo H, Chiarelli AM, Lillholm M. Breast tissue segmentation and mam-
Hislop TG, Yaffe MJ, Minkin S, Hopper JL, Boyd mographic risk scoring using deep learning. In:
NF. Family history, mammographic density, and International workshop on breast imaging. Lec-
risk of breast cancer. Cancer Epidemiol Biomarkers ture notes in computer science. Vol 8539. Cham:
Prev. 2010;19(2):456–63. Springer; 2014. p. 88–94.
214 H. Harvey et al.

79. Wu N, Geras KJ, Shen Y, Su J, Gene Kim S, Kim 93. Cobb AD, Roberts SJ, Gal Y. Loss-calibrated
E, Wolfson S, Moy L, Cho K. Breast density clas- approximate inference in Bayesian neural networks;
sification with deep convolutional neural networks; 2018. arxiv:1805.03901.
2017. arxiv:1711.03674. 94. Nishikawa RM, Bae KT. Importance of better
80. Shin SY, Lee S, Yun ID, Jung HY, Heo YS, Kim human-computer interaction in the era of deep
SM, Lee SM. A novel cascade classifier for auto- learning: mammography computer-aided diagnosis
matic microcalcification detection. Public Libr Sci. as a use case. J Am Coll Radiol. 2018;15(1):
2015;10(12):e0143725. 49–52.
81. Chen T, Xu B, Zhang C, Guestrin C. Train- 95. Simonyan K, Vedaldi A, Zisserman A. Deep
ing deep nets with sublinear memory cost; 2016. inside convolutional networks: visualising image
arxiv:1604.06174. classification models and saliency maps; 2013.
82. Gomez AN, Ren M, Urtasun R, Grosse RB. The re- arxiv:1312.6034.
versible residual network: backpropagation without 96. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B,
storing activations; 2017. arxiv:1707.04585. Warde-Farley D, Ozair S, Courville A, Bengio Y.
83. Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy Generative adversarial nets. In: Advances in neural
M, Rubin DL. Data descriptor: a curated mammog- information processing systems; 2014. p. 2672–80.
raphy data set for use in computer-aided detection 97. Kingma DP, Welling M. Auto-encoding variational
and diagnosis research. Sci Data. 2017;4:170177. Bayes. In: International conference on learning
84. Xi P, Shu C, Goubran R. Abnormality detection representations; 2014.
in mammography using deep convolutional neural 98. van den Oord A, Kalchbrenner N, Kavukcuoglu K.
networks; 2018. arxiv:1803.01906. Pixel recurrent neural networks. In: International
85. Chawla N, Bowyer KW, Hall LO, Kegelmeyer conference on machine learning. Vol 48; 2016. p.
WP. SMOTE: synthetic minority over-sampling 1747–56.
technique. J Artif Intell Res. 2002;16:321–57. 99. Salehinejad H, Valaee S, Dowdell T, Colak E,
86. Keller BM, Nathan DL, Gavenonis SC, Chen J, Barfett J. Generalization of deep neural networks
Conant EF, Kontos D. Reader variability in breast for chest pathology classification in X-rays using
density estimation from full-field digital mam- generative adversarial networks. In: IEEE interna-
mograms: the effect of image postprocessing on tional conference on acoustics, speech and signal
relative and absolute measures. Acad Radiol. processing (ICASSP); 2018.
2013;20(5):560–8. 100. Costa P, Galdran A, Meyer MI, Niemeijer M,
87. Redondo A, Comas M, Macià F, Ferrer F, Murta- Abramoff M, Mendonca AM, Campilho A. End-to-
Nascimento C, Maristany MT, Molins E, Sala M, end adversarial retinal image synthesis. IEEE Trans
Castells X. Inter- and intraradiologist variability in Med Imaging. 2018;37(3):781–91.
the BI-RADS assessment and breast density cate- 101. Korkinof D, Rijken T, O’Neill M, Yearsley J,
gories for screening mammograms. Br J Radiol. Harvey H, Glocker B. High-resolution mammogram
2012;85(1019):1465–70. synthesis using progressive generative adversarial
88. Lee AY, Wisner DJ, Aminololama-Shakeri S, Arasu networks; 2018. arxiv:1807.03401.
VA, Feig SA, Hargreaves J, Ojeda-Fournier H, 102. Adiwardana D, et al. Using generative models for
Bassett LW, Wells CJ, De Guzman J, Flowers CI, semi-supervised learning. In: Medical image com-
Campbell JE, Elson SL, Retallack H, Joe BN. Inter- puting and computer-assisted intervention – MIC-
reader variability in the use of BI-RADS descriptors CAI 2016; 2016. p. 106–14.
for suspicious findings on diagnostic mammogra- 103. Lahiri A, Ayush K, Biswas PK, Mitra P. Generative
phy: a multi-institution study of 10 academic radi- adversarial learning for reducing manual annotation
ologists. Acad Radiol. 2017;24(1):60–6. in semantic segmentation on large scale microscopy
89. Heath M, Bowyer K, Kopans D, Kegelmeyer P, images: automated vessel segmentation in retinal
Moore R, Chang K, Munishkumaran S. Current fundus image as test case. In: IEEE Computer
status of the digital database for screening mam- Society conference on computer vision and pattern
mography. In: Digital mammography. Dordrecht: recognition workshops, July 2017; 2017. p. 794–
Springer; 1998. p. 457–60. 800.
90. Gal Y, Ghahramani Z. Dropout as a Bayesian 104. Kamnitsas K, Baumgartner C, Ledig C, Newcombe
approximation: representing model uncertainty in V, Simpson J, Kane A, Menon D, Nori A, Criminisi
deep learning; 2015. arxiv:1506.02142. A, Rueckert D, Glocker B. Unsupervised domain
91. Kendall A, Gal Y. What uncertainties do we need in adaptation in brain lesion segmentation with ad-
Bayesian deep learning for computer vision?; 2017. versarial networks. In: Lecture notes in computer
arxiv:1703.04977. science (including subseries lecture notes in artifi-
92. Guo C, Pleiss G, Sun Y, Weinberger KQ. On cial intelligence and lecture notes in bioinformat-
calibration of modern neural networks; 2017. ics). Lecture notes in computer science. Vol 10265.
arxiv:1706.04599. Cham: Springer; 2017. p. 597–609.
14 Deep Learning in Breast Cancer Screening 215

105. Wolterink JM, Leiner T, Viergever MA, Isgum I. scale bilateral filtering regularized reconstructed
Generative adversarial networks for noise reduc- digital breast tomosynthesis volume. Med Phys.
tion in low-dose CT. IEEE Trans Med Imaging. 2014;41(2):021901.
2017;36(12):2536–45. 113. Morra L, Sacchetto D, Durando M, Agliozzo
106. Gennaro G, Bernardi D, Houssami N. Radiation S, Carbonaro LA, Delsanto S, Pesce B, Persano
dose with digital breast tomosynthesis compared D, Mariscotti G, Marra V, Fonio P, Bert A.
to digital mammography: per-view analysis. Eur Breast cancer: computer-aided detection with digi-
Radiol. 2018;28(2):573–81. tal breast tomosynthesis. Radiology. 2015;277(1):
107. Vedantham S, Karellas A, Vijayaraghavan GR, 56–63.
Kopans DB. Digital breast tomosynthesis: state of 114. Killelea BK, Chagpar AB, Bishop J, Horowitz
the art. Radiology. 2015;277(3):663–84. NR, Christy C, Tsangaris T, Raghu M, Lannin
108. Gilbert FJ, Tucker L, Gillan MGC, Willsher P, DR. Is there a correlation between breast cancer
Cooke J, Duncan KA, Michell MJ, Dobson HM, molecular subtype using receptors as surrogates
Lim YY, Suaris T, Astley SM, Morrish O, Young and mammographic appearance? Ann Surg Oncol.
KC, Duffy SW. Accuracy of digital breast to- 2013;20(10):3247–53.
mosynthesis for depicting breast cancer subgroups 115. Nguyen NG, Tran VA, Ngo DL, Phan D, Lumban-
in a UK retrospective reading study (TOMMY trial). raja FR, Faisal MR, Abapihi B, Kubo M, Satou
Radiology. 2015;277(3):697–706. K. DNA sequence classification by convolutional
109. Connor SJ, Lim YY, Tate C, Entwistle H, Morris neural network. J Biomed Sci Eng. 2016;9(9):280–
J, Whiteside S, Sergeant J, Wilson M, Beetles U, 6.
Boggis C, Gilbert F, Astley S. A comparison of 116. Yin B, Balvert M, Zambrano D, Sander M,
reading times in full-field digital mammography and Wiskunde C. An image representation based con-
digital breast tomosynthesis. Breast Cancer Res. volutional network for DNA classification; 2018.
2012;14(S1):P26. arxiv:1806.04931.
110. Chan HP, Wei J, Zhang Y, Helvie MA, Moore RH, 117. Rutman AM, Kuo MD. Radiogenomics: creating a
Sahiner B, Hadjiiski L, Kopans DB. Computer- link between molecular diagnostics and diagnostic
aided detection of masses in digital tomosynthesis imaging. Eur J Radiol. 2009;70(2):232–41.
mammography: comparison of three approaches. 118. Grimm LJ. Breast MRI radiogenomics: current
Med Phys. 2008;35(9):4087–95. status and research implications. J Magn Reson
111. Sahiner B, Chan HP, Hadjiiski LM, Helvie MA, Wei Imaging. 2016;43(6):1269–78.
J, Zhou C, Lu Y. Computer-aided detection of clus- 119. Incoronato M, Aiello M, Infante T, Cavaliere C,
tered microcalcifications in digital breast tomosyn- Grimaldi AM, Mirabelli P, Monti S, Salvatore M.
thesis: a 3D approach. Med Phys. 2011;39(1):28– Radiogenomic analysis of oncological data: a tech-
39. nical survey. Int J Mol Sci. 2017;18(4):pii: E805.
112. Samala RK, Chan HP, Lu Y, Hadjiiski L, Wei 120. Perry N. European guidelines for quality assurance
J, Sahiner B, Helvie MA. Computer-aided de- in breast cancer screening and diagnosis. Ann
tection of clustered microcalcifications in multi- Oncol. 2006;12(4):295–9.

You might also like