Professional Documents
Culture Documents
Introduction To Machine and Deep Learning For Medical Physicists
Introduction To Machine and Deep Learning For Medical Physicists
Introduction To Machine and Deep Learning For Medical Physicists
Sunan Cuia)
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI 48103, USA
Applied Physics Program, University of Michigan, Ann Arbor, MI 48109, USA
Huan-Hsin Tseng
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI 48103, USA
Julia Pakela
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI 48103, USA
Applied Physics Program, University of Michigan, Ann Arbor, MI 48109, USA
Randall K. Ten Haken and Issam El Naqa
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI 48103, USA
(Received 30 September 2019; revised 23 January 2020; accepted for publication 3 March 2020;
published 15 May 2020)
Recent years have witnessed tremendous growth in the application of machine learning (ML) and
deep learning (DL) techniques in medical physics. Embracing the current big data era, medical
physicists equipped with these state-of-the-art tools should be able to solve pressing problems in
modern radiation oncology. Here, a review of the basic aspects involved in ML/DL model building,
including data processing, model training, and validation for medical physics applications is pre-
sented and discussed. Machine learning can be categorized based on the underlying task into super-
vised learning, unsupervised learning, or reinforcement learning; each of these categories has its
own input/output dataset characteristics and aims to solve different classes of problems in medical
physics ranging from automation of processes to predictive analytics. It is recognized that data size
requirements may vary depending on the specific medical physics application and the nature of the
algorithms applied. Data processing, which is a crucial step for model stability and precision, should
be performed before training the model. Deep learning as a subset of ML is able to learn multilevel
representations from raw input data, eliminating the necessity for hand crafted features in classical
ML. It can be thought of as an extension of the classical linear models but with multilayer (deep)
structures and nonlinear activation functions. The logic of going “deeper" is related to learning com-
plex data structures and its realization has been aided by recent advancements in parallel computing
architectures and the development of more robust optimization methods for efficient training of these
algorithms. Model validation is an essential part of ML/DL model building. Without it, the model
being developed cannot be easily trusted to generalize to unseen data. Whenever applying ML/DL,
one should keep in mind, according to Amara’s law, that humans may tend to overestimate the abil-
ity of a technology in the short term and underestimate its capability in the long term. To establish
ML/DL role into standard clinical workflow, models considering balance between accuracy and
interpretability should be developed. Machine learning/DL algorithms have potential in numerous
radiation oncology applications, including automatizing mundane procedures, improving efficiency
and safety of auto-contouring, treatment planning, quality assurance, motion management, and out-
come predictions. Medical physicists have been at the frontiers of technology translation into medi-
cine and they ought to be prepared to embrace the inevitable role of ML/DL in the practice of
radiation oncology and lead its clinical implementation. © 2020 American Association of Physicists
in Medicine [https://doi.org/10.1002/mp.14140]
Key words: deep learning, machine learning, medical physics
e127 Med. Phys. 47 (5), May 2020 0094-2405/2020/47(5)/e127/21 © 2020 American Association of Physicists in Medicine e127
e128 Cui et al.: ML/DL for medical physicists e128
to treatment and be applied to new patients in order to per- trained. With its 152 layers, it won the championship of the
sonalize their prescription. Examples of typical supervised ILSVRC 2015 classification competition with a top-5 error
learning algorithms are logistic regressions,32 support vector rate of 3.57%, only rivaling that of human cognitive ability.
machines (SVMs),33 random forests (RFs),34 and neural net- Some of the most common architectures of DL include
works (NNs).35 convolutional NNs (CNNs),47 recurrent NNs (RNNs),56 varia-
Unsupervised learning36 operates on an input dataset tional autoencoders (VAEs),57 and generative adversarial NNs
without the need for labels. Its goal is to try to draw infer- (GANs)58 as shown in Figs. 11–13, 15. Convolutional neural
ences and identify patterns within the unlabeled data space, network are typically designed for image recognition and com-
for the purposes of clustering or data reduction. For instance, puter vision applications. They largely reduce the number of
clustering37 is a typical task of unsupervised learning, which free parameters compared to standard fully connected NNs.
can be applied for quality assurance (QA) in radiotherapy.38 They have shown competitive results in medical imaging anal-
In this case, one can distinguish outliers or unacceptable ysis, including cancer cell classification, lesion detection,59
treatment plans from acceptable ones by applying the cluster- organ segmentation60 and image enhancement. RNNs are usu-
ing algorithm to the feature set. Other typical unsupervised ally applied for natural language processing (NLP) and audio
learning tasks include dimensionality reduction such as prin- recognition problems, as they can exhibit temporal dynamic
cipal component analysis (PCA),39 t-SNE,40 and autoen- behavior that can be exploited for sequential data analysis.56
coders.41 These can be used for visualization of complex data This property also makes RNNs valuable for aiding fraction-
in higher dimensions or applied before supervised learning, ated radiotherapy, effectively taking advantage of a variety of
as a way of learning a more compact data representation for previously unused temporal information generated during the
solving complex supervised learning problems. treatment course. A VAE is an unsupervised learning algo-
Reinforcement learning42 is a ML extension of classical deci- rithm that is able to learn the distribution of compressed data
sion-making schemes, Markov Decision Processes (MDPs).43 It representations from a high-dimension dataset. In another
is concerned with how software agents can take actions when words, it is the equivalent of PCA analysis but for DL applica-
interacting with a given environment. Usually the agent needs to tions. It can be widely applied in radiation oncology consider-
achieve a definite goal via maximizing a cumulative reward ing the prevalence of high-dimension data due to the limitation
function,42 for example, therapeutic index in radiotherapy. The of patient sample sizes.14 Similar to a VAE, a GAN is also a
famous Google AlphaGO44 is an RL application, where an generative model61 that can learn the multivariate distribution
agent learns how to take actions under different situations to win and describe how the data are generated. GANs learn the dis-
a board game. In radiotherapy, RL can be applied to adaptive tribution by an adversarial competition between its generator
treatment planning, for example, how to optimize prescriptions and its discriminator. They have been successfully applied in
for patients by learning from during treatment information.45 In some medical imaging tasks, mapping magnetic resonance
this case, an agent will learn how to adapt dose fractionation imaging (MRI) into computed tomography (CT) images (syn-
(action) based on the current condition of the patient undergoing thetic CT)62 or in adaptive radiotherapy45 for generating syn-
radiotherapy (environment) to achieve the goal (reward) of better thetic data and enriching the sample size.
treatment response.
Deep learning,46 which recently demonstrated tremendous
success in image recognition problems47 and natural lan- 3. WHAT DATA ARE NEEDED FOR ML/DL
guage processing,48 is a subcategory of the broad family of APPLICATIONS?
ML algorithms. It is generally based on NN architectures,35
3.A. What training sample size is required?
using multiple layers to gradually extract higher level features
from the raw inputs; eliminating the necessary and typically When building machine or DL models for solving practi-
problematic feature engineering process49 in classical ML, cal medical problems, one should first consider how much
and hence showing superior performances. This is a key data is needed for successful training, that is, not under- or
advancement in multivariable and statistical prediction mod- overfitting the data. Often, the answer can be complicated as
eling, where data representation and task learning can be there are no off-the-shelf recipes for ML/DL algorithms as
effectively achieved in the same framework. Training a deep compared to traditional power analysis in statistics. Instead,
NN (DNN) was challenging before and during the 1990s to one needs to examine the specific problem/learning algorithm
2000s due to limitations in computational capacity and lack and perform some simulation experimentation using so called
of robust optimization techniques. The modern framework of Learning Curves (LC) on the existing data to determine
DNNs originated from earlier work on restricted Boltzmann whether there is a sufficient sample to meet the training
machines (RBMs, 1985),50 deep belief networks (DBNs, requirements of the ML/DL algorithm at hand.
2006),51 and later the efficient improvement in activation Empirically, the more complex the problem/learning algo-
functions using rectified linear unit (ReLU, 2010),52,53 among rithm (e.g., larger number of free parameters), the more data
others. Certainly, the huge leap in computing power using will be naturally required. A simple linear model with two
graphics processing units (GPUs) and advancement in opti- unknown parameters will only require two “perfect” samples
mization techniques54 were also critical. Today, a residual to fit, while a complicated nonlinear modern DL architecture
network (ResNet55) is among the deepest network that can be may need thousands of data points to train. Applying a small
or fully connected NNs. This causes a loss of spatial informa- neighbor, and so forth are also worth considering when
tion, which is another reason for the superior performance of designing applications for radiotherapy. Due to space limita-
CNNs in imaging tasks. Vectors and matrices are only two tion, the following section will mainly focus on linear models
special cases of tensors (rank 1 and 2, respectively), where a and generalized linear models (GLMs), as they are the foun-
general rank k tensor T is defined as a k-linear functional dations of current DL models. Other common classical
from a vector space V, T : V V ! R. The choice of
|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} machine learning methods, including SVMs and RFs will be
ðktimesÞ briefly reviewed in the section that follows.
representation is problem-dependent and determined by the
complexity of problem and the size of available data. For 4.A. Linear models and generalized linear model
instance, a video or a three-dimensional (3D) volume medical (GLM)
image such a CT or an MRI can be presented by a tensor
component, if the spatial arrangement is necessary for learn- A linear model fw ðÞ considered to be linear in unknown
ing the task at hand (e.g., detection or classification), then parameter w (w 2 Rm ), has the form,
Tijk with (i,j,k)=(1 dimx,1 dimy,1 dimz), see Fig. 4. In
y ¼ fw ðxÞ ¼ wT x (3)
supervised learning, the collection of input data along with
ground truths (labels) are denoted by: Note that a linear model is not necessarily a linear function
yðiÞ ÞjxðiÞ 2 Rk1 k2 ; ^yðiÞ 2 Rl ; i ¼ 1; . . .; ng. In
X ¼ fðxðiÞ ; ^ of predictors xðx 2 Rm Þ. A model which has polynomial and
unsupervised learning setting, the data are represented by interaction terms as in Eq. (4) is also considered as a “gener-
X ¼ fðxðiÞ ÞjxðiÞ 2 Rk1 k2 i ¼ 1; . . .; ng, as no label infor- alized” linear model,
mation is required. When the dataset contains features (x) y ¼ fw ðxÞ ¼ w0 þ w1 x21 þ w2 x1 x2 þ w3 x22 (4)
highly varying in magnitudes and unit, it is important to stan-
dardize the feature values,72 otherwise, model numerical sta- as one can simply re-define x in a way that the model satisfies
bility and estimation precision may be degraded. Some the general form in Eq. (3). A linear model uses straight lines
common standardization methods include Z-score normaliza- or linear planes to fit data (see Fig. 5). The fitting process
usually involves calculating the optimal estimator b^ by mini-
tion and Min-Max scaling as in Eq. (2):
mizing a designated loss function, for example, squared error
x lðxÞ or likelihood function,
Z-score xnew ¼
stdðxÞ
(2) X
n
x minðxÞ ^ ¼ arg min
w ðyðiÞ ðxðiÞ ÞT wÞ2 (5)
Min-Max xnew ¼
maxðxÞ minðxÞ i¼1
It should be emphasized that standardization techniques are where n is the sample size and i is the index of sample. The
especially beneficial73 for learning by DNNs, which is gener- ^ , from Eq. (5) is called least squared estima-
best estimator, w
ally a challenging task. During the training, standardization tor which has a closed form,
of input data will help the optimizer to find good local mini- ^ ¼ ðX T XÞ1 X T Y
w (6)
mum more easily and faster by mitigating numerical instabil-
ð1Þ ð2Þ ðnÞ T
ity errors. Another big concern when applying ML/DL to where X ¼ ðx ; x ; :::; x Þ ðX 2 R Þ is called a design
nm
medical data is how to deal with missing data. Of course, one matrix and Y ¼ ðyð1Þ ; yð2Þ ; :::; yðnÞ ÞT ðY 2 Rn Þ is a vector of
can always drop the cases with incomplete information, but it labels.
would further reduce the sample size which may already be A linear model is concise and easy to implement, how-
small. To fully make use of the available data, imputation ever, it assumes Y is normally distributed, that is,
methods,74 that is, replacing missing values with statistical YjX Nð0; r2 Þ and there is an identity mapping [Eq. (8)]
estimates, are usually applied before analyzing the full data- between Y and Xw, which is usually not the case in prac-
set. Some simple imputation methods include mean or med- tice. To alleviate these issues, a generalized linear model
ian imputation. More sophisticated imputation methods also (GLM)75 was proposed.
exist based on maximum likelihood estimation (MLE), for A GLM is defined by three components: a random compo-
instance. However, one should keep in mind, that imputation nent, which specifies the distribution of Y given X (Y|X); a
is also dependent on the quality of observed values, therefore, systematic component, which is a linear part relating a
imputation should be applied with caution, as the imputed parameter g to a predictor X (i.e., g=Xw); and a link function
values may be inaccurate and noisy. Alternatively, one may that connects the expected value of Y to g. In GLMs, the dis-
consider data augmentation or transfer learning when dealing tribution of Y|X typically belongs to the exponential family,
with insufficient training data, as discussed earlier. for example, normal, exponential, Bernoulli and categorical
distribution. Some commonly used link functions include
identify, negative inverse, log, and Logit.
4. WHAT CLASSICAL MODELS EXIST? Logistic regression32 which is used for binary classifica-
Aside from modern DL methods, classical machine learn- tion is a special case of GLM that has Bernoulli distribution
ing algorithms such as SVMs, RFs, Naive Bayes, K-nearest and Logit link function, that is,
FIG. 4. An magnetic resonance image [left] (along slice planes) is usually represented by a rank three tensor (component) [right].
TABLE II. A snippet from a dataset of 569 breast cancer diagnosis with 30 image extracted features
FIG. 6. Activation functions: [Left]r(z)=z for regression; [Middle]rðzÞ ¼ 1þe1 z for classification. [Right]r(z)=max(0,x) for deep learning.
where Kðxj ; xk Þ ¼ /ðxj ÞT /ðxk Þ is known as a kernel, which is proposed with improved performance.77 As an ensemble
an inner product of feature maps / and acts as a cross-simi- method, RF usually reduces variance and improves general-
larity metric in the feature (Hilbert) space. Various types of ization compared to a single decision tree but with the caveat
kernels exists, such as linear, polynomial, and radial basis of reduced interpretability.
function (RBF) depending on the desired data support. For
instance, polynomials have finite data support while RBFs
4.D. Example implementations
have infinite data support and is thus commonly used despite
being more computationally expensive. Penalty terms such as As a demonstrative implementation example, a logistic
ridge loss are usually added to the loss function for regular- regression model application to breast cancer diagnosis is
ization purposes and for preventing overfitting pitfalls. Sup- presented here. The dataset (Table II), which contains 569
port vector machines have been very popular ML techniques breast cancer diagnosis cases, was created in 1995 by
due to their global optimal solutions for classification and researchers at the University of Wisconsin. Each case
regression are widely applied to radiation oncology problems. includes 30 features computed from a digitized image of a
However, as a classical ML approach, they require features to fine needle aspirate (FNA) of a breast mass and a binary
be extracted and selected prior to training. diagnosis label with “M" indicating “malignant" and “B"
A RF uses an ensemble of decision trees to solve classifi- indicating “benign.”
cation or regression problems. A decision tree has a flow- Let the mean-value features (first 10 in all 30) plus a fea-
chart-like structure where each node represents a test on ture x11 ¼ 1 (serving as intercept term) be the predictor
attributes, splitting examples into different branches. From parameters in the trained logistic model. Using
root nodes to leaf nodes, the decision is gradually made by p ¼ 1þexp1ðXwÞ as in Eq. (7), the logistic regression model
multilevel classification rules, which make them quite easy to will take the design matrix XðX 2 R56911 Þ as input and out-
interpret and desirable to use but with limited predictive put prediction p which is the probabilities of diagnosis Y = 1
power. Hence, the corresponding ensemble approach RFs, (malignant). The associated Python code for the reader refer-
that is, combining multiple weak classifiers (decision trees) ence can be found in https://github.com/sunancui/breast-ca
to achieve a stronger classification, is usually applied instead. ncer-diagnosis. As mentioned, logistic regression can be
The splitting at nodes is usually based on a Gini Index, regarded as a simple NN with only a single input/output layer
entropy function, or information gain. The ensemble is based and a sigmoid activation function. We also compared logistic
on so called bagging, that is, averaging. Recently, a gradient regression and NN in the code from this perspective; one
boosting, weighted averaging approach for RFs has been would notice that they yield similar estimation of prediction
parameters (w). A summary of the prediction results of breast layer (j) vanishes. In an extreme case where all activations are
cancer diagnosis by logistic regression is presented in Fig. 7 identities in Eq. (14), such a NN simply reduces to a linear
with the ground truth labels shown for comparison. model Eq. (3) no matter how many layers (deep) it has.
Therefore, this argument provides an insight why activation
functions are the primary source of nonlinearity of NNs. It is
5. WHAT DEEP LEARNING MODELS EXIST? thought that with each nonlinear activation, an additional fold
5.A. Neural networks (manifold) in the data can be achieved allowing for better
representation or capturing of highly complex relationships.
Neural networks in principle35 can be considered as natural As for NNs’ hyperparameters, for example, number of layers
generalizations of linear models (Section 4.A). If we modify L and number of nodes in a specific layer, Bayesian meth-
Eq. (3) by adding a so-called activation function r [Eq. (8)], ods78 can be used to optimize their number. However, most
of the current procedures still rely on trial and error schemes.
xðjþ1Þ ¼ rðjÞ wðjÞ xðjÞ þ bðjÞ (13)
j=0,1,. . ., and let it iterate recursively over itself L times, with 5.B. Why and how to go deeper?
the upper-right index (0),(1),. . .,(L) (Fig. 8), one then has a
The Universal Approximation Theorem (UAT) developed
composite function
by Hornik79 states that mathematically a NN with one hidden
fw ðxÞ ¼ xðLÞ
layer of sufficient nodes can approximate any measurable
(and hence continuous) function on compact sets under cer-
¼ rðL1Þ wðL1Þ rðL2Þ wðL2Þ þbðL2Þ þbðL1Þ
tain mild conditions on the activation functions. This fact
(14) explains why NNs are suitable for fitting complex functions
This is known as a NN, where now the number of structures and datasets. However, based on this, one probably wonders
is renamed as the number of layers (Fig. 9). In particular, the why we would bother going deeper? In fact, the theorem has
first layer (0) and the last layer (L) are called the input layer several constraints. First, we need to have a “sufficient” (can
and output layer, respectively. Any layer (j) with 0 < j < L is be infinite) number of nodes. Secondly, it does not guarantee
called a hidden layer. Moreover, it is in general referred to as the theoretical performance can be achieved through opti-
a DNN if L > 4 (i.e., more than two hidden layers), which is mization in practice due to local minima and convergence
the fundamental building block for DL. Hence, conceptually, issues. Thus, it still depends on experimentally designing the
a NN is merely an extension of a linear model. Thus the con- right architecture (e.g., activation function, regularization,
cept of activation functions in Eq. (8) and loss functions (9) number and size of layers, etc.) and adopting an appropriate
discussed in Section 4.A can be applied to DL without any training process (e.g., optimization method) in order to possi-
changes. This is known as a NN, where now the number of bly achieve the theoretical performance estimates.80
structures is renamed as the number of layers. In particular, Practically, adding more layers to a NN has been shown to
the first layer (0) and the last layer (L) are called the input provide a good architecture design vs increasing the number
layer and output layer, respectively. Any layer (j) with of nodes as suggested by UAT. A NN with more layers will
0 < j < L is called a hidden layer. Moreover, it is in general show better performance than a single layer NN that has the
referred to as a DNN if L > 4 (i.e., more than two hidden lay- same number of parameters. Intuitively, this is possibly
ers), which is the fundamental building block for DL. Hence, because each layer will transform its input, that is, folding,
conceptually a NN is merely an extension of a linear model. and creating a new representation of the data (appropriate
Thus the concept of activation functions in Eq. (8) and loss manifold). The multilevel abstraction that is being learned
functions (9) discussed in Section 4.A can be applied to DL through multiple layers can be hard coded into a single layer
without any changes. with the same number of nodes. Or formally speaking, the
However, the nonlinear activation function now plays a multilayer structures enable NNs to recognize the entangled
prominent role in (deep) NNs as it is the source of nonlinear- manifolds of the data more easily, so as to solve the desig-
ity, that is, mapping the data into higher dimensions. If the nated task.81
activation function between any two layers, Eq. (13), is an However, adding too many layers, the performance of a
identity map in Eq. (8), then the two layers can be merged NN can actually be degraded as well. The phenomenon was
since identified as the vanishing gradient problem by Hochreiter’s
Ph.D thesis82 and has been a major obstacle of DL studies for
xðjþ1Þ ¼ rðjÞ wðjÞ wðj1Þ xðj1Þ þ bðj1Þ þ bðjÞ a long time. Vanishing gradient is a problem occurring dur-
ing optimizing NN weights83 with gradient-based learning
¼ rðjÞ w ~ðjÞ
~ ðjÞ xðj1Þ þ b methods, where the gradient will become too small through
multiplication of many layers, effectively preventing the
(15)
weight from changing its value. In the modern deep architec-
with w~ ðjÞ ¼ wðjÞ wðj1Þ and b~ðjÞ ¼ wðjÞ bðj1Þ þ bðjÞ . Thus, tures, there exist several mechanisms to prevent such prob-
ðj1Þ
one sees clearly from above discussion node x
ðjÞthatðjÞthe lems, for example, residual blocks, ReLU activation,52 batch
directly connects to node x ðjþ1Þ ~
via w ; b ~ as if middle normalization.84 Hence, they help the architectures grow
deeper and more powerful without encountering such issue. posed problems by suppressing the noise in the training data.
For instance, the residual NN architecture (ResNet)55 adds Besides adding a weight penalty as shown in Section 4.B,
one extra term to the usual NN Eq. (13), one can adopt the dropout, another neuroscience inspired
0 1 trick,86 as a regularization technique. As shown in Fig. 10,
during the training, dropout will randomly select some por-
xðjþ1Þ ¼ rðjÞ @wðjÞ xðjÞ þ bðjÞ þ xðj1Þ
|ffl{zffl}
A ðResNetÞ;
tion (dropout rate, e.g., 20%–50%) of nodes being ignored.
skipconnection
They will not affect updated weights, as their contribution to
(16) the activation of downstream neurons is temporally removed.
where the additional term is called the skip connection. Indeed, dropout is currently a very effective ensemble
Heuristically, it adds an extra path or shortcut to another neu- method, performing averaging with NNs while mitigating the
ron that was not connected previously. During the back-prop- risk of memorizing the data. Hence, the resulting NN with
agation process, the network will skip some subnetworks, many layers (DNN) is capable of better generalization to
directly forwarding the gradient from higher to lower layers, unseen data and is less likely to overfit (memorize) the train-
eliminating the gradient vanishing problem. The new activa- ing data.
tion functions ReLU in Fig. 6[Right] and its variants85 elimi- More advanced optimization techniques will also be
nate vanishing gradients by avoiding squishing a large input required when training a “deeper” NN. As compared with the
space between 0 and 1 as in Sigmoid activation, thus prevent- simple case of logistic regression, whose loss function is con-
ing extremely small gradients from occurring at the edge. vex, the deeper NN will tend to have a more sophisticated
Using batch normalization layers between other layers to nor- loss landscape, making finding even the close-to-global opti-
malize the intermediate inputs is another solution. This mal solution more difficult.80 Many efforts have been made
ensures the values of intermediate inputs are inside the range to develop effective optimization algorithms87 for such large-
that has the effective gradients, hence, the gradient values scale nonlinear problems primarily based on gradient descent
will not become too small. techniques, including the use of stochastic approaches with
The next question that comes to mind when designing an momentum, which accelerate learning by increasing the gra-
architecture is how to guarantee good generalization. As the dient vectors in the direction that past gradients accumulated
model becomes extremely complex, overfitting can be a main (velocity). In the adaptive rate scheme, the algorithm will
concern. Intuitively, regularization techniques can resolve ill- adopt various learning rates per parameter according to their
history of momenta and gradients. The common optimization
techniques including Adam,54, RMSProp87 are popular
choices in DL studies as surrogates for the classical stochastic
gradient descent techniques.
X1
m;n;C
I~ k;‘;b ¼ wi;j;a;b I sðk1Þþi;sð‘1Þþj;a
i;j;a (18)
~1 ; ‘ ¼ 1; . . .; L
ðk ¼ 1; . . .; L ~2 ; b ¼ 1; . . .C2 Þ
m;n;C ;C
here w ¼ wi;j;a;b i¼1;j¼1;a¼1;b¼1
1 2
2 R4 is a four-tensor convo-
lutional filter (kernel). With stride s > 1, the output size
would be roughly reduced to 1s of the original input size.
Implementing a pooling layer of kernel size s will have the
same effect.
One can understand that using kernels in NNs as being
equivalent to template matching or “seeing” local information
of a neighborhood while blocking information from far apart
or less related regions, as depicted in Fig. 11 (left). This can
be also visualized by vectorizing the input and the output as
in Fig. 11 (right), from which one can realize CNNs only con-
nect to certain nodes within a layer when compared to a fully
FIG. 9. A graphic plot of Fig. 8 and Eq. (14). connected NN; this is called local connectivity property.
Overall, a CNN is a locally connected NN as it only considers
local relations (receptive field) while it decouples information
far away in space and/or time allowing for efficient data rep-
resentation and improved task learning.
The property that makes a CNN distinguishable from
other locally connected networks is that CNNs force the
weights (kernel) to be repeatedly used, that is,
m;n
wi;j;a;b i¼1;j¼1 2 R2 is used everywhere in the input feature
map (e.g., I ). This parameter sharing scheme effectively
takes advantage of the spatial invariance property of the
imaging data, largely reducing the number of free parameters
in the architecture.
FIG. 11. Illustration of convolutional neural network. (Left) a kernel acts as a mask to consider only neighboring information (pixels) yet block information far
part. For example, z11 in the convolved output is generated by applying 3 9 3 kernel on the receptive field (denoted) at the left upper corner. In the view from
the right, one sees clearly that utilizing kernels (filters) is essentially another implementation of locally connected networks.
where h and / serving as unknown neural weights to be opti- matrix C, where Ci;j ¼ Ai;j Bi;j ). And ∘ denotes the compo-
mized. As these parameters are shared across different time nent-wise functional composition, that is, the ith element in the
points, free parameters in an RNN are largely reduced. Two resulting vector of function is the functional composition
common RNN architectures were invented in the early devel- between the ith elements of the two input vector of function.
opments of RNNs, they are known as Elman89 and Jordan90 The affine transformation is defined by,
networks, which are simple RNNs that differed in their con-
Aff W;U;b ðn; gÞ :¼ W n þ U g þ b
nection scheme. (21)
These simple RNNs were known to suffer from two major ðW 2 Rpq ; U 2 Rpr ; b 2 Rr ; n 2 Rq ; g 2 Rr Þ
disadvantages, (a) their back-propagation gradients tended to In Eq. (20), three additional units GðtÞ a , where a = 1,2,3
vanish or explode quickly and (b) the information farther denotes the forget gate, the input gate and the output gate,
apart cannot be connected. These two problems are resolved respectively,
by using the so called long short-term memory (LSTM) archi-
tecture.56 GðtÞ ¼ r a Aff Wa ;Ua ;ba x ðtÞ ðt1Þ
; y 2 Rk (22)
a
An LSTM is a state-of-the-art RNN model composed of
basic building blocks denoted as gated units, which learns by are used to control and determine when and how much
itself to store and forget internal memories when needed such should the previous information be kept or forgotten. Total
that it is capable of creating long-term dependencies and unknown parameters of an LSTM are ðWh ; Uh ; bh Þ and
paths through a time series as in Fig. 12. An LSTM has the fðWa ; Ua ; ba Þja ¼ 1; 2; 3g. A simplified version of LSTM is
following construction. Specify fh ¼ fLSTM, g/ ¼ gLSTM in called a gated recurrent units (GRU),where the number of
Eq. (19) by gates is reduced to two, namely, reset and update gates, and
hence more computationally efficient.91
ðtÞ
hðtÞ ¼ fLSTM hðt1Þ ; yðt1Þ ; xðtÞ ¼ G1 hðt1Þ
; (20)
ðtÞ
yðtÞ ¼ gLSTM hðt1Þ ; yðt1Þ ; xðtÞ ¼ G3 r3 hðtÞ 5.C.3. Attention awareness
where ⊙ denotes the component-wise multiplication, (i.e., for Attention awareness92 is a special mechanism that equips
two m9n matrices A and B, A⊙B will produce another m9n NNs with the ability to focus on a subset of a feature map,
FIG. 12. The diagram of a gated units of long short-term memory, which is consisted of input, output and forget gates.
which is particularly useful when tailoring the architecture for connected NNs. These three basic architectures comprise
specific tasks. Suppose, the input of an attention network fw ðÞ almost all the deep architectures used today. Other architec-
is x and its output is a; ai;j 2 ½0; 1 . After applying it on the tures fall into these three categories or are a combination of
feature map z, it will produce an attention glimpse g as below: them.
For instance, a variational autoencoder57 is a special type
a ¼ fw ðxÞ; g ¼ a z; (23)
of CNN or fully connected NN which has a bottleneck struc-
where ⊙ is a component-wise multiplication. The attention ture and is employed for data compression applications. As
mechanism assigns weights a to the input feature map. When Fig. 13 suggests, its input and output dimension are exactly
a is restricted to be either 0 or 1, it is hard attention93; When the same and the dimension of its latent variables is usually
a is between 0 and 1, it is called soft attention.94 significantly smaller than its input size. In a VAE, latent vari-
Attention was first implemented in RNNs94 for lan- ables are sampled from a designate distribution, for example,
guage translation tasks. In the classical Seq2Seq model, a Gaussian distribution, which is parameterized by the encoder.
context vector will be built out of the last hidden state The decoder is responsible for re-constructing the original
from an encoder RNN, which takes source sequence as input from the latent variables. The loss function of a VAE is
input, then a decoder RNN would take the context vector defined as a lower bound of the data log-likelihood function.
as input and generate target sequences. However, the It turns out to be a combination of two terms; one penalizes
model tends to “forget” the early part of source sequences reconstruction error as in a plain autoencoder, the other
when generating target sequences. Hence, attention net- forces the similarity [Kullback–Leibler divergence (KL)]
works are proposed to be applied on all the hidden states between the learned distribution and true prior distribution.
(z) of encoder RNN to learn which part of source The additional KL-term of VAE can be regarded as a way of
sequences should be focused on when generating a certain regularization. Contrasted with the deterministic autoencoder,
piece of a target sequence (g). it helps suppressing the noise in data.
Attention mechanisms can be applied to fuse and align U-Net,96 which has been successfully applied in many
information from heterogeneous data. For instance, by adopt- medical imaging segmentations, is also CNN based, with its
ing an attention mechanism, the relation between an image name derived from its “U” shape as shown in Fig. 14. U-Net
and its caption can be learned.95 Specifically, the imaging is composed of three sections: contraction, bottleneck, and
information extracted from a CNN can be consumed by the expansion sections. The contraction (encoding) section is
RNN, for each position of the caption, an attention mecha- made of several blocks of convolutional layers and max pool-
nism fuses the information and generate the desired caption ing layers. It is responsible for learning complex features.
on a word by word basis. The bottleneck section is the bottommost layer mediating
between the contraction and expansion layers. The heart of
U-net lies in the expansion (decoding) section, which consists
5.C.4. Other interesting architectures
of several blocks of convolutional layers and an upsampling
We have introduced two fundamental architectures for DL layer. Particularly, it appends the outputs from the contraction
algorithms, that is, CNNs and RNNs, together with fully section to their corresponding components in the expansion
FIG. 13. The diagram of a variational autoencoders, which is composed of encoder and decoder networks.
FIG. 14. The diagram of a U-net consisting of contraction, bottleneck, and expansion section. The size of output is L1 L2 a, where L1 L2 is the size of
original image, and a is the number of structure that needs to be contoured.
section. Such a design helps preserve the structural integrity basic idea is that the generator tries to generate new data
of the image and reduces distortion enormously. The loss from random noise to mimic the real data, while the dis-
function of U-net is defined as energy function computed by criminator tries to distinguishable any fake data from the
a pixel-wise soft-max over the final feature map combined real data in an analogy to a Turing machine. The two parts
with the cross-entropy loss function. are alternately trained to compete with each other and over-
A generative adversarial networks (GAN)58 is another time the performance of both will be boosted. The original
popular NN architecture and is considered among the most GAN as shown in Fig. 15 works on the dataset without
exciting breakthroughs in DL of the past decade. Numerous labels (unsupervised learning) and is based on fully con-
GAN variants have been invented for different purposes fol- nected architectures. However, its idea can be directly
lowing their invention by Ian Goodfellow in 2014. Like a applied to CNNs or other architectures to create a deep con-
VAE, a GAN is also a generative model,61 which aims to volution GAN (DCGAN),97 which has had wide applica-
learn the true data distribution from the training set so as to tions in the imaging domain such as sequence generation. A
generate new data points. However, unlike VAEs, which try cGAN98 is a GAN variant that can be used to learn multi-
to minimize the lower bound of likelihood function, GANs modal data models and can be used for generation of syn-
learn the data distribution through pitting a competition thetic data for training of limited sample outcome prediction
between two adversarial components, so-called generator models, for instance. Its generator and discriminator are both
and discriminator networks in some form of a game. The trained conditioning on the label/other information.
FIG. 15. A generative adversarial neural network which learns data distribution through the competition between two adversarial components.
TABLE III. A snippet of a dataset containing the six predictor for local control prediction
InfoGAN99 is an version of GAN that learns meaningful containing 131 training/70 test 3D CT images of size (512,
codes of unlabeled data. CycleGANs100 were invented for 512, 74 784). The last dimension of images varied, as
domain image transfer; they can accomplish one-to-one patients have different numbers of slices. During the training,
translation with unpaired datasets such as synthetic CT gen- each slice from patients was taken as individual inputs.
eration from MR images.101 Hence the input of U-Net will be a 2D image. The U-Net is
applied to classify each pixel into three different classes:
liver, lesion, something else. Dice similarity was used for
5.D. Example implementations
evaluation and is calculated for the three classes to evaluate
We implemented a fully connected NN in pytorch for a the resulting classification models. The code was imple-
binary classification problem. The dataset in Table III con- mented to split the 131 training patients into training/valida-
tains 45 patients in TCGA-LUAD and TCGA-LUSC who tion sets. The model was trained up to 17 epochs and was
received external beam radiotherapy for a primary tumor as evaluated on the validation set. Some sample results are
adjuvant therapy. Only the patients who have complete dose shown in Fig. 17. Interested readers are also encouraged to
and treatment outcome information were kept. In addition to play with the architecture definitions in the code and test their
dose, six clinical variables were selected for local control pre- final models on the 70 hold-out patients.
diction (progressive = 1/local control = 0).
Before training the model, all the variables were converted
5.E. Reinforcement learning
into numerical values and were Z-score normalized to avoid
numerical instability. The dataset is randomly split into train- Reinforcement learning is the third category of machine
ing and testing sets. The code is available for the reader refer- learning apart from supervised and unsupervised. It is
ence at https://github.com/sunancui/lung_TCGA_prediction, designed to achieve a definite goal by optimizing a “reward”
where the history of training/test losses is plotted and model function. The learning process of an RL is through interac-
performance is evaluated with area under receiver operating tion with an “environment” so that RL user (also called an
curve (AUC). In Fig. 16, AUC results of both training/testing agent) tries to earn the most reward to obtain the designated
data are shown. goal.
Another implementation of deep U-Net for liver tumor Reinforcement learning is based on the MDP, five-tuple
segmentation is also provided for the reader reference at ðS; A; P; c; RÞ. S ¼ fðx1 ; . . .; xn Þ 2 Rn g is used to describe
https://github.com/sunancui/liver_segmentation. The dataset an environment, it is the space of its all possible states. A is a
is from the Liver Tumor Segmentation Challenge (LiTS), collection of all actions the agent can take. R : X ! R is the
FIG. 17. Example results from the validation set with dice similarity coefficient shown, where “nan" indicates no such structure existing in the ground truth label.
(first column: original image slice, second column: ground truth label, third column: prediction).
Data quality is another important concern when develop- quality of care for patients, which is the ultimate goal of med-
ing and applying ML/DL tools. There are many aspects of ical practice.
data quality, including accuracy, completeness and reliability
etc. Accuracy means the data that have been collected are of
10. CONCLUSIONS
original source data. Completeness refers to the fact that all
required data elements are present. Reliability requires that Machine learning/DL algorithms have witnessed increas-
data remain consistent and the information generated is ing applications in treatment planning and quality assurance
understandable. Medical physicists should examine the accu- to improve efficiency and safety in radiation oncology. These
racy of data before training new models, for example, filter- tools can also improve outcome prediction and personalizing
ing out any abnormal outliers, double checking whether of radiation oncology treatment. To utilize these tools effec-
patients are correctly linked to their information, etc. By gath- tively, medical physicists need to define the problem at hand,
ering complete high quality data that cover patients with dif- know its corresponding category in the ML/DL lexicon,
ferent age, genders and race, medical physicists may also help determine suitable algorithms/models to utilize, collect data
avoid potential bias in the algorithms. Medical physicists of appropriate size and conduct validation in the proper way
should also inspect the readability of data, that is, data yield to ensure generalization to unseen data while being mindful
the same results on repeated collections, processing, storing of possible pitfalls and how to avoid them. This article aims
and display of information in different medical records. to serve as a tutorial and stepping stone in that direction.
Quantitatively, the Shapley value metric111 was proposed as a Moreover, establishing ML/DL algorithms into standard clin-
tool for evaluating quality of healthcare data for use in Ml/DL ical workflow will require additional attention to balancing
algorithms. It interprets whether the presence of an individual accuracy and interpretability in development of these models.
data can help or hurt the overall performance of the ML/DL Medical physicists are encouraged to be informed of the latest
predictive model. Such a quantitative tool can be useful and ML/DL’s developments as part of their continued education,
may provide guidance for medical physicists to assure data get more acquainted with ML/DL literature and become lea-
quality in the future. der of the processes of accepting and commissioning ML/DL
Data of large volumes are always desirable in the era of for routine clinical practice.
precision medicine and predictive analytics. However, proper
data sharing protocols are a precondition for generating large
volumes given the limited sample size at an individual radia- ACKNOWLEDGMENTS
tion oncology department. Medical physicists should be
This work was supported in part by the National Institutes
aware with some thorny issues including data privacy, data
of Health P01-CA059827, R37-CA222215 and
security and data interoperability112 when involving in multi-
R01-CA233487.
institutional studies and assist with other stake holders toward
providing secure and safe data lake exchange platforms for
radiation oncology applications. CONFLICT OF INTEREST
The authors have no relevant conflict of interest to
9. WHAT DOES THIS ML ERA IMPLY FOR MEDICAL disclose.
PHYSICISTS?
Machine and deep learning technologies are promising a)
Author to whom correspondence should be addressed. Electronic mail:
tools to introduce substantial positive changes to the medical sunan@umich.edu.
physics workflow and is likely to become an indispensable
component of its future. Ml/DL are likely to automate many
REFERENCES
of the mundane procedures as well as some of the essential
processes involving treatment planning and quality assurance 1. Bi WL, Hosny A, Matthew B, et al. Artificial intelligence in cancer
to name of few. This is likely to lead to changes in the future imaging: clinical challenges and applications. CA. 2019;69:127–157.
2. Levine AB, Schlosser C, Grewal J, Coope R, Jones SJM, Yip S. Rise
role of the medical physicist in the clinic. Medical physicists of the machines: advances in deep learning for cancer diagnosis. Trends
as vanguards of technology in medicine are at the forefront of Cancer. 2019;5:157–169.
embracing ML role and its potential. However, this also 3. Sahiner B, Pezeshk A, Hadjiiski LM, et al. Deep learning in medical
requires that medical physicists become more acquainted imaging and radiation therapy. Med Phys. 2019;46:e1–e36.
4. Nikolov S, Blackwell S, Mendes R, et al. Deep learning to achieve clin-
with ML/DL, its algorithms and their implementations, and ically applicable segmentation of head and neck anatomy for radiother-
lead the process of acceptance and commissioning of such apy; 2018.
algorithms in their clinic. These issues are likely to require 5. Tseng H-H, Luo Y, Ten Haken RK, El Naqa I. The role of machine
learning in knowledge-based response-adapted radiotherapy. Front
changes in the medical physicist training and their job profile.
Oncol. 2018;8:266.
Nevertheless, the benefits are expected to significantly out- 6. Shiraishi S, Moore KL. Knowledge-based prediction of three-dimen-
weigh some of the possible downsides, including achieving sional dose distributions for external beam radiotherapy. Med Phys.
better efficiency, improved safety, and potentially better 2016;43:378–387.
7. Valdes G, Simone CB, Chen J, et al. Clinical decision support of radio- Emerging Artificial Intelligence Applications in Computer Engineer-
therapy treatment planning: a data-driven machine learning strategy for ing. Amsterdam, The Netherlands: IOS Press; 2007:3–24.
patient-specific dosimetric decision making. Radiother Oncol. 31. El Naqa I, Kerns SL, Coates J, et al. Radiogenomics and radiotherapy
2017;125:392–397. response modeling. Phys Med Biol. 2017;62:R179.
8. El Naqa I, Irrer J, Ritter TA, et al. Machine learning for automated 32. Peng CYJ, Lee KL, Ingersoll GM. An introduction to logistic regres-
quality assurance in radiotherapy: a proof of principle using EPID data sion analysis and reporting. J Educ Res. 2002;96:3–14.
description. Med Phys. 2019;46:1914–1921. 33. Hearst M. Support vector machines. IEEE Intell Syst. 1998;13:18–28.
9. Good D, Joseph LW, Robert LQ, Jackie W, Yin F-F, Das SK. A knowl- 34. Breiman L. Random forests. Mach Learn. 2001;45:5–32.
edge-based approach to improving and homogenizing intensity modu- 35. Priddy KL, Keller PE. Artificial Neural Networks: An Introduction
lated radiation therapy planning quality among treatment centers: an (Vol. 68). Bellingham: SPIE Press; 2005.
example application to prostate cancer planning. Int J Radiat Oncol 36. Emre Celebi M, Aydin K. Unsupervised Learning Algorithms, 1st edn.
Biol Phys. 2013;87:176–181. Berlin: Springer Publishing Company; 2016.
10. Avanzo M, Stancanello J, El Naqa I. Beyond imaging: the promise of 37. Hartigan JA, Wong MA. A k-means clustering algorithm. J STOR.
radiomics. Phys Med. 2017;38:122–139. 1979;28:100–108.
11. Mattonen SA, Palma DA, Johnson C, et al. Detection of local cancer 38. Putora PM, Peters S, Bovet M. Informatics in Radiation Oncology. In:
recurrence after stereotactic ablative radiation therapy for lung cancer: El Naqa I, Li R, Murphy MJ, eds. Machine Learning in Radiation
physician performance versus radiomic assessment. Int J Radiat Oncol Oncology: Theory and Applications. Cham, Switzerland: Springer;
Biol Phys. 2016;94:1121–1128. 2015:57–70.
12. Deist TM, Dankers FJWM, Valdes G, et al. Machine learning algo- 39. Abdi H, Williams LJ. Principal component analysis. WIREs Comput
rithms for outcome prediction in (chemo)radiotherapy: an empirical Stat. 2010;2:433–459.
comparison of classifiers. Med Phys. 2018;45:3449–3459. 40. van der Maaten L, Hinton G. Visualizing data using t-SNE; 2008.
13. Cui S, Luo Y, Hsin Tseng H, Ten Haken RK, El Naqa I. Artificial neu- 41. Baldi P. Autoencoders, unsupervised learning and deep architectures.
ral network with composite architectures for prediction of local control In: Proceedings of the 2011 International Conference on Unsupervised
in radiotherapy. IEEE Trans Radiat Plasma Med Sci. 2019;3:242–249. and Transfer Learning Workshop - Volume 27, UTLW’11. JMLR.org;
14. Cui S, Luo Y, Tseng HH, Ten Haken RK, El Naqa I. Combining hand- 2011:37–50.
crafted features with latent variables in machine learning for prediction 42. Sutton RS, Barto AG. Reinforcement Learning: An Introduction, Vol.
of radiation-induced lung damage. Med Phys. 2019;46:2497–2511. 1. Cambridge: MIT Press; 1998.
15. Men K, Geng H, Zhong H, Fan Y, Lin A, Xiao Y. A deep learning 43. Puterman ML. Markov Decision Processes: Discrete Stochastic
model for predicting xerostomia due to radiation therapy for head and Dynamic Programming, 1st edn. New York, NY: John Wiley & Sons
neck squamous cell carcinoma in the RTOG 0522 clinical trial. Int J Inc.; 1994.
Radiat Oncol Biol Phys. 2019;105:440–447. 44. Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with
16. Ibragimov B, Toesca D, Chang D, Yuan Y, Koong A, Xing L. Develop- deep neural networks and tree search. Nature. 2016;529:484–503.
ment of deep neural network for individualized hepatobiliary toxicity 45. Tseng H-H, Luo Y, Sunan C, et al. Deep reinforcement learning for
prediction after liver SBRT. Med Phys. 2018;45:4763–4774. automated radiation adaptation in lung cancer. Med Phys.
17. Schuler T, Kipritidis J, Eade T, et al. Big data readiness in radiation 2017;44:6690–6705.
oncology: an efficient approach for relabeling radiation therapy struc- 46. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–
tures with their TG-263 standard name in real-world data sets. Adv 444.
Radiat Oncol. 2019;4:191–200. 47. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with
18. Meyer P, Noblet V, Mazzara C, Lallement A. Survey on deep learning deep convolutional neural networks. In: Pereira F, Burges CJC,
for radiotherapy. Comput Biol Med. 2018;98:126–146. Bottou L, Weinberger KQ, eds. Advances in Neural Information
19. Tseng HH, Wei L, Cui S, Luo Y, Ten Haken RK, El Naqa I. Machine Processing Systems 25. Red Hook, NY: Curran Associates Inc;
learning and imaging informatics in oncology. Oncology. 2018;1–19. 2012:1097–1105.
https://doi.org/10.1159/000493575 48. Mikolov T, Karafiat M, Burget L, Cernocky JH, Khudanpur S. Recur-
20. El Naqa I. Perspectives on making big data analytics work for oncol- rent neural network based language model. In: INTERSPEECH, edited
ogy. Methods. 2016;111:32–44. by Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura. ISCA;
21. McNutt TR, Bowers M, Cheng Z, et al. Practical data collection and 2010:1045–1048.
extraction for big data applications in radiotherapy. Med Phys. 2018;45: 49. Guyon I, Elisseeff A. An introduction to variable and feature selection.
e863–e869. J Mach Learn Res. 2003;3:1157–1182.
22. Kang J, Schwartz R, Flickinger J, Beriwal S. Machine learning 50. Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltz-
approaches for predicting radiation therapy outcomes: a clinician’s per- mann machines. Cogn Sci. 1985;9:147–169.
spective. Int J Radiat Oncol Biol Phys. 2015;93:1127–1135. 51. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep
23. Bibault J-E, Giraud P, Burgun A. Big data and machine learning in belief nets. Neural Comput. 2006;18:1527–1554.
radiation oncology: state of the art and future prospects. Cancer Lett. 52. Nair V, Hinton GE. Rectified linear units improve restricted boltzmann
2016;382:110–117. machines. In: Proceedings of the 27th international conference on
24. Feng M, Valdes G, Dixit N, Solberg TD. Machine learning in radiation machine learning (ICML-10); 2010:807–814.
oncology: opportunities, requirements, and needs. Front Oncol. 53. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks.
2018;8:110. In: Proceedings of the fourteenth international conference on artificial
25. Boldrini L, Bibault J-E, Masciocchi C, Shen Y, Bittner M-I. Deep intelligence and statistics; 2011:315–323.
learning: a review for the radiation oncologist. Front Oncol. 54. Kingma D, Ba J. Adam: a method for stochastic optimization. arXiv
2019;9:977. preprint arXiv:1412.6980; 2014.
26. Thompson RF, Valdes G, Fuller CD, et al. Artificial intelligence in 55. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recog-
radiation oncology: a specialty-wide disruptive transformation? Radio- nition. arXiv preprint arXiv:0706.1234; 2015.
ther Oncol. 2018;129:421–426. 56. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Com-
27. Sahiner B, Pezeshk A, Hadjiiski LM, et al. Deep learning in medical put. 1997;9:1735–1780.
imaging and radiation therapy. Med Phys. 2019;46:e1–e36. 57. Kingma DP, Welling M. Auto-encoding variational Bayes; 2013. Cite
28. Domingos P. A few useful things to know about machine learning. arxiv:1312.6114.
Commun ACM. 2012;55:78–87. 58. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial
29. Mohri M, Rostamizadeh A, Talwalkar A. Foundations of Machine nets. In: Advances in Neural Information Processing Systems;
Learning. Cambridge, MA: The MIT Press; 2012. 2014:2672–2680.
30. Kotsiantis SB. Supervised machine learning: a review of classification 59. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification
techniques. In: Real W, ed. Proceedings of the 2007 Conference on of skin cancer with deep neural networks. Nature. 2017;542:115–118.
60. Dalmisß MU, Litjens G, Holland K, et al. Using deep learning to seg- 85. Clevert AD, Unterthiner T, Hochreiter S. Fast and accurate deep net-
ment breast and fibroglandular tissue in MRI volumes. Med Phys. work learning by exponential linear units (ELUs). arXiv preprint
2017;44:533–546. arXiv:1511.07289; 2015.
61. Ng AY, Jordan MI. On discriminative vs. generative classifiers: a com- 86. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R.
parison of logistic regression and naive bayes. In: Proceedings of the Dropout: a simple way to prevent neural networks from overfitting. J
14th International Conference on Neural Information Processing Sys- Mach Learn Res. 2014;15:1929–1958.
tems: Natural and Synthetic, NIPS’01. Cambridge, MA: MIT Press; 87. Ruder S. An overview of gradient descent optimization algorithms.
2001:841–848. CoRR abs/1609.04747; 2016. arXiv:1609.04747.
62. Wolterink JM, Dinkla AM, Savenije MHF, et al. Deep MR to CT syn- 88. Jain LC, Medsker LR. Recurrent Neural Networks: Design and Appli-
thesis using unpaired data. CoRR abs/1708.01155; 2017. cations, 1st edn. Boca Raton, FL: CRC Press Inc; 1999.
arXiv:1708.01155. 89. Elman JL. Finding structure in time. Cogn Sci. 1990;14:179–211.
63. Dietterich T. Overfitting and undercomputing in machine learning. 90. Jordan MI. Chapter 25 - serial order: A parallel distributed processing
ACM Comput Surv. 1995;27:326–327. approach. In: Neural-Network Models of Cognition, Advances in Psy-
64. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep chology, Vol. 121, edited by John W. Donahoe and Vivian Packard Dor-
learning requires rethinking generalization; 2017. sel. North-Holland; 1997: 471–495.
65. Bishop CM. Pattern Recognition and Machine Learning (Information 91. Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase
Science and Statistics). Berlin, Heidelberg: Springer-Verlag; 2006. Representations using RNN Encoder-Decoder for Statistical Machine
66. Linial N, Mansour Y, Rivest RL. Results on learnability and the Vap- Translation. arXiv e-prints, arXiv:1406.1078; 2014. arXiv:1406.1078
nik-Chervonenkis dimension. Inf Comput. 1991;90:33–49. [cs.CL].
67. Perlich C, Provost F, Simonoff JS. Tree induction vs. logistic regres- 92. Vaswani A, Shazeer N, Parmar N. Attention is all you need. In: Guyon
sion: a learning-curve analysis. J Mach Learn Res. 2003;4:211–255. I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Gar-
68. Perez L, Wang J. The effectiveness of data augmentation in image clas- nett R, eds. Advances in Neural Information Processing Systems 30.
sification using deep learning. CoRR abs/1712.04621; 2017, Red Hook: Curran Associates Inc; 2017:5998–6008.
arXiv:1712.04621. 93. Luong M-T, Pham H, Manning CD. Effective approaches to attention-
69. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep trans- based neural machine translation. CoRR abs/1508.04025;2015,
fer learning. arXiv e-prints, arXiv:1808.01974; 2018, arXiv:1808.01974 arXiv:1508.04025.
[cs.LG]. 94. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly
70. Shin HC, Roth HR, Gao M, et al. Deep convolutional neural networks learning to align and translate; 2014. cite arxiv:1409.0473Comment:
for computer-aided detection: CNN architectures, dataset characteristics Accepted at ICLR 2015 as oral presentation.
and transfer learning. IEEE Trans Med Imaging. 2016;35:1285–1298. 95. Kelvin X, Ba J, Kiros R, et al. Show, attend and tell: neural image cap-
71. Zhen X, Chen J, Zhong Z, et al. Deep convolutional neural network tion generation with visual attention. CoRR abs/1502.03044; 2015.
with transfer learning for rectum toxicity prediction in cervical cancer arXiv:1502.03044.
radiotherapy: a feasibility study. Phys Med Biol. 2017;62:8246–8263. 96. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks
72. Walach J, Filzmoser P, Hron K. Chapter seven-data normalization and for biomedical image segmentation. In: Medical Image Computing
scaling: consequences for the analysis in omics sciences. In: Jaumot J, and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351.
Bedia C, Tauler R, eds. Data Analysis for Omic Sciences: Methods and Berlin: Springer; 2015:234–241. Available on arXiv:1505.04597
Applications, Comprehensive Analytical Chemistry, Vol. 82. Amster- [cs.CV].
dam: Elsevier; y:165–196. 97. Radford A, Metz L, Chintala S. Unsupervised representation learning
73. LeCun Y, Bottou L, Orr GB, Muller KR. Efficient backprop. In: Neural with deep convolutional generative adversarial networks; 2015. Cite
Net-works: Tricks of the Trade, This Book is an Outgrowth of a 1996 arxiv:1511.06434Comment: Under review as a conference paper at
NIPS Workshop. London, UK: Springer-Verlag; 1998:9–50. ICLR 2016.
74. Rubin DB. An overview of multiple imputation. In: Proceedings of the 98. Mirza M, Osindero S. Conditional generative adversarial nets. CoRR
Survey Research Section, American Statistical Association; 1988:79–84. abs/1411.1784; 2014. arXiv:1411.1784.
75. McCullagh P, Nelder JA. Generalized Linear Models. London: Chap- 99. Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P.
man Hall CRC; 1989. Infogan: Interpretable representation learning by information maximiz-
76. Ng AY. Feature selection, l1 vs. l2 regularization, and rotational invari- ing generative adversarial nets. In: D. D. Lee, M. Sugiyama, U. V. Lux-
ance. In: Proceedings of the Twenty-first International Conference on burg, I. Guyon, and R. Garnett, eds. Advances in Neural Information
Machine Learning, ICML ’04. New York, NY: ACM; 2004:78. Processing Systems 29. Red Hook: Curran Associates Inc; 2016:2172–
77. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neu- 2180.
rorobot. 2013;7:21. 100. Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation
78. Springenberg JT, Klein A, Falkner S, Hutter F. Bayesian optimization using cycle-consistent adversarial networks. CoRR abs/1703.10593;
with robust bayesian neural networks. In: Lee DD, Sugiyama M, Lux- 2017. arXiv:1703.10593.
burg UV, Guyon I, Garnett R, eds. Advances in Neural Information 101. Emami H, Dong M, Nejad-Davarani S, Glide-Hurst C. Generating syn-
Processing Systems 29. Red Hook, NY: Curran Associates Inc; thetic CTS from magnetic resonance images using generative adversar-
2016:4134–4142. ial networks. Med Phys. 2018;45:3627–3636.
79. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks 102. Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient
are universal approximators. Neural Netw. 1989;2:359–366. methods for reinforcement learning with function approximation.
80. Vidal R, Bruna J, Giryes R, Soatto S. Mathematics of deep learning. In: Advances in Neural Information Processing Systems;
CoRR abs/1712.04741; 2017. arXiv:1712.04741. 2000:1057–1063.
81. Brahma PP, Wu D, She Y. Why deep learning works: a manifold dis-en- 103. Efron B, Tibshirani R. An Introduction to the Bootstrap, Monographs
tanglement perspective. IEEE Trans Neural Netw Learn Syst. on Statistics and Applied Probability. New York: Chapman & Hall;
2016;27:1997–2008. 1993: xvi, 436 p.
82. Hochreiter S. Untersuchungen zu dynamischen neuronalen netzen. 104. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent
Diploma, Technische Universit€at M€unchen 91; 1991. reporting of a multivariable prediction model for individual prognosis
83. Goller C, Kuchler A. Learning task-dependent distributed representa- or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med.
tions by backpropagation through structure. In: IEEE International 2015;162:55–63.
Conference on Neural Networks, Vol. 1. IEEE; 1996: 347–352. 105. Ratcliffe S. Roy amara; 2016.
84. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network 106. Hau K. We analyzed 16,625 papers to figure out where AI is headed
training by reducing internal covariate shift. In: Proceedings of the next. Technical Report; 2019.
32Nd International Conference on International Conference on 107. Luo Y, Tseng H-H, Cui S, Wei L, Ten Haken RK, El Naqa I.
Machine Learning - Volume 37, ICML’15. JMLR.org; 2015: 448–456. Balancing accuracy and interpretability of machine learning
approaches for radiation treatment outcomes modeling. BJR-Open 1. 110. Conforti F, Pala L, Bagnardi V, et al. Cancer immunotherapy efficacy
2019;1:20190021. and patients’ sex: a systematic review and meta-analysis. Lancet Oncol.
108. Tannenbaum C, Ellis RP, Eyssel J, Zou F, Schiebinger L. Sex and gen- 2018;19:737–746.
der analysis improves science and engineering. Nature. 2019;575:137– 111. Ghorbani A, Zou J. Data shapley: Equitable valuation of data for
146. machine learning; 2019. arXiv:1904.02868 [stat.ML].
109. Manrai AK, Funke BH, Rehm HL, et al. Genetic misdiagnoses and 112. Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data:
the potential for health disparities. N Engl J Med. 2016;375:655– preserving security and privacy. J Big Data. 2018;5:1.
665.