Introduction To Machine and Deep Learning For Medical Physicists

Introduction to machine and deep learning for medical physicists
Sunan Cuia)
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI 48103, USA
Applied Physics Program, University of Michigan, Ann Arbor, MI 48109, USA
Huan-Hsin Tseng
Julia Pakela
Applied Physics Program, University of Michigan, Ann Arbor, MI 48109, USA
Randall K. Ten Haken and Issam El Naqa
(Received 30 September 2019; revised 23 January 2020; accepted for publication 3 March 2020;
published 15 May 2020)
Recent years have witnessed tremendous growth in the application of machine learning (ML) and
deep learning (DL) techniques in medical physics. Embracing the current big data era, medical
physicists equipped with these state-of-the-art tools should be able to solve pressing problems in
modern radiation oncology. Here, a review of the basic aspects involved in ML/DL model building,
including data processing, model training, and validation for medical physics applications is pre-
sented and discussed. Machine learning can be categorized based on the underlying task into super-
vised learning, unsupervised learning, or reinforcement learning; each of these categories has its
own input/output dataset characteristics and aims to solve different classes of problems in medical
physics ranging from automation of processes to predictive analytics. It is recognized that data size
requirements may vary depending on the specific medical physics application and the nature of the
algorithms applied. Data processing, which is a crucial step for model stability and precision, should
be performed before training the model. Deep learning as a subset of ML is able to learn multilevel
representations from raw input data, eliminating the necessity for hand crafted features in classical
ML. It can be thought of as an extension of the classical linear models but with multilayer (deep)
structures and nonlinear activation functions. The logic of going “deeper" is related to learning com-
plex data structures and its realization has been aided by recent advancements in parallel computing
architectures and the development of more robust optimization methods for efficient training of these
algorithms. Model validation is an essential part of ML/DL model building. Without it, the model
being developed cannot be easily trusted to generalize to unseen data. Whenever applying ML/DL,
one should keep in mind, according to Amara’s law, that humans may tend to overestimate the abil-
ity of a technology in the short term and underestimate its capability in the long term. To establish
ML/DL role into standard clinical workflow, models considering balance between accuracy and
interpretability should be developed. Machine learning/DL algorithms have potential in numerous
radiation oncology applications, including automatizing mundane procedures, improving efficiency
and safety of auto-contouring, treatment planning, quality assurance, motion management, and out-
come predictions. Medical physicists have been at the frontiers of technology translation into medi-
cine and they ought to be prepared to embrace the inevitable role of ML/DL in the practice of
radiation oncology and lead its clinical implementation. © 2020 American Association of Physicists
in Medicine [https://doi.org/10.1002/mp.14140]
Key words: deep learning, machine learning, medical physics
1. INTRODUCTION feature extraction,10,11 and outcomes modeling.12–16 Numer-

ous studies, as summarized in Fig. 1, demonstrate the poten-
Applications of machine learning (ML) and deep learning tial application of ML and DL models to clinical problems
(DL), as a branch of intelligence (AI) in medical physics have combined with the ongoing efforts toward ushering radiation
witnessed rapid growth over the past few years. These tech- oncology into the era of Big data analytics17–21 and serve as
niques have been studied as effective tools for a wide range evidence that ML and DL are poised to revolutionize the
of applications in medicine and oncology, including com- fields of medical physics and radiation oncology. Given the
puter-aided detection and diagnosis,1,2 image segmentation3,4 unique role of the medical physicist as a bridge between the
knowledge-based planning,5–7 quality assurance,8,9 radiomics clinical team and clinical technology and as a driving force
e127 Med. Phys. 47 (5), May 2020 0094-2405/2020/47(5)/e127/21 © 2020 American Association of Physicists in Medicine e127
e128 Cui et al.: ML/DL for medical physicists e128
toward developing new technological innovations in medi-

2. WHAT ARE MACHINE AND DEEP LEARNING?
cine, it stands to reason that medical physicists are the most
appropriate member of the clinical team to lead this AI-dri- Machine learning28 is the scientific study that builds math-
ven digital revolution for the medical community. ematical models and computer algorithms to perform specific
There has been several reviews of AI/ML/DL in the medi- tasks by learning patterns and inferences from data using
cal literature, introducing its basic concepts and potential computers, without being explicitly programmed to conduct
roles,22–27 but with little focus on the personnel that will be these tasks. ML algorithms differ in (a) their types of input/
tasked to lead its implementation in the field. Therefore, the output data and (b) the types of problems they are intended to
main goal of this review article is to equip medical physicists solve. They can be divided accordingly into three cate-
(who may have little to no prior experience with ML/DL gories:29 supervised learning, unsupervised learning, and
techniques) with the basic foundational knowledge and exam- reinforcement learning (RL). Table I provides an visual over-
ples necessary to develop and analyze models for application view of the main categories of ML and their common appli-
to a broad range of problems in medical physics and radiation cations in medical physics and radiation oncology.
oncology. While this publication is by no means a compre- Supervised learning30 requires a labeled dataset, as it is
hensive cookbook of ML algorithms, it should serve as useful intended to learn the relationship between input variables and
resource to help novices answer the question: “where should outputs (labels). For instance, prediction of radiotherapy out-
I start with my AI task?,” in regards to what ML/DL models comes (e.g., tumor control or normal tissue toxicity)31 is a
they should select to answer specific research questions in supervised learning task. First, one collects relevant patient
medical physics and what data processing and model valida- information (e.g., dosimetric information and clinical vari-
tion best practices are recommended to ensure robust results. ables) together with the treatment outcomes (labels). Then, a
In addition, this article provides an overview of the underly- supervised learning algorithm is applied to learn the mapping
ing practical and ethical issues pertaining to ML/DL applica- from this patient information to the labeled outcomes. Once
tions in radiation oncology. The rest of this tutorial/review is the model is obtained, it can be adopted to predict response
organized as follows: Section 2 provides a general overview
of ML and DL algorithms. Section 3 provides a description TABLE I. Common taxonomy for machine learning and deep learning
of data requirements for ML/DL models, such as sample size algorithms
and necessary pre-processing steps to ensure that the models
Classification of ML and DL algorithms
will perform as intended. Section 4 provides an overview of
classical ML techniques, while in Section 5 DL methods are Radiation oncology
presented. Section 6 discusses how to validate model perfor- Learning style Definition applications
mance on new and out-of-sample datasets. Section 7
Supervised Learns relationship Diagnosis,1,2 image
describes the role of humans in the development and use of between input segmentation,3,4
ML and DL models, while Section 8 discusses the known data and labels radiotherapy outcomes
limitations and pitfalls of these methods. Finally, Section 9 prediction12–16
envisions the role of the medical physicist in this new era of Requires labeled dataset
AI-guided radiation oncology clinics. Unsupervised Learns patterns in Anomaly detection for QA,8,9
input data radiomics feature
extraction10,11
Uses unlabeled dataset
Used for clustering or
data reduction
Reinforcement Learns to perform Decision-making in
actions in response to adaptive radiotherapy19
environment to maximize
a reward function
Data interaction Definition Examples
Classical ML An ML algorithm that Supervised: GLM,75

would require manual SVMs,33 RFs34
feature extraction and Unsupervised: k-means
selection to clustering, PCA,39 t-SNE40
perform its task Reinforcement: Q-learning
DL Can learn feature Supervised: U-NET96
representation Unsupervised: VAE,57
from raw input data GAN58
and perform
Reinforcement: DQN45
learning tasks
Nonspecific: MLP, CNN,47
FIG. 1. Frequency counts of published PubMed studies in radiation oncol- RNN48
ogy/medical physics by machine and deep learning.
Medical Physics, 47 (5), May 2020

to treatment and be applied to new patients in order to per- trained. With its 152 layers, it won the championship of the
sonalize their prescription. Examples of typical supervised ILSVRC 2015 classification competition with a top-5 error
learning algorithms are logistic regressions,32 support vector rate of 3.57%, only rivaling that of human cognitive ability.
machines (SVMs),33 random forests (RFs),34 and neural net- Some of the most common architectures of DL include
works (NNs).35 convolutional NNs (CNNs),47 recurrent NNs (RNNs),56 varia-
Unsupervised learning36 operates on an input dataset tional autoencoders (VAEs),57 and generative adversarial NNs
without the need for labels. Its goal is to try to draw infer- (GANs)58 as shown in Figs. 11–13, 15. Convolutional neural
ences and identify patterns within the unlabeled data space, network are typically designed for image recognition and com-
for the purposes of clustering or data reduction. For instance, puter vision applications. They largely reduce the number of
clustering37 is a typical task of unsupervised learning, which free parameters compared to standard fully connected NNs.
can be applied for quality assurance (QA) in radiotherapy.38 They have shown competitive results in medical imaging anal-
In this case, one can distinguish outliers or unacceptable ysis, including cancer cell classification, lesion detection,59
treatment plans from acceptable ones by applying the cluster- organ segmentation60 and image enhancement. RNNs are usu-
ing algorithm to the feature set. Other typical unsupervised ally applied for natural language processing (NLP) and audio
learning tasks include dimensionality reduction such as prin- recognition problems, as they can exhibit temporal dynamic
cipal component analysis (PCA),39 t-SNE,40 and autoen- behavior that can be exploited for sequential data analysis.56
coders.41 These can be used for visualization of complex data This property also makes RNNs valuable for aiding fraction-
in higher dimensions or applied before supervised learning, ated radiotherapy, effectively taking advantage of a variety of
as a way of learning a more compact data representation for previously unused temporal information generated during the
solving complex supervised learning problems. treatment course. A VAE is an unsupervised learning algo-
Reinforcement learning42 is a ML extension of classical deci- rithm that is able to learn the distribution of compressed data
sion-making schemes, Markov Decision Processes (MDPs).43 It representations from a high-dimension dataset. In another
is concerned with how software agents can take actions when words, it is the equivalent of PCA analysis but for DL applica-
interacting with a given environment. Usually the agent needs to tions. It can be widely applied in radiation oncology consider-
achieve a definite goal via maximizing a cumulative reward ing the prevalence of high-dimension data due to the limitation
function,42 for example, therapeutic index in radiotherapy. The of patient sample sizes.14 Similar to a VAE, a GAN is also a
famous Google AlphaGO44 is an RL application, where an generative model61 that can learn the multivariate distribution
agent learns how to take actions under different situations to win and describe how the data are generated. GANs learn the dis-
a board game. In radiotherapy, RL can be applied to adaptive tribution by an adversarial competition between its generator
treatment planning, for example, how to optimize prescriptions and its discriminator. They have been successfully applied in
for patients by learning from during treatment information.45 In some medical imaging tasks, mapping magnetic resonance
this case, an agent will learn how to adapt dose fractionation imaging (MRI) into computed tomography (CT) images (syn-
(action) based on the current condition of the patient undergoing thetic CT)62 or in adaptive radiotherapy45 for generating syn-
radiotherapy (environment) to achieve the goal (reward) of better thetic data and enriching the sample size.
treatment response.
Deep learning,46 which recently demonstrated tremendous
success in image recognition problems47 and natural lan- 3. WHAT DATA ARE NEEDED FOR ML/DL
guage processing,48 is a subcategory of the broad family of APPLICATIONS?
ML algorithms. It is generally based on NN architectures,35
3.A. What training sample size is required?
using multiple layers to gradually extract higher level features
from the raw inputs; eliminating the necessary and typically When building machine or DL models for solving practi-
problematic feature engineering process49 in classical ML, cal medical problems, one should first consider how much
and hence showing superior performances. This is a key data is needed for successful training, that is, not under- or
advancement in multivariable and statistical prediction mod- overfitting the data. Often, the answer can be complicated as
eling, where data representation and task learning can be there are no off-the-shelf recipes for ML/DL algorithms as
effectively achieved in the same framework. Training a deep compared to traditional power analysis in statistics. Instead,
NN (DNN) was challenging before and during the 1990s to one needs to examine the specific problem/learning algorithm
2000s due to limitations in computational capacity and lack and perform some simulation experimentation using so called
of robust optimization techniques. The modern framework of Learning Curves (LC) on the existing data to determine
DNNs originated from earlier work on restricted Boltzmann whether there is a sufficient sample to meet the training
machines (RBMs, 1985),50 deep belief networks (DBNs, requirements of the ML/DL algorithm at hand.
2006),51 and later the efficient improvement in activation Empirically, the more complex the problem/learning algo-
functions using rectified linear unit (ReLU, 2010),52,53 among rithm (e.g., larger number of free parameters), the more data
others. Certainly, the huge leap in computing power using will be naturally required. A simple linear model with two
graphics processing units (GPUs) and advancement in opti- unknown parameters will only require two “perfect” samples
mization techniques54 were also critical. Today, a residual to fit, while a complicated nonlinear modern DL architecture
network (ResNet55) is among the deepest network that can be may need thousands of data points to train. Applying a small

dataset to train a complex algorithm can be problematic, as it

may lead to overfitting pitfalls,63 where the complex algo-
rithm starts to fit noise or errors in the limited-size training
set, in other words, the algorithm memorizes the data rather
than learns from it. Under this circumstance, generalization64
of the model is usually not good, that is, the model performs
poorly on new, unseen out-of-sample datasets. Mathemati-
cally, we can understand this as a trade-off between variance
and bias, colloquially referred to in the ML literature as the
bias-variance dilemma.65 Specifically, suppose f describes
the underlying real relationship of the data; ^f is the model
approximation that being trained on a certain dataset. The FIG. 3. The learning curve example for diagnostic purpose of data size.
amount of ^f that changes as training sets vary is called vari-
ance. The difference between f and the model ^f is defined as
bias. It is proven that, in a unseen test dataset, both bias and
long as they are not too noisy. However, if collecting more
variance add to the total errors of model, moreover, there is a
data becomes infeasible, one can alternatively perform data
trade-off between the two as in Fig. 2. As the model complex-
augmentation or apply transfer learning approaches. Data
ity (e.g., the number of free parameters) increases, the bias
augmentation68 is an effective way to increase the data size
(i.e., the training fitting error) decreases, but the variance
and diversity for the training model by cropping, padding,
(i.e., the testing generalizability error on out-of-sample data)
shifting, flipping, and rotation, which are widely applied in
increases. The optimal trade-off point between bias and vari-
DNNs to support imaging tasks. Transfer learning69 is
ance or training and testing can be quantified using so called
another important tool in ML/DL to solve the problem of
the Vapnik–Chervonenkis (VC) dimension,66 however, this is
insufficient training data by trying to transfer knowledge from
still currently a theoretical rather than a practical measure of
a source domain to the target domain. This approach has been
model complexity.
successfully applied to medical image segmentation problems
A learning curve (LC)67 is a practical graphical tool that
by transferring knowledge from natural image applications
can be used to evaluate whether there are enough data empiri-
(e.g., Google ImageNet database).70 This idea can be
cally. Note that there exist other versions of learning curves
extended to other tasks, where Zhen et al. demonstrated a
for different purposes, for example, determining the training
CNN for predicting rectal toxicity in cervical cancer radio-
epochs (the number of times that the learning algorithm work
therapy by fine-tuning a pretrained network (VGG-16) on the
through the entire training dataset). However, the idea behind
natural images from ImageNet.71
those learning curves is the same; splitting the dataset into
training, validation, and testing. Then, one can plot the model
performance metric for training/validation separately as a 3.B. How to process the data?
function of the number of the samples to determine a suffi-
Data are the fuel of ML/DL algorithms, where models are
cient number of samples for training the ML/DL algorithm
the combustion engines. Only the combination of a good
before evaluating its generalizability on the testing data. As
engine and good fuel can bring out the most powerful perfor-
shown in Fig. 3, when there is a significant change in the per-
mance, where the case of “garbage in garbage out” is the least
formance error for training or validation, it may indicate a lar-
desirable result. Thus, it is important to have high quality data
ger sample size may be required until they both plateau.
that are properly curated and processed for successful ML
To unlock the usage of more sophisticated models, it is
applications. At face value, this may seem to contradict the cur-
always a good idea to have more training/validation data as
rent notion of Big data analytics. However, this is accounted for
by the fact that larger sized datasets, though noisy, will benefit
from variance reduction by virtue of the law of large numbers.
Depending on the experimental designer’s objectives and
choices, tabular, text, or imaging data can be represented as
vectors, matrices Eq. (1) or even higher-rank tensors.
x11 x12 x13 .. . x1n
x21 x22 x23 .. . x2n
xi ¼ ðx1 ;x2 ;. ..;xn Þ 2 R ; xi ; ¼ ..
n
.. .. . . .. 2 Rdn
. . . . .
xd1 xd2 xd3 .. . xdn
(vector) (matrix)
(1)
For example, two-dimensional (2D) images are usually pre-
sented by a matrix when fed into a CNN, but are usually rep-
FIG. 2. The trade-off between bias and variance with model complexity. resented by a long vector when serving as an input to SVMs

or fully connected NNs. This causes a loss of spatial informa- neighbor, and so forth are also worth considering when
tion, which is another reason for the superior performance of designing applications for radiotherapy. Due to space limita-
CNNs in imaging tasks. Vectors and matrices are only two tion, the following section will mainly focus on linear models
special cases of tensors (rank 1 and 2, respectively), where a and generalized linear models (GLMs), as they are the foun-
general rank k tensor T is defined as a k-linear functional dations of current DL models. Other common classical
from a vector space V, T : V V ! R. The choice of
|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} machine learning methods, including SVMs and RFs will be
ðktimesÞ briefly reviewed in the section that follows.
representation is problem-dependent and determined by the
complexity of problem and the size of available data. For 4.A. Linear models and generalized linear model
instance, a video or a three-dimensional (3D) volume medical (GLM)
image such a CT or an MRI can be presented by a tensor
component, if the spatial arrangement is necessary for learn- A linear model fw ðÞ considered to be linear in unknown
ing the task at hand (e.g., detection or classification), then parameter w (w 2 Rm ), has the form,
Tijk with (i,j,k)=(1 dimx,1 dimy,1 dimz), see Fig. 4. In
y ¼ fw ðxÞ ¼ wT x (3)
supervised learning, the collection of input data along with
ground truths (labels) are denoted by: Note that a linear model is not necessarily a linear function
yðiÞ ÞjxðiÞ 2 Rk1 k2 ; ^yðiÞ 2 Rl ; i ¼ 1; . . .; ng. In
X ¼ fðxðiÞ ; ^ of predictors xðx 2 Rm Þ. A model which has polynomial and
unsupervised learning setting, the data are represented by interaction terms as in Eq. (4) is also considered as a “gener-
X ¼ fðxðiÞ ÞjxðiÞ 2 Rk1 k2 i ¼ 1; . . .; ng, as no label infor- alized” linear model,
mation is required. When the dataset contains features (x) y ¼ fw ðxÞ ¼ w0 þ w1 x21 þ w2 x1 x2 þ w3 x22 (4)
highly varying in magnitudes and unit, it is important to stan-
dardize the feature values,72 otherwise, model numerical sta- as one can simply re-define x in a way that the model satisfies
bility and estimation precision may be degraded. Some the general form in Eq. (3). A linear model uses straight lines
common standardization methods include Z-score normaliza- or linear planes to fit data (see Fig. 5). The fitting process
usually involves calculating the optimal estimator b^ by mini-
tion and Min-Max scaling as in Eq. (2):
mizing a designated loss function, for example, squared error
x lðxÞ or likelihood function,
Z-score xnew ¼
stdðxÞ
(2) X
n
x minðxÞ ^ ¼ arg min
w ðyðiÞ ðxðiÞ ÞT wÞ2 (5)
Min-Max xnew ¼
maxðxÞ minðxÞ i¼1
It should be emphasized that standardization techniques are where n is the sample size and i is the index of sample. The
especially beneficial73 for learning by DNNs, which is gener- ^ , from Eq. (5) is called least squared estima-
best estimator, w
ally a challenging task. During the training, standardization tor which has a closed form,
of input data will help the optimizer to find good local mini- ^ ¼ ðX T XÞ1 X T Y
w (6)
mum more easily and faster by mitigating numerical instabil-
ð1Þ ð2Þ ðnÞ T
ity errors. Another big concern when applying ML/DL to where X ¼ ðx ; x ; :::; x Þ ðX 2 R Þ is called a design
nm
medical data is how to deal with missing data. Of course, one matrix and Y ¼ ðyð1Þ ; yð2Þ ; :::; yðnÞ ÞT ðY 2 Rn Þ is a vector of
can always drop the cases with incomplete information, but it labels.
would further reduce the sample size which may already be A linear model is concise and easy to implement, how-
small. To fully make use of the available data, imputation ever, it assumes Y is normally distributed, that is,
methods,74 that is, replacing missing values with statistical YjX Nð0; r2 Þ and there is an identity mapping [Eq. (8)]
estimates, are usually applied before analyzing the full data- between Y and Xw, which is usually not the case in prac-
set. Some simple imputation methods include mean or med- tice. To alleviate these issues, a generalized linear model
ian imputation. More sophisticated imputation methods also (GLM)75 was proposed.
exist based on maximum likelihood estimation (MLE), for A GLM is defined by three components: a random compo-
instance. However, one should keep in mind, that imputation nent, which specifies the distribution of Y given X (Y|X); a
is also dependent on the quality of observed values, therefore, systematic component, which is a linear part relating a
imputation should be applied with caution, as the imputed parameter g to a predictor X (i.e., g=Xw); and a link function
values may be inaccurate and noisy. Alternatively, one may that connects the expected value of Y to g. In GLMs, the dis-
consider data augmentation or transfer learning when dealing tribution of Y|X typically belongs to the exponential family,
with insufficient training data, as discussed earlier. for example, normal, exponential, Bernoulli and categorical
distribution. Some commonly used link functions include
identify, negative inverse, log, and Logit.
4. WHAT CLASSICAL MODELS EXIST? Logistic regression32 which is used for binary classifica-
Aside from modern DL methods, classical machine learn- tion is a special case of GLM that has Bernoulli distribution
ing algorithms such as SVMs, RFs, Naive Bayes, K-nearest and Logit link function, that is,

FIG. 4. An magnetic resonance image [left] (along slice planes) is usually represented by a rank three tensor (component) [right].
4.B. Model regularizations

Often one wishes to suppress overfitting of a model when
feeding data corrupted with noise by adding a penalty term: h
(w) into the loss functions LðwÞ such as in Eq. (9)
~
LðwÞ ¼ LðwÞ þ hðwÞ (10)
where h(w) only depends on the model w (not data) for regu-
larization.76 Typical choices are:
(Elastic)hðwÞ ¼ k1 kwkL1 þ k2 kwk2L2
(Ridge) hðwÞ ¼ k2 kwk2L2 ði.e., k1 ¼ 0Þ (11)
(LASSO)hðwÞ ¼ k1 kwkL1 ði.e., k2 ¼ 0Þ
In fact, this trick can be widely played in other ML/DL tech-
niques (e.g. SVM, DNNs) leading to better solutions. This is
FIG. 5. Geometry intuition of linear regression by Eq. (5), where one wishes particularly true, if the problem is ill-posed as is commonly
to find a straight line fit such that the sum of all residual errors is smallest. the case in many ML/DL applications in medical physics,
where small errors in the training may lead to large variations
in the estimated ML/DL model.
p
g ¼ Xw ¼ log ; YjX BernoulliðpÞ (7)
1p 4.C. Nonlinear classical machine learning
One can rewrite the above to arrive at p ¼ 1þexpðXwÞ, 1
There is a whole host of nonlinear machine learning algo-
where p is the predicted probability of Y=1. Note that p=sig-
rithms that have been applied to medical physics/radiotherapy
moid(Xw), see Eq. (8); hence, the logistic model can be seen
applications.38 Two of the most common ones are briefly
as a simple NN (in Section 5.A) with only two layers (input/
reviewed here: SVMs and RFs. SVMs represent data samples
output) and a sigmoid activation function (Fig. 6),
as points in space, mapped (/) in a way that samples from
8 the different classes can be separated by a margin (gap) that
>
> identity ¼ ðz1 ;...;zm Þ ðRm -regressionÞ
>
< needs to be made as wide as possible, that is, maximized in
sigmoid ¼ 1þe1zm ðm ¼ 1 binary classificationÞ
rðz1 ;z2 ;...;zm Þ higher dimensional space, a trick known as the kernel map-
>
>
>
: softmax ¼ Pem1 zj ;...; Pem
z zm
; ðm[1 classificationÞ
j¼1
e j¼1
ezj ping. In practice, it is usually not feasible to completely sepa-
(8) rate samples, particularly when the data are noisy; hence,
some tolerance errors are allowed. A SVM is inherently a
where m is the dimension of output. A GLM model is usually binary classifier, which has a hinge loss function. To make
optimized via MLE techniques. For logistic regression, this is the optimization process numerically easier, the non-convex
equivalent to minimizing a binary cross-entropy loss: primal problem is usually converted into a convex dual prob-
lem defined as follows:
1X N
Xn
1X n X n
LðwÞ ¼ yðiÞ log pðiÞ (9) max ai aj ak yk yj Kðxj ; xk Þ
N i¼1 ai 0 2 j¼1 k¼1
i¼1
In NNs, cross-entropy is also usually applied as a loss func- 0 ai C; 8i (12)

tion for the classification problem. This is due to its favorable X n
numerical and theoretical properties, as will be further ai y i ¼ 0
explained below. i¼1

TABLE II. A snippet from a dataset of 569 breast cancer diagnosis with 30 image extracted features
id diagnosis radius_mean texture_mean perimeter_mean ... fractal_dimension_worst
842302 M 17.99 10.38 122.8 ... 0.1189

842517 M 20.5 17.77 132.9 ... 0.08903
84300903 M 19.69 21.25 130 ... 0.08758
84358301 M 11.42 20.38 77.58 ... 0.173
84348402 M 20.29 14.34 135.1 ... 0.07678
8510426 B 13.54 14.36 87.46 ... 0.07259
FIG. 6. Activation functions: [Left]r(z)=z for regression; [Middle]rðzÞ ¼ 1þe1 z for classification. [Right]r(z)=max(0,x) for deep learning.
where Kðxj ; xk Þ ¼ /ðxj ÞT /ðxk Þ is known as a kernel, which is proposed with improved performance.77 As an ensemble
an inner product of feature maps / and acts as a cross-simi- method, RF usually reduces variance and improves general-
larity metric in the feature (Hilbert) space. Various types of ization compared to a single decision tree but with the caveat
kernels exists, such as linear, polynomial, and radial basis of reduced interpretability.
function (RBF) depending on the desired data support. For
instance, polynomials have finite data support while RBFs
4.D. Example implementations
have infinite data support and is thus commonly used despite
being more computationally expensive. Penalty terms such as As a demonstrative implementation example, a logistic
ridge loss are usually added to the loss function for regular- regression model application to breast cancer diagnosis is
ization purposes and for preventing overfitting pitfalls. Sup- presented here. The dataset (Table II), which contains 569
port vector machines have been very popular ML techniques breast cancer diagnosis cases, was created in 1995 by
due to their global optimal solutions for classification and researchers at the University of Wisconsin. Each case
regression are widely applied to radiation oncology problems. includes 30 features computed from a digitized image of a
However, as a classical ML approach, they require features to fine needle aspirate (FNA) of a breast mass and a binary
be extracted and selected prior to training. diagnosis label with “M" indicating “malignant" and “B"
A RF uses an ensemble of decision trees to solve classifi- indicating “benign.”
cation or regression problems. A decision tree has a flow- Let the mean-value features (first 10 in all 30) plus a fea-
chart-like structure where each node represents a test on ture x11 ¼ 1 (serving as intercept term) be the predictor
attributes, splitting examples into different branches. From parameters in the trained logistic model. Using
root nodes to leaf nodes, the decision is gradually made by p ¼ 1þexp1ðXwÞ as in Eq. (7), the logistic regression model
multilevel classification rules, which make them quite easy to will take the design matrix XðX 2 R56911 Þ as input and out-
interpret and desirable to use but with limited predictive put prediction p which is the probabilities of diagnosis Y = 1
power. Hence, the corresponding ensemble approach RFs, (malignant). The associated Python code for the reader refer-
that is, combining multiple weak classifiers (decision trees) ence can be found in https://github.com/sunancui/breast-ca
to achieve a stronger classification, is usually applied instead. ncer-diagnosis. As mentioned, logistic regression can be
The splitting at nodes is usually based on a Gini Index, regarded as a simple NN with only a single input/output layer
entropy function, or information gain. The ensemble is based and a sigmoid activation function. We also compared logistic
on so called bagging, that is, averaging. Recently, a gradient regression and NN in the code from this perspective; one
boosting, weighted averaging approach for RFs has been would notice that they yield similar estimation of prediction

parameters (w). A summary of the prediction results of breast layer (j) vanishes. In an extreme case where all activations are
cancer diagnosis by logistic regression is presented in Fig. 7 identities in Eq. (14), such a NN simply reduces to a linear
with the ground truth labels shown for comparison. model Eq. (3) no matter how many layers (deep) it has.
Therefore, this argument provides an insight why activation
functions are the primary source of nonlinearity of NNs. It is
5. WHAT DEEP LEARNING MODELS EXIST? thought that with each nonlinear activation, an additional fold
5.A. Neural networks (manifold) in the data can be achieved allowing for better
representation or capturing of highly complex relationships.
Neural networks in principle35 can be considered as natural As for NNs’ hyperparameters, for example, number of layers
generalizations of linear models (Section 4.A). If we modify L and number of nodes in a specific layer, Bayesian meth-
Eq. (3) by adding a so-called activation function r [Eq. (8)], ods78 can be used to optimize their number. However, most
of the current procedures still rely on trial and error schemes.
xðjþ1Þ ¼ rðjÞ wðjÞ xðjÞ þ bðjÞ (13)
j=0,1,. . ., and let it iterate recursively over itself L times, with 5.B. Why and how to go deeper?
the upper-right index (0),(1),. . .,(L) (Fig. 8), one then has a
The Universal Approximation Theorem (UAT) developed
composite function
by Hornik79 states that mathematically a NN with one hidden
fw ðxÞ ¼ xðLÞ
layer of sufficient nodes can approximate any measurable
(and hence continuous) function on compact sets under cer-
¼ rðL1Þ wðL1Þ rðL2Þ wðL2Þ þbðL2Þ þbðL1Þ
tain mild conditions on the activation functions. This fact
(14) explains why NNs are suitable for fitting complex functions
This is known as a NN, where now the number of structures and datasets. However, based on this, one probably wonders
is renamed as the number of layers (Fig. 9). In particular, the why we would bother going deeper? In fact, the theorem has
first layer (0) and the last layer (L) are called the input layer several constraints. First, we need to have a “sufficient” (can
and output layer, respectively. Any layer (j) with 0 < j < L is be infinite) number of nodes. Secondly, it does not guarantee
called a hidden layer. Moreover, it is in general referred to as the theoretical performance can be achieved through opti-
a DNN if L > 4 (i.e., more than two hidden layers), which is mization in practice due to local minima and convergence
the fundamental building block for DL. Hence, conceptually, issues. Thus, it still depends on experimentally designing the
a NN is merely an extension of a linear model. Thus the con- right architecture (e.g., activation function, regularization,
cept of activation functions in Eq. (8) and loss functions (9) number and size of layers, etc.) and adopting an appropriate
discussed in Section 4.A can be applied to DL without any training process (e.g., optimization method) in order to possi-
changes. This is known as a NN, where now the number of bly achieve the theoretical performance estimates.80
structures is renamed as the number of layers. In particular, Practically, adding more layers to a NN has been shown to
the first layer (0) and the last layer (L) are called the input provide a good architecture design vs increasing the number
layer and output layer, respectively. Any layer (j) with of nodes as suggested by UAT. A NN with more layers will
0 < j < L is called a hidden layer. Moreover, it is in general show better performance than a single layer NN that has the
referred to as a DNN if L > 4 (i.e., more than two hidden lay- same number of parameters. Intuitively, this is possibly
ers), which is the fundamental building block for DL. Hence, because each layer will transform its input, that is, folding,
conceptually a NN is merely an extension of a linear model. and creating a new representation of the data (appropriate
Thus the concept of activation functions in Eq. (8) and loss manifold). The multilevel abstraction that is being learned
functions (9) discussed in Section 4.A can be applied to DL through multiple layers can be hard coded into a single layer
without any changes. with the same number of nodes. Or formally speaking, the
However, the nonlinear activation function now plays a multilayer structures enable NNs to recognize the entangled
prominent role in (deep) NNs as it is the source of nonlinear- manifolds of the data more easily, so as to solve the desig-
ity, that is, mapping the data into higher dimensions. If the nated task.81
activation function between any two layers, Eq. (13), is an However, adding too many layers, the performance of a
identity map in Eq. (8), then the two layers can be merged NN can actually be degraded as well. The phenomenon was
since identified as the vanishing gradient problem by Hochreiter’s
Ph.D thesis82 and has been a major obstacle of DL studies for
xðjþ1Þ ¼ rðjÞ wðjÞ wðj1Þ xðj1Þ þ bðj1Þ þ bðjÞ a long time. Vanishing gradient is a problem occurring dur-
ing optimizing NN weights83 with gradient-based learning
¼ rðjÞ w ~ðjÞ
~ ðjÞ xðj1Þ þ b methods, where the gradient will become too small through
multiplication of many layers, effectively preventing the
(15)
weight from changing its value. In the modern deep architec-
with w~ ðjÞ ¼ wðjÞ wðj1Þ and b~ðjÞ ¼ wðjÞ bðj1Þ þ bðjÞ . Thus, tures, there exist several mechanisms to prevent such prob-
ðj1Þ
one sees clearly from above discussion node x
ðjÞthatðjÞthe lems, for example, residual blocks, ReLU activation,52 batch
directly connects to node x ðjþ1Þ ~
via w ; b ~ as if middle normalization.84 Hence, they help the architectures grow

deeper and more powerful without encountering such issue. posed problems by suppressing the noise in the training data.
For instance, the residual NN architecture (ResNet)55 adds Besides adding a weight penalty as shown in Section 4.B,
one extra term to the usual NN Eq. (13), one can adopt the dropout, another neuroscience inspired
0 1 trick,86 as a regularization technique. As shown in Fig. 10,
during the training, dropout will randomly select some por-
xðjþ1Þ ¼ rðjÞ @wðjÞ xðjÞ þ bðjÞ þ xðj1Þ
|ffl{zffl}
A ðResNetÞ;
tion (dropout rate, e.g., 20%–50%) of nodes being ignored.
skipconnection
They will not affect updated weights, as their contribution to
(16) the activation of downstream neurons is temporally removed.
where the additional term is called the skip connection. Indeed, dropout is currently a very effective ensemble
Heuristically, it adds an extra path or shortcut to another neu- method, performing averaging with NNs while mitigating the
ron that was not connected previously. During the back-prop- risk of memorizing the data. Hence, the resulting NN with
agation process, the network will skip some subnetworks, many layers (DNN) is capable of better generalization to
directly forwarding the gradient from higher to lower layers, unseen data and is less likely to overfit (memorize) the train-
eliminating the gradient vanishing problem. The new activa- ing data.
tion functions ReLU in Fig. 6[Right] and its variants85 elimi- More advanced optimization techniques will also be
nate vanishing gradients by avoiding squishing a large input required when training a “deeper” NN. As compared with the
space between 0 and 1 as in Sigmoid activation, thus prevent- simple case of logistic regression, whose loss function is con-
ing extremely small gradients from occurring at the edge. vex, the deeper NN will tend to have a more sophisticated
Using batch normalization layers between other layers to nor- loss landscape, making finding even the close-to-global opti-
malize the intermediate inputs is another solution. This mal solution more difficult.80 Many efforts have been made
ensures the values of intermediate inputs are inside the range to develop effective optimization algorithms87 for such large-
that has the effective gradients, hence, the gradient values scale nonlinear problems primarily based on gradient descent
will not become too small. techniques, including the use of stochastic approaches with
The next question that comes to mind when designing an momentum, which accelerate learning by increasing the gra-
architecture is how to guarantee good generalization. As the dient vectors in the direction that past gradients accumulated
model becomes extremely complex, overfitting can be a main (velocity). In the adaptive rate scheme, the algorithm will
concern. Intuitively, regularization techniques can resolve ill- adopt various learning rates per parameter according to their
history of momenta and gradients. The common optimization
techniques including Adam,54, RMSProp87 are popular
choices in DL studies as surrogates for the classical stochastic
gradient descent techniques.
5.C. Prevalent architectures

5.C.1. Convolutional neural networks
CNNs47 are best known of late for image recognition and
image-related predictions, which borrow the concept of con-
volution from classical linear system filtering, where a 2D
image I : R2 ! C is convolved with a given kernel function
w : R2 ! R such that the output image is of the form,
Z
I~ ðxÞ ¼ wðx yÞI ðyÞdy (Fourier convolution)
R2
(17)
Therefore, the concept of convolution is naturally blended
FIG. 7. Plot of fitting of breast cancer diagnosis by logistic regression, cases into NNs to develop the so-called CNN when image-like data
with malignant (Y = 1), and benign (Y = 0) diagnosis were denoted.
are concerned.
FIG. 8. Self-iterations of linear transformations.

X1
m;n;C
I~ k;‘;b ¼ wi;j;a;b I sðk1Þþi;sð‘1Þþj;a
i;j;a (18)
~1 ; ‘ ¼ 1; . . .; L
ðk ¼ 1; . . .; L ~2 ; b ¼ 1; . . .C2 Þ
m;n;C ;C
here w ¼ wi;j;a;b i¼1;j¼1;a¼1;b¼1
1 2
2 R4 is a four-tensor convo-
lutional filter (kernel). With stride s > 1, the output size
would be roughly reduced to 1s of the original input size.
Implementing a pooling layer of kernel size s will have the
same effect.
One can understand that using kernels in NNs as being
equivalent to template matching or “seeing” local information
of a neighborhood while blocking information from far apart
or less related regions, as depicted in Fig. 11 (left). This can
be also visualized by vectorizing the input and the output as
in Fig. 11 (right), from which one can realize CNNs only con-
nect to certain nodes within a layer when compared to a fully
FIG. 9. A graphic plot of Fig. 8 and Eq. (14). connected NN; this is called local connectivity property.
Overall, a CNN is a locally connected NN as it only considers
local relations (receptive field) while it decouples information
far away in space and/or time allowing for efficient data rep-
resentation and improved task learning.
The property that makes a CNN distinguishable from
other locally connected networks is that CNNs force the
weights (kernel) to be repeatedly used, that is,
m;n
wi;j;a;b i¼1;j¼1 2 R2 is used everywhere in the input feature
map (e.g., I ). This parameter sharing scheme effectively
takes advantage of the spatial invariance property of the
imaging data, largely reducing the number of free parameters
in the architecture.
5.C.2. Recurrent neural networks

Recurrent neural networks (RNNs)88 are another variant
of NNs especially useful for sequential (time series) data
learning, such as voice, text data. Although, a 1D CNN can
also model such data by taking into account local sequential
FIG. 10. Illustration of dropout techniques.
relationships, the parameters (kernel) sharing scheme by
CNNs is too simple and “shallow” for complex learning
objectives. RNNs manage another way of sharing parameters
but in a deeper and more sophisticated sense. Suppose we
A CNN typically consists of several convolutional layers
have a sequential data fxðtÞ 2 Rn jt 2 Tg as input and
(filtering), pooling layers (down-sampling aimed for reducing
f~yðtÞ 2 Rm jt 2 Tg as the corresponding labels where T
dimensionality), and activation functions, where the convolu-
denotes an index set labeling separation across time steps.
tion layer is the core component that operates convolution on
Note that we only consider one sample here for convenience,
an image field to capture its relevant features and contribute
the superscript no longer stands for sample index. The goal
to the data representation for the following layers, which is
of an RNN is to learn the relation between data fxðtÞ g and
fed subsequently into a fully connected layer to perform the
labels f~yðtÞ g via hidden units fhðtÞ 2 Rk g. In RNNs, two
learning task (e.g., detection or classification). In most practi-
functions fh : Rk Rm Rn ! Rk and
cal implementations, the “convolution" in CNN is replaced
g/ : R R R ! Rm , relating fxðtÞ g, f~yðtÞ g, fhðtÞ g
k m n
by a cross-correlation operator rather than Eq. (17), for
parameterized by NNs’ weights h,/, are imposed to search
speed-up purposes. In the case of input as a 2D image (size
for the recurrence relations,
L1 L2 ) with multicolor channels (C1 ) represented by a 3D
L1 ;L2 ;C1

tensor I ¼ I i;j;a i¼1;j¼1;a¼1 2 R3, a convolutional layer with hðtÞ ¼ fh hðt1Þ ; yðt1Þ ; xðtÞ 2 Rk
stride s and kernel size m 9 n will produce an output image (19)
yðtÞ ¼ g/ hðt1Þ ; yðt1Þ ; xðtÞ 2 Rm
I~ (of size L
~1 L
~2 with C2 channels) as below,

FIG. 11. Illustration of convolutional neural network. (Left) a kernel acts as a mask to consider only neighboring information (pixels) yet block information far
part. For example, z11 in the convolved output is generated by applying 3 9 3 kernel on the receptive field (denoted) at the left upper corner. In the view from
the right, one sees clearly that utilizing kernels (filters) is essentially another implementation of locally connected networks.
where h and / serving as unknown neural weights to be opti- matrix C, where Ci;j ¼ Ai;j Bi;j ). And ∘ denotes the compo-
mized. As these parameters are shared across different time nent-wise functional composition, that is, the ith element in the
points, free parameters in an RNN are largely reduced. Two resulting vector of function is the functional composition
common RNN architectures were invented in the early devel- between the ith elements of the two input vector of function.
opments of RNNs, they are known as Elman89 and Jordan90 The affine transformation is defined by,
networks, which are simple RNNs that differed in their con-
Aff W;U;b ðn; gÞ :¼ W n þ U g þ b
nection scheme. (21)
These simple RNNs were known to suffer from two major ðW 2 Rpq ; U 2 Rpr ; b 2 Rr ; n 2 Rq ; g 2 Rr Þ
disadvantages, (a) their back-propagation gradients tended to In Eq. (20), three additional units GðtÞ a , where a = 1,2,3
vanish or explode quickly and (b) the information farther denotes the forget gate, the input gate and the output gate,
apart cannot be connected. These two problems are resolved respectively,
by using the so called long short-term memory (LSTM) archi-
tecture.56 GðtÞ ¼ r a Aff Wa ;Ua ;ba x ðtÞ ðt1Þ
; y 2 Rk (22)
a
An LSTM is a state-of-the-art RNN model composed of
basic building blocks denoted as gated units, which learns by are used to control and determine when and how much
itself to store and forget internal memories when needed such should the previous information be kept or forgotten. Total
that it is capable of creating long-term dependencies and unknown parameters of an LSTM are ðWh ; Uh ; bh Þ and
paths through a time series as in Fig. 12. An LSTM has the fðWa ; Ua ; ba Þja ¼ 1; 2; 3g. A simplified version of LSTM is
following construction. Specify fh ¼ fLSTM, g/ ¼ gLSTM in called a gated recurrent units (GRU),where the number of
Eq. (19) by gates is reduced to two, namely, reset and update gates, and
hence more computationally efficient.91
ðtÞ
hðtÞ ¼ fLSTM hðt1Þ ; yðt1Þ ; xðtÞ ¼ G1 hðt1Þ
; (20)
ðtÞ
yðtÞ ¼ gLSTM hðt1Þ ; yðt1Þ ; xðtÞ ¼ G3 r3 hðtÞ 5.C.3. Attention awareness
where ⊙ denotes the component-wise multiplication, (i.e., for Attention awareness92 is a special mechanism that equips
two m9n matrices A and B, A⊙B will produce another m9n NNs with the ability to focus on a subset of a feature map,

FIG. 12. The diagram of a gated units of long short-term memory, which is consisted of input, output and forget gates.
which is particularly useful when tailoring the architecture for connected NNs. These three basic architectures comprise
specific tasks. Suppose, the input of an attention network fw ðÞ almost all the deep architectures used today. Other architec-
is x and its output is a; ai;j 2 ½0; 1 . After applying it on the tures fall into these three categories or are a combination of
feature map z, it will produce an attention glimpse g as below: them.
For instance, a variational autoencoder57 is a special type
a ¼ fw ðxÞ; g ¼ a z; (23)
of CNN or fully connected NN which has a bottleneck struc-
where ⊙ is a component-wise multiplication. The attention ture and is employed for data compression applications. As
mechanism assigns weights a to the input feature map. When Fig. 13 suggests, its input and output dimension are exactly
a is restricted to be either 0 or 1, it is hard attention93; When the same and the dimension of its latent variables is usually
a is between 0 and 1, it is called soft attention.94 significantly smaller than its input size. In a VAE, latent vari-
Attention was first implemented in RNNs94 for lan- ables are sampled from a designate distribution, for example,
guage translation tasks. In the classical Seq2Seq model, a Gaussian distribution, which is parameterized by the encoder.
context vector will be built out of the last hidden state The decoder is responsible for re-constructing the original
from an encoder RNN, which takes source sequence as input from the latent variables. The loss function of a VAE is
input, then a decoder RNN would take the context vector defined as a lower bound of the data log-likelihood function.
as input and generate target sequences. However, the It turns out to be a combination of two terms; one penalizes
model tends to “forget” the early part of source sequences reconstruction error as in a plain autoencoder, the other
when generating target sequences. Hence, attention net- forces the similarity [Kullback–Leibler divergence (KL)]
works are proposed to be applied on all the hidden states between the learned distribution and true prior distribution.
(z) of encoder RNN to learn which part of source The additional KL-term of VAE can be regarded as a way of
sequences should be focused on when generating a certain regularization. Contrasted with the deterministic autoencoder,
piece of a target sequence (g). it helps suppressing the noise in data.
Attention mechanisms can be applied to fuse and align U-Net,96 which has been successfully applied in many
information from heterogeneous data. For instance, by adopt- medical imaging segmentations, is also CNN based, with its
ing an attention mechanism, the relation between an image name derived from its “U” shape as shown in Fig. 14. U-Net
and its caption can be learned.95 Specifically, the imaging is composed of three sections: contraction, bottleneck, and
information extracted from a CNN can be consumed by the expansion sections. The contraction (encoding) section is
RNN, for each position of the caption, an attention mecha- made of several blocks of convolutional layers and max pool-
nism fuses the information and generate the desired caption ing layers. It is responsible for learning complex features.
on a word by word basis. The bottleneck section is the bottommost layer mediating
between the contraction and expansion layers. The heart of
U-net lies in the expansion (decoding) section, which consists
5.C.4. Other interesting architectures
of several blocks of convolutional layers and an upsampling
We have introduced two fundamental architectures for DL layer. Particularly, it appends the outputs from the contraction
algorithms, that is, CNNs and RNNs, together with fully section to their corresponding components in the expansion

FIG. 13. The diagram of a variational autoencoders, which is composed of encoder and decoder networks.
FIG. 14. The diagram of a U-net consisting of contraction, bottleneck, and expansion section. The size of output is L1 L2 a, where L1 L2 is the size of
original image, and a is the number of structure that needs to be contoured.
section. Such a design helps preserve the structural integrity basic idea is that the generator tries to generate new data
of the image and reduces distortion enormously. The loss from random noise to mimic the real data, while the dis-
function of U-net is defined as energy function computed by criminator tries to distinguishable any fake data from the
a pixel-wise soft-max over the final feature map combined real data in an analogy to a Turing machine. The two parts
with the cross-entropy loss function. are alternately trained to compete with each other and over-
A generative adversarial networks (GAN)58 is another time the performance of both will be boosted. The original
popular NN architecture and is considered among the most GAN as shown in Fig. 15 works on the dataset without
exciting breakthroughs in DL of the past decade. Numerous labels (unsupervised learning) and is based on fully con-
GAN variants have been invented for different purposes fol- nected architectures. However, its idea can be directly
lowing their invention by Ian Goodfellow in 2014. Like a applied to CNNs or other architectures to create a deep con-
VAE, a GAN is also a generative model,61 which aims to volution GAN (DCGAN),97 which has had wide applica-
learn the true data distribution from the training set so as to tions in the imaging domain such as sequence generation. A
generate new data points. However, unlike VAEs, which try cGAN98 is a GAN variant that can be used to learn multi-
to minimize the lower bound of likelihood function, GANs modal data models and can be used for generation of syn-
learn the data distribution through pitting a competition thetic data for training of limited sample outcome prediction
between two adversarial components, so-called generator models, for instance. Its generator and discriminator are both
and discriminator networks in some form of a game. The trained conditioning on the label/other information.

FIG. 15. A generative adversarial neural network which learns data distribution through the competition between two adversarial components.
TABLE III. A snippet of a dataset containing the six predictor for local control prediction
id Gender pathologic_N pathologic_stage pathologic_T other_dx tobacco_smoking_history
1 Male N0 Stage IIIB T3 No 4

2 Male N1 Stage IIA T1b No 3
3 Female N2 Stage IIIA T1 No 2
4 Female NX Stage IB T2 No 4
InfoGAN99 is an version of GAN that learns meaningful containing 131 training/70 test 3D CT images of size (512,
codes of unlabeled data. CycleGANs100 were invented for 512, 74 784). The last dimension of images varied, as
domain image transfer; they can accomplish one-to-one patients have different numbers of slices. During the training,
translation with unpaired datasets such as synthetic CT gen- each slice from patients was taken as individual inputs.
eration from MR images.101 Hence the input of U-Net will be a 2D image. The U-Net is
applied to classify each pixel into three different classes:
liver, lesion, something else. Dice similarity was used for
5.D. Example implementations
evaluation and is calculated for the three classes to evaluate
We implemented a fully connected NN in pytorch for a the resulting classification models. The code was imple-
binary classification problem. The dataset in Table III con- mented to split the 131 training patients into training/valida-
tains 45 patients in TCGA-LUAD and TCGA-LUSC who tion sets. The model was trained up to 17 epochs and was
received external beam radiotherapy for a primary tumor as evaluated on the validation set. Some sample results are
adjuvant therapy. Only the patients who have complete dose shown in Fig. 17. Interested readers are also encouraged to
and treatment outcome information were kept. In addition to play with the architecture definitions in the code and test their
dose, six clinical variables were selected for local control pre- final models on the 70 hold-out patients.
diction (progressive = 1/local control = 0).
Before training the model, all the variables were converted
5.E. Reinforcement learning
into numerical values and were Z-score normalized to avoid
numerical instability. The dataset is randomly split into train- Reinforcement learning is the third category of machine
ing and testing sets. The code is available for the reader refer- learning apart from supervised and unsupervised. It is
ence at https://github.com/sunancui/lung_TCGA_prediction, designed to achieve a definite goal by optimizing a “reward”
where the history of training/test losses is plotted and model function. The learning process of an RL is through interac-
performance is evaluated with area under receiver operating tion with an “environment” so that RL user (also called an
curve (AUC). In Fig. 16, AUC results of both training/testing agent) tries to earn the most reward to obtain the designated
data are shown. goal.
Another implementation of deep U-Net for liver tumor Reinforcement learning is based on the MDP, five-tuple
segmentation is also provided for the reader reference at ðS; A; P; c; RÞ. S ¼ fðx1 ; . . .; xn Þ 2 Rn g is used to describe
https://github.com/sunancui/liver_segmentation. The dataset an environment, it is the space of its all possible states. A is a
is from the Liver Tumor Segmentation Challenge (LiTS), collection of all actions the agent can take. R : X ! R is the

between states. The probability mass function (pmf) (s,a,t)

↦P(s,a,t) denotes the transition probability from state s 2 S
to another t 2 S under an action a 2 A. Consequently, this
induces the condition probability:
Psa ðtÞ Probðtjs; aÞ Pðs; a; tÞ=Pðs; aÞ (24)
on space of next states t conditioned on previous state s and
current action a.
An MDP is also called an “environment” in modern
machine learning development. Each element si 2 S is called
a “state” representing a configuration in MDP. An action
ai 2 A corresponds to a control or a move to be exerted. The
purpose of an agent in the RL algorithm is to find a sequence
of actions fa0 ; a1 ; . . .g such that the following path in S col-
lects maximum rewards:
a0 a1 a2
FIG. 16. Area under receiver operating curves of local control prediction in s0 ! s1 ! s2 ! s3 (25)
lung cancer by neural network. p p p
In other words, an agent is a policy function p : S ! A who

provides an action a = p(s) under a state s. There are mainly
reward function given on the product space X ¼ S A S, two ways to construct a policy function by policy-based and
that is, the reward is determined by actions and states. value-based methods in RL where the former tries to find a
c 2 [0,1] is the discount factor representing the possible policy function directly102 via pðhÞ while the latter finds a pol-
rewards that propagate from future back to the present. And icy implicitly via a Q-function defined by,
P : F ! ½0; 1 is a probability describing the transition
FIG. 17. Example results from the validation set with dice similarity coefficient shown, where “nan" indicates no such structure existing in the ground truth label.
(first column: original image slice, second column: ground truth label, third column: prediction).

typically several hundred or thousand times depending on the

" # original dataset size.
X
1
Qp ðs; aÞ ¼ E ck Rðsk ; pðsk ÞÞ p; s0 ¼ s; a0 ¼ a (26) However, in practice, the notion of a comprehensive valida-
k¼0 tion process is more complicated than just implementing CV
An optimal policy p : S ! A is then derived by maximiz- considering various scenarios of sample size and requirements.
ing the Q-function such that Qp ¼ maxp Qp . Such RL algo- The Transparent reporting of a multivariable prediction model
rithms that use Q-functions are called Q-learning. A practical for individual prognosis or diagnosis (TRIPOD)104 guidelines
way to compute the Q-function Eq. (26) is via the Bellman’s provide a detailed list of different trustworthy-level model vali-
iteration defined by: dation pipelines. Specifically, it categorizes model validation
into four types, from type 1 to type 4, where the validation
~ iþ1 ðs; aÞ ¼ Et Psa Rðs; aÞ þ c max Q
~ i ðt; bÞ : rules become more stringent. Type 1 uses the whole dataset to
Q (27)
b2A develop the model and test its performance, whether using a
resampling method or not. Type 2 uses only parts of the data
such a sequence has a unique converging point where
~ i g1 ! Qp and so to develop a model and reserve the rest for testing, the split can
fQ i¼1 be either random (subtype a) or by location/time (subtype b).
~ ðs; aÞ ¼ Et Psa Rðs; aÞ þ c max Q
~ ast ðt; bÞ : Types 3 and 4 are both refer to external (independent) valida-
Q (28)
b2A tion, both using a separate dataset to evaluate the model. Par-
ticularly, type 4 further requires the model has been already
Consequently, the optimal value Qp can be easily computed published and is being evaluated in a meta-analysis like
by Bellman’s equation fQ ~ i g1 instead.
i¼1 scheme . In all, developers have the responsibility to figure out
RL serves as an independent ML field separate from the best validation pipeline for their specific studies, while
supervised and unsupervised learning in the sense that it does higher-level validation is usually desirable when possible to
not rely on observed data nor comparing its outputs with any achieve acceptance from the community and demonstrate fea-
data labels. It is designed to make an optimal decision or a sibility for clinical implementation.
control action and therefore it is particularly suitable for adap-
tive treatment planning and optimizing sequential decision-
making. An implementation in radiotherapy to optimize dose 7. ARE YOU THE HUMAN STILL RELEVANT?
adaptation can be found in.45 The human community of ML and DL developers and
users will play a decisive role (knowingly or unknowingly) in
6. HOW TO VALIDATE? the continued advancement of machine learning technologies
and on the ultimate impact these technologies will have on
When building a model one would like to know how good society including medical physics and radiation oncology. In
it is. This is particularly true in the context of ML/Dl, where considering our future role in shaping these technologies,
the models have tremendous ability to overfit (memorize) the some perspective can be gained by considering Amara’s
training data and one would like to confirm that the model Law105: “We tend to overestimate the effect of a technology
can actually have a good generalization, that is, perform well in the short run and underestimate the effect in the long run.”
on out-of-sample or unseen data. Fig. 18 provides a visual representation of this disconnect
Common validation methods are based on statistical re- between human expectations and how technological produc-
sampling techniques and include cross-validation (CV) and tivity develops. It can be conjectured that ML/DL technology
bootstrapping. CV65 (e.g., K-Fold CV, leave-one-out CV) is a is currently in the tail-end of the overestimation phase. The
widely used resampling technique in classification/regression development of ML paradigms and applications has seen
model internal validation analysis. In CV, one would system- explosive growth over the past twenty years,106 and reports
atically split the data into training/testing sets, train the model on AI can sometimes overinflate its potential. However,
on training sets, then test a selected performance metrics (ac- despite the current hype, it is suggested that, based on past
curacy, area under receiver operating curve, specificity, sensi- trends in this field, the research community’s (and eventually
tivity) on the test datasets in a round-robin fashion and by extension the general public’s) interest in AI will begin to
average the results. Stratified K-fold CV is usually applied for diminish in the coming decade,106 even as progress is made
a imbalanced classification problem, where each fold in CV to address its most significant limitations. Future develop-
contains roughly the same number of positive/negative cases. ment of these technologies will depend on continuing efforts
Bootstrapping103 is an inherently computationally more and investments in (a) curation of high quality, accessible
intensive procedure than CV but generates more realistic datasets, (b) developing algorithms to make high accuracy
results. Typically, a bootstrap pseudo-dataset is generated by predictions using available datasets, and (c) buy-in from
randomly making copies of original data samples, similar to prospective users.
Monte Carlo techniques, with an estimated inclusion proba- With respect to medical and radiation oncology commu-
bility of 63%. The bootstrap often works acceptably well even nity, this user buy-in ultimately needs to come from clinicians
when datasets are small or unevenly distributed. To achieve in order to establish ML and DL techniques as a standard of
valid results this process must be repeated many times, care within the clinical workflow. To gain acceptance from

learning process by incorporating human-computer interac-

tion into the system. Machines are known for their great
capabilities of learning from vast datasets, while humans are
much better at making decision with scarce information.
Human–computer interaction is expected to combine best
human intelligence and machine intelligence to make robust
and accurate decisions. To develop a robust system to sup-
port decision-making in clinical practice, it is important to
investigate both (a) how to make machines more accurate
with physicians/physicists’ input and (b) how to improve a
physician/physicist’s task with the aid of a machine. The
previous one helps leverage human intelligence into AI algo-
FIG. 18. Graphical representation of Amara’s law and today’s current state of rithms for more accurate and robust decisions. The latter
machine learning and deep learning technologies. one is the real decision-making scenario in practice. During
this process, the uncertainty of predictions made by human/
machine needs to be estimated. The performance of machine
the clinical team, a new technology must prove itself to be versus machine-aided human should be thoroughly investi-
useful in a clinical setting, as well as to not subject patients to gated. Specifically, in the scenarios where physicians/physi-
the risk of unnecessary harm. As a matter of ethical account- cists and machine make different decisions, a scheme should
ability, ML/DL techniques for clinical workflows need to be be designed to resolve the disagreement and to make best
transparent enough to allow human experts to be part of the use of both physicians/physicists and machine intelligence,
decision loop. which is going to be better than either alone.
7.A. Why interpretability matters? 8. WHAT PITFALLS MAY EXIST?

Although ML/DL have shown promising accuracy in solv- There are caveats to be mindful of when medical physi-
ing practical problems in radiation oncology, interpretability cists develop or deploy ML/DL models to solve practical
is a necessary requirement for ML and DL to be viable clini- problems in medicine or radiation oncology: bias, data qual-
cal tools. Accuracy in this clinical context refers to the ability ity, data sharing, and data privacy, to name a few. Biases in
of the ML and DL algorithms to perform the tasks they are the data including age, gender and race,108 or other sources of
assigned either as well, or better than what a human could underrepresented components in the AI algorithm training
accomplish. Interpretability refers to the ability of the clini- may potentially aggravate health care disparities. That is,
cian to confidently understand and interpret the results of an models developed with bias may lead to misleading diagnosis
algorithm, without necessarily having to understand the min- or poor decision support for patients in minority groups
ute details of its mechanics.107 This concept of interpretabil- whose members have not been sufficiently included in the
ity is particularly important in a clinical setting because it can dataset that is used to develop the models. It is found that
help act as a fail-safe against instances in which algorithms some genetic variants, which are more common among black
may produce results that are flawed due to inherent bias in Americans than white Americans, were mis-classified as
the training data or other unforeseen bugs. pathogenic in a study with insufficient inclusion of black
Algorithms which are both accurate and interpretable are Americans in control cohorts.109 Patients with melanoma and
able to gain a clinician’s trust as clinical tools because they can lung cancer treated with the same immunotherapy regimen
be expected to perform their function correctly most of the time may respond differently based on sex, with higher remission
and do not require the clinician to blindly accept their results. probability for male patients.110 A segmentation algorithm
Existing ML and DL techniques are recognized to suffer from designed for one anatomic site may still produce (erroneous,
a tradeoff between accuracy and interpretability, and therefore fake) contours if inappropriately applied to the wrong ana-
more work is necessary to develop ML/DL methods which can tomic site. Thus, AI Algorithms can take whatever informa-
achieve a better balance between these two qualities, such as tion we feed into them and output a result that needs to be
the use of gradient maps or proxy models with DNN or human carefully scrutinized/tested before clinical implementation to
in the loop with ML approaches as discussed below.107 mitigate such bias. It is physicists’ responsibility to under-
stand ML/DL algorithms’ scope of application as well as their
7.B. How to handle human–computer interactions limitations and to conduct a thorough battery of quality assur-
in decision-making? ance (QA) on software tools that are used in the clinic. Physi-
cists are also responsible for eliminating any potential bias by
To meet the needs of practical clinical scenarios in radia- ensuring that implemented ML/DL algorithms continue to
tion oncology, it is important to involve humans in the loop perform as expected from an accuracy and clinical context
of model development and decision-making. Human-in-the- perspective as well by monitoring their performance and con-
loop (HITL) is a practical guide to optimize the entire ducting routine QA tests monthly or yearly, for instance.

Data quality is another important concern when develop- quality of care for patients, which is the ultimate goal of med-
ing and applying ML/DL tools. There are many aspects of ical practice.
data quality, including accuracy, completeness and reliability
etc. Accuracy means the data that have been collected are of
10. CONCLUSIONS
original source data. Completeness refers to the fact that all
required data elements are present. Reliability requires that Machine learning/DL algorithms have witnessed increas-
data remain consistent and the information generated is ing applications in treatment planning and quality assurance
understandable. Medical physicists should examine the accu- to improve efficiency and safety in radiation oncology. These
racy of data before training new models, for example, filter- tools can also improve outcome prediction and personalizing
ing out any abnormal outliers, double checking whether of radiation oncology treatment. To utilize these tools effec-
patients are correctly linked to their information, etc. By gath- tively, medical physicists need to define the problem at hand,
ering complete high quality data that cover patients with dif- know its corresponding category in the ML/DL lexicon,
ferent age, genders and race, medical physicists may also help determine suitable algorithms/models to utilize, collect data
avoid potential bias in the algorithms. Medical physicists of appropriate size and conduct validation in the proper way
should also inspect the readability of data, that is, data yield to ensure generalization to unseen data while being mindful
the same results on repeated collections, processing, storing of possible pitfalls and how to avoid them. This article aims
and display of information in different medical records. to serve as a tutorial and stepping stone in that direction.
Quantitatively, the Shapley value metric111 was proposed as a Moreover, establishing ML/DL algorithms into standard clin-
tool for evaluating quality of healthcare data for use in Ml/DL ical workflow will require additional attention to balancing
algorithms. It interprets whether the presence of an individual accuracy and interpretability in development of these models.
data can help or hurt the overall performance of the ML/DL Medical physicists are encouraged to be informed of the latest
predictive model. Such a quantitative tool can be useful and ML/DL’s developments as part of their continued education,
may provide guidance for medical physicists to assure data get more acquainted with ML/DL literature and become lea-
quality in the future. der of the processes of accepting and commissioning ML/DL
Data of large volumes are always desirable in the era of for routine clinical practice.
precision medicine and predictive analytics. However, proper
data sharing protocols are a precondition for generating large
volumes given the limited sample size at an individual radia- ACKNOWLEDGMENTS
tion oncology department. Medical physicists should be
This work was supported in part by the National Institutes
aware with some thorny issues including data privacy, data
of Health P01-CA059827, R37-CA222215 and
security and data interoperability112 when involving in multi-
R01-CA233487.
institutional studies and assist with other stake holders toward
providing secure and safe data lake exchange platforms for
radiation oncology applications. CONFLICT OF INTEREST
The authors have no relevant conflict of interest to
9. WHAT DOES THIS ML ERA IMPLY FOR MEDICAL disclose.
PHYSICISTS?
Machine and deep learning technologies are promising a)
Author to whom correspondence should be addressed. Electronic mail:
tools to introduce substantial positive changes to the medical sunan@umich.edu.
physics workflow and is likely to become an indispensable
component of its future. Ml/DL are likely to automate many
REFERENCES
of the mundane procedures as well as some of the essential
processes involving treatment planning and quality assurance 1. Bi WL, Hosny A, Matthew B, et al. Artificial intelligence in cancer
to name of few. This is likely to lead to changes in the future imaging: clinical challenges and applications. CA. 2019;69:127–157.
2. Levine AB, Schlosser C, Grewal J, Coope R, Jones SJM, Yip S. Rise
role of the medical physicist in the clinic. Medical physicists of the machines: advances in deep learning for cancer diagnosis. Trends
as vanguards of technology in medicine are at the forefront of Cancer. 2019;5:157–169.
embracing ML role and its potential. However, this also 3. Sahiner B, Pezeshk A, Hadjiiski LM, et al. Deep learning in medical
requires that medical physicists become more acquainted imaging and radiation therapy. Med Phys. 2019;46:e1–e36.
4. Nikolov S, Blackwell S, Mendes R, et al. Deep learning to achieve clin-
with ML/DL, its algorithms and their implementations, and ically applicable segmentation of head and neck anatomy for radiother-
lead the process of acceptance and commissioning of such apy; 2018.
algorithms in their clinic. These issues are likely to require 5. Tseng H-H, Luo Y, Ten Haken RK, El Naqa I. The role of machine
learning in knowledge-based response-adapted radiotherapy. Front
changes in the medical physicist training and their job profile.
Oncol. 2018;8:266.
Nevertheless, the benefits are expected to significantly out- 6. Shiraishi S, Moore KL. Knowledge-based prediction of three-dimen-
weigh some of the possible downsides, including achieving sional dose distributions for external beam radiotherapy. Med Phys.
better efficiency, improved safety, and potentially better 2016;43:378–387.

7. Valdes G, Simone CB, Chen J, et al. Clinical decision support of radio- Emerging Artificial Intelligence Applications in Computer Engineer-
therapy treatment planning: a data-driven machine learning strategy for ing. Amsterdam, The Netherlands: IOS Press; 2007:3–24.
patient-specific dosimetric decision making. Radiother Oncol. 31. El Naqa I, Kerns SL, Coates J, et al. Radiogenomics and radiotherapy
2017;125:392–397. response modeling. Phys Med Biol. 2017;62:R179.
8. El Naqa I, Irrer J, Ritter TA, et al. Machine learning for automated 32. Peng CYJ, Lee KL, Ingersoll GM. An introduction to logistic regres-
quality assurance in radiotherapy: a proof of principle using EPID data sion analysis and reporting. J Educ Res. 2002;96:3–14.
description. Med Phys. 2019;46:1914–1921. 33. Hearst M. Support vector machines. IEEE Intell Syst. 1998;13:18–28.
9. Good D, Joseph LW, Robert LQ, Jackie W, Yin F-F, Das SK. A knowl- 34. Breiman L. Random forests. Mach Learn. 2001;45:5–32.
edge-based approach to improving and homogenizing intensity modu- 35. Priddy KL, Keller PE. Artificial Neural Networks: An Introduction
lated radiation therapy planning quality among treatment centers: an (Vol. 68). Bellingham: SPIE Press; 2005.
example application to prostate cancer planning. Int J Radiat Oncol 36. Emre Celebi M, Aydin K. Unsupervised Learning Algorithms, 1st edn.
Biol Phys. 2013;87:176–181. Berlin: Springer Publishing Company; 2016.
10. Avanzo M, Stancanello J, El Naqa I. Beyond imaging: the promise of 37. Hartigan JA, Wong MA. A k-means clustering algorithm. J STOR.
radiomics. Phys Med. 2017;38:122–139. 1979;28:100–108.
11. Mattonen SA, Palma DA, Johnson C, et al. Detection of local cancer 38. Putora PM, Peters S, Bovet M. Informatics in Radiation Oncology. In:
recurrence after stereotactic ablative radiation therapy for lung cancer: El Naqa I, Li R, Murphy MJ, eds. Machine Learning in Radiation
physician performance versus radiomic assessment. Int J Radiat Oncol Oncology: Theory and Applications. Cham, Switzerland: Springer;
Biol Phys. 2016;94:1121–1128. 2015:57–70.
12. Deist TM, Dankers FJWM, Valdes G, et al. Machine learning algo- 39. Abdi H, Williams LJ. Principal component analysis. WIREs Comput
rithms for outcome prediction in (chemo)radiotherapy: an empirical Stat. 2010;2:433–459.
comparison of classifiers. Med Phys. 2018;45:3449–3459. 40. van der Maaten L, Hinton G. Visualizing data using t-SNE; 2008.
13. Cui S, Luo Y, Hsin Tseng H, Ten Haken RK, El Naqa I. Artificial neu- 41. Baldi P. Autoencoders, unsupervised learning and deep architectures.
ral network with composite architectures for prediction of local control In: Proceedings of the 2011 International Conference on Unsupervised
in radiotherapy. IEEE Trans Radiat Plasma Med Sci. 2019;3:242–249. and Transfer Learning Workshop - Volume 27, UTLW’11. JMLR.org;
14. Cui S, Luo Y, Tseng HH, Ten Haken RK, El Naqa I. Combining hand- 2011:37–50.
crafted features with latent variables in machine learning for prediction 42. Sutton RS, Barto AG. Reinforcement Learning: An Introduction, Vol.
of radiation-induced lung damage. Med Phys. 2019;46:2497–2511. 1. Cambridge: MIT Press; 1998.
15. Men K, Geng H, Zhong H, Fan Y, Lin A, Xiao Y. A deep learning 43. Puterman ML. Markov Decision Processes: Discrete Stochastic
model for predicting xerostomia due to radiation therapy for head and Dynamic Programming, 1st edn. New York, NY: John Wiley & Sons
neck squamous cell carcinoma in the RTOG 0522 clinical trial. Int J Inc.; 1994.
Radiat Oncol Biol Phys. 2019;105:440–447. 44. Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with
16. Ibragimov B, Toesca D, Chang D, Yuan Y, Koong A, Xing L. Develop- deep neural networks and tree search. Nature. 2016;529:484–503.
ment of deep neural network for individualized hepatobiliary toxicity 45. Tseng H-H, Luo Y, Sunan C, et al. Deep reinforcement learning for
prediction after liver SBRT. Med Phys. 2018;45:4763–4774. automated radiation adaptation in lung cancer. Med Phys.
17. Schuler T, Kipritidis J, Eade T, et al. Big data readiness in radiation 2017;44:6690–6705.
oncology: an efficient approach for relabeling radiation therapy struc- 46. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–
tures with their TG-263 standard name in real-world data sets. Adv 444.
Radiat Oncol. 2019;4:191–200. 47. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with
18. Meyer P, Noblet V, Mazzara C, Lallement A. Survey on deep learning deep convolutional neural networks. In: Pereira F, Burges CJC,
for radiotherapy. Comput Biol Med. 2018;98:126–146. Bottou L, Weinberger KQ, eds. Advances in Neural Information
19. Tseng HH, Wei L, Cui S, Luo Y, Ten Haken RK, El Naqa I. Machine Processing Systems 25. Red Hook, NY: Curran Associates Inc;
learning and imaging informatics in oncology. Oncology. 2018;1–19. 2012:1097–1105.
https://doi.org/10.1159/000493575 48. Mikolov T, Karafiat M, Burget L, Cernocky JH, Khudanpur S. Recur-
20. El Naqa I. Perspectives on making big data analytics work for oncol- rent neural network based language model. In: INTERSPEECH, edited
ogy. Methods. 2016;111:32–44. by Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura. ISCA;
21. McNutt TR, Bowers M, Cheng Z, et al. Practical data collection and 2010:1045–1048.
extraction for big data applications in radiotherapy. Med Phys. 2018;45: 49. Guyon I, Elisseeff A. An introduction to variable and feature selection.
e863–e869. J Mach Learn Res. 2003;3:1157–1182.
22. Kang J, Schwartz R, Flickinger J, Beriwal S. Machine learning 50. Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltz-
approaches for predicting radiation therapy outcomes: a clinician’s per- mann machines. Cogn Sci. 1985;9:147–169.
spective. Int J Radiat Oncol Biol Phys. 2015;93:1127–1135. 51. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep
23. Bibault J-E, Giraud P, Burgun A. Big data and machine learning in belief nets. Neural Comput. 2006;18:1527–1554.
radiation oncology: state of the art and future prospects. Cancer Lett. 52. Nair V, Hinton GE. Rectified linear units improve restricted boltzmann
2016;382:110–117. machines. In: Proceedings of the 27th international conference on
24. Feng M, Valdes G, Dixit N, Solberg TD. Machine learning in radiation machine learning (ICML-10); 2010:807–814.
oncology: opportunities, requirements, and needs. Front Oncol. 53. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks.
2018;8:110. In: Proceedings of the fourteenth international conference on artificial
25. Boldrini L, Bibault J-E, Masciocchi C, Shen Y, Bittner M-I. Deep intelligence and statistics; 2011:315–323.
learning: a review for the radiation oncologist. Front Oncol. 54. Kingma D, Ba J. Adam: a method for stochastic optimization. arXiv
2019;9:977. preprint arXiv:1412.6980; 2014.
26. Thompson RF, Valdes G, Fuller CD, et al. Artificial intelligence in 55. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recog-
radiation oncology: a specialty-wide disruptive transformation? Radio- nition. arXiv preprint arXiv:0706.1234; 2015.
ther Oncol. 2018;129:421–426. 56. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Com-
27. Sahiner B, Pezeshk A, Hadjiiski LM, et al. Deep learning in medical put. 1997;9:1735–1780.
imaging and radiation therapy. Med Phys. 2019;46:e1–e36. 57. Kingma DP, Welling M. Auto-encoding variational Bayes; 2013. Cite
28. Domingos P. A few useful things to know about machine learning. arxiv:1312.6114.
Commun ACM. 2012;55:78–87. 58. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial
29. Mohri M, Rostamizadeh A, Talwalkar A. Foundations of Machine nets. In: Advances in Neural Information Processing Systems;
Learning. Cambridge, MA: The MIT Press; 2012. 2014:2672–2680.
30. Kotsiantis SB. Supervised machine learning: a review of classification 59. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification
techniques. In: Real W, ed. Proceedings of the 2007 Conference on of skin cancer with deep neural networks. Nature. 2017;542:115–118.

60. Dalmisß MU, Litjens G, Holland K, et al. Using deep learning to seg- 85. Clevert AD, Unterthiner T, Hochreiter S. Fast and accurate deep net-
ment breast and fibroglandular tissue in MRI volumes. Med Phys. work learning by exponential linear units (ELUs). arXiv preprint
2017;44:533–546. arXiv:1511.07289; 2015.
61. Ng AY, Jordan MI. On discriminative vs. generative classifiers: a com- 86. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R.
parison of logistic regression and naive bayes. In: Proceedings of the Dropout: a simple way to prevent neural networks from overfitting. J
14th International Conference on Neural Information Processing Sys- Mach Learn Res. 2014;15:1929–1958.
tems: Natural and Synthetic, NIPS’01. Cambridge, MA: MIT Press; 87. Ruder S. An overview of gradient descent optimization algorithms.
2001:841–848. CoRR abs/1609.04747; 2016. arXiv:1609.04747.
62. Wolterink JM, Dinkla AM, Savenije MHF, et al. Deep MR to CT syn- 88. Jain LC, Medsker LR. Recurrent Neural Networks: Design and Appli-
thesis using unpaired data. CoRR abs/1708.01155; 2017. cations, 1st edn. Boca Raton, FL: CRC Press Inc; 1999.
arXiv:1708.01155. 89. Elman JL. Finding structure in time. Cogn Sci. 1990;14:179–211.
63. Dietterich T. Overfitting and undercomputing in machine learning. 90. Jordan MI. Chapter 25 - serial order: A parallel distributed processing
ACM Comput Surv. 1995;27:326–327. approach. In: Neural-Network Models of Cognition, Advances in Psy-
64. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep chology, Vol. 121, edited by John W. Donahoe and Vivian Packard Dor-
learning requires rethinking generalization; 2017. sel. North-Holland; 1997: 471–495.
65. Bishop CM. Pattern Recognition and Machine Learning (Information 91. Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase
Science and Statistics). Berlin, Heidelberg: Springer-Verlag; 2006. Representations using RNN Encoder-Decoder for Statistical Machine
66. Linial N, Mansour Y, Rivest RL. Results on learnability and the Vap- Translation. arXiv e-prints, arXiv:1406.1078; 2014. arXiv:1406.1078
nik-Chervonenkis dimension. Inf Comput. 1991;90:33–49. [cs.CL].
67. Perlich C, Provost F, Simonoff JS. Tree induction vs. logistic regres- 92. Vaswani A, Shazeer N, Parmar N. Attention is all you need. In: Guyon
sion: a learning-curve analysis. J Mach Learn Res. 2003;4:211–255. I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Gar-
68. Perez L, Wang J. The effectiveness of data augmentation in image clas- nett R, eds. Advances in Neural Information Processing Systems 30.
sification using deep learning. CoRR abs/1712.04621; 2017, Red Hook: Curran Associates Inc; 2017:5998–6008.
arXiv:1712.04621. 93. Luong M-T, Pham H, Manning CD. Effective approaches to attention-
69. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep trans- based neural machine translation. CoRR abs/1508.04025;2015,
fer learning. arXiv e-prints, arXiv:1808.01974; 2018, arXiv:1808.01974 arXiv:1508.04025.
[cs.LG]. 94. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly
70. Shin HC, Roth HR, Gao M, et al. Deep convolutional neural networks learning to align and translate; 2014. cite arxiv:1409.0473Comment:
for computer-aided detection: CNN architectures, dataset characteristics Accepted at ICLR 2015 as oral presentation.
and transfer learning. IEEE Trans Med Imaging. 2016;35:1285–1298. 95. Kelvin X, Ba J, Kiros R, et al. Show, attend and tell: neural image cap-
71. Zhen X, Chen J, Zhong Z, et al. Deep convolutional neural network tion generation with visual attention. CoRR abs/1502.03044; 2015.
with transfer learning for rectum toxicity prediction in cervical cancer arXiv:1502.03044.
radiotherapy: a feasibility study. Phys Med Biol. 2017;62:8246–8263. 96. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks
72. Walach J, Filzmoser P, Hron K. Chapter seven-data normalization and for biomedical image segmentation. In: Medical Image Computing
scaling: consequences for the analysis in omics sciences. In: Jaumot J, and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351.
Bedia C, Tauler R, eds. Data Analysis for Omic Sciences: Methods and Berlin: Springer; 2015:234–241. Available on arXiv:1505.04597
Applications, Comprehensive Analytical Chemistry, Vol. 82. Amster- [cs.CV].
dam: Elsevier; y:165–196. 97. Radford A, Metz L, Chintala S. Unsupervised representation learning
73. LeCun Y, Bottou L, Orr GB, Muller KR. Efficient backprop. In: Neural with deep convolutional generative adversarial networks; 2015. Cite
Net-works: Tricks of the Trade, This Book is an Outgrowth of a 1996 arxiv:1511.06434Comment: Under review as a conference paper at
NIPS Workshop. London, UK: Springer-Verlag; 1998:9–50. ICLR 2016.
74. Rubin DB. An overview of multiple imputation. In: Proceedings of the 98. Mirza M, Osindero S. Conditional generative adversarial nets. CoRR
Survey Research Section, American Statistical Association; 1988:79–84. abs/1411.1784; 2014. arXiv:1411.1784.
75. McCullagh P, Nelder JA. Generalized Linear Models. London: Chap- 99. Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P.
man Hall CRC; 1989. Infogan: Interpretable representation learning by information maximiz-
76. Ng AY. Feature selection, l1 vs. l2 regularization, and rotational invari- ing generative adversarial nets. In: D. D. Lee, M. Sugiyama, U. V. Lux-
ance. In: Proceedings of the Twenty-first International Conference on burg, I. Guyon, and R. Garnett, eds. Advances in Neural Information
Machine Learning, ICML ’04. New York, NY: ACM; 2004:78. Processing Systems 29. Red Hook: Curran Associates Inc; 2016:2172–
77. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neu- 2180.
rorobot. 2013;7:21. 100. Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation
78. Springenberg JT, Klein A, Falkner S, Hutter F. Bayesian optimization using cycle-consistent adversarial networks. CoRR abs/1703.10593;
with robust bayesian neural networks. In: Lee DD, Sugiyama M, Lux- 2017. arXiv:1703.10593.
burg UV, Guyon I, Garnett R, eds. Advances in Neural Information 101. Emami H, Dong M, Nejad-Davarani S, Glide-Hurst C. Generating syn-
Processing Systems 29. Red Hook, NY: Curran Associates Inc; thetic CTS from magnetic resonance images using generative adversar-
2016:4134–4142. ial networks. Med Phys. 2018;45:3627–3636.
79. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks 102. Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient
are universal approximators. Neural Netw. 1989;2:359–366. methods for reinforcement learning with function approximation.
80. Vidal R, Bruna J, Giryes R, Soatto S. Mathematics of deep learning. In: Advances in Neural Information Processing Systems;
CoRR abs/1712.04741; 2017. arXiv:1712.04741. 2000:1057–1063.
81. Brahma PP, Wu D, She Y. Why deep learning works: a manifold dis-en- 103. Efron B, Tibshirani R. An Introduction to the Bootstrap, Monographs
tanglement perspective. IEEE Trans Neural Netw Learn Syst. on Statistics and Applied Probability. New York: Chapman & Hall;
2016;27:1997–2008. 1993: xvi, 436 p.
82. Hochreiter S. Untersuchungen zu dynamischen neuronalen netzen. 104. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent
Diploma, Technische Universit€at M€unchen 91; 1991. reporting of a multivariable prediction model for individual prognosis
83. Goller C, Kuchler A. Learning task-dependent distributed representa- or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med.
tions by backpropagation through structure. In: IEEE International 2015;162:55–63.
Conference on Neural Networks, Vol. 1. IEEE; 1996: 347–352. 105. Ratcliffe S. Roy amara; 2016.
84. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network 106. Hau K. We analyzed 16,625 papers to figure out where AI is headed
training by reducing internal covariate shift. In: Proceedings of the next. Technical Report; 2019.
32Nd International Conference on International Conference on 107. Luo Y, Tseng H-H, Cui S, Wei L, Ten Haken RK, El Naqa I.
Machine Learning - Volume 37, ICML’15. JMLR.org; 2015: 448–456. Balancing accuracy and interpretability of machine learning

approaches for radiation treatment outcomes modeling. BJR-Open 1. 110. Conforti F, Pala L, Bagnardi V, et al. Cancer immunotherapy efficacy
2019;1:20190021. and patients’ sex: a systematic review and meta-analysis. Lancet Oncol.
108. Tannenbaum C, Ellis RP, Eyssel J, Zou F, Schiebinger L. Sex and gen- 2018;19:737–746.
der analysis improves science and engineering. Nature. 2019;575:137– 111. Ghorbani A, Zou J. Data shapley: Equitable valuation of data for
146. machine learning; 2019. arXiv:1904.02868 [stat.ML].
109. Manrai AK, Funke BH, Rehm HL, et al. Genetic misdiagnoses and 112. Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data:
the potential for health disparities. N Engl J Med. 2016;375:655– preserving security and privacy. J Big Data. 2018;5:1.
665.

Introduction To Machine and Deep Learning For Medical Physicists

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Machine and Deep Learning For Medical Physicists

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Machine and Deep Learning For Medical Physicists

Uploaded by

Copyright:

Available Formats

Introduction to machine and deep learning for medical physicists

1. INTRODUCTION feature extraction,10,11 and outcomes modeling.12–16 Numer-

toward developing new technological innovations in medi-

Data interaction Definition Examples

Classical ML An ML algorithm that Supervised: GLM,75

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

dataset to train a complex algorithm can be problematic, as it

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

4.B. Model regularizations

In NNs, cross-entropy is also usually applied as a loss func- 0  ai  C; 8i (12)

Medical Physics, 47 (5), May 2020

id diagnosis radius_mean texture_mean perimeter_mean ... fractal_dimension_worst

842302 M 17.99 10.38 122.8 ... 0.1189

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

5.C. Prevalent architectures

FIG. 8. Self-iterations of linear transformations.

Medical Physics, 47 (5), May 2020

5.C.2. Recurrent neural networks

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

id Gender pathologic_N pathologic_stage pathologic_T other_dx tobacco_smoking_history

1 Male N0 Stage IIIB T3 No 4

Medical Physics, 47 (5), May 2020

between states. The probability mass function (pmf) (s,a,t)

In other words, an agent is a policy function p : S ! A who

Medical Physics, 47 (5), May 2020

typically several hundred or thousand times depending on the

Medical Physics, 47 (5), May 2020

learning process by incorporating human-computer interac-

7.A. Why interpretability matters? 8. WHAT PITFALLS MAY EXIST?

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

Medical Physics, 47 (5), May 2020

You might also like

In NNs, cross-entropy is also usually applied as a loss func- 0 ai C; 8i (12)