Professional Documents
Culture Documents
Multi Objective Optimization of Radiotherapy Distributed Q Learning and Agent Based Simulation PDF
Multi Objective Optimization of Radiotherapy Distributed Q Learning and Agent Based Simulation PDF
Intelligence
To cite this article: Ammar Jalalimanesh, Hamidreza Shahabi Haghighi, Abbas Ahmadi, Hossein
Hejazian & Madjid Soltani (2017) Multi-objective optimization of radiotherapy: distributed
Q-learning and agent-based simulation, Journal of Experimental & Theoretical Artificial
Intelligence, 29:5, 1071-1086, DOI: 10.1080/0952813X.2017.1292319
1. Introduction
The second cause of death worldwide after cardiovascular disease is cancer, even in developed countries
(Stamatakos et al., 2002), and up till now there is no reliable cure for it. Cancer is a group of diseases,
producing change in body cells and grows violently and finally these cells form a tumour (Ahmadi &
Afshar, 2015). The leading approaches for cancer treatment are surgery, chemotherapy and Radiotherapy
(RT). Most of the time the blend of radiotherapy and chemotherapy is applied to achieve enhanced
treatment. The main objective of all the therapies is to control tumour growth and to eradicate cancer
cells.
About half of cancer patients are treated by RT (Orth et al., 2014). In RT cluster of malignant cells is
projected by radiation beam to fracture DNA bonds which leads to cell death. Protection of tumour
surrounding healthy tissue from radiation is a vital problem in this procedure.
Sequential therapy planning or ‘dynamic treatment regimes’ (DTR) or ‘adaptive treatment strategies’
(Song, Wang, Zeng, & Kosorok, 2011) is an important problem in engineering and also medical science.
Murphy (2003) defined a concept of ‘optimal dynamic treatment regimes’ to find an optimal list of
decision rules, one per time interval, for adjusting the level of treatment in a personalised manner
through time according to an individual’s altering status. Moodie, Chakraborty, and Kramer (2012) used
Q-learning to find an optimal DTR for patients with major depressive disorder. Zhao, Kosorok, and Zeng
(2009) found an optimal DTR for non-small cell lung cancer chemotherapy using Q-learning algorithm.
Treatment planning is the most important part of RT. In this procedure, the radiation schedules and
the intensity of radiation are determined. The goal of RT is to treat tumour by removing cancer cells and
killing minimum number of healthy cells (Craft, Bangert, Long, Papp, & Unkelbach, 2014). Accordingly,
RT is multi-objective intrinsically. There are many studies focusing on modelling and optimisation of
RT in past decade and there exist some attempts in recent years. Most of them were focused on opti-
misation of physical aspects of RT such as beam angles and radiation intensity (Deng & Ferris, 2008;
Ferris, Lim, & Shepard, 2003).
By the support of computer, RT inverse planning becomes more popular. In this process the objec-
tive is to maximise tumour control probability (TCP) while preserving the normal tissue complication
probability (NTCP) at standard levels. For a typical RT treatment, TCP ≥ .5 and NTCP ≤ .05 (Podgorsak,
2005). Treatment plan is confirmed by computer before any physical experiment. Some studies focus
on modelling and optimisation of RT. Many of them try to determine the optimised solution using
operational research methods based on medical data from RT patient’s records.
Agent-based simulation is a flexible method for biological modelling of tumour growth and RT. This
approach can model the heterogeneity of tumour growth and therapy selection. Current researches
show the evolution of this method and the readiness of cancer therapy associations to put this kind
of model in practice in near future (Yankeelov et al., 2013). Besides, having reliable algorithm for opti-
misation of agent-based models will help oncologists in finding better solutions in cancer therapy.
As agent-based models are stochastic, classic methods such as dynamic programming could not
simply use for their optimisation. There are large numbers of states in an agent-based model and it is
hard to define a finite Markov decision process for it. Hence, model-free RL algorithm such as Q-learning
is an appropriate algorithm for optimising such a model.
In previous work, we considered RT as a single-objective problem and proposed an optimization
methodology by the aid of agent-based simulation and Q-learning (Jalalimanesh et al., 2017). In this
study, we combine MDQ-learning algorithm with an agent-based simulation of tumour growth and RT. We
find Pareto optimal solutions of RT treatment considering conflicting objectives. Using MDQ-learning we
minimise treatment time, total radiation dose and healthy tissue damage simultaneously. The reminder of
the paper is organised as follows. Section 2 is the literature review of modelling and optimisation of radi-
otherapy. Section 3 describes our agent-based simulation model of tumour growth. It also displays some
snapshots of simulated tumour. Section 4 introduces our methodology for modelling of radiotherapy.
Section 5 defines the concepts of Q-learning, DQ-learning and MDQ-learning and represents their classic
algorithms. Section 6 then demonstrates our approach to optimise RjT using MDQ-learning. The pseudo
codes of proposed algorithm are provided in this section. Section 7 illustrates running results of the pro-
posed algorithm and discussion. Section 8 finally contains conclusion and our planned future research.
effects causing from the underlying dose distributions (Lim, 2002). Simulation will help us to harvest the
data which are needed for optimisation where it is hard to get access to biological data of real patients.
Simulation also allows us to study several treatment scenarios and compare them to each other. From
biological perspective, several studies attempt to model the cancer growth and the effects of RT on
tumour. Some of them tried to validate their simulation based on real clinical or in vivo experiments.
Kirkby, Burnet, and Faraday (2002) demonstrated the reaction of tumour cells to RT using ordinary
differential equation (ODE) simulation. They modelled the cancer cells’ cell-cycle and considered cells
age as the main parameters in radiosensitivity. In fact, an age-distributed population balance model of
a typical tumour cell cycle has been formulated in their research. Stamatakos et al. (2002) established
a 3D model of tumour behaviour in response to RT using Monte-Carlo simulation technique. Antipas,
Stamatakos, Uzunoglu, Dionysiou, and Dale (2004) improved Stamatakos et al. model by considering
Oxygen Enhancement Ratio (OER) and new angiogenesis phenomenon. They also verified the model
variables using clinical data. Borkenstein, Levegrün, and Peschke (2004) constructed a mathematical
model to predict the quality of RT. The cell-cycle, oxygen diffusion dynamic and tumour angiogenic fac-
tor (TAF) diffusion were considered in their model. The simulation was run in discrete manner including
space and time. Their model is a typical kind of cellular automata (CA) with some differential equation
inside the model mostly for oxygen and TAF diffusion. The angiogenesis is simplified as an increase in
number of oxygen resources in tumour. The model is run as 3D simulation. Some studies such as (Basu,
Paul, & Roy, 2005; Dionysiou, Stamatakos, Uzunoglu, & Nikita, 2006; Enderling, Anderson, & Chaplain,
2007; Harting, Peschke, Borkenstein, & Karger, 2007; Jiménez & Hernandez, 2011) also developed models
and run simulation of effects of different RT methods on tumour in diverse organs.
There are also various attempts for optimisation of RT. Wein, Cohen, and Wu (2000) developed a
methodology to dynamically optimise the Linear Quadratic function in different situations using ODE-
based simulation model. They used some deterministic optimal control techniques to optimise treat-
ment plan. Conflicting with conformal RT the dose intensity is changed during treatment according
to their approach. Lim (2002) developed a series of techniques to optimise RT treatment plan using
Gamma Knife machine in his PhD thesis. He applied several operational research techniques such
as Nonlinear programming, Mixed-integer programming and some heuristic methods like simulated
annealing to optimise treatment plan spatiotemporally. Craft, Halabi, Shih, and Bortfeld (2006) pre-
sented an algorithm for computing the well-distributed points on the convex Pareto optimal surface
in a multi-objective optimisation of RT. The algorithm was applied to intensity-modulated radiotherapy
(IMRT) inverse planning problem, and results of a prostate case and a skull base case were presented
in three dimensions. Chan (2007) investigated robust approaches to manage uncertainty in RT treat-
ments in his PhD thesis. First, he studied the effect of breathing motion uncertainty on IMRT of a lung
tumour. Next, he provided a robust formulation for the optimization of proton-based treatments. Thieke
et al. (2007) developed a new inverse planning system called ‘Multi-criteria Interactive Radiotherapy
Assistant (MIRA)’. They used the optimisation results as a database of patient specific, Pareto-optimal
plan proposals. Two clinical test cases, a para-spinal meningioma and a prostate case, were optimised
using MIRA and the result is compared with the clinically approved planning programme KonRad. Deng
and Ferris (2008) investigated an on-line treatment planning strategy for fractionated RT by the aid
of neuro-dynamic programming technique. They considered the effect of day-to-day patient motion.
In their early limited testing the solutions they obtained overtake regular solutions and suggested
an improved dose profile for each fraction of the treatment. Kim, Ghate, and Phillips (2009) designed
fractionation schedules that consider the patient’s accumulative response to radiation up to a particu-
lar treatment day to determine the fraction on that day. They claimed that, they proposed an original
approach for the first time, which mathematically discovers the profits of fractionation schemes. They
developed a stylistic Markov decision process (MDP) model and applied it on several simple numerical
trials. Ruotsalainen, Boman, Miettinen, and Tervo (2009) demonstrated a nonlinear interactive multi-
objective optimisation method for RT treatment planning by the use of the Boltzmann transport equa-
tion (BTE) for dose calculation. They used a parameterisation method to create the dose calculation
more rapidly in the BTE model.
1074 A. JALALIMANESH ET AL.
Holdsworth et al. (2012) established a patient-specific scheme of adaptive IMRT treatment for glio-
blastoma by means of a multi-objective evolutionary algorithm (MOEA). They used MOEA to generate
spatially optimised dose distributions based on a mathematical model of tumour cell proliferation,
diffusion and response. The sets of spatially optimal dose distributions were produced via the MOEA
method to generate the Pareto Front. Ramakrishnan (2013) has proposed a series of algorithm to
dynamically optimise the fractionation schedules. He also projected a model to adaptively vary the
fraction size during RT to reach the best treatment plan. Leder et al. (2014) developed a mathematical
model to optimise PDGF-driven glioblastoma radiation dosing Schedules. They also validated their
results by in vivo experiments. They understood, altering the radiation size during treatment possibly
will lead to better achievement.
enter mitosis. In M period cell stops to grow and cell divides into two daughter cells. There is an inactive
phase in which the cell leaves the cycle and stops dividing which is called G0.
In our simulation the agents decide to choose its phenotype based on cell-cycle stage and envi-
ronmental factors. Due to the key role of oxygen in RT effectiveness we do take care of oxygen and
its dynamics as the main nutrient in tissue. Figure 1 shows the flowchart of cell cycle for all the cells.
The cell-cycle time considered to be 24 h according to (Humphrey & Brooks, 2005). Cells are 12, 6, 4,
2 h in G1, S, G2 and M phases, respectively. Normal cells do not divide where there is no space availa-
ble to split down and become quiescent, but cancer cells do not follow this rule. Cancer cells are more
resistant to oxygen deficiency and turn hypoxic in poor oxygenation conditions. According to (Kufe
et al., 2003) a tumour with a mass of 2 mm3 has 105 cells. Later, based on (Vaupel, Kallinowski, & Okunieff,
1990) the oxygen consumption rate is 3 × 10−5 ml/(cm3 s). Hence, for modifying the dimensions of
our model the average oxygen consumption is considered to be 2.16 × 10−9 ml per cell per hour. The
remaining parameters were estimated based on this value.
Once more, for simplicity, we used Chen, Sprouffske, Huang, and Maley (2011) methodology for mod-
elling angiogenesis. They neglected the morphology of vascular network and modelled the epithelial
cells as nodes which diffuse oxygen. According to their method, V blood vessels are put in continuous
2D space on the top of a square lattice of P patches. For every time step, each patch having a blood
vessel receives ri units of oxygen (Equation (1)). Oxygen concentrations cjt+1 at patch j at time t + 1 can
be clarified by the following difference equation.
( )
t+1
) t 1 ∑
d c − ra ntj + ri 𝛿jt
t
(
cj = 1 − dc cj + (1)
8 k∈N(j) c k
dc is the resource diffusion constant, N(j) are the eight neighbouring patches of j, ra is the cell absorption
ratio, ntj is the quantity of cells at location j at time t, ri is the resource creation rate for a microvessel and
𝛿jt is equal to unity if there is a microvessel at position j at time t and 0 elsewhere. Oxygen concentrations
1076 A. JALALIMANESH ET AL.
are not allowed to be negative since cells are disallowed from absorbing resources more than those
are present in the position. Table 1 depicts the parameters used in our model.
We used NetLogo package as our simulation platform. Figure 2 shows three snapshots of tumour
growth simulation at different stages. As each tick shows one hour, you can see the tumour size and
oxygen status in ticks 1000, 1500 and 2000 from top frame in Figure 2 to the bottom. As time passes,
the cells inside tumour become necrotic due to lack of oxygen and nutrient. In the right side of 3D plots,
the red colour shows the lack of oxygen while the white colour exhibits sufficient oxygen presence.
The simulation is run in a square world by the dimensions of 100 × 100 patches where cell diameter is
20 μm (Powathil, Gordon, Hill, & Chaplain, 2012) and consequently we can simulate a cancer tissue of
2 × 2 mm2 area. According to Figure 2 it can be seen that by increasing the tumour diameter and also
the necrotic core volume, oxygen leaks through to the tumour by angiogenesis procedure and causes
cancer cells to proliferate inside tumour.
Multi-objective Q-learning (MDQ-learning) is like DQ-learning while each group of agents tries to
handle one objective. At the end of each iteration, agents negotiate with each other and compromise
on the solutions to draw Pareto-front (Mariano & Morales, 2000). We explain this algorithm for the case
of RT in the next section, thoroughly.
In the beginning, the tumour size is recognised as the initial state. All the scenarios should be applied
on the same tumour to enable us to compare the solutions with each other. Accordingly, we save the
tumour growth simulation in a determined tumour stage/size and use this saved model for initiation
of the algorithm in all iterations. To explore solution space thoroughly, at initial iterations of the algo-
rithm, the actions are more likely to select arbitrarily sooner than selection based on maximum rewards.
Over time, the Q matrix will be fulfilled by the accumulation of the rewards and will converge. At this
phase the exploitation operator is more likely to be activated and it is more expected to select actions
based on maximum rewards. The swapping between exploration and exploitation occurs by defining
𝜂 = 𝛿∕IterationNumber parameter, which becomes smaller in higher iterations. The w1 and w2 in the
reward function are the coefficients which show the reward of killing cancer cells and the penalty of
killing normal cells in our therapy, respectively. The w3 coefficient is related to the significance of the
effects of radiation on neighbouring healthy tissues and later on therapy outcomes. For instance, w2
and w3 should be higher if tumour surrounded by more critical Organ At Risk (OAR). During the RT, we
just irradiate patches contain cancer cells. This manner is like Intensity Modulated RadioTherapy (IMRT)
technique. The termination criterion is to have a tumour with less than 10 cancer cells. We calculate the
total rewards which agent i has achieved during a complete therapy as Vi. Once all the agents complete
their therapy and find the solutions, we compare Vis with each other to select the best solution. Next,
the Q matrix will update based on the best therapy. For the subsequent iterations, we will exchange
the Qi of agents with updated Q matrix.
5.16: end while
5.17: Calculate Si as the value of conflicting objectives pairs such as TCP and NTCP
5.18: end for
5:19: 𝐍𝐞𝐠𝐨𝐭𝐢𝐚𝐭𝐢𝐨𝐧(S)
5.20: end repeat
We developed a treatment optimisation approach based of TCP and NTCP on the typical Multi-
Objective Decision Making Problems scheme in RT. Two objectives are defined as #KCC and #KNC that
represent the TCP and NTCP which are interpreted as the number of cancer cells and the number of nor-
mal cells eradicated during each irradiation stage, respectively. As it is showed in algorithm 5 the reward
function which calculate ri is comprise three elements. Killing more cancer cells increases the reward
while killing normal cells and irradiation of wider area using higher intensity decreases the reward.
We must specify the solution space of our problem same as constructing the Pareto Front in MODM
problems where we find the non-dominated solutions by exploring the feasible space which is shaped
based on some certain constraints. However, in stochastic environment such as agent-based simulation
we cannot simply define neither specific solution space nor Markovian decision process (we have a
complex state transitions). Using MDQ-learning algorithm, we propose a new methodology by defin-
ing three schemes of radiotherapy comprise different combination of our objectives namely invasive,
conservative and moderate.
As mentioned before the objectives of RT is eradicating cancer cell faster (first objective) while
keeping healthy tissue undamaged (second objective) with as low total radiation dose as possible.
Accordingly, invasive scheme is a kind of treatment which enforces killing cancer cells without pay-
ing much attention regarding normal cells. On the other hand, conservative scheme does not take
high risk and takes more care of normal cells and tries to ruin cancer cells in a less stressed manner.
The moderate scheme is in average by paying the same attention to both killing of cancer and normal
cells during the RT.
To realise these schemes, we put different emphasis to the before mentioned objectives for each
of three solver agents by assigning different wi to their reward functions. This approach gives diverse
behaviour to the agents to explore the solution space more thoroughly.
W1 W2 W3
Agent 1/invasive 3 1 .1
Agent 2/moderate 2 1 .3
Agent 3/conservative 1 2 .5
Figure 5. Pareto optimal solutions of total radiation domain vs. total radiation dose and fractions number.
inside the tissue when the simulation starts. We also had 70 micro-vessels scattered in tissue which
diffuse oxygen. Other parameters were set according to Table 1. After 333 ticks (hours) of running the
simulation, we had a tumour with 750 cancer cells and a big necrotic core. We kept this tumour as an
input data for MDQ-learning algorithm. The state space (s) was built by discretizing the tumour size
to 40 stages. Hence, 20 cells proliferation takes tumour to one higher stage. For example, the tumour
with 605 cancer cells is in stage 31. As declared, actions (a) consist of week (2 Gy), normal (2.5 Gy) and
intense (3 Gy) irradiation. We tuned the parameters of algorithm using sensitivity analysis approach.
Eventually we came up with α = .2, δ = .4 and γ = .3 as good parameters for fast convergence of the algo-
rithm. As mentioned before we construct three solver agents using different wi values in their reward
function. Table 2 depicts these values for all schemes. The algorithm was run for 100 iterations each
time. Figure 4 displays the results of one complete running of algorithm for three optimizer agents.
The Total radiation domain objective is computed by accumulating the multiplication of the number
of irradiated patched which containing normal cells to the radiation intensity, for all days of complete
therapy. This measure plays the role of NTCP for our simulation. The Number of fractions and the Total
radiation dose are two similar objectives which the first one shows the duration of therapy and the
second displays the total aggregated dose irradiated during a complete therapy. These two objectives
are similar to TCP in nature, but their decreasing shows higher TCP.
The initial iterations are in exploration phase where the large amount of fluctuations is observable.
The agent 1 which has invasive behaviour has taken less consideration regarding radiation domain
and consequently reached later convergence than other agents. Agent 2, the moderate agent, became
stable before fortieth iteration and conservative agent converged soon just after twentieth iteration. It
can be recognised from the fluctuations of results that all the agents explore the solution space till the
final iterations. This fact proved that the exploration operator works well even in convergence phase.
Figure 5 depicts Pareto points which agents explore in solution space in all iterations. We plotted the
total radiation domain against the number of fractions and total radiation dose, separately. It can be
concluded that the places which we do not have any discovered solution, could become discoverable
1084 A. JALALIMANESH ET AL.
with more extreme weights and different RT schedule. For example, the tumour cannot be eradicated
with less than 8 daily fractions with total dose less than 17 Gys, although it would be possible if we
irradiate tumour two or more times per day. Therefore, some points are not explored by agents because
they impose high negative rewards to the agents, even though there are some outliers in the plots
which agents explored due to the randomness in exploration mechanism. Solutions found by each
agent are depicted with specific colour. The Pareto solutions are indicated by red squares. The Pareto
front could be obtained by a fitting curve which connects optimal solution to each other. Our Pareto
front is narrow because of limited number of actions (dosage) which are not too different. We can have
a wider Pareto front by defining more actions (dosage) in a broader range of radiation.
8. Conclusion
Using agent-based simulation we developed a flexible multi-scale vascular tumour model. We modelled
cell-cycle dynamics inside cells and the effects of radiation on cancer and healthy cells in different
cell-cycle stages. We also constructed the angiogenesis process inside tissue and tumour and the oxy-
gen dynamics during tumour growth and RT course. Then, we proposed an algorithm to optimise RT
based on MDQ-learning algorithm by considering fix irradiation schedule and dynamic dose intensity
during treatment course. As RT is multi-objective fundamentally, our proposed algorithm considers two
conflicting objective of minimising tumour therapy period and simultaneously minimising unavoidable
side effects of radiation on healthy cells. We obtained Pareto optimal solutions for RT and set up the
Pareto front by running the algorithm iteratively. Using diverse solver agents owning different explora-
tion behaviour let us explore solution space more thoroughly and to better construct the Pareto front.
Multi-objective distributed Q-learning is a potent technique for optimisation of complex stochastic
problems which has multi-objective nature such as RT. In our problem, it is not possible to apply classical
optimisation techniques form operational research context such as dynamic programming due to the
infinite number of states and impossibility of establishing transition probability matrix.
Our multi-objective optimisation compartment works fine and it seems, it can connect with other
type of biological agent-based simulation models in other contexts such as chemotherapy or surgery
and it would be possible to apply existing procedure on such models by minimum modification.
Although, previously multi-objective optimisation based on multi-agent simulation was done in
other domains such as controlling traffic lights (Khamis & Gomaa, 2012; Khamis, Gomaa, & El-Shishiny,
2012), but it is novel in our context. In future research, we will try to improve the biological aspects of
tumour growth model by forming more sophisticated vascular tumour. Accordingly, we would expand
angiogenesis section by modelling VEGF secretion and brush-border effect. Additional key amendment
that could be considered in our tumour model is hypoxia. This phenomenon is an important factor in
radiotherapy. Hypoxic cells are more resistant to RT and hypoxia region is the most aggressive part of
tumour (Olcina, Lecane, & Hammond, 2010). Declining sensitivity of hypoxic cells to radiation is sup-
posed to be the core cause of treatment failure, especially in more hypoxic tumours.
Computational complexity is one of the main complications in agent-based models of tumour
growth which do not let the designers to go up through real scale. Shirazi, Davison, von Mammen,
Denzinger, and Jacob (2014) proposed a novel methodology to increase the agent-based simulation by
designing hierarchical meta-agents which contains groups of lower level agents. Although this meth-
odology needs some modifications to become compatible with our model, we are going to recruit it.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Hamidreza Shahabi Haghighi http://orcid.org/0000-0002-1679-4920
Abbas Ahmadi http://orcid.org/0000-0001-9884-0830
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 1085
References
Ahmadi, A., & Afshar, P. (2015). Intelligent breast cancer recognition using particle swarm optimization and support vector
machines. Journal of Experimental & Theoretical Artificial Intelligence, 28, 1021–1034.
Antipas, V. P., Stamatakos, G. S., Uzunoglu, N. K., Dionysiou, D. D., & Dale, R. G. (2004). A spatio-temporal simulation model
of the response of solid tumours to radiotherapy in vivo: Parametric validation concerning oxygen enhancement ratio
and cell cycle duration. Physics in Medicine and Biology, 49, 1485–1504.
Basu, K., Paul, S., & Roy, P. (2005). MRI-image based radiotherapy treatment optimization of brain tumours using stochastic
approach. In Manesar (Ed.), N.B.R.C. Computational Neuroscience & Neuroimaging Laboratory.
Borkenstein, K., Levegrün, S., & Peschke, P. (2004). Modeling and computer simulations of tumor growth and tumor response
to radiotherapy. Radiation Research, 162, 71–83.
Chan, T. C.-Y. (2007). Optimization under uncertainty in radiation therapy. Massachusetts Institute of Technology, Operations
Research Center. Retrieved from http://hdl.handle.net/1721.1/40302
Chen, J., Sprouffske, K., Huang, Q., & Maley, C. C. (2011). Solving the puzzle of metastasis: The evolution of cell migration
in neoplasms. PLoS ONE, 6, e17933.
Craft, D., Bangert, M., Long, T., Papp, D., & Unkelbach, J. (2014). Shared data for intensity modulated radiation therapy (IMRT)
optimization research: The CORT dataset. GigaScience, 3(1), 1–12.
Craft, D. L., Halabi, T. F., Shih, H. A., & Bortfeld, T. R. (2006). Approximating convex Pareto surfaces in multiobjective
radiotherapy planning. Medical Physics, 33, 3399–3407.
Deng, G., & Ferris, M. C. (2008). Neuro-dynamic programming for fractionated radiotherapy planning. In C. J. S. Alves,
P. M. Pardalos, L. N. Vicente (Eds.), Optimization in medicine (pp. 47–70). New York: Springer.
Dionysiou, D. D., Stamatakos, G. S., Uzunoglu, N. K., & Nikita, K. S. (2006). A computer simulation of in vivo tumour growth
and response to radiotherapy: New algorithms and parametric results. Computers in Biology and Medicine, 36, 448–464.
Enderling, H., Anderson, A. R., & Chaplain, M. A. (2007). A model of breast carcinogenesis and recurrence after radiotherapy.
PAMM, 7, 1121701–1121702.
Ferris, M. C., Lim, J., & Shepard, D. M. (2003). Radiosurgery treatment planning via nonlinear programming. Annals of
Operations Research, 119, 247–260.
Fletcher, A. G., Mirams, G. R., Murray, P. J., Walter, A., Kang, J.-W., Cho, K.-H., … Byrne, H. M. (2010). Multiscale modeling of
colonic crypts and early colorectal cancer. Multiscale Cancer Modeling, Editor: TS Deisboeck, 6, 111–134.
Ghate, A. (2014, September 3). Dynamic optimization in radiotherapy. INFORMS Tutorials in Operations Research, 60–74.
Seattle: University of Washington. Retrieved from http://pubsonline.informs.org/doi/abs/10.1287/educ.1110.0088
Grote, J., Süsskind, R., & Vaupel, P. (1977). Oxygen diffusivity in tumor tissue (DS-carcinosarcoma) under temperature
conditions within the range of 20–40 C. Pflügers Archiv European Journal of Physiology, 372, 37–42.
Hall, E. J., & Giaccia, A. J. (2006). Radiobiology for the radiologist (6th ed). Philadelphia, PA: Lippincott Williams & Wilkins.
Harting, C., Peschke, P., Borkenstein, K., & Karger, C. P. (2007). Single-cell-based computer simulation of the oxygen-
dependent tumour response to irradiation. Physics in Medicine and Biology, 52, 4775–4789.
Holdsworth, C., Corwin, D., Stewart, R., Rockne, R., Trister, A., Swanson, K., & Phillips, M. (2012). Adaptive IMRT using a
multiobjective evolutionary algorithm integrated with a diffusion–invasion model of glioblastoma. Physics in Medicine
and Biology, 57, 8271–8283.
Humphrey, T. C., & Brooks, G. (2005). Cell cycle control: Mechanisms and protocols (Vol. 296). Totowa, NJ: Humana Press Inc.
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., & Soltani, M. (2017). Simulation-based optimization of radiotherapy: Agent-
based modeling and reinforcement learning. Mathematics and Computers in Simulation, 133, 235–248.
Jiménez, R. P., & Hernandez, E. O. (2011). Tumour–host dynamics under radiotherapy. Chaos, Solitons & Fractals, 44, 685–692.
Khamis, M. A., & Gomaa, W. (2012). Enhanced multiagent multi-objective reinforcement learning for urban traffic light control.
Paper presented at the 11th International Conference on Machine Learning and Applications (ICMLA), 2012, Boca
Raton, FL.
Khamis, M. A., & Gomaa, W. (2014). Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal
control based on cooperative multi-agent framework. Engineering Applications of Artificial Intelligence, 29, 134–151.
Khamis, M. A., Gomaa, W., & El-Shishiny, H. (2012). Multi-objective traffic light control system based on Bayesian probability
interpretation. Paper presented at the 2012 15th International IEEE Conference on Intelligent Transportation Systems,
Anchorage.
Kim, M., Ghate, A., & Phillips, M. (2009). A Markov decision process approach to temporal modulation of dose fractions in
radiation therapy planning. Physics in Medicine and Biology, 54, 4455–4476.
Kirkby, N., Burnet, N., & Faraday, D. (2002). Mathematical modelling of the response of tumour cells to radiotherapy. Nuclear
Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms, 188, 210–215.
Kufe, D. W., Pollock, R. E., Weichselbaum, R. R., Bast, R. C., Gansler, T. S., Holland, J. F., … Kalluri, R. (2003). Beginning of
angiogenesis research. Holland-Frei cancer medicine (6th ed.). Hamilton,Ontario: BC Decker.
Langlois, M., & Sloan, R. H. (2010). Reinforcement learning via approximation of the Q-function. Journal of Experimental &
Theoretical Artificial Intelligence, 22, 219–235.
Leder, K., Pitter, K., LaPlant, Q., Hambardzumyan, D., Ross, B. D., Chan, T. A., … Michor, F. (2014). Mathematical modeling of
PDGF-driven glioblastoma reveals optimized radiation dosing schedules. Cell, 156, 603–616.
1086 A. JALALIMANESH ET AL.
Lim, J. (2002). Optimization in radiation treatment planning (Doctoral dissertation ). Madison, WI: University of Wisconsin–
Madison.
Liu, C., Xu, X., & Hu, D. (2015). Multiobjective reinforcement learning: A comprehensive overview. IEEE Transactions on
Systems, Man, and Cybernetics: Systems, IEEE Transactions on, 45, 385–398.
Mariano, C. E., & Morales, E. F. (2000). Distributed reinforcement learning for multiple objective optimization problems. Paper
presented at the Proceedings of the 2000 Congress on Evolutionary Computation, LA Jolla, CA.
Moodie, E. E., Chakraborty, B., & Kramer, M. S. (2012). Q-learning for estimating optimal dynamic treatment rules from
observational data. Canadian Journal of Statistics, 40, 629–645.
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 65, 331–355.
O’Neil, N. (2012). An agent based model of tumor growth and response to radiotherapy (Master of Science). Commonwealth
University, Richmond, VA.
Olcina, M., Lecane, P. S., & Hammond, E. M. (2010). Targeting hypoxic cells through the DNA damage response. Clinical
Cancer Research, 16, 5624–5629.
Orth, M., Lauber, K., Niyazi, M., Friedl, A. A., Li, M., Maihöfer, C., … Belka, C. (2014). Current concepts in clinical radiation
oncology. Radiation and Environmental Biophysics, 53(1), 1–29.
Podgorsak, E. B. (2005). Review of radiation oncology physics: A handbook for teachers and students. Vienna: International
Atomic Energy Agency.
Powathil, G. G., Gordon, K. E., Hill, L. A., & Chaplain, M. A. (2012). Modelling the effects of cell-cycle heterogeneity on the
response of a solid tumour to chemotherapy: Biological insights from a hybrid multiscale cellular automaton model.
Journal of Theoretical Biology, 308, 1–19.
Ramakrishnan, J. (2013). Dynamic optimization of fractionation schedules in radiation therapy (Doctoral dissertation).
Massachusetts Institute of Technology. Retrieved from http://hdl.handle.net/1721.1/82181
Ruotsalainen, H., Boman, E., Miettinen, K., & Tervo, J. (2009). Nonlinear interactive multiobjective optimization method for
radiotherapy treatment planning with Boltzmann transport equation. Contemporary Engineering Sciences, 2, 391–422.
Shirazi, A. S., Davison, T., von Mammen, S., Denzinger, J., & Jacob, C. (2014). Adaptive agent abstractions to speed up spatial
agent-based simulations. Simulation Modelling Practice and Theory, 40, 144–160.
Song, R., Wang, W., Zeng, D., & Kosorok, M. R. (2011). Penalized Q-learning for dynamic treatment regimes. Statistica Sinica,
25, 901–920.
Stamatakos, G. S., Dionysiou, D. D., Zacharaki, E., Mouravliansky, N., Nikita, K. S., & Uzunoglu, N. K. (2002). In silico radiation
oncology: Combining novel simulation algorithms with current visualization techniques. Proceedings of the IEEE, 90,
1764–1777.
Thames, H. D., & Hendry, J. H. (1987). Fractionation in radiotherapy. London: Taylor and Francis.
Thieke, C., Küfer, K.-H., Monz, M., Scherrer, A., Alonso, F., Oelfke, U., … Bortfeld, T. (2007). A new concept for interactive
radiotherapy planning with multicriteria optimization: First clinical evaluation. Radiotherapy and Oncology, 85, 292–298.
Thiele, J. (2014). R marries NetLogo: Introduction to the RNetLogo package. Journal of Statistical, 58(2), 1–41.
Van der Kogel, A., & Joiner, M. (2009). Basic clinical radiobiology. London: Hodder Arnold.
Vaupel, P., Kallinowski, F., & Okunieff, P. (1990). Blood flow, oxygen consumption and tissue oxygenation of human tumors.
In J. Piiper, T. K. Goldstick, M. Meyer (Eds.), Oxygen transport to tissue XII (pp. 895–905). New York, NY: Springer.
Wein, L. M., Cohen, J. E., & Wu, J. T. (2000). Dynamic optimization of a linear–quadratic model with incomplete repair
and volume-dependent sensitivity and repopulation. International Journal of Radiation Oncology* Biology* Physics, 47,
1073–1083.
Yankeelov, T. E., Atuegwu, N., Hormuth, D., Weis, J. A., Barnes, S. L., Miga, M. I., … Quaranta, V. (2013). Clinically relevant
modeling of tumor growth and treatment response. Science Translational Medicine, 5, 187–189.
Zhao, Y., Kosorok, M. R., & Zeng, D. (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine,
28, 3294–3315.