Multi Objective Optimization of Radiotherapy Distributed Q Learning and Agent Based Simulation PDF

Journal of Experimental & Theoretical Artificial
Intelligence
ISSN: 0952-813X (Print) 1362-3079 (Online) Journal homepage: https://www.tandfonline.com/loi/teta20
Multi-objective optimization of radiotherapy:

distributed Q-learning and agent-based simulation
Ammar Jalalimanesh, Hamidreza Shahabi Haghighi, Abbas Ahmadi, Hossein

Hejazian & Madjid Soltani
To cite this article: Ammar Jalalimanesh, Hamidreza Shahabi Haghighi, Abbas Ahmadi, Hossein
Hejazian & Madjid Soltani (2017) Multi-objective optimization of radiotherapy: distributed
Q-learning and agent-based simulation, Journal of Experimental & Theoretical Artificial
Intelligence, 29:5, 1071-1086, DOI: 10.1080/0952813X.2017.1292319
To link to this article: https://doi.org/10.1080/0952813X.2017.1292319
Published online: 05 Mar 2017.
Submit your article to this journal
Article views: 447
View related articles
View Crossmark data
Citing articles: 8 View citing articles
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=teta20
Journal of Experimental & Theoretical Artificial Intelligence, 2017
VOL. 29, NO. 5, 1071–1086
https://doi.org/10.1080/0952813X.2017.1292319
Multi-objective optimization of radiotherapy: distributed

Q-learning and agent-based simulation
Ammar Jalalimanesha, Hamidreza Shahabi Haghighia , Abbas Ahmadia ,
Hossein Hejaziana and Madjid Soltanib,c
a
Department of Industrial Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran;
b
Department of Mechanical Engineering, K. N. Toosi University of Technology, Tehran, Iran; cDepartment of Radiology,
Johns Hopkins University, Baltimore, MD, USA
ABSTRACT ARTICLE HISTORY

Radiotherapy (RT) is among the regular techniques for the treatment of Received 26 July 2016
cancerous tumours. Many of cancer patients are treated by this manner. Accepted 3 February 2017
Treatment planning is the most important phase in RT and it plays a key role
KEYWORDS
in therapy quality achievement. As the goal of RT is to irradiate the tumour Multi-objective optimisation;
with adequately high levels of radiation while sparing neighbouring healthy agent-based modelling;
tissues as much as possible, it is a multi-objective problem naturally. In this reinforcement learning;
study, we propose an agent-based model of vascular tumour growth and also radiotherapy; MDQ-
effects of RT. Next, we use multi-objective distributed Q-learning algorithm learning; simulation-based
to find Pareto-optimal solutions for calculating RT dynamic dose. We consider optimisation; tumour
multiple objectives and each group of optimizer agents attempt to optimise treatment
one of them, iteratively. At the end of each iteration, agents compromise
the solutions to shape the Pareto-front of multi-objective problem. We
propose a new approach by defining three schemes of treatment planning
created based on different combinations of our objectives namely invasive,
conservative and moderate. In invasive scheme, we enforce killing cancer
cells and pay less attention about irradiation effects on normal cells. In
conservative scheme, we take more care of normal cells and try to destroy
cancer cells in a less stressed manner. The moderate scheme stands in
between. For implementation, each of these schemes is handled by one
agent in MDQ-learning algorithm and the Pareto optimal solutions are
discovered by the collaboration of agents. By applying this methodology, we
could reach Pareto treatment plans through building different scenarios of
tumour growth and RT. The proposed multi-objective optimisation algorithm
generates robust solutions and finds the best treatment plan for different
conditions.
1. Introduction
The second cause of death worldwide after cardiovascular disease is cancer, even in developed countries
(Stamatakos et al., 2002), and up till now there is no reliable cure for it. Cancer is a group of diseases,
producing change in body cells and grows violently and finally these cells form a tumour (Ahmadi &
Afshar, 2015). The leading approaches for cancer treatment are surgery, chemotherapy and Radiotherapy
(RT). Most of the time the blend of radiotherapy and chemotherapy is applied to achieve enhanced
CONTACT Hamidreza Shahabi Haghighi Shahabi@aut.ac.ir; Abbas Ahmadi abbas.ahmadi@aut.ac.ir

© 2017 Informa UK Limited, trading as Taylor & Francis Group
1072 A. JALALIMANESH ET AL.
treatment. The main objective of all the therapies is to control tumour growth and to eradicate cancer
cells.
About half of cancer patients are treated by RT (Orth et al., 2014). In RT cluster of malignant cells is
projected by radiation beam to fracture DNA bonds which leads to cell death. Protection of tumour
surrounding healthy tissue from radiation is a vital problem in this procedure.
Sequential therapy planning or ‘dynamic treatment regimes’ (DTR) or ‘adaptive treatment strategies’
(Song, Wang, Zeng, & Kosorok, 2011) is an important problem in engineering and also medical science.
Murphy (2003) defined a concept of ‘optimal dynamic treatment regimes’ to find an optimal list of
decision rules, one per time interval, for adjusting the level of treatment in a personalised manner
through time according to an individual’s altering status. Moodie, Chakraborty, and Kramer (2012) used
Q-learning to find an optimal DTR for patients with major depressive disorder. Zhao, Kosorok, and Zeng
(2009) found an optimal DTR for non-small cell lung cancer chemotherapy using Q-learning algorithm.
Treatment planning is the most important part of RT. In this procedure, the radiation schedules and
the intensity of radiation are determined. The goal of RT is to treat tumour by removing cancer cells and
killing minimum number of healthy cells (Craft, Bangert, Long, Papp, & Unkelbach, 2014). Accordingly,
RT is multi-objective intrinsically. There are many studies focusing on modelling and optimisation of
RT in past decade and there exist some attempts in recent years. Most of them were focused on opti-
misation of physical aspects of RT such as beam angles and radiation intensity (Deng & Ferris, 2008;
Ferris, Lim, & Shepard, 2003).
By the support of computer, RT inverse planning becomes more popular. In this process the objec-
tive is to maximise tumour control probability (TCP) while preserving the normal tissue complication
probability (NTCP) at standard levels. For a typical RT treatment, TCP ≥ .5 and NTCP ≤ .05 (Podgorsak,
2005). Treatment plan is confirmed by computer before any physical experiment. Some studies focus
on modelling and optimisation of RT. Many of them try to determine the optimised solution using
operational research methods based on medical data from RT patient’s records.
Agent-based simulation is a flexible method for biological modelling of tumour growth and RT. This
approach can model the heterogeneity of tumour growth and therapy selection. Current researches
show the evolution of this method and the readiness of cancer therapy associations to put this kind
of model in practice in near future (Yankeelov et al., 2013). Besides, having reliable algorithm for opti-
misation of agent-based models will help oncologists in finding better solutions in cancer therapy.
As agent-based models are stochastic, classic methods such as dynamic programming could not
simply use for their optimisation. There are large numbers of states in an agent-based model and it is
hard to define a finite Markov decision process for it. Hence, model-free RL algorithm such as Q-learning
is an appropriate algorithm for optimising such a model.
In previous work, we considered RT as a single-objective problem and proposed an optimization
methodology by the aid of agent-based simulation and Q-learning (Jalalimanesh et al., 2017). In this
study, we combine MDQ-learning algorithm with an agent-based simulation of tumour growth and RT. We
find Pareto optimal solutions of RT treatment considering conflicting objectives. Using MDQ-learning we
minimise treatment time, total radiation dose and healthy tissue damage simultaneously. The reminder of
the paper is organised as follows. Section 2 is the literature review of modelling and optimisation of radi-
otherapy. Section 3 describes our agent-based simulation model of tumour growth. It also displays some
snapshots of simulated tumour. Section 4 introduces our methodology for modelling of radiotherapy.
Section 5 defines the concepts of Q-learning, DQ-learning and MDQ-learning and represents their classic
algorithms. Section 6 then demonstrates our approach to optimise RjT using MDQ-learning. The pseudo
codes of proposed algorithm are provided in this section. Section 7 illustrates running results of the pro-
posed algorithm and discussion. Section 8 finally contains conclusion and our planned future research.
2. Modelling and optimisation of radiotherapy

Generally, there are two types of objective functions for RT optimisation: dose-based models and radi-
obiological models. In radiobiological models the optimisation should be constructed on biological
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE 1073
effects causing from the underlying dose distributions (Lim, 2002). Simulation will help us to harvest the
data which are needed for optimisation where it is hard to get access to biological data of real patients.
Simulation also allows us to study several treatment scenarios and compare them to each other. From
biological perspective, several studies attempt to model the cancer growth and the effects of RT on
tumour. Some of them tried to validate their simulation based on real clinical or in vivo experiments.
Kirkby, Burnet, and Faraday (2002) demonstrated the reaction of tumour cells to RT using ordinary
differential equation (ODE) simulation. They modelled the cancer cells’ cell-cycle and considered cells
age as the main parameters in radiosensitivity. In fact, an age-distributed population balance model of
a typical tumour cell cycle has been formulated in their research. Stamatakos et al. (2002) established
a 3D model of tumour behaviour in response to RT using Monte-Carlo simulation technique. Antipas,
Stamatakos, Uzunoglu, Dionysiou, and Dale (2004) improved Stamatakos et al. model by considering
Oxygen Enhancement Ratio (OER) and new angiogenesis phenomenon. They also verified the model
variables using clinical data. Borkenstein, Levegrün, and Peschke (2004) constructed a mathematical
model to predict the quality of RT. The cell-cycle, oxygen diffusion dynamic and tumour angiogenic fac-
tor (TAF) diffusion were considered in their model. The simulation was run in discrete manner including
space and time. Their model is a typical kind of cellular automata (CA) with some differential equation
inside the model mostly for oxygen and TAF diffusion. The angiogenesis is simplified as an increase in
number of oxygen resources in tumour. The model is run as 3D simulation. Some studies such as (Basu,
Paul, & Roy, 2005; Dionysiou, Stamatakos, Uzunoglu, & Nikita, 2006; Enderling, Anderson, & Chaplain,
2007; Harting, Peschke, Borkenstein, & Karger, 2007; Jiménez & Hernandez, 2011) also developed models
and run simulation of effects of different RT methods on tumour in diverse organs.
There are also various attempts for optimisation of RT. Wein, Cohen, and Wu (2000) developed a
methodology to dynamically optimise the Linear Quadratic function in different situations using ODE-
based simulation model. They used some deterministic optimal control techniques to optimise treat-
ment plan. Conflicting with conformal RT the dose intensity is changed during treatment according
to their approach. Lim (2002) developed a series of techniques to optimise RT treatment plan using
Gamma Knife machine in his PhD thesis. He applied several operational research techniques such
as Nonlinear programming, Mixed-integer programming and some heuristic methods like simulated
annealing to optimise treatment plan spatiotemporally. Craft, Halabi, Shih, and Bortfeld (2006) pre-
sented an algorithm for computing the well-distributed points on the convex Pareto optimal surface
in a multi-objective optimisation of RT. The algorithm was applied to intensity-modulated radiotherapy
(IMRT) inverse planning problem, and results of a prostate case and a skull base case were presented
in three dimensions. Chan (2007) investigated robust approaches to manage uncertainty in RT treat-
ments in his PhD thesis. First, he studied the effect of breathing motion uncertainty on IMRT of a lung
tumour. Next, he provided a robust formulation for the optimization of proton-based treatments. Thieke
et al. (2007) developed a new inverse planning system called ‘Multi-criteria Interactive Radiotherapy
Assistant (MIRA)’. They used the optimisation results as a database of patient specific, Pareto-optimal
plan proposals. Two clinical test cases, a para-spinal meningioma and a prostate case, were optimised
using MIRA and the result is compared with the clinically approved planning programme KonRad. Deng
and Ferris (2008) investigated an on-line treatment planning strategy for fractionated RT by the aid
of neuro-dynamic programming technique. They considered the effect of day-to-day patient motion.
In their early limited testing the solutions they obtained overtake regular solutions and suggested
an improved dose profile for each fraction of the treatment. Kim, Ghate, and Phillips (2009) designed
fractionation schedules that consider the patient’s accumulative response to radiation up to a particu-
lar treatment day to determine the fraction on that day. They claimed that, they proposed an original
approach for the first time, which mathematically discovers the profits of fractionation schemes. They
developed a stylistic Markov decision process (MDP) model and applied it on several simple numerical
trials. Ruotsalainen, Boman, Miettinen, and Tervo (2009) demonstrated a nonlinear interactive multi-
objective optimisation method for RT treatment planning by the use of the Boltzmann transport equa-
tion (BTE) for dose calculation. They used a parameterisation method to create the dose calculation
more rapidly in the BTE model.
Figure 1. Cell-cycle flowchart.
Holdsworth et al. (2012) established a patient-specific scheme of adaptive IMRT treatment for glio-
blastoma by means of a multi-objective evolutionary algorithm (MOEA). They used MOEA to generate
spatially optimised dose distributions based on a mathematical model of tumour cell proliferation,
diffusion and response. The sets of spatially optimal dose distributions were produced via the MOEA
method to generate the Pareto Front. Ramakrishnan (2013) has proposed a series of algorithm to
dynamically optimise the fractionation schedules. He also projected a model to adaptively vary the
fraction size during RT to reach the best treatment plan. Leder et al. (2014) developed a mathematical
model to optimise PDGF-driven glioblastoma radiation dosing Schedules. They also validated their
results by in vivo experiments. They understood, altering the radiation size during treatment possibly
will lead to better achievement.
3. The proposed simulation model of tumour growth

An agent-based model is developed to simulate tumour growth and effects of RT on tumour. As the
main goal of this research is to optimise the RT treatment plan from multi-objective viewpoint, we tried
to maintain our simulation model as simple as possible. Our simulation model considers the tumour
in early vascular stage when it has less than 2 mm diameter. The basic elements of the simulation are
cells (cancerous and healthy) which are considered as the main agents in our simulation. The model
is multi-scale which means, each cell has a cell cycle inside as the sub-cellular dynamics. The cell cycle
is the systematic sequence of events wherein a cell duplicates its contents before dividing into two
cells (Fletcher et al., 2010). Cell cycle is an important concept in tumour progression, science cancer
is a disease of wild cell proliferation. G1, S, G2 and M are four main steps of the cell cycle. During G1
phase the cell increases in size and content. In S phase DNA replication occurs. Throughout G2 phase
the cell continues to growth. There is also a G2 checkpoint which ensures that everything is ready to
Table 1. Agent-based tumour growth model parameters.

Parameter Value Scaled model value Calculation and Reference
Critical-neighbours 8 9 (including the cell itself) O’Neil (2012)
Average-oxygen-consumption ml
2.16 × 10−9 cell.h .216 Vaupel et al. (1990)
Max-oxygen-consumption ml
4.32 × 10−9 cell.h .432 Vaupel et al. (1990)
Normal cell oxygen consump- Random Normal Random normal Based on assumption
tion per hour ml
Average 2.16 × 10−9 cell.h average .216
ml
St.dev .72 × 10−9 cell.h St.dev .072
Cancer cell oxygen consump- Random normal Random normal Based on assumption
tion per hour Average average .216 to .432
ml
2.16 × 10−9 –4.32 × 10−9 cell.h
St. dev one-third of selected St.dev one-third of selected
value value
Transition Random normal Random normal Multiplying average-
ml
average 2.38 × 10−8 cell average 2.38 oxygen-consumption by
the average amount of
ml
St.dev .79 × 10−8 cell St.dev .79 time spent by a cell in Gap1
Quiescent-oxygen-level 10.37 × ml
10−8 cell 10.37 Average amount of oxygen
required for two cells to
complete the entire cell
cycle
Critical-oxygen-level ml
3.88 × 10−8 cell 3.88 Three quarters of the amount
of oxygen required for an
entire cell cycle
Diffusion rate 1.8×10−5 cm2/s 42% (calibrated patch diffu- Grote, Süsskind, and Vaupel
sion constant per hour) (1977)
Bystander-radiation Range: 0–5% Range: 0–.05 O’Neil (2012)
Bystander-survival-probability Range: 37–100% Range: .37–1 O’Neil (2012)
enter mitosis. In M period cell stops to grow and cell divides into two daughter cells. There is an inactive
phase in which the cell leaves the cycle and stops dividing which is called G0.
In our simulation the agents decide to choose its phenotype based on cell-cycle stage and envi-
ronmental factors. Due to the key role of oxygen in RT effectiveness we do take care of oxygen and
its dynamics as the main nutrient in tissue. Figure 1 shows the flowchart of cell cycle for all the cells.
The cell-cycle time considered to be 24 h according to (Humphrey & Brooks, 2005). Cells are 12, 6, 4,
2 h in G1, S, G2 and M phases, respectively. Normal cells do not divide where there is no space availa-
ble to split down and become quiescent, but cancer cells do not follow this rule. Cancer cells are more
resistant to oxygen deficiency and turn hypoxic in poor oxygenation conditions. According to (Kufe
et al., 2003) a tumour with a mass of 2 mm3 has 105 cells. Later, based on (Vaupel, Kallinowski, & Okunieff,
1990) the oxygen consumption rate is 3 × 10−5 ml/(cm3 s). Hence, for modifying the dimensions of
our model the average oxygen consumption is considered to be 2.16 × 10−9 ml per cell per hour. The
remaining parameters were estimated based on this value.
Once more, for simplicity, we used Chen, Sprouffske, Huang, and Maley (2011) methodology for mod-
elling angiogenesis. They neglected the morphology of vascular network and modelled the epithelial
cells as nodes which diffuse oxygen. According to their method, V blood vessels are put in continuous
2D space on the top of a square lattice of P patches. For every time step, each patch having a blood
vessel receives ri units of oxygen (Equation (1)). Oxygen concentrations cjt+1 at patch j at time t + 1 can
be clarified by the following difference equation.
( )
t+1
) t 1 ∑
d c − ra ntj + ri 𝛿jt
t
(
cj = 1 − dc cj + (1)
8 k∈N(j) c k
dc is the resource diffusion constant, N(j) are the eight neighbouring patches of j, ra is the cell absorption
ratio, ntj is the quantity of cells at location j at time t, ri is the resource creation rate for a microvessel and
𝛿jt is equal to unity if there is a microvessel at position j at time t and 0 elsewhere. Oxygen concentrations
After 1000 ticks

After 1500 ticks
After 2000 ticks
Figure 2. Agent-based simluation of vascular tumour growth.
Figure 3. Schematic process of reinforcement learning.

are not allowed to be negative since cells are disallowed from absorbing resources more than those
are present in the position. Table 1 depicts the parameters used in our model.
We used NetLogo package as our simulation platform. Figure 2 shows three snapshots of tumour
growth simulation at different stages. As each tick shows one hour, you can see the tumour size and
oxygen status in ticks 1000, 1500 and 2000 from top frame in Figure 2 to the bottom. As time passes,
the cells inside tumour become necrotic due to lack of oxygen and nutrient. In the right side of 3D plots,
the red colour shows the lack of oxygen while the white colour exhibits sufficient oxygen presence.
The simulation is run in a square world by the dimensions of 100 × 100 patches where cell diameter is
20 μm (Powathil, Gordon, Hill, & Chaplain, 2012) and consequently we can simulate a cancer tissue of
2 × 2 mm2 area. According to Figure 2 it can be seen that by increasing the tumour diameter and also
the necrotic core volume, oxygen leaks through to the tumour by angiogenesis procedure and causes
cancer cells to proliferate inside tumour.
4. Modelling of radiotherapy

To model the RT process, we considered biological effects of radiation in different cell phases. In this
part we used the way O’Neil used in her research (O’Neil, 2012). As O’Neil considered, we modelled
direct effects of radiation on cancer and healthy cells along with bystander effects. Algorithm 1 displays
the pseudocode of RT procedure developed in NetLogo. The cells in gap2 (G2) or mitosis stages are
more sensitive to irradiation whereas cells in gap1 (G1) stage are more resistant to radiation (Van der
Kogel & Joiner, 2009).
Linear Quadratic (LQ) is the mostly used function for modelling the response of cancer cells to RT in
most of studies (Thames & Hendry, 1987). Equation (2) demonstrates LQ function:
SF(D) = exp(−𝛼D − 𝛽D2 ), (2)
where SF(D) is the proportion of cancer cells surviving an irradiation of dose D (Gray). Gray is the unit
of ionising radiation which defined as the absorption of one joule of radiation energy per kilogram of
body mass. The elements α and β are in units of Gy−1 and Gy−2, respectively. The α and β are radiation
sensitivity parameters those are related to the probabilities of double-strand breaks in the DNA and
their repair, respectively. We used LQ function to calculate survival probability of irradiated cells.
Algorithm 1. Radiation therapy algorithm

Input: Radiation Dose, Patches Contains Cancer Cells, Cell Cycle Stage, 𝛼, β, Bystander Survival Probability
Output: Cells killed by radiation
1.1: Calculate SF(D) using LQ function based on Radiation Intensity
► Radiation Effects on Cells
1.2: if any patch is under radiation then
1.3: for all cells in all patches do
1.4: if cell cycle stage = G2 or mitosis then
1.5: set RadiationReceived = RandomNumber [0:1.25]
1.7: else if cell cycle stage = G1 then
1.8: set RadiationReceived = RandomNumber [0:.75]
1.9: else if cell cycle stage = synthesis or quiescent then
1.10: set RadiationReceived = RandomNumber [0:1]
1.11: end
1.12: if RadiationReceived > SurvivalProbability then kill this cell end
1.13: end
1.14: end
► Bystander Effect
1.15: for all cells in all patches do
1.16: set NeighboursRadiated to the number of patches containing irradiated cells of 2-patches radius
1.17: set BystanderDoseto NeighboursRadiated multiplied by BystanderRadiation
1.18: if BystanderDose > 1 then set BystanderDose = 1 end
1.19: if RandomNumber[0:BystanderDose] > BystanderSurvivalProbability then
1.20: Die
1.21: end
1.22: end
5. Q-learning, DQ-learning and MDQ-learning

Reinforcement learning (RL) is a major class of machine learning techniques (Liu, Xu, & Hu, 2015). In fact,
RL is shaped on communication with an environment and to take actions to maximise the cumulative
reward. The objective is to pick the best action for the current state. More accurately, the duty of RL is
to use observed rewards to reach an optimal (or near optimal) policy for the environment (Langlois &
Sloan, 2010). This method is mainly different from standard supervised learning techniques as it focuses
on online performance instead of correcting input/output pairs. The method also involves finding a
balance between exploration and exploitation. Figure 1 shows the conceptual diagram of RL.
The agent may be any artefact such as robots which does an action and gets rewards based on the
outcomes of a selected action. Among different algorithms used for RL implementation, Q-learning is
one of the most popular methods. Q-learning is a model-free RL algorithm helps to discover optimised
policy in which the learning process is not guided by a state transition probability model (Khamis &
Gomaa, 2014). As agent-based simulation is among stochastic simulation, we could not simply extract
a finite Markov decision process. Therefore, model-free RL algorithm such as Q-learning is appropriate
technique for optimisation. In Q-learning we have a set of state space (S), a set of possible actions (A)
and a reward Function R:S × A → R. Algorithm 2 presents a typical algorithm of Q-learning.
Algorithm 2. Q-learning algorithm

Input: Parameters γ,α
Output: Optimised policy
2.1: Set Q-values (Q(s,a)) randomly for all state-action pairs.
2.2: Repeat until convergence criteria satisfied
2.3: Initialize s
2.4: While stopping criteria is not satisfied do
2.5: Choose an action (a) in the world state (s) based on current Q-value estimates (Q(s, a)).
2.6: Take the action (a) and see the outcome state (s′) and reward (r).
2.7: Update Q(s,a):=Q(s,a)+α[r + γ × arg maxa′Q(s′,a′)−Q(s,a)]
2.8: s ← s′
2.9: End
2.10: End
By selecting an action after each iteration of the algorithm, the system enters to a new state (s′). The
value of new state determines the reward of the action (a) in state (s). By running the algorithm, Q(s,a)
matrix is filled by aggregation of the rewards for all state-action pairs. Finally, the algorithm would
converge to the best policy which is equal to the maximum elements in Q(s,a) rows.
γ in line 2.7 of the algorithm 2, is the discount factor and could be between 0 and 1 which signifies
the importance of the value of upcoming states in the measuring of the current state. α (0 ≤ α < 1) is
the learning rate parameter which is the weight of the new information in updating Q matrix.
Distributed Q-learning (DQ-learning) is a type of Q-learning which the number of agents tries to find
optimal solution in parallel to reduce the convergence time and to increase the quality of solutions.
Accordingly, during the learning course, decisions taken by the agents, changes the behaviour of the
other agents (Mariano & Morales, 2000). Algorithm 3 expresses the pseudocode of typical DQ-learning
algorithm.
Algorithm 3. Distributed Q-learning algorithm

Input: Parameters γ,α
Output: Optimised policy
3.1: Set Q-values (Q(s,a)) randomly for all state-action pairs.
3.2: Repeat until convergence criteria satisfied
3.3: Initialize s;copy Q(s;a)to QC(s;a)
3.4: for i = 1 to m (agents)
3.5: while stopping criteria is not satisfied do
3.6: Choose an action (a) in the world state (s) based on current Q-value estimates (Q(s,a)).
3.7: Take the action (a) and see the outcome state (s′) and reward (r).
3.8: Update Qc(s,a):=Qc(s,a)+α[r + γ × argmaxa′Qc(s′,a′)−Qc(s,a)]
3.9: s ← s′
3.10: end while
(Continued)
Algorithm 3. Distributed Q-learning algorithm

3.11: end repeat
3.12: Evaluate the m proposed solutions
3.13: Assign rewards to the best solution found
3.14: Q(s,a):=Q (s,a)+α[r + γ × argmaxa′Q(s′,a′)−Q(s,a)]
Multi-objective Q-learning (MDQ-learning) is like DQ-learning while each group of agents tries to
handle one objective. At the end of each iteration, agents negotiate with each other and compromise
on the solutions to draw Pareto-front (Mariano & Morales, 2000). We explain this algorithm for the case
of RT in the next section, thoroughly.
6. Simulation-based optimisation of radiotherapy

Radiotherapy is a treatment applied daily over several weeks. One of the core reasons of postponement
of irradiation is to let healthy cells recover between sessions which are termed fractionation. The empiri-
cal observation shows that healthy cells have better damage healing capability than tumours. Moreover,
the repair of sub lethal damage in healthy tissue, redistribution of tumour cells into more radiosensitive
stages, reoxygenation, and repopulation of tumour cells have significant effect on the quality of frac-
tionated RT (Hall & Giaccia, 2006). Evolution of these complex dynamic processes, and then a patient’s
biological response to treatment are stochastic, and are subjected to how the total dose is broken into
fractions. In current treatment planning methods, total dose is delivered by breaking it into a series of
equal-dosage fractions, although, there are contemporary studies which use time-varying fraction size
(Ghate, 2014; Kim et al., 2009; Ramakrishnan, 2013). In this study, we tried to find the best treatment
plan by considering a fixed schedule of irradiation and varying the fraction size during the treatment.
6.1. Radiotherapy optimisation using DQ-learning

In the beginning, we used DQ-learning algorithm as a model-free technique to optimise the RT treat-
ment plan. Algorithm 4 is pseudocode of our DQ-learning algorithm. We considered intensity of radi-
ation as an action variable and the tumour size as the states set. It means that, each elements in Q(s,a)
matrix is a pair of intensity of radiation for a given tumour size. The outcome of each action including
positive effects on tumour or negative effects on healthy tissue is considered as rewards. The area of
irradiation during RT is considered as a penalty by multiplying IrradiatedPatches (which contains healthy
cells) by RadiationIntensity. This penalty states that, irradiating a large area of healthy tissue is not
desirable in RT due to its side effects. We discretised radiation intensity (action choices) to three classes,
weak (2 Gy), normal (2.5 Gy) and intense (3 Gy). At each step by applying a new action (dose intensity
value) the agent-based simulation is run and the reward of action-state pair is determined based on
the simulation results. As the algorithm goal is to collect as much rewards as possible, it continues for
definite number of iterations or until some terminal state is reached. As described in section 5, m solver
agents go to find the optimal solution. All m agents try to find the best solution simultaneously and
negotiate with each other during exploration.
Algorithm 4. DQL for RT

Input: Parameters γ, α, δ, w1, w2, w3
Output: Optimised Therapy
4.1: Initialize Q(s, a)
4.2: repeat (for each iteration)
4.3: V ← 0 (the value of best agent’s solution)
4.5: Vi ← 0 (the value of agent i’s solution)
4.6: Define current state from model
4.7: Qi(s, a) ← Q(s, a)
4.8: Define 𝜂 = 𝛿∕IterationNumber
4.9: while the number of cancer cells is greater than the specified threshold do
4.10: Define a random uniform number
(Continued)
Algorithm 4. DQL for RT

4.11: if the random uniform number is greater than η then
4.12: Select action a as the maximum element of Q(s, a)
4.13: else
4.14: Select action a randomly
4.15: end
4.16: Take action a
4.17: Calculate reward ri, Qimatrix and Vi as follow:
ri = w1 * #KilledCancerCells – w2 * #KilledNormalCells − w3 × NTCP
Qi (s, a) ← Qi (s, a) + 𝛼(r + 𝛾 × arg max Qi (s� , a� ) − Qi (s, a)) Vi ← Vi + ri
4.18: Observe the new state s′ a�
4.19: end while

4.20: if Vi > V then
4.21: Q(s, a) ← Qi(s, a)
4.22: V ← Vi
4.23: end if
4.24: end for
4.25: end repeat
In the beginning, the tumour size is recognised as the initial state. All the scenarios should be applied
on the same tumour to enable us to compare the solutions with each other. Accordingly, we save the
tumour growth simulation in a determined tumour stage/size and use this saved model for initiation
of the algorithm in all iterations. To explore solution space thoroughly, at initial iterations of the algo-
rithm, the actions are more likely to select arbitrarily sooner than selection based on maximum rewards.
Over time, the Q matrix will be fulfilled by the accumulation of the rewards and will converge. At this
phase the exploitation operator is more likely to be activated and it is more expected to select actions
based on maximum rewards. The swapping between exploration and exploitation occurs by defining
𝜂 = 𝛿∕IterationNumber parameter, which becomes smaller in higher iterations. The w1 and w2 in the
reward function are the coefficients which show the reward of killing cancer cells and the penalty of
killing normal cells in our therapy, respectively. The w3 coefficient is related to the significance of the
effects of radiation on neighbouring healthy tissues and later on therapy outcomes. For instance, w2
and w3 should be higher if tumour surrounded by more critical Organ At Risk (OAR). During the RT, we
just irradiate patches contain cancer cells. This manner is like Intensity Modulated RadioTherapy (IMRT)
technique. The termination criterion is to have a tumour with less than 10 cancer cells. We calculate the
total rewards which agent i has achieved during a complete therapy as Vi. Once all the agents complete
their therapy and find the solutions, we compare Vis with each other to select the best solution. Next,
the Q matrix will update based on the best therapy. For the subsequent iterations, we will exchange
the Qi of agents with updated Q matrix.
6.2. Multi-objective optimisation of radiotherapy using MDQ-learning

Algorithm 5 displays our multi-objective optimisation algorithm of RT. Like DQ-learning we have groups
of agents which try to find the optimal solution. To overcome the computational complexity of the
simulation, we assume that each group contains one agent. Accordingly, each agent has responsibility
to optimise one objective. Algorithm 6 demonstrates the negotiation mechanism between agents to
compromise the solutions.
Algorithm 5. MDQ-learning of radiotherapy

Input: Parameters γ, α, δ, W
Output: Pareto Optimal Therapies
5.1: Initialize Q(s, a)
5.2: repeat (for each iteration)
5.4: Define current state from model
5.5: Define 𝜂 = 𝛿∕IterationNumber
5.6: while the number of cancer cells is greater than the specified threshold do
5.7: Define a random uniform number
(Continued)
Algorithm 5. MDQ-learning of radiotherapy

5.8: if the random uniform number is greater than η then
5.9: Select action a as the maximum element of Qi(s, a)
5.10: else
5.11: Select action a randomly
5.12: end
5.13: Take action a
5.14: Calculate reward ri, Qimatrix as follow:
ri = wi1 ∗ #KilledCancerCells − wi2 ∗ #KilledNormalCells − wi3 × NTCP
Qi (s, a) ← Qi (s, a) + 𝛼(ri + 𝛾 × arg max Qi (s� , a� ) − Qi (s, a))
5.15: Observe the new state s′ a�
5.16: end while
5.17: Calculate Si as the value of conflicting objectives pairs such as TCP and NTCP
5.18: end for
5:19: 𝐍𝐞𝐠𝐨𝐭𝐢𝐚𝐭𝐢𝐨𝐧(S)
5.20: end repeat
Algorithm 6. Negotiation (m solutions)

Input: Agents’ solutions Si
Output: Qi(s, a)
6.1: for i = 1to m (number of agents each of which corresponds to one objective)
6.2: if Si (the solution of agent i) dominates any of the pareto front members then
6.3: Remove the dominated points and add Si to the pareto list
6.4: end if
6.5: end
We developed a treatment optimisation approach based of TCP and NTCP on the typical Multi-
Objective Decision Making Problems scheme in RT. Two objectives are defined as #KCC and #KNC that
represent the TCP and NTCP which are interpreted as the number of cancer cells and the number of nor-
mal cells eradicated during each irradiation stage, respectively. As it is showed in algorithm 5 the reward
function which calculate ri is comprise three elements. Killing more cancer cells increases the reward
while killing normal cells and irradiation of wider area using higher intensity decreases the reward.
We must specify the solution space of our problem same as constructing the Pareto Front in MODM
problems where we find the non-dominated solutions by exploring the feasible space which is shaped
based on some certain constraints. However, in stochastic environment such as agent-based simulation
we cannot simply define neither specific solution space nor Markovian decision process (we have a
complex state transitions). Using MDQ-learning algorithm, we propose a new methodology by defin-
ing three schemes of radiotherapy comprise different combination of our objectives namely invasive,
conservative and moderate.
As mentioned before the objectives of RT is eradicating cancer cell faster (first objective) while
keeping healthy tissue undamaged (second objective) with as low total radiation dose as possible.
Accordingly, invasive scheme is a kind of treatment which enforces killing cancer cells without pay-
ing much attention regarding normal cells. On the other hand, conservative scheme does not take
high risk and takes more care of normal cells and tries to ruin cancer cells in a less stressed manner.
The moderate scheme is in average by paying the same attention to both killing of cancer and normal
cells during the RT.
To realise these schemes, we put different emphasis to the before mentioned objectives for each
of three solver agents by assigning different wi to their reward functions. This approach gives diverse
behaviour to the agents to explore the solution space more thoroughly.
7. Results and discussion

The MDQ-learning algorithm was developed in R scripting language. The R-NetLogo (Thiele, 2014)
plug-in was used which lets us to transport data between NetLogo and R packages. Figure 4 shows
the obtained results from MDQ-learning algorithm implementation. Due to the complexity of MDQ-
learning algorithm, we reduced the tissue size to 50 × 50 patches (1 mm2). There was one cancer cell
Figure 4. MDQ-learning algorithm results.

Table 2. Reward function coefficient for different agents.
W1 W2 W3
Agent 1/invasive 3 1 .1
Agent 2/moderate 2 1 .3
Agent 3/conservative 1 2 .5
Figure 5. Pareto optimal solutions of total radiation domain vs. total radiation dose and fractions number.
inside the tissue when the simulation starts. We also had 70 micro-vessels scattered in tissue which
diffuse oxygen. Other parameters were set according to Table 1. After 333 ticks (hours) of running the
simulation, we had a tumour with 750 cancer cells and a big necrotic core. We kept this tumour as an
input data for MDQ-learning algorithm. The state space (s) was built by discretizing the tumour size
to 40 stages. Hence, 20 cells proliferation takes tumour to one higher stage. For example, the tumour
with 605 cancer cells is in stage 31. As declared, actions (a) consist of week (2 Gy), normal (2.5 Gy) and
intense (3 Gy) irradiation. We tuned the parameters of algorithm using sensitivity analysis approach.
Eventually we came up with α = .2, δ = .4 and γ = .3 as good parameters for fast convergence of the algo-
rithm. As mentioned before we construct three solver agents using different wi values in their reward
function. Table 2 depicts these values for all schemes. The algorithm was run for 100 iterations each
time. Figure 4 displays the results of one complete running of algorithm for three optimizer agents.
The Total radiation domain objective is computed by accumulating the multiplication of the number
of irradiated patched which containing normal cells to the radiation intensity, for all days of complete
therapy. This measure plays the role of NTCP for our simulation. The Number of fractions and the Total
radiation dose are two similar objectives which the first one shows the duration of therapy and the
second displays the total aggregated dose irradiated during a complete therapy. These two objectives
are similar to TCP in nature, but their decreasing shows higher TCP.
The initial iterations are in exploration phase where the large amount of fluctuations is observable.
The agent 1 which has invasive behaviour has taken less consideration regarding radiation domain
and consequently reached later convergence than other agents. Agent 2, the moderate agent, became
stable before fortieth iteration and conservative agent converged soon just after twentieth iteration. It
can be recognised from the fluctuations of results that all the agents explore the solution space till the
final iterations. This fact proved that the exploration operator works well even in convergence phase.
Figure 5 depicts Pareto points which agents explore in solution space in all iterations. We plotted the
total radiation domain against the number of fractions and total radiation dose, separately. It can be
concluded that the places which we do not have any discovered solution, could become discoverable
with more extreme weights and different RT schedule. For example, the tumour cannot be eradicated
with less than 8 daily fractions with total dose less than 17 Gys, although it would be possible if we
irradiate tumour two or more times per day. Therefore, some points are not explored by agents because
they impose high negative rewards to the agents, even though there are some outliers in the plots
which agents explored due to the randomness in exploration mechanism. Solutions found by each
agent are depicted with specific colour. The Pareto solutions are indicated by red squares. The Pareto
front could be obtained by a fitting curve which connects optimal solution to each other. Our Pareto
front is narrow because of limited number of actions (dosage) which are not too different. We can have
a wider Pareto front by defining more actions (dosage) in a broader range of radiation.
8. Conclusion
Using agent-based simulation we developed a flexible multi-scale vascular tumour model. We modelled
cell-cycle dynamics inside cells and the effects of radiation on cancer and healthy cells in different
cell-cycle stages. We also constructed the angiogenesis process inside tissue and tumour and the oxy-
gen dynamics during tumour growth and RT course. Then, we proposed an algorithm to optimise RT
based on MDQ-learning algorithm by considering fix irradiation schedule and dynamic dose intensity
during treatment course. As RT is multi-objective fundamentally, our proposed algorithm considers two
conflicting objective of minimising tumour therapy period and simultaneously minimising unavoidable
side effects of radiation on healthy cells. We obtained Pareto optimal solutions for RT and set up the
Pareto front by running the algorithm iteratively. Using diverse solver agents owning different explora-
tion behaviour let us explore solution space more thoroughly and to better construct the Pareto front.
Multi-objective distributed Q-learning is a potent technique for optimisation of complex stochastic
problems which has multi-objective nature such as RT. In our problem, it is not possible to apply classical
optimisation techniques form operational research context such as dynamic programming due to the
infinite number of states and impossibility of establishing transition probability matrix.
Our multi-objective optimisation compartment works fine and it seems, it can connect with other
type of biological agent-based simulation models in other contexts such as chemotherapy or surgery
and it would be possible to apply existing procedure on such models by minimum modification.
Although, previously multi-objective optimisation based on multi-agent simulation was done in
other domains such as controlling traffic lights (Khamis & Gomaa, 2012; Khamis, Gomaa, & El-Shishiny,
2012), but it is novel in our context. In future research, we will try to improve the biological aspects of
tumour growth model by forming more sophisticated vascular tumour. Accordingly, we would expand
angiogenesis section by modelling VEGF secretion and brush-border effect. Additional key amendment
that could be considered in our tumour model is hypoxia. This phenomenon is an important factor in
radiotherapy. Hypoxic cells are more resistant to RT and hypoxia region is the most aggressive part of
tumour (Olcina, Lecane, & Hammond, 2010). Declining sensitivity of hypoxic cells to radiation is sup-
posed to be the core cause of treatment failure, especially in more hypoxic tumours.
Computational complexity is one of the main complications in agent-based models of tumour
growth which do not let the designers to go up through real scale. Shirazi, Davison, von Mammen,
Denzinger, and Jacob (2014) proposed a novel methodology to increase the agent-based simulation by
designing hierarchical meta-agents which contains groups of lower level agents. Although this meth-
odology needs some modifications to become compatible with our model, we are going to recruit it.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Hamidreza Shahabi Haghighi http://orcid.org/0000-0002-1679-4920
Abbas Ahmadi http://orcid.org/0000-0001-9884-0830
References
Ahmadi, A., & Afshar, P. (2015). Intelligent breast cancer recognition using particle swarm optimization and support vector
machines. Journal of Experimental & Theoretical Artificial Intelligence, 28, 1021–1034.
Antipas, V. P., Stamatakos, G. S., Uzunoglu, N. K., Dionysiou, D. D., & Dale, R. G. (2004). A spatio-temporal simulation model
of the response of solid tumours to radiotherapy in vivo: Parametric validation concerning oxygen enhancement ratio
and cell cycle duration. Physics in Medicine and Biology, 49, 1485–1504.
Basu, K., Paul, S., & Roy, P. (2005). MRI-image based radiotherapy treatment optimization of brain tumours using stochastic
approach. In Manesar (Ed.), N.B.R.C. Computational Neuroscience & Neuroimaging Laboratory.
Borkenstein, K., Levegrün, S., & Peschke, P. (2004). Modeling and computer simulations of tumor growth and tumor response
to radiotherapy. Radiation Research, 162, 71–83.
Chan, T. C.-Y. (2007). Optimization under uncertainty in radiation therapy. Massachusetts Institute of Technology, Operations
Research Center. Retrieved from http://hdl.handle.net/1721.1/40302
Chen, J., Sprouffske, K., Huang, Q., & Maley, C. C. (2011). Solving the puzzle of metastasis: The evolution of cell migration
in neoplasms. PLoS ONE, 6, e17933.
Craft, D., Bangert, M., Long, T., Papp, D., & Unkelbach, J. (2014). Shared data for intensity modulated radiation therapy (IMRT)
optimization research: The CORT dataset. GigaScience, 3(1), 1–12.
Craft, D. L., Halabi, T. F., Shih, H. A., & Bortfeld, T. R. (2006). Approximating convex Pareto surfaces in multiobjective
radiotherapy planning. Medical Physics, 33, 3399–3407.
Deng, G., & Ferris, M. C. (2008). Neuro-dynamic programming for fractionated radiotherapy planning. In C. J. S. Alves,
P. M. Pardalos, L. N. Vicente (Eds.), Optimization in medicine (pp. 47–70). New York: Springer.
Dionysiou, D. D., Stamatakos, G. S., Uzunoglu, N. K., & Nikita, K. S. (2006). A computer simulation of in vivo tumour growth
and response to radiotherapy: New algorithms and parametric results. Computers in Biology and Medicine, 36, 448–464.
Enderling, H., Anderson, A. R., & Chaplain, M. A. (2007). A model of breast carcinogenesis and recurrence after radiotherapy.
PAMM, 7, 1121701–1121702.
Ferris, M. C., Lim, J., & Shepard, D. M. (2003). Radiosurgery treatment planning via nonlinear programming. Annals of
Operations Research, 119, 247–260.
Fletcher, A. G., Mirams, G. R., Murray, P. J., Walter, A., Kang, J.-W., Cho, K.-H., … Byrne, H. M. (2010). Multiscale modeling of
colonic crypts and early colorectal cancer. Multiscale Cancer Modeling, Editor: TS Deisboeck, 6, 111–134.
Ghate, A. (2014, September 3). Dynamic optimization in radiotherapy. INFORMS Tutorials in Operations Research, 60–74.
Seattle: University of Washington. Retrieved from http://pubsonline.informs.org/doi/abs/10.1287/educ.1110.0088
Grote, J., Süsskind, R., & Vaupel, P. (1977). Oxygen diffusivity in tumor tissue (DS-carcinosarcoma) under temperature
conditions within the range of 20–40 C. Pflügers Archiv European Journal of Physiology, 372, 37–42.
Hall, E. J., & Giaccia, A. J. (2006). Radiobiology for the radiologist (6th ed). Philadelphia, PA: Lippincott Williams & Wilkins.
Harting, C., Peschke, P., Borkenstein, K., & Karger, C. P. (2007). Single-cell-based computer simulation of the oxygen-
dependent tumour response to irradiation. Physics in Medicine and Biology, 52, 4775–4789.
Holdsworth, C., Corwin, D., Stewart, R., Rockne, R., Trister, A., Swanson, K., & Phillips, M. (2012). Adaptive IMRT using a
multiobjective evolutionary algorithm integrated with a diffusion–invasion model of glioblastoma. Physics in Medicine
and Biology, 57, 8271–8283.
Humphrey, T. C., & Brooks, G. (2005). Cell cycle control: Mechanisms and protocols (Vol. 296). Totowa, NJ: Humana Press Inc.
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., & Soltani, M. (2017). Simulation-based optimization of radiotherapy: Agent-
based modeling and reinforcement learning. Mathematics and Computers in Simulation, 133, 235–248.
Jiménez, R. P., & Hernandez, E. O. (2011). Tumour–host dynamics under radiotherapy. Chaos, Solitons & Fractals, 44, 685–692.
Khamis, M. A., & Gomaa, W. (2012). Enhanced multiagent multi-objective reinforcement learning for urban traffic light control.
Paper presented at the 11th International Conference on Machine Learning and Applications (ICMLA), 2012, Boca
Raton, FL.
Khamis, M. A., & Gomaa, W. (2014). Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal
control based on cooperative multi-agent framework. Engineering Applications of Artificial Intelligence, 29, 134–151.
Khamis, M. A., Gomaa, W., & El-Shishiny, H. (2012). Multi-objective traffic light control system based on Bayesian probability
interpretation. Paper presented at the 2012 15th International IEEE Conference on Intelligent Transportation Systems,
Anchorage.
Kim, M., Ghate, A., & Phillips, M. (2009). A Markov decision process approach to temporal modulation of dose fractions in
radiation therapy planning. Physics in Medicine and Biology, 54, 4455–4476.
Kirkby, N., Burnet, N., & Faraday, D. (2002). Mathematical modelling of the response of tumour cells to radiotherapy. Nuclear
Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms, 188, 210–215.
Kufe, D. W., Pollock, R. E., Weichselbaum, R. R., Bast, R. C., Gansler, T. S., Holland, J. F., … Kalluri, R. (2003). Beginning of
angiogenesis research. Holland-Frei cancer medicine (6th ed.). Hamilton,Ontario: BC Decker.
Langlois, M., & Sloan, R. H. (2010). Reinforcement learning via approximation of the Q-function. Journal of Experimental &
Theoretical Artificial Intelligence, 22, 219–235.
Leder, K., Pitter, K., LaPlant, Q., Hambardzumyan, D., Ross, B. D., Chan, T. A., … Michor, F. (2014). Mathematical modeling of
PDGF-driven glioblastoma reveals optimized radiation dosing schedules. Cell, 156, 603–616.
Lim, J. (2002). Optimization in radiation treatment planning (Doctoral dissertation ). Madison, WI: University of Wisconsin–
Madison.
Liu, C., Xu, X., & Hu, D. (2015). Multiobjective reinforcement learning: A comprehensive overview. IEEE Transactions on
Systems, Man, and Cybernetics: Systems, IEEE Transactions on, 45, 385–398.
Mariano, C. E., & Morales, E. F. (2000). Distributed reinforcement learning for multiple objective optimization problems. Paper
presented at the Proceedings of the 2000 Congress on Evolutionary Computation, LA Jolla, CA.
Moodie, E. E., Chakraborty, B., & Kramer, M. S. (2012). Q-learning for estimating optimal dynamic treatment rules from
observational data. Canadian Journal of Statistics, 40, 629–645.
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 65, 331–355.
O’Neil, N. (2012). An agent based model of tumor growth and response to radiotherapy (Master of Science). Commonwealth
University, Richmond, VA.
Olcina, M., Lecane, P. S., & Hammond, E. M. (2010). Targeting hypoxic cells through the DNA damage response. Clinical
Cancer Research, 16, 5624–5629.
Orth, M., Lauber, K., Niyazi, M., Friedl, A. A., Li, M., Maihöfer, C., … Belka, C. (2014). Current concepts in clinical radiation
oncology. Radiation and Environmental Biophysics, 53(1), 1–29.
Podgorsak, E. B. (2005). Review of radiation oncology physics: A handbook for teachers and students. Vienna: International
Atomic Energy Agency.
Powathil, G. G., Gordon, K. E., Hill, L. A., & Chaplain, M. A. (2012). Modelling the effects of cell-cycle heterogeneity on the
response of a solid tumour to chemotherapy: Biological insights from a hybrid multiscale cellular automaton model.
Journal of Theoretical Biology, 308, 1–19.
Ramakrishnan, J. (2013). Dynamic optimization of fractionation schedules in radiation therapy (Doctoral dissertation).
Massachusetts Institute of Technology. Retrieved from http://hdl.handle.net/1721.1/82181
Ruotsalainen, H., Boman, E., Miettinen, K., & Tervo, J. (2009). Nonlinear interactive multiobjective optimization method for
radiotherapy treatment planning with Boltzmann transport equation. Contemporary Engineering Sciences, 2, 391–422.
Shirazi, A. S., Davison, T., von Mammen, S., Denzinger, J., & Jacob, C. (2014). Adaptive agent abstractions to speed up spatial
agent-based simulations. Simulation Modelling Practice and Theory, 40, 144–160.
Song, R., Wang, W., Zeng, D., & Kosorok, M. R. (2011). Penalized Q-learning for dynamic treatment regimes. Statistica Sinica,
25, 901–920.
Stamatakos, G. S., Dionysiou, D. D., Zacharaki, E., Mouravliansky, N., Nikita, K. S., & Uzunoglu, N. K. (2002). In silico radiation
oncology: Combining novel simulation algorithms with current visualization techniques. Proceedings of the IEEE, 90,
1764–1777.
Thames, H. D., & Hendry, J. H. (1987). Fractionation in radiotherapy. London: Taylor and Francis.
Thieke, C., Küfer, K.-H., Monz, M., Scherrer, A., Alonso, F., Oelfke, U., … Bortfeld, T. (2007). A new concept for interactive
radiotherapy planning with multicriteria optimization: First clinical evaluation. Radiotherapy and Oncology, 85, 292–298.
Thiele, J. (2014). R marries NetLogo: Introduction to the RNetLogo package. Journal of Statistical, 58(2), 1–41.
Van der Kogel, A., & Joiner, M. (2009). Basic clinical radiobiology. London: Hodder Arnold.
Vaupel, P., Kallinowski, F., & Okunieff, P. (1990). Blood flow, oxygen consumption and tissue oxygenation of human tumors.
In J. Piiper, T. K. Goldstick, M. Meyer (Eds.), Oxygen transport to tissue XII (pp. 895–905). New York, NY: Springer.
Wein, L. M., Cohen, J. E., & Wu, J. T. (2000). Dynamic optimization of a linear–quadratic model with incomplete repair
and volume-dependent sensitivity and repopulation. International Journal of Radiation Oncology* Biology* Physics, 47,
1073–1083.
Yankeelov, T. E., Atuegwu, N., Hormuth, D., Weis, J. A., Barnes, S. L., Miga, M. I., … Quaranta, V. (2013). Clinically relevant
modeling of tumor growth and treatment response. Science Translational Medicine, 5, 187–189.
Zhao, Y., Kosorok, M. R., & Zeng, D. (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine,
28, 3294–3315.

Multi Objective Optimization of Radiotherapy Distributed Q Learning and Agent Based Simulation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi Objective Optimization of Radiotherapy Distributed Q Learning and Agent Based Simulation PDF

Uploaded by

Copyright:

Available Formats

Journal of Experimental & Theoretical Artificial

ISSN: 0952-813X (Print) 1362-3079 (Online) Journal homepage: https://www.tandfonline.com/loi/teta20

Multi-objective optimization of radiotherapy:

Ammar Jalalimanesh, Hamidreza Shahabi Haghighi, Abbas Ahmadi, Hossein

To link to this article: https://doi.org/10.1080/0952813X.2017.1292319

Published online: 05 Mar 2017.

Submit your article to this journal

Article views: 447

View related articles

View Crossmark data

Citing articles: 8 View citing articles

Full Terms & Conditions of access and use can be found at

Multi-objective optimization of radiotherapy: distributed

ABSTRACT ARTICLE HISTORY

CONTACT Hamidreza Shahabi Haghighi Shahabi@aut.ac.ir; Abbas Ahmadi abbas.ahmadi@aut.ac.ir

2. Modelling and optimisation of radiotherapy

Figure 1. Cell-cycle flowchart.

3. The proposed simulation model of tumour growth

Table 1. Agent-based tumour growth model parameters.

After 1000 ticks

Figure 2. Agent-based simluation of vascular tumour growth.

Figure 3. Schematic process of reinforcement learning.

4. Modelling of radiotherapy

Algorithm 1. Radiation therapy algorithm

5. Q-learning, DQ-learning and MDQ-learning

Algorithm 2. Q-learning algorithm

Algorithm 3. Distributed Q-learning algorithm

Algorithm 3. Distributed Q-learning algorithm

6. Simulation-based optimisation of radiotherapy

6.1. Radiotherapy optimisation using DQ-learning

Algorithm 4. DQL for RT

Algorithm 4. DQL for RT

4.19: end while

6.2. Multi-objective optimisation of radiotherapy using MDQ-learning

Algorithm 5. MDQ-learning of radiotherapy

Algorithm 5. MDQ-learning of radiotherapy

Algorithm 6. Negotiation (m solutions)

7. Results and discussion

Figure 4. MDQ-learning algorithm results.

Table 2. Reward function coefficient for different agents.

You might also like