1 s2.0 S0098135422001570 Main

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/360244313
Multi-agent reinforcement learning-based exploration of optimal operation

strategies of semi-batch reactors
Article in Computers & Chemical Engineering · April 2022

DOI: 10.1016/j.compchemeng.2022.107819
CITATION READS
1 138
3 authors:
Ádám Sass Alex Kummer

University of Pannonia, Veszprém University of Pannonia, Veszprém
3 PUBLICATIONS 4 CITATIONS 23 PUBLICATIONS 163 CITATIONS
SEE PROFILE SEE PROFILE
János Abonyi
University of Pannonia, Veszprém
450 PUBLICATIONS 6,526 CITATIONS
SEE PROFILE
All content following this page was uploaded by János Abonyi on 05 May 2022.
The user has requested enhancement of the downloaded file.

Computers and Chemical Engineering 162 (2022) 107819
Contents lists available at ScienceDirect
Computers and Chemical Engineering

journal homepage: www.elsevier.com/locate/compchemeng
Multi-agent reinforcement learning-based exploration of optimal

operation strategies of semi-batch reactors
Ádám Sass a,b, Alex Kummer b,∗, János Abonyi b,c
a
KALL Ingredients Ltd., Tiszapüspöki, H-5211, Hungary
b
Institute of Chemical and Process Engineering, Department of Process Engineering, Hungary
c
MTA-PE Lendület Complex Systems Monitoring Research Group, University of Pannonia, Veszprém H-8200, Hungary
a r t i c l e i n f o a b s t r a c t
Article history: The operation of semi-batch reactors requires caution because the feeding reagents can accumulate, lead-
Received 13 November 2021 ing to hazardous situations due to the loss of control ability. This work aims to develop a method that
Revised 5 April 2022
explores the optimal operational strategy of semi-batch reactors. Since reinforcement learning (RL) is
Accepted 22 April 2022
an efficient tool to find optimal strategies, we tested the applicability of this concept. We developed a
Available online 28 April 2022
problem-specific RL-based solution for the optimal control of semi-batch reactors in different operation
Keywords: phases. The RL-controller varies the feeding rate in the feeding phase directly, while in the mixing phase,
Temperature control it works as a master in a cascade control structure. The RL-controllers were trained with different neu-
RL-controller ral network architectures to define the most suitable one. The developed RL-based controllers worked
Feeding trajectory very well and were able to keep the temperature at the desired setpoint in the investigated system. The
Cascade control results confirm the benefit of the proposed problem-specific RL-controller.
© 2022 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
1. Introduction oped a problem-specific RL-based solution for the optimal control

of SBRs in both the feeding and the mixing phase. Our idea is that
Semi-batch reactors (SBRs) are widely used in the industry to the operation phases can be distinguished from the aspect of the
handle highly exothermic reactions, but their operations require a control strategy, so two different RL-agents work independently in
lot of attention because the reagents can accumulate, resulting in the feeding and the mixing phase. The proposed RL-controllers ful-
hazardous situations (Ni et al., 2016; Bai et al., 2017). At the op- filled our expectations, and the results confirm their benefits for
eration of SBRs, there is a need to improve the efficiencies contin- the operation of SBRs.
uously, so there is a need to define the best potential operation During the operation of SBRs, great attention must be paid to
strategy while the reactor always stays in the safe and control- prevent the development of thermal runaways. A tried-and-tested
lable operation regime (Hall, 2018). Reinforcement learning (RL) is method is to control the rate of reaction heat generation by the
an efficient tool for learning complex behaviour and strategy, and feeding rate of the reagent. This way, the cooling system is able to
RL agent has the potential to perform any task that requires ex- remove the reaction heat from the reactor. However, if the feed-
perience gained by interacting with the process (Shin et al., 2019). ing strategy is not designed correctly, the temperature may ex-
The operation of batch reactors is generally improved iteratively ceed the maximum allowable temperature (MAT) (Westerterp and
based on the experience gained from the earlier operation strate- Molga, 20 04; 20 06; Copelli et al., 2014). One of the main risks dur-
gies. Since RL works similarly based on experiences, we think RL ing the operation of the SBRs is the accumulation of the feeding
can perform superbly in finding the optimal feeding strategy. component due to poorly chosen initial conditions or feeding strat-
Moreover, RL is typically useful for episodic cases, so it confirms egy because the higher concentration may trigger a thermal run-
the use of RL for the operation of batch reactors (Nian et al., 2020). away resulting in a rapid and significant temperature increase (Guo
Our motivation is to explore the potential in RL-based operation of et al., 2017; Ni et al., 2017). Generally, the feeding strategy is sim-
semi-batch reactors carrying out exothermic reactions. We devel- ple, using only a constant feeding rate over the entire operation of
the reactor resulting in higher batch times compared to the oper-
∗
ation cases with a more complex feeding strategy (Kummer et al.,
corresponding author.
E-mail address: kummera@fmt.uni-pannon.hu (A. Kummer).
2020b).
https://doi.org/10.1016/j.compchemeng.2022.107819
0098-1354/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Á. Sass, A. Kummer and J. Abonyi Computers and Chemical Engineering 162 (2022) 107819
Kanavalau et al., 2019; Kummer et al., 2020a). However, there is

Notations the main disadvantage when we use NMPC to solve such a prob-
lem. Since the NMPC solves an optimization problem in every it-
a action eration step, the computational cost is high, and the computa-
c concentration of reagent, [ kmol
m3
] tional time is critical from the point of applicability. In contrast,
cp heat capacity, [ kgkJ K ] a reinforcement learning-based controller is trained offline, and it
e error does not need additional optimization in the online application.
EA activation energy, [ kmol kJ
] As a result of this, the computational time is not critical anymore
J objective function (Nian et al., 2020).
3 s The learning of the RL-based controller is similar to the con-
k reaction rate constant, [ m kmol
]
3 cept of batch-to-batch learning (Xiong and Zhang, 2005). Batch-
k0 pre-exponential factor, [ m s
kmol
] to-batch optimization uses the episodic nature of batch processes
KP proportional gain to define the optimal operating policy (Clarke-Pringle and MacGre-
KI integral gain gor, 1998), and the optimal operating conditions are defined iter-
L loss function atively in the presence of uncertainty using the previous batches
M number of episodes (Srinivasan et al., 2001).
ni mole amount of the ith component, [kmol] RL-based controllers are able to learn only from experience,
N noise and it does not require any parameter tuning; also, the RL-based
N mini batch controllers can handle non-linear systems (Ma et al., 2019). Only
NR number of reagents a few research articles can be found in the literature about RL-
OP actuator position based reactor operations. Machalek et al. optimized the operation
Q action-value function of a continuous stirred tank reactor (CSTR) by defining the opti-
PV process variable of controller mal value of the transferred heat using three different RL algo-
r reward rithms (Deep Deterministic Policy Gradient (DDPG), Twin Delayed
kmol
rR reaction rate, [ m 3 s] Deep Deterministic Policy Gradient, Proximal Policy Optimization)
R replay buffer (Machalek et al., 2020). Singh and Kodamana used Q-learning
kJ
Rg gas constant, [ kmol K
] and Deep Q-learning to control a batch polymerization process
s state of the system (Singh and Kodamana, 2020). Pan et al. proposed an oracle-assisted
SP setpoint of controller constrained Q-learning to handle the different system constraints
t time, [s] (for example, maximum process temperature, maximum concen-
T time interval, [s] tration, etc.) during the operation. Pan et al. (2021). An RL-
TJ temperature of the reactor jacket, [K] based real-time optimization framework for steady-state optimiza-
TR temperature of the reactor, [K] tion has been worked out and tested on a CSTR (Powell et al.,
TS P temperature setpoint, [K] 2020). A Monte–Carlo-based DDPG algorithm was presented in
y target Yoo et al. (2021) for the optimal control of a batch polymeriza-
u control input vector in NMPC formulation, [%] tion process. An RL-controller (Q-learning algorithm) has been ap-
UA heat transfer parameter, [ kW
K
] plied for pH process control in a laboratory-sized plant, and the
V liquid volume, [m3 ] RL-based controller provided a much smoother operation than the
x state vector in NMPC formulation conventional PID controller (Syafiie et al., 2007).
γ discount factor The motivation of our work is to investigate the potentials of
Hr kJ
reaction heat, [ kmol ] RL-based operation of SBRs carrying out exothermic reactions. Our
θ weights in the neural network RL-based operation strategy uses an RL agent both for the feeding
μ policy operator and mixing phase. The RL controller in the feeding phase is a sim-
π general policy ple single-loop controller, while it acts as a master in a cascade-
ρ kg
density, [ m 3] loop during the mixing phase. During the feeding phase, the tem-
τ trajectory perature was controlled by varying the feeding rate, and in the
τsys time constant of the system, [min] mixing phase, the temperature was controlled by manipulating the
coolant rate. A model-free RL algorithm, namely Deep Determin-
istic Policy Gradient (DDPG) method, is applied for the training,
In recent years, researchers focused on applying model-based where we investigated different neural network architectures. We
solutions to define a non-constant feeding strategy because an ad- also implemented an NMPC to control the SBR in different opera-
equate model can be used to design an appropriate feeding plan tion phases to demonstrate in which situations the RL-based con-
to maximize productivity. A model-based control approach, such troller can outperform the NMPC. The designed controllers could
as model predictive controllers (MPC), can be used to optimize the keep the reactor in a safe operation zone, and the reactor tem-
feeding strategy (Findeisen et al., 2007). Since SBRs carrying out perature did not deviate from the target value. A continuously im-
exothermic reactions have a high non-linearity in the dynamics, proving area is the application of Multi-Agent reinforcement learn-
the controller has to cope with it. Non-linear MPC (NMPC) is a ing techniques to earn a better performance in control problems
suitable tool to handle such non-linear processes (Seki et al., 2001). (Zhang et al., 2018; Chen et al., 2021). Our control strategy uses
Until today, numerous MPC-based frameworks have been de- a multi-agent setup to operate semi-batch reactors, though in this
veloped to control semi-batch reactors safely and optimally. Rossi case, the agents are decoupled from each other. Our contributions
et al. developed the simultaneous model-based optimization and to the field are as follows:
control (BSMBO&C) algorithm, which is the combination of an
NMPC and dynamic real-time optimization (DRTO) techniques • We suggested operation phase-based control structures for the
(Rossi et al., 2015; 2016; 2017). MPC-based solutions with inte- operation of SBRs using multi-agents (see Section 2)
grated runaway criteria to predict and avoid thermal runaway were • We developed problem-specific RL-controllers used in single-
also investigated carefully (Kähm and Vassiliadis, 2018a; 2018b; and cascade loops (see Section 2)
2
Fig. 1. The process scheme of the investigated semi-batch reactor.
• We developed a problem-specific NMPC (see Section 2) 2.2. The applied control schemes for the operation
• We demonstrate that the proposed method is successful in ex-
ploring the optimal operating strategies (see Section 3) The applied control structures are presented in Fig. 2. In the
• We discuss the applicability and the benefits of the proposed feeding phase, an RL-based controller leads the temperature to the
method in avoiding dangerous operating zones that can lead desired setpoint. The RL-based controller structure is presented in
to thermal runaway and compare its behaviour to NMPC (see Fig. 2a. The only feedback to the RL controller is the required state
Section 3). tuple (s f eeding ) to define the appropriate actions, which is the de-
sired change of the feed valve actuator position (V 1)).
The RL-based controller was also tested in the mixing phase.
2. Operation phase-based control of SBRs This setup uses a cascade controller with a PID controller in the
slave loop and an RL-controller in the master loop. The PID con-
The operation of SBRs consists of more phases, like the feeding troller perfectly handled the disturbances, while the RL-controller
and mixing phase. In the feeding phase, a reagent is fed into the is used to define the setpoint of the PID controller based on the
reactor, while in the mixing phase, the only goal is to maximize changes in the temperature of the reactor. In the mixing phase,
the conversion at the desired temperature without any inlet feed. the reactor temperature is kept at setpoint with a cascade con-
The operation phases can be well separated, so different control troller, where we compared the performance of the conventional
structures can be used to maximize their performances. Usually, in PID controllers (see Fig. 2b) with the new proposed RL-based con-
the feeding phase, the feed rate is designed as a constant rate us- troller (see Fig. 2c). In this case, the RL controller acts as the mas-
ing a single control-loop with PID or the feed rate is varied using a ter controller in the cascade algorithm, and it reacts based on the
model predictive controller to predict and avoid reagent accumu- observed states (smixing ). This cascade structure also showed that
lation. Since the reagent accumulation cannot occur in the mixing RL-based controllers could be used in addition to being a local
phase, a cascade control-loop is usually used with PIDs to keep the level controller, as an advanced level controller too.
reactor temperature at the setpoint.
2.3. Setup of the reinforcement learning algorithm
2.1. Definition of the control problem We introduce in this section the applied RL algorithm (Deep
Deterministic Policy Gradient (DDPG) method). Then we present
The control goal in both operation phases is to keep the tem- how the RL-based training of the controllers occurred for the feed-
perature at the desired setpoint (SP ). In the feeding phase, the ing and mixing phases. We present which states are observed for
temperature is controlled by the feed rate of a reagent by manipu- the control problem and which action is taken to control the re-
lating the reagent feed valve (V 1), and in the mixing phase, the actor temperature. We also show the architectures of the neural
temperature is controlled by manipulating the coolant flow rate networks and the training algorithm.
(V 2) in a cascade structure. Since the operation phases from the DDPG method is an actor-critic method that was developed
control point are independent, we suggest using two different con- based on two earlier RL methods, based on the Deep Q-network
trollers. Due to the potential accumulation of the feeding reagents, and Deterministic Policy Gradient (DPG) (see Appendix B). Lilli-
high non-linearities are present in the system, which a non-linear crap et al. extended the DPG method with a replay buffer to store
controller can handle. Based on these control structures and han- tuples of historical samples. In DDPG the agent learns in mini-
dling of non-linearities, we propose an RL-based control solution to batches rather than learning online. DDPG includes target networks
define the optimal operation strategy for SBRs. Fig. 1 presents the too, which are the copy of the actor and critic networks, and they
process scheme of the investigated system. An RL-controller means are used to calculate the target values (Lillicrap et al., 2015). The
an RL-agent so that a neural network can choose the right con- DDPG method incorporates the advantages of DQN into an actor-
trol values (actions) for the operation. The RL-controllers in both critic structure. The actor-critic structure with a deterministic pol-
phases use state representation (s f eeding and smixing ) to make a deci- icy gradient-based update on the actor is the first and most simple
sion about the action values (a f eeding and amixing ). The agents inter- method for continuous action space. Using DPG with actor-critic
act with the system through actions based on the observed states, structure in the SBR would be a relevant solution if the reactor
and the agents learn based on the received reward values. characteristic would not insinuate non-linearity. The main advan-
3
Fig. 2. The applied control schemes for the operation of SBRs.
Fig. 3. The applied RL algorithm.
tage of the DQN algorithm is that it uses deep neural networks maining part of this section, we present how this algorithm should
which approximate the action-value function; thus, it can exploit be interpreted.
the fact that neural networks can be well used for non-linear func- We use target networks to calculate the target values for both
tion approximation. However, using neural networks in actor-critic actor and critic networks, and we use soft updates for the target
structures and directly implementing Q-learning (critic) proved to networks. The weights of the target networks are updated with the
be unstable in many environments because the critic network be- following equation:
ing updated is also used in calculating the target value, the Q
update is prone to divergence (Lillicrap et al., 2015). These fac-
θ ←
− τ θ + ( 1 − τ )θ (1)
tors shaped our choice to go with the DDPG algorithm to solve where θ is the weight parameters of the target networks. The
our continuous non-linear optimal control problem, which was the exploration-exploitation dilemma is solved by adding noise (N ) to
first to solve continuous high-dimensional control tasks effectively the action at the actor network:
(Nian et al., 2020). The selected DDPG algorithm has already been
applied for the continuous control of a non-linear batch processes μ (st ) = μ(st |θtμ ) + N (2)
(Ma et al., 2019), (Yoo et al., 2021), (Xu et al., 2020), (Yoo et al., N can be Ornstein–Uhlenbeck process or a normal noise to ex-
2019). plore the environment with or without noise decay.
The details of the utilised DDPG algorithm can be found in Several hyperparameters need to be defined to use reinforce-
Lillicrap et al. (2015). The overview of the algorithm is depicted ment learning, such as the structure of the used deep neural
in Fig. 3, while Algorithm 1 presents the step by step description networks, the learning rate of the networks, the discount factor
of the training algorithm in the case of reactor control. In the re- (γ ), and the action noise (σ ), buffer size and minibatch size. The
4
Algorithm 1 The utilized training algorithm. The time interval of each action is 100 s both in the feeding and
mixing phase. This parameter must be set according to the small-
Randomly initialize critic (Q (s, a|θ Q ) and actor network (μ(s|θ μ )
est time constant of the non-linear system to ensure that the effect
with weights θ Q and θ μ
of the action can change the state of the system. The actor-network
Initialize target network Q and μ with weights θ Q ← − θ Q and
predicts the action with a random normal action noise (σ = 0.15)
θμ ← − θμ
for exploration without noise decay. The action noise provides bal-
Initialize replay buffer R
ances the exploration and exploitation of the algorithm. If the
for episode = 1, M do
value is too low, then exploration is not dominant enough, so af-
Initialize reactor model with a uniformly random reactor
ter many episodes, many actions with potentially high rewards
temperature (TR ) and moles of the reagents (ni,0 )
are still not found; therefore, the solution will be sub-optimal. If
Receive initial observation state s1
the value is too high, then stability problems can appear. The ac-
while t < T do
tion value has been determined based on Lillicrap et al. (2015);
Select action at = μ(st |θ μ ) + Nt
Yoo et al. (2021), and the results discussed in Section 3 show, that
Execute action at and observe reward rt and observe new
the selected parameter results in good performance.
state
The reward function is the same for the feeding and mix-
st+tsampling
ing phase. At the sampling times, the agent receives a reward
Store transition(st , at , rt , st+tsampling ) in R
(r) from the environment. The reward function is formed from
Sample a random minibatch of N transitions (si , ai , ri , si+1 ) three-part, as presented in Eq. (5). If the reactor temperature
from R is higher than MAT, then the agent receives a constant penalty.

Set yi = ri + γ Q (si+1 , μ (si+1 |θ μ )|θ Q ) If the reactor temperature is within an interval of ε , the agent
Update critic by minimizing the loss:
2 receives a positive reward. If the reactor temperature is lower
L = N1 i yi − Q si , ai |θ Q than the desired setpoint, the agent gets a penalty based on
Update the actor policy using the sampled policy gradient: the absolute difference between the setpoint and the reactor

∇θ μ J ≈ N1 i ∇μ Q s, a|θ Q |s=si ,a=μ(st |θ μ ) temperature.
Update the target networks: ⎧

θQ ← − τ θ Q + ( 1 − τ )θ Q

⎨−100 if TR,t > MAT

μ
θ ← μ
− τ θ + ( 1 − τ )θ μ rt = +100 if |TR,t − SP| ≤ ε (5)
end while ⎩
−|TR,t − SP | otherwise
end for
The structure of the applied neural networks is the same for
both the actor and the critic networks in all phases. A feedforward
most popular hyperparameter setting methods are the manual neural network is used with fully connected layers, and we used
search, the grid search, the random search, and the more so- layer normalization on all layers of the critic and actor.
phisticated Bayesian optimization. Hyperparameter optimization We used Adam optimizer to train the neural networks with a
techniques suffer from the requirement of many learning runs to learning rate of 10−4 and 10−3 for the actor and critic networks.
identify good hyperparameters, so these methods highly increase The target networks were updated softly with τ = 10−3 . The acti-
the computational costs. However, these methods can effectively vation layers in the hidden layers used the Relu function. The out-
improve the performance (Paul et al., 2019). Since this research put layer of the actor used tanh function to bound the actions. The
focuses on developing the control structure, a draft sensitiv- output of the critic network did not use any activation function.
ity analysis of the hyperparameters was elaborated, and the Action was included only in the last hidden layer of the critic net-
parameters were not fine-tuned by optimization. work (added to the previous layer’s outputs).
Every state which affects the process temperature and varies in
time should be included in the state representation. In the feed- 2.4. Setup of the NMPC
ing phase, the observed states are presented in Eq. (3), which are
the reactor temperature (TR ), the jacket temperature (TJ ), the actua- The performance of the RL-based controller is compared to
tor position (OP ), the concentration of the reagents, the heat trans- NMPC, which can be considered an advanced technique for opti-
fer rate (UA), and the error at the present time and at the earlier mal control.
sampling time between the reactor temperature and the setpoint The NMPC is designed to keep the temperature of the reactor
(et , et−s ). The errors must be included in the state space because, (TR ) at the desired setpoint and at all times under MAT. The ob-
as we experienced, the agents learn much easier and faster in this jective function considers that the process temperature follows the
way. It is worth noting that the knowledge earned by basic algo- setpoint, so that the error can be defined as
rithms (like PID) can help define the structure of RL-based con-
trollers, so, for example, it can help define an appropriate set of ek = TSP − TR,k (6)
states. where ek is the error between the setpoint and the temperature
There is only one actuator, which is the feeding valve (V 1), of the reactor (TRk ) at kth time step. The objective function of the
hence there is only one action: the valve’s adjustment value (OP ). open-loop optimization problem is the sum of the squared error
The action is limited between −10% and +10%. over the prediction horizon described in Fig. 7.
s f eeding = {TR , TJ , OP, ci , . . . , cNR , UA, et , et−s } (3) t pred

In the mixing phase, the observed states are presented in min ek 2 (7)
u j
Eq. (4), which are almost the same as s f eeding , but the setpoint k=0
of the slave loop (SPslave ) is included. In the mixing phase, the RL subject to
works as the master controller in the cascade loop; hence the ac-
tion is the adjustment of the setpoint of the slave loop. The action u j+1 = u j + u j (8)
is limited between −10 K and +10 K.
smixing = {TR , TJ , OP, SPslave , ci , . . . , cNR , UA, et , et−s } (4) −10% ≤ u j ≤ 10% (9)
5
0% ≤ u j ≤ 100% (10) Table 1

Maximum average rewards at the different neural network structures.
Run no. Hidden layers Neurons Maximum average reward

TR ≤ MAT (11)
1 1 64 −540.82
where u j is the manipulated variable in jth control interval. The 2 1 128 585.57
manipulated variables are the valve opening of V1 or V2, depend- 3 1 256 1258.55
4 1 512 3591.64
ing on the operation phase of the SBR. The NMPC is formulated as 5 2 64 × 64 4250.85
velocity-based optimization thus the optimal change of the manip- 6 2 128 × 128 4492.19
ulated variables are to be found in every time step. That is limited 7 2 256 × 256 4434.51
to the same range as used in the case of the RL-based controller. 8 2 512 × 512 4626.68
The parameters of the NMPC have been tuned similarly to the
RL-based controller. The action interval, which is the time between Table 2
two control actions, was set to 100 s, which is the same value as in Dynamics of the system and PID controller parameters .
the case of the RL-controller. It is essential to define the prediction
Slave loop Master loop
horizon to be long enough to capture thermal runaway (Kähm and
Gain −0.27 1
Vassiliadis, 2018a), so we set 2500 s as the length of the prediction
τsys [min] 3.1 20.5
horizon and 10 0 0 s as the length of the control horizon, which will KP 14.8 5
add up to 10 different values of u j in one optimization step. KI 12.4 200
3. Results and discussion

enough to keep the temperature within the setpoint interval, hence
As the RL agents learn from a wealth of experience that can-
we need to change the actuator there to maintain temperature. In
not be earned on the real system, there is a need for an adequate
the second figure (top-right), the position of the actuator (V1) is
system model. Fortunately, there is a broad, well-established tool-
presented. In the beginning, the RL controller opens the actuator
box for modelling batch systems. Although the modelling and the
at maximum speed so that the reagents can be consumed in the
validating tasks are time- and resource-intensive, they can be done
reaction and generate heat. As the reactor temperature gets closer
in an industrial environment, so this does not limit its applicabil-
to the setpoint, the feed rate decreases to compensate or avoid
ity. In the following, we present the applied process model and the
overshooting. Then the feed rate is slowly increased to keep the
reactor operation results.
temperature in the setpoint interval. In the third figure (bottom,
right), the action made by the RL controllers is presented. Although
3.1. Process model of the reactor system
the maximum average reward was earned with two hidden layers
and 512 × 512 neurons, the test runs show that the trained net-
A second-order reaction is carried out in a semi-batch reac-
work with 128 × 128 neurons performs better, for instance, there
tor, where C product is made from reagents A and B in the A +
kmol is no overshooting at the end of the feeding phase. This contra-
B → C reaction. The reaction rate (rR m 3 s ) is expressed with a
diction may come from the random initial operating parameters
Arrhenius-type equation. The details of the process model, and the
and the width of the averaging window. Based on the results, we
dynamic, kinetic and thermodynamic parameters are described in
work further with the networks, including two hidden layers and
Appendix A.
128 × 128 neurons.
3.2. Results and discussion
3.2.2. Analysis of the cascade controller in the mixing phase
3.2.1. Training and testing of the RL controller in the feeding phase We performed the open-loop analysis of the slave and master
The training was performed for eight different network struc- loop to get information about the dynamics of the system, and
tures to define the proper structure. The RL Agents for the con- we defined the gain and the time constant, which is presented in
trollers are trained with the DDPG algorithm. A replay buffer is Table 2. The time constant of the slave loop was 3.1 min, and
used to store the historical samples, from which the samples are of the master loop, it was 20.5 min. The PID controller param-
chosen uniformly, and the size of the buffer is 1,0 0 0,0 0 0. The size eters (KP , KI ) were defined based on the dynamics of the loops,
of the mini-batches is 64. The structure of the neural networks was which are presented in Table 2. The PID parameters were tuned
the same for both the actor and critic. One and two hidden layers IMC-based and fine-tuned by hand.
were investigated based on the 2n -principle, so with 64, 128, 256 Fig. 7 presents how the cascade controller with the PID master
and 512 neurons in each hidden layer. works. In the top figure, the operation of the master loop is pre-
The average rewards can be seen in Fig. 4, where the average sented, and on the bottom figure, the operation of the slave loop
rewards were calculated within every 50 episodes. As it can be is shown. At the time of 2.3 h, the feeding of the reagent was
seen, the reward trends saturate in the same interval, so investigat- stopped so was its cooling effect. Therefore, between 2.3 h and
ing more complex network structures will not provide better solu- 2.5 h, there was a need to decrease the jacket temperature to keep
tions. The actor and critic models were saved at the maximal aver- the reactor temperature in the setpoint interval. As it can be seen
age reward values, which can be seen in Table 1. With one hidden in Fig. 7 the reactor temperature (PVmaster ) goes to the setpoint
layer, the controller could not learn to keep the temperature, but interval but reaches it quite slowly, and the reactor temperature
with two hidden layers, the results are quite acceptable, and the stays outside of the desired setpoint interval.
maximum rewards were obtained using 512 × 512 neurons. Figs. 5 In the following, we present the control results of the cascade
and 6 presents how the RL controller operates using all the neural with RL controller. The training was performed for four different
networks with two hidden layers. In the first figure (top-left), the network structures, where the neural networks’ structure was the
temperature trajectories and the setpoint is shown. The setpoint same for both the actor and critic. We investigated the control re-
interval is shown with blue lines, and we can see how the differ- sults with one hidden layer and 64, 128, 256 and 512 neurons. The
ent RL controllers drive the reactor temperatures into the setpoint average rewards are shown in Fig. 8. The reward trends saturate in
interval. After ∼ 2 h, the heat generated in the reaction is no longer the same interval, so there is no need to investigate more complex
6
Fig. 4. Average rewards at different neural network structures.
Fig. 5. Performance of the RL controller (TR,0 = 345 K, nB,0 = 4.2 kmol).
Fig. 6. Performance of the RL controller (TR,0 = 364 K, nB,0 = 5.0 kmol).
Table 3 presented in Fig. 9, where all the four controllers are presented
Maximum average rewards at the different neural network structures .
with the different numbers of applied neurons. Using 64, 128 and
Run no. Hidden layers Neurons Maximum average reward 256 neurons, the temperature slightly exceeds the setpoint inter-
1 1 64 7119.24 vals, which is not the case using 512 neurons. Also, the smoothest
2 1 128 7052.75 action trajectory is earned using 512 neurons.
3 1 256 7002.32 As a comparison of the RL master and PID master controllers,
4 1 512 7181.75 the RL master outperformed the PID controller (the integrated er-
ror in the case of PID master was 0.669, while in the case of RL
master was 0.002).
network structures. The average rewards show that the controller
could learn and perform the control task quite well at each net- 3.2.3. Operation of the semi-batch reactor using RL-controllers
work. Each controller gained more than 70 0 0 as a maximum aver- We present how the trained RL controller in the feeding phase
age reward (see Table 3), though the highest rewards were gained and the trained RL-PID cascade controller work together during the
using 512 neurons. The control results in the mixing phase are also operation of the semi-batch reactor, which is presented in Fig. 10.
7
Fig. 7. Performance of the cascade controller with PID master.
Fig. 8. Average rewards at different neural network structures.
Fig. 9. Performance of the cascade controller with RL master.
In the top subplot, the reactor temperature is shown during the ing horizon. Since we would like to compare the results of NMPC
operation, where the feeding phase lasts until 2.1 h. The controller to the RL-based control framework, we used the same 100 s as a
switch is smooth after the feeding phase, and the reactor temper- sample time. The number of control intervals is set to 10, and the
ature stays at the setpoint. In the middle subplot, the position of number of prediction intervals is set to 15 (excluding control inter-
the actuator valves is presented. The position of V 2 is noisy due to vals). The control and prediction intervals were set quite long due
the discrete actions performed by the RL-master controller. to the potential hazards of a thermal runaway because NMPC only
works if the development of the thermal runaway can be seen in
3.3. Control results with open-loop NMPC advance in the prediction horizon. Fig. 11 presents the control of
the SBR. The NMPC can solve the control task well, although no
The open-loop optimization problem was solved by the SQP op- uncertainties were considered. There is only a slight overshoot; the
timization algorithm, where the algorithm proceeds with a mov- reactor temperature follows the setpoint.
8
Fig. 10. SBR temperature control using RL-controller.
Fig. 11. SBR temperature control using NMPC.
Comparing the RL-based controllers and NMPC, we can say that reactor temperature was led to and kept at the desired setpoint
the RL-controller in the feeding phase lets in more reagents in the during the whole operation.
first half an hour; hence the setpoint is reached earlier. On the Compared to the NMPC, the RL-based controllers followed a
other hand, the feeding strategy is similar, so there is a higher similar strategy as the NMPC but reached the setpoint faster, cre-
feeding rate at the beginning of the operation (until 0.6–0.7 h), ating a better overall yield thanks to the higher reaction rate at
then the feeding rate is decreased to avoid high overshoot, then higher temperatures. Contrary to the performance of the RL-based
the feeding rate is slightly increases continue to keep the reactor controller, a slight overshoot appeared in the case of NMPC. Re-
temperature at the setpoint. The cooling flow rate continuously de- garding computational requirements, in the case of NMPC, the op-
creases in the mixing phase to follow the continuously decreasing timization time in some cases was very close to the action time
reaction rate. interval of 100 s, which might jeopardize the applicability of the
NMPC in this case of this complex model and optimization prob-
3.4. Discussion lem. The RL-based controller was trained on the simulator “offline”
for eight hours with the same computational capacity, and the ap-
The application of RL-controllers show great potential in ex- plication of the trained controller required computational capac-
ploring the operational strategies of SBRs, but we need to de- ity that allowed the online application. For RL-controllers’ learning
fine problem-specific RL-controllers. Since the different operation and application, we need an adequate and reliable model over a
phases of SBRs require different control approaches using multi- wide interval of operating parameters, so the used policies by the
agent techniques is a well-functioning idea. During the operation agents become reliable and usable on the investigated real system.
of SBRs the agents can be decoupled from each other, so phase- The neural networks are difficult to be interpreted and understood,
based and individual RL-controllers can be applied for the oper- and they are also difficult to be validated. Therefore, the perfor-
ation. Increasing the number of hidden layers and the number mance of the control needs to be monitored continuously, and if
of neurons leads to a better performance of the RL-controllers, any malfunction occurs, we may need to switch to a back-up con-
though we need to avoid unnecessarily complex neural networks. troller to keep in hand the system. A potential future research area
The RL-controller is realized in a single control-loop in the feed- is to work out validation and failure analysis techniques for the
ing phase, just like an ordinary control-loop using PID. Unlike a RL-controllers.
PID controller, the RL-controller can handle the non-linearities due
to the accumulation of reagents, so developing a possible runaway 4. Conclusion
can be avoided. In the mixing phase, the RL-controller worked very
well as a master in a cascade control-loop, and it performed better A reinforcement learning-based approach has been investigated
than an ordinary cascade-loop using PID controllers. The two RL- to operate a semi-batch reactor carrying out an exothermic re-
controllers worked well in the operation of the SBR because the action optimally. We developed an RL-based control scheme to
9
operate SBRs, where an RL-controller is used for both the feed- Table A1
Parameters and initial conditions of the case study.
ing and mixing phase. The operation of semi-batch reactors needs
higher attention due to the potential runaway reactions. If the Parameter Value Unit
feeding reagents accumulate in a higher concentration, the reac- k0 pre-exponential factor 1.465 × 107 m3
kmols
tion may trigger non-desired, resulting in a rapid temperature in- EA
activation energy 8500 K
Rg
crease. The proposed reinforcement learning-based controller helps Hr reaction heat parameter −3.5 × 105 kJ
kmol
operate the reactor safely if the training is well-designed. MAT Maximum Allowable Temperature 373 K
kW
The RL-controller varied the feed rate in the feeding phase to UA0 initial heat transfer parameter 1.85 K
V0 initial reagent volume 0.5 m3
keep the temperature at the desired setpoint without crossing the
VJ jacket volume 0.41 m3
maximum allowable temperature. The agent was trained from ran- kJ
ρcp liquid property in reactor 4800
m3 K
dom initial reactor temperature and loaded amount of reagents, kJ
( ρ c p )J cooling agent property 4183
so the controller could define the appropriate action to take in a m3 K
wider operation interval. In the mixing phase, a cascade-type con-
troller was used to keep the temperature at the desired setpoint, Table A2
where we compared the standard cascade controller with PIDs and Operating parameters .
the RL-based cascade controller and the NMPC. The RL-based cas- Parameter Value Unit
cade controller was able to keep the temperature closer to the set-
nA, f eed feed moles of A reagent nB,0 kmol
point than the PID controller. Also, it was able to reach the setpoint nA,0 initial moles of A reagent 0 kmol
faster than the NMPC and showed zero sign of overshoot. nB,0 initial moles of B reagent U ( 4, 5 ) kmol
The reinforcement learning methods work from experience to cin,A feed concentration of A reagent 5 kmol
m3
m3
experience, just like batch-to-batch learning or run-to-run opti- Fmax maximum flow rate of feeding reagent 1.4 × 10−4 s
m3
mization. Since there is no option to perform experiments on Fj,max coolant flow rate 1.1 × 10−3 s
TR,0 initial reactor temperature U (333, 363 ) K
the real system, there is always a need for adequate and reli-
TJ,0 initial jacket temperature 298 K
able model development for learning RL-based controllers. Unfor- Tin reagent feed temperature 298 K
tunately, the modelling and simulator development of any process TJ,in coolant feed temperature 298 K
system is time and cost consuming, but for most units/systems, a
well-designed methodology for modelling is already worked out.
Also, the use of open-source programming languages can compen-
sate for the mentioned disadvantages.
dnA
The applied RL-based controller works very well in the inves- = F cin,A − V rR (A.3)
dt
tigated, simulated system, and it is worth continuing the work in
this research field. However, when implementing an RL controller
on a real system, there will be a plant-model mismatch when dnB dnC
=− = −V rR (A.4)
the controller is trained offline based is a not-perfect simulation dt dt
model. The requirement of valid information for controller tuning
implies that the RL-based controller must be fine-tuned using data
dTR F ρ c p (Tin − TR ) + (−Hr )V rR − UA(TR − TJ )
collected from the real-time operation. Although model-plant mis- = (A.5)
dt V ρ cp
match and non-observable states are not negligible in the applica-
tion of reinforcement learning, it was not analyzed in this work.
However, it will be the subject of our future investigations. dTJ FJ (ρ c p )J (TJ,in − TJ ) + UA(TR − TJ )
= (A.6)
dt (V ρ c p )J
Declaration of Competing Interest
Vdos
The authors declare that they have no known competing finan- UA = UA0 1 + (A.7)
V0
cial interests or personal relationships that could have appeared to
influence the work reported in this paper. where t is time [s], V is liquid volume [m3 ], Vdos is dosed vol-
3
ume [m3 ], F is feeding rate ms , ni is the mole amount of the
Appendix A. Process model of the reactor
ith component [kmol], TR and TJ are the reactor and jacket temper-

The reaction rate (rR kmol
m3 s
) is expressed with a Arrhenius-type atures respectively [K], ρ is density kg
m3
, c p is the heat capacity
equation:
kJ
kgK
, Hr is reaction heat kJ
kmol
, UA is the heat transfer param-
−EA
rR = kcA cB = k0 exp cA cB (A.1) eter . kW
Rg TR K
The considered control valves have linear characteristics, so the
m3 s following equations (Eqs. (A.8) and (A.9)) describe the inlet flow
where k is the reaction rate constant kmol
, cA and cB is the
rates.
kmol
concentration of A and B reagents , k0 and EA are the pre-
m3 V 1 pos
exponential factor m3 s
and the activation energy kJ
respec- F = Fmax (A.8)
kmol kmol 100

kJ
tively, Rg is the gas constant kmol K
and TR is temperature of the
V 2 pos
reactor [K]. Fj = Fj,max (A.9)
100
The following differential equations describe the dynamic be-
haviour of the system. The kinetic, material and reactor geometry parameters are pre-
dV sented in Table A.4, and operating parameters are presented in
=F (A.2) Table A.5.
dt
10
Appendix B. Reinforcement learning Gradient ascent is used to update the policy parameters based
on the following equation:
In reinforcement learning an agent interacts with the environ-
ment. The environment produces information which describes the
θ←
− θ + α∇θ J (πθ ) (B.9)
state of the system (st ), and the agent interacts with the environ- where α is the learning rate. The term ∇θ J (πθ ) is known as the
ment through an action (at ) based on the observed states. The ac- policy gradient (Graesser and Keng, 2019).
tion is performed on the environment and it transitions into the A widespread RL algorithm family for continuous action prob-
next state, then it returns the next state and the reward (rt ) to the lems is the actor-critic method. All the actor-critic algorithms
agent. The information exchanged are (st , at , rt ), where t denotes have two components learning together. One member is the actor,
the time step, and this tuple is called an experience. A trajectory is which learns the parameterized policy based on the policy gradi-
a sequence of experiences over an episode, τ = (s0 , a0 , r0 ), (s1 , a1 , ent theorem to define the desired action. The other member is the
r1 ), ... (sT , aT , rT ) (Graesser and Keng, 2019). critic, which learns the action-value function (Q (s, a )) to evaluate
A policy (π ) of an agent is a function that maps states to ac- the state-action pairs and provide a reinforcing signal to the ac-
tions, and the objective of an RL problem is to find the optimal tor. The actor is updated based on the following formula, derived
policy, where the best action is chosen at every state. The optimal by Silver et al. and presented as the Deterministic Policy Gradient
policy can be found by maximizing the sum of rewards by select- (DPG) method (Silver et al., 2014).
ing the best actions. A discounted cumulative reward is used in the
objective to maximize, which can be seen in Eq. (B.1). The discount ∇θ μ J = Est ∇θ μ Q (s, a|θ Q )|s = st , a = μ(st |θ μ ) (B.10)
factor (γ ) can be in the interval of [0, 1], but in most cases it is in where μ(s|θ μ ) is the current policy by deterministically mapping
the interval of [0.95, 0.99]. states to a specific action. The critic (Q (s, a )) learns using the Bell-

T mann equation.
R ( τ ) = r0 + γ r1 + γ 2 r2 + . . . + γ T rT = γ t rt (B.1)
t=0 CRediT authorship contribution statement
The objective (J (τ )) is the expectation of returns over many tra-
jectories (see Eq. (B.2)): Ádám Sass: Conceptualization, Methodology, Software, Formal
analysis, Investigation, Writing – original draft, Writing – review &

T
editing, Visualization. Alex Kummer: Conceptualization, Method-
J ( τ ) = E τ π [ R ( τ )] = E τ γ t rt (B.2) ology, Software, Writing – review & editing, Supervision. János
t=0
Abonyi: Conceptualization, Methodology, Writing – review & edit-
The agent can learn the optimal policy based on the action- ing, Supervision.
value function (Q π (s, a )) through estimating the expected return
(Eτ [R(τ )]). The optimal action-value function is the maximum of References
the expected return after performing an action (a) at the current
Bai, W., Hao, L., Guo, Z., Liu, Y., Wang, R., Wei, H., 2017. A new criterion to identify
state (s) (Graesser and Keng, 2019). safe operating conditions for isoperibolic homogeneous semi-batch reactions.
Chem. Eng. J. 308, 8–17.

T
Chen, K., Wang, H., Valverde-Pérez, B., Zhai, S., Vezzaro, L., Wang, A., 2021. Optimal
Q ∗ (s, a ) = maxEs0 =s,a0 =a,τ π γ t rt (B.3) control towards sustainable wastewater treatment plants based on multi-agent
t=0 reinforcement learning. Chemosphere 279, 130498.
Clarke-Pringle, T.L., MacGregor, J.F., 1998. Optimization of molecular-weight distribu-
Bellman equation defines a recursive relationship to define the tion using batch-to-batch adjustments. Ind. Eng. Chem. Res. 37 (9), 3660–3669.
Q-function, where s and a is the subsequent state and action fol- Copelli, S., Torretta, V., Pasturenzi, C., Derudi, M., Cattaneo, C.S., Rota, R., 2014. On
the divergence criterion for runaway detection: application to complex con-
lowing s and a (Graesser and Keng, 2019). trolled systems. J. Loss Prev. Process Ind. 28, 92–100.
Q ∗ (s, a ) = maxEs0 =s,a0 =a,τ π r (s, a ) + γ Ea π [Q π (s , a )] (B.4) Findeisen, R., Allgöwer, F., Biegler, L.T., 2007. Assessment and Future Directions of
Nonlinear Model Predictive Control, vol. 358. Springer.
Graesser, L., Keng, W.L., 2019. Foundations of Deep Reinforcement Learning: Theory
If the target policy is deterministic, we can describe this
and Practice in python. Addison-Wesley Professional.
policy as a function μ : S ←− A and avoid the inner expectation Guo, Z., Chen, L., Chen, W., 2017. Development of adiabatic criterion for runaway de-
(Lillicrap et al., 2015): tection and safe operating condition designing in semibatch reactors. Ind. Eng.
Chem. Res. 56 (50), 14771–14780.
Q ∗ (s, a ) = maxEs0 =s,a0 =a,τ π r (s, a ) + γ Q μ (s , μ(s )) (B.5) Hall, B., 2018. Nonlinear Model Predictive Control and Dynamic Real-Time Optimiza-
tion of Semi-Batch Reactors-A Case Study of Expandable Polystyrene Production.
A commonly used policy in Q-learning is a greedy policy: NTNU Master’s thesis.
Kähm, W., Vassiliadis, V.S., 2018. Stability criterion for the intensification of batch
μ(s ) = argmaxa Q (s, a ) (B.6) processes with model predictive control. Chem. Eng. Res. Des. 138, 292–313.
Kähm, W., Vassiliadis, V.S., 2018. Thermal stability criterion integrated in model pre-
In practice, Q-function is represented with a non-linear function dictive control for batch reactors. Chem. Eng. Sci. 188, 192–207.
approximator due to its size issues. We can use a neural network Kanavalau, A., Masters, R., Kähm, W., Vassiliadis, V.S., 2019. Robust thermal stability
for batch process intensification with model predictive control. Comput. Chem.
with weights (θ ) to approximate the Q value for all possible ac- Eng. 130, 106574.
tions in each state (Q (s, a; θ ) ≈ Q ∗ (s, a )). The neural network rep- Kummer, A., Nagy, L., Varga, T., 2020. Nmpc-based control scheme for a semi-batch
resenting the Q-function is called Q-network. The weights of the reactor under parameter uncertainty. Comput. Chem. Eng. 141, 106998.
Kummer, A., Varga, T., Nagy, L., 2020. Semi-batch reactor control with NMPC avoid-
Q-network are updated iteratively by minimizing the following loss ing thermal runaway. Compu. Chem. Eng. 134, 106694.
function (Ma et al., 2019; Lillicrap et al., 2015): Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,
Wierstra, D., 2015. Continuous control with deep reinforcement learning.
Loss = (r + γ maxa Q (s , a ; θ ) − Q (s, a; θ ))2 (B.7) arXiv preprint arXiv:1509.02971
Ma, Y., Zhu, W., Benton, M.G., Romagnoli, J., 2019. Continuous control of a polymer-
In numerous cases, continuous actions are inevitable, and policy ization system with deep reinforcement learning. J. Process Control 75, 40–47.
gradient methods help to realize it. The policy is directly modified Machalek, D., Quah, T., Powell, K.M., 2020. Dynamic economic optimization of a con-
tinuously stirred tank reactor using reinforcement learning. In: 2020 American
through its parameters to maximize the expected reward in policy
Control Conference (ACC). IEEE, pp. 2955–2960.
gradient methods. Ni, L., Jiang, J., Mannan, M.S., Mebarki, A., Zhang, M., Pan, X., Pan, Y., 2017. Thermal
runaway risk of semibatch processes: esterification reaction with autocatalytic
max J (πθ ) = Eτ πθ [R(τ )] (B.8) behavior. Ind. Eng. Chem. Res. 56 (6), 1534–1542.
θ
11
Ni, L., Mebarki, A., Jiang, J., Zhang, M., Dou, Z., 2016. Semi-batch reactors: thermal Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., 2014. Determin-
runaway risk. J. Loss Prev. Process Ind. 43, 559–566. istic policy gradient algorithms. In: International Conference on Machine Learn-
Nian, R., Liu, J., Huang, B., 2020. A review on reinforcement learning: introduc- ing, PMLR, pp. 387–395.
tion and applications in industrial process control. Comput. Chem. Eng. 139, Singh, V., Kodamana, H., 2020. Reinforcement learning based control of batch poly-
106886. merisation processes. IFAC-PapersOnLine 53 (1), 667–672.
Pan, E., Petsagkourakis, P., Mowbray, M., Zhang, D., del Rio-Chanona, E.A., 2021. Con- Srinivasan, B., Primus, C.J., Bonvin, D., Ricker, N.L., 2001. Run-to-run optimization via
strained model-free reinforcement learning for process optimization. Comput. control of generalized constraints. Control Eng. Pract. 9 (8), 911–919.
Chem. Eng. 154, 107462. Syafiie, S., Tadeo, F., Martinez, E., 2007. Model-free learning control of neutralization
Paul, S., Kurin, V., Whiteson, S., 2019. Fast efficient hyperparameter tuning for policy processes using reinforcement learning. Eng. Appl. Artif. Intell. 20 (6), 767–782.
gradient methods. Adv. Neural Inf. Process. Syst. 32, 4618–4628. Westerterp, K.R., Molga, E.J., 2004. No more runaways in fine chemical reactors. Ind.
Powell, K.M., Machalek, D., Quah, T., 2020. Real-time optimization using reinforce- Eng. Chem. Res. 43 (16), 4585–4594.
ment learning. Comput. Chem. Eng. 143, 107077. Westerterp, K.R., Molga, E.J., 2006. Safety and runaway prevention in batch and
Rossi, F., Copelli, S., Colombo, A., Pirola, C., Manenti, F., 2015. Online model-based semibatch reactors—A review. Chem. Eng. Res. Des. 84 (7), 543–552.
optimization and control for the combined optimal operation and runaway pre- Xiong, Z., Zhang, J., 2005. A batch-to-batch iterative optimal control strategy based
diction and prevention in (fed-) batch systems. Chem. Eng. Sci. 138, 760–771. on recurrent neural network models. J. Process Control 15 (1), 11–21.
Rossi, F., Manenti, F., Buzzi-Ferraris, G., Reklaitis, G., 2017. Combined dynamic op- Xu, X., Xie, H., Shi, J., 2020. Iterative learning control (ILC) guided reinforcement
timization, optimal control and online runaway detection & prevention under learning control (RLC) scheme for batch processes. In: 2020 IEEE 9th Data
uncertainty. Chem. Eng. Trans. 57, 973–978. Driven Control and Learning Systems Conference (DDCLS). IEEE, pp. 241–246.
Rossi, F., Reklaitis, G., Manenti, F., Buzzi-Ferraris, G., 2016. Multi-scenario robust on- Yoo, H., Kim, B., Kim, J.W., Lee, J.H., 2021. Reinforcement learning based optimal
line optimization and control of fed-batch systems via dynamic model-based control of batch processes using Monte–Carlo deep deterministic policy gradient
scenario selection. AlChE J. 62 (9), 3264–3284. with phase segmentation. Comput. Chem. Eng. 144, 107133.
Seki, H., Ogawa, M., Ooyama, S., Akamatsu, K., Ohshima, M., Yang, W., 2001. Indus- Yoo, H., Kim, B., Lee, J.H., 2019. Deep deterministic policy gradient algorithm for
trial application of a nonlinear model predictive control to polymerization reac- batch process control. Foundations of Process Analytics and Machine Learning:
tors. Control Eng. Pract. 9 (8), 819–828. FOPAM.
Shin, J., Badgwell, T.A., Liu, K.-H., Lee, J.H., 2019. Reinforcement learning–overview of Zhang, K., Yang, Z., Basar, T., 2018. Networked multi-agent reinforcement learning
recent progress and implications for process control. Comput. Chem. Eng. 127, in continuous spaces. In: 2018 IEEE Conference on Decision and Control (CDC).
282–294. IEEE, pp. 2771–2776.
12
View publication stats

1 s2.0 S0098135422001570 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0098135422001570 Main

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Multi-agent reinforcement learning-based exploration of optimal operation

Article in Computers & Chemical Engineering · April 2022

Ádám Sass Alex Kummer

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Contents lists available at ScienceDirect

Computers and Chemical Engineering

Multi-agent reinforcement learning-based exploration of optimal

1. Introduction oped a problem-speciﬁc RL-based solution for the optimal control

Kanavalau et al., 2019; Kummer et al., 2020a). However, there is

Fig. 1. The process scheme of the investigated semi-batch reactor.

Fig. 2. The applied control schemes for the operation of SBRs.

Fig. 3. The applied RL algorithm.

0% ≤ u j ≤ 100% (10) Table 1

Run no. Hidden layers Neurons Maximum average reward

3. Results and discussion

Fig. 4. Average rewards at different neural network structures.

Fig. 5. Performance of the RL controller (TR,0 = 345 K, nB,0 = 4.2 kmol).

Fig. 6. Performance of the RL controller (TR,0 = 364 K, nB,0 = 5.0 kmol).

Fig. 7. Performance of the cascade controller with PID master.

Fig. 8. Average rewards at different neural network structures.

Fig. 9. Performance of the cascade controller with RL master.

Fig. 10. SBR temperature control using RL-controller.

Fig. 11. SBR temperature control using NMPC.

View publication stats

You might also like