Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Data Driven Control of Interacting Two Tank

Hybrid System using Deep Reinforcement Learning


David Mathew Jones S. Kanagalakshmi
M.Tech. Scholar, Dept. of Electrical Engineering Assistant Professor, Dept. of Electrical Engineering
National Institute of Technology Calicut National Institute of Technology Calicut
Kerala, India Kerala, India
davidmathewjones@ieee.org kanagalakshmi@nitc.ac.in

Abstract—This paper investigates the use of a Deep Neural other methods of learning, pushing the current boundaries of
Network based Reinforcement Learning(RL) algorithm applied knowledge.
to a non-linear system that is typically exhibited in the process A fundamental non-linear problem across process industries
control industry. It aims to augment the large amounts of data
that we possess with the classical theory of dynamic systems is liquid flow control and liquid level control in storage tanks
in control. Control systems represent a non-linear optimization and reaction vessels. Solving non-linear problems is generally
problem, and Machine Learning helps to achieve non-linear tricky as they are less understood compared to linear
optimization using large amounts of data. problems. Although major industries prefer conventional tank
This document demonstrates the use of Deep Deterministic systems for their processing, conical and spherical tanks
Policy Gradient (DDPG), which is an actor-critic methodology
of reinforcement learning, applied on an Interacting Two Tank
prove to be advantageous due to their lower cost, efficient
Hybrid System(ITTHS). usage of material, ease of cleaning, and improved product
Index Terms—reinforcement learning, deep neural network, quality. Many non-linearities are introduced in the system
actor-critic method, interacting water tank system. when such vessels are made use of in the process industry.
We can make use of modern computerized systems to control
I. I NTRODUCTION such non-linear processes in real-time with ease.

We are in an era of unfathomable computational intelli- The project aims to design a controller of Interacting Two
gence, which is taking the world by storm. The huge success Tank Hybrid System, a Multi Input Multi Output Process. Here
of Artificial Intelligence in the usage of data to achieve we are given two control variables (height of each tank) and
human like performance, has been motivating researchers to two manipulated variables (inflow of pumps). Conventionally
implement algorithms in industrial applications ranging from such problems are handled using traditional control theory and
technology to healthcare. Currently most of our research then designing a suitable PID, MPC, or non-linear MPC. The
revolves around ”Narrow AI” - which is highly task specific, process would start by writing down the equations describing
and we have a long way to go before we can crack the the dynamics of the systems, which would be generally non-
code to ”General AI”. In AI, the most prominent area of linear equations, and later linearize these equations. Then
focus is machine learning, which is a scientific field that linear state space control theory could be used, and we could
analyse statistical models and develop algorithms, in-turn have a standard filter with a linear quadratic regulator or
giving machines the explicit ability to self-learn tasks and robust control. This would work as long as we knew how
thereby eliminating rule-based programming. to write the equations describing the physical dynamics of the
We are aware that Machine Learning is composed of su- system. However, in the real world of Industrial Processes, we
pervised learning, unsupervised learning, and reinforcement know that such systems are massively non-linear. They have
learning, and among these, supervised learning is currently unknown dynamics, and their dimensionalities are higher, i.e.,
majorly employed across various applications. However, the there may be many degrees of freedom to describe the system
performance of supervised learning can never outperform the state X. This high dimensionality would pose difficulties for
supervisor or the subject matter expert since the agent only the system simulation. Furthermore, the physical world sys-
mimics the labelling behaviour of the supervisor. Similar is tems would have limited measurements and limited actuation.
the case for unsupervised learning. Reinforcement Learning A tabular comparison between PID, Model Predictive Con-
(RL) tries to break through this performance barrier by pushing troller, and Reinforcement Learning has is shown in table I.
the limits of what is currently feasible by learning the best These fundamental challenges motivate us to use modern
mapping of observations to actions (Policy) through a method data-driven or machine learning control. The method I propose
of trial-and-error search led by a scalar reward signal. Also, is called Data-Driven control or Machine Learning Control,
it considers the feedback of the effect of actions taken on which is a modern mathematical method in machine learning
subsequent rewards. These unique features set RL apart from applied to control theory. It aims to augment the large amount
TABLE I they double the size of the critic’s hidden layers to have 80
C OMAPRISON BETWEEN PID, MPC AND RL neurons. The hidden layers in both cases have tanh activation
Attributes PID MPC RL functions, and the output layers have a single neuron with
Controller Capability Low High High linear activation. It is concluded that the proposed multi-critic
Training Computation Cost Low Low High schemes do not alter the structure of the actor network and, as
Deployment Computational Cost Low High Medium
Sensitivity to Tuning High High Low
a result, do not increase controller complexity when compared
Dependency on model At all At all Only during to the standard single-critic method.
times times training
D. Machalek and et al. investigate three RL algorithms in
their paper [2]: deep deterministic policy gradient (DDPG),
twin-delayed DDPG (TD3), and proximal policy optimization.
of data we possess with the classical theory of dynamic They are graded based on how well they converge to a
systems control. Control systems represent a non-linear op- stable solution and dynamically optimise the economics of
timization problem, and Machine Learning helps achieve non- the Continuous Stirred Tank Reactor (CSTR). DDPG is the
linear optimization using large amounts of data. original deep policy gradient method, and TD3 extends it by
There are many ways in which machine learning could be including a second critic that operates a few steps behind
implemented. Firstly, it can be used to write the equations us- the first. The purpose of having two critics is to avoid
ing data-driven models and later apply classical control, model overestimation, because each state’s value is assumed to be the
predictive control, or optimal control; and this has gained the lower of the two critics’ estimates. PPO improves its stability
most attention. Secondly, it could be used to learn suitable by limiting the amount of policy change that can occur during
controllers using data, i.e., explore the system, try different each training round.
control strategies, and develop the most effective controller. M. Yiming and et al. completed a simulation experiment[3]
The last way Machine Learning can help is in sensor and for the single tank system on the MATLAB platform with
actuator placement to get nearly optimal performance but are the inclusion of disturbances. It was based on DQN algorithm
very fast and easy to implement. The most effective method for the feedforward-feedback controller for the liquid level
among these three would be to parallelly try out methods one control. The Deep Q-Learning Network (DQN) is a neural
and two to get the best representation model of the system network that combines Q-learning and deep learning to effec-
and the most efficient controller. tively approximate the Q-learning function. In the experiment,
Here, for the Interacting Two Tank Hybrid System (ITTHS), when compared to a PID feedback control system with the
I have proposed a method of using what is already known same parameters, the feedforward-feeback controller based on
about the system, i.e., the system model, and then using the DQN algorithm effectively overcomes the disturbance of
data-driven tools to decide on the best controller for the the control system, improves the control system’s performance,
application. This system, ITTHS, was chosen because of its and demonstrates the algorithm’s effectiveness.
superiority. Conical and spherical tank systems have varying
cross-sectional areas, and their process variables(liquid levels) Y. Zhang and et al. in their study[4] showed how reference
are constantly changing and exhibits non-linear dynamic tracking as well as disturbance tracking (popularly known and
behaviour, making control of this system a complex problem. implemented widely as 2 Degree of Freedom Controller) could
be achieved by means of reinforcement learning techniques.
The validation of this method has been conducted on the Con-
J. Martinez-Piazuelo and et al in his paper[1] has deployed
tinuous Stirred Tank Reactor (CSTR) by adopting Q learning
an algorithm which eases the training process on the control
into Adaptive Dynamic Programming(ADP) and using Kalman
policy. This is named as Multi actor -critic Reinforcement
filter residual generator.
Learning and unlike the traditional actor-critic algorithm eases
the value function learning task by distributing it into multiple It is important to mention that data driven techniques for
neural networks. The algorithm utilises the Deep Deterministic control have been studied much earlier. They have been
Policy Gradient (DDPG) and has demonstrated its application employed in Linear System Identification such as Balanced
on the quadruple tank process and vertical tank process. Two Proper Orthogonal Decomposition(BPOD)[9], Eigensystem
fully connected hidden layers with 20 neurons and tanh activa- Realisation Algorithm(ERA)[]and Dynamic Mode Decompo-
tion functions comprise the actor neural networks. Similarly, sition (DMD); Non - Linear System Identification such as
the actors’ output layers have two neurons that represent the Nonlinear AutoRegressive Moving Average with eXogenous
two components of the system’s input, as well as a tanh input (NARMAX), Sparse Identification of Nonlinear Dynam-
activation that keeps the control actions within [-1; 1]. On the ics (SINDy)[7], Genetic Programming. Also for the study of
other hand, the critic networks are defined according to two Dynamics on measurement. Most of these methods leverage
distinct schemes that aim to provide a fair comparison between the use of machine learning based regression. Koopman The-
single-critic and multi-critic approaches. They use critics with ory[6] have been also looked into. For the formulation of
two hidden layers of 40 neurons in the first scheme, which control laws apart from the use of Reinforcement Learning
is used for multi-critic approaches.In the second scheme, we have SINDy with Control and DMD with control[8].
II. R EINFORCEMENT L EARNING 1) Creation of the environment: Define the context within
which the agent can learn, including the interface be-
Reinforcement learning(RL) is a branch of machine learning tween agent and environment. The environment might be
that deals with learning control strategies from experience either a simulation model or an actual physical system.
to interact with a complex environment. In simple terms, Model identification is preferred for accelerated pre-
it is used to reinforce good behaviour with rewards. It has training, but it is necessary that the model is accurate
the potential to solve complex tasks from high-dimensional, for the RL’s operating range.
unprocessed, sensory inputs in many areas such as playing 2) Reward Shaping: Describe the reward signal that the
video games, autonomous driving, natural language, and in- agent uses to assess its performance in relation to the
dustrial automation. In contrast to other learning methods, RL task goals, as well as how this signal is derived from
learns from a dynamic environment, i.e., data modified due to the environment. It may take a few iterations to get the
external conditions. One of the most famous examples of RL reward shaping just perfect.
implementations has been playing the board game known as 3) Creation of the agent: Because the agent is made up of
’Go’ developed by DeepMind and popularly known AlphaGo, the policy and the training algorithm, we must
which defeated the world champion Lee Se-dol back in 2016.
• choose a method to represent the policy, such as
The components of the RL framework shown in Fig. 1 com- neural networks or lookup tables.
prise the Agent, which interacts with the Environment based • Select the appropriate training algorithm for com-
on a given Policy that is updated by an algorithm. The Agent plex problems and large state-action spaces.
gets to measure its current state in the Environment and then
4) Training and validation of the agent: Configure training
takes action from its previous experience, and occasionally, it
options (such as stopping criteria) and train the agent to
is rewarded based on the action taken.
fine-tune the policy. Simulation is the simplest way to
validate a trained policy.
5) Deploying the policy: Deploy the trained policy repre-
sentation using generated C/C++ or CUDA code, for
example. At this point, there is no need to worry about
agents or training algorithms because the policy is a
stand-alone decision-making system.
III. D EEP D ETERMINISTIC P OLICY G RADIENT
In Deep Reinforcement Learning, we are implementing
the policy of the agent by means of deep neural networks.
The idea behind using the neural network architectures for
reinforcement learning is that we want reward signals obtained
to strengthen the connection that leads to a good policy.
Moreover, these deep neural networks are unique in their
ability to represent complex functions if we give them ample
Fig. 1. Reinforcement Learning System amounts of data.
The Deep Deterministic Policy Gradient[5] formulated by
The challenge here is to design the policy of what actions Lillicrap et al. was the first RL algorithm which was effective
to take given the current observation to maximize the chance in solving numerous continuous control taks of high dimen-
of getting a positive future reward. From a control system sionality. It is based on the Deterministic Policy Gradient
perspective, this policy is going to be deterministic rather (DPG) algorithm by Silver and et al, with insights from
than probabilistic. Therefore, the goal of the RL framework success of Deep Q Network (DQN) by Mnih et al. The
is to optimize the policy given the reward at each state. Some DDPG is a model free, off-policy, actor-critic reinforcement
optimization techniques include - Differential Programming, learning method which can operate on continuous action and
Monte Carlo, Temporal Difference, Bellman optimization, observation spaces. It is superior to DQN in the sense that most
Policy Iteration. The action taken, whether good or bad, is real world physical control tasks have real valued and high
evaluated using a value function. However, to guide it better, dimensional action spaces. The DPG, also known as the actor,
we use Q-Learning, which is a function of state and the is used to map states or observations to actions, whereas the
action taken, assuming it does the best thing in the future DQN, also known as the critic, is used to identify action-values
and measures the quality of being in that state and taking that in order to update the DPG. The unification of two algorithms
action. have proved to reduced the problem of high bias and high
variance. In DDPG, the policy gradient methods are trained
using gradient of the critic using gradient ascent and deep
A. Reinforcement Learning Workflow
neural network function approximators are used to estimate
There are 5 steps involved in the training of an RL agent: the action-value function.
Here the network is trained off-policy meaning that samples based RL controller. This is considered as a Multi Input Multi
from a replay buffer are utilised so as to reduce the correlation Output(MIMO) system which takes into account the non-linear
between samples. Moreover the training is carried out sing coupled dynamics. The system is illustrated in Fig. 2 and
target Q network to give consistent targets along with batch the goal is to control the heights of tank 1 and tank 2 with
normalisation. reference to some given desired signals. The system is trained
on a 4 core i5 processor and a 1650 Nvidia GPU (1024 CUDA
Algorithm 1: DDPG algorithms cores, no tensor cores).
Initialise the critic Q(S, A) with random parameters The system parameters has been inspired by [10] and has
θQ and initialise the target critic with the same been summarised in Table II.
parameter values :θQ0 = θQ
Initialise the actor µ(S) with random parameters θµ TABLE II
and initialise the target actor with the same parameter PARAMETERS OF THE ITTHS
values :θµ0 = θµ
Initialise replay buffer R Parameters Description Values
for episode = 1,M do h1 , h2 Height of the conical and spherical tank cm
Initialise a random process N for action at any particular instant of time
H1 Total height of tank1 70 cm
exploration; H2 Total height of tank2 50 cm
Receive an initial observation S; R1 Maximum radius of tank1 30 cm
for t=1,T do R2 Maximum radius of tank2 25 cm
Select action A = µ(S) + N ; FI N 1 , FI N 2 Maximum Inflow of Tank1 & 2 200 cm3 /sec
CV1 ,CV2 Coefficient of control valve 1& 2 0 to 1
Execute action A and observe the reward R M V12 Coefficient of 0 to 1
and next observation S 0 ; interacting manual valve
Store the experience (S, A, R, S 0 )in the α1 , α2 CSA of output pipe in Tank1 & 2 1.2272cm2
experience buffer; α12 CSA of interacting pipe 1.2272cm2
g Acceleration due to gravity 981cm/sec2
Sample a random mini-batch of M experiences
(Si , Ai , Ri , Si0 )from the experience buffer;
If Si is a terminal state, set the value function
target yi to Ri , else
yi = Ri + γQ0 (Si0 , µ0 (Si0 |θµ )|θQ0 )
Update the critic parameters by minimizing the
loss L across all sampled experiences:
M
1 X
L= (yi − Q(Si , Ai |θQ ))2
M i=1
Update the actor parameters using the sampled
policy gradient:
M
1 X
∇θ µ ≈ Gai Gµi
M i=1
Gai = ∇A Q(Si , A|θQ )where, A = µ(Si |θµ )
Gµi = ∇θµ µ(Si |θµ )
Update the target network
θQ0 = τ θQ + (1 − τ )θQ0 Fig. 2. Schematics of the ITTHS

θµ0 = τ θµ + (1 − τ )θµ0
The Interacting Two Tank Hybrid System (ITTHS) com-
end prises of a conical tank and a spherical tank named as tank1
end and tank2 respectively. These tanks can hold liquid upto a
height of 70cm (H1 ) & 50cm (H2 ) and their radii are 30cm
(R1 ) & 25cm (R2 ) respectively. A restriction (HV12 ) which is
IV. A PPLICATION TO M ULTI -TANK H YBRID WATER manually operated interconnects these two tanks. We denote
S YSTEM FI N 1 and FI N 2 as the two input flow rates and FO U T 1 ,
In this work, an interacting hybrid multi-tank water system FO U T 2 as the output flow rates of the Tank1, Tank2 which
is used as the testbed for the demonstration of the DDPG flows through restriction HV12 to drain.
Fig. 3. Simulink Model of RL-ITTHS

The liquid level of Tank1 (h1 ) & Tank2 (h2 ) are to be


controlled by varying the inlet flows of tanks. Magnetic Flow dh2 1 p
transmitters measure the input flow FI N 1 of tank 1 and FI N 2 = [u2 −(cv2 α2 2gh2 )+sign(h1 −h2 )
dt π(2R2 h2 − h22 )
of tank 2 and transmit them in the form of current signals p
to the Data Acquisition System. To control the motion of the M V12 α12 |2g(h1 − h2 )|] (4)
valves, current signals from the controller are transmitted to We have made use of the equation to model the system in
the control valves so as to obtain the required flow and thereby the simulink. The RL agent will be trained on this system
maintain the required set point level. first before physical model validation could be applied. The
In the MIMO process of ITTHS, the manipulated variables complete simulink simulation modelling of the RL-ITTHS has
are the inflow FI N 1 and FI N 2 respectively and the measured been shown in Fig. 3 The RL agent consists of 3 inputs namely
variables are the respective heights of the tanks h1 & h2 . the Observation, the Reward and the isdone. The observation
From the paper[10] by Balram Naik and et al., we get the takes in the present height of both the tanks, errors wrt to
mathematical model of the system using the first principles. the set point and the integral error. The Observation block is
We could use this system identification readily for our pro- represented in Fig. 4
posed work.
The mass balance equation was used to develop the math-
ematical model of the conical and spherical tank. If D1 is the
maximum diameter and H1 is the conical tank’s maximum
height, then its height h1 at any given by :

dh1 1 p
= π D1 2 2 [u1 − (cv1 α1 2gh1 )] (1)
dt ( 4 )( H1 ) (h1 )

Similarly for the spherical tank the mass mathematical


model is as follows:
dh2 1 p Fig. 4. Agent Observation Block
= [u 2 − (cv2 α2 2gh2 )] (2)
dt π(2R2 h2 − h22 )
The reward block is formalised taking into consideration the
The mathematical modelling for the Interactive Two Tank fact that, the error must be within +/-0.1 range.
Hybrid System has also been developed using the first princi- The isdone block makes sure that the prediction is always
ples as : within the physical constraints of the equipment setup. Here
the constraints are the height of each tank being 70cm and
50cm respectively.
dh1 1 p The Reward block and isdone block have been represented
= π D1 2 2 [u1 − (cv1 α1 2gh1 ) − sign(h1 − h2 )
dt ( 4 )( H1 ) (h1 ) in Fig. 5 and Fig. 6.
p It is also worth mentioning that during training the initial
M V12 α12 |2g(h1 − h2 )|] (3) height of the tank and the desired level of each tanks are
modified using a random function, so that the agent is trained V. S IMULATION R ESULTS
for the complete height of each tanks.
Although the proposed algorithm for the system was de-
signed, it was unable to complete the training due to com-
putational complexity. Typical machine learning algorithms
rely heavily on GPU to accelerate their learning process and
only minimally on CPU. Deep Reinforcement learning relies
both on GPU as well as CPU. In our case, the reinforcement
learning algorithms is taken care of by the GPU whereas the
simulation environment, reward and observation calculation,
replay buffer, part of policy network and part of rendering is
taken care by the CPU. GPU acceleration using tensor cores
will help neural network to be trained faster and a CPU with
higher cores will accommodate other factors of the simulation
Fig. 5. Agent Reward Block environment.
The system under consideration has been designed keeping
in mind the wide range of input (desired water level and initial
water level). Also, since the system is an interactive system,
the coefficient of interaction ranging from 0 to 1 has also been
kept as a training input, which is randomly changed throughout
the course of the training. Fig. 7, shows the completed training
for just the conical tank, however for the performance study
the system inclusive of both the tanks and the complete range
of operating conditions has to be specified. When simulated,
as shown in Fig. 8, the training was able to reach upto 60
episodes due to computational constraints.

Fig. 6. Agent Stop Block

For the critic network, the observation path has a fully-


connected layer of 50 neurons and an Relu activation function Fig. 7. RL Agent Training - Conical Tank System
followed by a second fully-connected layer of 25 neurons.
Meanwhile the action path of the critic network has a fully
connected layer of 25 neurons. The output of the critic is
a added combination of the two sub neural networks with a
Relu activation function followed by a fully connected layer.
For the actor network, we have a fully connected layer of 3
neurons followed by a Tanh activation function and a final fully
connected layer. The learning rate of 10−4 for the actor and
10−3 for the critic have been used respectively. For updating
the actor and critic parameters we have employed smoothing
factor of 0.001. We trained with mini batches of 64 and used
replay buffer of 106 . We used Ornstein-Uhlenbeck process
with variance =0.3 and decay rate of 10−5 . The training is done
for utmost 5000 episodes, however the training can be stopped
automatically once the agent reaches an average cumulative
reward greater than 800 over 20 consecutive episodes. Fig. 8. RL Agent Training - Overall ITTHS
Currently modern tools for an end to end GPU accelerated [3] M. Yiming, P. Boyu, L. Gongqing, L. Yongwen and Z. Deliang,
reinforcement learning is also under study such as Nvidia Isaac ”Feedforward Feedback Control Based on DQN,” 2020 Chinese Control
And Decision Conference (CCDC), Hefei, China, 2020, pp. 550-554,
Gym. doi: 10.1109/CCDC49329.2020.9163990.
[4] Y. Zhang, S. X. Ding, Y. Yang and L. Li, ”Data-driven design of two-
VI. C ONCLUSION degree-of-freedom controllers using reinforcement learning techniques,”
In this paper, an RL- Deep Deterministic Policy Gradient in IET Control Theory & Applications, vol. 9, no. 7, pp. 1011-1021, 23
4 2015, doi: 10.1049/iet-cta.2014.0156.
(DDPG) Agent based controller was proposed for the Inter- [5] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
acting Two Tank Hybrid System and simulated on MATLAB Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, Continuous
environment using system model obtained from the first prin- control with deep reinforcement learning, CoRR abs/1509.02971 (2015).
[6] Data-driven discovery of Koopman eigenfunctions for control, Eurika
ciples. The designing of the deep neural network for actor and Kaiser and J. Nathan Kutz and Steven L. Brunton, 2020, 1707.01146.
critic were completed. Future Score of this work will focus on [7] Sparse identification of nonlinear dynamics for model predictive control
employing high end computation systems so as to complete in the low-data limit, Kaiser, E. and Kutz, J. N. and Brunton, S. L.,
Proceedings of the Royal Society A: Mathematical, Physical and Engi-
the training process and also run validation on physical ITTHS neering Sciences, vol.474, no. 2219, 2018, doi : 10.1098/rspa.2018.0335.
setup. [8] Dynamic Mode Decomposition with Control, Joshua L. Proctor, Steven
L. Brunton, and J. Nathan Kutz SIAM Journal on Applied Dynamical
R EFERENCES Systems 2016 15:1, 142-161.
[9] B. Moore, ”Principal component analysis in linear systems: Control-
[1] J. Martinez-Piazuelo, D. E. Ochoa, N. Quijano and L. F. Giraldo, ”A lability, observability, and model reduction,” in IEEE Transactions on
Multi-Critic Reinforcement Learning Method: An Application to Multi- Automatic Control, vol. 26, no. 1, pp. 17-32, February 1981, doi:
Tank Water Systems,” in IEEE Access, vol. 8, pp. 173227-173238, 2020, 10.1109/TAC.1981.1102568.
doi: 10.1109/ACCESS.2020.3025194. [10] R. B. Balaram Naik and S. Kanagalakshmi, ”Mathematical Mod-
[2] D. Machalek, T. Quah and K. M. Powell, ”Dynamic Economic Opti- elling and Controller Design for Interacting Hybrid Two Tank System
mization of a Continuously Stirred Tank Reactor Using Reinforcement (IHTTS),” 2020 Fourth International Conference on Inventive Sys-
Learning,” 2020 American Control Conference (ACC), Denver, CO, tems and Control (ICISC), Coimbatore, India, 2020, pp. 297-303, doi:
USA, 2020, pp. 2955-2960, doi: 10.23919/ACC45564.2020.9147706. 10.1109/ICISC47916.2020.9171218.

You might also like