Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of 2022 IEEE

International Conference on Mechatronics and Automation


August 7 - 10, Guilin China

Using Q-learning to Automatically Tune Quadcopter


PID Controller Online for Fast Altitude Stabilization
Yazeed Alrubyli Andrea Bonarini
2022 IEEE International Conference on Mechatronics and Automation (ICMA) | 978-1-6654-0853-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICMA54519.2022.9856292

Politecnico di Milano Politecnico di Milano


Milan, Italy Milan, Italy
yazeednaif.alrubyli@mail.polimi.it andrea.bonarini@polimi.it

Abstract—Unmanned Arial Vehicles (UAVs), and more specifi-


cally, quadcopters need to be stable during their flights. Altitude
stability is usually achieved by using a PID controller that is
built into the flight controller software. Furthermore, the PID
controller has gains that need to be tuned to reach optimal
altitude stabilization during the quadcopter’s flight. For that,
control system engineers need to tune those gains by using exten-
sive modeling of the environment, which might change from one
environment and condition to another. As quadcopters penetrate
more sectors from the military to the consumer sectors, they
have been put into complex and challenging environments more Fig. 1. Agent-environment relation in a Markov decision process [6].
than ever before. Hence, intelligent self-stabilizing quadcopters
are needed to maneuver through those complex environments
and situations. Here we show that by using online reinforce- environments and behaviors that would be difficult to model
ment learning with minimal background knowledge, the altitude otherwise [6].
stability of the quadcopter can be achieved using a model-free
approach. We found that by using background knowledge and According to Sutton and Barto [6], learning can be de-
an activation function like Sigmoid, altitude stabilization can be scribed with the formulation of the finite Markov decision
achieved faster with a small memory footprint. In addition, using process (MDP). The formulation in its simple form is an agent
this approach will accelerate development by avoiding extensive that interacts over time in an environment to achieve a goal
simulations before applying the PID gains to the real-world while maximizing the overall rewards, see Figure 1. Moreover,
quadcopter. Our results demonstrate the possibility of using the
trial and error approach of reinforcement learning combined there are three main ways to solve a finite Markov decision
with activation function and background knowledge to achieve problem. First, using dynamic programming (DP) methods
faster quadcopter altitude stabilization in different environments which require the model of the environment. On the other
and conditions. hand, Monte Carlo (MC) methods, which do not need a model
Index Terms—reinforcement learning (RL), Q-leanring, of the environment, but it does not respond well to incremental,
PID tuning, unmanned aerial vehicle (UAV), quadcopter. step-by-step computations. Finally, temporal-difference (TD)
methods, which can be model-free and incremental.
I. I NTRODUCTION In order for the agent to evaluate the action taken, simple
methods like the sample-average can be used. Another method
In recent years, breakthroughs in building intelligent agents that works well, as it balances between exploration and ex-
that exceed superhuman levels happen thanks to reinforcement ploitation, is the ϵ − greedy. ϵ is the probability of taking the
learning (RL) [1]. Mimicking the human behavior of trial and greedy choice over the non-greedy one. However, to reduce
error through simulations that can be executed in parallel to the effect of the exploration-exploitation dilemma, a reduction
gain many experiences in the shortest time [2]. To name a is made to the ϵ at every timestep. In another word, The agent
few, complex games like chess [3], Atari [4], and Go [2] is encouraged to try different experiences at the beginning of
have been learned by agents. In the case of Go, an intelligent the learning process, but as the time passes, the agent must
agent named AlphaGo beat the world champion [5]. Those exploit the policy with the highest expected reward instead of
achievements show great promise in the field of RL. Teaching continuing exploring [6].
an agent is made possible using a rewards and punishments In RL, a policy is how the actions are selected by an
approach. The agent will be rewarded if the action that has agent. Policies can be stationary and non-stationary. Further,
been made by the agent is preferable, and punished otherwise. policies that are non-stationary depend on timesteps and can
Through enough experience interacting with the environment be categorized into deterministic π(s) or stochastic π(a|s). In
and receiving rewards, the agent will eventually learn the addition, a policy can be learned from data in two ways, offline
correct sequence of actions that will yield the highest rewards. or online. Unlike the offline case, where the data is needed
This makes RL very attractive for tasks that involve complex upfront to make the agent learn, in the online case, the agent

978-1-6654-0853-0/22/$31.00 ©2022 IEEE 514

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
1

0.8

0.6

0.4

0.2 (a)

0
-10 -5 0 5 10

Fig. 2. Sigmoid function.

is acting on the environment, while gaining experiences and


learning from them [7]. Additionally, when a policy is chosen
by an agent, there are two methods to improve the policy,
on-policy and off-policy methods. On-policy methods try to (b)
improve the current policy, whereas off-policy tries to improve Fig. 3. (a) Manually tuned PID controller feedback loop that was used in the
the policy different from the one considered to generate the quadcopter. (b) shows how Q-learning will fit in the standard PID controller
loop.
data [6]. Unlike off-policy methods, on-policy methods tend
to introduce bias. Besides, in terms of RL algorithms, there
are three categories, Value-based, Policy-based, and Model- approach, a modification is made to the parameters of the PID
based algorithms. In the value-based algorithms, the objective controller (Kp , Ki , Kd ) until the system reaches a stable and
is to build the value function, which defined the policy. One smooth response [13].
of the algorithms from the value-based category is Q-learning Z t
which has a lookup table called Q-table that is used to store de(t)
u(t) = Kp e(t) + Ki e(τ )dτ + Kd (2)
the result of the Q-function, Q(s, a) [7]. 0 dt
In the artificial neural networks literature, there are nonlin- In the last decade, successful attempts have been made to
ear functions that are used to help the neurons contribute or use RL to tune various PID controllers [14], [15]. On the
not and to what degree do the next layers of the network. other hand, in the realm of UAVs, more research is needed to
To name a few of the available activation functions, there address the real-world requirements [16], [17]. In this paper,
are Tanh, ReLU, and Sigmoid activation functions. For the a proposed method of using Q-learning has been presented.
Sigmoid function (Eq. 1), as shown in Figure 2, any value More specifically, a combined approach of PID controller gain
given as an input to the function will result in an output initialization, with the use of activation function that was
between 0 and 1, which is very useful to produce a probability borrowed from the artificial neural networks literature, namely,
[8]. Sigmoid function [18] that will help in reducing the entries of
1 the Q-table. Also, a careful selection of a reward system has
σ(z) = (1) been implemented to aid the agent, the quadcopter in this case,
1 − e−z
to reach setpoint altitude faster and smoother.
In the arena of control problems, for instance, keeping a
quadcopter at a specific altitude regardless of disturbances that II. P ROBLEM F ORMULATION
may affect the system, a proportional-integral-derivative (PID) Following the MDP for sequential decision-making problem
controller (Eq. 2) can be used [9]. PID controllers are widely formulation. The agent sense its state S from the environment
applied to solve control problems [10]. More than 90% of and based on the state-action pair (s, a) taken, it receives a
the industrial control problems are solved by PID controllers reward r, using the reward it updates the Q-table by using
[11], [12]. The feedback loop transfers the state of the system the action-value function q(s, a), then the agent should decide
back to the PID controller to correct for errors that might occur which action at+1 to take in the next time step based on
[9]. In this study, the feedback is the altitude of the quadcopter its current state using ϵ − greedy strategy. In the problem
after applying the thrust as shown in Figure 3. To tune the PID of tunning the PID controller using a Q-learning algorithm
controller, classical or modern methods can be applied. In the to reach the desired setpoint. In this case, the quadcopter is
classical methods category, there are Trial and Error, Ziegler- the agent, that senses the altitude of the environment using
Nichols Step Response, Ziegler-Nichols Frequency Response, sensors such as sonar, lidar, camera, etc. The set of states
Relay Tuning and Cohen-Coon [12]. In the trial and error S has a cardinality of 100, i.e. |S| = 100. The state of the

515

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
quadcopter will be decided by the output of the state function,
see Algorithm 1. The state function has the error function
q(s, a) = (1 − α)q(s, a) + α(Rt+1 + γ max q(s′ , a′ )) (6)
as an input, see Algorithm 2. The set of actions A has 3 ′ a
elements, i.e. A = {0.0, 0.1, −0.1}. Moreover, the reward will To adapt equation Eq. 6 to the problem of quadcopter
be assigned based on the error function, as shown in the reward altitude stabilization, choices of parameters ϵ, γ, and α need
function, see Algorithm 3. to be made. As the agent is learning, the ϵ is reduced from
1.0 to 0.001 by 0.001 each timestep to give the agent the
Algorithm 1 State Function
ability to explore and then exploit what has been learned at a
S ← σ(err) × 100 later stage. To show that the agent does not only care about
return S the immediate rewards but also the future ones, a choice of
γ = 0.99 has been used. By the same token, to make it a little
slower for the agent to replace the Q-value in the Q-table with
Algorithm 2 Error Function
the new value, the learning rate α has been chosen to be 0.1.
∆alt ← alt − alt′ Q-table used is a 100 × 3 matrix. The rows represent the
∆d ← ∆alt − ∆alt prev states that the agent might find itself in, while the columns
v ← ∆d/dt represent the three actions. The cells of the table correspond
E ← ∆alt + v to the state-action pairs that are computed by Eq. 6.
return E  
qs1 ,a1 qs1 ,a2 qs1 ,a3
 qs2 ,a1 qs2 ,a2 qs2 ,a3 
Qt =  . (7)
 
Algorithm 3 Reward Function . .. 
 .. .. . 
if err < 0.01 then
qs100 ,a1 qs100 ,a2 qs100 ,a3
R←1
else if err < 0.1 then At time t = 0, the agent does not know anything about
R←0 which action to take as the learning process just started.
else if err < 1 then Therefore, Q-table must be initialized to 0.
R ← −1 
0 0 0

else 0 0 0
R ← −2 Q0 =  . . . 
 
(8)
end if  .. .. .. 
return R 0 0 0
In the classical approach of tunning the PID controller using
To know how good it is for an agent to take a given action the trial and error approach, the parameters are set to zero and
in a given state, a value function is computed as in Eq. 3. The then tune them one by one. That is, starting with proportional
function is called Q-function and produces a Q-value that will part Kp , then moving to the other 2 parameters, namely Ki
be placed in the Q-table. and Kd [20]. On the other hand, initializing the PID controller
parameters to 0.1 instead of zero helps the tunning process to
qπ (s, a) = E [Gt |St = s, At = a] be faster. Following the same classical approach by tunning
"∞ # Kp first, then Ki and finally Kd makes the system stabilize
X (3)
=E γGk+t+1 |St = s, At = a quicker.
k=1
III. E XPERIMENT S ETUP
The optimal action-value function gives the evaluation of The experiments have been done in a simulated environment
how good it is following a policy π considering all the policies. that has a physics engine. The quadcopter has been designed
to embed the real-world physical properties like mass and
q∗ (s, a) = max qπ (s, a) (4)
π materials of each component, see Figure 4. The physical
It can be seen that the Bellman optimality equation (5) properties that have been used are mass, gravity, drag, angular
considers the expected return given the state-action pair. The drag, and colliders. Since the algorithm is indeterministic in
return is composed of the reward for taking an action a in- nature, each experiment illustrated in this study has been
state s in combination with the optimal action-value function executed 100 times and an average has been reported in tables
multiplied with γ, the discount rate. (I, II), along with the upper and lower bounds, first (Q1),
h i second (Median), and third (Q3) quartiles. In addition, the
q∗ (s, a) = E Rt+1 + γ max q∗ (s′ ′
, a ) (5) speed at which the PID controller parameters are learned by
′ a the agent and reach its setpoint has been outlined in Figure 5.
Finally, the Q-value is computed using Eq. 6 and the Q- The stopping criteria is set to an error of ±0.1% to the target
table (Eq. 7) [19]. altitude. This will prevent the algorithm after reaching the

516

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
experiments which are equivalent to an addition of 2 kg in
total. The results are documented in Table I.

B. Drag Variation
In this experiment, the objective is to examine the effect
of changing the drag on the speed of convergence of the
algorithm. Mass and angular drag properties have been fixed
and only mass is changing. For every 100 experiments, an
additional force of 0.5 N is added to the drag that affects the
quadcopter. The experiment is terminated at 500 trials which
is equivalent to an addition of 5 N of drag in total. The results
are then reported in Table II.

IV. R ESULTS
From an initial initialization of the PID controller parame-
ters to a fully tunned PID controller in under 1.5 minutes on
average. The box plots seen in Figures 6a-6b that correspond
to Tables I-II show that a minimum of 14 seconds is needed
to reach the required setpoint altitude. After learning the PID
Fig. 4. A 3D render of a custom-made physical quadcopter that uses 3D parameters, the quadcopter reaches the setpoint quickly and
printed components besides off-the-shelf ones, which is used for applying the smoothly as shown in Figures 5b, 5d, and 5f. Furthermore,
lab experiments.
Figures 5a, 5c, and 5e show that the agent tuned the PID
controller parameters in under 30 seconds. It is worth noting
TABLE I
DESCRIPTIVE STATISTICS ( VARIABLE = Q UADCOPTER MASS )
that the 75% percentile, are ranging from 2 to 4 minutes across
the the thousand trials.
Variable* Mean Min Max Q1 Median Q3 On the other hand, due to the indeterministic nature of the
1.0 kg 76.1 14.1 266.9 32.9 53.8 114.2 RL algorithms, more experiments are needed as the results are
1.5 kg 96.2 17.8 437.3 39.1 67.7 127.2 far from perfect. As seen from the handpicked experiments in
2.0 kg 102.1 15.0 489.6 41.2 75.1 114.0 Figure 3b the algorithm needs to optimize the learned PID
2.5 kg 100.6 10.5 754.0 40.3 66.5 110.2 controller parameters. It can also be seen in Tables I-II that
3.0 kg 115.1 22.9 340.0 50.9 91.1 159.3 increasing the mass or drag will increase the convergence
time to reach the setpoint. Also, after the agent has finished
*Note: 100 observations are taken for each change in the variable.
the learning process as shown in Figure 5a, the quadcopter
bounces very fast which may seem unrealistic given the
TABLE II
DESCRIPTIVE STATISTICS ( VARIABLE = D RAG ON Q UADCOPTER ) physical limitation of the motors used as shown in Figure 5b.
Furthermore, in Figure 5f it is clear that the learned parameters
Variable* Mean Min Max Q1 Median Q3 of the PID controller were not optimized, as it exceeded 30
1.0 N 76.1 14.1 266.9 32.9 53.8 114.2 seconds to reach the required setpoint. It is easily inferred
1.5 N 106.9 18.1 458.4 47.9 81.0 140.9 from Tables I-II that the average convergence time increases
2.0 N 113.2 22.7 613.2 51.4 91.7 142.2 whenever mass or drag increases on the system.
2.5 N 130.3 18.2 1199.6 50.3 87.3 157.6
3.0 N 154.6 23.6 608.4 58.6 119.1 206.9
V. C ONCLUSION
In this paper, a reinforcement learning, model-free, ϵ −
*Note: 100 observations are taken for each change in the variable.
greedy algorithm is implemented through the use of Q-
learning with a modification of the Q-function to make it
setpoint to continue executing the favorable actions that may appropriate for quadcopters to reach altitude stability in less
increase or decrease the PID controller gains rapidly which than half a minute on average as evident from experiments
leads to an unbalanced system. presented. These results are achievable without the need for
extensive simulations and modeling of the environment. Al-
A. Mass Variation though minimal proper knowledge of PID controller and the
In this experiment, the aim is to examine the effect of chang- application at hand is needed to aid the algorithm to reach
ing the mass on the speed of convergence of the algorithm. stability faster and smoother. Hence, making it applicable to
Drag and angular drag properties have been fixed to 1 N and a wide range of quadcopter sizes and weights with a different
0.05 N respectively, and only mass is changing. For every drag that may affect the system. The environment is unknown
100 experiments, an addition of 500 grams is added to the apriori and stochastic data is used. Results show promise and
mass of the quadcopter. The experiment is terminated at 500 more exploration is indeed required.

517

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
6 7

5 6

5
4

Altitude (m)
Altitude (m)
4
3
3
2
2
1 1

0 0
0 5 10 15 20 0 10 20 30
Time (sec) Time (sec)
(a) (b)

6 6

5 5

4 4
Altitude (m)

3 Altitude (m) 3

2 2

1 1

0 0
0 10 20 30 40 0 5 10
Time (sec) Time (sec)
(c) (d)

6 6

5 5

4
Altitude (m)

4
Altitude (m)

3 3

2 2

1 1

0 0
0 10 20 30 0 10 20 30 40
Time (sec) Time (sec)
(e) (f)
Fig. 5. Handpicked experiments to show the indeterministic behavior to be expected from such algorithm. Figures (a)(c)(e) show the training phase. Figures
(b)(d)(f) shows the run after learned PID controller parameters (Kp , Ki , Kd ). Both sides were plotted between time steps on the horizontal axis and the
quadcopter’s altitude on the vertical axis.

518

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
1200

1000
600
800
Time (sec)

Time (sec)
400 600

400
200
200

0 0
1 1.5 2 2.5 3 1 1.5 2 2.5 3
Mass (kg) Drag (N)
(a) (b)
Fig. 6. (a) Box plot for each increase in mass on the horizontal axis, and corresponding time of convergence on the vertical axis. (b) Box plot that represents
the change in drag on the horizontal axis, with the time of convergence along the vertical axis.

R EFERENCES [14] X.-S. Wang, Y.-H. Cheng, and S. Wei, “A proposal of adaptive pid
controller based on reinforcement learning,” Journal of China University
of Mining and Technology, vol. 17, no. 1, pp. 40–44, 2007.
[1] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, [15] W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement
S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel et al., learning for uav attitude control,” ACM Transactions on Cyber-Physical
“Mastering atari, go, chess and shogi by planning with a learned model,” Systems, vol. 3, no. 2, pp. 1–21, 2019.
Nature, vol. 588, no. 7839, pp. 604–609, 2020. [16] A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim,
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed et al.,
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, “Drone deep reinforcement learning: A review,” Electronics, vol. 10,
M. Lanctot et al., “Mastering the game of go with deep neural networks no. 9, p. 999, 2021.
and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016. [17] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru,
[3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, S. Gowal, and T. Hester, “Challenges of real-world reinforcement
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mastering chess learning: definitions, benchmarks and analysis,” Machine Learning, vol.
and shogi by self-play with a general reinforcement learning algorithm,” 110, no. 9, pp. 2419–2468, 2021.
arXiv preprint arXiv:1712.01815, 2017. [18] J. Han and C. Moraga, “The influence of the sigmoid function pa-
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- rameters on the speed of backpropagation learning,” in International
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- workshop on artificial neural networks. Springer, 1995, pp. 195–201.
ing,” arXiv preprint arXiv:1312.5602, 2013. [19] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[5] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, no. 3, pp. 279–292, 1992.
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general [20] N. Kuyvenhoven, “Pid tuning methods an automatic pid tuning study
reinforcement learning algorithm that masters chess, shogi, and go with mathcad,” Calvin college ENGR, vol. 315, 2002.
through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
[6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[7] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and
J. Pineau, “An introduction to deep reinforcement learning,” arXiv
preprint arXiv:1811.12560, 2018.
[8] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
2016.
[9] T. Luukkonen, “Modelling and control of quadcopter,” Independent
research project in applied mathematics, Espoo, vol. 22, p. 22, 2011.
[10] K. J. Åström and R. M. Murray, “Feedback systems,” in Feedback
Systems. Princeton university press, 2010.
[11] L. Desborough and R. Miller, “Increasing customer value of industrial
control performance monitoring-honeywell’s experience,” in AIChE sym-
posium series, no. 326. New York; American Institute of Chemical
Engineers; 1998, 2002, pp. 169–189.
[12] R. P. Borase, D. Maghade, S. Sondkar, and S. Pawar, “A review of
pid control, tuning methods and applications,” International Journal of
Dynamics and Control, vol. 9, no. 2, pp. 818–827, 2021.
[13] S. C. Pratama, E. Susanto, and A. S. Wibowo, “Design and implemen-
tation of water level control using gain scheduling pid back calculation
integrator anti windup,” in 2016 International Conference on Con-
trol, Electronics, Renewable Energy and Communications (ICCEREC).
IEEE, 2016, pp. 101–104.

519

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.

You might also like