Professional Documents
Culture Documents
Using Q-learning to Automatically Tune Quadcopter PID Controller Online for Fast Altitude Stabilization
Using Q-learning to Automatically Tune Quadcopter PID Controller Online for Fast Altitude Stabilization
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
1
0.8
0.6
0.4
0.2 (a)
0
-10 -5 0 5 10
515
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
quadcopter will be decided by the output of the state function,
see Algorithm 1. The state function has the error function
q(s, a) = (1 − α)q(s, a) + α(Rt+1 + γ max q(s′ , a′ )) (6)
as an input, see Algorithm 2. The set of actions A has 3 ′ a
elements, i.e. A = {0.0, 0.1, −0.1}. Moreover, the reward will To adapt equation Eq. 6 to the problem of quadcopter
be assigned based on the error function, as shown in the reward altitude stabilization, choices of parameters ϵ, γ, and α need
function, see Algorithm 3. to be made. As the agent is learning, the ϵ is reduced from
1.0 to 0.001 by 0.001 each timestep to give the agent the
Algorithm 1 State Function
ability to explore and then exploit what has been learned at a
S ← σ(err) × 100 later stage. To show that the agent does not only care about
return S the immediate rewards but also the future ones, a choice of
γ = 0.99 has been used. By the same token, to make it a little
slower for the agent to replace the Q-value in the Q-table with
Algorithm 2 Error Function
the new value, the learning rate α has been chosen to be 0.1.
∆alt ← alt − alt′ Q-table used is a 100 × 3 matrix. The rows represent the
∆d ← ∆alt − ∆alt prev states that the agent might find itself in, while the columns
v ← ∆d/dt represent the three actions. The cells of the table correspond
E ← ∆alt + v to the state-action pairs that are computed by Eq. 6.
return E
qs1 ,a1 qs1 ,a2 qs1 ,a3
qs2 ,a1 qs2 ,a2 qs2 ,a3
Qt = . (7)
Algorithm 3 Reward Function . ..
.. .. .
if err < 0.01 then
qs100 ,a1 qs100 ,a2 qs100 ,a3
R←1
else if err < 0.1 then At time t = 0, the agent does not know anything about
R←0 which action to take as the learning process just started.
else if err < 1 then Therefore, Q-table must be initialized to 0.
R ← −1
0 0 0
else 0 0 0
R ← −2 Q0 = . . .
(8)
end if .. .. ..
return R 0 0 0
In the classical approach of tunning the PID controller using
To know how good it is for an agent to take a given action the trial and error approach, the parameters are set to zero and
in a given state, a value function is computed as in Eq. 3. The then tune them one by one. That is, starting with proportional
function is called Q-function and produces a Q-value that will part Kp , then moving to the other 2 parameters, namely Ki
be placed in the Q-table. and Kd [20]. On the other hand, initializing the PID controller
parameters to 0.1 instead of zero helps the tunning process to
qπ (s, a) = E [Gt |St = s, At = a] be faster. Following the same classical approach by tunning
"∞ # Kp first, then Ki and finally Kd makes the system stabilize
X (3)
=E γGk+t+1 |St = s, At = a quicker.
k=1
III. E XPERIMENT S ETUP
The optimal action-value function gives the evaluation of The experiments have been done in a simulated environment
how good it is following a policy π considering all the policies. that has a physics engine. The quadcopter has been designed
to embed the real-world physical properties like mass and
q∗ (s, a) = max qπ (s, a) (4)
π materials of each component, see Figure 4. The physical
It can be seen that the Bellman optimality equation (5) properties that have been used are mass, gravity, drag, angular
considers the expected return given the state-action pair. The drag, and colliders. Since the algorithm is indeterministic in
return is composed of the reward for taking an action a in- nature, each experiment illustrated in this study has been
state s in combination with the optimal action-value function executed 100 times and an average has been reported in tables
multiplied with γ, the discount rate. (I, II), along with the upper and lower bounds, first (Q1),
h i second (Median), and third (Q3) quartiles. In addition, the
q∗ (s, a) = E Rt+1 + γ max q∗ (s′ ′
, a ) (5) speed at which the PID controller parameters are learned by
′ a the agent and reach its setpoint has been outlined in Figure 5.
Finally, the Q-value is computed using Eq. 6 and the Q- The stopping criteria is set to an error of ±0.1% to the target
table (Eq. 7) [19]. altitude. This will prevent the algorithm after reaching the
516
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
experiments which are equivalent to an addition of 2 kg in
total. The results are documented in Table I.
B. Drag Variation
In this experiment, the objective is to examine the effect
of changing the drag on the speed of convergence of the
algorithm. Mass and angular drag properties have been fixed
and only mass is changing. For every 100 experiments, an
additional force of 0.5 N is added to the drag that affects the
quadcopter. The experiment is terminated at 500 trials which
is equivalent to an addition of 5 N of drag in total. The results
are then reported in Table II.
IV. R ESULTS
From an initial initialization of the PID controller parame-
ters to a fully tunned PID controller in under 1.5 minutes on
average. The box plots seen in Figures 6a-6b that correspond
to Tables I-II show that a minimum of 14 seconds is needed
to reach the required setpoint altitude. After learning the PID
Fig. 4. A 3D render of a custom-made physical quadcopter that uses 3D parameters, the quadcopter reaches the setpoint quickly and
printed components besides off-the-shelf ones, which is used for applying the smoothly as shown in Figures 5b, 5d, and 5f. Furthermore,
lab experiments.
Figures 5a, 5c, and 5e show that the agent tuned the PID
controller parameters in under 30 seconds. It is worth noting
TABLE I
DESCRIPTIVE STATISTICS ( VARIABLE = Q UADCOPTER MASS )
that the 75% percentile, are ranging from 2 to 4 minutes across
the the thousand trials.
Variable* Mean Min Max Q1 Median Q3 On the other hand, due to the indeterministic nature of the
1.0 kg 76.1 14.1 266.9 32.9 53.8 114.2 RL algorithms, more experiments are needed as the results are
1.5 kg 96.2 17.8 437.3 39.1 67.7 127.2 far from perfect. As seen from the handpicked experiments in
2.0 kg 102.1 15.0 489.6 41.2 75.1 114.0 Figure 3b the algorithm needs to optimize the learned PID
2.5 kg 100.6 10.5 754.0 40.3 66.5 110.2 controller parameters. It can also be seen in Tables I-II that
3.0 kg 115.1 22.9 340.0 50.9 91.1 159.3 increasing the mass or drag will increase the convergence
time to reach the setpoint. Also, after the agent has finished
*Note: 100 observations are taken for each change in the variable.
the learning process as shown in Figure 5a, the quadcopter
bounces very fast which may seem unrealistic given the
TABLE II
DESCRIPTIVE STATISTICS ( VARIABLE = D RAG ON Q UADCOPTER ) physical limitation of the motors used as shown in Figure 5b.
Furthermore, in Figure 5f it is clear that the learned parameters
Variable* Mean Min Max Q1 Median Q3 of the PID controller were not optimized, as it exceeded 30
1.0 N 76.1 14.1 266.9 32.9 53.8 114.2 seconds to reach the required setpoint. It is easily inferred
1.5 N 106.9 18.1 458.4 47.9 81.0 140.9 from Tables I-II that the average convergence time increases
2.0 N 113.2 22.7 613.2 51.4 91.7 142.2 whenever mass or drag increases on the system.
2.5 N 130.3 18.2 1199.6 50.3 87.3 157.6
3.0 N 154.6 23.6 608.4 58.6 119.1 206.9
V. C ONCLUSION
In this paper, a reinforcement learning, model-free, ϵ −
*Note: 100 observations are taken for each change in the variable.
greedy algorithm is implemented through the use of Q-
learning with a modification of the Q-function to make it
setpoint to continue executing the favorable actions that may appropriate for quadcopters to reach altitude stability in less
increase or decrease the PID controller gains rapidly which than half a minute on average as evident from experiments
leads to an unbalanced system. presented. These results are achievable without the need for
extensive simulations and modeling of the environment. Al-
A. Mass Variation though minimal proper knowledge of PID controller and the
In this experiment, the aim is to examine the effect of chang- application at hand is needed to aid the algorithm to reach
ing the mass on the speed of convergence of the algorithm. stability faster and smoother. Hence, making it applicable to
Drag and angular drag properties have been fixed to 1 N and a wide range of quadcopter sizes and weights with a different
0.05 N respectively, and only mass is changing. For every drag that may affect the system. The environment is unknown
100 experiments, an addition of 500 grams is added to the apriori and stochastic data is used. Results show promise and
mass of the quadcopter. The experiment is terminated at 500 more exploration is indeed required.
517
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
6 7
5 6
5
4
Altitude (m)
Altitude (m)
4
3
3
2
2
1 1
0 0
0 5 10 15 20 0 10 20 30
Time (sec) Time (sec)
(a) (b)
6 6
5 5
4 4
Altitude (m)
3 Altitude (m) 3
2 2
1 1
0 0
0 10 20 30 40 0 5 10
Time (sec) Time (sec)
(c) (d)
6 6
5 5
4
Altitude (m)
4
Altitude (m)
3 3
2 2
1 1
0 0
0 10 20 30 0 10 20 30 40
Time (sec) Time (sec)
(e) (f)
Fig. 5. Handpicked experiments to show the indeterministic behavior to be expected from such algorithm. Figures (a)(c)(e) show the training phase. Figures
(b)(d)(f) shows the run after learned PID controller parameters (Kp , Ki , Kd ). Both sides were plotted between time steps on the horizontal axis and the
quadcopter’s altitude on the vertical axis.
518
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.
1200
1000
600
800
Time (sec)
Time (sec)
400 600
400
200
200
0 0
1 1.5 2 2.5 3 1 1.5 2 2.5 3
Mass (kg) Drag (N)
(a) (b)
Fig. 6. (a) Box plot for each increase in mass on the horizontal axis, and corresponding time of convergence on the vertical axis. (b) Box plot that represents
the change in drag on the horizontal axis, with the time of convergence along the vertical axis.
R EFERENCES [14] X.-S. Wang, Y.-H. Cheng, and S. Wei, “A proposal of adaptive pid
controller based on reinforcement learning,” Journal of China University
of Mining and Technology, vol. 17, no. 1, pp. 40–44, 2007.
[1] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, [15] W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement
S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel et al., learning for uav attitude control,” ACM Transactions on Cyber-Physical
“Mastering atari, go, chess and shogi by planning with a learned model,” Systems, vol. 3, no. 2, pp. 1–21, 2019.
Nature, vol. 588, no. 7839, pp. 604–609, 2020. [16] A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim,
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed et al.,
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, “Drone deep reinforcement learning: A review,” Electronics, vol. 10,
M. Lanctot et al., “Mastering the game of go with deep neural networks no. 9, p. 999, 2021.
and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016. [17] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru,
[3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, S. Gowal, and T. Hester, “Challenges of real-world reinforcement
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mastering chess learning: definitions, benchmarks and analysis,” Machine Learning, vol.
and shogi by self-play with a general reinforcement learning algorithm,” 110, no. 9, pp. 2419–2468, 2021.
arXiv preprint arXiv:1712.01815, 2017. [18] J. Han and C. Moraga, “The influence of the sigmoid function pa-
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- rameters on the speed of backpropagation learning,” in International
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- workshop on artificial neural networks. Springer, 1995, pp. 195–201.
ing,” arXiv preprint arXiv:1312.5602, 2013. [19] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[5] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, no. 3, pp. 279–292, 1992.
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general [20] N. Kuyvenhoven, “Pid tuning methods an automatic pid tuning study
reinforcement learning algorithm that masters chess, shogi, and go with mathcad,” Calvin college ENGR, vol. 315, 2002.
through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
[6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[7] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and
J. Pineau, “An introduction to deep reinforcement learning,” arXiv
preprint arXiv:1811.12560, 2018.
[8] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
2016.
[9] T. Luukkonen, “Modelling and control of quadcopter,” Independent
research project in applied mathematics, Espoo, vol. 22, p. 22, 2011.
[10] K. J. Åström and R. M. Murray, “Feedback systems,” in Feedback
Systems. Princeton university press, 2010.
[11] L. Desborough and R. Miller, “Increasing customer value of industrial
control performance monitoring-honeywell’s experience,” in AIChE sym-
posium series, no. 326. New York; American Institute of Chemical
Engineers; 1998, 2002, pp. 169–189.
[12] R. P. Borase, D. Maghade, S. Sondkar, and S. Pawar, “A review of
pid control, tuning methods and applications,” International Journal of
Dynamics and Control, vol. 9, no. 2, pp. 818–827, 2021.
[13] S. C. Pratama, E. Susanto, and A. S. Wibowo, “Design and implemen-
tation of water level control using gain scheduling pid back calculation
integrator anti windup,” in 2016 International Conference on Con-
trol, Electronics, Renewable Energy and Communications (ICCEREC).
IEEE, 2016, pp. 101–104.
519
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on July 10,2023 at 16:52:03 UTC from IEEE Xplore. Restrictions apply.