Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

PD Control based on Reinforcement Learning

Compensation for a DC servo drive


Manuel Alejandro Ojeda-Misses
Departament of Automatic Control
CINVESTAV-IPN
Mexico City, Mexico
manuel.ojeda@cinvestav.mx

Abstract—This paper presents a control algorithm based on computational methods or algorithms that are designed to
reinforcement learning for an DC servo drive. It is implemented make a computer system capable of improving its performance
the Q-learning technique plus a PD (proportional derivative) over time, using its experience. This depends on the way in
control scheme by the DC servo drive. This is enabled to improve which the experience is defined, one can speak of three
its online performance and adapt to changes by the parameters general types of machine learning: reinforcement learning,
of the servo drive model. Q-learning is used so a control that can
move the servo drive towards its position and velocity. Also, it is
supervised learning, and unsupervised learning.
combined hybrid techniques of Q-learning and PD control. This This paper is based on the reinforcement learning, this is an
combination can bring the servo drive to its position and velocity, agent learns to reach an objective through its interaction with
regardless of the applied disturbance. Finally, the experimental the environment [6]. This problem is commonly attacked
results show the effectiveness of the proposed controller. using the MDP framework. However, in the reinforcement
learning problem, neither the transition probabilities of states ρ
Keywords—Reinforcement Learning, Q-learning, Proportional nor the reward function r, which prevents us from using any
Derivative Controller, DC Servo drive. algorithm of iteration of value or iteration of policy. Instead, it
can make the agent interact with the environment many times
I. INTRODUCTION and in that way, it can collect information about the rewards
obtained and the states visited, after performing different
An area that is growing given diversification of tools used actions in different states.
to facilitate the interest in reinforcement learning has It is used the data collected by the agent where there are
increased, which has become one of the most abundant several algorithms that can be used to solve the learning
research areas in machine learning. It has multiple problem through reinforcement [5]. Most of these algorithms
applications in classical control, robust control, optimal allow to calculate and to estimate of the value function or the
control, and robotics. In a practical method, the central value function after many interactions with the environment.
objective, is to learn how states are assigned to actions while In reinforcement learning problems, the value function can be
expanding a reward signal [1]. Another challenge frequently stored using tables and dates. However, it is necessary to
encountered in control automatic is the generation of good replace the table using a sophisticated structure, this is known
rewards that can concisely define the interaction between the as a function approximation. Another promising way of
process and the controller [2]. For example, the rewards that dealing with complex and complex actions states is based on
bring the plant or process quickly to its goal are necessary in the development of temporal abstractions, using categorized
the physical implementation, this problem is known as a methods. One of the main advances in the investigation of
greedy reward and represents a great contribution to learning reinforcement learning methods is to introduce the time
through reinforcement. Specifying good rewards in control difference methods, which this is a class of incremental
automatic requires a lot of experience and it is always difficult learning procedures specialized in prediction problems. These
to obtain in practice [3]. methods are driven by the error or the difference between the
In the context of artificial intelligence, computational successive temporal predictions of the states. Learning always
systems that exhibit intelligent behavior are studied and occurs that a change in prediction occurs over time.
constructed. Behaviors that demonstrate, among others, that Reinforcement learning focuses on the use of classical optimal
the system can represent knowledge of the real world, that a control techniques to models that learn through repeated
process of reasoning can follow, that can plan, learn, make interaction with the environment [2] and [4]. In addition to the
decisions, and use those decisions to achieve various relationship, it has with optimal control, there is also a close
objectives [4]. Machine learning is an area that in the last ten relationship with optimization simulation and dynamic
years has attracted many researchers and has contributed to programming [7].
achieve enormous advances in other areas such as the The paper layout is as follows. Section II presents the PR
development of web search, in cognitive robotics and in data control with the compensation. Section III describes the
mining [5]. The term machine learning refers to a set of reinforcement learning using a compensation. Section IV
presents the experimental results using a DC servodrive. electromagnetic torque is proportional to the input voltage
Finally, in Section V is concluded this paper. applied to the amplifier. This approach also works for both
DC and AC brushless servomotors. Observe that (1) can be
II. PD CONTROL WITH COMPENSATION written as:
The reinforcement learning has been used for robots. q (t ) = −aq (t ) + bu (t ) (2)
However, in this paper is implemented for a DC servo drive where a = f / J , b = k / J are positive parameters. The
that differ considerably from most topics studied in
estimated parameter values for this platform are been obtained
reinforcement learning. For example, there are some
via the identification algorithm proposed in [11], given as a =
applications where it has been used a motor. In the case of
motor synergy development for a high performing using deep 0.45 and b = 31. It is known that PD control using terms of
reinforcement learning algorithm [8]. In [9], it is presented an friction that can achieve asymptotic stability [12]. Also, the
use of neural networks to compensate of dynamics can be
online adaptive decoding of motor imagery based on
found in [13] and [14]. If F and G are unknown, it is
reinforcement learning for the development of electronic and
computer technology makes non-invasive acquisition systems feasible to use a neural network to approximate as:
of EEG more universal adoption. Finally, in [10], it is given a G + F − g = Wˆt (t ) +  (3)
relationship between the order for motor skill transfer and where  is defined as a bounded modeling error,  T 1   ,
motion complexity in reinforcement learning. This method to
generate an order for learning and transferring motor skills  1 is a matrix such that 1 = 1T  0 ,  ( x) is known as the
based on motion complexity, then evaluate the order to learn activation function, x represents the input of the Neural
motor skills of a task and transfer them to another task as a Network (NN) [15]:
T
form of reinforcement learning (see Fig. 1). X =  q, q, q d , q d  (4)
and g characterizes uncertainties. The control of the neural
PD is:
u = − kv r − Wˆt ( x) (5)
The PD control with the reinforcement learning
compensation is given as:
u = k p e + k d e + ur . (6)
In this scheme given by (4), e = q − q d is the signal error
and u r is torque generated by the reinforcement learning.
Figure 1. Reinforcement learning scheme.
III. REINFORCEMENT LEARNING WITH COMPENSATION
The problem of reinforcement learning has a goal to have a
It is always very difficult to think or suppose that the true good interaction with the environment or the process to
state can be observed completely, however, it does not contain achieve the objectives or goals. The servodrive or agent also
any noise either. It also knows that the reinforcement learning known as the decision-maker is the one who learns based on
is very difficult to know exactly what state is considered and, trial and error to select an optimal function [7]. In the case of
sometimes, very different states can seem very similar. the process or environment, it is everything that is outside the
Therefore, in the study of DC servodrive using reinforcement robot and interacts with it. Finally, with this continuous
learning, the model is often represented is observable. interaction, the system learns to select its actions and the
Therefore, if we have an experimental platform when making environment responds to these actions giving a reward for
an estimation of the states, the filters should be used to have a each action taken and leaving the DC servodrive in a new
better estimate of the states, which is quite expensive, tedious situation or state [16].
to obtain and difficult to reproduce. However, although In the case of the process or environment, it is everything
experience in real systems is difficult to obtain and expensive, that is outside the robot and interacts with it. Finally, with this
it can never be replaced by learning in the simulation. continuous interaction, the servomechanism learns to select its
Therefore, the used servo drive uses the dynamics of a DC actions and the environment responds to these actions giving a
motor, this can be defined as the following second order reward for each action taken and leaving the DC servodrive in
model: a new situation or state [16]. The reward provided by the
Jq (t ) + fq (t ) = ku (t ) (1) environment is a numerical value that the robot or agent tries
where q is the angular position, u (t ) the control input voltage, to maximize over time. Fig 1 shows a representation of how
J the motor and load inertia, f the viscous friction, and k the process and the servodrive interact, generating actions and
awarding rewards. A Markov decision problem for finite
the amplifier gain. A brushed servomotor, a power amplifier,
states and finite actions can be thought of as follows, in each
and a position sensor compose the servomechanism. The
discrete time step given as k = 1, 2,3,... the controller (servo
power amplifier is set to current mode; therefore, the
drive/agent) observes the xk states of the Markov selects an The process is given by a Markov decision, this is a task of
action u k that receives a reward rk and observe the following reinforcement learning that satisfies the Markov property,
which says: given any state xk and action u k the probability
state xk +1 . This distribution probability to receive the reward
that the next state xk +1 is defined as:
r and the xk +1 state depends only of xk , u k and rk . It can
p ( xk +1 | xk , uk ) = Pr  xk +1 | xk = x, uk = u (12)
observe that their mathematical expectation values are finite.
The goal is to find a control law that maximizes the This equation is known as the transition probability. This
mathematical expectation of its future reward values at each allows to obtain a similar way when it defines the reward that
step of time, which means at any discrete time step k the has as parameters the actual state xk , the actions u k and the
control law should apply an action u k that maximizes the next state xk +1 .
reward. In addition, at each step of time, the system or the r ( xk +1 | xk , uk ) = E Pk +1 | xk = x, uk = u , xk +1  (13)
agent assigns the states to the probabilities of generating the
where the equations (12) and (13) represents the dynamics of a
next action. This is known as system governments or control
finite Markov decision process. Now, the value functions can
strategy and this will be represented by  .
be represented using particular strategies, where a strategy
The aim of reinforcement learning is not only to maximize  is an assignment of the states t the actions, also,  (uk | xk )
the reward, where the long-term reward which is represented
as follows: is a probability of taking an action u k in the state xk . In other
Gk = Rk +1 + Rk + 2 + Rk + 3 + + RT (7) words, the function V ( x) is defined as the value of state under
where T represents the final time, however, the equation can the  strategy. This is the expected return when it starts at xk
increase to infinity. This is better when it is used a discount and the strategy  is taken. Therefore, the Markov decision
equation that is conceptually more complex and
process can be defined as V ( x) [1]:
mathematically is simple since it reduces the contribution f the
 
V  ( xk ) = E Gk | xk = x  = E    l Rk + l +1 | xk = x  . (14)
rewards when the time increments k steps
Gk = Rk +1 +  Rk + 2 +  2 Rk + 3  l =0 
 (8) where a characteristic of this value function V ( x) is its
+ +  k RT =   k Rk + l +1
l =0
recursive property given as:
where  is a gain given as 0    1 known as the discount V  ( xk ) =  (uk | xk ) (15)
uk
rate. Although there are ways that allow to consider future
rewards, there are three optimal models used, these are  p( x k +1 | xk , uk )  r ( xk , uk , xk +1 ) + V  ( xk +1 ) 
presented to following. xk +1

Finite horizon: the controller tries to optimize the where E   represents a mathematical expectation, while the
mathematical expectation of the rewards accumulated for a
controller follows the strategy  . Similarly, the value
finite time h considering that happens later.
 h  function Q ( xk , uk ) represents an action u k in the state
E   Rl  (9) xk following the policy  given as follows:
 l =0 
where Rl is the reward received after l time steps in the Q ( xk , uk ) = E Gk | xk = x, uk = u  (16)
future.  

Infinite horizon: the reward is received by the controller = E    l Rk + l +1 | xk = x, uk = u 
 l =0 
(agent/system/robot) s geometrically reduced using a  factor
It can derive recursively the following expression given as
( 0    1)
Q ( xk , uk ) =
  
E    l Rl  .
 p( x | x k , uk )  R ( xk +1 | xk , uk ) + V  ( xk +1 ) 
(10) (17)
 l =0  xk +1
k +1

where Rl is the reward received after l time steps to infinity.


It knows that Q is named the action-value function for
Average reward: its objective is to optimize the long-term
reward by policy  [1]. The two last equations represent the state value
1   function and the action state value function that these are the
lim E   Rl  . (11) basis of the algorithms in reinforcement learning. In the
h →
 h l =0  practice, it seeks to have the best policies that produce the
the situation with the average reward is that in long episodes greatest long-term rewards.
you can not distinguish between the strategies in which one If it necessary to know if the one policy  is better or the
strategy can receive a large reward and in the other not.
same than the other policy  k +1 , the mathematical expression
Finally, the model used in the literature, this is the finite
horizon, and this is the one that can be used from now on.  k   k +1 if V  ( xk )  V  for all xk  X . However, there is
k k +1
a better policy that others, this is known as optimal policy and where  is a constant  >0. Also, one of the conditions for
this is representing as  * that it is represented as a value the converge from Q to Q * of the algorithm is that all states
function as: are required to be visited an infinite number of tomes and
Q* ( xk ) = max V  k ( xk ) (18)  must decay adequality [17].
k

for all xk  X . Optimal policies have the same optimal action IV. EXPERIMENTAL RESULTS
*
value function Q extended as: In this section presents the experimental results show the
Q ( xk , uk ) = max V
* k
( xk , uk ) (19) proposed control methodology, that it is applied by means of
k a prototype from the Center for Research and Advanced
Now, defining Q * in terms of V * can be expressed Studies of the IPN. The experiments were carried out by
means of the prototype of the servo mechanism with a direct
Q* ( xk , uk ) = E  Rk +1 + V * ( xk +1 ) | xk = x, uk = u  (20) current motor shown in Fig. 2. The servomechanism used in
using equations (15) and (16) and the optimal value functions the experiments is composed of a Clifton Precision JDTH-
can be expressed recursively as 2250-BQ-IC motor, a tach generator, and an optical decoder.
V * ( xk ) = max uk Q ( xk , uk ) A Copley Controls power amplifier working in mode to drive
*
(21)
the motor.
= max uk E Gk | xk = x, uk = u  The STGII-8 acquisition card, which is integrated into a
computer, can process the signals of the US-Digital E-6
 
= max uk E    l Rk + l +1 | xk = x, uk = u  model 2500 incremental optical decoder allowing the
 l =0  counting of 10,000 pulses per revolution and the voltage-
 
 current relationship produced by the tach generator, and these
= max uk E  Rk +1 +    l Rk + l + 2 | xk = x, uk = u  signals are sent to the power amplifier. The software used is
 l =0  Matlab-Simulink, together with the WINCON real-time
= max uk E  Rk +1 + V ( xk +1 ) | xk = x, uk = u 
*
environment, allowing the coding and execution of the
control algorithms.
V * ( xk ) = max uk  p( x
xk +1
k +1 | x, u )  r ( xk , uk , xk +1 ) + V * ( xk +1 ) 

(22)
Similarly, for Q values are given as:
Q* ( xk , uk ) = E  Rk +1 +  max uk Q* ( xk +1 , uk +1 ) | xk = x, uk = u 
=  p( xk +1 | xk , uk )  r ( xk , uk , xk +1 ) + V * ( xk +1 ) 
xk +1

Q ( xk , uk ) =
*

 p( x
xk +1
k +1 | xk , uk )  r ( xk , uk , xk +1 ) +  max *uk +1 Q* ( xk +1 , uk +1 )  .

(23)
It is used the method of temporal differences to estimate the
function Q online. The Q-learning rule is defined as:
Q* ( xk , uk ) =
Q ( k ) ( xk , uk ) +   R ( ) +  max uk +1 Q ( k ) ( xk +1 , uk +1 ) − Q ( k ) ( xk , uk ) 
k

(24)
where xk , uk are the state and the action in time step k and
finally R ( ) the reward in time step k , R ( ) = r ( xk , uk ) ,
k k +1

hence u k is bounded between ( 0, −1,1) and the reward is a Figure 2. Platform based in a servo drive system.

bounded signal that the infinite horizon is also bounded and


the convergence condition of the algorithm Q can be The model dynamic of DC servo drive is given by the
maintained. Then  is the learning rate given as 0    1 , equations (1) and (2), where a = f / J , b = k / J are positive
 is the discount factor between 0    1 . parameters. The algorithm for learning a control policy for the
servo drive is applied for the position control and a bit of noise
requires that this be able to find based on test and the error a
 =  arg max Q ( k +1) ( xk , uk )  (25) policy of control that lifts the servo drive and stabilizes it. The
uk
task is done for 1000 episodes with 1500 iterations.
The Q-learning algorithm requires that the servo drive Select an action u r j from esi using the control action given
controls the position and stabilizes it. The method of learning (k )
by reinforcement as it is Q-learning does not need the from Q
dynamics of the system. The idea in Q-learning is to estimate Execute the action u r j and it is applied the reinforcement
( )
a Q value function, where Q esi , urj is the expected sum of R ( k ) for the next state esi +1
the discounts of future rewards to perform an action u r and an Q ( ) (esi , urj )
k
Update the date for by
state es as it is shown in Fig. 3.
Q ( k ) (esi , urj )  Q ( k ) (esi , urj ) +
This figure shows that the cost takes an action u r for a state
es known as Q . The states are defined as a quadratic sum of   R( k ) +  max u Q ( k ) (es +1 , ur +1 ) − Q ( k ) (es , ur ) 
r i j i j

the position and velocity errors given as: esi  esi +1


[ es = e 2position + evelocity
2
= f ( q, q, q d , q d ) (28) Until esi will be a terminal state
where e position = q − q,
d
q = [− ,  ],
d
evelocity = q − q,
d
Define ur = ur =  arg ( kmax  
)
Q ( esi , ur j )
q = [− ,  ] and the actions are defined as the torque applied
d

The numerical results show the performance and


to servo drive as  = 0, −1,1 . effectiveness of the control. The results are carried out using
the servo drive by Matlab-Simulink. The used parameters are
shown in Table 1.

Table 1. Parameters for the servo drive using Q-learning control


Parameter Description Value
a First parameter 0.45
b Second parameter 31
 QL control gain 3.5
 Learning rate 0.55
 Discount factor 0.5
Figure 3. Q-learning matrix.
 Set action {0,1,-1}Nm

Then, the learning algorithm is done online, therefore,


Also, the used function of energy is the following:
following (24) this algorithm is defined as:
(e , u ) = R = − xk − 0.25 xk .
2 2
(32)
Q(
k +1)
si rj
The results obtained are shown after the Q matrix has
Q ( k ) (esi , urj ) +   R ( ) +  max ur Q ( k ) (esi +1 , urj +1 ) − Q ( k ) (esi , urj ) 
k
already done with 1000 episodes of 1500 iterations per
 
episode. In the Fig. 4, it shows the results obtained by means
(29)
of the Q-learning algorithm applied to the pendulum, where
The applied policy is depicted by:
we can see that both the position q and velocity q tend to be
ur =  arg ( kmax
)
Q

( esi , ur j )
(30)
around zero while it arrives during iteration 600.
using () it obtains the input torque given for servo drive
u = ur (31)

where E is a finite set of states E = esi |1  i  7000 ,  is a 

finite set of actions  = urj |1  j  3 , R ( ) is a reward
k

function given as R ( ) = − xk2 − 0.25 xk2 , this is a functions of
k

energy,  is defined as  = 0.5 and  is the learning rate


given by  = 0.55 . The used algorithm for the Q-leaning is
presented below:
Q-learning algorithm
(k )
Initialize Q (esi , urj )
Run for each episode Figure 4. (a) Position of the servo drive and (b) velocity of the servo drive
Initialize esi using Q-learning.

Run for each episode


Further, it shows adding a PD control to out Q matrix Finally, it added a PD control in the Q-learning algorithm to
previously calculated with the parameters of the initial plant, give robustness to the control, also, its observed that the servo
Table 1 gives a great robustness to Q-learning when the values drive obtained its position and velocity. The figures showed
of our parameters increase for factors external to us, as is the the effectiveness and the performance of the proposed control.
case of implementation. The graphs that we will show are the It should be mentioned that, by itself, the PD control with
position q and velocity q for the servo drive. Table 2 shows the gain values defined in Table 2 is used for the PD control
the parameters for the Q-learning+PD. plus the Q-learning algorithm operate cooperatively so that the
servo drive reaches its final position and during a shorter
Table 2. Parameters for the servo drive using Q-learning control settling time.
Parameter Description Value
a First parameter 0.45 REFERENCES
b Second parameter 31
QL control gain 150 [1] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction”,
kp
The MIT Press, March 1998. ISBN 0262193981.
kd Learning rate 30 [2] Deisenroth, M. Peter, G. Neumann, and J. Peters, “A survey on policy
search for robotics”, Foundations and Trends in Robotics vol. 2, pp. 1-
 Discount factor 35 142, 2013.
[3] A. S. Polydoros and L. Nalpantidis, “Survey of Model-Based
 Set action {0,1,-1}Nm Reinforcement Learning: Applications on Robotics,” Journal of
Intelligent & Robotic Systems, vol. 86, pp. 153-173, 2017.
[4] M. Ghavamzadeh, S. Mannor, J. Pineau and A. Tamar. “Bayesian
The PD control gains were adjusted in such a way that it reinforcement learning: A survey”, Foundations and Trends in Machine
could stabilize the pendulum as best as possible to its desired Learning, vol. 8, No. 5-6, pp.359-483, 2015.
L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
position, and in the Q-learning control only a gain  was [5]
learning: A survey.” Journal of artificial intelligence research vol. 4 , pp.
increased to the input torque  ur while maintaining the same 237-285, 1996.
[6] T. Moerland, J. Broekens, and C. M. Jonker, “Emotion in
learning matrix Q ( k ) (esi , urj ) . The hybrid control is given as: Reinforcement Learning Agents and Robots: A Survey.” arXiv preprint
arXiv:1705.05172, 2017.
u = k p e position + kd evelocity +  ur (33) [7] J. J. Kober, A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey”, The International Journal of Robotics Research vol.
In Fig. 5 (a) we see a smooth trajectory that reaches its 32, pp. 1238-1274, 2013.
desired position in smaller iterations than the previous graph [8] J. Chai and M. Hayashibe, "Motor Synergy Development in High-
where only the Q-learning control was applied. In Fig. 5 (b) Performing Deep Reinforcement Learning Algorithms," in IEEE
Robotics and Automation Letters, vol. 5, no. 2, pp. 1271-1278, April
also shows that the steady state is reached in a shorter time and 2020, doi: 10.1109/LRA.2020.2968067.
with noise. [9] J. Liu, S. Qu, W. Chen, J. Chu and Y. Sun, "Online Adaptive Decoding
of Motor Imagery Based on Reinforcement Learning," 2019 14th IEEE
Conference on Industrial Electronics and Applications (ICIEA), Xi'an,
China, 2019, pp. 522-527, doi: 10.1109/ICIEA.2019.8833778
[10] N. J. Cho, S. H. Lee, I. H. Suh and H. Kim, "Relationship Between the
Order for Motor Skill Transfer and Motion Complexity in
Reinforcement Learning," in IEEE Robotics and Automation Letters,
vol. 4, no. 2, pp. 293-300, April 2019, doi: 10.1109/LRA.2018.2889026.
[11] R. Garrido and R. Miranda, “DC servomechanism parameter
identification: A closed loop input error approach,” ISA Trans., vol. 51,
no. 1, pp. 42–49, 2012.
[12] M.W. Spong and M. Vidyasagar, “Robot Dynamics and Control,” John
Wiley & Sons Inc., Canada, 1989.
[13] F.L. Lewis, A. Yesildirek and K. Liu, “Multilayer Neural-Net Robot
Controller with Guaranteed Tracking Performance,” IEEE Trans. on
Neural Networks, vol.7, No.2, pp. 388-399, 1996.
[14] F.L. Lewis, “Neural Network Control of Robot Manipulators,” IEEE
Expert, vol.11, No.2, pp. 64-75, 1996.
[15] D. Hernandez, W. Yu and M. A. Moreno-Armendariz, "Neural PD
Figure 5. (a) Position of the servo drive and (b) velocity of the servo drive control with second-order sliding mode compensation for robot
using Q-learning + PD. manipulators," The 2011 International Joint Conference on Neural
Networks, San Jose, CA, 2011, pp. 2395-2402, doi:
V. CONCLUSIONS 10.1109/IJCNN.2011.6033529.
[16] R. Figueroa, A. Faust, P. Cruz, L. Tapia and R. Fierro, “Reinforcement
In this paper, it presented an algorithm of Q-learning learning for balancing a Flying inverted pendulum,” Intelligent Control
implemented by a servo drive platform, where the Q matrix and Automation (WCICA), 2014 11th World Congress on. IEEE, 2014.
learned during a lapse of 1000 episodes with 1500 iterations [17] M. Hehn, R. D. Andrea, “A Flying inverted pendulum,” Robotics and
Automation (ICRA), 2011 IEEE International Conference on. IEEE,
per episode. After adding a disturbance to the plant and using 2011.
the matrix Q already calculated with the initial parameters, it
shows that the Q-learning algorithm is no longer able to bring
the servo drive to its upper position.

You might also like