Professional Documents
Culture Documents
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
Abstract—This paper presents a control algorithm based on computational methods or algorithms that are designed to
reinforcement learning for an DC servo drive. It is implemented make a computer system capable of improving its performance
the Q-learning technique plus a PD (proportional derivative) over time, using its experience. This depends on the way in
control scheme by the DC servo drive. This is enabled to improve which the experience is defined, one can speak of three
its online performance and adapt to changes by the parameters general types of machine learning: reinforcement learning,
of the servo drive model. Q-learning is used so a control that can
move the servo drive towards its position and velocity. Also, it is
supervised learning, and unsupervised learning.
combined hybrid techniques of Q-learning and PD control. This This paper is based on the reinforcement learning, this is an
combination can bring the servo drive to its position and velocity, agent learns to reach an objective through its interaction with
regardless of the applied disturbance. Finally, the experimental the environment [6]. This problem is commonly attacked
results show the effectiveness of the proposed controller. using the MDP framework. However, in the reinforcement
learning problem, neither the transition probabilities of states ρ
Keywords—Reinforcement Learning, Q-learning, Proportional nor the reward function r, which prevents us from using any
Derivative Controller, DC Servo drive. algorithm of iteration of value or iteration of policy. Instead, it
can make the agent interact with the environment many times
I. INTRODUCTION and in that way, it can collect information about the rewards
obtained and the states visited, after performing different
An area that is growing given diversification of tools used actions in different states.
to facilitate the interest in reinforcement learning has It is used the data collected by the agent where there are
increased, which has become one of the most abundant several algorithms that can be used to solve the learning
research areas in machine learning. It has multiple problem through reinforcement [5]. Most of these algorithms
applications in classical control, robust control, optimal allow to calculate and to estimate of the value function or the
control, and robotics. In a practical method, the central value function after many interactions with the environment.
objective, is to learn how states are assigned to actions while In reinforcement learning problems, the value function can be
expanding a reward signal [1]. Another challenge frequently stored using tables and dates. However, it is necessary to
encountered in control automatic is the generation of good replace the table using a sophisticated structure, this is known
rewards that can concisely define the interaction between the as a function approximation. Another promising way of
process and the controller [2]. For example, the rewards that dealing with complex and complex actions states is based on
bring the plant or process quickly to its goal are necessary in the development of temporal abstractions, using categorized
the physical implementation, this problem is known as a methods. One of the main advances in the investigation of
greedy reward and represents a great contribution to learning reinforcement learning methods is to introduce the time
through reinforcement. Specifying good rewards in control difference methods, which this is a class of incremental
automatic requires a lot of experience and it is always difficult learning procedures specialized in prediction problems. These
to obtain in practice [3]. methods are driven by the error or the difference between the
In the context of artificial intelligence, computational successive temporal predictions of the states. Learning always
systems that exhibit intelligent behavior are studied and occurs that a change in prediction occurs over time.
constructed. Behaviors that demonstrate, among others, that Reinforcement learning focuses on the use of classical optimal
the system can represent knowledge of the real world, that a control techniques to models that learn through repeated
process of reasoning can follow, that can plan, learn, make interaction with the environment [2] and [4]. In addition to the
decisions, and use those decisions to achieve various relationship, it has with optimal control, there is also a close
objectives [4]. Machine learning is an area that in the last ten relationship with optimization simulation and dynamic
years has attracted many researchers and has contributed to programming [7].
achieve enormous advances in other areas such as the The paper layout is as follows. Section II presents the PR
development of web search, in cognitive robotics and in data control with the compensation. Section III describes the
mining [5]. The term machine learning refers to a set of reinforcement learning using a compensation. Section IV
presents the experimental results using a DC servodrive. electromagnetic torque is proportional to the input voltage
Finally, in Section V is concluded this paper. applied to the amplifier. This approach also works for both
DC and AC brushless servomotors. Observe that (1) can be
II. PD CONTROL WITH COMPENSATION written as:
The reinforcement learning has been used for robots. q (t ) = −aq (t ) + bu (t ) (2)
However, in this paper is implemented for a DC servo drive where a = f / J , b = k / J are positive parameters. The
that differ considerably from most topics studied in
estimated parameter values for this platform are been obtained
reinforcement learning. For example, there are some
via the identification algorithm proposed in [11], given as a =
applications where it has been used a motor. In the case of
motor synergy development for a high performing using deep 0.45 and b = 31. It is known that PD control using terms of
reinforcement learning algorithm [8]. In [9], it is presented an friction that can achieve asymptotic stability [12]. Also, the
use of neural networks to compensate of dynamics can be
online adaptive decoding of motor imagery based on
found in [13] and [14]. If F and G are unknown, it is
reinforcement learning for the development of electronic and
computer technology makes non-invasive acquisition systems feasible to use a neural network to approximate as:
of EEG more universal adoption. Finally, in [10], it is given a G + F − g = Wˆt (t ) + (3)
relationship between the order for motor skill transfer and where is defined as a bounded modeling error, T 1 ,
motion complexity in reinforcement learning. This method to
generate an order for learning and transferring motor skills 1 is a matrix such that 1 = 1T 0 , ( x) is known as the
based on motion complexity, then evaluate the order to learn activation function, x represents the input of the Neural
motor skills of a task and transfer them to another task as a Network (NN) [15]:
T
form of reinforcement learning (see Fig. 1). X = q, q, q d , q d (4)
and g characterizes uncertainties. The control of the neural
PD is:
u = − kv r − Wˆt ( x) (5)
The PD control with the reinforcement learning
compensation is given as:
u = k p e + k d e + ur . (6)
In this scheme given by (4), e = q − q d is the signal error
and u r is torque generated by the reinforcement learning.
Figure 1. Reinforcement learning scheme.
III. REINFORCEMENT LEARNING WITH COMPENSATION
The problem of reinforcement learning has a goal to have a
It is always very difficult to think or suppose that the true good interaction with the environment or the process to
state can be observed completely, however, it does not contain achieve the objectives or goals. The servodrive or agent also
any noise either. It also knows that the reinforcement learning known as the decision-maker is the one who learns based on
is very difficult to know exactly what state is considered and, trial and error to select an optimal function [7]. In the case of
sometimes, very different states can seem very similar. the process or environment, it is everything that is outside the
Therefore, in the study of DC servodrive using reinforcement robot and interacts with it. Finally, with this continuous
learning, the model is often represented is observable. interaction, the system learns to select its actions and the
Therefore, if we have an experimental platform when making environment responds to these actions giving a reward for
an estimation of the states, the filters should be used to have a each action taken and leaving the DC servodrive in a new
better estimate of the states, which is quite expensive, tedious situation or state [16].
to obtain and difficult to reproduce. However, although In the case of the process or environment, it is everything
experience in real systems is difficult to obtain and expensive, that is outside the robot and interacts with it. Finally, with this
it can never be replaced by learning in the simulation. continuous interaction, the servomechanism learns to select its
Therefore, the used servo drive uses the dynamics of a DC actions and the environment responds to these actions giving a
motor, this can be defined as the following second order reward for each action taken and leaving the DC servodrive in
model: a new situation or state [16]. The reward provided by the
Jq (t ) + fq (t ) = ku (t ) (1) environment is a numerical value that the robot or agent tries
where q is the angular position, u (t ) the control input voltage, to maximize over time. Fig 1 shows a representation of how
J the motor and load inertia, f the viscous friction, and k the process and the servodrive interact, generating actions and
awarding rewards. A Markov decision problem for finite
the amplifier gain. A brushed servomotor, a power amplifier,
states and finite actions can be thought of as follows, in each
and a position sensor compose the servomechanism. The
discrete time step given as k = 1, 2,3,... the controller (servo
power amplifier is set to current mode; therefore, the
drive/agent) observes the xk states of the Markov selects an The process is given by a Markov decision, this is a task of
action u k that receives a reward rk and observe the following reinforcement learning that satisfies the Markov property,
which says: given any state xk and action u k the probability
state xk +1 . This distribution probability to receive the reward
that the next state xk +1 is defined as:
r and the xk +1 state depends only of xk , u k and rk . It can
p ( xk +1 | xk , uk ) = Pr xk +1 | xk = x, uk = u (12)
observe that their mathematical expectation values are finite.
The goal is to find a control law that maximizes the This equation is known as the transition probability. This
mathematical expectation of its future reward values at each allows to obtain a similar way when it defines the reward that
step of time, which means at any discrete time step k the has as parameters the actual state xk , the actions u k and the
control law should apply an action u k that maximizes the next state xk +1 .
reward. In addition, at each step of time, the system or the r ( xk +1 | xk , uk ) = E Pk +1 | xk = x, uk = u , xk +1 (13)
agent assigns the states to the probabilities of generating the
where the equations (12) and (13) represents the dynamics of a
next action. This is known as system governments or control
finite Markov decision process. Now, the value functions can
strategy and this will be represented by .
be represented using particular strategies, where a strategy
The aim of reinforcement learning is not only to maximize is an assignment of the states t the actions, also, (uk | xk )
the reward, where the long-term reward which is represented
as follows: is a probability of taking an action u k in the state xk . In other
Gk = Rk +1 + Rk + 2 + Rk + 3 + + RT (7) words, the function V ( x) is defined as the value of state under
where T represents the final time, however, the equation can the strategy. This is the expected return when it starts at xk
increase to infinity. This is better when it is used a discount and the strategy is taken. Therefore, the Markov decision
equation that is conceptually more complex and
process can be defined as V ( x) [1]:
mathematically is simple since it reduces the contribution f the
V ( xk ) = E Gk | xk = x = E l Rk + l +1 | xk = x . (14)
rewards when the time increments k steps
Gk = Rk +1 + Rk + 2 + 2 Rk + 3 l =0
(8) where a characteristic of this value function V ( x) is its
+ + k RT = k Rk + l +1
l =0
recursive property given as:
where is a gain given as 0 1 known as the discount V ( xk ) = (uk | xk ) (15)
uk
rate. Although there are ways that allow to consider future
rewards, there are three optimal models used, these are p( x k +1 | xk , uk ) r ( xk , uk , xk +1 ) + V ( xk +1 )
presented to following. xk +1
Finite horizon: the controller tries to optimize the where E represents a mathematical expectation, while the
mathematical expectation of the rewards accumulated for a
controller follows the strategy . Similarly, the value
finite time h considering that happens later.
h function Q ( xk , uk ) represents an action u k in the state
E Rl (9) xk following the policy given as follows:
l =0
where Rl is the reward received after l time steps in the Q ( xk , uk ) = E Gk | xk = x, uk = u (16)
future.
Infinite horizon: the reward is received by the controller = E l Rk + l +1 | xk = x, uk = u
l =0
(agent/system/robot) s geometrically reduced using a factor
It can derive recursively the following expression given as
( 0 1)
Q ( xk , uk ) =
E l Rl .
p( x | x k , uk ) R ( xk +1 | xk , uk ) + V ( xk +1 )
(10) (17)
l =0 xk +1
k +1
for all xk X . Optimal policies have the same optimal action IV. EXPERIMENTAL RESULTS
*
value function Q extended as: In this section presents the experimental results show the
Q ( xk , uk ) = max V
* k
( xk , uk ) (19) proposed control methodology, that it is applied by means of
k a prototype from the Center for Research and Advanced
Now, defining Q * in terms of V * can be expressed Studies of the IPN. The experiments were carried out by
means of the prototype of the servo mechanism with a direct
Q* ( xk , uk ) = E Rk +1 + V * ( xk +1 ) | xk = x, uk = u (20) current motor shown in Fig. 2. The servomechanism used in
using equations (15) and (16) and the optimal value functions the experiments is composed of a Clifton Precision JDTH-
can be expressed recursively as 2250-BQ-IC motor, a tach generator, and an optical decoder.
V * ( xk ) = max uk Q ( xk , uk ) A Copley Controls power amplifier working in mode to drive
*
(21)
the motor.
= max uk E Gk | xk = x, uk = u The STGII-8 acquisition card, which is integrated into a
computer, can process the signals of the US-Digital E-6
= max uk E l Rk + l +1 | xk = x, uk = u model 2500 incremental optical decoder allowing the
l =0 counting of 10,000 pulses per revolution and the voltage-
current relationship produced by the tach generator, and these
= max uk E Rk +1 + l Rk + l + 2 | xk = x, uk = u signals are sent to the power amplifier. The software used is
l =0 Matlab-Simulink, together with the WINCON real-time
= max uk E Rk +1 + V ( xk +1 ) | xk = x, uk = u
*
environment, allowing the coding and execution of the
control algorithms.
V * ( xk ) = max uk p( x
xk +1
k +1 | x, u ) r ( xk , uk , xk +1 ) + V * ( xk +1 )
(22)
Similarly, for Q values are given as:
Q* ( xk , uk ) = E Rk +1 + max uk Q* ( xk +1 , uk +1 ) | xk = x, uk = u
= p( xk +1 | xk , uk ) r ( xk , uk , xk +1 ) + V * ( xk +1 )
xk +1
Q ( xk , uk ) =
*
p( x
xk +1
k +1 | xk , uk ) r ( xk , uk , xk +1 ) + max *uk +1 Q* ( xk +1 , uk +1 ) .
(23)
It is used the method of temporal differences to estimate the
function Q online. The Q-learning rule is defined as:
Q* ( xk , uk ) =
Q ( k ) ( xk , uk ) + R ( ) + max uk +1 Q ( k ) ( xk +1 , uk +1 ) − Q ( k ) ( xk , uk )
k
(24)
where xk , uk are the state and the action in time step k and
finally R ( ) the reward in time step k , R ( ) = r ( xk , uk ) ,
k k +1
hence u k is bounded between ( 0, −1,1) and the reward is a Figure 2. Platform based in a servo drive system.