Reinforcement Learning Control of Robot Manipulators in Uncertain Environments

Reinforcement Learning Control of Robot Manipulators in Uncertain Environments
Hitesh Shah, M.Gopal

Electrical Engineering Department, IIT-Delhi, New Delhi-110016, INDIA E-mail: iitd.hitesh@gmail.com, mgopal@ee.iitd.ac.in
Abstract- Considerable attention has been given to the design of stable controllers for robot manipulators, in the presence of uncertainties. We investigate here the robust tracking performance of reinforcement learning control of manipulators, subjected to parameter variations and extraneous disturbances. Robustness properties in terms of average error, absolute maximum errors and absolute maximum control efforts, have been compared for reinforcement learning systems using various parameterized function approximators, such as fuzzy, neural network, decision tree, and support vector machine. Simulation results show the importance of fuzzy Q-learning control. Further improvements in this control approach through dynamic fuzzy Q-learning have also been highlighted.
I.
INTRODUCTION
Since the dynamics of the robot manipulators are highly nonlinear, and may contain uncertain elements, many efforts have been made in developing control schemes to achieve the precise tracking control of robot manipulators. Conventionally, many control techniques for robot manipulators rely on proportional-integral-derivative(PID)type controllers in industrial operations due to their simple control structure, ease of design, and low cost. However, robot manipulators have to face various uncertainties in practical applications, such as payload variations, internal friction, and external disturbances [1]. Therefore, PID controllers need to be reinforced using some sort of adaptive techniques. Robust control of robot manipulators can be effectively addressed under the reinforcement learning (RL) paradigm [1]. RL is a computationally simple, direct approach to the adaptive optimal control of nonlinear systems [2]. RL agents (controllers) can learn optimal control policies mapping from states to actions for a given objective, to maximize its long-term discounted reward by interacting with an environment whose dynamic model is unknown. One of the most popular RL approaches is the Q-learning [3]. It has been widely used in robotic domains for its simplicity and welldeveloped theory. Usually Q-learning is applied to the discrete set of states and actions by its standard tabular formation. However, in real robot applications, state and action spaces are continuous; discrete Q-learning can not be directly used due to the problems of discretization, and curse of dimensionality. Also, learning will not converge in a reasonable time. In addition, generalization is not guaranteed when large spaces must be explored by the agent. Approximating the value function with function approximators (FA), which are capable of both handling continuous variables and generalizing experience among similar situations, is a possible solution. In the literature, a number of research results have been reported
for extension of RL methods to continuous state space by means of FA such as, neural network (NN) [4], decision tree (DT) [5], support vector machine (SVM) [6], etc. However, these results assume discrete actions. In realistic applications, it is imperative to deal with continuous states and actions. Fuzzy inference system (FIS) can be used to facilitate generalization in continuous state space and to generate continuous actions [7]. In this paper, our motivation is to investigate the performance of various FA to enhance the capability of robot manipulators for handling uncertainties in terms of payload variations and extraneous disturbances. To fulfill our objective, we compare robustness properties of various reinforcement learning controllers, namely, fuzzy Q-learning controller (FQC) [79], neural network Q-learning controller (NNQC) [4], decision tree Q-learning controller (DTQC) [5], and support vector machine Q-learning controller (SVMQC) [6], in terms of average error, absolute maximum error and maximum control effort with respect to payload variations and torque disturbances. Simulation results show the importance of FQC adaptive learning control in robotics. Further improvements in FQC, made possible through dynamic fuzzy Q-learning [10], have also been highlighted. The paper is organized as follows. Section II presents theoretical background for various approaches to value function estimation with reinforcement controller design, such as, fuzzy Q-learning (FQL) with a new strategy to increase the learning ability of the controller, neural network Q-learning, decision tree Q-learning, and support vector machine Q-learning, Section III gives details of controller realization. Section IV compares and discusses the empirical performance study on the basis of simulation results. Additionally this section highlights the features of dynamic fuzzy Q-learning controller (DFQC) in comparison with FQC. Finally in Section V, we conclude and indicate directions for future work. II. APPROACHES TO VALUE FUNCTION ESTIMATION We consider an adaptive learning agent (controller) that interacts with its environment (discrete-time dynamical system). For each state x k X of the dynamical system at discrete-time k, there is a nite set of possible decisions (control actions), u ( x k ) U ( x k ) , that may be taken by the learning agent. The goal of RL agent is to find the optimal policy. The value of a state is defined as the sum of the rewards received when starting in the state and following the policy to a terminal state. The value function can be approximated using any parameterized function approximator such as FIS, NN, DT, or SVM.
A. Fuzzy Q-learning FQL is a RL method to tune FIS conclusions. It is an adaptation of Watkins Q-learning [3] for FIS, where both the actions and Q-functions are inferred from fuzzy rules. In FIS structure, rule base antecedent part (FIS(A)) depends on the number of variables in the state vector, and the number of the fuzzy sets used to sense each variable. There can be various types of membership functions, e.g., triangular, trapezoidal, Gaussian, etc. The N number of rules forms the FIS rule base. The consequent part (FIS(B)) of FIS adjusts /tunes only the conclusion part of the fuzzy rules in an incremental manner, as per the concept of FQL specified in [7]. Input fuzzy labels are set using a priori task knowledge of the user. The architecture of FQC is shown in Fig. 1. For an input
k k vector x k = x1k , x2 , we find the truth value i ( x k ) of , , xn
FIS
xk
U
FIS (A)
q(i, ui ) at x ui U
k
FIS (B)
q i, ui i N
( )
Global Q- value Defuzzifier

Q(xk )
TD error Global target value
Action selector
(
V (xk +1 )
+ +
c
k
xk (desired)
-greedy)
Defuzzifier
Kv
V (x )
i k +1
Error metric evaluator
Global action value Defuzzifier
ukc
+
upd
uc
Unknown Plant
xk
Fig.1 FQC architecture
each rule Ri. Each rule has m possible discrete control actions U = {u1 , u2 , , um } , and a parameter q associated with each
control action. The parameter q associates to each action in Ri, a quality with respect to the task, and is used to select actions so as to maximize discounted sum of reinforcements received by the controller. FQL uses simplified TakagiSugeno FIS with N rules; therefore, the inferred global continuous action uk for an input vector x k and rule truth
k values ( i ( x k ))iN =1 is: uk = i ( x )ui i =1 N
(x ); u U
k i =1 i i
is the
action selected in rule Ri . In order to explore the set of possible actions and acquire experience through the RL signals, actions are selected using an exploration/exploitation policy (EEP) [11]. If ui , the action selected in rule Ri is
-greedy ui (where -greedy is

bm
a function implementing the EEP strategy), while ui* is the maximizing action, i.e., q(i, ui* ) = max q(i, b) , then Q-value for the
N i =1
B. Neural Q-learning NN function approximation is a competent way for substituting look-up tables. NN generalizes among states and actions, and reduces the number of Q-values stored in look-up table to a set of NN weights. In this work, we consider the approximation of the Q-function using feed-forward NN with back propagation (BP) algorithm. In BP algorithm, error propagates backward from the output nodes to the inner nodes to adjust the networks weights. In implementation, the NN has as inputs the state-action pair and as output, the Q-value corresponding to the stateaction pair. In particular, the NN configuration comprises of one or two hidden layers containing a set of neurons, and output layer, which has one neuron. The number of layers and neurons depends on the complexity and the dimensions of the value-function to be approximated. For the hidden layers, a hyperbolic tangent function is used as the activation function. This function is antisymmetric and accelerates the learning process. The output layer has a linear activation function, which allows the NN to approximate any real value. The initialization of the NN weights is done randomly.
V ( x k +1 )
xk
ck
inferred
N i =1
action
uk is
uk
Plant
( x k , u , wk ) Q k
xk
x k +1
Q( xk , uk ) = i ( xk )q(i, ui ) i ( xk ) , and value of state x k is:

V ( x ) = i ( x )q(i, u ) i ( x ) .
k k i =1 i k i =1 N N
( x k +1 , u , w k ) Q u U ( xk +1 )
Action selector (-greedy)
Neural Network
uk
Under action
k
uk , the system undergoes transition
TD error
c xk xk +1 , where the cost incurred by the agent is c k . This information is used to calculate temporal difference (TD) [11] approximation error as: Q = c k + V ( x k +1 ) Qn ( x k , uk ) and q parameter values are updated as N q(i, ui ) q (i, ui ) + Q i ( x k ) i ( x k ) . i =1 Q-values are updated as k k +1 k Q ( x k , uk ) Q ( x k , uk ) + c + V ( x ) Q ( x , uk ) (1)
NNQ-learning controller
Fig. 2 NNQC architecture
Fig. 2 shows the architecture of the NNQC. The neural network gives a more compact representation of Q-values than a look-up table, and also allows us to interpolate Q( x k , uk ) for state-action pairs that have not been visited. The tabular Q function, Q( x k , uk ) , is replaced by a parameterized ( x k , u , wk ) , where wk is the set of parameters Q function Q
k
where 0 1 is the discount factor that controls how much effect future costs have on current optimal decisions, is the learning-rate parameter. We design reinforcement FIS as per [9].
describing the approximation architecture used, i.e., the weights and biases of the network at time step k. The update equation for transition from state x k to x k +1 becomes ( x k , u , wk ) Q ( x k , u , wk ) + c k + V ( x k +1 ) Q ( x k , u , wk ) (2) Q k k k

Updated Q-values are an approximation to actual values but will improve as network learns. The network will give the estimate of the Q-values, and with this information the action selector selects an action according to -greedy policy, which is fed to the plant as input. We design reinforcement NN system as per [4]. C. Decision tree Q-learning Decision trees (DT) have been widely used for classification and regression. DT based approaches to value function approximation in RL allow the continuous state space to be divided with non-uniform discretization, i.e., varying level of resolution. The tree can be used to map an input vector to one of the leaf nodes, which corresponds to a region in the state space. Reinforcement learning can be used to learn the values of taking each action in each region. Our approach is very similar to the U Tree [12] algorithm, and the algorithm proposed by Larry Pyeatt [5]. In implementation, DT starts out with only one leaf node that represents the entire input space. As algorithm runs, the leaf node gathers information in its history list. When the list reaches a threshold length, a test is performed to determine whether the leaf node should be split. If split is required, the test also determines the decision boundary. A new decision node is created to replace the leaf node, and two new leaf nodes are created and attached to the decision node. In this manner, the tree grows the root downward, continually subdividing the input space into smaller regions. Q-learning based DT algorithm is summarized below: 1. Initialize the DT with one leaf node, null history list with suitable threshold length, zero Q-value for each possible action at leaf node, and current state x k . 2. Use current state to find a leaf node representing state x k . 3. Obtain Q-values Q( x k , u ( x k )) corresponding to each action
uk in action set U from the leaf node.
4. Select the optimal action u with the largest value of Q ( xk , u ) , then obtain actual action u using -greedy action selector. 5. Calculate Q( x k 1 , uk 1 ) , update Q( x k 1 , uk 1 ) . 6. Add Q ( x k 1 , uk 1 ) and state vector to the history list for the leaf node corresponding to x k 1 . 7. Decide if x k 1 should be divided into two states by examining the history list for x k 1 . (a) if hist_list_length < threshold length then split := False (b) else (i) calculate mean and standard deviation of Q ( x k 1 , uk 1 ) in the history list. (ii) if || < 2 then split := True (iii) else split := False 8 Perform split using T-statistic, if required Train the DT for the next iteration, assign x k = x k +1 , repeat the procedure until the learning process is finished. D. SVM Q-learning SVM is a new universal learning machine in the framework of structural risk minimization (SRM) [13]. SRM has better
generalization ability and is superior to the traditional empirical risk minimization (ERM) principle. In SVM, the results guarantee global minima whereas ERM can only locate local minima. SVM uses a kernel function that satisfies Mercers condition [13], to map the input data into a highdimensional feature space, and then construct a linear optimal separating hyper plane in that space. Linear, Gaussian, polynomial and RBF kernels are frequently used in SVMs. Conventional SVMs have properties of global optimization, and good adaptability. However, the optimal solutions are obtained by solving standard quadratic programming which is numerically inefficient. Least square-support vector machine (LS-SVM) proposed by Suykens et al. [14] is an improved algorithm. It involves equality instead of inequality constraints and works with a least-square cost function. LSSVM has been successfully applied to classification and regression problems. A RL problem is converted into a regression problem, wherein the observed states and actions are considered as inputs and value functions as output. A LS-SVM has good generalization property, and is used to approximate the Qvalue of state-action pair online by taking the advantage of not falling into the trap of local minima. We use Q-leaning method on continuous state domains based on LS-SVM proposed in [6]. The input of LS-SVM is the state-action pair ( x k , uk ) , while the output is the estimated Q-value corresponding to ( x k , uk ) . Training samples of LS-SVM should be obtained during interaction between the learning system and its environment. Computational complexity of conventional LSSVM will increase rapidly with the increase of training samples. To overcome high computational complexity, a fixed-length training samples (l) and a sliding time window is introduced into the learning system. When we add a new training sample, the oldest sample should roll out of the modeling interval. Q-learning based on LS-SVM can be summarized as follows. Initialize the LS-SVM model with the length of training samples (l), RBF kernel function, and current state x k . Obtain Q-values Q( x k , uk ) corresponding to
( x k , uk ) for each action uk in action set U by solving regression model of the LS-SVM, send them to a -greedy action selector and obtain the actual action. Perform actual action and obtain reward and successor state x k +1 . Update the Q-value according to (1), to obtain the target Q-value. Add the new observed data into training sample and slide time window. Train the LS-SVM model and assign x k = x k +1 . Repeat the procedure until the learning process is finished. We design LS-SVM Q-learning controller (SVMQC) as per [6]. III. CONTROLLER REALIZATION A. Two-link robot manipulator We consider a two-link robot manipulator as the plant. There are two links (rigid) of lengths l1 and l2, and masses m1 and m2, respectively. The joint angles are 1 and 2, and g is
the acceleration due to gravity. = [ 1 2 ]T is the input torques applied on the joints of the robot. Dynamic equations for a two-link robot manipulator can be represented as [1]
+ + 2 cos 2 + cos 2
+ cos 2 1 (21 2 + 22 ) sin 2 + 12 sin 2 2
e cos 1 + e1 cos(1 + 2 ) 1dis 1 + 1 + = e1 cos(1 + 2 ) 2 dis 2
where dis = [ 1dis 2 dis ]T is the torque disturbance, and
= (m1 + m2 )l12 ; = m2l22 ; e1 = g / l1 .

Manipulator parameters are: l1 = l2 = 1 m ; m1 = m2 = 1 kg. Desired trajectory is: 1d = sin(0.5 t ) , and 2 d = cos(0.5 t ) .
to Q-value function. For simplicity, the controller uses two function approximators, one each for the two links. The training samples of LS-SVM are obtained during the interaction between the controller and the environment. Online LS-SVM learning with RBF kernel function is used in the simulation. In controller implementations of FQC, NNQC, DTQC, and SVMQC, we have used controller structure with an inner PD loop. Control action to the robot manipulator is a combination of an action generated by an adaptive learning RL signal through FIS/NN/DT/SVM and a fixed gain PD controller signal. The PD loop will maintain stability until FIS/NN/DT/SVM controller learns, starting with zero initialized Q-values. The controller, thus, requires no offline learning [1]. IV. SIMULATION RESULTS AND DISCUSSION Our aim is to track the desired trajectory for the joints of the two-link robot manipulator. MATLAB 7.4.0 (R2007a) has been used as simulation tool. The physical system has been simulated for a single run of 10 sec using fourth-order RungeKutta method, with fixed time step of 10 msec. For various value function approximators, we trained the controller for 20 runs, then evaluated and compared the performance for two cases: (i) Effects of external disturbances: A torque disturbance dis has a sinusoidal variation with frequency of 2 rad/sec, and added to the model at 5 sec. The magnitude of dis is expressed as a percentage of control effort . Figs. 3(a) and (b) show the absolute maximum error (max |e(t)|) v/s torque disturbance (both the links) of robotic arm, for all the controllers, i.e., FQC, NNQC, DTQC, and SVMQC. For an absolute maximum allowable error of 0.1 rad (5% of the maximum variation of the reference signals, which is 2 rad), the maximum tolerable torque disturbance was calculated, and has been tabulated in Table I. Whenever the error on either of the links crosses the allowable limit of 0.1 rad, the value of disturbance, which leads to this error, has been taken as the maximum tolerable disturbance.
0.4 0.3 max|e(t)| (rad) 0.2 0.1 0 FQC NNQC DTQC SVMQC Link 1
B. Controller learning details Simulation parameters and learning details for various value function approximators in reinforcement learning control structure are as follows: We define tracking error vector as: e k = dk k and cost function c k = e k + e k , = T > 0 with = diag {30,20} . Maximum limit of error is taken as 0.2 rad, for both the links (10% of peak-to-peak of reference trajectory). System state space (continuous) has four variables, i.e., x k = [1 2 1 2 ]T = [ x1 x2 x1 x2 ]T . Controller action sets for link1 and link2 are and U (2) = 3 [ 2 0 2] Nm, respectively. Exploration level
U (1) = 3 [ 20 0 20] Nm,
decays from 0.5 0.002 over the iterations. The discount factor is set to 0.8; learning-rate parameter is set to 0.2, and PD gain matrix K v = diag {20, 20} . We deliberately
introduce deterministic noise of 1% of control effort with a probability of (1/3), for stochastic simulation. Fuzzy Q-learning Controller (FQC): In FQC, each link has two input variables ( , ) with the partition of each variable into three fuzzy subsets of Gaussian membership functions. The membership function parameters used in this paper are same as in [13]. Neural Network Q-learning Controller (NNQC): The controller uses two function approximators, one each for the two links. Both the approximators have 18 hidden neurons with tan-sigmoidal activation function, and single output layer with linear activation function. The initialization of the NN weights is done randomly, and length of training samples (l) for batch mode processing is chosen as 100. Decision Tree Q-learning Controller (DTQC): We have implemented DT based FA proposed in [5]. The controller initializes DT for each link with single leaf node, null history list, and threshold length 100, for performing splitting criteria. The level of significance for T-statistic is chosen as 0.1. Support Vector Machine Q-learning Controller (SVMQC): A LS-SVM is used to realize mapping from state-action pair
20
40
60 torque disturbance
80
100
120
Fig. 3(a) Absolute maximum error (link 1)

0.4 0.3 0.2 0.1 0 FQC NNQC DTQC SVMQC Link 2
max|e(t)| (rad)
20
40
60 torque disturbance
80
100
120
Fig. 3(b) Absolute maximum error (link 2)
Table I Comparison of controllers for max |e(t)| < 0.1 rad Controller FQC NNQC DTQC SVMQC
Effect of external disturbances Tolerable limit of disturbance in % of control effort Link 1 Link 2 77 82.52 40 49.2 52 60 55 75.2
absolute maximum error (max |e(t)|) v/s percentage of payload change (both the links) of robotic arm, for all controllers. The maximum tolerable payload change for an absolute maximum allowable error of 0.1 rad was calculated, and has been tabulated in Table I.
0.5 0.4 max|e(t)|(rad) 0.3 0.2 0.1 0 FQC NNQC DTQC SVMQC Link 1
Effect of payload variations Tolerable limit of variation of mass of link 2 (nominal value = 1kg) Link 1 Link 2 3.35 3.52 1.75 2.00 2.00 2.65 1.65 2.50
From the results (Fig. 3, Table I), we observe that FQC handles the highest percentage of control effort within absolute maximum allowable error of 0.1 rad, for both the links. Figs. 4(a) and (b) show the output tracking errors (both the links) and Table II tabulates the average error, absolute maximum error (max |e(t)|), and absolute maximum control effort (max | |) at torque disturbance of magnitude which is 70% of control effort.
0.2 0.1 Error (rad) 0 FQC NNQC DTQC SVMQC Link 1
100
200
300 400 % of payload change
500
600
700
Fig. 5(a) Absolute maximum error (link 1)

0.5 0.4 max|e(t)| (rad) 0.3 0.2 0.1 0 FQC NNQC DTQC SVMQC Link 2
100
200
300 400 % of payload change
500
600
700
Fig. 5(b) Absolute maximum error (link 2)
-0.1 -0.2
5 Time (sec)
10
Fig. 4(a) Output tracking error (link1)

0.2 0.1 Error (rad) FQC NNQC DTQC SVMQC Link 2
From the results (Fig. 5, Table I), we observe that FQC handles the highest percentage of payload change within absolute maximum allowable error of 0.1 rad for both the links. Figs. 6(a) and (b) show output tracking errors (both the links), and Table III tabulates the absolute maximum error (max |e(t)|) and absolute maximum control effort (max | |) at 300% of payload variation.
FQC 0.2 0.1
Error (rad)
NNQC
DTQC
SVMQC
Link 1
-0.1 -0.2
-0.1 -0.2
5 Time (sec)
10
Time (sec)
10
Fig. 4(b) Output tracking error (link2) Table II Comparison of controllers

Error (rad) FQC 0.2
Fig. 6(a) Output tracking error (link1)

NNQC DTQC SVMQC Link 2
Controller FQC NNQC DTQC SVMQC
Average error (rad) Link 1 -0.0165 0.0285 0.0179 0.0159 Link 2 0.0100 0.0157 0.0101 0.0159
max |e(t)| (rad) Link 1 0.0867 0.1610 0.1241 0.1416 Link 2 0.0842 0.1537 0.0807 0.0939
max | | (Nm) Link 1 93 176 109 156 Link 2 45 76 67 74
0.1 0
-0.1 -0.2 0 1 2 3 4 5 Time (sec) 6 7 8 9 10
Fig. 4 shows that all controllers perform well, as the tracking error is in allowable limit. Specifically, Fig. 4 and Table II bring out the fact that FQC outperforms other controllers, in terms of lower tracking errors and the least value of maximum absolute control effort for both the links. DTQC performs better then NNQC and SVMQC. All the measurements were taken after applying the disturbance. (ii) Effect of payload variations: We consider the payload variation on the link 2. Mass m2 was suddenly changed at 5 sec from its nominal value of 1 kg. Figs. 5(a) and (b) show
Fig. 6(b) Output tracking error (link2) Table III Comparison of controllers Controller FQC NNQC DTQC SVMQC Average error (rad) Link 1 -0.0007 0.0371 0.0108 0.0373 Link 2 0.0262 0.0285 0.0275 0.0288 max |e(t)| (rad) Link 1 0.1354 0.2009 0.1620 0.1988 Link 2 0.1148 0.2268 0.1533 0.2053 max | | (Nm) Link 1 144 244 185 252 Link 2 50 109 78 101
Fig. 6 and Table III show that the FQC outperforms other controllers, in terms of the least value of tracking errors and
maximum absolute control effort for both the links. DTQC performs better then NNQC and SVMQC. All the measurements were taken after changing the payload mass. Superior FQC performance can be attributed to the use of FIS in FQC, as explained by Jouffe [7]. FIS allows efficient use of a priori knowledge in terms of fuzzy labels of the rule premise part. This leads to an accelerated learning and improved performance, whereas NN/SVM has to learn its state perception (random initialization at starting), in addition to the Q-values. We found that a drawback with respect to NN/SVM reinforcement learning implementation was its tendency to over-train on the portion of the state space that it visits often and forget the value function for portions of the state space that have not been visited recently. We find that DT can provide better learning performance than NN/SVM function approximation. FIS can deal with continuous actions and performs better than NN/DT/SVM approximators wherein the actions are discrete in nature. Further improvement in FQL based reinforcement controller may be achieved through dynamic fuzzy Qlearning (DFQ). DFQ is an efficient learning method where conclusion part of FIS can be constructed online, and structure of FIS generated dynamically. This self-organizing feature makes the system performance better than that of a conventional FQC. Here, DFQ fulfills our objective of dealing with continuous states and action spaces in robot arm. We have implemented DFQC proposed in [10]. Robust tracking performance comparison of FQC and DFQC has been carried out, and simulation results are shown in Figs. 7(a) and (b) and tabulated in Table IV.
0.15 0.1 Error (rad) Error (rad) 0.05 0 -0.05 -0.1 0 1 2 3 4 5 6 Time (sec) 7 8 9 10 Link 1 FQC 0.15 0.1 0.05 0 -0.05 -0.1 0 1 2 3 4 5 6 Time (sec) 7 8 9 10 DFQC Link 2
From the results (Fig. 7, Table IV), we observe that DFQC enhanced the performance than FQC for external disturbances and payload mass variations. In comparison to FQC, which has 18 fuzzy rules, DFQC uses a dynamic FIS with only 8 fuzzy rules; this implies that DFQ Controller is computationally efficient. Better tuning in DFQC may result in improved performance. V. CONCLUSION This paper presents an approach comparing reinforcement learning systems using various parameterized function approximators, for robust tracking control of two-link robot manipulator in presence of external disturbances and payload variations. Simulation results show the superiority of fuzzy Q-learning controller, in terms of speed of convergence to small error region and lower torque requirements. FIS allows efficient use of a priori knowledge; this leads to an accelerated learning with improved performance. The use of continuous action in FIS leads to better performance over the discrete action in NN/DT/SVM function approximators. Decision tree function approximator has a promise of good performance, as revealed by our simulation studies. Exploring the use of fuzzy decision trees function approximator in reinforcement learning, is a direction for future work. REFERENCES
[1] [2] [3] [4] [5] [6] F. L. Lewis, S. Jagannathan, and A. Yesildirek, Neural network control of robotic manipulators and nonlinear systems. London, U.K.: Taylor and Francis, 1999. R. S. Sutton, A. G. Barto, and R. J. Williams, Reinforcement learning is direct adaptive optimal control, IEEE Control Syst. Mag., vol. 12, no. 2, 1992, pp. 1922. C. H. Watkins, Learning from delayed rewards, Thesis, University of Cambridge, England, 1989. S. T. Hagen, and B. Krse, Neural Q-learning, Neural Comput. & Applic., vol.12, 2003, pp. 81-88. L. D. Pyeatt, Reinforcement learning with decision tree, Proc. 21st IASTED Int. Conf., Applied Informatics, Austria, 2003, pp. 26-31. X. Wang, X. Tian, and Y. Cheng, Value approximation with least square support vector machines in reinforcement learning system, Journal of Computational and Theoretical Nanoscience, vol. 4, no. 78, 2007, pp. 1290-1294. L. Jouffe, Fuzzy inference system learning by reinforcement methods, IEEE Trans. Syst., Man, and Cybernetics, Part C, vol. 28, no. 3, 1998, pp. 338355. P. Y. Glorennec, and L. Jouffe, Fuzzy Q-learning, in Proc. 6th IEEE Conf. Fuzzy Systems, Barcelona, Spain, vol. 2, 1997, pp.659 662. R. Sharma, and M. Gopal, A Markov game-adaptive fuzzy controller for robot manipulators, IEEE Trans. on Fuzzy Systems, vol. 16, no.1, 2007, pp. 171-186. M. J. Er, and C. Deng, Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning, IEEE Trans on Systems, Man, and Cybernetics, Part B, vol. 34, no. 3, 2004, pp. 1478-1489. R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MIT Press, 1998. W. T. B. Uther, and M. M. Veloso, Tree based discretization for continuous state space reinforcement learning, In proc. of 16th national conference on Artificial Intelligence (AAAI-98), Madison, 1998. V. Vapnik, The nature of statistical learning theory, Springer Verlag, 1995. J. A. K. Suykens, and J. Vandewalle, Least square support vector machine classifiers, Neural processing letters, vol. 9, 1999, pp. 293300.
[7] [8] [9] [10]
Fig. 7(a) Output tracking error for torque disturbance (both links)
0.15 0.1 Error (rad) Error (rad) 0.05 0 -0.05 -0.1 0 1 2 3 4 5 6 Time (sec) 7 8 9 10 Link 1 FQC 0.15 0.1 0.05 0 DFQC Link 2
-0.05 -0.1 0 1 2 3 4 5 6 Time (sec) 7 8 9 10
Fig. 7(b) Output tracking error for payload change (both links) Table IV Comparison of controllers Average error (rad) Controller FQC DFQC FQC DFQC Link 1 -0.0165 -0.0182 -0.0007 0.0010 Link 2 0.0100 -0.0003 0.0262 0.0265 max |e(t)| (rad) Link 1 0.0867 0.0812 0.1354 0.1301 Link 2 0.0842 0.0759 0.1148 0.1122 max | | (Nm) Link 1 93 90 144 140 Link 2
[11] [12]
external disturbance (@ 70% control effort) 45 38 50 45
[13] [14]
payload variation (@ 300% of mass change)

Reinforcement Learning Control of Robot Manipulators in Uncertain Environments

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning Control of Robot Manipulators in Uncertain Environments

Uploaded by

Copyright:

Available Formats

Reinforcement Learning Control of Robot Manipulators in Uncertain Environments

Hitesh Shah, M.Gopal

Global Q- value Defuzzifier

TD error Global target value

Error metric evaluator

Global action value Defuzzifier

Fig.1 FQC architecture

-greedy ui (where -greedy is

Q( xk , uk ) = i ( xk )q(i, ui ) i ( xk ) , and value of state x k is:

Action selector (-greedy)

uk , the system undergoes transition

Fig. 2 NNQC architecture

+ cos 2 1 (21 2 + 22 ) sin 2 + 12 sin 2 2

e cos 1 + e1 cos(1 + 2 ) 1dis 1 + 1 + = e1 cos(1 + 2 ) 2 dis 2

where dis = [ 1dis 2 dis ]T is the torque disturbance, and

= (m1 + m2 )l12 ; = m2l22 ; e1 = g / l1 .

Fig. 3(a) Absolute maximum error (link 1)

Fig. 3(b) Absolute maximum error (link 2)

300 400 % of payload change

Fig. 5(a) Absolute maximum error (link 1)

300 400 % of payload change

Fig. 5(b) Absolute maximum error (link 2)

Fig. 4(a) Output tracking error (link1)

Fig. 4(b) Output tracking error (link2) Table II Comparison of controllers

Fig. 6(a) Output tracking error (link1)

Controller FQC NNQC DTQC SVMQC

max | | (Nm) Link 1 93 176 109 156 Link 2 45 76 67 74

-0.1 -0.2 0 1 2 3 4 5 Time (sec) 6 7 8 9 10

[7] [8] [9] [10]

-0.05 -0.1 0 1 2 3 4 5 6 Time (sec) 7 8 9 10

external disturbance (@ 70% control effort) 45 38 50 45

payload variation (@ 300% of mass change)

You might also like