Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/354464422

A Model-free Deep Reinforcement Learning Approach for Robotic


Manipulators Path Planning

Preprint · September 2021


DOI: 10.13140/RG.2.2.14465.79208

CITATION READS

1 1,209

5 authors, including:

Wenxing Liu Muhammad Nasiruddin Mahyuddin


The University of Manchester Universiti Sains Malaysia
9 PUBLICATIONS 34 CITATIONS 80 PUBLICATIONS 1,234 CITATIONS

SEE PROFILE SEE PROFILE

Joaquin Carrasco
The University of Manchester
132 PUBLICATIONS 1,986 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hanlin Niu on 09 September 2021.

The user has requested enhancement of the downloaded file.


A Model-free Deep Reinforcement Learning Approach for Robotic Manipulators
Path Planning
Wenxing Liu 1 , Hanlin Niu 1∗ , Muhammad Nasiruddin Mahyuddin 2 , Guido Herrmann 1 and
Joaquin Carrasco 1
1
Department of Electrical and Electronic Engineering,
The University of Manchester,
Manchester, M13 9PL, UK
{wenxing.liu, hanlin.niu*, guido.herrmann, joaquin.carrasco}@manchester.ac.uk
*corresponding author
2
School of Electrical and Electronic Engineering,
Universiti Sains Malaysia,
Pulau Pinang, 14300, Malaysia
nasiruddin@usm.my

Abstract: Path planning problems have attracted much attention in robotic fields such as manipulators. In this paper,
a model-free off-policy actor critic based deep reinforcement learning method is proposed to solve the classical path
planning problem of a UR5 robot arm. Unlike standard path planning methods, the reward design of the proposed method
contains smoothness reward, which assures smooth trajectory of the UR5 robot arm when accomplishing path planning
tasks. Additionally, the proposed method does not rely on any model while the standard path planning method is model-
based. The proposed method not only guarantees that the joint angle of the UR5 robotic arm lies within the allowable range
each time when it reaches the random target point, but also ensures that the joint angle of the UR5 robotic arm is always
within the allowable range during the entire episode of training. A standard path planning method was implemented in
Robot Operating System (ROS) and the proposed method was applied in CoppeliaSim to validate the feasibility. It can be
inferred from the experiment that the training with the proposed method is successful.

Keywords: Model-free, Deep Reinforcement Learning, Manipulators, Path Planning.

1. INTRODUCTION erative search method. Unlike Jacobian methods above,


this method descends the position and orientation error
Over the past two decades, robots have already estab- of the manipulator by simply changing the parameters of
lished its worth in both theory and practical applications, only one joint at a time. One of the most representative
including path planning tasks [1], formation techniques heuristic iterative search method is Cyclic-Coordinate
[2], human-robot interaction [3], collision avoidance [4], Descent (CCD) algorithm in [12], [13]. This kind of path
[5], pick and place tasks [6]. When dealing with path planning is easy to implement and computationally fast,
planning problems, standard methods are always consid- but may need longer iterations to reach the target point.
ered as the primary solution. However, how to avoid Besides, this method is more suitable in solving snake
the occurrence of jerk in path planning remains an open robots rather than the UR5 robot arm.
question. Therefore, a vital question will be ensuring the
In order to overcome the difficulties encountered in
smoothness of the manipulator trajectory while accom-
a path planning problem, reinforcement learning tech-
plishing path planning tasks.
niques have been widely used to solve challenging prob-
In recent years, there has been a growing interest in re- lems in engineering, computer science and robotics [14].
searching path planning problems [7]. For instance, a Ja- The main advantage of reinforcement learning method is
cobian based method was proposed in [8] to handle obsta- that it can be model-free [15], which could be used in var-
cle avoidance path planning. Nevertheless, the inverse of ious applications. Agents are encouraged to learn map-
the Jacobian matrix used in path planning problems could ping relationships between the environment and action
be a singular matrix. In [9], [10], [11], a pseudo-inverse with reinforcement learning techniques [16]. Recently,
Jacobian matrix was developed to handle the problem of many reinforcement learning methods have been inves-
a singular Jacobian matrix. Although the computation tigated to solve classical problems. For instance, an ac-
of each iteration is quite fast, the Jacobian can only be tor critic based reinforcement learning approach was de-
viewed as a valid approximation near the current configu- veloped in [17] to deal with the input of a classical for-
ration of the manipulators. As a result, the Jacobian based mation control problem. In [18], a solution of achieving
method has to be repeated in a small step until the end consensus of multiple agents under sudden total commu-
effector of the manipulator is close enough to the target nication failure was proposed by using actor-critic rein-
point, which may be considered as a limitation. forcement learning method. A distributed off-policy ac-
Another standard path planning method is heuristic it- tor critic method was developed in [19] to deal with re-
inforcement learning problems in a multi-agent system. downwards, oth is the orientation threshold, and | · | rep-
However, the problem of solving path planning problems resents the Euclidean norm of a vector. By introducing ro
with a model-free deep reinforcement learning method in in the reward function, the end effector orientation of the
robotic manipulators is not considered in the aforemen- UR5 robot arm is more likely to be as straight as possible
tioned works, which may be regarded as an open prob- downwards.
lem. The arrive reward ra can be calculated by
The goal of this paper is to propose a new design (
of model-free off-policy deep reinforcement learning rap if |dc | < rah
ra = (4)
method to deal with the problem of path planning of 0 otherwise
the UR5 robot arm. Different from standard path plan-
ning methods, the proposed method does not rely on any where rap equals to a positive arrive reward, and rah rep-
model. In addition, it is able to guarantee a smooth move- resents the distance threshold. Therefore, ra equals to a
ment of the UR5 robot arm, and all the joint angles of the positive reward if |dc | lies within the distance threshold,
UR5 robot arm lie within the allowable range during each or it remains 0.
movement. Moreover, a standard path planning method The smoothness reward rk can be computed by
has been implemented on the real UR5 robot arm as a (
baseline to compare and contrast the benefits and draw- −rj if |rd | > rth
backs of both methods. rk = (5)
0 otherwise

2. THE PROPOSED METHOD where rj represents a positive smoothness reward, and


rth stands for the smoothness threshold. As a result, rk
The proposed method was trained in CoppeliaSim [20] will receive a negative reward if |rd | is above the smooth-
simulation environment for efficiency. The training de- ness threshold. By introducing rk in the total reward, the
tails are described in Section 2.4. smooth movement of the UR5 robot arm can be ensured.
2.1 Reward Design 2.2 Action Space and Observation Space
The reward should make the UR5 robot arm reach the
target position smoothly while keeping the end effector
orientation of the UR5 robot arm as straight as possible
downwards, as shown in Fig. 1. The reward function is
given as follows:

r = rd + ro + ra + rk (1)

where r represents the total reward, rd stands for the dis-


tance reward, ro represents the orientation reward, ra is
the arrive reward, and rk stands for the smoothness re-
ward.
The distance reward rd is computed as follows:

rd = dp − dc (2) Fig. 1.: The joint position of a UR5 robot arm.


where dp stands for the previous distance between the
end effector position of the UR5 robot arm and the target The shoulder pan joint of the UR5 robot arm is actu-
position, and dc denotes the current distance between the ated in this training scenario, together with the shoulder
end effector position of the UR5 robot arm and the target lift joint, elbow joint and wrist 1 joint, as shown in Fig. 1.
position. The time step between dp and dc is set to 0.01 As a result, the action space is a 1 ∗ 4 vector that con-
seconds. If dp is smaller than dc , the UR5 robot arm will tains the joint angle of the shoulder pan joint, shoulder
receive a negative reward since it gets further to the target lift joint, elbow joint and wrist 1 joint.
position. The observation space is a 1∗17 vector which includes
The orientation reward ro is given by the action space, the location of the elbow joint and the
( wrist 2 joint, the distance between the elbow joint and the
rop if |oe − od | < oth goal, the distance between the wrist 2 joint and the goal
ro = (3) and the distance between the end effector and the goal.
0 otherwise
2.3 Network Structure
where rop equals to a positive orientation reward, oe The proposed method is on the basis of actor critic off-
denotes the current end effector orientation of the UR5 policy deep reinforcement learning method. The struc-
robot arm, od represents the desired end effector orienta- ture of the actor network is shown in Fig. 2. The input of
tion of the UR5 robot arm which is as straight as possible the actor network is a vector that contains 17 elements,
Fig. 2.: The architecture of actor network.

Fig. 3.: The architecture of critic network.

Fig. 4.: The process of the UR5 robot arm reaching the
target position. The pink disc depicts the position of the
target.

which are the joint angle of shoulder pan joint, shoulder


lift joint, elbow joint and wrist 1 joint, the location of the
elbow joint and wrist 2 joint, the distance between the el-
bow joint and the goal, the distance between the wrist 2 Algorithm 1 Model-free off-policy actor critic based
joint and the goal and the distance between the end ef- deep reinforcement learning method
fector and the goal. The input of the actor network is 1: Initialize actor network φ(s|η) with weight η, critic
connected with three dense layers with ReLu [21] activa- network Q(s, a|ξ) with weight ξ, fixed behaviour
tion function, which also belongs to the input of the critic policy µ, maximum time limit T , maximum episode
network architecture. The output of the actor network limit N , replay buffer R, maximum size of the re-
contains 4 elements, which are the joint angle of shoul- play buffer Rm , target actor network φ0 with weight
der pan joint, shoulder lift joint, elbow joint and wrist 1 η 0 ← η, target critic network Q0 with weight ξ 0 ← ξ.
joint. Tanh function is the activation function of the out- 2: for episode = 1, N do
put of the actor network, which also belongs to the input 3: Reset the environment.
of the critic network architecture, as depicted in Fig. 3. 4: for t = 1, T do
The output of the critic is a Q value which is engendered 5: Initialize a random exploration noise ot for
by a linear activation function. action exploration.
6: Choose action at = µ(st ) + ot according to
fixed behaviour policy µ and exploration noise ot .
2.4 Training Details 7: Execute action at , calculate reward rt and get
new state st+1 .
8: Store (st , at , rt , st+1 ) in the replay buffer R.
The proposed method was trained in CoppeliaSim [20] 9: Sample a minibatch (sj , aj , rj , sj+1 ) from R
environment with Python [22] and Keras [23]. The UR5 if size(R) > Rm .
robot arm is connected with a suction gripper, as shown 10: Set yj = rj + γQ0 (sj+1 , φ(sj+1 |η 0 )|ξ 0 ).
in Fig. 4. The experiment is considered to be successful if 11: Update critic by minimizing the loss:
the UR5 robot arm is able to reach the position of the pink 12: L = 21 (yj − Q(sj , aj |ξ)2
disc within the allowable range during each iteration. The 13: Update actor with policy gradient ascent:
pictures in the first column describe how the UR5 robot 14: ∇η J = E[∇a Q(s, a|ξ)|s=sj ,a=φ(sj ) ∇η φ(s|η)|sj ]
arm gradually reaches the target point within the allow- 15: Update the target networks:
able range, while the pictures in the second column de- 16: η 0 ← τ η + (1 − τ )η 0
pict another solution. Algorithm 1 details the proposed 17: ξ 0 ← τ ξ + (1 − τ )ξ 0
model-free off-policy actor critic based deep reinforce- 18: end for
ment learning method. 19: end for
within a fixed batch size. The maximum episode limit is
set to 215 and the maximum time limit is set to 200. The
simulation of the UR5 robot arm is trained to arrive at
the position of the target. In the initial stage, the robotic
arm performs action exploration, so the reward at the be-
ginning did not increase immediately. As can be seen
from the figure, the reward and Q value gets stable when
it reaches 35000 steps, which indicates the model-free
off-policy actor critic based deep reinforcement learning
training is successful.

Fig. 5.: Pick and place tasks via the standard path plan- Fig. 6.: Average reward of the proposed method. The
ning method. The UR5 robot arm with orange color de- transparent area indicates the standard deviation of the
notes the initial position of the actual UR5 robot arm. The results.
UR5 robot arm with grey color represents the current po-
sition of the actual UR5 robot arm. The image viewer at
the bottom right corner represents the view from the 3D
camera. The location of the box can be computed by the
aruco marker [27] on top of it.

3. STANDARD PATH PLANNING METHOD


A standard path planning method was implemented
in the real UR5 robot arm with Robot Operating Sys-
tem (ROS) [24]. Robotiq 3 finger gripper was tested in
accomplishing pick and place tasks with 3D camera rs-
visard. Fig. 5 shows the rviz [25] view of the actual UR5
robot arm, which demonstrates how the UR5 robot arm
can accomplish pick and place tasks via the standard path
planning method. To avoid potential crash on the actual
UR5 robot arm, OMPL (Open Motion Planning Library) Fig. 7.: Average Q value of the proposed method. The
[26] was used in rviz to move the actual UR5 robot arm transparent area indicates the standard deviation of the
to the position above the aruco marker, as shown in the results.
first picture of Fig. 5. Then the actual UR5 robot arm
went down to grasp the box first, then move the box to Fig. 8 shows solutions of reaching random target posi-
the target location and release the box in the end. tions after training for 43000 steps. The aim of the pro-
posed method is to train the UR5 robot arm to reach the
target position (the pink disc) smoothly while keeping the
4. RESULTS AND ANALYSIS end effector orientation of the UR5 robot arm as straight
as possible downwards. The first and the last picture de-
4.1 Evaluation on Training of the Proposed Method pict how the UR5 robot arm is trying to find path planning
Fig. 6 demonstrates the average reward of the pro- solutions at the right half plane, where the joint value of
posed method and Fig. 7 describes the average Q value the shoulder pan joint, shoulder lift joint, elbow joint and
of the proposed method. The batch value is set to 512. wrist 1 joint are all within the allowable range. The sec-
Each average value is computed by averaging the results ond and the third picture depict how the UR5 robot arm
Table 1.: Algorithm comparison for path planning.

Success rate (%) 10 trials 20 trials 40 trials


Standard method 80 75 77.5
The proposed method 100 100 100

around 20% more than the proposed method. On one


hand, the allowable joint angle range of both methods is
different. With the proposed method, each joint on the
UR5 robot arm has its unique allowable range, thereby
saving training time and improving the training speed.
Fig. 8.: Solutions of reaching random target positions Nevertheless, with the standard path planning method, all
with the proposed method. the joint range of the UR5 robot arm is set to [−2π, 2π].
On the other hand, there exists multiple solutions when
performing path planning with the standard method. This
may lead to path planning joint solutions with large jerk
manoeuvre. The total reward of the proposed method
contains smoothness reward according to equation (1).
By introducing smoothness reward inside the total re-
ward, it is more likely to reduce the occurrence of a large
jerk manoeuvre when performing path planning with the
proposed method. A minor drawback of the proposed
method falls in the requirement of training before using
it. However, this can be overcome by implementing the
proposed method in the simulation environment.

5. CONCLUSION

Fig. 9.: Failed path planning with the standard path plan- In this paper, a model-free off-policy actor critic based
ning method. The orange UR5 robot arm stands for the deep reinforcement learning method is proposed to solve
initial position of the actual UR5 robot arm, the grey UR5 the problem of path planning. The simulation results
robot arm stands for the current position of the actual in CoppeliaSim validated the feasibility of the proposed
UR5 robot arm, and the transparent UR5 robot arm de- method. When it comes to the hardware implementation,
notes the planned trajectory for the actual movement of a standard path planning method has been implemented
the real UR5 robot arm generated by OMPL (Open Mo- as a baseline to compare and contrast both methods. It
tion Planning Library) in rviz. can be deduced that the proposed method can generate a
smooth trajectory and keep the joint angles of the UR5
robot arm always within the allowable range. In addi-
is trying to find path planning solutions at the left half tion, the proposed method does not rely on any model
plane. It can be deduced from Fig. 8 that the proposed while the standard path planning method is model-based.
training method is feasible. In future work, the proposed method will also be imple-
mented on the real UR5 robot arm. Besides, vision infor-
4.2 Comparison with Standard Path Planning Method mation can be applied in the proposed method to accom-
Unlike the standard path planning method, the pro- plish more complicated tasks.
posed method can not only guarantee that the joint angle
of the UR5 robotic arm is within the allowable range each
time when it reaches the target point, but also ensure that ACKNOWLEDGEMENT
the joint angle of the UR5 robotic arm is always within
This work was supported by EPSRC project No.
the allowable range during the entire episode of training.
EP/S03286X/1 and EPSRC RAIN project No. EP/R026084/1.
Fig. 9 shows an example of failed path planning. It can
be seen from the pictures in Fig. 9 that the planned tra-
jectory generated by the standard path planning method
REFERENCES
contains a severe jerk, which makes it impossible to exe-
cute in the actual experiment. [1] K. Wei and B. Ren, “A method on dynamic path
Table 1 depicts the method of successful rate compari- planning for robotic manipulator autonomous ob-
son for the path planning. It can be inferred that the stan- stacle avoidance based on an improved rrt algo-
dard path planning method failed to complete the task rithm,” Sensors, vol. 18, no. 2, p. 571, 2018.
[2] K. Wu, J. Hu, B. Lennox, and F. Arvin, “Sdp- tional Journal of Control, Automation, and Systems,
based robust formation-containment coordination vol. 5, no. 6, pp. 674–680, 2007.
of swarm robotic systems with input saturation,” [15] R. S. Sutton, “Introduction: The challenge of re-
Journal of Intelligent & Robotic Systems, vol. 102, inforcement learning,” in Reinforcement Learning.
no. 1, pp. 1–16, 2021. Springer, 1992, pp. 1–3.
[3] H. Niu, Z. Ji, F. Arvin, B. Lennox, H. Yin, and [16] R. S. Sutton and A. G. Barto, Reinforcement learn-
J. Carrasco, “Accelerated sim-to-real deep rein- ing: An introduction. MIT press, 2018.
forcement learning: Learning collision avoidance [17] H. Zhang, H. Jiang, Y. Luo, and G. Xiao, “Data-
from human player,” in 2021 IEEE/SICE Inter- driven optimal consensus control for discrete-time
national Symposium on System Integration (SII). multi-agent systems with unknown dynamics us-
IEEE, 2021, pp. 144–149. ing reinforcement learning method,” IEEE Transac-
[4] K. Wu, J. Hu, B. Lennox, and F. Arvin, “Finite- tions on Industrial Electronics, vol. 64, no. 5, pp.
time bearing-only formation tracking of hetero- 4091–4100, 2016.
geneous mobile robots with collision avoidance,” [18] H. Kandath, J. Senthilnath, and S. Sundaram,
IEEE Transactions on Circuits and Systems II: Ex- “Mutli-agent consensus under communication fail-
press Briefs, 2021. ure using actor-critic reinforcement learning,” in
[5] J. Hu, H. Niu, J. Carrasco, B. Lennox, and 2018 IEEE Symposium Series on Computational In-
F. Arvin, “Voronoi-based multi-robot autonomous telligence (SSCI). IEEE, 2018, pp. 1461–1465.
exploration in unknown environments via deep re- [19] Y. Zhang and M. M. Zavlanos, “Distributed off-
inforcement learning,” IEEE Transactions on Vehic- policy actor-critic reinforcement learning with pol-
ular Technology, vol. 69, no. 12, pp. 14 413–14 423, icy consensus,” in 2019 IEEE 58th Conference on
2020. Decision and Control (CDC). IEEE, 2019, pp.
[6] H. Niu, Z. Ji, Z. Zhu, H. Yin, and J. Carrasco, “3d 4674–4679.
vision-guided pick-and-place using kuka lbr iiwa [20] E. Rohmer, S. P. Singh, and M. Freese, “V-rep: A
robot,” in 2021 IEEE/SICE International Sympo- versatile and scalable robot simulation framework,”
sium on System Integration (SII). IEEE, 2021, pp. in 2013 IEEE/RSJ International Conference on In-
592–593. telligent Robots and Systems. IEEE, 2013, pp.
[7] B. Dasgupta and T. Mruthyunjaya, “Singularity-free 1321–1326.
path planning for the stewart platform manipulator,” [21] V. Nair and G. E. Hinton, “Rectified linear units
Mechanism and Machine Theory, vol. 33, no. 6, pp. improve restricted boltzmann machines,” in Icml,
711–725, 1998. 2010.
[22] M. Lutz, Programming python. ” O’Reilly Media,
[8] A. A. Maciejewski and C. A. Klein, “Obstacle
Inc.”, 2001.
avoidance for kinematically redundant manipula-
[23] N. Ketkar, “Introduction to keras,” in Deep learning
tors in dynamically varying environments,” The in-
with Python. Springer, 2017, pp. 97–111.
ternational journal of robotics research, vol. 4,
[24] M. Quigley, K. Conley, B. Gerkey, J. Faust,
no. 3, pp. 109–117, 1985.
T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros:
[9] T. Greville, “The pseudoinverse of a rectangular or
an open-source robot operating system,” in ICRA
singular matrix and its application to the solution of
workshop on open source software, vol. 3. Kobe,
systems of linear equations,” SIAM review, vol. 1,
Japan, 2009, p. 5.
no. 1, pp. 38–43, 1959.
[25] H. R. Kam, S.-H. Lee, T. Park, and C.-H. Kim,
[10] S. R. Buss, “Introduction to inverse kinematics with “Rviz: a toolkit for real domain data visualization,”
jacobian transpose, pseudoinverse and damped least Telecommunication Systems, vol. 60, no. 2, pp. 337–
squares methods,” IEEE Journal of Robotics and 345, 2015.
Automation, vol. 17, no. 1-19, p. 16, 2004. [26] I. A. Sucan, M. Moll, and L. E. Kavraki, “The open
[11] D. E. Whitney, “Resolved motion rate control of motion planning library,” IEEE Robotics & Automa-
manipulators and human prostheses,” IEEE Trans- tion Magazine, vol. 19, no. 4, pp. 72–82, 2012.
actions on man-machine systems, vol. 10, no. 2, pp. [27] R. M. Salinas, “Aruco: A minimal library for aug-
47–53, 1969. mented reality applications based on opencv,” 2012.
[12] J. Lander and G. CONTENT, “Making kine more
flexible,” Game Developer Magazine, vol. 1, no. 15-
22, p. 2, 1998.
[13] R. Mukundan, “A robust inverse kinematics algo-
rithm for animating a joint chain,” International
Journal of Computer Applications in Technology,
vol. 34, no. 4, pp. 303–308, 2009.
[14] J.-J. Park, J.-H. Kim, and J.-B. Song, “Path plan-
ning for a robot manipulator based on probabilis-
tic roadmap and reinforcement learning,” Interna-

View publication stats

You might also like