Smooth Path Planning of 6-DOF Robot Based On Reinforcement Learning

2022 4th International Conference on Control and Robotics
Smooth Path Planning of 6-DOF Robot Based on

Reinforcement Learning
Jiawei Tian Dazi Li*

College of Information Science and Technology College of Information Science and Technology
Beijing University of Chemical Technology Beijing University of Chemical Technology
Beijing, China Beijing, China
2020210474@mail.buct.edu.cn lidz@mail.buct.edu.cn
Abstract—The current path planning algorithms such as A- low oscillation artificial potential field method is used to search,
2022 4th International Conference on Control and Robotics (ICCR) | 978-1-6654-8641-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCR55715.2022.10053875
star(all stars) algorithm and RRT (Rapidly-exploring Random and when encountering local minima, collisions and other
Trees) algorithm can meet the obstacle avoidance planning of the problems, the algorithm switches to ARRT algorithm to escape.
6-DOF robot, but the smoothness of the path is not considered. This algorithm has strong adaptability, and the success rate and
Working in an unreasonable path for a long time will produce a search efficiency have been significantly improved. Xie[8]
great load on the joints of the 6-DOF robot and seriously affect Proposed an improved potential field method, which enables
its life. In this paper, we use reinforcement learning reconcile A- the robot arm to plan obstacle avoidance paths for dynamic
star algorithm and RRT algorithm for smooth path planning of
targets. Zhao[9] proposed a planning algorithm based on A-
the robot. Experimental results show that compared with A-star
star algorithm and cubic spline function, The collision free
algorithm and RRT algorithm, the fusion algorithm has
smoother path and more reasonable time.
grasping method of the robot arm proposed by Du Feng[10]
takes the minimum offset of the joint angle as the optimization
Keywords—path planning, A-star, RRT, RL, 6-DOF robot index, can update the end trajectory according to
environmental changes, and adjust the posture accordingly to
I. INTRODUCTION improve adaptability. Yang[11] proposed an improved A-star
After years of development, the mechanical arm has made algorithm that changes the path weight according to the real-
great progress, such as 4-DOF robot arm[1-3] and 6-DOF time path congestion, which greatly optimizes the node
mechanical arm. With the popularization of AI (artificial redundancy problem of the A-star algorithm. Ju[12] proposed
intelligence), the requirements of industrial production on the an improved a * algorithm to solve the path planning problem
intelligence of 6-DOF robot are increasing. At present, the under specific conditions and find the global shortest path.
traditional teaching method is still used in the path planning of At present, RRT algorithm is widely used in path planning
the 6-DOF robot in industrial production. The traditional of 6-DOF robot. Because its algorithm is simple and easy to
teaching method is operated by the on-site personnel in implement, and it is easy to add non holonomic constraints, but
advance, and then the relevant parameters are recorded and there is a problem that the generation path is extremely random.
saved. After that, the robot arm system will perform repeated The optimal path of 6-DOF robot path planning is A-star
operations according to the parameters. This method can only algorithm[13]. However, with the increase of space size, the
deal with the fixed environment, and it is very cumbersome to iterative amount of A-star algorithm increases exponentially,
adjust the parameters of each robot. Another problem is that and the path generation time is too long, which leads to its
such a path is neither the shortest nor the smoothest. This failure to be well applied to AI production. In this paper, the
results in a waste of production time and accelerates the wear advantages of the RRT algorithm, such as light weight, high
of the 6-DOF robot joints. Therefore, the robot arm needs to speed, high accuracy and smooth path, are considered. We
carry out path planning according to the environment. Khatib[4] study the use of reinforcement learning method to integrate the
proposed the artificial potential field method, which can plan a two, use strategic decision-making, and use different
high-quality path, but the planned path is not necessarily algorithms in different scenarios, so that the planned path can
optimal, and it is easy to fall into a local minimum. Lavalle[5] meet both the smoothness requirements of the 6-DOF robot
proposed the RRT algorithm. In the case of multiple obstacles, and the time requirements. The experimental results show that
the search success rate is high and the optimal path can be the output curve of reinforcement learning fusion decision
found, but the algorithm efficiency is low. Yin bin[6] used the algorithm is smoother and the time required is within the
improved RRT algorithm to realize path planning. Compared demand range.
with the basic RRT algorithm, it has the ability of gradual
optimization and can gradually make the planned path to the The rest of this article is organized as follows. In section 2,
optimal value. However, it also has the problems of slow mathematical modeling of robot, DH parameters, 6-DOF robot
planning speed and low success rate. Chen[7] proposed a kinematics and dynamics are presented. Section 3 proposed A-
hybrid algorithm of low oscillation artificial potential field and star algorithm, RRT algorithm and DQN (Deep Q-Network)
Adaptive-Rapidly-exploring-Random-Tree (ARRT). First, the algorithm. In section 4, experimental results and comparisons
978-1-6654-8641-5/22/$31.00 ©2022 IEEE 89

Authorized licensed use limited to: UNIVERSITAETSBIBLIOTHEK CHEMNITZ. Downloaded on May 21,2024 at 14:33:57 UTC from IEEE Xplore. Restrictions apply.
are presented. Section 5 draws the conclusion and the future III. ALGORITHM IMPLEMENTATION
work.
A. A-star Algorithm
II. MATHEMATICAL MODELING OF ROBOT A-star algorithm is a minimum cost search algorithm that
We chose the classic six degree of freedom robot PUMA combines the advantages of Dijkstra algorithm and heuristic
560 as the research object. This 6-DOF robot is the puma series function. On the premise that the heuristic function is effective,
industrial robot launched by unimation in the 1970s. It is this algorithm can find the shortest running path. Core formula
composed of a base, two main links and end effectors, with a of A-star algorithm, that is, the way to evaluate nodes is its
total of six rotating joints. evaluation function:
A. DH Parameters f ( n)  g ( n)  h( n) 
The mathematical model of the 6-DOF robot is generally where �(�) is the total path cost from the start point to the end
represented by the D-H coordinate system. This method was point when node n is selected. Cost size is positively related to
proposed by denavit and hartenberg[14] to describe the motion the length of the path. On the premise of no collision between
state of a chain structure composed of multiple rigid body links the paths, the smaller �(�), the better the selected path. �(�)is
and rotating joints or moving joints between them. the actual path cost required to move from the starting point to
B. 6-DOF Robot Kinematics node n. The sum of the lengths of multiple sub paths from the
starting point to node n is accumulated as the path cost. �(�)is
PUMA 560 has 6 joints, all of which are rotary joints. In a heuristic function, which calculates the estimated path cost
order to obtain the transformation matrix of each link, the from node n to the end point.
parameters of the link and the joint need to be set first. To
solve the forward kinematics solution of the 6-DOF robot is to Our research finds that even though the A-star algorithm
obtain the position and posture of the end effector through the has been improved and optimized, its iterative operation
transformation matrix when the deflection angle of each joint is growth on the 6-DOF robot workspace is exponential. With the
known. Under the standard D-H coordinate system, the general expansion of the workspace and the change of the working
expression of the connecting rod can be written as: environment, the A-star algorithm is increasingly difficult to
meet the path planning needs of the 6-DOF robot.
i 1
i T  R  z,i   T (0, 0, d )  T  ai 1 , 0, 0  R x, i 1
ci  si 0 0  1 0 0 ai 1  1 0 0 0  B. RRT Algorithm
 s ci 0 0  0 1 0 0  0 ci 1  si 1 0  RRT algorithm builds an undirected graph on a known map
 i  
0 0 1 0  0 0 1 0  0 s i 1 ci 1 0 by sampling, and then finds the path by searching. RRT
      algorithm starts from a certain point, searches, samples and
0 0 0 1  0 0 0 0  0 0 0 0
builds a map at the same time. RRT algorithm can be regarded
After obtaining the transformation matrix of all the links, as "tree algorithm". It starts from A-starting configuration (a
assuming that the reference coordinate system of the point for a two-dimensional diagram), continuously extends the
workspace is consistent with the base coordinate system of the tree data, and finally connects with the target point. RRT
robot arm, the posture transformation matrix of the end effector algorithm is a path planning algorithm based on sampling. It
of the robot arm in the reference coordinate system can be searches globally in the workspace through sampling (taking a
written as: given starting point as the root node, generating new child
nodes by randomly picking points, so that the expansion tree
B
6 T 1B T21T32T43T54T65T  grows continuously, with excellent spatial search ability. It also
has excellent performance for path planning in high-
The inverse kinematics of the 6-DOF robot is opposite to dimensional space. However, due to its random sampling
the forward kinematics. It is necessary to solve the possible characteristics, RRT algorithm has problems such as strong
angles of each joint of the 6-DOF robot when the position and randomness, repeated search and unsmooth path. The main
attitude of the end effector of the 6-DOF robot are known. steps of RRT algorithm are as follows:
Using the analytical method to solve is to start from the 1) Define �� (the start node) and �� (the goal node) in
equation of the forward kinematics solution of the robot arm, workspace to establish a fast expanding random tree T for the
constantly solve the inverse matrix of each transformation
root node is �� ;
matrix in the forward equations, and find the value of each
joint in the process, which is the inverse kinematics solution of 2) Randomly obtain �� (a random node) in workspace;
the robot arm. The analytical solution of the 6-DOF robot can 3) Traverse all nodes of T to find the node �� closest to
be described as: ��;
4) Check the feasibility between �� and �� . If it is
0
T 0B T (1 ) 12 T ( 2 ) 32 T (3 ) 34 T ( 4 ) 54 T (5 ) 56 T (6 )  not feasible, discard �� and take a new �� ;if feasible,
6
make the next judgment;
The joint angle �� is the result to be obtained. 5) When the distance between �� and �� is less than
the step size ε， Connect �� and �� directly, and the new
node �� = �� ， take �� into T; When the distance
between �� and �� is greater than or equal to ε, expand
90
one step from �� to �� ,generate a new node �� ,and several data from the training data for gradient descent. As the
take �� into T; learning continues, each training data will be used multiple
6) Cycle the above process until �� = �� or the times. In the original Q-learning algorithm, each data is only
distance between �� and �� is greater than or equal to ε, used to update the Q value once. In order to better combine Q-
learning and deep neural network, DQN algorithm adopts
complete the path planning and end the cycle;
experience replay method. The specific method is to maintain a
7) Trace back by �� to find the planned path.
replay buffer, store the quad data {(�� ,�� ,�� ,�'� )} sampled from
In theoretical research and experimental simulation, we
the environment each time in the replay buffer, and then
found that RRT algorithm often leads to unsatisfactory path
randomly sample several data from the replay buffer for
planning due to its strong randomness in long-distance path
training when training Q-network. This can play the following
planning.
two roles:
C. DQN
 Make the sample satisfy the independent hypothesis.
Since the values of each dimension of the state of the robot The data obtained by cross sampling in MDP does not
arm are continuous, it is impossible to use tables to record, so a satisfy the independent assumption, because the state at
common solution is to use the idea of function fitting. Because this time is related to the state at the previous time. The
the neural network has strong expression ability, a neural non independent and identically distributed data have a
network can be used to represent the function Q. If the action is great influence on the training of neural networks, and
continuous (infinite), the input of the neural network is state s will make the neural networks fit to the recently trained
and action � , and then the output is a scalar representing the data. Experience replay can break the correlation
value obtained by taking action � under state �.If the action is between samples and make them meet independent
discrete (finite), in addition to the continuous action, only the assumptions.
state s can be input into the neural network to output the Q
value of each action at the same time. In general, DQN (and Q-  Improve sample efficiency. Each sample can be used
learning) can only handle the case of discrete actions, because many times, which is very suitable for gradient learning
there is an operation of �� in the update process of function of deep neural networks.
Q. Suppose that the parameter used by the neural network to fit The goal of the final update of DQN algorithm is to let
the function q is ω, that is, the Q value of all possible actions a �� (�, �) approach F, since the TD error target itself
in each state � can be expressed as �� (�, �).It is known that the contains the output of the neural network, the target is
update rules of Q-learning are: constantly changing while updating the network parameters,
which is very easy to cause instability in the training of the
  neural network. In order to solve this problem, DQN uses
Q ( s , a )  Q ( s , a )    r   max Q  s  , a    Q ( s , a )  
 
a A  the idea of target network. Since the continuous updating of
Q-network in the training process will cause the target to
change constantly, it is better to fix the Q-network in the
The formula adopts time-series difference learning
TD error target for the time being. In order to realize this
objective � + � �� (�' , �') to incrementally update
' � ∈� idea, we need to use two sets of Q-network.
�(�, �),that is, to make �(�, �) approach TD error target � +
The original training network �� (�, �) is used to calculate
� �� (�', �' ).Then, for a set of data {(�� ,�� ,�� ,�'� )}, the loss
'
� ∈� �� (�, �) in the original loss function, and is updated using
function of the Q network can be constructed in the form of the normal gradient descent method.
mean square error:
Target network ��− (�, �) ,used to calculate the F term in
the original loss function, where �− represents a parameter
1 N
2
 *  arg m in 
2N
  Q  s , a   F 
i 1
i i  in the target network. If the parameters of the two networks
are consistent at any time, the algorithm is still unstable. In
order to make the update target more stable, the target
where � = � + � �� (�' , �' ) . network will not be updated every step. Specifically, the
' � ∈� target network uses a relatively old set of parameters of the
So far, Q-learning can be extended to the form of neural training network. The training network will be updated at
network: deep Q network. Because DQN is an offline strategy every step of training, and the parameters of the target
algorithm, a greedy strategy can be used to balance exploration network will be synchronized with the training network
and utilization when collecting data, and the collected data can every step C, that is,�− ← �.
be stored for use in subsequent training. There are also two To sum up, the specific flow of DQN algorithm is as
very important modules in DQN: experience replay and target follows:
network, they can help DQN achieve stable and excellent
performance. Algorithm DQN algorithm.
Use random network parameters ω Initialize network
In general supervised learning, it is assumed that the �� (�, �);
training data are independent and identically distributed. Each Copy the same parameter �− ←ω to initialize the target
time we train the neural network, we randomly sample one or network ��' ;
91
Initialize experience replay pool R; using the 64 bit version of MATLAB R2022a. The hardware is:
for e = 1→E do the CPU is AMD3600, the GPU is NVIDIA 1660s, and the
Get the initial state �1 ; memory is 16g. The operating system is windows 10.
for t = 1 → T do
In the simulation environment in MATLAB, the
�� (�, �) + � − ��→�� ； coordinates of the robot base were set as the origin of the
�� , �� →��+1 ; coordinate system, and the initial joint angles of the robot were
(�� ,�� ,�� ,��+1 )→R; set as �� = [0, ��/4, ��, 0, ��/4,0].Bring it into the forward
If size(R) >= N: kinematics of the 6-DOF robot, we got the starting point
Sampling data:{(�� ,�� ,�� ,��+1 )}i=1,…,N coordinate �� = (0.5963, − 0.1501, − 0.0144) .the goal
�� = �� + � max ��− (�, �) point coordinate is �� = ( − 0.6,0.4,0.2).we set 2 balls with
�
1
min� = � (�� − �� (�, �))2 a radius of 0.1m used as obstacles. Their positions are �1 =
�
update network 0.3, − 0.25,0 and �2 = −0.32, − 0.08,0.2 .The safely
end for margin was 0.02.The specific parameters of DQN are set as
Table I:
end for
TABLE I. SYMBOLS AND SET VALUES OF VARIOUS TRAINING PARAMETERS
The action space of the 6-DOF robot is relatively simple, IN DQN.
only: ①A-star; ②RRT has two actions, so it can use the full
parameter symbol Set value
connection of 128 neurons in one layer and ReLU as the Training cycle T 500
activation function.ϵ − greedy is a commonly used exploration Learning rate α 0.002
strategy, which allows the agent to randomly select an Discount rate γ 0.98
unknown action with a small probability ϵ when selecting an �-greedy ϵ 0.01
action through decision-making, and to select the action with Sampling data volume N 64
the largest median function of the existing action with the Maximum sample size M 10000
remaining probability 1 − ϵ . This can not only avoid that the
agent gets a large number of negative samples due to excessive After 500 cycles of training according to the parameters in
"exploration" of unknown strategies, which makes it difficult the table, the training results are shown in the Fig. 2.
to reach the destination and spends a lot of training time, and
can prevent the agent from falling into local optimization due From Fig. 2, it can be seen that the performance of DQN
to excessive "utilization" of the existing optimal strategy to improves rapidly after 100 training sessions, and finally
select the action with the largest Q value. converges to the optimal return of the strategy of 200.After the
performance of DQN is improved, it will continue to
The reward function should first meet the most basic experience a certain degree of vibration, mainly due to the
conditions: When the end of the robot arm reaches the target impact of arg max operation after the neural network is over
point, it gets a positive reward, and when the robot arm fitted to some local empirical data.
encounters an obstacle, it gives corresponding punishment. At
the same time, in order to avoid the blind advance of the robot
arm, a certain punishment will be given when the robot arm
does not reach the target point for a long time. The reward
function is expressed as follows:
 200 s t  1 reach th e targ et n o d e


r ( st , a t )    1 0 0 s t  1 en co u n ter o b stacles (7)
 100 s t  1 o ff targ et n o d e

where �(st, �� ) indicates the immediate reward after the robot

arm performs the action. When the robot arm reaches the target
point, it will be rewarded with + 200, and the experiment will
end; when the robot arm encounter obstacles, it will be
Fig. 1. Variation of rewards in DQN network with training cycle
rewarded with -100, and the experiment will end; when the
robot arm is far away from the target node after robot (more
than 2) movements, a penalty of -100 will be given.
IV. SIMULATION AND EXPERIMENTS
In order to verify the effect of RL algorithm on the robot
path planning results, we compared the experimental results of
different algorithms in the same map environment. The main
comparison methods are:①A-star algorithm;②RRT algorithm;
③improved A-star algorithm;④improved RRT algorithm[15]; (a)A-star (b)Improved A-star
⑤ RL(A-star+RRT) algorithm. The algorithm is implemented
92
REFERENCES
[1] Anan Sutapun and Viboon Sangveraphunsiri, "A 4-DOF Upper Limb
Exoskeleton for Stroke Rehabilitation: Kinematics Mechanics and
Control," International Journal of Mechanical Engineering and Robotics
Research, Vol.4, No. 3, pp. 269-272, July 2015. DOI:
10.18178/ijmerr.4.3.269-272
[2] Omar T. Abdelaziz, Shady A. Maged, Mohammed I. Awad, "Towards
Dynamic Task/Posture Control of a 4DOF Humanoid Robotic Arm"
International Journal of Mechanical Engineering and Robotics Research,
(c)RRT (d)Improved RRT Vol. 9, No. 1, pp. 99-105, January 2020. DOI: 10.18178/ijmerr.9.1.99-
105
[3] Han Sung Kim, "Design of a Novel 4-DOF High-Speed Parallel Robot,"
International Journal of Mechanical Engineering and Robotics Research,
Vol. 7, No. 5, pp. 500-506, September 2018. DOI:
10.18178/ijmerr.7.5.500-506
[4] O. Khatib, ‘Real-Time Obstacle Avoidance for Manipulators and
Mobile Robots’, The International Journal of Robotics Research, vol. 5,
no. 1, pp. 90–98, Mar. 1986, doi: 10.1177/027836498600500106.
(e)RL(A-star+RRT) [5] S. M. Lavalle, ‘Rapidly-Exploring Random Trees: A New Tool for Path
Planning’, 1999, Accessed: Sep. 04, 2022. [Online]. Available:
Fig. 2. Path planning results of five algorithms in the same environment. http://www.researchgate.net/publication/2639200_rapidly-
exploring_random_trees_a_new_tool_for_path_planning
The experimental results are shown in the Table II: [6] B. Yin, W. Lu, and W. Wei, ‘Obstacle avoidance planning of 6-DOF
manipulator based on RRT*’, ELECTEONIC TEST, no. 16, pp. 46–48,
TABLE II. COMPARISON TABLE OF PATH PLANNING ALGORITHM RESULTS 2021, doi: 10.16520/j.cnki.1000-8519.2021.16.017.
Algorithm Path length (m) Time spent (s) [7] M. CHEN et al., ‘Obstacle Avoidance Path Planning Of Manipulator In
A-star
Multiple Obstacles Environment’, Computer Integrated Manufacturing
2.0624 49.724
Systems, vol. 27, no. 04, pp. 990–998, 2021, doi:
Improved A-star 1.7068 4.398 10.13196/j.cims.2021.04.003.
RRT 2.4619 7.291
Improved RRT
[8] L. XIE and S. LIU, ‘Dynamic obstacle-avoiding motion planning for
2.0151 4.770
manipulator based on improved artificial potential filed’, Control Theory
RL(A-star+RRT) 1.9174 2.667 & Applications, vol. 35, no. 09, pp. 1239–1249, 2018.
[9] J. ZHAO, Y. CHAO, and Y. YUAN, ‘The Research of Industrial
It can be seen from the Fig. 2 and Table III that since the Manipulator Path Smoothness Based on A-star Algorithm and Cubic
Spline Function’, Machine Design and Research, vol. 35, no. 01, pp. 61-
iterative calculation of the A-star algorithm is very complicated, 64+69, 2019, doi: 10.13952/j.cnki.jofmdr.2019.0100.
a small optimization of the A-star algorithm can significantly [10] F. Du, L. Su, J. Jiao, and Z. Cao, ‘A Method for Collision-free Grasp of
improve it, but this improvement sacrifices the accuracy of the Mobile Manipulator’, Mechanical Science and Technology for
path, and many excessively tortuous paths may damage the Aerospace Engineering, vol. 35, no. 04, pp. 535–538, 2016, doi:
joints of the solid 6-DOF robot. Due to the randomness of RRT, 10.13433/j.cnki.1003-8728.2016.0407.
even if the offset probability is added to the algorithm, the [11] R. Yang and L. Cheng, ‘Path Planning of Restaurant Service Robot
optimization of path and time is not obvious. The fusion Based on A-star Algorithms with Updated Weights’, in 2019 12th
International Symposium on Computational Intelligence and Design
algorithm may not be optimal in the path result, but it takes the (ISCID), 2019, vol. 1, pp. 292–295. doi: 10.1109/ISCID.2019.00074.
shortest time, and the accuracy of the path is not lost, which is [12] C. Ju, Q. Luo, and X. Yan, ‘Path Planning Using an Improved A-star
more friendly to the movement of the robot arm. Algorithm’, in 2020 11th International Conference on Prognostics and
System Health Management (PHM-2020 Jinan), 2020, pp. 23–26. doi:
V. CONCLUSION 10.1109/PHM-Jinan48558.2020.00012.
This paper presents a fusion algorithm of A-star algorithm [13] P. E. Hart, N. J. Nilsson, and B. Raphael, ‘A Formal Basis for the
Heuristic Determination of Minimum Cost Paths’, IEEE Transactions
and RRT algorithm guided by reinforcement learning. The on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, Jul.
algorithm adopts DQN network structure, which transforms the 1968, doi: 10.1109/TSSC.1968.300136.
path planning problem of the 6-DOF robot into two discrete [14] J. Denavit and R. S. Hartenberg, ‘A Kinematic Notation for Lower-Pair
strategies. This algorithm takes into account both the path Mechanisms’, Journal of Applied Mechanics, vol. 22, 1955, doi:
accuracy of A-star algorithm and the real-time performance of 10.1115/1.4011045.
RRT algorithm. Experimental results show that the fusion [15] D. Li and J. Wang, ‘A Robust Adaptive Method Based on ESO for
algorithm has better performance than the single algorithm. Trajectory Tracking of Robot Manipulator’, in 2019 IEEE 15th
International Conference on Control and Automation (ICCA), Jul. 2019,
In the future, we will extend this algorithm to the solid 6- pp. 506–511. doi: 10.1109/ICCA.2019.8899737.
DOF robot and apply it to actual production.
93

Smooth Path Planning of 6-DOF Robot Based On Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smooth Path Planning of 6-DOF Robot Based On Reinforcement Learning

Uploaded by

Copyright:

Available Formats

2022 4th International Conference on Control and Robotics

Smooth Path Planning of 6-DOF Robot Based on

Jiawei Tian Dazi Li*

978-1-6654-8641-5/22/$31.00 ©2022 IEEE 89

 200 s t  1 reach th e targ et n o d e

where �(st, �� ) indicates the immediate reward after the robot

You might also like