Zhang 2022 J. Phys. Conf. Ser. 2203 012065

Journal of Physics: Conference Series
PAPER • OPEN ACCESS You may also like

- Best Practices for Investigating Anion
Simulation of Robotic Arm Grasping Control Based Exchange Membrane Suitability for
Alkaline Electrochemical Devices: Case
on Proximal Policy Optimization Algorithm Study Using Quaternary Ammonium
Poly(2,6-dimethyl 1,4-phenylene)oxide
Anion Exchange Membranes
Christopher G. Arges, Lihui Wang, Javier
To cite this article: Zhizhuo Zhang and Change Zheng 2022 J. Phys.: Conf. Ser. 2203 012065 Parrondo et al.
- Thin film preparation of polyphenol

oxidase enzyme (PPO) and investigation
of its organic vapor interaction mechanism
M Evyapan and D E Deniz
View the article online for updates and enhancements.
- Reinforced poly(propylene oxide): a very
soft and extensible dielectric electroactive
polymer
K Goswami, F Galantini, P Mazurek et al.
This content was downloaded from IP address 197.210.226.60 on 29/01/2024 at 01:38

ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
Simulation of Robotic Arm Grasping Control Based on

Proximal Policy Optimization Algorithm
Zhizhuo Zhang1,*, Change Zheng1

1
The school of Technology, Beijing Forestry University, Beijing, China
*
Email: zhangzhizhuo@bjfu.edu.cn
Abstract. There are many kinds of inverse kinematics solutions for robots. Deep reinforcement
learning can make the robot spend a short time to find the optimal inverse kinematics solution.
Aiming at the problem of sparse rewards in the process of deep reinforcement learning, this paper
proposes an improved PPO algorithm. Firstly, built a simulation environment for the operation
of the robotic arm. Secondly, use a convolutional neural network to process the data read by the
camera of the robotic arm, obtaining a network about Actor and Critic. Thirdly, based on the
principle of inverse kinematics of the robotic arm and the reward mechanism in deep
reinforcement learning, design a hierarchical reward function containing motion accuracy to
promote the convergence of the PPO algorithm. Finally, compare the improved PPO algorithm
with the traditional PPO algorithm. The results show that the improved PPO algorithm has
improved both the convergence speed and the operating accuracy.
1. Introduction
The control of the robotic arm is an indispensable part of the field of robot control. There are usually
many inverse kinematics solutions of the robotic arm, and the motion of each joint in each solution is
different. The robotic arm can quickly find a suitable trajectory in different working environments
through deep reinforcement learning algorithms.
However, in the construction of many deep reinforcement learning algorithms, the sparse rewards
problem is a key challenge that has not yet been solved[1]. In a sparse reward environment, the agent
can only get rewards at the end of the sequential decision-making process. The lack of effective
information feedback in the intermediate process leads to under-fitting of the strategy network, slow
training speed, and high cost. There are many ways to solve the sparse reward problem, which can be
roughly divided into multi-objective guidance method that increases virtual goal rewards, hierarchical
strategy method that increase subtask rewards, the imitation learning method that increase expert
similarity rewards, and the curiosity method that increases the state novelty rewards.
Designing the corresponding reward function for different environments can more directly reduce
the impact of sparse rewards on deep reinforcement learning training. In the field of manipulator control,
one of the important objectives of the reward function is to drive the end of the manipulator to the target
point. Li Heyu et al[2]. Improved the reward function of the PPO algorithm, and divided the grasping
operation of the robotic arm into two stages, one is from the initial position to the stage directly below
the target object, and the other is from directly below the target object to the specified grasping position.
This method reduced the motion of the robot arm Jitter in the process; Zhao et al[3].proposed a robotic
snatch method that combined the concept of energy consumption in dynamics. The distance between
the robotic arm and the real target was determined by calculating the energy consumption, thereby
improving the accuracy of the reward function; Sun Kang et al[4]. Set up a reward function based on
the time of complete the capture task, the distance between the end-capture mechanism of the robotic
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
arm and the target point, and the size of the joint observation driving torque as the reward value.
In response to the problem and research discussed above, this paper modifies the reward function of
the PPO algorithm, and sets up a hierarchical reward function according to the distance between the end
of the robotic arm and the target object. This could solve the problem of low learning efficiency of the
PPO algorithm to a certain extent.
2. Related works
The PPO algorithm changes the on-policy training method in the PG algorithm to the off-policy training
method, which updates the strategy in each iteration and reduces the difference with the previous
strategy while ensuring the minimum loss function. Through the importance sampling method, the
𝑘
objective function of PPO is as follows, where 𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ) is the advantage function[5]:
′ 𝑝 (𝑎 |𝑠 ) 𝑘
𝐽𝜃 (𝜃) = 𝐸(𝑠𝑡 ,𝑎𝑡 )~𝑝 [𝑝 𝜃 𝑎𝑡 |𝑠𝑡 𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 )] (1)
𝜃𝑘 𝜃𝑘
( 𝑡 𝑡)
This article adopts the PPO-Clip method. The update strategy of the objective function is as follows:
𝜃𝑘+1 = 𝑎𝑟𝑔max 𝐸 [𝐿(𝑠,𝑎,𝜃𝑘 ,𝜃)] (2)
𝜃 𝑠,𝑎~𝑝𝜃𝑘
𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑘 𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑘

𝐿(𝑠,𝑎,𝜃𝑘 ,𝜃) = min ( 𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ),𝑐𝑙 𝑖 𝑝( ,1−𝜀,1+𝜀)𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ))
𝑝𝜃𝑘 (𝑎𝑡 |𝑠𝑡 ) 𝑝𝜃𝑘 (𝑎𝑡 |𝑠𝑡 )
𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝜃𝑘 𝑘
= min ( 𝐴 (𝑠𝑡 ,𝑎𝑡 ),𝑔(𝜀,𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ))) (3)
𝑝𝜃𝑘 (𝑎𝑡 |𝑠𝑡 )
(1+𝜀)𝐴 𝐴⩾0
𝑔(𝜀,𝐴) = { (4)
(1−𝜀)𝐴 𝐴<0
When the advantage function is positive, the next update will tend to increase the probability of
taking the same action. When the advantage function is negative, the next update will tend to reduce the
probability of taking the same action.
3. Introduction to the working principle of the robotic arm
3.1. Introduction to the simulation environment

Pybullet is a Python module that can be used for physical simulation of robots, games, and machine
learning[6]. The penman builds a simulation environment in Pybullet based on Kuka's six-axis robotic
arm products. The established D-H parameter table is as follows, where 𝛼𝑖−1 is the Z-axis rotation,
𝑎𝑖−1 is the Z-axis translation, 𝑑𝑖 is the X-axis translation, and 𝜃𝑖 is the X-axis rotation.
Table 1. Mechanical arm D-H parameter table
𝑖 𝛼𝑖−1 𝑎𝑖−1 (𝑚𝑚) 𝑑𝑖 (𝑚𝑚) 𝜃𝑖 Range of motion
1 0° 0 0 𝜃1 ±170°
2 −90° 340 0 𝜃2 ±120°
3 0° 440 0 𝜃3 ±170°
4 −90° 440 260 𝜃4 ±120°
5 90° 126 0 𝜃5 ±170°
6 −90° 0 0 𝜃6 ±175°
3.2. Design of robotic arm motion

In the simulation environment, the camera transmits the collected image back to the computer. After the
image is processed, it will be used as the state S to be observed by the actor. The computer will determine
the next move distance of the end of the robotic arm on the X-axis, Y-axis, and Z-axis. According to the
current coordinate value(𝑥𝑟 ,𝑦𝑟 ,𝑧𝑟 )of the end of the robot arm and the rotation angle(𝑎1 ,𝑎2 ,𝑎3 ,𝑎4 ,𝑎5 ,𝑎6 )
of each joint of the current robot arm, the computer will calculate the angle at which each joint of the
2
robot arm should rotate when the end of the robot arm reaches the next position (𝑥𝑟 ′,𝑦𝑟 ′,𝑧𝑟 ′) .
Subsequently, the motors of each joint will drive the joint to rotate and output an action A.
A = [∆𝑎1 , ∆𝑎2 , ∆𝑎3 , ∆𝑎4 , ∆𝑎5 , ∆𝑎6 ] (5)
When the angle of each joint of the robotic arm changes, the image data collected by the camera will
also change, so S is updated. Repeat the above process until the end of the robotic arm reaches the target
position(𝑥d ,𝑦d ,𝑧d ).
4. Improved PPO algorithm design
4.1. Reward setting

This article improves the traditional PPO algorithm. Combined with the inverse kinematics of the robotic
arm, the following ladder reward regulations are designed. Among them, the distance between the end
position of the robotic arm and the target position is 𝑑𝑖𝑠 = √(𝑥r −𝑥d )2 + (𝑦r −𝑦d )2 + (𝑧r −𝑧d )2 , and
the reward value is 𝑟.
4.1.1. Specify the limit value of the end motion of the robotic arm. When the end exceeds the limit range
of the motion, r = −0.5.
0.3 < 𝑥 < 0.9
{−0.3 < 𝑦 < 0.3 (6)
0 < 𝑧 < 0.6
4.1.2. Set the maximum number of motion steps. When the number of motion steps is exceeded and the
end of the robotic arm has not reached the target position, r = −0.5.
4.1.3. Hierarchical reward function To give the system enough positive rewards, the rewards for
successfully reaching the target position are divided into steps
𝑟 = 0.1 , 0.05 < 𝑑𝑖𝑠 ⩽ 0.2
{ 𝑟 = 5 , 0.01 < 𝑑𝑖𝑠 ⩽ 0.05 (7)
𝑟=7 , 𝑑𝑖𝑠 ⩽ 0.01
4.2. Setting of neural network parameters

Use PyTorch1.2 to build two neural networks, which are used to train the strategy function and the value
function respectively[7 ]. The network has three convolutional layers, a pooling layer, and a fully
connected layer. The input of the strategy function network is the processed 84×84×4 image, and the
output is the next action of the end of the manipulator in the X-axis, Y-axis, and Z-axis. The input of the
value function network is also the image, and its output is the quality of the current strategy. Considering
that the output action coordinates are positive or negative, all activation functions are set to Tanh. The
following figure shows the specific parameters of the strategy function network:
Figure 1. Schematic diagram of convolutional neural network structure

The structure of the value function network is generally the same as that of the strategy function
3
network, except that the output is 1 × 1.
4.3. Setting of algorithm parameters

Use the deep reinforcement learning algorithm framework provided by Spinning Up to construct an
improved PPO algorithm[8]. Training for 50 rounds, each round includes 4000 interactions with the
environment. The reward discount factor γ is 0.99. The KL divergence is 0.01. The learning rate of the
policy optimizer is 0.001. The learning rate of the value function optimizer is 0.001. The number of
gradient descents for each round of strategy function and value function is 80 times. The shear ratio is
0.2. The maximum number of backtracking steps K is 10. The backtracking factor is 0.8, and the GEA-
Lambda is 0.97.
5. Experiment design and realization
5.1. Comparison of algorithm effects

Count the reward value obtained in each cycle during the training process, where the horizontal axis
represents the number of cycles in a training session, and the vertical axis represents the reward value
obtained in each cycle. The left picture is the training result of the PPO algorithm after adding the
hierarchical reward function, the right picture is the training result of the traditional PPO algorithm.
Figure 2. Comparison of reward function between Two PPO algorithms

It can be seen from the figure that as the training progresses, the reward value obtained in a single
cycle of the left image gradually increases, indicating that the neural network uses the reward value to
modify its parameters through interaction with the environment, and gradually makes correct control
decisions. The final reward value of the left image tends to be stable, indicating that the parameters of
the neural network converge at this time, and a stable control effect is achieved. On the contrary, the
reward value of the image on the right does not show an obvious upward trend, and there is a large
fluctuation value. This shows that due to the limited environmental rewards, the guidance target is too
single, and the system does not get effective rewards from the environment, and thus cannot accurately
update the parameters for the next training.
By comparing the improved PPO algorithm with the traditional PPO algorithm, it is found that the
improved PPO algorithm can achieve a better reward with fewer steps. This shows that the improved
PPO algorithm converges faster and the training effect is better.
5.2. Crawl experiment

The penman uses the improved PPO algorithm and the traditional PPO algorithm to carry out simulation
grasping experiments to compare the grasping accuracy of the two algorithms. The training times of the
algorithm are all 50 rounds, and the results are shown in the table.
4
Figure 3. Schematic diagram of the crawling process

It can be seen that as the number of training increases, the success rate of the grasping operation is
also increasing, and the average number of steps used is decreasing. This shows that with the
strengthening of training, the adaptability of the system to the environment will continue to improve,
and the effect of the algorithm will get better and better.
Table 2. Grasping test results
10 experiments 30 experiments 50 experiments
Improved PPO algorithm 9 26 47
Traditional PPO algorithm 4 17 22
Take a successful grasp of the robotic arm when the improved PPO algorithm is used as an example,
and draw the curve of the rotation angle of each joint of the robotic arm, which is shown in the figure.
Figure 4. The curve of the rotation angle of each joint

It can be seen that in a single grasping process, the angle changes of the various joints of the robotic
arm are relatively smooth, and the directionality is obvious. There is less repeated shaking. This shows
that the robotic arm already has a more mature inverse kinematics plan for target objects in different
positions.
6. Conclusion
This paper constructs an improved PPO algorithm, sets a reasonable reward function, and uses it for
neural network training. Through verification, the algorithm in this paper can converge in a short time,
improve efficiency, and has a stable control effect. After training, the robotic arm can continuously learn
and update itself among numerous inverse kinematics solutions, and approach the target object as
quickly as possible.
Acknowledgement
First and foremost, I would like to show my deepest gratitude to my supervisor, Zhang Shaolin, a
respectable, responsible and resourceful scholar, who has provided me with valuable guidance in every
stage of the writing of this thesis. Without his enlightening instruction, impressive kindness and patience,
I could not have completed my thesis.
5
I shall extend my thanks to Associated Professor Zheng Change for all her kindness and help. Her
keen and vigorous academic observation enlightens me not only in this thesis but also in my future study.
Last but not least, I would like to thank the School of Technology, Beijing Forestry University and
the Institute of Automation, Chinese Academy of Sciences for providing an experimental environment
for my thesis.
This thesis work was partially supported by the Beijing University Student Research and Career
Creation Program, the project number is s20201002114.
References
[1] He Q 2021 Research on Multi-goal-conditioned Method in Reinforcement Learning with Sparse
Rewards Hefei University of Science and Technology of China
[2] Li H, Lin T and Zeng B 2020 Control Method of Space Manipulator by Using Reinforcement
Learning Aerospace Control 38 p 6
[3] Zhao R and Tresp V 2018 Energy-based hindsight experience prioritization Conference on Robot
Learning PMLR pp 113-122
[4] Sun K, Wang Y, Du D and Qi N 2020 Capture Control Strategy of Free-Floating Space
Manipulator Based on Deep Reinforcement Learning Algorithm Manned Spaceflight 26 p 6
[5] Jian S 2021 Algorithm principle and implementation of proximal policy optimization[EB/OL]
https://www.jianshu.com/p/9f113adc0c50.
[6] Coumans E and Bai Y 2017 PyBullet Quickstart Guide[EB/OL] https://github.com/bulletph
ysics/bullet3
[7] Zhu G, Huo Y, Luan Q and Shi 2021 Y Research on IoT environment temperature prediction
based on PPO algorithm optimization[J] Transducer and Microsystem Technologies 40 p 4
[8] OpenAI Spinning Up[EB/OL] 2018 https://spinningup.openai.com/en/latest/

Zhang 2022 J. Phys. Conf. Ser. 2203 012065

Uploaded by

Copyright:

Available Formats

You might also like

Zhang 2022 J. Phys. Conf. Ser. 2203 012065

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zhang 2022 J. Phys. Conf. Ser. 2203 012065

Uploaded by

Copyright:

Available Formats

Journal of Physics: Conference Series

PAPER • OPEN ACCESS You may also like

- Thin film preparation of polyphenol

This content was downloaded from IP address 197.210.226.60 on 29/01/2024 at 01:38

Simulation of Robotic Arm Grasping Control Based on

Zhizhuo Zhang1,*, Change Zheng1

𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑘 𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑘

3. Introduction to the working principle of the robotic arm

3.1. Introduction to the simulation environment

3.2. Design of robotic arm motion

4. Improved PPO algorithm design

4.1. Reward setting

4.2. Setting of neural network parameters

Figure 1. Schematic diagram of convolutional neural network structure

network, except that the output is 1 × 1.

4.3. Setting of algorithm parameters

5. Experiment design and realization

5.1. Comparison of algorithm effects

Figure 2. Comparison of reward function between Two PPO algorithms

5.2. Crawl experiment

Figure 3. Schematic diagram of the crawling process

Figure 4. The curve of the rotation angle of each joint

You might also like