Professional Documents
Culture Documents
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
net/publication/352656808
CITATIONS READS
9 860
2 authors, including:
Roman Parák
Brno University of Technology
4 PUBLICATIONS 12 CITATIONS
SEE PROFILE
All content following this page was uploaded by Roman Parák on 05 October 2021.
Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods
are a promising approach to solving complex tasks in the real world with physi-
cal robots. In this paper, we compare several reinforcement learning (Q-Learning,
SARSA) and deep reinforcement learning (Deep Q-Network, Deep Sarsa) methods
for a task aimed at achieving a goal using robotics arm UR3. The main optimiza-
tion problem of this experiment is to find the best solution for each RL/DRL
scenario, respectively, minimize the Euclidean distance accuracy error and smooth
the resulting path by the Bézier spline method. The simulation and real word
application are controlled by the Robot Operating System (ROS). The learning
environment is implemented using the OpenAI Gym library, which uses the RVIZ
simulation tool and the Gazebo 3D modeling tool for dynamics and kinematics.
Received: 1 March 2021
Keywords: Reinforcement Learning, Deep neural network, Motion planning, Accepted: 9 May 2021
Bézier spline, Robotics, UR3. Published: 21 June 2021
MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX 1
tion planning, etc. (Section 2 Related Work), and we 3 Methods
also summarize the necessary methods needed to create
our work (Section 3 Methods). This section provides a brief introduction to the the-
ory of Reinforcement Learning, Deep Reinforcement
In the main part of the work, we focus on solving the
Learning, as well as path smoothing techniques us-
problem, achieving the goal using advanced methods
ing the Bézier spline curve. In each subsection, we
of motion planning (Section 4 Experiments and Re-
present two methods of RL (Q-Learning, SARSA) /
sults). Our approach compares different learning tech-
DRL (Deep Q-Network, Deep Sarsa) control and the
niques (Reinforcement Learning / Deep Reinforcement
last subsection is focused on trajectory smoothing.
Learning) to find the trajectory from the initial posi-
tion to the target position and the resulting trajectory
smoothing using a Bézier spline (B-spline) curve. 3.1 Markov Decision Process
In the final part of the paper, we focus on the chal-
Markov Decision Process (MDP) is a classical formula-
lenges we have encountered, the current limitations,
tion of sequential decision making, where actions influ-
and future extensions of our work (Section 5 Conclu-
ence not just immediate rewards, but also subsequent
sion and Future Work).
situations or states, and through those future rewards
[27]. MDPs include late reward and the need to com-
2 Related Work promise with immediate and late reward.
The MDP contains a structure of four basic elements:
Our approach to finding a point in Cartesian space (st ; at ; P (st+1 |st ; at ); R(st+1 |st ; at )), where st and st+1
using multiple RL/DRL techniques is based on previ- elements represents the current and next state, at
ous work in the areas of reinforcement learning, deep part represents the action, P (st+1 |st ; at ) means the
reinforcement learning, and motion planning. In the probability of transition to the state st+1 when tak-
following section, we will briefly discuss previous work ing action at in state st , and the last part of ele-
on each of the relevant topics. ments R(st+1 |st ; at ) represents the immediate reward
In research in the field of robotic motion planning, received from the environment after the transition from
the concept of machine learning emerges. In particular, st to st+1 . The agent and environment interact at each
Reinforcement Learning (RL) and Deep Reinforcement in a sequence of discrete time steps, t = 0, 1, ... Proba-
Learning (DRL) techniques are an area of growing in- bility of transition in the MDP structure is depending
terest for the robotic research community. on the current state st and chosen action at [11, 27, 33].
Researchers at Erle Robotics have created a frame-
work for testing various RL/DRL algorithms called
OpenAI Gym [6, 32]. Various robot simulation tools
are used as an extension of the OpenAI toolkit, with
Gazebo [1, 15] and PyBullet [8] being the most com-
monly used today. The connection of the robotic tool
with the robust physical core and the Gym toolkit is
created using the ROS (Robot Operating System) [25].
RL/DRL methods are used in robotic applications in Figure 2: Basic structure of agent-environment inter-
the real world for several experiments. One of the ap- action in Markov’s decision-making process (MDPs)
proaches is focused on unscrewing operations in robotic [11, 27]
disassembly of electronic waste using the Q-Learning
method [16], other approaches have used a robotic arm
to achieve a goal using the Deep Reinforcement Learn-
ing method DQN (Deep Q-Network) [5], TRPO(Trust
3.2 Reinforcement Learning
Region Policy Optimization)[22] and tested the result
of the experiment in a real application. Some ap- Reinforcement learning (RL) is an area of machine
proaches use 2D/3D cameras and some other sensors learning that deals with gradual decision-making. The
to observe the robotic environment [5, 10, 12], others main task of this method is to learn how agents ought
use only dynamic simulation with the specified environ- to take sequences of actions in an environment to max-
ment [13, 16] or use real-time robot learning techniques imize cumulative rewards. Markov decision processes
[19]. (MDP) are an ideal mathematical formulation for RL
Motion planning is one of the most fundamental re- problems, for which a direct learning methodology to
search topics in robotics. Some of the approaches have achieve the goal is proposed. The agent decides to
used traditional planning methods, such as RRT [18], receive not only the current remuneration, but also
RRT* [31], where structured tree methods are used cumulative remuneration in the next learning state
to find the curve from point A to point B. Other ap- [11, 27, 33].
proaches use modern techniques, such as RL/DRL, but The agent and MDP together form a sequence that
both methods use Bézier curves to characterize com- contains a number of state-action pairs represented as
plex trajectories and smooth motion planning [23]. τ = ((s0 , a0 ), (s1 , a1 ), . . . ). The return is defined as the
2 MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
R. Parak and R. Matousek
discounted return for the sequence τ at the time steps This is based on simultaneously maximizing a number
t: of pseudoreward functions, such as immediately pre-
dicting the reward (γ = 0), predicting changes on the
∞
X 0 next observation, or predicting the activation of some
Rt (τ ) = γ t −t rt0 , (1)
hidden unit of the agent’s neural network [11].
t0 =t
where γ is a discount factor ((0 ≤ γ ≤ 1)), rt0 is the Deep Q-Network (DQN):
reward at time steps t0 . is a combination of Q-learning with deep convolu-
The main optimization problem of the RL method is tional ANN and reinforcement learning method, multi-
the need to find the optimal policy π ∗ , which is defined
layer and deep ANN specialized in the processing of
such as maximizing the expected return spatial data fields. For a given state of the neural net-
work s, the output is a vector of action values Q(s, a; θ),
π ∗ = arg max E [Rt (τ )]. (2) where θ are the network parameters [11, 29]. The two
π τ ∼π
most important components of the DQN algorithm are
Q-Learning (QL): the use of the target network and the use of the expe-
rience replay. The formula used by DQN is then:
is one of the most popular methods of Reinforce-
ment Learning. The QL method was developed as an
Q(st , at ; θt ) ← Q(st , at ; θt )
off-policy TD (Temporal difference) control algorithm,
defined by + α[(rt+1 + γ max Q(st+1 , at ; θt ) (5)
a
− Q(st , at ; θt ))2 ],
Q(st , at ) ← Q(st , at ) + α[rt+1 + γ max Q(st+1 , at )
where a target network with parameters θt , is the
a
− Q(st , at )], same as the online network except that its parameters
are copied every τ steps from the online network, so
(3)
that then θt = θt , which implies that weight of the
where α is a learning rate (0 < α ≤ 1), neural network [29].
max Q(st+1 , a) is estimate of the optimal future Deep SARSA (DSARSA):
a
value, and other parameters are described in the is similar to the previous part. In the structure
previous section [11, 27]. of learning, the approximation of the value function
is with the convolutional neural network (CNN), that
State–action–reward–state–action (SARSA):
uses Q-network to obtain Q value like DQN. The Func-
is an on-policy TD control method, very similar to tion is represented by a CNN with weights θ and an
the previous Q-Learning method. SARSA method is output that represents the Q-values for each action[20].
characterized by using the next action [27].
Q(st , at ; θt ) ← Q(st , at ; θt )
Q(s , a ) ← Q(s , a ) + α[r + γ max Q(s , a ) + α[(rt+1 + γ max Q(st+1 , at+1 ; θt−1 ) (6)
t t t t t+1 t+1 t+1 a
a
2
− Q(st , at )], − Q(st , at ; θt )) ],
(4)
3.4 Bézier-Spline Curves
3.3 Deep Reinforcement Learning Bézier curves use Bernstein’s polynomials, which were
described in 1912 by the Russian mathematician Sergei
Deep reinforcement learning (DRL) combines artificial
Bernstein [30].
neural networks (ANN) with a learning reinforcement
B-spline curves, like Bézier curves, use polynomials
architecture that allows agents to learn the best pos-
to generate a curve segment. The main difference be-
sible actions in an individual environment to achieve
tween simple Bézier curves and a B-spline is that B-
their goals. The main approach of this method is to
Spline is used as a series of control points to determine
approximate the function and optimize the goals, map-
the local geometry of the curve. This feature ensures
ping the state-action pairs to the expected rewards
that only a small portion of the curve is changed when
[11, 33]. This area of research can solve a wide range
a control point is moved [30]. A Bézier spline curve of
of complex decision-making tasks that were previously
degree p is defined by n + 1 control points P0 , P1 , .., Pn
out of reach for a machine.
[2]:
The learning agent’s DRL methods with auxiliary
tasks within a jointly learned representation can signif- Xn
icantly increase the effectiveness of the learning sam- B(t) = Ni,p (t)Pi , (7)
ple. The learning agent’s DRL methods with auxiliary i=0
tasks within a jointly learned representation can signifi- where Ni,p (t) is a normalized B-Spline curve defined
cantly increase the effectiveness of the learning sample. over the nodes.
MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX 3
4 Experiments and Results task aims to autonomously find the trajectory from
the initial position to the target position using various
In this section, we present an application that includes
learning techniques (Reinforcement Learning / Deep
a task using several RL/DRL techniques for the point
Reinforcement Learning) and using the Bézier spline
of Cartesian space achievement problem using an col-
method to smooth the resulting trajectory.
laborative manipulator with 6-DOF (Degrees of Free-
First, the robot position is initialized and the target
dom). The experiment will be demonstrated in simu-
position is randomly selected. When selecting a target
lation and in the real world.
position, we start the learning process for each learn-
ing technique. The distance between the start Pi and
4.1 Setting up the learning environment target Pt point is described by Euclidean Distance dt
(Eq. 8).
For our experiment, we will use a cooperating robotic
arm UR3 from Universal Robots (Fig. 1), more pre- v
u n
cisely a 6-axis robotic arm with a working radius of 500 uX
dt (Pi , Pt ) = t (Pi − Pt )2 , (8)
mm/19.7 inches, a payload of 3 kg/6.6 pounds and a i=1
repeatability of ±0.1 mm [28].
The simulation is controlled through communica-
4.3 Learning process
tion between the RVIZ [7, 26] simulation tool and the
Gazebo 3D modeling tool [1, 15]. In our experiment, We implement the learning environment within the
we use the UR3 URDF model (Universal Robotic De- OpenAI Gym library [6, 32], which provides a inter-
scription Format) without a gripper and the official face to train and test the learning process.
ROS driver for Universal Robots [3], which is used to To evaluate our algorithm, we performed identical
control real / simulation robots. experiments with several RL/DRL techniques. In the
The main structure of our experiment is shown in first experiment we use classical RL techniques (QL,
Fig. 1. Fig. 3 shows a simplified structure of the SARSA) and in the next experiment we use modern
UR3 model in the RVIZ simulation tool. The blue DRL techniques (DQN, DSARSA). Hyper-parameters
sphere represents the initial position of the robot and of individual RL/DRL techniques are given in the table
the target to be reached by the robot is represented (Tab. 1, Tab. 2).
as a red sphere. The purple box (Atarget ) is approx-
imately the area of restriction from which targets are Table 1: Hyper-parameters used for Reinforcement
removed, and the yellow box (Asearch ) represents the Learning method (Q-Learning, DARSA)
area of safe movement. The safe area is created with
given to the individual robot model to avoid collisions Hyper-parameter Symbol Value
with the target area (imagine that the target area is a Episodes of Training Mmin , Mmax 500, 1000
bin for an object selection problem). Steps per Episode T 100
Discount Factor γ 0.75
Learning Rate α 0.3
dt (Pp , Pt ) − dt (Pa , Pt )
4.2 Definition of experiment Rt = + Rs , (9)
dt (Pi , Pt )
In our problem, we choose a task aimed at achieving where dt (Pi , Pt ) is the initial Euclidean distance
a goal in Cartesian space using a robotics arm. The between the start Pi and target point Pt , and
4 MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
R. Parak and R. Matousek
dt (Pa , Pt )
δd = .100. (11)
dt (Pi , Pt )
The learning process of the RL/DRL agent begins
by examining the environment by performing actions
from the initial state to the target state and collecting
appropriate rewards (Eq. 9). In our experiment, the
(c) Deep Q-Network
agent can select one of six possible actions (at = (0..5))
in each state (st = (X+ , X− , Y+ , Y− , Z+ , Z− )), and the
available actions correspond to fixed discrete steps 5
mm of the Tool Center Point (TCP).
Once the agent moves from Asearch , the episode M
ends in the current step T , otherwise the process con-
tinues until the maximum number of episodes.
MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX 5
δd ≤ 5.0% δd ≤ 10.0% δd > 10.0%
(a) Q-Learning
(b) SARSA
(a) Q-Learning
(b) SARSA
6 MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
R. Parak and R. Matousek
Table 3: Results of RL/DRL learning techniques for the goal achievement experiment
Learning Number of Best solution Number of points Accuracy error Time Time
Technique episodes (Episode; Time) from Pi to Pt δd Default B-Spline
Q-Learning 500 (279; 04:55 hr.) 63 2.98% 5.25 sec. 2.27 sec.
SARSA 800 (772; 18:29 hr.) 92 8.48% 7.83 sec. 4.33 sec.
Deep Q-Network 500 (440; 15:48 hr.) 83 4.16% 6.57 sec. 2.89 sec.
Deep SARSA 500 (371; 13:25 hr.) 78 2.90% 6.79 sec. 2.44 sec.
5 Conclusion and Future Work and more. This research can also provide a suitable ba-
sis for other areas of learning, such as the Pick & Place
In this work, we provided the experimental study of task, Bin-picking, etc., and the use of other robotic arm
multiple reinforcement learning (RL)/deep reinforce- models.
ment learning algorithms (DRL), namely Q-Learning
(QL), SARSA for RL methods, and Deep Q-Network Acknowledgement: This work was supported by In-
(DQN), Deep Sarsa (DSARSA) for DRL methods. We ternal grant agency of BUT: FME-S-20-6538 “Industry
presented related work on the problem of achieving the 4.0 and AI methods”.
goal using RL/DRL techniques. In this work, we also
introduced in more detail the various learning methods
and techniques for trajectory smoothing. The simula- References
tion is controlled by communication between the sim-
[1] Aguero, C., et al. Inside the virtual robotics
ulation tool RVIZ and the Gazebo 3D modeling tool
challenge: Simulating real-time robotic disaster
via the Robot Operating System (ROS). The main op-
response. Automation Science and Engineering,
timization problem of this experiment is to find the
IEEE Transactions on 12, 2 (April 2015), 494–
best solution for each RL/DRL scenario, respectively,
506.
minimize the Euclidean distance accuracy error δd .
RL (QL) and DRL (DQN, DSARSA) techniques com- [2] Ammad, M., and Ramli, A. Cubic b-spline
pleted the conditions for the required accuracy repre- curve interpolation with arbitrary derivatives on
sented by Rs , but in the perspective of the future re- its data points. In 2019 23rd International Con-
search are techniques based on deep neural network are ference in Information Visualization – Part II
more stable and efficient. The resulting path found in (2019), pp. 156–159.
each scenario is smoothed by the Bézier spline method [3] Andersen, T. T. Optimizing the univer-
and tested in a real robotic application using the official sal robots ros driver. Technical University of
ROS driver. Denmark, Department of Electrical Engineering
This work can provide a foundation for future re- (2015).
search on motion planning in the field of robotics using [4] Bingol, O. R., and Krishnamurthy, A.
advanced deep reinforcement learning methods such NURBS-Python: An open-source object-oriented
as DDPG (Deep Deterministic Policy Gradient), TD3 NURBS modeling framework in Python. Soft-
(Twin Delayed Deep Deterministic Policy Gradient) wareX 9 (2019), 85–94.
MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX 7
[5] Bogunowicz, D., Rybnikov, A., Vendi- [19] Mahmood, A., Korenkevych, D., Komer,
dandi, K., and Chervinskii, F. Sim2real B., and Bergstra, J. Setting up a rein-
for peg-hole insertion with eye-in-hand camera. forcement learning task with a real-world robot.
arXiv:2005.14401 (05 2020). arXiv:1803.07067 (03 2018).
[6] Brockman, G., Cheung, V., Pettersson, L., [20] Mesquita, A., Nogueira, Y., Vidal, C.,
Schneider, J., Schulman, J., Tang, J., and Cavalcante-Neto, J., and Serafim, P. Au-
Zaremba, W. Openai gym. arXiv:1606.01540 tonomous foraging with sarsa-based deep rein-
(2016). forcement learning. In 2020 22nd Symposium
[7] Coleman, D., S, ucan, I. A., Chitta, S., and on Virtual and Augmented Reality (SVR) (2020),
Correll, N. Reducing the barrier to entry of pp. 425–433.
complex robotic software: a moveit!case study. [21] Nguyen, H., and La, H. Review of deep re-
Journal of Software Engineering for Robotics 5 inforcement learning for robot manipulation. In
(2014), 3–16. 2019 Third IEEE International Conference on
[8] Coumans, E., and Bai, Y. Pybullet, a python Robotic Computing (IRC) (2019), pp. 590–595.
module for physics simulation for games, robotics [22] Rupam Mahmood, A., Korenkevych, D.,
and machine learning. http://pybullet.org, Komer, B. J., and Bergstra, J. Setting up
2016–2019. a reinforcement learning task with a real-world
[9] El-Shamouty, M., Wu, X., Yang, S., Al- robot. In 2018 IEEE/RSJ International Confer-
bus, M., and Huber, M. F. Towards safe ence on Intelligent Robots and Systems (IROS)
human-robot collaboration using deep reinforce- (2018), pp. 4635–4640.
ment learning. In 2020 IEEE International [23] Scheiderer, C., Thun, T., and Meisen, T.
Conference on Robotics and Automation (ICRA) Bézier curve based continuous and smooth motion
(2020), pp. 4899–4905. planning for self-learning industrial robots. Pro-
[10] Franceschetti, A., Tosello, E., Castaman, cedia Manufacturing 38 (2019), 423 – 430. 29th
N., and Ghidoni, S. Robotic arm control and International Conference on Flexible Automation
task training through deep reinforcement learning. and Intelligent Manufacturing ( FAIM 2019), June
arXiv:2005.02632 (01 2021). 24-28, 2019, Limerick, Ireland, Beyond Industry
[11] Francois-Lavet, V., Henderson, P., Is- 4.0: Industrial Advances, Engineering Education
lam, R., Bellemare, M. G., and Pineau, J. and Intelligent Manufacturing.
An introduction to deep reinforcement learning. [24] Silver, D., et al. Mastering the game of go with
arXiv:1811.12560 (2018). deep neural networks and tree search. Nature 529
[12] Hundt, A., et al. ”good robot!”: Efficient rein- (01 2016), 484–489.
forcement learning for multi-step visual tasks with [25] Stanford Artificial Intelligence Labora-
sim to real transfer. IEEE Robotics and Automa- tory et al. Robotic operating system.
tion Letters PP (08 2020), 1–1. [26] Sucan, I. A., and Chitta, S. Moveit. [online]
[13] Hůlka, T., Matoušek, R., Dobrovský, L., Available at: moveit.ros.org.
Dosoudilová, M., and Nolle, L. Optimiza- [27] Sutton, R. S., and Barto, A. G. Reinforce-
tion of snake-like robot locomotion using ga: Ser- ment Learning: An Introduction, second ed. The
penoid design. MENDEL 26, 1 (Aug. 2020), 1–6. MIT press, 2018.
[14] Kingma, D., and Ba, J. Adam: A method for [28] Universal Robots. Ur3. [online] Available at:
stochastic optimization. International Conference https://www.universal-robots.com.
on Learning Representations (12 2014). [29] van Hasselt, H., Guez, A., and Silver,
[15] Koenig, N., and Howard, A. Design and use D. Deep reinforcement learning with double q-
paradigms for gazebo, an open-source multi-robot learning. arxiv:1509.06461 (2015).
simulator. In IEEE/RSJ International Conference [30] Vince, J. Mathematics for Computer Graphics,
on Intelligent Robots and Systems (Sendai, Japan, fifth ed. Springer, London, 2017.
Sep 2004), pp. 2149–2154. [31] Xinyu, W., Xiaojuan, L., Yong, G., Ji-
[16] Kristensen, C., Sørensen, F., Nielsen, H., adong, S., and Rui, W. Bidirectional potential
Andersen, M., Bendtsen, S., and Bøgh, S. guided rrt* for motion planning. IEEE Access 7
Towards a robot simulation framework for e-waste (2019), 95046–95057.
disassembly using reinforcement learning. Proce- [32] Zamora, I., Lopez, N., Vilches, V., and
dia Manufacturing 38 (01 2019), 225–232. Cordero, A. Extending the openai gym for
[17] Kudela, J. Social distancing as p-dispersion robotics: a toolkit for reinforcement learning us-
problem. IEEE Access 8 (2020), 149402–149411. ing ros and gazebo. arXiv:1608.05742 (08 2016).
[18] Lin, C., and Li, M. Motion planning with ob- [33] Zeng, X. Reinforcement learning based approach
stacle avoidance of an ur3 robot using charge sys- for the navigation of a pipe-inspection robot at
tem search. In 2018 18th International Confer- sharp pipe corners. University of Twente, Septem-
ence on Control, Automation and Systems (IC- ber 2019.
CAS) (2018), pp. 746–750.
8 MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX