Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/352656808
Comparison of Multiple Reinforcement Learning and Deep Reinforcement

Learning Methods for the Task Aimed at Achieving the Goal
Article in MENDEL · June 2021

DOI: 10.13164/mendel.2021.1.001
CITATIONS READS
9 860
2 authors, including:
Roman Parák
Brno University of Technology
4 PUBLICATIONS 12 CITATIONS
SEE PROFILE
All content following this page was uploaded by Roman Parák on 05 October 2021.
The user has requested enhancement of the downloaded file.

ISSN: 1803-3814 (Printed), 2571-3701 (Online)
https://doi.org/10.13164/mendel.2021.1.001
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning

Methods for the Task Aimed at Achieving the Goal
Roman Parak , Radomil Matousek
Institute of Automation and Computer Science, Brno University of Technology, Czech Republic
Roman.Parak@vutbr.cz , RMatousek@vutbr.cz
Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods
are a promising approach to solving complex tasks in the real world with physi-
cal robots. In this paper, we compare several reinforcement learning (Q-Learning,
SARSA) and deep reinforcement learning (Deep Q-Network, Deep Sarsa) methods
for a task aimed at achieving a goal using robotics arm UR3. The main optimiza-
tion problem of this experiment is to find the best solution for each RL/DRL
scenario, respectively, minimize the Euclidean distance accuracy error and smooth
the resulting path by the Bézier spline method. The simulation and real word
application are controlled by the Robot Operating System (ROS). The learning
environment is implemented using the OpenAI Gym library, which uses the RVIZ
simulation tool and the Gazebo 3D modeling tool for dynamics and kinematics.
Received: 1 March 2021
Keywords: Reinforcement Learning, Deep neural network, Motion planning, Accepted: 9 May 2021
Bézier spline, Robotics, UR3. Published: 21 June 2021
1 Introduction servability, a lot of time for problem solving and are

not adapted to dynamic scene changes. Advanced
Robotics as a field of science has been evolving for the RL/DRL techniques can solve motion planning tasks
past several years and modern robots operating in the for multiple-axis industrial robot [23].
real world should learn new tasks autonomously, flex-
ibly and adapt smoothly to different changes. These
requirements create new challenges in the field of
robot control systems. For this purpose, reinforcement
learning (RL) methods such as Q-Learning, SARSA
(State–action–reward–state–action), etc. are com-
monly used [27]. A limitation of these learning meth-
ods is the need for a large amount of memory.
In recent years, there has been an increase in deep
neural network (DNN) methods in several areas of
science, technology, medicine, and more, along with
significant advances in Deep Reinforcement Learning
(DRL) techniques [11]. DRL overcomes the limitations
of simple RL methods by combining parallel computa- Figure 1: Experimental task aimed at achieving the
tion and embedded deep neural networks (DNN). goal using the UR3 robot. The purple box (Atarget )
Reinforcement Learning (RL) and Deep Reinforce- here is approximately the area of restriction from which
ment Learning (DRL) methods are a promising ap- the targets are sampled, and the yellow box (Asearch )
proach to solving complex tasks in the real world represents the area of safe movement. The distance be-
with physical robots. RL/DRL methods are also used tween the start (Pi ) and target (Pt ) point is described
in real-world applications, such as improvements in by Euclidean Distance.
the gaming industry for the Go game [24], as well
as in robotic applications for manipulation [21], goal
achievement [5, 22], Human-Robot Collaboration [9], In this paper, we propose several RL/DRL methods
and more [17]. for the task aimed at achieving the goal using the co-
Planning the trajectory of the robotic arm as one operating robotic arm UR3 for Universal Robots, more
of the most basic and challenging research topics in precisely a 6-axis robotic arm [28]. The basic scene of
robotics has found considerable interest from research our experiment and the real robot in the initial position
institutes in recent decades. Traditional task and position are shown in Fig. 1.
motion planning methods, such as RRT [18], RRT* We capitalize the related work in multiple areas of re-
[31] can solve complex tasks but require full state ob- inforcement learning, deep reinforcement learning, mo-
MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX 1
tion planning, etc. (Section 2 Related Work), and we 3 Methods
also summarize the necessary methods needed to create
our work (Section 3 Methods). This section provides a brief introduction to the the-
ory of Reinforcement Learning, Deep Reinforcement
In the main part of the work, we focus on solving the
Learning, as well as path smoothing techniques us-
problem, achieving the goal using advanced methods
ing the Bézier spline curve. In each subsection, we
of motion planning (Section 4 Experiments and Re-
present two methods of RL (Q-Learning, SARSA) /
sults). Our approach compares different learning tech-
DRL (Deep Q-Network, Deep Sarsa) control and the
niques (Reinforcement Learning / Deep Reinforcement
last subsection is focused on trajectory smoothing.
Learning) to find the trajectory from the initial posi-
tion to the target position and the resulting trajectory
smoothing using a Bézier spline (B-spline) curve. 3.1 Markov Decision Process
In the final part of the paper, we focus on the chal-
Markov Decision Process (MDP) is a classical formula-
lenges we have encountered, the current limitations,
tion of sequential decision making, where actions influ-
and future extensions of our work (Section 5 Conclu-
ence not just immediate rewards, but also subsequent
sion and Future Work).
situations or states, and through those future rewards
[27]. MDPs include late reward and the need to com-
2 Related Work promise with immediate and late reward.
The MDP contains a structure of four basic elements:
Our approach to finding a point in Cartesian space (st ; at ; P (st+1 |st ; at ); R(st+1 |st ; at )), where st and st+1
using multiple RL/DRL techniques is based on previ- elements represents the current and next state, at
ous work in the areas of reinforcement learning, deep part represents the action, P (st+1 |st ; at ) means the
reinforcement learning, and motion planning. In the probability of transition to the state st+1 when tak-
following section, we will briefly discuss previous work ing action at in state st , and the last part of ele-
on each of the relevant topics. ments R(st+1 |st ; at ) represents the immediate reward
In research in the field of robotic motion planning, received from the environment after the transition from
the concept of machine learning emerges. In particular, st to st+1 . The agent and environment interact at each
Reinforcement Learning (RL) and Deep Reinforcement in a sequence of discrete time steps, t = 0, 1, ... Proba-
Learning (DRL) techniques are an area of growing in- bility of transition in the MDP structure is depending
terest for the robotic research community. on the current state st and chosen action at [11, 27, 33].
Researchers at Erle Robotics have created a frame-
work for testing various RL/DRL algorithms called
OpenAI Gym [6, 32]. Various robot simulation tools
are used as an extension of the OpenAI toolkit, with
Gazebo [1, 15] and PyBullet [8] being the most com-
monly used today. The connection of the robotic tool
with the robust physical core and the Gym toolkit is
created using the ROS (Robot Operating System) [25].
RL/DRL methods are used in robotic applications in Figure 2: Basic structure of agent-environment inter-
the real world for several experiments. One of the ap- action in Markov’s decision-making process (MDPs)
proaches is focused on unscrewing operations in robotic [11, 27]
disassembly of electronic waste using the Q-Learning
method [16], other approaches have used a robotic arm
to achieve a goal using the Deep Reinforcement Learn-
ing method DQN (Deep Q-Network) [5], TRPO(Trust
3.2 Reinforcement Learning
Region Policy Optimization)[22] and tested the result
of the experiment in a real application. Some ap- Reinforcement learning (RL) is an area of machine
proaches use 2D/3D cameras and some other sensors learning that deals with gradual decision-making. The
to observe the robotic environment [5, 10, 12], others main task of this method is to learn how agents ought
use only dynamic simulation with the specified environ- to take sequences of actions in an environment to max-
ment [13, 16] or use real-time robot learning techniques imize cumulative rewards. Markov decision processes
[19]. (MDP) are an ideal mathematical formulation for RL
Motion planning is one of the most fundamental re- problems, for which a direct learning methodology to
search topics in robotics. Some of the approaches have achieve the goal is proposed. The agent decides to
used traditional planning methods, such as RRT [18], receive not only the current remuneration, but also
RRT* [31], where structured tree methods are used cumulative remuneration in the next learning state
to find the curve from point A to point B. Other ap- [11, 27, 33].
proaches use modern techniques, such as RL/DRL, but The agent and MDP together form a sequence that
both methods use Bézier curves to characterize com- contains a number of state-action pairs represented as
plex trajectories and smooth motion planning [23]. τ = ((s0 , a0 ), (s1 , a1 ), . . . ). The return is defined as the
2 MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
R. Parak and R. Matousek
discounted return for the sequence τ at the time steps This is based on simultaneously maximizing a number
t: of pseudoreward functions, such as immediately pre-
dicting the reward (γ = 0), predicting changes on the
∞
X 0 next observation, or predicting the activation of some
Rt (τ ) = γ t −t rt0 , (1)
hidden unit of the agent’s neural network [11].
t0 =t
where γ is a discount factor ((0 ≤ γ ≤ 1)), rt0 is the Deep Q-Network (DQN):
reward at time steps t0 . is a combination of Q-learning with deep convolu-
The main optimization problem of the RL method is tional ANN and reinforcement learning method, multi-
the need to find the optimal policy π ∗ , which is defined
layer and deep ANN specialized in the processing of
such as maximizing the expected return spatial data fields. For a given state of the neural net-
work s, the output is a vector of action values Q(s, a; θ),
π ∗ = arg max E [Rt (τ )]. (2) where θ are the network parameters [11, 29]. The two
π τ ∼π
most important components of the DQN algorithm are
Q-Learning (QL): the use of the target network and the use of the expe-
rience replay. The formula used by DQN is then:
is one of the most popular methods of Reinforce-
ment Learning. The QL method was developed as an
Q(st , at ; θt ) ← Q(st , at ; θt )
off-policy TD (Temporal difference) control algorithm,
defined by + α[(rt+1 + γ max Q(st+1 , at ; θt ) (5)
a
− Q(st , at ; θt ))2 ],
Q(st , at ) ← Q(st , at ) + α[rt+1 + γ max Q(st+1 , at )
where a target network with parameters θt , is the
a
− Q(st , at )], same as the online network except that its parameters
are copied every τ steps from the online network, so
(3)
that then θt = θt , which implies that weight of the
where α is a learning rate (0 < α ≤ 1), neural network [29].
max Q(st+1 , a) is estimate of the optimal future Deep SARSA (DSARSA):
a
value, and other parameters are described in the is similar to the previous part. In the structure
previous section [11, 27]. of learning, the approximation of the value function
is with the convolutional neural network (CNN), that
State–action–reward–state–action (SARSA):
uses Q-network to obtain Q value like DQN. The Func-
is an on-policy TD control method, very similar to tion is represented by a CNN with weights θ and an
the previous Q-Learning method. SARSA method is output that represents the Q-values for each action[20].
characterized by using the next action [27].
Q(st , at ; θt ) ← Q(st , at ; θt )
Q(s , a ) ← Q(s , a ) + α[r + γ max Q(s , a ) + α[(rt+1 + γ max Q(st+1 , at+1 ; θt−1 ) (6)
t t t t t+1 t+1 t+1 a
a
2
− Q(st , at )], − Q(st , at ; θt )) ],
(4)
3.4 Bézier-Spline Curves
3.3 Deep Reinforcement Learning Bézier curves use Bernstein’s polynomials, which were
described in 1912 by the Russian mathematician Sergei
Deep reinforcement learning (DRL) combines artificial
Bernstein [30].
neural networks (ANN) with a learning reinforcement
B-spline curves, like Bézier curves, use polynomials
architecture that allows agents to learn the best pos-
to generate a curve segment. The main difference be-
sible actions in an individual environment to achieve
tween simple Bézier curves and a B-spline is that B-
their goals. The main approach of this method is to
Spline is used as a series of control points to determine
approximate the function and optimize the goals, map-
the local geometry of the curve. This feature ensures
ping the state-action pairs to the expected rewards
that only a small portion of the curve is changed when
[11, 33]. This area of research can solve a wide range
a control point is moved [30]. A Bézier spline curve of
of complex decision-making tasks that were previously
degree p is defined by n + 1 control points P0 , P1 , .., Pn
out of reach for a machine.
[2]:
The learning agent’s DRL methods with auxiliary
tasks within a jointly learned representation can signif- Xn
icantly increase the effectiveness of the learning sam- B(t) = Ni,p (t)Pi , (7)
ple. The learning agent’s DRL methods with auxiliary i=0
tasks within a jointly learned representation can signifi- where Ni,p (t) is a normalized B-Spline curve defined
cantly increase the effectiveness of the learning sample. over the nodes.
4 Experiments and Results task aims to autonomously find the trajectory from
the initial position to the target position using various
In this section, we present an application that includes
learning techniques (Reinforcement Learning / Deep
a task using several RL/DRL techniques for the point
Reinforcement Learning) and using the Bézier spline
of Cartesian space achievement problem using an col-
method to smooth the resulting trajectory.
laborative manipulator with 6-DOF (Degrees of Free-
First, the robot position is initialized and the target
dom). The experiment will be demonstrated in simu-
position is randomly selected. When selecting a target
lation and in the real world.
position, we start the learning process for each learn-
ing technique. The distance between the start Pi and
4.1 Setting up the learning environment target Pt point is described by Euclidean Distance dt
(Eq. 8).
For our experiment, we will use a cooperating robotic
arm UR3 from Universal Robots (Fig. 1), more pre- v
u n
cisely a 6-axis robotic arm with a working radius of 500 uX
dt (Pi , Pt ) = t (Pi − Pt )2 , (8)
mm/19.7 inches, a payload of 3 kg/6.6 pounds and a i=1
repeatability of ±0.1 mm [28].
The simulation is controlled through communica-
4.3 Learning process
tion between the RVIZ [7, 26] simulation tool and the
Gazebo 3D modeling tool [1, 15]. In our experiment, We implement the learning environment within the
we use the UR3 URDF model (Universal Robotic De- OpenAI Gym library [6, 32], which provides a inter-
scription Format) without a gripper and the official face to train and test the learning process.
ROS driver for Universal Robots [3], which is used to To evaluate our algorithm, we performed identical
control real / simulation robots. experiments with several RL/DRL techniques. In the
The main structure of our experiment is shown in first experiment we use classical RL techniques (QL,
Fig. 1. Fig. 3 shows a simplified structure of the SARSA) and in the next experiment we use modern
UR3 model in the RVIZ simulation tool. The blue DRL techniques (DQN, DSARSA). Hyper-parameters
sphere represents the initial position of the robot and of individual RL/DRL techniques are given in the table
the target to be reached by the robot is represented (Tab. 1, Tab. 2).
as a red sphere. The purple box (Atarget ) is approx-
imately the area of restriction from which targets are Table 1: Hyper-parameters used for Reinforcement
removed, and the yellow box (Asearch ) represents the Learning method (Q-Learning, DARSA)
area of safe movement. The safe area is created with
given to the individual robot model to avoid collisions Hyper-parameter Symbol Value
with the target area (imagine that the target area is a Episodes of Training Mmin , Mmax 500, 1000
bin for an object selection problem). Steps per Episode T 100
Discount Factor γ 0.75
Learning Rate α 0.3
Table 2: Hyper-parameters used for Deep Reinforce-

ment Learning method (DQN, DSARSA)
Hyper-parameter Symbol Value
Episodes of Training Mmin , Mmax 500, 1000
Steps per Episode T 100
Discount Factor γ 0.95
Learning Rate α 0.0003
Batch Size N 64
Replay Buffer Size B 2000
Optimizer - Adam [14]
Figure 3: Learning environment of the UR3 model in
the RVIZ simulation tool.
The reward function of the goal achievement exper-
iment in each learning technique is defined as:
dt (Pp , Pt ) − dt (Pa , Pt )
4.2 Definition of experiment Rt = + Rs , (9)
dt (Pi , Pt )
In our problem, we choose a task aimed at achieving where dt (Pi , Pt ) is the initial Euclidean distance
a goal in Cartesian space using a robotics arm. The between the start Pi and target point Pt , and
pected cumulative reward (Fig. 5) and to minimize the

Euclidean distance accuracy error (Fig. 6).
Figure 4: Agent-environment interaction in Markov’s

decision-making process (MDPs) in our problem. The
environment in our problem represents the communi- (a) Q-Learning
cation between the RVIZ [7, 26] simulation tool and
the Gazebo [1, 15] 3D modeling tool via ROS [25].
The agent represents the RL/DRL methods used in
our problem.
(dt (Pp , Pt ) − dt (Pa , Pt )) is the Euclidean distance

difference in real time (Pp – previous position, Pa –
actual position). The parameter Rs assumes values
higher than 0 when the condition is successfully done,
which is determined by the accuracy of the results
(Eq. 10).
(b) SARSA
(
0.25, if δd ≤ 5%.
Rs = (10)
0, otherwise.
where δd is the error of accuracy of the Euclidean
distance, defined as:
dt (Pa , Pt )
δd = .100. (11)
dt (Pi , Pt )
The learning process of the RL/DRL agent begins
by examining the environment by performing actions
from the initial state to the target state and collecting
appropriate rewards (Eq. 9). In our experiment, the
(c) Deep Q-Network
agent can select one of six possible actions (at = (0..5))
in each state (st = (X+ , X− , Y+ , Y− , Z+ , Z− )), and the
available actions correspond to fixed discrete steps 5
mm of the Tool Center Point (TCP).
Once the agent moves from Asearch , the episode M
ends in the current step T , otherwise the process con-
tinues until the maximum number of episodes.
4.4 Experimental results

The training results of the proposed target achieve-
ment experiment using the UR3 robotic arm are shown
in Fig. 5 (a. Q-Learning, b. SARSA, c. Deep Q-
Network, d. Deep SARSA), including the mean cumu- (d) Deep SARSA
lative reward with the number of iterations in each of Figure 5: Training results of RL/DRL techniques using
the episodes. the environment to achieve the goal of the UR3 robot.
The main goal of the optimization problem in the
case of the learning process was to maximize the ex-
δd ≤ 5.0% δd ≤ 10.0% δd > 10.0%
(a) Q-Learning
Figure 6: Minimization of Euclidean distance accuracy

error for each of the learning techniques.
(b) SARSA
(a) Q-Learning
(c) Deep Q-Network
(b) SARSA
(d) Deep SARSA

Figure 8: Test case of a robotic arm for the task
of achieving a goal: With trajectory optimization by
smoothing the path using B-Spline.
(c) Deep Q-Network
After the learning process, the robotic arm UR3 is

tested with the RVIZ simulation tool and the Gazebo
3D modeling tool, which communicate through ROS.
In each learning technique, we choose the best solution
and execute the movement of the planning path. The
resulting path is smoothed by the Bézier spline method
using the GEOMDL library [4]. The trajectory results
of the proposed experiment in each scenario are shown
(d) Deep SARSA in Fig. 7 without B-Spline, and Fig. 8 with B-Spline.
The result of the learning experiment is also tested
Figure 7: Test case of a robotic arm for the task of in a real robotic application using the official ROS
achieving a goal: Without trajectory optimization. driver [3]. A demonstration of the proposed optimiza-
tion path using the DQN learning technique is given in
Fig. 9.
(a) 0% (b) 25% (c) 50% (d) 75% (e) 100%

Figure 9: Test case of a robotic arm for the task of achieving a goal in a real robot application. The experiment
using the proposed optimization path is performed by the DQN learning technique.
Table 3: Results of RL/DRL learning techniques for the goal achievement experiment
Learning Number of Best solution Number of points Accuracy error Time Time
Technique episodes (Episode; Time) from Pi to Pt δd Default B-Spline
Q-Learning 500 (279; 04:55 hr.) 63 2.98% 5.25 sec. 2.27 sec.
SARSA 800 (772; 18:29 hr.) 92 8.48% 7.83 sec. 4.33 sec.
Deep Q-Network 500 (440; 15:48 hr.) 83 4.16% 6.57 sec. 2.89 sec.
Deep SARSA 500 (371; 13:25 hr.) 78 2.90% 6.79 sec. 2.44 sec.
5 Conclusion and Future Work and more. This research can also provide a suitable ba-
sis for other areas of learning, such as the Pick & Place
In this work, we provided the experimental study of task, Bin-picking, etc., and the use of other robotic arm
multiple reinforcement learning (RL)/deep reinforce- models.
ment learning algorithms (DRL), namely Q-Learning
(QL), SARSA for RL methods, and Deep Q-Network Acknowledgement: This work was supported by In-
(DQN), Deep Sarsa (DSARSA) for DRL methods. We ternal grant agency of BUT: FME-S-20-6538 “Industry
presented related work on the problem of achieving the 4.0 and AI methods”.
goal using RL/DRL techniques. In this work, we also
introduced in more detail the various learning methods
and techniques for trajectory smoothing. The simula- References
tion is controlled by communication between the sim-
[1] Aguero, C., et al. Inside the virtual robotics
ulation tool RVIZ and the Gazebo 3D modeling tool
challenge: Simulating real-time robotic disaster
via the Robot Operating System (ROS). The main op-
response. Automation Science and Engineering,
timization problem of this experiment is to find the
IEEE Transactions on 12, 2 (April 2015), 494–
best solution for each RL/DRL scenario, respectively,
506.
minimize the Euclidean distance accuracy error δd .
RL (QL) and DRL (DQN, DSARSA) techniques com- [2] Ammad, M., and Ramli, A. Cubic b-spline
pleted the conditions for the required accuracy repre- curve interpolation with arbitrary derivatives on
sented by Rs , but in the perspective of the future re- its data points. In 2019 23rd International Con-
search are techniques based on deep neural network are ference in Information Visualization – Part II
more stable and efficient. The resulting path found in (2019), pp. 156–159.
each scenario is smoothed by the Bézier spline method [3] Andersen, T. T. Optimizing the univer-
and tested in a real robotic application using the official sal robots ros driver. Technical University of
ROS driver. Denmark, Department of Electrical Engineering
This work can provide a foundation for future re- (2015).
search on motion planning in the field of robotics using [4] Bingol, O. R., and Krishnamurthy, A.
advanced deep reinforcement learning methods such NURBS-Python: An open-source object-oriented
as DDPG (Deep Deterministic Policy Gradient), TD3 NURBS modeling framework in Python. Soft-
(Twin Delayed Deep Deterministic Policy Gradient) wareX 9 (2019), 85–94.
[5] Bogunowicz, D., Rybnikov, A., Vendi- [19] Mahmood, A., Korenkevych, D., Komer,
dandi, K., and Chervinskii, F. Sim2real B., and Bergstra, J. Setting up a rein-
for peg-hole insertion with eye-in-hand camera. forcement learning task with a real-world robot.
arXiv:2005.14401 (05 2020). arXiv:1803.07067 (03 2018).
[6] Brockman, G., Cheung, V., Pettersson, L., [20] Mesquita, A., Nogueira, Y., Vidal, C.,
Schneider, J., Schulman, J., Tang, J., and Cavalcante-Neto, J., and Serafim, P. Au-
Zaremba, W. Openai gym. arXiv:1606.01540 tonomous foraging with sarsa-based deep rein-
(2016). forcement learning. In 2020 22nd Symposium
[7] Coleman, D., S, ucan, I. A., Chitta, S., and on Virtual and Augmented Reality (SVR) (2020),
Correll, N. Reducing the barrier to entry of pp. 425–433.
complex robotic software: a moveit!case study. [21] Nguyen, H., and La, H. Review of deep re-
Journal of Software Engineering for Robotics 5 inforcement learning for robot manipulation. In
(2014), 3–16. 2019 Third IEEE International Conference on
[8] Coumans, E., and Bai, Y. Pybullet, a python Robotic Computing (IRC) (2019), pp. 590–595.
module for physics simulation for games, robotics [22] Rupam Mahmood, A., Korenkevych, D.,
and machine learning. http://pybullet.org, Komer, B. J., and Bergstra, J. Setting up
2016–2019. a reinforcement learning task with a real-world
[9] El-Shamouty, M., Wu, X., Yang, S., Al- robot. In 2018 IEEE/RSJ International Confer-
bus, M., and Huber, M. F. Towards safe ence on Intelligent Robots and Systems (IROS)
human-robot collaboration using deep reinforce- (2018), pp. 4635–4640.
ment learning. In 2020 IEEE International [23] Scheiderer, C., Thun, T., and Meisen, T.
Conference on Robotics and Automation (ICRA) Bézier curve based continuous and smooth motion
(2020), pp. 4899–4905. planning for self-learning industrial robots. Pro-
[10] Franceschetti, A., Tosello, E., Castaman, cedia Manufacturing 38 (2019), 423 – 430. 29th
N., and Ghidoni, S. Robotic arm control and International Conference on Flexible Automation
task training through deep reinforcement learning. and Intelligent Manufacturing ( FAIM 2019), June
arXiv:2005.02632 (01 2021). 24-28, 2019, Limerick, Ireland, Beyond Industry
[11] Francois-Lavet, V., Henderson, P., Is- 4.0: Industrial Advances, Engineering Education
lam, R., Bellemare, M. G., and Pineau, J. and Intelligent Manufacturing.
An introduction to deep reinforcement learning. [24] Silver, D., et al. Mastering the game of go with
arXiv:1811.12560 (2018). deep neural networks and tree search. Nature 529
[12] Hundt, A., et al. ”good robot!”: Efficient rein- (01 2016), 484–489.
forcement learning for multi-step visual tasks with [25] Stanford Artificial Intelligence Labora-
sim to real transfer. IEEE Robotics and Automa- tory et al. Robotic operating system.
tion Letters PP (08 2020), 1–1. [26] Sucan, I. A., and Chitta, S. Moveit. [online]
[13] Hůlka, T., Matoušek, R., Dobrovský, L., Available at: moveit.ros.org.
Dosoudilová, M., and Nolle, L. Optimiza- [27] Sutton, R. S., and Barto, A. G. Reinforce-
tion of snake-like robot locomotion using ga: Ser- ment Learning: An Introduction, second ed. The
penoid design. MENDEL 26, 1 (Aug. 2020), 1–6. MIT press, 2018.
[14] Kingma, D., and Ba, J. Adam: A method for [28] Universal Robots. Ur3. [online] Available at:
stochastic optimization. International Conference https://www.universal-robots.com.
on Learning Representations (12 2014). [29] van Hasselt, H., Guez, A., and Silver,
[15] Koenig, N., and Howard, A. Design and use D. Deep reinforcement learning with double q-
paradigms for gazebo, an open-source multi-robot learning. arxiv:1509.06461 (2015).
simulator. In IEEE/RSJ International Conference [30] Vince, J. Mathematics for Computer Graphics,
on Intelligent Robots and Systems (Sendai, Japan, fifth ed. Springer, London, 2017.
Sep 2004), pp. 2149–2154. [31] Xinyu, W., Xiaojuan, L., Yong, G., Ji-
[16] Kristensen, C., Sørensen, F., Nielsen, H., adong, S., and Rui, W. Bidirectional potential
Andersen, M., Bendtsen, S., and Bøgh, S. guided rrt* for motion planning. IEEE Access 7
Towards a robot simulation framework for e-waste (2019), 95046–95057.
disassembly using reinforcement learning. Proce- [32] Zamora, I., Lopez, N., Vilches, V., and
dia Manufacturing 38 (01 2019), 225–232. Cordero, A. Extending the openai gym for
[17] Kudela, J. Social distancing as p-dispersion robotics: a toolkit for reinforcement learning us-
problem. IEEE Access 8 (2020), 149402–149411. ing ros and gazebo. arXiv:1608.05742 (08 2016).
[18] Lin, C., and Li, M. Motion planning with ob- [33] Zeng, X. Reinforcement learning based approach
stacle avoidance of an ur3 robot using charge sys- for the navigation of a pipe-inspection robot at
tem search. In 2018 18th International Confer- sharp pipe corners. University of Twente, Septem-
ence on Control, Automation and Systems (IC- ber 2019.
CAS) (2018), pp. 746–750.
View publication stats

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Comparison of Multiple Reinforcement Learning and Deep Reinforcement

Article in MENDEL · June 2021

The user has requested enhancement of the downloaded file.

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning

1 Introduction servability, a lot of time for problem solving and are

Table 2: Hyper-parameters used for Deep Reinforce-

pected cumulative reward (Fig. 5) and to minimize the

Figure 4: Agent-environment interaction in Markov’s

(dt (Pp , Pt ) − dt (Pa , Pt )) is the Euclidean distance

4.4 Experimental results

Figure 6: Minimization of Euclidean distance accuracy

(c) Deep Q-Network

(d) Deep SARSA

After the learning process, the robotic arm UR3 is

(a) 0% (b) 25% (c) 50% (d) 75% (e) 100%

View publication stats

You might also like