Professional Documents
Culture Documents
10 1108 - Ir 12 2022 0299
10 1108 - Ir 12 2022 0299
10 1108 - Ir 12 2022 0299
reinforcement learning
Xiangda Yan, Jie Huang, Keyan He, Huajie Hong and Dasheng Xu
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China
Abstract
Purpose – Robots equipped with LiDAR sensors can continuously perform efficient actions for mapping tasks to gradually build maps. However,
with the complexity and scale of the environment increasing, the computation cost is extremely steep. This study aims to propose a hybrid
autonomous exploration method that makes full use of LiDAR data, shortens the computation time in the decision-making process and improves
efficiency. The experiment proves that this method is feasible.
Design/methodology/approach – This study improves the mapping update module and proposes a full-mapping approach that fully exploits the
LiDAR data. Under the same hardware configuration conditions, the scope of the mapping is expanded, and the information obtained is increased.
In addition, a decision-making module based on reinforcement learning method is proposed, which can select the optimal or near-optimal
perceptual action by the learned policy. The decision-making module can shorten the computation time of the decision-making process and improve
the efficiency of decision-making.
Findings – The result shows that the hybrid autonomous exploration method offers good performance, which combines the learn-based policy with
traditional frontier-based policy.
Originality/value – This study proposes a hybrid autonomous exploration method, which combines the learn-based policy with traditional frontier-based
policy. Extensive experiment including real robots is conducted to evaluate the performance of the approach and proves that this method is feasible.
Keywords Mobile robots, Autonomous robots, Autonomous Exploration, Deep Q-network, Reinforcement learning
Paper type Research paper
The current issue and full text archive of this journal is available on Emerald
Insight at: https://www.emerald.com/insight/0143-991X.htm The authors disclosed receipt of the following financial support for the
research, authorship and publication of this article: This research was
supported by NUDT, China.
Industrial Robot: the international journal of robotics research and application
50/5 (2023) 793–803 Received 19 October 2022
© Emerald Publishing Limited [ISSN 0143-991X] Revised 14 February 2023
[DOI 10.1108/IR-12-2022-0299] Accepted 27 February 2023
793
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
Figure 1 Autonomous exploration system nearest frontiers separately. However, no schedule has been
established, which results in low efficiency.
Simmons made some attempts to minimize the overall time
of the whole robot team, taking into account the cost of the
robot and the utility of the target point. And once the target
point is assigned to a robot, the utility of the target point to
other robots will be reduced. This approach reduces time
significantly, but there is still a relatively large optimization
space (Burgard et al., 2000; Simmons et al., 2000).
Laumond (1983) and Chatila and Laumond (1985) build a
topological model and then derive a semantic model, e.g.
identifying “rooms” and “corridors” (Kuipers and Byun,
1991). Nodes are used to represent specific locations in the
map, such as corners, and edges represent connections
between nodes, such as corridors. However, there is no use of
the topological model to cope with measure inaccuracy.
794
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
module adopt the traditional A 1 DWA method). The third Figure 2 (a) Semi-mapping and (b) full-mapping
part introduces map processing, frontier detection and
clustering. And the target calculation reinforcement learning
and mapping rescue algorithm are combined to do experiments
on the data set, and the feasibility and efficiency of the
algorithm are analyzed. The implementation details of each
part are described below:
795
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
796
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
4. Experiments
The simulation experiment is carried out on Ros (the algorithm
is deployed to robots for experiments). To verify the effect of
the above algorithm, it is divided into two parts. In the first part,
DQN is trained to calculate the goal (in this process, the path-
planning and motion-controlling module adopt the traditional
A 1 DWA method). In the second part, experiments are done
on data sets by combining reinforcement learning and frontier
rescue, and the feasibility and efficiency of the algorithm are
analyzed. The experimental details of each part are described
below.
797
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
training process. Maps are of unequal size. In our experiment, strategies and recorded the final average reward of different
the robot’s LiDAR provided sensor data with a 360-degree field strategies. Figure 7(a) – (f) shows the corresponding six action
of view with Gaussian noise. The local observation map of spaces, and the average rewards corresponding to the six
the robot is used as the input state quantity of the neural strategies are 3.897, 3.267, 4.217, 3.937, 4.015 and 3.917,
network, and the position of the robot is marked in it (its gray respectively, as shown in Figure 7(g). When the action space is
value is set to 80 within the radius of 0.3 m). Detect if a collision Figure 7(c), the final average reward after convergence is 4.217,
would take place or not before the robot takes any action, and the highest among the six strategies. We study the relationship
training samples will include collision. Then, the robot will between quantity and distribution of action space and the
gradually learn the strategy of avoiding the collision and advantages and disadvantages of learned policies. We may
exploring at the same time. However, if the robot navigates to safely draw a conclusion from Figure 7 that (1) if the number
the end of the channel or cannot find an effective action, the of actions is the same, but the distribution is different, it all
“Frontier Rescue” mechanism will be triggered to drive the can learn policies that explore unknown environments.
robot to the nearest boundary point so that the training can However, the final reward of action space (c) in Figure 7 will
continue.
be slightly higher by about 5% ð4:2174:015
4:015
Þ
100% than the
We compared RRT-Exploration (Umari and
Mukhopadhyay, 2017), Explore-lite (Hörner, 2016) and our other five strategies; (2) if the distribution of actions is the
learned policy in detail on a data set consisting of an additional same, but the number is different, it all can learn policies
five maps at the stage of testing. RRT-Exploration presents a that explore unknown environments. However, note that the
new strategy for detecting frontier points using RRT. A more actions are used, the higher the final reward would be.
growing randomly sampling tree is used in the search for Because if there are too few or too many actions, it is all difficult
frontier points. Explore-lite is a method to search frontiers to learn a good policy. Ultimately, we adopt action space (c) in
points by breadth-first search on the map. The frontier Figure 7.
points are filtered and queued to be assigned to a robot, then Figure 8 shows the change curve of loss and reward in the
the robot moves toward the assigned point. Both algorithms training process. When the agent learns a good policy, loss
detect the frontiers first, while DRL directly gets the goal decrease gradually. What’s more, in the initial stage, the agent is
through the local field of view, which can more obviously show constantly exploring, and the reward is unstable. After the agent
the end-to-end characteristics and highlight the efficiency of learns a policy that can obtain a large reward, the reward starts
our algorithm. In the process of exploration based on DQN, to increase gradually, and gradually converges after 35,000
any action that will result in low Shannon entropy gain will episodes.
trigger the “Frontier Rescue” mechanism. It can guide the What’s more, we compared the performances of the
robot from a region full of obstacles or with poor information to proposed algorithm with RRT-Exploration and Explore-lite,
the nearest frontier point by introducing this mechanism. conducted experiments using these exploration methods on
In the training process, only when there are no frontier five experimental scenarios and recorded the data of these
points on the map or the exploration proportion is up to exploration methods. We also set up a group of experiments in
the threshold, the exploration is stopped. We choose two which human operate robots remotely to explore as a baseline.
performance indexes to compare performance. Reward All methods can explore more than 98% of the area in the
represents the area of the explored region or the information unknown environment. Figure 9 shows an example of using
gain obtained during exploration, which is the key index these exploration methods to explore an unknown environment
to evaluate the performance of a method (excluding the [the action space is (c) in Figure 7] when exploring Scenario 1.
reward obtained by the “Frontier Rescue” mechanism). If the In the whole process of autonomous exploration, fully rely on
rewards are the same, the path length of the robot is also a key DQN to make decisions. According to Figure 10, the shortest
index. exploration route obtained from the four methods is controlled
We used the DQN algorithm to train the network and set the by human. Among the three algorithms, our method is that the
experience playback mechanism to improve the efficiency of exploration route is optimal. And by comparing the time cost of
using samples. Before starting the training, the experience the algorithm, our algorithm consumes about 2 ms for every
playback pool should collect at least 10,000 samples, and the decision, which greatly improves the real-time performance.
maximum number of samples in the experience playback pool From the performance comparison of the three algorithms, we
is 50,000. The network is trained every 20 steps, and 64 can roughly see the advantages of the DQN method in path
samples are randomly sampled from the experience replay pool exploration efficiency and decision time compared with RRT
for network gradient update at a time. method and LITE method. Therefore, it can be concluded that
The learning rate is set to 0.0001, the sample batch is set to our method can improve the efficiency, comparing with RRT-
64, the reward discount rate is set to 0.99, the network soft Exploration and Explore-lite.
update rate is set to 0.01 and the network soft update is What’s more, because both RRT-Exploration and Explore-
performed every 250 training times. Train the model on a lite need to detect frontiers on the global map, detection
device configured with 32GB RAM, CPU: i7 11800H and efficiency is closely related to map size and complexity of the
Nvidia 3080 6GB GPU. We trained 50,000 episodes on the scenario. So, the two methods have large and significantly
data set and spent 20 h training the network. positive correlation differences in decision-making time with
We successfully used the DQN algorithm to train the agent to different complexity. However, DQN method only needs to
complete the end-to-end decision-making process in a complex obtain the local field of view without any further processing.
and diverse environment. What’s more, we trained six different Then, the goal can be determined by inputting local map into
798
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
799
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
800
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
801
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
802
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
randomized trees”, IEEE/RSJ International Conference on Autonomous Agents, pp. 3715-3720, doi: 10.1145/280765.
Intelligent Robots and Systems, pp. 1396-1402, doi: 280773.
10.1109/IROS.2017.8202319. Zhang, Z., Shi, C., Zhu, P., Zeng, Z. and Zhang, H.
Yamauchi, B. (1997), “A frontier-based approach for (2021), “Autonomous exploration of mobile robots via
autonomous exploration”, IEEE International deep reinforcement learning based on spatiotemporal
Symposium on Computer Intelligence in Robotics and information on graph”, Applied Sciences, Vol. 11 No. 18,
Automation, pp. 146-151, doi: 10.1109/CIRA.1997. pp. 8299-8320.
613851.
Yamauchi, B. (1998), “Frontier-based exploration using Corresponding author
multiple robots”, IEEE International Conference on Keyan He can be contacted at: hekeyan2009@126.com
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
803