10 1108 - Ir 12 2022 0299

Autonomous exploration through deep
reinforcement learning
Xiangda Yan, Jie Huang, Keyan He, Huajie Hong and Dasheng Xu
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China
Abstract
Purpose – Robots equipped with LiDAR sensors can continuously perform efficient actions for mapping tasks to gradually build maps. However,
with the complexity and scale of the environment increasing, the computation cost is extremely steep. This study aims to propose a hybrid
autonomous exploration method that makes full use of LiDAR data, shortens the computation time in the decision-making process and improves
efficiency. The experiment proves that this method is feasible.
Design/methodology/approach – This study improves the mapping update module and proposes a full-mapping approach that fully exploits the
LiDAR data. Under the same hardware configuration conditions, the scope of the mapping is expanded, and the information obtained is increased.
In addition, a decision-making module based on reinforcement learning method is proposed, which can select the optimal or near-optimal
perceptual action by the learned policy. The decision-making module can shorten the computation time of the decision-making process and improve
the efficiency of decision-making.
Findings – The result shows that the hybrid autonomous exploration method offers good performance, which combines the learn-based policy with
traditional frontier-based policy.
Originality/value – This study proposes a hybrid autonomous exploration method, which combines the learn-based policy with traditional frontier-based
policy. Extensive experiment including real robots is conducted to evaluate the performance of the approach and proves that this method is feasible.
Keywords Mobile robots, Autonomous robots, Autonomous Exploration, Deep Q-network, Reinforcement learning
Paper type Research paper
1. Introduction et al., 2011), Cartographer (Hess et al., 2016) and Karto

(Konolige et al., 2010)], when the data is larger than the
Robot technology has developed rapidly in recent years. effective range of the LiDAR, the grid probability of the
Researchers have made great efforts and many attempts in area within this range is not calculated (we call such
many fields. Now, robots can complete simple and repetitive algorithms semi-mapping). We believe that this method
works very well, and even for works requiring high accuracy, is reliable but inefficient, because the kind of algorithm
great progress has been made, such as robots on automatic does not make full use of datas, and we have improved the
production lines (Qiu and Kermani, 2021). However, up to map-updating module on this basis.
now, the effect of robots performing tasks in complex The decision-making module determines the goal at the
environments is still not very satisfactory. For example, we hope next time according to the local observation map at the
that robots can replace humans to search, rescue and investigate current time and the robot’s information. A typical
in harsh conditions, that is, robots can independently build scheme is to use the concept: frontier proposed by Brian
maps in unknown environments (Liu and Nejat, 2013). Yamauchi, extracting the area based on the junction of
Automatic mapping the unknown environments is a very open space and unknown space as the edge in the
challenging task for robots. Scholars have done a lot of research occupancy grid map, and navigating a robot to the nearest
on autonomous mapping (Casper and Murphy, 2003). A frontier to achieve the goal of autonomous exploration
typical autonomous exploration system consists of a map- (Yamauchi, 1997, 1998).
updating module, a decision-making module, a path-planning The path-planning and motion-controlling module plans
and motion-controlling module, as shown in Figure 1: the path for the robot to reach the target, such as A (Hart
The map-updating module collects sensor data such et al., 1968), Hybrid A (Dolgov and Thrun, 2009) and
as LiDAR point cloud and depth camera data, then RRT (LaValle and Kuffner, 2001), and controls the robot
updates the map based on sensor data. However, in the to track the path to reach the target position, such as LQR
current mapping algorithms proposed by scholars [such as (Levinson et al., 2011), MPC (Nareyko et al., 2020) and
Gmapping (Grisetti et al., 2005), Hector (Kohlbrecher DWA (Fox et al., 1997). The algorithms all need to maintain
The current issue and full text archive of this journal is available on Emerald
Insight at: https://www.emerald.com/insight/0143-991X.htm The authors disclosed receipt of the following financial support for the
research, authorship and publication of this article: This research was
supported by NUDT, China.
Industrial Robot: the international journal of robotics research and application
50/5 (2023) 793–803 Received 19 October 2022
© Emerald Publishing Limited [ISSN 0143-991X] Revised 14 February 2023
[DOI 10.1108/IR-12-2022-0299] Accepted 27 February 2023
793
Autonomous exploration Industrial Robot: the international journal of robotics research and application
Xiangda Yan et al. Volume 50 · Number 5 · 2023 · 793–803
Figure 1 Autonomous exploration system nearest frontiers separately. However, no schedule has been
established, which results in low efficiency.
Simmons made some attempts to minimize the overall time
of the whole robot team, taking into account the cost of the
robot and the utility of the target point. And once the target
point is assigned to a robot, the utility of the target point to
other robots will be reduced. This approach reduces time
significantly, but there is still a relatively large optimization
space (Burgard et al., 2000; Simmons et al., 2000).
Laumond (1983) and Chatila and Laumond (1985) build a
topological model and then derive a semantic model, e.g.
identifying “rooms” and “corridors” (Kuipers and Byun,
1991). Nodes are used to represent specific locations in the
map, such as corners, and edges represent connections
between nodes, such as corridors. However, there is no use of
the topological model to cope with measure inaccuracy.
2.2 Reinforcement learning

At the same time, a variety of intelligent methods have emerged.
The ultimate goal of these methods is to minimize the time or the
path length for robots to explore the unknown environment.
It is the first time to try to solve the problem of exploring
unknown environments by adopting deep reinforcement
the global or local cost map in real time to prevent collisions,
learning (Koutras et al., 2021). The author trains agents in an
which requires high computing resources for equipment.
Open-AI environment: Mars Explorer. Although the robot is a
This paper proposes a full-mapping method to improve the brief model based on the particle in this environment, the size
map-updating module and trains the decision-making module of the environment is fixed and the environment is highly
based on reinforcement learning to shorten the computation structured, this is still a brave attempt in the exploration field
time in the decision-making process and improve the efficiency and has confirmed the possibility of using reinforcement
of decision-making. learning algorithms to solve this problem.
Our paper proceeds as follows: we describe related work in Zhang et al. (2021) and Chen et al. (2020) propose a
Section 2, elaborate on our methodology and agent network reinforcement learning method using a graph convolution
design in Section 3, showcase experimental results in Section 4 neural network to solve the decision-making problem in
and conclude in Section 5. exploration and train the policy in many random maps, taking
the historical position, current position, observation landmark
and candidate frontiers of the robot as the vertices of the graph
2. Related work and connecting these vertices by some rules to form a graph.
2.1 Information theory However, this method still relies on a series of processing on
There are many kinds of maps, such as the topology map, the occupancy grid maps such as frontier detection to a certain
feature map and the grid map. But the feature map is usually extent, and when the environmental characteristics change, the
used in a structured environment. efficiency of this graph-based method is not high. There is still a
A statistical approach derived from Bayesian theory has lot of work to be done before deploying this method to robots.
captured our attention (Moravec, 1988). This method is fast Li et al. (2020) use a convolutional neural network (CNN)
and effective, which considers the problem of generating a grid to train agents and calculates the position with the largest
map from independent sources of information and lays the advantage value in the map by inputting the whole map into
foundation for the mapping. the network. Finally, the author successfully deployed the
Thrun and Bücken (1996) describe an approach that algorithm to robots and carried out experiments. However,
integrates two paradigms: grid-based and topological, and gives when the map size changes (i.e. the input dimension of the
results for autonomously operating a mobile robot equipped with neural network changes), the network needs to be changed and
sonar sensors. However, the author assumes that all the walls are retrained. What’s more, the author does not consider the
distance cost of robot movement, and the reward function is
perpendicular or parallel to each other, which is only suitable for
only set to the map information gain obtained by the current
the environment of regular shape and has limitations.
action (this is unreasonable).
Yamauchi (1997) proposed the concept of frontiers for the
first time. In the occupancy grid map, extract grids at the
junction of free grids and unknown grids in occupancy grid 3. Methodology
maps as frontiers. And a robot navigates to the nearest frontier The algorithm is divided into three parts. In the first part, the
to achieve the purpose of autonomous map building. Later, map-updating module is improved to make full use of sensor
Yamauchi (1998) extended this method to exploration for data. In the second part, train the weight of the policy which
multirobot. Robots share information and navigate to their decides the goal (the path-planning and motion-controlling
794
module adopt the traditional A 1 DWA method). The third Figure 2 (a) Semi-mapping and (b) full-mapping
part introduces map processing, frontier detection and
clustering. And the target calculation reinforcement learning
and mapping rescue algorithm are combined to do experiments
on the data set, and the feasibility and efficiency of the
algorithm are analyzed. The implementation details of each
part are described below:
3.1 Semi-mapping and full-mapping

There are many kinds of maps, considering that occupancy grid
map is easy to be applied in robot path planning, occupancy
grid map is adopted in this paper. The occupancy grid map
divides the whole map into grids. Each grid has only two states
(state: s). s = 1 means the grid is free, and s = 0 means the grid is represents the immediate reward value obtained by the agent
occupied. But in the real world, the information obtained by when performing action at in st state. f: S A S ! [0,1] is the
robots from the external environment usually contains errors, state transition probability distribution function. st11 f(st,st)
so it is more appropriate to use probability to represent the state represents the probability that the agent performs action at in
of the map. For a grid m in the map mi,j, use p(mi,j) to represent state st and transfers to the next state st11. The trajectory of
the probability of the grid being occupied (called confidence state, actions and rewards is t = (s0, a0, r0, s1, a1, r1, sT1, aT1,
level bel(mi,j) and use pðm i;j Þ or 1 p(mi,j) to represent the rT1, sT), status, network update formula is:
probability of the grid being idle. For the grid not observed by
the robot, its initial grid occupation probability p(mi,j) = 0.5 Qðst ; at Þ 1 ¼ a Rt 1 1 1 gQðst 1 1 ; at 1 1 Þ Qðst ; at Þ (4)
(Moravec, 1988).
The grid confidence level solution formula is as follows: utarget ¼ rutarget 1 ð1 rÞu (5)

p mi;j jxt ; zt pðzt jxt Þbelt1 ðmi;j Þ
belt ðmi;j Þ ¼ (1) Deep reinforcement learning is used to calculate goals. Because
pðmi;j Þpðzt jx1:t ; z1:t1 Þ
the input dimension of the neural network is constant, the
The following results are derived from Bayesian theory: entire map cannot be taken as input. We set the robot’s local
vision as state, that is, a grayscale image (marking the position
belt ðmi;j Þ ¼ 1 1=ð1 1 expðlt ÞÞ (2) of the robot in this grayscale image), which size does not
change. And then scale the image to 84 84. The final result is
where: used as the input of the neural network. Action spaces are a
8
>
> l ¼ lt1 1 linv l0 series of discrete actions, 81 actions in all. The position relative
> t
< l0 ¼ log pðmi;j Þ=pðm i;j Þ
> to the robot is shown in Figure 3. We integrated two different
(3)
>
> lt1 ¼ log belt1 ðmi;j Þ=belt1 ðm i;j Þ neural networks in our framework, including the recurrent
>
: linv ¼ log pmi;j jxt ; zt =pm i;j jxt ; zt
> neural network network (GRU) and the CNN. The output is
influenced by the temporal sequential inputs to the neural
However, when the data is larger than the effective range of network, considering the temporal context of an input state.
the LiDAR, mapping algorithms do not calculate the grid What’s more, we add a dropout layer to avoid overfitting, as
probability of the area within this range, such as Gmapping, shown in Figure 4.
Hector, Cartographer and Karto (such algorithms called semi- Data set: We use the exploration data set to provide training
mapping). We believe that this kind of method is reliable but scenarios for the deep reinforcement learning method. The
inefficient because it does not make full use of data. exploration data set is highly diverse, with more than 10,000
To solve this problem, we improved semi-mapping by using maps in total. Figure 5 shows several 3D Gazebo simulation
all the valid data of LiDAR. When the LiDAR data is larger environments.
than the effective range of the LiDAR, the grids in the limited Shannon entropy is used to represent the uncertainty of a
range can be considered to be free to achieve the purpose of grid in the map, and entropy is introduced as the feature of the
updating the map. Figure 2 shows the results of the two types of grid (Carrillo et al., 2015).
mapping methods. The map updating area of the full-mapping Shannon entropy:
method is larger than that of the semi-mapping method, where X
more information is obtained. Hg pðmi;j Þlogð pðmi;j ÞÞ 1 1 pðmi;j Þ log 1 pðmi;j Þ
ij
3.2 Deep reinforcement learning (6)

Markov decision Process (MDP) can be used to model RL
problems. It is common to define an MDP as A quadruple Reward function:
(S, A,r,f): S is the set of all environment states. st [ S represents 8
the state of the agent at time t; A is A set of actions that can be < R ¼ v1 R1 1 v2 R2 1 v3 R3
executed by the agent. at [ A represents the action taken by the R1 ¼ DHg (7)
:
agent at time t; r: S A ! R is the reward function. rt r (st, at) R2 ; R3 : path length; angle change
795
Figure 3 Action space Figure 5 Exploration dataset
So far, the detected frontiers can be used to calculate the goal

for the robot (Senarathne et al., 2013). However, without
further processing, this process is cumbersome, inefficient and
prone to error. Therefore, Mean-Shift is performed on the
frontiers and uses the center of each class to calculate the goal
3.3 Frontier rescue for the robot, which improves the efficiency of autonomous
If the goal from deep reinforcement learning is not reachable or exploration.
the information obtained from the action is very low, frontier Mean-Shift is a hill-climbing algorithm that involves
rescue policy is carried out by clustering of frontiers based on shifting this kernel iteratively to a higher density region until
the Mean-Shift algorithm. convergence. Every shift is defined by a mean shift vector. The
But frontiers directly extracted from occupancy grid mean shift vector always points toward the direction of the
maps are decentralized. First, details are edge-detected and maximum increase in density. At every iteration, the kernel is
denoising, using dilation and erosion at the same time, further, shifted to the centroid or the mean of the points within it. At
the results are averaged. And make the obstacles expand to convergence, there will be no direction in which a shift can
ensure the accessibility of frontiers. Then, detect frontiers on accommodate more points inside the kernel. The bandwidth
the result. On this basis, we detect the connectivity of frontiers size has a physical meaning, unlike k-means. Mean-Shift can
and group frontiers into different edges representing different automatically determine the number of categories. Figure 6
regions, which is shown in Algorithm 1. shows the results of frontier rescue policy.
Figure 4 Neural network architecture
796
Figure 6 Frontier detection and clustering Algorithm 1: frontiers grouping
1: Input: frontier list(Frontiers)

2: Output: clusters list(Clusters)
3: While Frontiers is not empty:
4: Initialize a list Clusters=[]
5: Initialize a list Open=[]
6: Initialize a list Close= []
7: Open append a frontier point,
Frontiers pop the point
8: Repeat:
9: If Open is not empty
10: Choose a frontier point from Open
11: Open append frontier points adjoining
with the point
12: Open pop the frontier point
13: Close append the frontier point
14: If Open is empty
15: Clusters append close
16: Until Open is empty
Source: Authors’ own work
Then, the greedy algorithm is used to calculate the exploration

scheme with the minimum cost for the robot. The path-
planning and motion-controlling module adopt the traditional
A 1 DWA method.
4. Experiments
The simulation experiment is carried out on Ros (the algorithm
is deployed to robots for experiments). To verify the effect of
the above algorithm, it is divided into two parts. In the first part,
DQN is trained to calculate the goal (in this process, the path-
planning and motion-controlling module adopt the traditional
A 1 DWA method). In the second part, experiments are done
on data sets by combining reinforcement learning and frontier
rescue, and the feasibility and efficiency of the algorithm are
analyzed. The experimental details of each part are described
below.
4.1 Autonomous exploration in the simulation

The training process is based on a virtual environment, ROS is
a systematic framework for robot development and Gazebo is
an open source 3D robot simulation software that integrates the
ODE physics engine, OpenGL rendering and supports sensor
simulation and actuator control.
The agent we trained was a standard robot development
platform from ROS and Gazebo: Scout-Mini. The Scout-Mini
is a robot platform that can be equipped with LiDAR sensors
and can be directly controlled via the speed topic, and strategies
trained on this standard robotic development platform can be
well transferable to other similar robotic platforms. The
minimum observation distance of the LiDAR sensor is 0.3 m,
and the maximum effective observation distance is 4 m. To be
closer to the real environment, Gaussian noise is added to the
sensor data. The raw data from the LiDAR sensor is sampled at
a range of 360 degrees.
We trained the strategy in Gazebo 3D environments similar
to a tunnel or corridor, using about 5,000 maps during the
797
training process. Maps are of unequal size. In our experiment, strategies and recorded the final average reward of different
the robot’s LiDAR provided sensor data with a 360-degree field strategies. Figure 7(a) – (f) shows the corresponding six action
of view with Gaussian noise. The local observation map of spaces, and the average rewards corresponding to the six
the robot is used as the input state quantity of the neural strategies are 3.897, 3.267, 4.217, 3.937, 4.015 and 3.917,
network, and the position of the robot is marked in it (its gray respectively, as shown in Figure 7(g). When the action space is
value is set to 80 within the radius of 0.3 m). Detect if a collision Figure 7(c), the final average reward after convergence is 4.217,
would take place or not before the robot takes any action, and the highest among the six strategies. We study the relationship
training samples will include collision. Then, the robot will between quantity and distribution of action space and the
gradually learn the strategy of avoiding the collision and advantages and disadvantages of learned policies. We may
exploring at the same time. However, if the robot navigates to safely draw a conclusion from Figure 7 that (1) if the number
the end of the channel or cannot find an effective action, the of actions is the same, but the distribution is different, it all
“Frontier Rescue” mechanism will be triggered to drive the can learn policies that explore unknown environments.
robot to the nearest boundary point so that the training can However, the final reward of action space (c) in Figure 7 will

continue.
be slightly higher by about 5% ð4:2174:015
4:015
Þ
100% than the
We compared RRT-Exploration (Umari and
Mukhopadhyay, 2017), Explore-lite (Hörner, 2016) and our other five strategies; (2) if the distribution of actions is the
learned policy in detail on a data set consisting of an additional same, but the number is different, it all can learn policies
five maps at the stage of testing. RRT-Exploration presents a that explore unknown environments. However, note that the
new strategy for detecting frontier points using RRT. A more actions are used, the higher the final reward would be.
growing randomly sampling tree is used in the search for Because if there are too few or too many actions, it is all difficult
frontier points. Explore-lite is a method to search frontiers to learn a good policy. Ultimately, we adopt action space (c) in
points by breadth-first search on the map. The frontier Figure 7.
points are filtered and queued to be assigned to a robot, then Figure 8 shows the change curve of loss and reward in the
the robot moves toward the assigned point. Both algorithms training process. When the agent learns a good policy, loss
detect the frontiers first, while DRL directly gets the goal decrease gradually. What’s more, in the initial stage, the agent is
through the local field of view, which can more obviously show constantly exploring, and the reward is unstable. After the agent
the end-to-end characteristics and highlight the efficiency of learns a policy that can obtain a large reward, the reward starts
our algorithm. In the process of exploration based on DQN, to increase gradually, and gradually converges after 35,000
any action that will result in low Shannon entropy gain will episodes.
trigger the “Frontier Rescue” mechanism. It can guide the What’s more, we compared the performances of the
robot from a region full of obstacles or with poor information to proposed algorithm with RRT-Exploration and Explore-lite,
the nearest frontier point by introducing this mechanism. conducted experiments using these exploration methods on
In the training process, only when there are no frontier five experimental scenarios and recorded the data of these
points on the map or the exploration proportion is up to exploration methods. We also set up a group of experiments in
the threshold, the exploration is stopped. We choose two which human operate robots remotely to explore as a baseline.
performance indexes to compare performance. Reward All methods can explore more than 98% of the area in the
represents the area of the explored region or the information unknown environment. Figure 9 shows an example of using
gain obtained during exploration, which is the key index these exploration methods to explore an unknown environment
to evaluate the performance of a method (excluding the [the action space is (c) in Figure 7] when exploring Scenario 1.
reward obtained by the “Frontier Rescue” mechanism). If the In the whole process of autonomous exploration, fully rely on
rewards are the same, the path length of the robot is also a key DQN to make decisions. According to Figure 10, the shortest
index. exploration route obtained from the four methods is controlled
We used the DQN algorithm to train the network and set the by human. Among the three algorithms, our method is that the
experience playback mechanism to improve the efficiency of exploration route is optimal. And by comparing the time cost of
using samples. Before starting the training, the experience the algorithm, our algorithm consumes about 2 ms for every
playback pool should collect at least 10,000 samples, and the decision, which greatly improves the real-time performance.
maximum number of samples in the experience playback pool From the performance comparison of the three algorithms, we
is 50,000. The network is trained every 20 steps, and 64 can roughly see the advantages of the DQN method in path
samples are randomly sampled from the experience replay pool exploration efficiency and decision time compared with RRT
for network gradient update at a time. method and LITE method. Therefore, it can be concluded that
The learning rate is set to 0.0001, the sample batch is set to our method can improve the efficiency, comparing with RRT-
64, the reward discount rate is set to 0.99, the network soft Exploration and Explore-lite.
update rate is set to 0.01 and the network soft update is What’s more, because both RRT-Exploration and Explore-
performed every 250 training times. Train the model on a lite need to detect frontiers on the global map, detection
device configured with 32GB RAM, CPU: i7 11800H and efficiency is closely related to map size and complexity of the
Nvidia 3080 6GB GPU. We trained 50,000 episodes on the scenario. So, the two methods have large and significantly
data set and spent 20 h training the network. positive correlation differences in decision-making time with
We successfully used the DQN algorithm to train the agent to different complexity. However, DQN method only needs to
complete the end-to-end decision-making process in a complex obtain the local field of view without any further processing.
and diverse environment. What’s more, we trained six different Then, the goal can be determined by inputting local map into
798
Figure 7 Six action spaces
Figure 8 DQN learning curves
799
Figure 9 Exploration pathway Figure 10 Performance comparison
Exploration path length( ) Decision-making time

Scenario Efficiency ( )
HM DQN RRT LITE DQN RRT LITE
95.5 106.5 119.8 113.5

2.0 134.2 41.3
1.0 89.6% 79.6% 84.1%
60.1 66.1 67.1 66.3

2.0 128.1 41.1
1.0 90.9% 89.4% 90.6%
46.2 46.9 68.1 49.6

1.9 120.5 33.6
1.0 98.5% 67.9% 93.1%
67.4 69.6 70.2 74.7

2.0 100.8 37.5
1.0 96.7% 95.9% 90.2%
44.4 44.5 44.8 45.7

1.9 114.4 31.8
1.0 99.7% 99.0% 97.1%
Notes: Efficiency is dividing exploration path length of the

exploration method into exploration path length of human
Source: Authors’ own work
viewed online: https://www.bilibili.com/video/BV1Q8411e7WL

and https://www.bilibili.com/video/BV1CW4y1j7ey. At any
time, the robot calculates the goal through the policy learned
the neural network, and decision time is not related to map size by deep reinforcement learning. When the goal cannot be
and environmental complexity. So decision time of DQN reached or the information gain is less than the minimum
method is almost irrelevant to the complexity of the scenario expectation set by us, the “Frontier Rescue” mechanism is
and almost remains constant. started. This mechanism can guide the robot out of the area
where information is scarce and avoid system disorders.
4.2 Hybrid autonomous exploration experiment Figure 12 shows the occupancy grid obtained during completing
In most cases, the above algorithm can be used for mapping. the autonomous exploration task on the real robot. By comparing
We tested on the test set, and our method can explore more our experimental environment with the final results obtained
than 95% of the area on all maps. However, because of the through hybrid autonomous exploration, it can be found that
intricacy and the multiplicity of the world, if the goal is the environmental features are completely established. Our
unreachable or the information obtained from the action is method explores more than 95% of the area in unknown
very low, or even if the robot collided with obstacles, frontier environments and realizes the autonomous exploration task of an
detection and clustering should be used to calculate the goal for unknown real environment without any prior knowledge.
robot rescue. This method can greatly improve the robustness
and reliability of the algorithm. We propose a hybrid autonomous 5. Conclusion
exploration method that combines the learn-based policy with the
frontier rescue policy. We improved the map-updating module and proposed a
Extensive experiments including real robots are conducted to full-mapping method and a reinforcement learning method to
evaluate the performance of the approach. Figure 11 proves train the decision-making module. The learned policy can
that our method is feasible in the real world. Figure 11(a) shows select the optimal or near-optimal perceptual action, and it
the real robot we used in the demonstrations. The motion has been demonstrated through experiments that our method
platform is Scout-Mini, equipped with VLP 16 laser radar, is feasible. In most cases, the above algorithms can be used
and provides locating information through the odometer and for mapping. However, sometimes, the goal of the policy is
IMU on the vehicle. The communication between equipment unreachable or the information obtained by the action is very
is supported through the router, and the Jetson Nano is low. We propose a hybrid autonomous exploration method that
used as the master control device to process data and run combines the learn-based policy with the frontier rescue policy.
our algorithm. We tested the hybrid autonomous exploration This method greatly improves the robustness and reliability of
algorithm in two realistic environments. Figure 11(b) shows the algorithm. The proposed method learns policy on more
the real experimental scene similar to a warehouse, baffles in the than 5,000 training map data sets and is tested by performing a
environment greatly increase the complexity of the environment total of five exploration runs. Results obtained show that the
and improve the difficulty of autonomous exploration. In proposed strategy can successfully explore the entire map in a
addition, we tested the performance of the algorithm in another reasonable amount of path length and without substantially
larger building, as shown in Figure 11(c). losing performance when compared to RRT-exploration and
Figure 12 shows the changes at different times in the Explore-lite. Future work will consider cooperative exploration
exploration process, two videos of the experiments can be for multirobot.
800
Figure 11 Real robot and scenario
Figure 12 Hybrid autonomous exploration in real robot
801
References Conference on Intelligent Robots and Systems, pp. 22-29, doi:

10.1109/IROS.2010.5649043.
Burgard, W., Moors, M., Fox, D., Simmons, R. and Thrun, S.
Koutras, D.I., Kapoutsis, A.C., Amanatiadis, A.A. and
(2000), “Collaborative multi-robot exploration”, IEEE
Kosmatopoulos, E.B. (2021), “MarsExplorer: exploration
International Conference on Robotics and Automation, Vol. 1,
of unknown terrains via deep reinforcement learning and
pp. 476-481, doi: 10.1109/ROBOT.2000.844100.
procedurally generated environments”, arXiv preprint arXiv:
Carrillo, H., Dames, P., Kumar, V. and Castellanos, J.A.
2107.09996.
(2015), “Autonomous robotic exploration using occupancy
Kuipers, B. and Byun, Y.T. (1991), “A robot exploration and
grid maps and graph SLAM based on Shannon and
mapping strategy based on a semantic hierarchy of spatial
Renyi entropy”, IEEE International Conference on Robotics representations”, Robotics and Autonomous Systems, Vol. 8
and Automation, pp. 487-494, doi: 10.1109/ICRA.2015. Nos 1/2, pp. 47-63.
7139224. Laumond, J.P.P. (1983), “Model structuring and concept
Casper, J. and Murphy, R. (2003), “Human-robot interactions recognition: two aspects of learning for a mobile robot”,
during the robot-assisted urban search and rescue response International Joint Conference on Artificial Intelligence,
at the World Trade Center”, IEEE Transactions on Systems, pp. 839-841.
Man, and Cybernetics, Part B (Cybernetics), Vol. 33 No. 3, LaValle, S.M. and Kuffner, J.J. (2001), “Randomized
pp. 367-385, doi: 10.1007/s10846-013-9822-x. kinodynamic planning”, The International Journal of Robotics
Chatila, R. and Laumond, J.P.P. (1985), “Position referencing Research, Vol. 20 No. 5, pp. 378-400, doi: 10.1177/
and consistent world modeling for mobile robots”, IEEE 02783640122067453.
International Conference on Robotics and Automation, Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D.,
pp. 138-145, doi: 10.1109/robot.1985.1087373. Kammel, S., Kolter, J.Z., Langer, D., Pink, O., Pratt, V.
Chen, F., Martin, J.D., Huang, Y., Wang, J. and Englot, B. and Sokolsky, M. (2011), “Towards fully autonomous
(2020), “Autonomous exploration under uncertainty via driving: systems and algorithms”, IEEE Intelligent Vehicles
deep reinforcement learning on graphs”, IEEE/RSJ Symposium (IV), pp. 163-168, doi: 10.1109/IVS.2011.
International Conference on Intelligent Robots and Systems, 5940562.
pp. 6140-6147, doi: 10.1109/IROS45743.2020.9341657. Li, H., Zhang, Q. and Zhao, D. (2020), “Deep reinforcement
Dolgov, D. and Thrun, S. (2009), “Autonomous driving in learning-based automatic exploration for navigation in
semi-structured environments: mapping and planning”, unknown environment”, IEEE Transactions on Neural
IEEE International Conference on Robotics and Automation, Networks and Learning Systems, Vol. 31 No. 6,
pp. 3407-3414, doi: 10.1109/ROBOT.2009.5152682. pp. 2064-2076, doi: 10.1109/TNNLS.2019.2927869.
Fox, D., Burgard, W. and Thrun, S. (1997), “The dynamic Liu, Y. and Nejat, G. (2013), “Robotic urban search and
window approach to collision avoidance”, IEEE Robotics & rescue: a survey from the control perspective”, Journal of
Automation Magazine, Vol. 4 No. 1, pp. 23-33, doi: 10.1109/ Intelligent & Robotic Systems, Vol. 72 No. 2, pp. 147-165,
100.580977. doi: 10.1007/s10846-013-9822-x.
Grisetti, G., Stachniss, C. and Burgard, W. (2005), “Improving Moravec, H.P. (1988), “Sensor fusion in certainty grids for
grid-based SLAM with Rao-Blackwellized particle filters mobile robots”, AI Magazine, Vol. 9 No. 2, pp. 61-74,
by adaptive proposals and selective resampling”, IEEE doi: 10.5555/46184.46187.
International Conference on Robotics and Automation, Nareyko, G., Biemelt, P. and Trächtler, A. (2020), “Real-time
pp. 2432-2437, doi: 10.1109/ROBOT.2005.1570477. optimized model predictive control of an active roll
Hart, P.E., Nilsson, N.J. and Raphael, B. (1968), “A stabilization system with actuator limitations”, IFAC-
formal basis for the heuristic determination of minimum PapersOnLine, Vol. 53 No. 2, pp. 14375-14380, doi:
cost paths”, IEEE Transactions on Systems Science and 10.1016/j.ifacol.2020.12.1393.
Cybernetics, Vol. 4 No. 2, pp. 100-107, doi: 10.1109/TSSC. Qiu, S. and Kermani, M.R. (2021), “Inverse kinematics of high
1968.300136. dimensional robotic arm-hand systems for precision
Hess, W., Kohler, D., Rapp, H. and Andor, D. (2016), “Real- grasping”, Journal of Intelligent & Robotic Systems, Vol. 101
time loop closure in 2D LIDAR SLAM”, IEEE International No. 4, pp. 1-15, doi: 10.1007/s10846-021-01349-7.
Conference on Robotics and Automation, pp. 1271-1278, Senarathne, P., Wang, D., Wang, Z. and Chen, Q. (2013),
doi: 10.1109/ICRA.2016.7487258. “Efficient frontier detection and management for robot
Hörner, J. (2016), “Map-merging for multi-robot system”, exploration”, IEEE International Conference on Cyber
Charles University in Prague, Faculty of Mathematics and Technology in Automation, Control and Intelligent Systems,
Physics, Prague, available at: https://is.cuni.cz/webapps/zzp/ pp. 114-119, doi: 10.1109/CYBER.2013.6705430.
detail/174125/ Simmons, R., Apfelbaum, D., Burgard, W., Fox, D., Moors,
Kohlbrecher, S., von Stryk, O., Meyer, J. and Klingauf, U. M., Thrun, S. and Younes, H. (2000), “Coordination for
(2011), “A flexible and scalable SLAM system with full 3D multi-robot exploration and mapping”, The National
motion estimation”, IEEE International Symposium on Safety, Conference on Artificial Intelligence, pp. 852-858.
Security, and Rescue Robotics, pp. 155-160, doi: 10.1109/ Thrun, S.B. and Bücken, A. (1996), “Integrating grid-based
SSRR.2011.6106777. and topological maps for mobile robot navigation”, The
Konolige, K., Grisetti, G., Kümmerle, R., Burgard, W., National Conference on Artificial Intelligence, pp. 944-950.
Limketkai, B. and Vincent, R. (2010), “Efficient sparse pose Umari, H. and Mukhopadhyay, S. (2017), “Autonomous
adjustment for 2D mapping”, IEEE/RSJ International robotic exploration based on multiple rapidly-exploring
802
randomized trees”, IEEE/RSJ International Conference on Autonomous Agents, pp. 3715-3720, doi: 10.1145/280765.
Intelligent Robots and Systems, pp. 1396-1402, doi: 280773.
10.1109/IROS.2017.8202319. Zhang, Z., Shi, C., Zhu, P., Zeng, Z. and Zhang, H.
Yamauchi, B. (1997), “A frontier-based approach for (2021), “Autonomous exploration of mobile robots via
autonomous exploration”, IEEE International deep reinforcement learning based on spatiotemporal
Symposium on Computer Intelligence in Robotics and information on graph”, Applied Sciences, Vol. 11 No. 18,
Automation, pp. 146-151, doi: 10.1109/CIRA.1997. pp. 8299-8320.
613851.
Yamauchi, B. (1998), “Frontier-based exploration using Corresponding author
multiple robots”, IEEE International Conference on Keyan He can be contacted at: hekeyan2009@126.com
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
803

10 1108 - Ir 12 2022 0299

Uploaded by

Copyright:

Available Formats

You might also like

10 1108 - Ir 12 2022 0299

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1108 - Ir 12 2022 0299

Uploaded by

Copyright:

Available Formats

Autonomous exploration through deep

1. Introduction et al., 2011), Cartographer (Hess et al., 2016) and Karto

2.2 Reinforcement learning

3.1 Semi-mapping and full-mapping

3.2 Deep reinforcement learning (6)

Figure 3 Action space Figure 5 Exploration dataset

So far, the detected frontiers can be used to calculate the goal

Figure 4 Neural network architecture

Figure 6 Frontier detection and clustering Algorithm 1: frontiers grouping

1: Input: frontier list(Frontiers)

Source: Authors’ own work

Then, the greedy algorithm is used to calculate the exploration

4.1 Autonomous exploration in the simulation

Figure 7 Six action spaces

Figure 8 DQN learning curves

Figure 9 Exploration pathway Figure 10 Performance comparison

Exploration path length( ) Decision-making time

95.5 106.5 119.8 113.5

60.1 66.1 67.1 66.3

46.2 46.9 68.1 49.6

67.4 69.6 70.2 74.7

44.4 44.5 44.8 45.7

Notes: Efficiency is dividing exploration path length of the

viewed online: https://www.bilibili.com/video/BV1Q8411e7WL

Figure 11 Real robot and scenario

Figure 12 Hybrid autonomous exploration in real robot

References Conference on Intelligent Robots and Systems, pp. 22-29, doi:

You might also like