IEEE MadridConference

Distributed Multi-Agent Deep Reinforcement
learning based Navigation and Control of UAV

Swarm for Wildfire monitoring
Adeeba Ali Rashid Ali M.F. Baig
Department of Computer Engineering Department of Computer Engineering Department of Mechanical Engineering
Aligarh Muslim University Aligarh Muslim University Aligarh Muslim University
Aligarh, India Aligarh, India Aligarh, India
aliadeeba98@gmail.com rashidaliamu@rediffmail.com drmfbaig@yahoo.co.uk
Abstract—Decentralized deep reinforcement learning is an satellite images. However, these images can only be made
emerging and most effective approach to solve the problem of available if the satellite passes over that particular region
resource allocation and coordination among the swarm of UAVs. of interest. Moreover, in case of small fires, satellite images
Nowadays, the use of autonomous aerial vehicles in wildfire
monitoring is increasing and considered as the reasonably feasible lack high spatial resolution [3]. The other alternative source
option as the surveillance in calamity hit areas can benefit from for obtaining the valuable information are manned aerial
this kind of automation. The flocks of UAVs can generate the vehicles which are quite expensive and require a highly skilled
maps of affected areas which could improve the process of relief human pilot. The use of autonomous UAVs (Unmanned Aerial
planning so that necessary aid can be reached to the burnt areas Vehicles) in monitoring wildfires and recording the real-time
quickly. This paper presents the Multi-agent Deep Q network
based technique for planning optimized trajectories for the UAV information is one of the most promising solutions [3]. The
swarm which can sense the wildfire in the forests and nearby flight controller present in the autonomous aerial vehicles
regions. In this work, the UAV agents are trained over simulated can accept the commands guiding the external maneuver in
wildfire in virtually generated forests with two reward schemes. addition to the ones communicated by remote controller. The
The simulation results verify the effectiveness of the proposed external commands include the GNSS coordinates and state
strategy for leveraging it in real-world scenarios.
Index Terms—swarms, reinforcement learning, navigation,
information provided by IMU (Inertial Measurement Unit).
wildfire, decentralized control, UAV External commands can be either transmitted from a base
station over a communication link, or can be sent directly
I. I NTRODUCTION from the processor mounted on the drone and physically
connected to the flight controller. Generally, little human
Since the last few years the spread of massive wildfire resources are required by the autonomous UAVs as the system
all across the world has caused substantial environmental, can be launched using only a single operator, however in some
economical, and social effects [1]. And, the most dreadful countries it is mandatory to assign at least one safety pilot to
effects of wildfire include the death of human beings, as well control the drone in emergency cases.
as other medical problems that arise due to high temperature, The use of multiple UAVs instead of a single UAV increases
smoke inhalation, debris, and vehicle crashes. Similarly, the the robustness and efficiency of the system [2]. Furthermore,
adverse effects of wildfires on environmental health include the use of multiple smaller and cheaper aerial vehicles would
degradation of air quality and the change in overall global minimize the overall cost of the system, thereby increasing
composition of atmosphere as wildfires emit Carbon Dioxide the adoption of this kind of system by the organizations that
(CO2) in large quantities, thus leading to global warming. possess comparatively less financial resources. In contrast,
Furthermore, the uncontrolled wildfires are also responsible some additional challenges are associated with the use of
for the extinction of some species of flora and fauna in that multiple UAVs, such as the need for an efficient algorithm that
particular location. can establish proper coordination among the UAVs. Although
In order to reduce the amount of loss caused due to the visible light and infrared vision based forest fire detection
wildfires, it is important to keep the account of the information and hardware required for designing a drone are in effect,
regarding the fuel map, movement, and spread of fire front and state-of-the-art algorithms designed for the control of UAV
climate conditions. The availability of this kind of accurate swarms for wildfire monitoring have to be further investigated
real-time information can reduce the risk concerned with the [4-9].
lives of fire fighters and increase the success in the efforts In the literature, collaborative control of multiple UAV
made for controlling the spread of fire [2]. One of the main agents can be achieved either by planning optimized and pre-
sources to obtain real-time information regarding wildfires are defined trajectories to the goal position or using the techniques
based on decentralized control theory or RHO (Receding Hori- was used for generating data. Later, in [24] the authors
zon Optimization). In [11] the authors proposed an approach validated their proposed cooperative control algorithm in real-
that outperforms the traditional RHO method for wildfire time field experiments.
monitoring with two aircrafts. The technique presented in The aforementioned approaches leverage control theory
[11] is based on deep reinforcement learning (Deep-RL), and techniques to estimate the UAVs’ movements. In [11], the
the results demonstrated in the paper shows that learning authors showed that the traditional control theory techniques
based mechanisms for cooperative control tasks are promising are outperformed by deep reinforcement learning methods for
alternatives. Typically, the researchers leverage off-policy Q- the task of wildfire monitoring with two aircrafts using a
learners [12-13] which is an uncomplicated and widely used random wildfire simulator.
approach in the field of multi-robot navigation and control. In In [6] the authors put forward a deep reinforcement learning
[5] the deep reinforcement learning has been previously used strategy to search for burning trees and then apply a retardant
for autonomous wildfire monitoring through multiple UAVs. for extinguishing the fire. Similarly, in [25] the authors inves-
The objective behind this research is to investigate and tigated an equilibrium based Q-learning variant to successfully
validate the techniques that can efficiently deal with the chal- complete the coverage task. Moreover, deep reinforcement
lenges associated with collaborative multi-agent reinforcement learning has also been utilized to gather relevant information
learning leveraged for monitoring a wildfire with multiple through learning a particular task. In [27], the authors present
UAVs. Inspired by [5] we propose a DQN based solution the DeepIG method to allow the agents to capture valuable
for cooperative control of multiple UAVs while monitoring and relevant information without colliding with each other.
the wildfire. In addition to this, the effect of reward sharing The proposed algorithm was validated with three UAVs in a
is explored in this research, which is considered as another terrain mapping experiment.
possible feedback scheme in swarm applications [14]. The algorithms proposed in [11], [25-27] were leveraged
The remainder of the paper is organized as follows. The for learning and establishing coordination among multi-robot
review of related work is presented in Section II. Then the agents, thus exemplifying the challenges associated with col-
problem is formally stated in Section III. Next, the implemen- laborative multi-agent reinforcement learning problems.
tation details of the formulated problem as POMDP (Partially III. P ROBLEM F ORMULATION
Observable Markov Decision Process) are discussed in Section
A. Environment
IV. This is followed in Section V by a detailed description
of the proposed solution for wildfire monitoring. Then the The simulated wildfire surveillance environment consists of
proposed methodology is validated through simulations in a 2-D map representing a 1 Km2 area which is discretized into
Section IV. Finally, the summary of the paper is presented 100×100 grid of burnable cells. The generated dynamic maps
in Section VII. of the environment are of two types:
• B(s): Fuel map in which each cell represents the amount
II. R ELATED W ORK of flammable material present in that particular region.
UAVs are used to gather valuable data and information • F(s): Fire map which is a boolean representation of the
through surveillance and continuous monitoring in a wide environment consisting of either the value ‘0’ or ‘1’ in
range of applications, including smart agriculture [15], post- each cell where ‘1’ represents the presence of fire in that
disaster hazardous material [16], or wildfire monitoring [2]. particular cell.
In this work, the use of multiple UAVs in wildfire monitor- The simulated wildfire environment with the evolution of fire
ing application [2] is studied. Earlier, the infrared and visible across the maps over an episode is shown in Fig. 1. The cells
light cameras, as well as the combination of both of them, of the fuel map are updated every 2.5 sec in accordance to the
have been leveraged for detecting wildfires, tracking fire front following two equations:
or hot-spot searching after fire [3]. Various infrared-light-based
(
max(0, B t (s) − βp(s)), if F t (s) = 1
approaches for tracking wildfires from UAVs and satellites are B t+1 (s) = (1)
B t (s), otherwise
outlined in [17].
In contrast to satellite imaging, aerial wildfire detection where β is the burning rate which is set to 1, and
and monitoring provides a high quality temporal and spatial
(
1 − s′ |d(s,s′ )<dmp (1 − P (s, s′ )Ft (s′ )), if Bt (s) > 0
Q
resolution [3], [18]. In [19], the authors demonstrated the p(s) =
operational characteristics of a remotely controlled UAV for 0, otherwise
monitoring wildfire in large environments. Using multiple (2)
UAVs for the tasks of surveillance and monitoring is more where p(s) is the probability that the cell s will ignite, d(s, s′ )
efficient with respect to a single UAV [3]. Techniques to is the distance between cells s and s’, dmp is the maximum
control the flock of UAVs during a simulated wildfire detection distance at which propagation can happen and P (s, s′ ) is
and monitoring were presented in [2], [7-8], [20-22]. In [23] the probability that cell s′ ignites cell s which is defined as
authors demonstrated an outline decentralized mechanism to follows:
estimate the wildfire model. The presented strategy was tested P (s, s′ ) = w.d(s, s′ )−2 (3)
using real-time drone hardware and stochastic fire simulator where w is constant proportional to the wind speed.
(a) t = 0s. (b) t = 100s.
(c) t = 200s. (d) t = 300s.
(e) t = 400s. (f) t = 500s.

Fig. 1: An example wildfire evolution over time. Left: burning cells. Right: fuel left.
B. Agents IV. S WARMDP D ESCRIPTION
Fixed-wing aircraft are considered as an agents for this Rather using Partially Observable Markov Decision Process
problem. It is assumed that the aircraft will fly with a constant (POMDP) as a model [28], the navigation problem is framed
velocity v of 20 m/s and each agent is allowed to decide as a special case of Decentralized POMDP, that is swarMDP
whether to increase or decrease its bank angle ϕ by 5 degrees [29]. Through this design the multi-agent settings can be
at 10 Hz frequency. The position of the aircraft is updated explicitly modeled by assuming that all the agents have similar
according to the kinematic model of fixed-wing aerial vehicle configuration.
is represented by the following expressions: Definition 1. The swarMDP problem is formulated in two
steps. First, a prototype A = ⟨S, A, Z, π⟩, where A represents
g tan ϕ
ẋ = v cos ϕ, ẏ = v sin ϕ, ψ̇ = (4) the instance of each UAV agent of the swarm and where:
v
• S represents the set of local states.
• A represents the set of local actions.
where g is acceleration due to gravity , 9.8ms−2 . The maxi-
• Z represents the set of local observations.
mum bank angle that a UAV can take is set to 50 degrees and
• π : Z → A is the local policy.
UAV maneuvers exceeding this range will not be considered
valid. This constraint will set a realistic limit to angular A swarMDP is defined using a tuple of 7 dimensions
velocity. ⟨N, A, P, R, O, γ⟩ where:
• N is the set of indices corresponding to the agents. The D. Rewards
index values in the set should be in range from 1 to N, The reward receives by each agent i is the sum of four
where N is the total number of agents. independent components:
• A is the prototype of the agent as defined above.
• P : S × S × A → [0, 1] is the transition probability
r1 = −λ1 min ds , (5)
function where P (s′ |s, a) denotes the probability that the {s∈S|Ft (s)}
agent will move to state s’ from state s after taking action
a.
X
r2 = −λ2 1 − Ft (s), (6)
• R : S × A → R is the reward function, where R(s, a) is {s∈S|ds ≤r0 }
the reward given to an agent when it takes action a in
state s. r3 = −λ3 ϕ20 , (7)
• O : S × Z × A → [0, 1] is the observation model, where
O(z ′ |s, a) is the probability that an agent observes z’ X ρij
after playing an action a in state s. r4 = − λ4 exp(− ) (8)
ϵ
• γ ∈ [0, 1] is the discount factor for the total reward {j∈1,2,...,N ∧j̸=i}
accumulated by the agent in future.
where ds is the distance from the UAV to cell s of the map.
A. State The aforementioned equations negative rewards or penalties
for the following:
The local state of each agent depends on following factors:
• Distance from fire (r1 ): proportional to the distance to
• Fuel map B(s) which has to be updated every 2.5 sec by
the closest burning cell.
the wildfire simulator. • Safe cells nearby (r2 ): proportional to the number of
• The positional coordinates (xi , yi ).
unburnt cells in a radius r0 .
• The heading angle ψi .
• High bank angles (r3 ): proportional to the square of the
• The bank angle ϕi .
bank angle (ϕ).
• Closeness to other UAV (r4 ): sum of the contributions of
B. Actions
each one of the other UAVs, which sets to the constant
The set of local actions for each agent is given by A = ⟨ value of λ4 at the time of saturation.
decrease ϕ by 5 degrees, increase ϕ by 5 degrees ⟩. The agent The constant parameters used in above equations have been
is required to choose any of them at a period of 0.1 sec. given the following values: λ1 = 10, λ2 = 1, λ3 = 100, λ4 =
1000, r0 = 10, c = 100.
C. Observations
The reason behind tuning the parameters to these values is
Each agent in the swarm gathers two kinds of observations: that flying away from the burning region should be penalized
a feature vector representing some information regarding the heavily and, once over the fire region, the negative rewards
local state of the UAV and other UAVs; an image as the partial obtained from bank angle and getting close to other UAVs
observation of the wildfire from the perspective of the UAV. should become dominant.
The feature vector associated with each agent i is the result
of the concatenation of four other vectors related to the state V. M ETHODOLOGY
variables of UAVs: the set of all the bank angles ϕ = {ϕj |j ∈ A. Deep Q-Networks
1, 2, ..., N }; the range of distance to other UAVs ρi = {ρij |j ∈
1, 2, ..., N ∧ j ̸= i}; the relative heading angle to other UAVs, The objective of the presented approach is to find the policy
θi = {θij |j ∈ 1, 2, ..., N ∧ j ̸= i}; the relative heading angle π that can optimally map observations to actions. The degree
of other UAVs, ψi = {ψij |j ∈ 1, 2, ..., N ∧ j ̸= i}. of optimality depends on the reward function R that has been
The images are captured using the optical hemispherical defined in the previous section. From the optimal policy, pi
camera of the UAV with the view angle of 160 degrees facing the value Q(s, a) can be determined for each state-action pair.
downwards, and can take pictures up to the maximum range The value function Q(s, a) represents the expected cumulative
of 500 m. After processing the image is downsampled to the reward assigned to an agent that takes action, a in state s. The
resolution of 30 × 40 pixels representing the burning trees and value of the Q function can be determined using the following
forest. Bellman equation:
The image is captured after sampling the points of interest
X
Q(s, a) = r(s, a) + γ p(s′ |s, a) max
′
Q(s′ , a′ ) (9)
at the horizontal angle of 40 degrees from the cardinal and s′ ∈S
a ∈A
at 30 degrees of elevation from nadir to 10 degrees below
the horizon. This strategy results in a high quality image of For a given Q value function the optimal policy can be deduced
the region right below the UAV. The characteristics of this as:
image determines the degree of partial observability of the
environment. π(s) = arg max Q(s, a) (10)
a∈A
Using DQN [30], [31] the Q-value can be estimated by training B. Network Architecture
a deep neural network that can approximate the function. In the implemented method the deep Q networks deal with
Following loss function is used for training the network: two different types of observations separately. The feature vec-
tor ⟨N, A, P, R, O, γ⟩ goes through a series of dense hidden
Li (θi ) = Es,a,r,s′ [e2 ] (11) layers each consisting of 100 neurons with ReLU activation
function. Next, the monocular image with partial informa-
where e is the Bellman error: tion of the environment is processed by a series of dense
convolutional and max pooling layers which downsampled
e = r(s, a) + γ max
′
Q(s′ , a′ ; θ− ) − Q(s, a; θ) (12) the image before feeding it to the dense hidden layers of
a
neurons. The outputs obtained from both the networks are
During the training of the network the agent has to go through concatenated and processed by two additional dense layers
the trade-off between the exploration of state-action space which finally estimate Q-values for the agents. The complete
using random policy and the exploitation of the previous network architecture is shown in Fig. 2.
knowledge that has been obtained through the estimation of
the optimal policy in previous states. Typically, this problem
can be tackled by leveraging exploratory policy in the first few
training iterations and then gradually increasing the fraction of
actions related to exploitation policy.
Deep Q Networks (DQN) circumvent the inherent instability
problem of deep reinforcement learning through two mecha-
nisms:
• By randomizing the training samples using experience
replay method to reduce the correlation effect of the se-
quence. This can be implemented using a replay memory
E.
• By training the target network (with parameters θ− ) at
the same time but at a lower update rate than the first
network (with parameters θ) to predict Q(s′ , a′ ), so that Fig. 2: Network Architecture
the first network can learn the fixed target. The complete
algorithm of the implemented DQN based wildfire mon-
C. Reward Schemes
itoring task is presented in Algorithm 1.
During the training of Deep Q networks two kinds of
reward approaches are compared which differ in their level of
Algorithm 1 Deep Q-Networks (DQN) decentralization. Once the network is trained both the models
Input: Parameters: ∈, L, N can be used for decentralized control.
Output: π, the approximate optimal policy 1) Independent rewards: Each agent obtains its own reward
1: Initialize replay memory E to capacity L separately from the environment.
2: Initialize θ = θ − randomly 2) Shared reward: It is calculated as the mean of the
3: for each episode do independent rewards and shared by all the agents.
4: Initialize s0
5: for each step t in the episode do VI. E XPERIMENTAL R ESULTS
6: Take action at π∈ (.|s) and observe rt , st+1
A. Simulation Setup
7: Update memory replay E with current sample et =
(st , at , st+1 , rt ) Both the DQN based fire detection models trained with
8: Sample randomly a set N of N indices from E two different reward configurations are tested in simulated
9: for each experienced sampled et do environments. Training of the network is performed every 100
10: Obtain Bt steps after the collection of 100 new samples. In every iteration
11: Update: θ ← Adam((Bt − Q(st , at ; θ))2 ) of training a batch size of 2000 samples is used, randomly
12: Update: θ − ← θ taken from the buffer of the last 100,000 examples. The target
13: for all s ∈ S do network is updated after every update of 10 action values, that
14: π(s) ← arg maxa′ ∈A Predict ((s, a′ ), θ − ) is, once every 1000 steps. In order to prevent the overfitting
15: return π, θ of the network with the first sequence of samples, the learning
of the network starts after the first 2000 steps. The choice of
training the network with exploratory actions starts at fraction
where Bt = rt + γ maxa′ Q(st+1 , at+1 ; θ− ) for non- of 1 and then linearly decreases during the first 70 percent of
terminal st+1 and bt = rt for terminal st+1 . training time and finally stays at the constant value of 0.1
(a) t = 100s. (b) t = 200s.
(c) t = 300s. (d) t = 400s.

Fig. 3: An example of an episode in wildfires with two aircraft trained with DQN, team reward scheme.
until the training ends. The simulation experiments have a an episode demonstrating the behavior learnt by the agents. As
duration of 10,000 steps (20 minutes). Adam [32] is used as the wildfire propagates in the region, the UAVs learn to head
the stochastic optimizer of the network and the learning rate towards the burning areas and start making wider circles. The
is set to the value of 5 × 10−9 . third component (r3 ) of the reward function, that is peanlty
for the collision between aircraft is effective in making the
B. Simulation Results agents to circle the burning region in the same direction and
at opposite end points of the fire while staying far away from
The major performance metric of the proposed approach
each other. When the region of interest (ROI) is too small,
is the accumulated reward of all simulated episodes which is
the penalty arises due to the proximity between the UAVs
determined as the sum of the instantaneous rewards earned by
outweighs the other and one of the UAVs stays there making
all agents over an episode. Fig. 3 represents the simulation of
small circles. Furthermore, As demonstrated in Fig. 4 with [4] F. A. Hossain, Y. Zhang, and C. Yuan, “A survey on forest fire
both the reward configurations agents are allowed to learn the monitoring using unmanned aerial vehicles,” in 2019 3rd International
Symposium on Autonomous Systems (ISAS), 2019, pp. 484–489.
monitoring task effectively by minimizing their penalties on [5] N. K. Ure, S. Omidshafiei, B. T. Lopez, A. Agha-Mohammadi, J. P. How,
average. It is expected that similar performance is obtained and J. Vian, “Online heterogeneous multiagent learning under limited
with both the reward schemes for this problem as three of communication with applications to forest fire management,” in 2015
IEEE/RSJ International Conference on Intelligent Robots and Systems
the four reward components (r1 , r2 , r3 ) are independent of (IROS), 2015, pp. 5181–5188.
learning behavior other UAV agents that is, only optimization [6] R. N. Haksar and M. Schwager, “Distributed deep reinforcement learn-
of the relative positions of the UAV with respect to other can ing for fighting forest fires with a network of aerial robots,” in 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems
affect the margin of improvement. (IROS), 2018, pp. 1067–1074.
[7] Z. Lin and H. H. Liu, “Topology-based distributed optimization for
VII. C ONCLUSION multi-uav cooperative wildfire monitoring,” Optimal Control Applica-
tions and Methods, vol. 39, no. 4, pp. 1530–1548, 2018.
In this paper, a DQN based approach is proposed to provide [8] F. Afghah, A. Razi, J. Chakareski, and J. Ashdown, “Wildfire monitoring
decentralized control to multiple UAVs for monitoring the in remote areas using autonomous unmanned aerial vehicles,” in IEEE
INFOCOM 2019 - IEEE Conference on Computer Communications
spread of wildfires. The deep Q networks take a processed Workshops (INFOCOM WORKSHOPS), 2019, pp. 835–840.
monocular image consisting of partial observation of the [9] A. Sargolzaei, A. Abbaspour, and C. D. Crane, Control of Cooperative
surroundings and the local state vectors of the UAV as input Unmanned Aerial Vehicles: Review of Applications, Challenges, and Al-
gorithms. Cham: Springer International Publishing, 2020, pp. 229–255.
and generate the control actions for the agents. The proposed [10] S. Chung, A. A. Paranjape, P. Dames, S. Shen, and V. Kumar, “A survey
method is tested with 2 reward configurations which differ in on aerial swarm robotics,” IEEE Transactions on Robotics, vol. 34, no.
their degree of federating multiple UAV agents. Simulation 4, pp. 837–855, 2018.
results show that UAVs can effectively monitor the wildfire [11] K. D. Julian and M. J. Kochenderfer, “Distributed wildfire surveillance
with autonomous aircraft using deep reinforcement learning,” Journal of
in a cooperative and coordinated manner with both types of Guidance, Control, and Dynamics, pp. 1–11, 2019.
reward schemes. Thus, the presented control design can be [12] C. Boutilier, “Planning, learning and coordination in multiagent decision
applied to real time UAV aided surveillance in wildfire regions processes,” in Proceedings of the 6th conference on Theoretical aspects
of rationality and knowledge. Morgan Kaufmann Publishers Inc., 1996,
to minimize the operational costs. The DQN are trained on pp. 195–210.
simulated fire maps from forests but the model should perform [13] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru,
well on all types of wildfires as long as enough data is J. Aru, and R. Vicente, “Multiagent cooperation and competition with
deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395,
available for training. 2017.
[14] M. Hüttenrauch, A. Šošić, and G. Neumann, “Deep reinforcement
learning for swarm systems,” Journal of Machine Learning Research,
vol. 20, no. 54, pp. 1–31, 2019.
[15] L. Tang and G. Shao, “Drone remote sensing for forestry research and
practices,” Journal of Forestry Research, vol. 26, no. 4, pp. 791–797,
2015.
[16] P. G. Martin, S. Kwong, N. Smith, Y. Yamashiki, O. D. Payton, F. Russell
Pavier, J. S. Fardoulis, D. Richards, and T. B. Scott, “3d unmanned
aerial vehicle radiation mapping for assessing contaminant distribution
and mobility,” International journal of applied earth observation and
geoinformation, vol. 52, pp. 12–19, 2016.
[17] L. Hua and G. Shao, “The progress of operational forest fire monitoring
with infrared remote sensing,” Journal of forestry research, vol. 28, no.
2, pp. 215–229, 2017.
[18] R. S. Allison, J. M. Johnston, G. Craig, and S. Jennings, “Airborne op-
tical and thermal remote sensing for wildfire detection and monitoring,”
Sensors, vol. 16, no. 8, p. 1310, 2016.
[19] T. J. Zajkowski, M. B. Dickinson, J. K. Hiers, W. Holley, B. W. Williams,
A. Paxton, O. Martinez, and G. W. Walker, “Evaluation and use of
remotely piloted aircraft systems for operations and research–rxcadre
2012,” International Journal of Wildland Fire, vol. 25, no. 1, pp.
114–128, 2016.
Fig. 4: Comparison of reward schemes at the time of training [20] D. W. Casbeer, R. W. Beard, T. W. McLain, S.-M. Li, and R. K. Mehra,
“Forest fire monitoring with multiple small uavs,” in Proceedings of the
2005, American Control Conference, 2005. IEEE, 2005, pp. 3530–3535.
R EFERENCES [21] P. Sujit, D. Kingston, and R. Beard, “Cooperative forest fire monitoring
using multiple uavs,” in 2007 46th IEEE Conference on Decision and
[1] A. M. Gill, S. L. Stephens, and G. J. Cary, “The worldwide “wildfire” Control. IEEE, 2007, pp. 4875–4880.
problem,” Ecological applications, vol. 23, no. 2, pp. 438–454, 2013. [22] M. Kumar, K. Cohen, and B. HomChaudhuri, “Cooperative control
[2] H. X. Pham, H. M. La, D. Feil-Seifer, and M. Deans, “A distributed of multiple uninhabited aerial vehicles for monitoring and fighting
control framework for a team of unmanned aerial vehicles for dynamic wildfires,” Journal of Aerospace Computing, Information, and Commu-
wildfire tracking,” in 2017 IEEE/RSJ International Conference on Intel- nication, vol. 8, no. 1, pp. 1–16, 2011.
ligent Robots and Systems (IROS). IEEE, 2017, pp. 6648–6653. [23] N. K. Ure, S. Omidshafiei, B. T. Lopez, A. Agha-Mohammadi, J. P. How,
[3] C. Yuan, Y. Zhang, and Z. Liu, “A survey on technologies for automatic and J. Vian, “Online heterogeneous multiagent learning under limited
forest fire monitoring, detection, and fighting using unmanned aerial communication with applications to forest fire management,” in 2015
vehicles and remote sensing techniques,” Canadian journal of forest IEEE/RSJ International Conference on Intelligent Robots and Systems
research, vol. 45, no. 7, pp. 783–792, 2015. (IROS), 2015, pp. 5181–5188.
[24] L. Merino, F. Caballero, J. R. Martı́nez-de Dios, J. Ferruz, and A.
Ollero, “A cooperative perception system for multiple uavs: Application
to automatic detection of forest fires,” Journal of Field Robotics, vol.
23, no. 3-4, pp. 165–184, 2006.
[25] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Nefian, “Cooperative and
distributed reinforcement learning of drones for field coverage,” arXiv
preprint arXiv:1803.07250, 2018.
[26] Y. Xiao, J. Hoffman, T. Xia, and C. Amato, “Multi-robot deep reinforce-
ment learning with macro-actions,” arXiv preprint arXiv:1909.08776,
2019.
[27] A. Viseras and R. Garcia, “Deepig: Multi-robot information gathering
with deep reinforcement learning,” IEEE Robotics and Automation
Letters, vol. 4, no. 3, pp. 3059–3066, 2019.
[28] K. D. Julian and M. J. Kochenderfer, “Distributed wildfire surveillance
with autonomous aircraft using deep reinforcement learning,” Journal of
Guidance, Control, and Dynamics, pp. 1–11, 2019.
[29] A. Šošić, W. R. KhudaBukhsh, A. M. Zoubir, and H. Koeppl, “Inverse
reinforcement learning in swarm systems,” in Proceedings of the 16th
Conference on Autonomous Agents and MultiAgent Systems. Inter-
national Foundation for Autonomous Agents and Multiagent Systems,
2017, pp. 1413–1421.
[30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.
Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement
learning,” arXiv preprint arXiv:1312.5602, 2013.
[31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et
al., “Human-level control through deep reinforcement learning,” Nature,
vol. 518, no. 7540, p. 529, 2015.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.

IEEE MadridConference

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IEEE MadridConference

Uploaded by

Copyright:

Available Formats

Distributed Multi-Agent Deep Reinforcement

learning based Navigation and Control of UAV

(c) t = 200s. (d) t = 300s.

(e) t = 400s. (f) t = 500s.

B. Agents IV. S WARMDP D ESCRIPTION

(c) t = 300s. (d) t = 400s.

You might also like