Professional Documents
Culture Documents
Timely Data Collection For UAV-based IoT Networks A Deep Reinforcement Learning Approach
Timely Data Collection For UAV-based IoT Networks A Deep Reinforcement Learning Approach
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
Abstract— In some real-time Internet of Things (IoT) applications, the timeliness of sensor data is very important for
the performance of a system. How to collect the data of sensor nodes is a problem to be solved for an unmanned aerial
vehicle (UAV) in a specified area, where different nodes have different timeliness priorities. To efficiently collect the data, a
guided search deep reinforcement learning (GSDRL) algorithm is presented to help the UAV with different initial positions
to independently complete the task of data collection and forwarding. First, the data collection process is modeled as a
sequential decision problem for minimizing the average age of information or maximizing the number of collected nodes
according to specific environment. Then, the data collection strategy is optimized by the GSDRL algorithm. After training
the network using the GSDRL algorithm, the UAV has the ability to perform autonomous navigation and decision-making
to complete the complexity task more efficiently and rapidly. Simulation experiments show that the GSDRL algorithm has
strong adaptability to adverse environments, and obtains a good strategy for the UAV data collection and forwarding.
Index Terms— Data collection, UAV trajectory optimization, age of information, deep reinforcement learning.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
2 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 3
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
4 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
1) The analysis of channel model: The path loss of line-of- where B0 is the signal bandwidth, Pσ2 is the noise power.
sight (LoS) signal model and non-line-of-sight (NLoS) signal From the Appendix, we can obtain that the average rate
model are, respectively [27], E (RC ) as
( ) P ( )
4π B0 P SNσ2P U P 2
P LL = 20 log (fc )+20 log +20 log (dU D )+κLoS , (2) E (RC ) = − e T R Ei − SNσ U , (12)
vc ln 2 PT PR
and where Ei (•) denotes the one-argument exponential integral
( )
4π function. Similarly, combining (6), (11) and (35), we can also
P LN = 20 log (fc ) + 20 log + 20 log (dU D ) + κN LoS , achieve the average rate E (RD ) as
vc
(3) P ( )
B0 P UσP2D Pσ 2
where fc is the carrier frequency, vc is the speed of light, dU D E (RD ) = − e T R Ei − U D . (13)
the distance between the UAV and the DC, κLoS is the LoS ln 2 PT PR
model’s path loss, and κN LoS is NLoS model’s path loss. The 2) UAV work mode: Given a random initial position, the
probability of the LoS model is given by, UAV will fly to a SN to collect its data. When the data
1 collection process of the SN is over, the UAV needs to move to
PL = ( ( ( ) )) , (4) another SN to continue collecting data, or to forward the data
−1
1 + z0 exp −z1 180
π sin
H0
dU D − z0 to the DC. The UAV will repeat the above process until the
where H0 is the altitude of the UAV above the ground, z0 and data of all SNs are collected and forwarded successfully, or
z1 are the parameters related to the specific environment [28]. the UAV is forced to finish its flight due to energy shortage. To
Thus, the average path loss (dB) between the UAV and the simplify the analysis of the model, this paper only considers
DC can be written as two modes of the UAV, namely the flight mode and the
hovering mode. The hovering mode includes data collection
→D
P LU
A = PL × P LL + (1 − PL ) P LN . (5) and data forwarding states.
(1) Flight Mode
The received power (dBW) of the DC can be obtained as
When the UAV is in flight mode, the power consumed by
follows:
U →D the UAV is [29]
PRD = 10 lg PTU − P LA , (6) ( )
2
where PTU is the transmission power of the UAV. As the Pf (V ) = 8δ ρsAΩ3 R3 1 + U3V2 +
UAV has a good maneuverability, the communication distance (√ ) 0.5
tip
, (14)
W 3/2 V2
between the UAV and a SN is relatively shorter, and only the (1 + k) √ 2ρA
1 + V4
4v 4 − 4v 2 + 1
2 d0 ρsAV 3
0 0
LoS model is considered between the UAV and the SN. We where V is the flying speed of the UAV, ρ is the air density,
assume dSU to be the distance from the SN to the UAV, and and the other parameters2 are mostly related to the UAV, such
the path loss P LS→U
A can be obtained by replacing dU D with as R and W are the rotor radius and the aircraft weight of the
dSU in (2). Thus, we can obtain the UAV receiving power UAV, respectively.
(dBW) when it is in collection state as (2) Hovering Mode
PRU = 10 lg PTSN − P LS→U , (7) When the UAV collects the data of a SN, the power
A
consumed by the UAV is [29]
where PTSN is the transmission power of a SN. In addition,
δ W 3/2
we also assume that the channel gain h follows the Rayleigh Ph1 ≈ ρsAΩ3 R3 + (1 + k) √ . (15)
distribution, then the probability density function (PDF) and 8 2ρA
2
the cumulative distribution function (CDF) of x = |h| are If the UAV forwards the data to the DC, it not only
respectively, ( ) needs to overcome its own gravity, wind and other factors,
1 x it also consumes certain power to transmit data, so the power
f (x) = exp − , (8)
Pr Pr consumed by the UAV is
and ( ) Ph2 =Ph1 + Pc , (16)
x
F (x) = 1 − exp − , (9)
Pr where Pc is the transmission power.
where Pr is the average received power of a signal. The
collection and forwarding rate of the UAV at time t are, B. Problem Formulation
respectively, An UAV can select different SNs for data collection ac-
( ) cording to the specific environment, and forward the data
2
PTSN |h (t)| at an appropriate time. In other words, it can dynamically
RC (t) = B0 log2 1 + , (10)
Pσ 2 select several nodes for data collection, and then forward
them together. To illustrate the process of data collection and
and ( )
P U |h (t)|
2 forwarding more clearly, we divide it into several groups and
RD (t) = B0 log2 1+ T , (11)
Pσ 2 2 They can be referred to Table I in [29].
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 5
(1) (2)
put the SNs, which are forwarded to the DC together, into the flies from dπ(nk ),k to dπ(nk ),k and begins to forward the data.
same group.3 From Fig. 2 and Fig. 3, we can also obtain the AoI of a Type
There are m0 SNs scattered randomly in a certain area, II node5
where the UAV needs to autonomously formulate its trajectory
and make decisions by itself to collect or forward the data. ( ) ( )
ASNπ(j),k (Θ)(=TSsn SNπ(j),k) , Θ (+ TC SNπ(j),k
) ,Θ
We assume that given a strategy π, the UAV needs N0 ,
+TSuav SNπ(j),k , Θ + TU SNπ(j),k , Θ
times of offloading to complete the task. These SNs are (17)
( )
divided into N{0 groups, and it is Φ = {G1 , G2 , ·}· · , GN0 }. where TSsn SNπ(j),k , Ω is the data storage time in SNπ(j),k ,
A group Gi = SNπ(1),i , SNπ(2),i , · · · , SNπ(ni ),i 4 consists which starts timing from
( the moment ) the UAV takes off. In
of ni (1 ≤ ni ≤ m0 ) SNs, where SNπ(j),i , 1 ≤ j ≤ ni other words, it is TSsn SNπ(j),k , Ω = 0 for a Type I node.
denotes the j-th collected node in the i-th group. During the It can be observed from Fig. 3, the UAV’s flight trajectory
process of performing tasks, the UAV needs to choose some (1) (1) (1) (2)
is dl → dπ(1),k → dπ(2),k → · · · → dπ(nk ),k → dπ(nk ),k .
locations to execute two different actions, namely collecting After the data of Gk−1 is forwarded, the UAV will collect the
or forwarding data. For m0 SNs, there is a position set data of Gk . Thus, the total energy consumed Es by the UAV
Ω = {d0 , D1 , D2 , · · · , Dm0 , dDC }, where d0 and dDC is
{ end point
are the start and } of the UAV, respectively. The
(1) (2)
parameter Di = di , di , 1 ≤ i ≤ m0 is the position set Es =
(1) (2) Pf (V ) Tb +
for SNi , where diand di
are locations of the UAV when P (V ) T (t , Ω) +
f M k
it collects data from SNi and forwards SNi ’s data to the DC, ( )
N 1 ∑
nk
,
respectively. The UAV will forward the data only after the data ∑0 Ph TC SNπ(j),k , Ω G( ) (Ω) +
SNπ(j),k
of all nodes in the same group are all collected. When SNi j=1
( )
k=1 2 ∑ nk
is not the last one to be collected in the group, we will set Ph TU SNπ(j),k , Ω G( ) (Ω)
j=1 SNπ(j),k
(2)
di =∅. (18)
As shown in Fig. 3, we take SNs in the k-th group where Tb is the time required for the UAV to back to the DC.
as an example to briefly explain the data collecting and The parameter TM (tk , Ω) is the flight time required for the UAV
to follow trajectory tk in Gk . The parameters TC (SNi , Ω) and
forwarding process of the UAV. We assume that there are
TU (SNi , Ω) are the time for the UAV to collect and forward the
nk (1 ≤ nk{ ≤ m0 ) SNs in Gk to be collected and } forwarded. data of SNπ(j),k , respectively. The parameters Pf (V ), Ph1 and Ph2
It is Gk = SNπ(1),k , SNπ(2),k , · · · , SNπ(nk ),k . At τk−1 , the are the UAV consumed power in different states. They have been
(1) discussed in (14), (15) and (16).
UAV starts to fly from dl to dπ(1),k . The parameter dl is the
UAV position where it forwards the ( data of SNs) in Gk−1 , it Given the limited energy of an UAV with a random initial position,
(2) (1) the UAV is dispatched to collect and forward the data of the SNs
is dl =dπ n . It takes T M dl → dπ(1),k seconds for
( k−1 ),k−1 in the specified area. However, for the different priorities of nodes,
(1) the UAV will obtain two different optimization strategies due to the
the UAV to transfer from dl to dπ(1),k , and then collects the
( ) l limited residual energy and the AoI requirement of Type II nodes
data of SNπ(1),i . The UAV spends TC SNπ(1),k , Ω = π(1),k RC in Ψh . When the UAV is close to the SNs, it has the ability to
seconds to collect the data of SNπ(1),k , where RC is the complete the task of all the nodes in Ψh and Ψo , simultaneously. It
collection rate. is N = m0 , so we aim to minimize the average AoI and reduce the
(1) energy consumption by optimizing the UAV data collection strategy.
Next, the UAV moves to dπ(2),k to collect the data of Besides, when the UAV is at the boundary of the area or far away
SNπ(2),i . For the Type II nodes, as their data have been from the SNs, it may fail to collect the data of all the nodes. At this
generated and stored in their memory, the AoI of the node time, the UAV has to give priority to completing the tasks of the high
SNπ(2),k includes the time of data storage in SN. The UAV priority nodes in Ψh and tries its best to collect the ordinary nodes
in Ψo as many as possible. So it is N < m0 . Thus, according to the
collects the data from the remaining
( ) in Gk by the same
nodes
environment, the objective function can be written by
way. The parameter TSuav SNπ(1),k , Ω is (the storage time )
of the data of SNπ(1),k in the UAV, and TU SNπ(1),k , Ω = ( )
lπ(1),k
Pf Tf (Ω) +
RD is the time required to forward the data of the node to
Ωmin Ec = ,
the DC, where RD is the forwarding rate. When the data of
Pc Tc (Ω) + Pu Tu (Ω) , N = m0
∑N
(1) (1)
i=1 A(SNi ) (Ω)
SNπ(1),k is collected, the UAV will fly from dπ(1),k to dπ(2),k
min
( )
Ω
N
and spends TC SNπ(2),k , Ω seconds to collect the data of |Ψ∑o |
SNπ(2),k . Similarly, the UAV continues to collecting the data
max N = |Ψh | + G(SNi ) (Ω) ,
Ω
until the data of all the SNs in Gk are collected. Finally, it
∑N i=1
A (Ω) , N < m0
min i=1 (SNi )
Ω |Ψh |∑ +|Ψo |
3 In
G(SN ) (Ω)
fact, the SNs do not need to be grouped before the UAV departures. Ac- i=1 i
cording to the specific conditions, the UAV will make autonomous decisions (19)
to complete the task, including where to collect the data, where and when
to forward the data, which SNs are forwarded together, which one should
be picked up first and so on. This paper just takes the grouping method to
describe the data collection and forwarding process for the simplification of 5 The AoI of the Type I nodes are set as T when the nodes have been
0
the analysis. collected but cannot be forwarded to the DC. The AoI of the Type II nodes
4 The subscript π (·) is the ordered index number of SNs. that have not been forwarded are also set as T0 .
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
6 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
(
TC SNp (1)k ) (
TSuav SNp (1)k ) (
TU SN p (1) k )
Agent
Type I: SNp (1)k
(
TM dl ® dp( ()1),k
1
)
a
( ) s, a, s ' , r
(
TSsn SNp ( 2)k ) (
TC SNp ( 2)k ) TSuav SNp ( 2)k
(
TU SNp ( 2)k )
Type II: SNp ( 2)k
s, r
(
TM dp( ()1),k ® dp( ()2),k
1 1
) Environment t
Replay
w
(
TSuav SNp (3)k ) ( ) Predict network
(
TC SNp (3)k ) TU SNp (3)k
Subject to :
C1
A. DRL Algorithm
( : Asn (SNi ) (Ω) = )
TS (SNi , Ω) + TC (SNi , Ω) An agent interacts with the environment in a trial-and-error way
uav , SNi ∈ {Ψh , Ψo } in the DRL algorithm continually. So it obtains the optimal strategy
+TS (SNi , Ω) + TU (SNi , Ω)
th
C2 : 0 ≤ A(SN ) (Ω) ≤ TSN , SNj ∈ Ψh , by maximizing the accumulation of rewards. The framework of UAV
j j data collection and forwarding based on DRL algorithm is shown
C3 : Es ≤ Eu , in Fig. 4. The basic framework of the DRL algorithms mainly
(2) (2)
C4 : di − dSNi ≤ rSNi , di − dDC ≤ rDC , 1 ≤ i ≤ m0 , consists of two parts, an agent (UAV) and the environment. The
C5 : G(SNi ) (Ω) = {0, 1} . agent includes prediction network, target network, an optimizer, and a
(20) replay experience pool. The agent interacts with the environment with
where |Ψh | and |Ψo | are the number of SNs in Ψh and Ψo , three parameters: state S, action A, and reward R. For the application
respectively. The parameters Tf (Ω) and Tc (Ω) are the total flight scenario of UAV data collection and forwarding in the paper, we
time and data collecting time of the UAV, respectively. The parameter define as follows,
A(SNi ) (Ω) is the AoI of SNi in (C1). Different from the nodes 1) State space S: The parameter S = {s} represents the state
in Ψo , the nodes in Ψh have stricter requirements for information space of the UAV, where s = {B, C, D, Y, e, η} includes the
timeliness, the AoI of the nodes in Ψh is less than or equal TSN th
in information about the UAV data collection, data forwarding, the AoI
j of a SN, the current position of the UAV, and the residual energy. The
(C2). The constraint (C3) means that the total energy consumed of the parameters B = {bi } , C = {ci } are the states of the collection and
UAV Es should not exceed the maximum capacity Eu of the battery. forwarding of the SNs, respectively, where bi , ci ∈ {0, 1} , 1 ≤ i ≤
Besides, the constraints (C4) set the communication range between m0 . When it is bi =1, the data of the i-th node has been collected.
the UAV and the DC, between SNi and the UAV, where dSN,i Otherwise, it means that the data of the node has not been collected.
and dDC are the coordinates of the SNi and the DC, respectively. The variable ci has a similar definition. The vector D represents the
We also define an indicator function G(SNi ) (Ω) in (C5). After set of the average AoI of all the SNs. The dimensions of B, C and
the data of SNi is collected and forwarded successfully, we set as D are all m0 . The parameter Y denotes the current position of the
G(SNi ) (Ω) = 1. Otherwise, it is G(SNi ) (Ω) = 0. UAV, and e is the remaining power of the UAV. An indicator variable
We assume that m0 SNs are randomly scattered in a rectangle η ∈ {−1, 1} describes the work model of the UAV. For example, if
with an area of L × L. The region is partitioned to grids, and the η=1, the UAV is in data forwarding state. Otherwise, the UAV is in
area of each grid is d × d. It is assumed that the UAV only stays data collection state.
above the center of a grid when it collects data from the sensors. 2) Action space A: The parameter A = {a} is the action set
(1) (2)
For a sensor SNi , the locations of the UAV di and di all of the UAV, where a = {E, η} and E is the current coordinate
2
( 2(L/d) possibilities.
have ) Thus, the task complexity of the UAV is of the UAV. According to the current environment, the UAV selects
O 2l0 (L/d)2l0 , where l0 = m0 + N0 is the number of actions the an appropriate location for data collection or forwarding. The agent
UAV takes. To achieve the timeliness of Type II nodes, the number of selects an action using the greedy algorithm. Both exploration and
groups N0 is also a random number due to the autonomous decision- exploitation methods are adopted by the agent to select an action.
making by the UAV according to the environment. Therefore, it is When the number of iterations is small, and the agent has limited
difficult to solve this problem using traditional trajectory optimization information about the environment, it will randomly choose an
methods. The data collection and forwarding process of the UAV can action using exploration method. With the increase of the number of
be modeled as a sequential decision problem. Then, a guided search interactions with the environment, the agent continues to accumulate
method based DRL algorithm is introduced to solve it. In the next experience and autonomously rectifies its behavior to obtain a larger
section, we focus on the implementation of the proposed scheme. reward. We call this approach exploitation. As the agent’s accumulat-
ed experience increases during the training process, the probability of
exploration becomes smaller, the probability of exploitation for the
III. T HE GSDRL ALGORITHM agent gets larger, the convergence speed for the network becomes
This section first introduces the basic principle of the DRL algo- faster. However, it will get a poor local solution in most cases.
rithm and the process of network parameters to be updated. Then, a Conversely, the network parameters converge slowly and the system
guiding fast search algorithm is proposed to improve the convergence needs to take a long time to find a good solution.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 7
3) Reward R: The parameter R denotes the reward of the agent B. Network Parameters Update
when it interacts with the environment. During the learning process, To maximize the long-term cumulative benefits, the agent interacts
we need a reward to evaluate the action the agent selected. Different with the environment to update the state-action value Q (s, a) by the
actions will usually get different rewards, which stimulates the agent method of trial-and-error. The process of the updated parameters in
to choose a better action. The reward function plays an important role DRL algorithm is
in the convergence speed of the algorithm, and also directly affects ( )
( )
Q (s, a) ← Q (s, a)+β R + λ max Q s′ , a′ − Q (s, a) , (25)
the optimal solution. When the UAV is close to SNs to collect data,
or the DC to forward the data, the agent will get a positive reward. a′
Under the same conditions, if an action chosen by the agent can make
the average AoI or the consumed energy become smaller, the agent where β ∈ (0, 1) is learning rate, λ ∈ (0, 1) is discount factor. The
will achieve a larger positive reward. If it choose an unreasonable parameter s′ is the next state after the agent executes the action a.
action, such as the data collection and forwarding conditions are not From (25), we can get that the larger the discount factor λ, the more
satisfied, it will get a negative reward. So we define the reward as significantly the Q (s, a) will be affected by the future returns. On
follows: the contrary, it is only affected by the immediate interests. Based
{ on the principle of the DRL algorithms, the GSDRL algorithm uses
ϖ1 χ1 + ϖ2 χ2 + ϖ3 χ3 , f lag = 1 deep neural networks Q (s, a, w) to approach Q (s, a). To improve
R= , (21)
− (ϖ1 χ1 + ϖ2 χ2 + ϖ3 χ3 ) , f lag = 0 the performance stability of the system, the proposed scheme also has
two networks including the target network and the current network.
where ϖ1 , ϖ2 , and ϖ3 are weight parameters. The functions χ1 , They have the same structure but different parameters. The state-
χ2 , and χ3 are, respectively, action
( values) from the target network and the current network are
Qt s, a, wt and Q (s, a, w), where wt and w are the parameters
|Ψ
∑h | of the corresponding networks, respectively. Finally, a small batch
A(SN ) stochastic gradient descent algorithm [30] is adopted to solve w. We
j
j=1 define the loss function as
χ1 = 1 − , (22)
|Ψ
∑h | ( ) (( )2 )
T0 G SNj L (w) = E gk − Q (sk , ak , w)
j=1
1 ∑(
N1
)2
|Ψo |+|Ψ
= gk − Q (sk , ak , w) , (26)
∑ h| N1
k=1
A(SN )
j
j=1 where
χ2 = 1 − , (23)
|Ψo |+|Ψ
∑ h| ( ) ( )
( )
T0 G SNj gk =Rk + λQt s′k , max Q s′k , a′k , w , wt . (27)
j=1 a′
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
8 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
Algorithm 1 The training process of the GSDRL algorithm. Algorithm 2 The testing process of the GSDRL algorithm.
1: Initialization: Randomly initialize all the parameters in- 1: for episode i = 1 to Dmax do
cluding the current network and the target network, and 2: Randomly initialize the UAV’s location, the state s and
clear the replay experience pool Θ. the UAV energy e = Eu .
2: for episode i = 1 to Imax do 3: while csn ≤ m0 and Pr > 0 do
3: Randomly initialize the UAV’s location, state s and 4: Choose an action a from the trained target network
the energy e = Eu . in Algorithm 1 according to s.
4: for epoch j = 1 to Jmax do 5: Execute a and output the next state s′ .
5: Step 1: If Pmax < ∆ × i × (j + 1), it is Pε =Pmax . 6: if FGk == 1 then
Otherwise, it is Pε =∆×i×(j + 1). The agent chooses an 7: Count the number of the loaded SNs csn .
action a = arg max Q (s, a, w) with the probability Pε . 8: end if
a
Otherwise, it selects a random action a. 9: Update the current state wtih s = s′ and the
6: Step 2: Rectify the location of the UAV with the remaining power with Pr =Pr − Pi .
probability Pτ according to (30). 10: end while
7: Step 3: Execute a given the state s to get the next 11: end for
state s′ and the reward R.
8: Step 4: Restore {s, a, s′ , R} in Θ and replace the
current state s with s′ . the mission. With the increase of the number of interactions with the
9: Step 5: Randomly select m0 samples from Θ, and environment, the samples stored in the pool Θ increase constantly.
After the pool Θ accumulates enough samples, the agent starts to
calculate gj according to (31). take out N1 random ones from the pool Θ and trains the network.
10: Step 6: Build a loss function The parameter gj is given by
N1 (
∑ )2
1
gj − Q (Sj , aj , w) and update w using the Rj , ς = 1, (
N1 )
j=1
gj = t ( ′ ) t , ς = 0,
stochastic gradient descent method. j R + λQ s, arg max Q s , a, w , w
11: Step 7: If it is i%ξ = 0, the parameters of the a′
(31)
target network is updated with wt = w. where ς describes the work state of the agent. If ς = 1, the
12: Step 8: Terminate the episode if s′ is the terminal agent will stop searching and the episode is over. Using the mini-
state. batch stochastic gradient descent algorithm, the agent builds a loss
end for ∑1 (
N ( ))2
13: function N1 gj − Q sj , aj , w with N1 samples to update
1
14: end for j=1
the parameter w. The current network parameter will be updated after
every interaction, while the parameter wt of the target network will
be updated after ξ interactions. This helps to enhance the stability
the starting point (x0 , y0 ) with the position of the target SN (xs , ys ), of network parameters. If j < Jmax , the process will re-back to the
the intersection point of the signal coverage boundary is obtained, Step 1 and continues to collect or forward the data until the task is
√ accomplished or the remaining energy of the UAV is exhausted.
x =x ±r (xs −x0 )2
, The testing process is presented in Algorithm 2, the UAV initializes
1 0 SN (xs −x0 )2 +(ys −y0 )2 (30)
its position randomly and clears the data in the pool Θ. The energy
y1 = y0 + xyss −y−x (x1 − x0 ).
0
is e = Eu . Then, according to its current state s, the trained network
0
obtained from Algorithm 1 outputs the next action a to be performed.
The equation (30) has two different solutions, and we only need to
Then, the UAV will execute the action a to collect or forward the
choose the one with the shortest distance from (x0 , y0 ). As the area is
data. For example, if it is FGk = 1, which denotes the data of the
meshed, we need to further quantify the solution in (30). So multiple
SNs in Gk all have been collected, the UAV will forward the data
quantification solutions will be achieved. We can randomly select a
to the DC and counts the number of the loaded SNs. Next, it will
solution which satisfies the condition of data collection. A similar
update the current state s = s′ and the remaining energy Pr =Pr −Pi ,
way can also be used to revise the action in the data forwarding
where Pi is the energy consumed to perform the action a. Using this
state, and an optimal location for data forwarding can also be found
iterative procedure, the UAV can complete the data collection and
by replacing rSN with rDC in (30). The training and testing processes
forwarding of the remaining SNs autonomously.
of the GSDRL algorithm are presented in Algorithm 1 and Algorithm
2, respectively.
For the training process in Algorithm 1, the agent first clears IV. S IMULATION RESULTS
the experience pool Θ and randomly initializes all the parameters In this section, we demonstrate the effectiveness of the proposed
including the location of the UAV, the original state s and the total scheme through simulation experiments. The experimental setup are
energy of the UAV e = Eu . Given s, the agent selects an action a to windows system 64 bits, CPU Core i7-6700 HQ, pycharm 2020,
′
{ the next} state s
affect the environment. Then, it will further obtain Torch 1.13.1, numpy 1.24.1. The number of SNs is 8, which is
and the corresponding reward R. Next, the tuple s, a, s′ , R will be presented in Table II. In the paper, the positions of SNs are fixed,
transferred to the pool Θ. The current state is updated with s = s′ . where the Type I nodes are [1, 2], [1.8, 2.5], [1,-2], [0.5, -1], [-2, -3],
With an increase in the number of iterations, the parameters of the [-1, 1], [-2, 2], and the Type II nodes are [1.5, 1.8], [-1, -1.5]. The
current network will gradually converge to local optimal solutions. location of the DC is [-2, 2]. The GSDRL algorithm achieves the best
To avoid this problem, the agent will select an action according to performance when compared with the fixed collection data strategy,
the current network with a probability Pε , or choose a random action Deep Q-Network (DQN) [11] and Double DQN (DDQN) [31]. In this
with a probability 1 − Pε to explore the unknow environment. When paper, the parameters6 related to the channel model and the training
Pε ≥ Pmax , we set Pε =Pmax . Then, the agent will rectify the action
with the probability Pτ to help in finding a better location to execute 6 UAV related parameters are selected based on Table 1 in [29].
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 9
TABLE II: Channel and UAV related simulation parameters 24 SN-UAV, simulation
UAV-DB, simulation
Parameters value SN-UAV, theoretical analysis
Rate (bit/s/Hz)
Communication radius of a SN rSN 900 m
N-LOSS model’s path loss κNLOSS 21 dB
LOSS model’s path loss κLOSS 0.1 dB 12
Maximum endurance time of the UAV T0 2118 s
Number of the SNs m0 8
Length of the side of the rectangle L 6 km 8
Length of each grid d 300 m
Area of an UAV search region 36 km2
Speed of light vc 3 × 108 4
Flying altitude of the UAV H0 50 m
Environment parameters z0 4.88
Environment parameters z1 0.43 300 600 900 1200 1500 1800 2100 2400
Carrier frequency fc 2.5 × 109 Hz Distance (m)
System bandwidth B0 1 MHz
Noise power spectrum density -174 dBm/Hz Fig. 5: Rate versus distance for SN-UAV and UAV-DC with
Maximum tolerance AoI of the nodes TSN th in Ψh 600 s
j V = 20.
Parameters value
200
Maximum number of episode Imax 50000
Maximum number of epoch Jmax 30
Number of testing times Dmax 100 100
Size of the pool Θ 10000
Reward
Optimizer Adam
-200
Activation functions ReLU/ Leaky-relu/ ReLU
Input layer (3m0 + 4) × 128
Hidden layers 128 × 256, 256 × 512
-300
Output layer 512 × 2(L/d + 1)2 1x10
4
2x10
4
3x10
4
4x10
4
5x10
4
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
10 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
500
P
P = 0.8
450 P = 0.9
0.9
0.8
0.5
200 16 18 20 22 24 26 28 30
16 18 20 22 24 26 28 30
Speed (m/s)
Speed (m/s)
200
the network. Although by increasing Pε can improve the convergence
of network parameters, it also results in a local inferior solution.
Based on the above factors, we choose Pε =0.8 in this paper. 100
We can see from Section III.C that the agent prefers staying at
Reward
UAV is at the boundary of the area or far away from the SNs, it may -200 DQN algorithm
fail to collect the data of all nodes due to limited residual energy.
The collection and forwarding probabilities of the nodes with
-300
different Pτ are presented in Fig. 8. As the speed increases, the 0
4 4 4 4 4 4 4 4
1x10 2x10 3x10 4x10 5x10 6x10 7x10 8x10
probability curves also increase and approach 1 finally. From (19),
Episodes
we can get that the UAV will prefer to collect the data of Type II
nodes to ensure their timeliness. Besides, the 0.5 curves is higher Fig. 9: Training process comparison of the GSDRL, DDQN
than that of the 0.1 and 1 curves. As Pτ increases from 0.1 to 0.5, and DQN algorithms.
the probability increases. However, the probability curves will reduce
when Pτ increases from 0.5 to 1. If we adjust the UAV’s action rarely,
the convergence speed is relatively slow. Conversely, if we revise the
UAV’s action very frequently, this will not be conducive to help the a negative reward. As the number of interactions between the agent
UAV to find a better solution. For example, the UAV will achieve and the environment increases, the agent continues to accumulate
a better result when it chooses a certain point in the signal overlap more experience, and it will choose appropriate actions to obtain more
area instead of the boundary. Thus, very frequently rectifying actions rewards based on the current environment state. The three reward
may miss some opportunities to learn a good strategy. As a result, it curves with different colors are obtained as shown in Fig. 9. As the
will take longer time to complete the data collection and forwarding number of iterations increases, the rewards will initially increase and
task. With these various factors taken into consideration, a suitable then become stable. In the classic DRL algorithm, the action with the
alternative is Pτ =0.5. largest Q value is selected as the next action each time. However, this
approach results in an overestimation of the actual reward function
due to noise in the Q values, so the system may achieve a suboptimal
C. Performance Comparison local solution.
Next, we demonstrate the superior performance of the proposed Using gk in (31), we can find that the choice and the evaluation of
algorithm by comparing with the DDQN algorithm, the DQN algo- the actions are achieved by two different networks in both the DDQN
rithm, and the fixed collection data strategy. We know that for the algorithm and the GSDRL algorithm, respectively. As shown in Fig.
deep reinforcement learning algorithms, when more factors in state 9, the GSDRL algorithm has gradually learned the ability to make
s are considered, the more features of the sample are collected, the corresponding decisions when the number of episodes reaches about
results of network training will be better. However, there are always 3000 times. However, the reward curve of the DDQN algorithm gets
too many unpredictable factors in the practical applications. Thus, stable after about 5500 times. Thus, by rectifying the action method,
the UAV’s flight speed act as an external dynamic factor to test the the GSDRL algorithm can further accelerate the convergence speed
robustness of the different algorithms in this paper. and obtain more reward.
From Section III, we can get that when the conditions of receiving Fig.10 presents the collection and forwarding probability of Type
or sending data for the UAV are satisfied, the agent will get a positive I and Type II nodes of the three algorithms. To ensure the timeliness
reward. Otherwise, it will be punished. In other words, it will achieve of high-priority nodes, the UAV prefers to collect the Type II nodes
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 11
1200
1.0
Collection and forwarding probability
GSDRL algorithm
DDQN algorithm
0.8 Type II nodes 1000 DQN algorithm
0.6
0.2 600
GSDRL algorithm, Type I
-0.4 200
16 18 20 22 24 26 28 30
16 18 20 22 24 26 28 30
Speed (m/s)
Speed (m/s)
1.0
0.0012 0.4
0.0010
0.2
0.0008
0.0
0.0006
-0.2
16 18 20 22 24 26 28 30
Speed (m/s) 15 16 17 18 19 20 21 22 23 24 25 26
Speed (m/s)
Fig. 11: The energy efficiency of the algorithms versus the
speed. Fig. 13: The collection and forwarding probability versus the
speed for the Type II nodes.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
12 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2023
500
Combining (9) and (33), we can get the PDF of y is given by
( ( )) 2y/B0
450 1
fy (y) = fx 2y/B0 − 1 ln 2
Ψ Ψ
400 ( ( ( ))) 2y/B0
1 1 1 y/B0
= exp − 2 −1 ln 2. (34)
Average AoI (s)
Pr Pr Ψ Ψ
350
Fixed collection data strategy According to (32), so the expected value of y is
∫
300 GSDRL algorithm
E (y) = yf (y) dy
250
∫ B0 × log2 (1 + Ψx) 1
( ( ( Pr )))
200 = × exp − 1 1
Ψ 2
y/B0
Pr
−1 dx
y/B0 B0
×2 Ψ ln 2 ×
× Ψ
150 ∫ ( 1+Ψx) ln 2
16 18 20 22 24 26 28 30 B 1
= 0 ln 2 (1 + Ψx) exp − x dx
Speed (m/s) Pr Pr
∫ ( )
B0 x
Fig. 14: The average AoI versus the speed for the fixed = ln (1 + Ψx) × exp − dx
ln 2 × Pr Pr
collection data strategy and the GSDRL algorithm. 1
( )
B 1
= − 0 e ΨPr Ei − , (35)
ln 2 ΨPr
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Sensors Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2023.3265935
HU et al.: TIMELY DATA COLLECTION FOR UAV-BASED IOT NETWORKS: A DEEP REINFORCEMENT LEARNING APPROACH 13
[13] Q. Li et al., “Joint activity and blind information detection for UAV-
assisted massive IoT access,” IEEE J. Sel. Areas Commun., vol. 40,
no. 5, pp. 1489 –1508, May 2022.
[14] Y. Liu et al., “UAV communications based on Non-Orthogonal multiple
access,” IEEE Wireless Commun., vol. 26, no. 1, pp. 52–57, Feb. 2019.
[15] Y. Wang et al., “Trajectory design for UAV-based internet of things
data collection: A deep reinforcement learning approach,” IEEE Internet
Things J., vol. 9, no. 5, pp. 3899 – 3912, Mar. 2022.
[16] S. Zhang et al., “Joint trajectory and power optimization for UAV relay
networks,” IEEE Commun. Lett., vol. 22, no. 1, pp. 161–164, Jan. 2018.
[17] C. Zhou et al., “Deep RL-based trajectory planning for AoI minimization
in UAV-assisted IoT,” in Proc. 2019 11th International Conference on
Wireless Communications and Signal Processing (WCSP), Xi’an, China,
2019, pp. 1–6.
[18] M. A. Abd-Elmagid and H. S. Dhillon, “Average peak Age-of-
information minimization in uav-assisted IoT networks,” IEEE Trans.
Veh. Technol., vol. 68, no. 2, pp. 2003–2008, Feb. 2019.
[19] T. Peng et al., “Deep reinforcement learning for efficient data collection
in UAV-Aided internet of things,” in Proc. 2020 IEEE International
Conference on Communications Workshops (ICC Workshops), Dublin,
Ireland, 2020, pp. 1–6.
[20] M. Sun et al., “AoI-energy-aware UAV-assisted data collection for IoT
networks: A deep reinforcement learning method,” IEEE Internet Things
J., vol. 8, no. 24, pp. 17 275–17 289, Dec. 2021.
[21] C. Zhan, Y. Zeng, and R. Zhang, “Energy-efficient data collection in
UAV enabled wireless sensor network,” IEEE Trans. Wireless Commun.
lett., vol. 7, no. 3, pp. 328–331, Nov. 2018.
[22] G. Chatzimilioudis et al., “Crowdsourcing with smartphones,” IEEE
Internet Comput., vol. 16, no. 5, pp. 36–44, Oct. 2012.
[23] L. Wang et al., “Sparse mobile crowdsensing: challenges and opportu-
nities,” IEEE Commun. Mag., vol. 54, no. 7, pp. 161–167, Jul. 2016.
[24] M. Alwateer, S. W. Loke, and N. Fernando, “Enabling drone services:
Drone crowdsourcing and drone scripting,” IEEE Access, vol. 7, pp.
110 035–110 049, Aug. 2019.
[25] K. Arulkumaran et al., “Deep reinforcement learning: A brief survey,”
IEEE Signal Process. Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
[26] M. Costa, M. Codreanu, and A. Ephremides, “On the age of information
in status update systems with packet management,” IEEE Trans. Inf.
Theory, vol. 62, no. 4, pp. 1897–1910, Apr. 2016.
[27] D. Athukoralage et al., “Regret based learning for UAV assisted LTE-
U/WiFi public safety networks,” in Proc. 2016 IEEE Global Commu-
nications Conference (GLOBECOM), Washington, DC, USA, 2016, pp.
1–7.
[28] A. Al-Hourani, S. Kandeepan, and A. Jamalipour, “Modeling air-to-
ground path loss for low altitude platforms in urban environments,”
in Proc. 2014 IEEE Global Communications Conference, Austin, TX,
USA, 2014, pp. 2898–2904.
[29] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless
communication with rotary-wing UAV,” IEEE Trans. Wireless Commun.,
vol. 18, no. 4, pp. 2329–2345, Apr. 2019.
[30] J. Konečnẏ et al., “Mini-batch semi-stochastic gradient descent in the
proximal setting,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 2,
pp. 242–255, Mar. 2016.
[31] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double Q-learning,” in Proc. Association for the Advancement of
Artificial Intelligence, 2016, pp. 2094–2100.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on May 02,2023 at 20:34:22 UTC from IEEE Xplore. Restrictions apply.