4s PDF

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 1
Resource Optimization for Delay-Tolerant Data in

Blockchain-Enabled IoT with Edge Computing: A
Deep Reinforcement Learning Approach
Meng Li, Member, IEEE, F. Richard Yu, Fellow, IEEE, Pengbo Si, Senior Member, IEEE, Wenjun Wu
Member, IEEE, and Yanhua Zhang
Abstract—Recently, the development of Internet of Things which machine-type communication devices (MTCDs) can
(IoT) provides plenty of opportunities and challenges in various communicate intelligently with very limited human interven-
fields. As an essential part of IoT, machine-to-machine (M2M) tions, such as wearable devices, automotive electronics, smart
communications open a novel way that the machine-type com-
munication devices (MTCDs) are connected and communicated grids, industry automation, etc. [2]–[4]. According to lots of
without any human intervention. Meanwhile, delay-tolerant predictions or reports by research institutions and companies,
data plays an important role in M2M communications-based the M2M connections will reach 14.6 billion by 2022 and
IoT, and it puts more emphasis on powerful data caching, around 50 billion in the near future [5]–[7].
computing and processing, as well as the security and stability Different from traditional human-to-human (H2H) com-
of data transmission. To meet these requirements in M2M
communications networks, in this paper, we introduce some munications, for M2M communications, there exists a large
promising technologies such as edge computing and blockchain, portion of the data traffic that can tolerate relative long
and propose a joint optimization framework about caching, delay, for instance, the data traffic in intelligent meter, envi-
computation and security for delay-tolerant data in M2M ronmental monitoring, and other non-real-time services [8].
communications networks based on dueling deep Q-network This delay-tolerant data in M2M communications is uploaded
(DQN). According to dynamic decision process by DQN, the
optimal selection and decision of caching servers, computing or downloaded periodically, and it can permit higher trans-
servers and blockchain systems can be made to achieve maxi- mission latency but requires powerful processing rate [9].
mum system rewards, which includes higher efficiency of data Meanwhile, most of delay-tolerant data computing tasks
processing, lower network costs and better security of data almost cannot be cached, processed and executed solely
interaction. Extensive simulation results with different system on the local devices, since the MTCDs usually equipped
parameters show that our proposed framework can effectively
improve the system performance for blockchain-enabled M2M with limited resource of battery, storage and computation
communications compared to the existing schemes. for a relatively long working life [10], [11], and micro-
central processing unit (micro-CPU) on the MTCDs also
Index Terms—Machine-to-machine communications, edge
computing, edge caching, blockchain, dueling deep Q-network. cannot execute complicated computing tasks to fulfill their
computing needs [12]. On the other hand, as a distinctive
feature in M2M communications, the security and reliability
I. I NTRODUCTION are considered even more important, because some sensitive
data in IoT is usually scheduled, transmitted and interacted
URRENTLY, with increasing number of electronic de-
C vices, lots of them are expected to be linked to the In-
ternet and constituted the Internet of Things (IoT) [1]. As the
between the various MTCDs, without artificial control [13].
Face to these issues and challenges, in recent years, lots
of researches or reports have been focused on improving
important part of IoT, machine-to-machine (M2M) communi- the capability of data caching and computing, as well as
cations emerge as a promising communication paradigm, in enhancing the security and reliability of data traffic in various
Copyright (c) 20xx IEEE. Personal use of this material is permitted. areas of IoT. In [14], the authors propose a novel nonorthog-
However, permission to use this material for any other purposes must be onal multiple access (NOMA)-based edge computing model
obtained from the IEEE by sending a request to pubs-permissions@ieee.org. for narrowband IoT (NB-IoT) networks, and they present a
This work was jointly supported in part by the National Natural Science
Foundation of China under Grant 61901011 and 61671029, the China joint optimization framework that minimizes the maximum
Postdoctoral Science Foundation under Grant No. 2018M640032, the Beijing task execution latency required per task bit across NB-IoT
Postdoctoral Science Foundation under Grant No. ZZ2019-73, the Chaoyang devices. The authors in [15] propose a new architecture for
District Postdoctoral Science Foundation under Grant No. 2019ZZ-4, and
the International Cooperation Seed Foundation of Faculty of Information data synchronization based on fog computing, and design a
Technology, Beijing University of Technology. (Corresponding author: synchronization algorithm for data caching and computing
Pengbo Si.) to the fog servers in order to decrease the communication
Meng Li, Pengbo Si, Wenjun Wu, and Yanhua Zhang are with Faculty
of Information Technology, Beijing University of Technology, Beijing, cost and reduce the latency. Moreover, focus on the data
100124, P.R. China (e-mail: limeng720@bjut.edu.cn; sipengbo@bjut.edu.cn; security in IoT, the authors of [16] propose and analyze a
wenjunwu@bjut.edu.cn; zhangyh@bjut.edu.cn). novel scheme for IoT nodes based on blockchain, which
F. Richard Yu is with the Department of Systems and Computer En-
gineering, Carleton University, Ottawa, ON, K1S 5B6, Canada (e-mail: aggregates the blockchain data in periodic updates and fur-
Richard.Yu@carleton.ca). ther reduces the communication cost of the connected IoT
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
Things Journal
devices. A blockchain-enabled efficient data collection and ticularly, to handle the dynamic and large-dimension
secure sharing scheme is proposed in [17], which combines characteristics of optimization and decision process, the
Ethereum blockchain and deep reinforcement learning (DRL) dueling DQN-based optimization algorithm has been
to achieve maximum amount of collected data and ensure adopted and the appropriate caching storage, computing
security and reliability of data sharing. servers and the blockchain systems can be selected after
Although various of excellent works have been done on the the training.
data caching, computing as well as security based on edge • Extensive simulation results with different parameters
computing or blockchain in M2M communications-based show that the proposed scheme has more advantages and
IoT, these three important aspects were generally considered effectiveness compared with the existing schemes. It is
separately. Most of existing works only focus on the resource revealed that the system rewards have been increased
allocation of caching and computing through edge computing and the average service latency has been decreased
or cloud computing respectively, and they have ignored significantly through dynamic decision process.
joint optimization with local computing, edge computing and The rest of this paper is organized as follows. We review
cloud computing in M2M communications [18]. More im- the related works about M2M communications, blockchain
portantly, delay-sensitive data and delay-tolerant data almost and edge computing in Section II. Next, the proposed
have not been differentiated in the existing works. Then this network architecture and system model are presented in
mixed transmission strategy results in the over consuming of Section III. In Section IV, the selection and decision process
communication resources and the serious degradation of the of caching, computing servers and blockchain systems for
quality-of-service (QoS). Meanwhile, based on the features delay-tolerant data in M2M communications is formulated,
of blockchain technology, the delay-sensitive data usually followed by the maximum system rewards derived from
cannot uploaded and applied into the blockchain systems, the optimal strategy. Then, the solution of the proposed
because the blockchain systems need enough times to execute optimization strategy is given and discussed in V. In Section
the transactions and the smart contracts [19]. VI, we present and discuss the simulation results. Finally, we
To address the above problems and challenges, in this conclude this work in Section VII with future works.
paper, focused on delay-tolerant data, we propose a novel
framework to jointly consider caching, computing and se-
II. R ELATED W ORKS
curity to improve system performance based on edge com-
puting and blockchain in M2M communications networks. In this section, some related works about delay-tolerant
In addition, we introduce the promising algorithm named as data in M2M communications are reviewed at first. Then,
dueling deep Q-network (DQN) to learn, train and derive the some backgrounds of blockchain-enabled IoT networks are
optimal decisions, then the maximum system rewards can presented, followed by the description of integrated caching
be obtained. The distinct features of this paper are listed as and computing in blockchain-enabled M2M communications.
follows.
• In this framework, edge computing is introduced in
A. Delay-Tolerant Data in M2M Communications
order to improve the capability of data caching and
computing. Based on edge computing, the delay-tolerant The concept of delay-tolerant network was initially pro-
computing tasks carried by MTCDs can be offloaded posed for the InterPlaNetary Internet (IPN), because in that
to closer edge computing servers, then more computing network environments, data transmission has to suffer from
tasks can be accommodated and executed selectively on very large latency, low data rates, possibly time-disjoint
local device, edge computing servers or cloud comput- periods of reception, and intermittent scheduled connectiv-
ing servers according to current network states, network ity [20], [21]. However, different from IPN, in general IoT
environments and QoS requirements. networks with M2M communications, there are lots of data
• In order to enhance the data security and efficiency in traffic that can tolerate relatively long latency [9], such as
M2M communications, the blockchain is considered as environment monitoring, e-health, smart meters, surveillance,
a crucial technology in the proposed network model. etc. For example, deploying smart meters in family, the
Based on the distributed blockchain node, the delay- data uploads or downloads periodically by each day, each
tolerant data can be uploaded into the blockchain sys- week or each month. Thus, on one hand, the delay-tolerant
tems after computing and processing, and the data se- data does not require transmission or execution with limited
curity can be authorized and ensured through consensus spectrum resources in real time [22]. On the other hand,
mechanism. this data in M2M communications networks has enough
• The schedule and strategy of caching, computation of- latency to be processed [23], such as executing data com-
floading as well as blockchain systems usage are jointly puting on edge/cloud computing servers or processing data
considered and formulated as a discrete markov decision at blockchain systems.
process (MDP) to maximize the system rewards, which It should be noted that delay-tolerant data traffic does not
include higher caching reward, lower time overhead mean no delay limitation in M2M communications. It also
of data computation for edge computing servers, and has its delay requirements of life time, which are much longer
efficient data processing for blockchain systems. Par- than delay-sensitive data traffic [24].
Things Journal
B. Blockchain-Enabled IoT Networks account in wired blockchain networks, and the limited
With rapid development of urbanization, various chal- capacity of caching and computing in conventional wireless
lenges and problems have emerged in all of the world. In communication networks is often ignored. In other words, it
order to solve these problems, the concept of smart cities has is essential to support powerful data storage and computation
been proposed and considered as an effective solution [25]. for blockchain-enabled M2M communications. As another
IoT networks play a key role in the implementation of smart promising technology, edge computing can deploy caching
cities, while M2M communications are important foundation and computation resources closer to MTCDs, and data
for IoT [6]. Nevertheless, for M2M communications, data in- caching and computing can be executed at edge computing
teraction and data security are more important than through- servers [35]. Compared to conventional cloud computing, it
put increasing or energy saving, since vast amounts of data can significantly reduce the network overload and shorten the
in M2M communications is used and supported industrial transmission latency based on edge computing. Therefore,
production, metering system, and etc., which requires higher integrating caching and computing through edge computing
reliability and security [7]. paradigm can bring huge advantages and benefits for
Fortunately, as an emerging technology, blockchain has a blockchain-enabled M2M communications.
huge potential to promote the development of smart cities and
to enhance IoT services. Blockchain is firstly used as a peer- III. N ETWORK A RCHITECTURE AND S YSTEM M ODEL
to-peer (P2P) ledger for Bitcoin economic transactions [26], In this section, the physical and logical network architec-
can guarantee data security and efficiency by enabling anony- ture of the proposed scheme is presented at first. Then, the
mous and trustful transactions and removing all kinds of communication model, caching model, computation model,
intermediaries. Nowadays, blockchain technology will bring as well as latency constraint are given and discussed in detail.
lots of good features to IoT, such as trust-free, transparen- The latency model of blockchain systems is discussed and
cy, automation, decentralization, security and etc. [27]. For presented at last.
instance, based on the features of blockchain, it is difficult
to have a single point of failure that apply the blockchain-
based decentralized systems in M2M communications. Thus, A. Network Architecture
the security of machine-type communications network can An example of the network architecture for blockchain-
be enhanced effectively [19]. Moreover, in the blockchain- enabled M2M communications is depicted in Fig. 1, which
enabled IoT network, any collected data is signed using consists of M small cells, N MTCDs for different ap-
digital signatures, and it is linked and secured through the plications or services and one cloud services platform. In
one-way cryptographic hash functions [28]. Therefore, the the mth (m=1, 2, . . . , M ) small cell, we assume that it
data collection through blockchain-enabled M2M commu- deploy one wireless access point (AP), which is equipped
nications can be deemed as transparent, and the reliability with edge computing server and blockchain systems, it is
of IoT networks can be guaranteed adequately. Based on defined as APm in this paper. The edge computing servers
these advantages of blockchain technology, it will promote enable computation and storage of computing tasks and the
the implementation and deployment of a trusted, secure and blockchain systems can upload data, record transactions and
transparent network environments to IoT. share information. In addition, for the MTCDs in this small
cell, we consider that there are Nm MTCDs can connect the
C. Integrate Caching and Computing in Blockchain-Enabled nearest AP to transmit delay-tolerant data randomly, and each
M2M Communications MTCD is also equipped a micro CPU to execute lightweight
computing tasks. We use nm (n=1, 2, . . . , N ) to represent
Edge caching and computing are wildly studied and re- the nth MTCD in the mth cell.
searched in recent years, and it will be beneficial for content Moreover, in the cloud services platform, it includes
retrieval and data processing for different M2M applica- one core controller equipped with powerful caching and
tions [29]. Many existing works have focused on content computing servers. Typically, the cloud services platform is
storage and computation offloading at the edge of network connected all the APs through wired link, and the heavy or
in order to decrease network overload and system costs. For complicated computing tasks by blockchain systems can be
example, the optimization scheme of content caching and offloaded and executed at the cloud computing servers [18].
request routing is proposed in [30], in order to minimize data Thus, the data computation in blockchain systems can be
traffic latency. In [31], the joint optimization of computation operated and processed on edge computing servers or cloud
offloading and resource allocation is discussed to decrease computing servers.
energy consumption.
For blockchain-enabled M2M communications, the
caching and computing process of data, which is called B. Communication Model
“mining”, has been deemed as a rigorous challenge In the communication model, we assume that the up-
in current network systems [32]. In order to provide link and down-link channels between APs and MTCDs are
powerful computation capacity, lots of data computation symmetry according to the reciprocity theorem when the
management schemes are proposed and discussed in the transmissions occur in the same coherence interval. In the
existing works [33], [34], but they are usually taken into proposed model, the channel radio propagation is assumed to
Things Journal
to cache the computing data by the core controller, especially

Computing-enabled and
Blockchain-enabled AP for the data of cryptographic hashs of blocks. We take into
Cloud Services Machine-type
Communications Device
account that the reward of caching on the edge computing
Core Controller with
servers can be written as
Caching and Computing
Data Offloaded to a rca,nm = qnm rEC,m , (2)

Nearby AP
Data Offloaded to
Cloud Services where qnm is the request rate of the caching contents when
M2M Network the data transmits to edge computing servers from the nm th
MTCD, and rEC,m is the unit caching reward of the edge
computing servers which are connected to APm in the mth
small cell. For the caching reward of cloud services platform,
M2M Network it can be represented as
rca,APm = qAPm rcloud , (3)
where qAPm is the request rate of the caching contents when
the data is cached on cloud services platform, and rcloud is
M2M Network ... the unit caching reward of the cloud services platform.
M2M Network
Considering different modes of data caching about all
Fig. 1: Network architecture for blockchain-enabled M2M MTCDs in the mth small cell, the total caching reward can
communications. be represented as
∑
ε
rtotal = qnm rEC,m + qAPm rcloud , (4)
comprise path loss and Rayleigh fading. The path loss fading nm =0
of the channel between the nm th MTCD and the APm is rep- where ε is the number of the MTCDs that select to cache
resented as (dnm ,APm )−κ , dnm ,APm is defined as the distance their data into edge computing servers, and it need to be
between the nm th MTCD and the APm , and κ means the satisfied 0 ≤ ε ≤ Nm .
path loss exponent. Meanwhile, the Rayleigh fading between Moreover, as any cache only has limited storage capability,
the nm th MTCD and the APm is defined as hnm ,APm each of them is much smaller than storage of cloud center
with hnm ,APm ∼exp(µ), where µ is the corresponding scale ωcloud . Thus, the cached content cannot be larger than the
parameter. It is noted that hnm ,APm is independent with each remaining of cloud center in the proposed network architec-
other. Moreover, the transmission power of the nm th MTCD ture, which can be expressed as
is represented as Pnm , which is on the same channel with
bandwidth B. The noise power is set as σ 2 . ∑
M
Consequently, the transmission capacity to offload the (ωEC,m ) ≤ ωcloud . (5)

m=1
computing tasks to the APm th APs by the nm th MTCD
during the time slot t can be represented as Especially, the size of storage in edge computing servers or
cloud services platform is not same, as well as the usage cost
cnm ,APm (t) =
  is also different. Besides, due to limited storage capability,
−κ if the storage of edge computing servers is occupied by the
 Pnm (t)hnm ,APm (t)(dnm ,APm (t))
∑ 
B ln 1 + 2  , content, then other data cannot be cached in the same one.
σ + Pn′m (t)hn′m ,APm (t)(dn′m ,APm (t))−κ
′
n ̸=n
Therefore, according to requirements of latency and state of
(1) storage, the delay-tolerant data would select the suited one
′
to store.
where Pn′m (t) is the transmission power of the nm th MTCD,
′
hn′m ,APm (t) is the Rayleigh fading between the nm th MTCD D. Computation Model
and the APm , and dn′m ,APm (t) is the distance between the
′ As discussed above, the MTCDs are equipped with
nm th MTCD and the APm in the mth small cell during the
embedded micro-computer to execute some light and simple
time slot t.
computing tasks by themselves. Meanwhile, APs are con-
nected with edge computing servers that have a computation
C. Caching Model capacity to provide delay-tolerant task-execution services. At
In the proposed scheme, there is a physical storage the cloud services platform, it is deployed cloud computing
deployed on edge computing servers in the mth small cell or servers that have super computational ability to execute
cloud services platform, which can be represented as ωEC,m lots of complicated computing tasks, such as computational
or ωcloud , respectively. Thus, the function of data caching difficult mining tasks. Consequently, we consider three data
is considered to exist in edge computing servers or cloud computing modes for the delay-tolerant computing tasks in
services platform, and it can be determined wether or where the proposed scheme, in other words, the computing tasks can
Things Journal
be processed and executed on local devices, edge computing Furthermore, due to a wired transmission between APm and
servers or cloud computing servers. cloud service platform, the time consumption for transmitting
For the computational ability, in other words, the number the input data from APm to the cloud computing servers
of CPU cycles per second of the nm th MTCD, the edge and returning the accomplished data backward is represented
computing servers connected with the APm th AP and cloud as tc,trans . Therefore, in the case of cloud computing, the
computing servers, are represented as Fnm , FAPm , and Fc , overhead of data processing time can be represented as
respectively. In addition, the delay-tolerant computing tasks αnm (t)
carried by the nm th MTCD at time slot t is represented tAPm ,c = + tc,comp + tc,trans . (10)
cnm ,APm (t)
as Inm (t) , {αnm (t), βnm (t)}, where αnm (t) denotes the
size of input data involved and βnm (t) denotes the total Accordingly, considering different modes of data comput-
number of CPU cycles required to accomplish these com- ing about all MTCDs in the mth small cell, the total execution
puting tasks [36]. At each time slot, the computing tasks time of the computing tasks can be represented as
may be executed on the local device, or transmitted and
offloaded to edge computing servers or cloud computing ∑λ ∑µ ( )
βnm (t) αnm (t) βn (t)
servers. Focus on different situations, the different time costs ttotal = + + m +
Fnm cnm ,APm (t) FAPm
for data computing will be given in detail as follow. nm =0 nm =0
∑ν ( )
1) Local Computing: For the simple computing tasks, they αnm (t)
cloud be executed on the MTCD immediately. Then, the time + tc,comp + tc,trans ,
n =0
cnm ,APm (t)
m
consumption to execute these tasks is represented as
(11)
βn (t)
tnm = m . (6) where λ, µ, and ν are the number of the computing tasks
Fn m that select to execute on MTCDs, edge computing servers
2) Edge Computing: For the complicated computing and cloud computing servers, respectively, and they need to
tasks, such as mining with blockchain systems, they are be satisfied 0 ≤ λ, µ, ν ≤ Nm , and 0 ≤ λ + µ + ν ≤ Nm .
unavailable to be processed on the MTCDs by themselves.
Therefore, the computing tasks need to be offloaded the full E. Latency Model with Blockchain Systems
tasks Inm to its associated APm . In this case, it has to be of-
Although the latency is permissible for the delay-tolerant
floaded the computing tasks at first, and then be transferred it
data and its tolerant delay is usually much higher than delay-
through wireless communication link. Thus, if the computing
sensitive traffic, it not means that the latency of data traffic
tasks are decided to execute on edge computing servers, the
and computation is unlimited. Especially, for the blockchain
transmission time of offloading computing data in the first
systems, the latency is considered as an important index and
step is represented as
cannot be ignored. Generally, the transaction processing of
αnm (t) data in blockchain systems has two phases, generates a block
tnm ,of f = . (7)
cnm ,APm (t) at first, and then reaches a consensus on the generated block
among the all the users. Consequently, the latency of the data
When the full computing tasks have been offloaded, the
processing in blockchain systems includes latency of block
edge computing servers which are connected by APm exe-
generation and block confirmation, which is represented as
cute these computing tasks, and the time consumption in this
phase is represented as Tblock = Tg + Td + Tv . (12)
βnm (t) where Tg is the average time required for the blockchain sys-
tAPm ,comp = . (8)
FAPm tems to produce a new block, Td is the time consumption of
As a result, the total execution time turns out to be data delivery and Tv is time cost for validation in blockchain
systems. In this paper, we select practical Byzantine fault
αnm (t) βn (t) tolerance (PBFT) as the consensus mechanisms in the pro-
tnm ,APm = + m . (9)
cnm ,APm (t) FAPm posed blockchain systems and according to [19], Td can be
3) Cloud Computing: Due to limited energy supply and calculated as
data processing capacity, it is difficult to rely solely on local 1 M Sb M Sb
Td = [min{ , Tlim } + min{ max ′ , Tlim }
devices or edge computing servers to accomplish the compli- P cnm ,n′m ni ̸=nm ,nm cnm ,ni
cated computing tasks. To improve the computing capacity, M Sb
cloud computing is also considered in the proposed network + min{ max , Tlim }+
ni ̸=nj ̸=nm cni ,nj
architecture. Similar with edge computing, the delay-tolerant
M Sb M Sb
computing tasks need to be offloaded to associated AP at min{ max , Tlim } + min{ max , Tlim }],
ni ̸=nj cni ,nj ni ̸=nm cnm ,ni
first, the time consumption is represented as tnm ,of f , which is
given in (7). Nevertheless, because of the powerful computing (13)
ability of cloud computing servers, we take into account that where P is the bath size of block, Sb represents the number of
it can execute unlimited computing tasks concurrently and bytes contained in each block, Tlim means the average time
return the results within a fixed time consumption tc,comp . required for the block producer to create a new block, and
Things Journal
cni ,nj denotes the data transmission rate of the link between aca (t) be the set of data caching selection in each time slot,
each pair of MTCDs. and it can be denoted as
Moreover, the time cost for validation Tv can be calculated
as aca (t) = {0, aca,EC (t), aca,cloud (t)}. (17)
{ }
1 P γ + [P + 4(K + ς − 1)]δ where 0 represents that the data does not need to store
Tv = max , (14)
P nm =1,...,Nm Fnm on any caching server, aca,EC (t) represents the decision
where γ and δ are the computing costs for verifying signa- of data caching on the edge computing servers. Similarly,
tures and generating message authentication codes, K is the aca,cloud (t) represents the decision that the data will be
number of block producers, ς is the number of faulty replicas. cached on the cloud servers.
For delay-tolerant M2M communications in IoT networks, Computing Node Selection: Following the data caching
the smart MTCDs also expect to receive the finality of selection, the corresponding computing node will be selected
transactions within a finite time, which satisfies the its delay and determined at each time slot. We denote the decision
requirements. Therefore, we assume that one block should be of computing node selection as acomp (t), which can be
issued and validated within a number of consecutive block represented as
intervals, the constraint should be satisfied as
acomp (t) = {acomp,l (t), acomp,m (t), acomp,c (t)}, (18)
Tblock ≤ ρ · Tg , (15)
where acomp,l (t) represents that the MTCD will execute
where ρ is the number of block intervals, it should be satisfied the computing tasks at local device. Meanwhile, acomp,m (t)
ρ > 1. means that the computing task will be executed at edge
computing servers which are connected with APs. In the same
IV. P ERFORMANCE A NALYSIS AND DRL-BASED way, acomp,c (t) means that the cloud computing servers will
O PTIMIZATION F RAMEWORK execute the computing tasks offloaded by the MTCDs at each
In this section, we formulate the optimization problem time slot t.
in deep reinforcement learning to handle the dynamic and Blockchain Systems Decision: In the proposed network
large-dimensional characteristics of M2M communications in architecture, the blockchain systems can be used in order to
IoT networks. Based on a prominent characteristic of deep ensure data security, but with inevitable latency and energy
reinforcement learning framework, it includes an offline deep consumption. Thus, in each time slot, the decision of whether
convolutional neural network (CNN) construction phase and to select the blockchain systems can be represented as
an online dynamic deep Q-learning phase. In other words, the
ablock (t) = {0, 1}. (19)
action-value function with corresponding actions and states
can be formulated as offline, the action selection as well as where ablock (t) = 0 means the blockchain systems will not
dynamic network updating can be obtained as online. With be selected and utilized, while ablock (t) = 1 represents that
the modeling of action space, state space and reward function, the blockchain systems will be selected and the data will be
the optimization problem is formulated as a DRL process as loaded to the block.
follows.
A. Action Space B. State Space

The action space is considered as a combined spaces, In each time slot, the agent will monitor the network
which includes the decision of caching server selection, com- environments and collect the system states dynamically. Let
puting node selection, as well as whether to select blockchain S denote the state space, the system state in each time slot
systems to protect data or not. In time slot t, in order to t should be satisfied s(t) ∈ S, and it can be written as
maximize the system rewards, the decision will be made by s(t) = {sca,APm (t), sEC,m (t), dnm ,APm (t),
controller in the proposed network architecture. Formally, let (20)
cnm ,APm (t), χ(t)},
A denote the action vector and the composite action a(t) ∈ A
can be expressed by where sca,APm (t) and sEC,m (t) mean the state of caching
and computing servers connected with the APm , respectively.
a(t) = {aca (t), acomp (t), ablock (t)}, (16)
For sca,APm (t), it can be written that sca,APm (t) ∈ {0, 1},
where aca (t) denotes the selection of data caching server- where sca,APm (t) = 0 representing this storage is idle
s, acomp (t) denotes the selection of computing servers, and data can be cached into it, while sca,APm (t) = 1
and ablock (t) denotes the decision of whether to select otherwise. Similarly, for sEC,m (t), it can be written that
the blockchain systems, respectively. Meanwhile, we denote sEC,m (t) ∈ {0, 1}, where sEC,m (t) = 0 means the edge
a∗ (t) as the corresponding action in time slot t. computing servers in the mth small cell are in the idle state
Data Caching Selection: At each time slot t, according to and can execute the computing tasks at this time slot, while
dynamic network environments and current state of caching sEC,m (t) = 1 otherwise. In this paper, we assume that the
servers, both the nm th MTCD and the controller need to state of caching and computing servers can be modeled as
decide whether to cache the data and where to store it. Let Poisson distribution [37].
Things Journal
Moreover, dnm ,APm (t) is the distance between the nm th Next State State s(t) Re re(t)
s(t+1)
MTCDs and the APm , and cnm ,APm (t) is the data traffic
capacity of the link between the nm th MTCDs and the APm .
Environment RL Agent
χ(t) is represented as a union of the data transaction size in
blockchain systems.
Action a(t)
C. System Reward Fig. 2: Conventional work process of reinforcement learning.

After each time slot t, the system will get the immediate
reward re(t) based on different taken action a(t). In gen-
eral, the reward function should be related to the objective and apply the deep Q-learning network (DQN) to cope with
function, and the optimization objective of the formulated the formulated problem which is discussed above in this
problem is to get the maximal system rewards. According paper. As one of the Q-learning algorithm combined with
to the proposed network architecture and system model, deep neural network, dueling DQN is widely considered as
total caching rewards, total time costs for data computing a significant improvement to conventional DQN by many
as well as latency in blockchain systems are considered existing researches [13], [38]–[41]. Specifically, the system
as main factors to improve system performances. In the state value and action advantage can be calculated and given
proposed scheme, caching servers and blockchain systems are by dueling DQN separately. Based on the advantages of
encouraged to use to alleviate network overload and ensure dueling DQN, the optimal decision can be made to maximize
data security, while latency of data transmission, computation system rewards in the proposed blockchain-enabled M2M
and processing is expected to decrease. Therefore, in order communications with delay-tolerant data.
to improve caching rewards and data computing efficiency,
and ensure data security at the same time for delay-tolerant A. Reinforcement Learning and Deep Q-learning Network
M2M communications, we define the unified system rewards
DQN as one of the most important advancements in DRL
as follow
during the past few years, has exhibited an excellent perfor-
1 1
re(t) = υrtotal + σ +τ , (21) mance to solve complex and dynamic MDP problems [42],
ttotal Tblock [43]. Compared to supervised learning and unsupervised
where rtotal is the total caching rewards for data caching at learning, reinforcement learning focus on maximal long-
the caching servers, ttotal is the total time costs for computing term rewards through learning from the network state and
tasks execution under the corresponding action, and Tblock making optimal decision. For the conventional reinforcement
is the the latency of the data loading and processing in learning, the agent usually observes and obtains the system
blockchain systems, which are discussed and defined in state s(t) and make a decision a(t) at the time slot t. Then,
formula (4), (11) and (12), respectively. In addition, υ, σ the environment returns the updated system state s(t + 1)
and τ are the weights that facilitate the differentiated system and the immediate reward r(t) to the agent in the next
rewards, and they should satisfy υ + σ + τ = 1. The time slot t + 1. In order to find the optimal policy π ∗ to
weights coefficient can be adjusted for different services to obtain maximized long-term value, the process is repeated as
set priorities. above. The process of reinforcement learning (RL) is shown
Therefore, for all the decision time slots, the cumulative in Fig. 2.
system rewards in the proposed blockchain-enabled M2M Especially, for the RL algorithm which is based on Q-
communications with delay-tolerant data can be represented learning, it adopts the Q-function Q(s, a) as value estimation
as ∑ function to replace the value function [44]. Consequently,
RE(π) = re(t), (22) the reward value Q∗ (s, a) with state s and action a can be
t∈T calculated as follow
where π denotes the optimization policy, which is the set of [ ]
∑ ′
∗ ∗ ∗ ′ ′
T represents the set of all
actions along the each time slot, ∏ Q (s, a) = E (r(s, a, s )) + ξ max Q (s , a ) , (24)
the decision time slots, we∏use to denote the set of all s∈S
possible policies, and π ∈ . ∗
where E [·] represents the expectation function, and ξ is the
Consequently, the optimization problem is interpreted as discount factor indicating the impact of future on the current
finding the optimal policy π ∗ to maximize the long-term reward value, it meets ξ ∈ (0, 1). As a result, the optimal
cumulative system rewards RE, it can be represented as strategy is the actions that maximize the Q value in the state
π ∗ = arg max
∏ RE(π). (23) s. In order to solve the Q-learning problem, the Q value of
π∈
each step need to be updated as follows
V. S OLVING THE P ROBLEM BY D EEP R EINFORCEMENT Q(s(t), a(t)) ← Q(s(t), a(t)) + θ[re(t)+
L EARNING ′ ′ (25)
ξ max Q(s (t), a (t)) − Q(s(t), a(t))],
In this section, since the computational complexity of a∈A
decision-making process and dimension increase exponen- where θ is the learning rate and it should be satisfied θ ∈
tially with the number of actions and states, we introduce (0, 1).
Things Journal
According to the advantages of Q-learning algorithm, y

DQN has been proposed and made improvements. At first, .
.
the deep neural network is introduced into Q-learning to .
approximate value function. In other words, adding neural
network can efficiently increase the application scenarios of
Q-learning since almost any function can be approximated by
neural networks based on its nonlinear nature. Furthermore,
.
.
.
.
.
.
a target DQN is introduced into deep Q-learning algorithm
to assist training, it is updated with a smaller learning rate
to maintain stable and smooth for the training process. Then,
.
the mean square deviation between the target Q value and .
the current Q value is defined as a loss function L(ω), and 0 .
it is represented as x
[ ]
′ ′
L(ω) = E (re + ξ max Q(s , a ; ω −
) − Q(s, a; ω))2
, Fig. 3: Actions and system states mapped of an RGB image.
′
a
(26)
where ω is the parameter of the neural network, ω − is the since the system state space is finite, we need to map the
the parameter of target DQN to keeping Q value stable and continuous state to discrete state set. Meanwhile, different
the training process smooth. Based on formula (26), DQN state sets are set as different levels and mapped to the RGB
updates its network parameters by minimizing L(ω). image. In other words, different colors in the image represent
different levels of the state sets. Then, the agent explores
B. Dueling Deep Q-learning Network the grid model with different actions. According to different
As an improved algorithm of model-free DRL, dueling actions and corresponding system states, the different system
DQN, can estimate the Q-values with lower variance and rewards can be achieved after moving one step. Based on this
use the greedy policy to ensure adequate exploration of method, the agent will continuously explore different paths in
the action space. However, different with traditional value- the grid model, and the available system rewards will reach
based DRL, dueling DQN enables to calculate the state value an optimal value and convergence after the whole training.
and action advantage, respectively. Thus, the calculation Repeating this training process, the training network model
and derivation process of dueling DQN algorithm can be will be completed and obtained. In order to understand easily,
formulated as combination value of environmental state and the proposed scheme is shown in the Fig. 3.
executive action, and it can be written as The work flows of dueling DQN is shown in Fig. 4, and
the whole process of the proposed algorithm is shown in
Q(s, a) = V (s) + A(a). (27) Algorithm 1 in detail.
Through dueling DQN, the problem about repeated calcu-
lation of the same state value can be addressed, and it can also VI. S IMULATION R ESULTS AND D ISCUSSIONS
improve the capability of estimating environmental state with In this section, simulation results are presented to demon-
a clear optimization objective [13]. Moreover, dueling DQN strate the performance improvement for delay-tolerant data
also utilizes a novel strategy called experience replay [45]. transmission in blockchain-enabled M2M communications
It stores the past experiences into a replay memory, and by our proposed scheme. Significant advantages can be
randomly samples mini-batches from the pool to train the observed in the results with various training parameters,
deep neural network, which refrains the agent from only different data sizes for computation offloading as well as
concentrating on what the network is currently doing. In different acceptable delay constraints.
addition, ϵ-greedy policy is used to balance the exploitation
and the exploration [13]. According to above training process
and continuous exploration, the available system rewards will A. Simulation Environment
reach an optimal value and convergence. Therefore, in this In the simulation, we consider and deploy a blockchain-
paper, we adopt dueling DQN to find the optimal strategy of enabled M2M communications environment with M = 5
data caching, computing, as well as the blockchain systems small cells in a 1000 m ×1000 m region. In each small
selection for delay-tolerant data in M2M communications cell, there are Nm = 50 MTCDs and one AP equipped with
networks. blockchain systems and edge computing servers to offer wire-
In addition, according to [46], [47], in order to utilize the less access and data computation services. Moreover, we also
DRL model simply, we map the action, state and system take into account one core controller with caching and cloud
rewards of the proposed scheme to a grid model. The computing servers in the proposed network architecture. In
whole training process just likes playing the grid game. The the initial time slot, the channel bandwidth between the
formulated action space can be designed and discretized, and MTCD and AP is set as 10 MHz. The transmit power of each
the action set which is taken probably in each time slot is MTCD and each AP is set 100 mWatts and 10 Watts, while
mapped the moving direction in the grid model. Moreover, the system background noise power is 5 mWatts. The channel
Things Journal
[ s (1), a(1), re(1), s (2)]
Experience [ s (2), a(2), re(2), s(3)]

Replay ĂĂ
[ s (t ), a(t ), re(t ), s (t + 1)]
State
Value
V(s)
Full
Q(s(t),a(t)) Action a(t)
Connected
Input Layer
Action
Size: 84×84×3 Size: 20×20×32 Size: 9×9×64 Size: 7×7×64 Value
Size: 1×1×512 A(a) Update
Filter: 8×8 Filter: 4×4 Filter: 3×3 Filter: 7×7
Stride: 4 Stride: 2 Stride: 1 Stride: 1
Convolutional Neural Networks (CNN)

Target Network Q t arg et (s(t+1),a(t+1))
Fig. 4: Work flows of dueling deep Q-network.
Algorithm 1 Performance Optimization Framework for Meanwhile, the CPU computation capability of MTCD, edge
Delay-Tolerant Data in Blockchain-Enabled M2M Commu- computing servers or cloud computing servers is set to be
nications Based on Dueling DQN Fnm = 0.5 GHz, FAPm = 5 GHz and Fm = 20 GHz,
1: Initialize: respectively. For the aspect of data caching, the unit price
2: Offline dueling DQN construction: for data caching in servers at AP or at cloud is set as 10 and
3: Parameters of DQN ω; 5, and the capacity of cache storage is set as 10 MB. Besides,
4: Parameters of DQN ω − ; for the aspect of data uploading into blockchain systems, the
5: Loading the historical system state and Q(s, a) value bath size of block P is set as 3, the number of bytes contained
estimates in experience memory; in each block Sb = 5 MB, the average time required to create
6: Pre-training the dueling DQN with input pairs with each a new block Tlim = 1s, the computing costs for verifying
action and state (s, a) and corresponding Q(s, a; ω); signatures or generating message authentication γ or δ is set
7: Online dueling DQN execution: as 2 MHz or 1 MHz, and the number of block producers is
8: Initialize the environment and the initial state; set as 20. Furthermore, the weights υ, σ and τ are set as 0.3,
9: for t = t, t + 1, t + 2, . . . , t + T do 0.3 and 0.4. The aforementioned parameters are used widely
10: for nm = 1, 2, . . . , Nm do in the existing works [19], [36], [39].
11: Execute a(t) based on ϵ-greedy policy, and obtain The CNN is used as the evaluation network to calculate the
reward re(t), and next state s(t + 1); Q value and the target Q value. A 4-layer CNN is designed
12: Form a sample (s(t), a(t), re(t), s(t + 1)), store and adopted in the simulations. In the proposed scheme, we
it into experience memory; formulate that the initial input image is firstly resized into
13: Calculate the state value V (s(t)) and action ad- 84 × 84 × 3. The first hidden layer convolves 8 × 8 filters
vantage A(a(t)); with stride 4 with the input image. The second hidden layer
14: Obtain Q(s(t), a(t); ω) based on V (s(t)) and convolves 4 × 4 filters with stride 2, and the size turns to
A(a(t)) through Eq. (27); 20 × 20 × 32, again followed by a rectifier nonlinearity. The
15: Calculate the target Q-value from the target third hidden layer convolves 3 × 3 filters with stride 1, the
DQN by Qtarget ← re(t + 1) + ξQtarget (s(t + size turns to 9 × 9 × 64. The final hidden layer convolves
1), arg max(Q(s(t + 1), a(t + 1); ω); ω − )); 7 × 7 filters with stride 1, the size turns to 7 × 7 × 64. In each
a∈A
16: Update the target Q network with learning rate θ convolutional layer, the rectified linear unit (ReLU) function
and loss function L(ω) in each step; is selected as the activation function. Then, through 4 layers
17: end for of convolutional operations, the output nodes will be fully
18: end for connected to be trained in the deep reinforcement learning.
The system performance of the proposed scheme is com-
pared with several existing schemes in the simulations, such
as greedy strategy and random selection strategy. We also
gain identically follows a Gaussian distribution with zero consider and study the impact of different parameters in order
mean and unit variance, and the path loss exponent κ is set as to ensure comparison fairness, such as different episodes in
3. In addition, for the aspect of data computing, the data size the dueling DQN, different number of CPU cycles, different
for the computation offloading is αnm (t) = 600 KB, and the number of data sizes for computation offloading, different
total number of CPU cycles is βnm (t) = 1200 Megacycles. capacity of cache storage at edge computing servers, different
Things Journal
550
450 The Proposed Scheme
500 Edge Computing Only
Cloud Computing Only
450 400
400
350
System Rewards Re
System Rewards Re
350
300
300
250 250
200 200
150
150
100 Learning Rate 10 -2
Learning Rate 10 -3 100
50
Learning Rate 10 -4
0 50
200 400 600 800 1000 1200 1400 1600 500 600 700 800 900 1000 1100 1200 1300 1400
Episode Number of CPU Cycles (Megacycles)
Fig. 5: Comparison of the system rewards with different Fig. 7: Comparison of the system rewards with different
learning rates. numbers of CPU cycles.
450 Fig. 6 presents that the comparison of the system rewards

The Proposed Scheme
400 Conventional DQN Strategy
with different schemes. It can be seen in this figure, the
Greedy-Based Strategy advantage of the proposed scheme is more obvious than
Random Selection Strategy
350 the existing schemes, such as conventional DQN strategy,
greedy-based strategy and random selection strategy. As
System Rewards Re
300
discussed above, due to using the dueling DQN, the con-
250 vergence of the proposed scheme is the fastest compare with
other existing schemes. Meantime, the system rewards in the
200
proposed scheme are still higher than other three existing
150 schemes, the reason is that the system rewards depend on
the appropriate selection and decision of different caching
100
and computing servers as well as utilization of blockchain
50 systems. In other words, the optimal selection and decision of
caching resource, computing resource as well as blockchain
0
0 200 400 600 800 1000 1200 1400 systems are benefit to delay-tolerant data in M2M commu-
Episode nications networks.
Fig. 6: Comparison of the system rewards with different
schemes. C. Performance Comparison for Computation Offloading
Fig. 7 depicts the comparison of the system rewards with
different numbers of CPU cycles. In this figure, the system
block size limitations, and etc. rewards by the proposed scheme increase much faster than
those by only consider edge computing or cloud computing
schemes. The advantage of the proposed scheme is prominent
B. Performance of Convergence because joint edge computing and cloud computing servers
In Fig. 5, it shows that the system rewards in the proposed can be utilized by the MTCD through the DQN. Hence, with
blockchain-enabled M2M communications networks under the increasing number of CPU cycles, the optimal selection
different learning rates. Learning rate in DRL refers to the and decision can be made to handle the computing tasks
magnitude of the network parameter updated by the gradient carried by MTCDs through computation offloading.
of the loss function. In other words, higher learning rate In Fig. 8, it shows the comparison of the system rewards
means larger parameter update range. As shown in Fig. 5, with different data sizes by different schemes. In order to
the system rewards keep more stable performance with lower ensure compare fairly, we compare the proposed scheme with
learning rate, because it has the capability to find the precise the existing schemes includes conventional DQN strategy,
position of the optimal value. Moreover, it also can be greedy-based strategy and random selection strategy. From
seen that higher learning rate enables better convergence. this figure, it can be seen that with the increasing sizes for
Therefore, in the simulation of this paper, we select the computation offloading, the system rewards in the proposed
learning rate as 10−3 . scheme and conventional DQN strategy increase obviously.
Things Journal
service latency with different numbers of CPU cycles. With

400
The Proposed Scheme
increasing number of CPU cycles, the average service la-
Coventional DQN Strategy tency increases dramatically by the proposed scheme and
Greedy-Based Strategy
350
the existing strategy that only considers edge computing.
Since the existing strategy about cloud computing servers
are connected through wired link and they have powerful
System Rewards Re
300
computing ability, the average service latency almost keeps
stable in different number of CPU cycles. However, con-
250 sidering the data transmission and offloading, the average
service latency by the cloud computing strategy is more
200
higher than other two schemes. Meanwhile, the advantage
of the proposed scheme is prominent because the optimal
decision can be made based on training by dueling DQN, and
150 the appropriate computing servers can be selected according
to different network environments.
100 200 300 400 500 600 700 800 900 1000
Number of Data Sizes for Computation Offloading (KB)
D. Performance Comparison for Joint Optimization
Fig. 8: Comparison of the system rewards with different data
sizes for computation offloading. In order to further compare performance improvement in
the various aspects by the proposed scheme, we focus on
the average service latency with different capacity of cache
3 storage at edge computing servers, and also take into account
Executing All Tasks at Cloud Servers the system rewards with different block size limitations,
2.8 Executing All Tasks at Edge Computing Servers
The Proposed Scheme
different acceptable delay constraints as well as different
2.6 number of MTCDs by different optimization schemes.
The Average Service Latency (s)
2.4
In Fig. 10, with increasing number of the capacity of
cache storage at edge computing servers, the average service
2.2 latency decreases in all of the schemes, but it decreases much
2 faster in the proposed scheme than those by other existing
schemes. As can be seen in this figure, the service latency of
1.8
the proposed scheme outperforms the existing conventional
1.6 DQN strategy and greedy-based strategy especially when
1.4
the capacity of the cache storage is small. In addition, for
the existing scheme of the random selection strategy, the
1.2
service latency is longer because it has not optimization
1 strategy about caching. Thanks to the dueling DQN in the
500 600 700 800 900 1000 1100 1200 1300 1400
Number of CPU Cycles (Megacycles)
proposed scheme, the advantage is prominent and the system
performance can be improved significantly.
Fig. 9: Comparison of the average service latency with As an important part in the proposed scheme, the impact
different numbers of CPU cycles. of blockchain systems for the proposed system performance
cannot be ignored. In Fig. 11, we discuss the comparison
of system rewards with different block size limitations.
The performance improvement in the DRL outperforms the In this figure, it shows that the blockchain-enabled M2M
traditional schemes, such as greedy-based strategy and ran- communications networks have ability to get more system
dom selection strategy. Moreover, the system rewards in the rewards with increasing block size in all of the schemes.
proposed scheme are still higher than in the conventional However, it cannot increase infinitely since the constraint
DQN strategy and greedy-based strategy. The reason is that restricts in the blockchain systems. Obviously, based on the
face to different processing methods of data computation, training by the dueling DQN, more delay-tolerant data in
the optimal selection of computing servers in different time the proposed scheme has chance to select and upload the
slots can be determined through training in dueling DQN. blockchain systems, and ensure the data security. Although
The simulation results also demonstrate that different data the system rewards increase in both the conventional DQN
computing resources should be allocated properly to increase strategy and greedy-based strategy, the proposed scheme
system rewards. outperforms it since the better decision can be made by
In order to consider the influence of the proposed scheme dueling DQN, and the proposed scheme pays more attention
for service latency, we compare the proposed scheme with on long-term rewards in whole time frames.
the other schemes such as the computing tasks only executing Fig. 12 depicts the variation of system rewards with
on edge computing servers or cloud computing servers. As different acceptable delay requirements. From the Fig. 12,
shown in Fig. 9, it reveals the comparison of the average it can be seen that with increasing acceptable latency, the
Things Journal
5.5 450
Random Selection Strategy The Proposed Scheme
5 Greedy-Based Strategy Conventional DQN Strategy
Conventional DQN Strategy 400 Greedy-Based Strategy
The Average Service Latency (s)
The Proposed Scheme Random Selection Strategy

4.5
350
System Rewards Re
4
3.5 300
3 250
2.5
200
2
150
1.5
1 100
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
The Capacity of Cache Storage at Edge Computing Servers (MB) Acceptable Delay Constraint (s)
Fig. 10: Comparison of the average service latency with Fig. 12: Comparison of the system rewards with different
different capacities of cache storage at edge computing acceptable delay constraints.
servers.
450 450 The Proposed Scheme

Conventional DQN Strategy
The Proposed Scheme
400 Conventional DQN Strategy 400 Random Selection Strategy
350
System Rewards Re
350
System Rewards Re
300
300
250
250
200
200
150
150
100
100
50
20 30 40 50 60 70 80 90 100 110
50
1 2 3 4 5 6 7 8 9 10 Number of MTCDs
Block Size Limitation (MB)
Fig. 13: Comparison of the system rewards with different
Fig. 11: Comparison of the system rewards with different numbers of MTCDs.
block size limitations.
be seen in Fig. 13, with the increasing number of MTCDs in

system rewards increase in all of the schemes, and also the M2M communications networks, more and more delay-
increase obviously through the proposed scheme. The reason tolerant data need to be transmitted and proceeded, and the
is that with increasing acceptable latency, there is more system rewards improve accordingly. However, although the
time to make decision and selection of appropriate caching system rewards by four different schemes all increase with
storage, computing servers and blockchain systems. However, the increasing number of MTCDs, the more advantages can
the existing scheme of random selection strategy also gets the be achieved in the proposed scheme. Especially, when the
lowest system rewards compare with the other three schemes number of MTCDs is small, the proposed scheme outper-
because there is not any optimization strategy. Focus on the forms other three schemes obviously, because it has enough
proposed scheme, due to adding the training via dueling network resources to cache and execute the delay-tolerant
DQN, the highest system rewards can be still kept in all of the data computing tasks. Similarly, the system rewards in the
acceptable delay constraints. Meanwhile, the system rewards conventional DQN strategy are close to the proposed scheme,
in the conventional DQN strategy are close to the proposed especially when the number of MTCDs is high. In addition,
scheme since the decision is made based on training. due to lack of optimization strategy, the system rewards
At last, we consider and discuss the comparison of the keep the lowest in the existing scheme of random selection
system rewards with different numbers of MTCDs. As can strategy.
Things Journal
VII. C ONCLUSIONS AND F UTURE W ORK [11] F. Ghavimi and H.-H. Chen, “M2M communications in 3GPP
LTE/LTE-A networks: Architectures, service requirements, challenges,
This paper proposed a novel scheme to jointly consider and applications,” IEEE Commun. Surveys and Tutorials, vol. 17, no. 2,
resource allocation of caching storage, computing servers as pp. 525–549, Secondquarter 2015.
[12] M. Chiang and T. Zhang, “Fog and IoT: An overview of research
well as blockchain systems for delay-tolerant data in M2M opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp.
communications networks, in order to decrease unnecessary 854–864, Dec. 2016.
latency and improve system performance. In the proposed [13] C. Qiu, F. R. Yu, H. Yao, F. Xu, and C. Zhao, “Blockchain-based
framework, edge computing or cloud computing servers can software-defined industrial Internet of Things: A dueling deep Q-
learning approach,” IEEE Internet of Things Journal, vol. 6, no. 3,
be selected and executed the complicated computing tasks, pp. 4627–4639, Jun. 2019.
and the blockchain systems can be utilized to ensure the [14] L. Qian, A. Feng, Y. Huang, Y. Wu, B. Ji, and Z. Shi, “Optimal SIC
data security and authenticity. Due to different selections and ordering and computation resource allocation in MEC-aware noma
NB-IoT networks,” IEEE Internet of Things Journal, vol. 6, no. 2,
decisions of network resource such as storage, computation pp. 2806–2816, Apr. 2019.
or blockchain, we employ the dueling DQN to solve the joint [15] T. Wang, J. Zhou, A. Liu, M. Bhuiyan, G. Wang, and W. Jia, “Fog-
decision-making optimization problem. After training, the based computing and storage offloading for data synchronization in
IoT,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4272–4282,
optimal decisions about caching servers, computing servers Jun. 2019.
as well as blockchain systems can be made with the maxi- [16] P. Danzi, A. Kalør, C̆. Stefanović, and P. Popovski, “Delay and
mum system rewards, which include lower data transmission communication tradeoffs for blockchain systems with lightweight IoT
clients,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 2354–2365,
latency, lower network costs and better data security guaran- Apr. 2019.
tees. Simulation results demonstrated that, with the proposed [17] C. Liu, Q. Lin, and S. Wen, “Blockchain-enabled data collection and
framework, the system rewards can be increased significantly sharing for industrial IoT with deep reinforcement learning,” IEEE
Internet of Things Journal, vol. 15, no. 6, pp. 3516–3526, Jun. 2019.
compared with the existing schemes, and the stability of [18] M. Li, F. R. Yu, P. Si, and Y. Zhang, “Green machine-to-machine
the proposed scheme can also be kept. Future work is in (M2M) communications with mobile edge computing (MEC) and
progress to consider other important issues, such as integrated wireless network virtualization,” IEEE Commun. Mag., vol. 56, no. 5,
pp. 148–154, May 2018.
smart cities with energy-efficient M2M communications and [19] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Performance
blockchain systems proposed in our framework. optimization for blockchain-enabled industrial Internet of Things (I-
IoT) systems: A deep reinforcement learning approach,” IEEE Trans.
Indust. Infor., vol. 15, no. 6, pp. 3559–3570, Jun. 2019.
ACKNOWLEDGMENT [20] P. R. Pereira, A. Casaca, J. J. P. C. Rodrigues, V. N. G. J. Soares,
J. Triay, and C. Cervelló-Pastor, “From delay-tolerant networks to
We thank the editor and reviewers for their detailed reviews vehicular delay-tolerant networks,” IEEE Commun. Surveys and Tu-
and constructive comments, which have helped to improve torials, vol. 14, no. 4, pp. 7–38, Fourthquarter 2012.
the quality of this paper. [21] E. Bulut, Z. Wang, and B. K. Szymanski, “Cost-effective multiperiod
spraying for routing in delay-tolerant networks,” IEEE/ACM Trans.
Netw., vol. 18, no. 5, pp. 1530–1543, Oct. 2010.
R EFERENCES [22] S. Burleigh, A. Hooke, L. Torgerson, L. Fall, V. Cerf, B. Durst,
K. Scott, and H. Weiss, “Delay-tolerant networking: An approach to
[1] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and interplanetary Internet,” IEEE Commun. Mag., vol. 41, no. 6, pp. 128–
M. Ayyash, “Internet of Things: A survey on enabling technologies, 136, Jun. 2003.
protocols, and applications,” IEEE Commun. Surveys and Tutorials, [23] M. Li, F. R. Yu, P. Si, H. Yao, and Y. Zhang, “Software-defined
vol. 17, no. 4, pp. 2347–2376, Fourthquarter 2015. vehicular networks with caching and computing for delay-tolerant data
[2] N. Xia, H. Chen, and C. Yang, “Radio resource management in traffic,” in Proc. IEEE Int. Conf. Commun. (ICC). Kansas City, MO,
machine-to-machine communications-a survey,” IEEE Commun. Sur- May 2018, pp. 1–6.
veys and Tutorials, vol. 20, no. 1, pp. 791–828, Firstquarter 2018. [24] P. Si, Y. He, H. Yao, R. Yang, and Y. Zhang, “DaVe: Offloading delay-
[3] Y. Lin, J. Huang, C. Fan, and W. Chen, “Local authentication and tolerant data traffic to connected vehicle networks,” IEEE Trans. Veh.
access control scheme in M2M communications with computation Tech., vol. 65, no. 6, pp. 3941–3953, Jun. 2016.
offloading,” IEEE Internet of Things Journal, vol. 5, no. 4, pp. 3209– [25] E. Tabane, S. M. Ngwira, and T. Zuva, “Survey of smart city initiatives
3219, Aug. 2018. towards urbanization,” in Proc. IEEE ICACCE. Durban, South Africa,
[4] B. Al-Kaseem and H. Al-Raweshidy, “SD-NFV as an energy efficient Nov. 2016, pp. 437–440.
approach for M2M networks using cloud-based 6lowpan testbed,” [26] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,”
IEEE Internet of Things Journal, vol. 4, no. 5, pp. 1787–1797, http://www.bitcoin.org/bitcoin.pdf, 2009.
Oct. 2017. [27] J. Xie, H. Tang, T. Huang, F. R. Yu, R. Xie, J. Liu, and Y. Liu,
[5] Cisco, “Cisco visual networking index (VNI) complete forecast for “A survey of blockchain technology applied to smart cities: Research
2017c2022,” Tech. Rep., 2018. issues and challenges,” IEEE Commun. Surveys and Tutorials, vol. 21,
[6] A. Ali, W. Hamouda, and M. Uysal, “Next generation M2M cellular no. 3, pp. 2794–2830, Thirdquarter 2019.
networks: Challenges and practical considerations,” IEEE Commun. [28] A. Kosba, A. Miller, E. Shi, Z. Wen, and C. Papamanthou, “Hawk:
Mag., vol. 53, no. 9, pp. 18–24, Sep. 2015. The blockchain model of cryptography and privacy-preserving smart
[7] A. Barki, A. Bouabdallah, S. Gharout, and J. Traoré, “M2M security: contracts,” in Proc. IEEE SP. San Jose, CA, May 2016, pp. 839–858.
Challenges and solutions,” IEEE Commun. Surveys and Tutorials, [29] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey
vol. 18, no. 2, pp. 1241–1254, Secondquarter 2016. on mobile edge computing: The communication perspective,” IEEE
[8] M. T. Islam, A.-E. M. Taha, and S. Akl, “A survey of access manage- Commun. Surveys and Tutorials, vol. 19, no. 4, pp. 2322–2358,
ment techniques in machine type communications,” IEEE Commun. Fourthquarter 2017.
Mag., vol. 52, no. 4, pp. 74–81, Apr. 2014. [30] M. Dehghan, B. Jiang, A. Seetharam, T. He, T. Salonidis, J. Kurose,
[9] M. Li, P. Si, and Y. Zhang, “Random access and virtual resource allo- D. Towsley, and R. Sitaraman, “On the complexity of optimal re-
cation in software-defined cellular networks with machine-to-machine quest routing and content caching in heterogeneous cache networks,”
communications,” IEEE Trans. Veh. Tech., vol. 67, no. 10, pp. 9073– IEEE/ACM Trans. Netw., vol. 25, no. 3, pp. 1635–1648, Jun. 2017.
9086, Oct. 2018. [31] Y. Mao, J. Zhang, S. Song, and K. B. Letaief, “Stochastic joint radio
[10] M. Islam, A. M. Taha, and S. Akl, “A survey of access management and computational resource management for multi-user mobile-edge
techniques in machine type communications,” IEEE Commun. Mag., computing systems,” IEEE Trans. Wire. Commun., vol. 16, no. 9, pp.
vol. 52, no. 4, pp. 74–81, Apr. 2014. 5994–6009, Sep. 2017.
Things Journal
[32] Z. Xiong, Y. Zhang, D. Niyato, P. Wang, and Z. Han, “When mobile F. Richard Yu (Fellow, IEEE) received the PhD
blockchain meets edge computing,” IEEE Commun. Mag., vol. 56, degree in electrical engineering from the University
no. 8, pp. 33–39, Aug. 2018. of British Columbia (UBC) in 2003. From 2002 to
[33] A. Kiayias, E. Koutsoupias, M. Kyropoulou, and Y. Tselekounis, 2006, he was with Ericsson (in Lund, Sweden) and
“Blockchain mining games,” in Proc. ACM Conf. Econ. Comput. a start-up in California, USA. He joined Carleton
Maastricht, The Netherlands, Jul. 2016, pp. 365–382. University in 2007, where he is currently a Profes-
[34] B. A. Fisch, R. Pass, and A. Shelat, “Socially optimal mining pools,” sor. He received the IEEE TCGCC Best Journal Pa-
in Proc. 13th Conf. Web Inter. Eco. Bangalore, India, Dec. 2017, pp. per Award in 2019, Distinguished Service Awards
205–218. in 2019 and 2016, Outstanding Leadership Award
[35] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Computation in 2013, Carleton Research Achievement Award
offloading and content caching in wireless blockchain networks with in 2012 and 2020, the Ontario Early Researcher
mobile edge computing,” IEEE Trans. Veh. Tech., vol. 67, no. 11, pp. Award (formerly Premiers Research Excellence Award) in 2011, the Ex-
11 008–11 021, Nov. 2018. cellent Contribution Award at IEEE/IFIP TrustCom 2010, the Leadership
[36] X. Chen, “Decentralized computation offloading game for mobile Opportunity Fund Award from Canada Foundation of Innovation in 2009
cloud computing,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4, and the Best Paper Awards at IEEE ICNC 2018, VTC 2017 Spring, ICC
pp. 974–983, Apr. 2015. 2014, Globecom 2012, IEEE/IFIP TrustCom 2009 and Int’l Conference
[37] Y.-S. Chen, C.-H. Cho, I. You, and H.-C. Chao, “A cross-layer protocol on Networking 2005. His research interests include connected/autonomous
of spectrum mobility and handover in cognitive LTE networks,” vehicles, security, artificial intelligence, distributed ledger technology, and
Simulation Modelling Practice and Theory, vol. 19, no. 8, pp. 1723– wireless cyber-physical systems.
1744, Oct. 2010. He serves on the editorial boards of several journals, including Co-Editor-
[38] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching and in-Chief for Ad Hoc & Sensor Wireless Networks, Lead Series Editor
computing for connected vehicles: A deep reinforcement learning for IEEE Transactions on Vehicular Technology, IEEE Communications
approach,” IEEE Trans. Veh. Tech., vol. 67, no. 1, pp. 44–55, Jan. 2018. Surveys & Tutorials, and IEEE Transactions on Green Communications and
[39] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching, Networking. He has served as the Technical Program Committee (TPC) Co-
computing, and radio resources for fog-enabled IoT using natural actor- Chair of numerous conferences. Dr. Yu is a registered Professional Engineer
critic deep reinforcement learning,” IEEE Internet of Things Journal, in the province of Ontario, Canada, an IEEE Fellow, IET Fellow, and
vol. 6, no. 2, pp. 2061–2073, Apr. 2019. Engineering Institute of Canada (EIC) Fellow. The Web of Science Group
[40] J. Feng, F. R. Yu, Q. Pei, X. Chu, J. Du, and L. Zhu, “Cooperative has identified him as a Highly Cited Researcher. He is an IEEE Distinguished
computation offloading and resource allocation for blockchain-enabled Lecturer of both Vehicular Technology Society (VTS) and Comm. Society.
mobile edge computing: A deep reinforcement learning approach,” He is an elected member of the Board of Governors of the IEEE VTS.
IEEE Internet of Things Journal, pp. 1–15, 2019, to appear, available
online.
[41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
learning,” https://arxiv.org/abs/1312.5602.
[42] V. Mnih, K. Kavukcuoglu, D. Silver, and et al., “Adaptive resource
allocation in future wireless networks with blockchain and mobile edge
computing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. Pengbo Si (Senior Member, IEEE) received his
[43] F. Guo, F. R. Yu, H. Zhang, H. Ji, M. Liu, and V. C. M. Leung, “Adap- B.S. degree and Ph.D. degree from Beijing Univer-
tive resource allocation in future wireless networks with blockchain and sity of Posts and Telecommunications in 2004 and
mobile edge computing,” IEEE Trans. Wire. Commun., vol. 19, no. 3, 2009, respectively. He joined Beijing University
pp. 1689–1703, Mar. 2020. of Technology in 2009, where he is currently a
[44] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, Professor. During 2007 and 2008, he visited Car-
no. 34, pp. 279–292, May 1992. leton University, Ottawa, Canada. During 2014 and
[45] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experi- 2015, he was a visiting scholar at the University
ence replay,” https://arxiv.org/abs/1511.05952. of Florida, Gainesville FL.
[46] W. Liu, P. Si, E. Sun, M. Li, C. Fang, and Y. Zhang, “Green mobility Dr. Si serves as the Associate Editor of Inter-
management in uav-assisted IoT based on dueling DQN,” in Proc. national Journal on AdHoc Networking Systems,
IEEE Int. Conf. Commun. (ICC). Shanghai, China, May 2019, pp. the Editorial Board Member of Ad Hoc & Sensor Wireless Networks, and
1–6. the Symposium Chair of IEEE Globecom 2019. He also served as the
[47] Y. He, C. Liang, F. R. Yu, and Z. Han, “Trust-based social networks Guest Editor of Advances in Mobile Cloud Computing, IEEE Transactions
with computing, caching and communications: A deep reinforcement on Emerging Topics in Computing Special Issue, TPC Co-Chair of IEEE
learning approach,” IEEE Trans. Network Science and Engineering, ICCC’13-GMCN, Program Vice Chair of IEEE GreenCom’13, and TPC
vol. 7, no. 1, pp. 66–79, Mar. 2020. member of numerous conferences. His research interests include blockchain,
SDN, resource management, cognitive radio networks, etc.
Meng Li (Member, IEEE) received the B.E., M.E.

and Ph.D. degrees in electronic information engi-
neering, electronics and communication engineer-
ing, and electronic science and technology from
Wenjun Wu (Member, IEEE) received the B.S.
Beijing University of Technology, Beijing, P.R.
and Ph.D. degrees from Beijing University of Post-
China, in 2011, 2014 and 2018, respectively. He
s and Telecommunications, China, in 2007 and
joined Beijing University of Technology in 2018,
2012, respectively. From 2012 to 2015, she was
where he is currently a Lecturer. From September
a post-doctoral researcher at Beihang University,
2015 to September 2016, he visited Carleton Uni-
P.R. China. She is currently an associate professor
versity, Ottawa, ON, Canada, as a visiting Ph.D.
at Beijing University of Technology, P.R. China.
student funded by China Scholarship Council (C-
Her research interests are in the field of mobile
SC). His current research interests include M2M communications, Industrial
edge computing, blockchain and deep reinforce-
Internet, intelligent edge computing, blockchain, etc.
ment learning.
Dr. Li has served as the Technical Program Committee (TPC) member
of the IEEE Globecom 2020, Globecom 2019, Globecom 2018, Globecom
2017, WCNC 2019, 5G World Forum 2020, 5G World Forum 2019 and 5G
World Forum 2018.
Things Journal
Yanhua Zhang received the B.E. degree from

Xi’an University of Technology, Xi’an, P.R. China,
in 1982, and the M.S. degree from Lanzhou Uni-
versity, Lanzhou, P.R. China, in 1988. From 1982
to 1990, he was with Jiuquan Satellite Launch Cen-
ter (JSLC), Jiuquan, P.R. China. During the 1990s,
he is a visiting professor at Concordia University,
Montreal, Canada. He joined Beijing University of
Technology, Beijing, P.R. China, in 1997, where
he is currently a Professor. His research interests
include QoS-aware networking and radio resource
management in wireless networks.

4s PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4s PDF

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Resource Optimization for Delay-Tolerant Data in

to cache the computing data by the core controller, especially

Data Offloaded to a rca,nm = qnm rEC,m , (2)

Consequently, the transmission capacity to offload the (ωEC,m ) ≤ ωcloud . (5)

A. Action Space B. State Space

C. System Reward Fig. 2: Conventional work process of reinforcement learning.

According to the advantages of Q-learning algorithm, y

[ s (1), a(1), re(1), s (2)]

Experience [ s (2), a(2), re(2), s(3)]

[ s (t ), a(t ), re(t ), s (t + 1)]

Convolutional Neural Networks (CNN)

Fig. 4: Work flows of dueling deep Q-network.

450 Fig. 6 presents that the comparison of the system rewards

service latency with different numbers of CPU cycles. With

The Proposed Scheme Random Selection Strategy

450 450 The Proposed Scheme

be seen in Fig. 13, with the increasing number of MTCDs in

Meng Li (Member, IEEE) received the B.E., M.E.

Yanhua Zhang received the B.E. degree from

You might also like