Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

A Reinforcement Learning Method for Joint


Mode Selection and Power Adaptation in the
V2V Communication Network in 5G
Di Zhao, Hao Qin, Bin Song*, Senior Member, IEEE, Yanli Zhang, Xiaojiang Du, Fellow, IEEE, and
Mohsen Guizani, Fellow, IEEE

with each industry. Utilizing the large bandwidth, low latency,


Abstract—A 5G network is the key driving factor in the high reliability, and wide connectivity of 5G technologies [1]
development of vehicle-to-vehicle (V2V) communication can profoundly change every traditional industry and even
technology, and V2V communication in 5G has recently attracted change aspects of economic and social life. One of the
great interest. In the V2V communication network, users can important areas is the transportation and automotive industries.
choose different transmission modes and power levels for
With the evolution of cellular networks and the rapid
communication, to guarantee their quality-of-service (QoS), high
capacity of vehicle-to-infrastructure (V2I) links and development of smart vehicles, short-range communication
ultra-reliability of V2V links. Aiming at V2V communication through vehicle-to-vehicle (V2V) links has become one of the
mode selection and power adaptation in 5G communication key candidate technologies for boosting the performance of 5G
networks, a reinforcement learning (RL) framework based on communication networks [2-5].
slow fading parameters and statistical information is proposed. In The 5G network has three key performance indicators
this paper, our objective is to maximize the total capacity of V2I (KPIs). The first indicator is the user experience rate. To meet
links while guaranteeing the strict transmission delay and the needs of the services, the user experience rate is required to
reliability constraints of V2V links. Considering the fast channel reach 0.1-1 Gbps [6]. The second one is the connection number
variations and the continuous-valued state in a high mobility density. Access equipment with a density of millions per square
vehicular environment, we use a multi-agent double deep
Q-learning (DDQN) algorithm. Each V2V link is considered as an kilometer is deployed to meet the communication needs in
agent, learning the optimal policy with the updated Q-network by densely populated scenarios. In addition, the last one is the
interacting with the environment. Experiments verify the end-to-end delay. The delay reaches the millisecond level to
convergence of our algorithm. The simulation results show that meet the needs of emerging applications. Moreover, the design
the proposed scheme can significantly optimize the total capacity requirements of 5G communication systems are high spectrum
of the V2I links and ensure the latency and reliability efficiency and high energy efficiency, and V2V communication
requirements of the V2V links. can just meet these requirements. To meet these requirements,
the V2V links can effectively avoid cochannel interference and
Index Terms—5G, V2V, mode selection, power adaptation, further improve spectrum utilization by selecting different
reinforcement learning transmission modes; similarly, the V2V links can improve
energy efficiency by adaptively selecting the transmit power
I. INTRODUCTION level.

W it comes to fifth generation (5G) networks, we are


HEN
not only using the greater bandwidth capabilities of the
5G network to further expand our mobile application
Mobile communication technologies are applied to smart
vehicles to realize the intercommunication and coordination
among vehicles. A major feature of V2V technology is that it
capabilities. The important idea of 5G is that at the beginning of can meet the high-speed and high-efficiency information
design, we hope to use 5G technologies to expand integration transmission requirements of the wireless system. Generally,
information applications and message sharing [7] require
This work has been supported by the National Natural Science Foundation
frequent access to servers and the internet in the vehicle
of China (Nos.61772387), the Fundamental Research Funds of Ministry of communication network. This operation involves a large
Education and China Mobile (MCM20170202), the National Natural Science amount of data transmission and information exchange. The
Foundation of Shaanxi Province (Grant No.2019ZDLGY03-03) and also transmission data is carried by vehicle-to-infrastructure (V2I)
supported by the ISN State Key Laboratory.
D. Zhao, H. Qin and B. Song (corresponding author) are with the State Key
links with high-capacity requirements, and the safety-critical
Laboratory of Integrated Services Networks, Xidian University, 710071, China information exchange is supported by V2V links with strict
(e-mail : dizhao1002@gmail.com, bsong@mail.xidian.edu.cn and ultra-reliable and low latency communication (URLLC)
hqin@mail.xidian.edu.cn) requirements [8]. Hence, V2V communication technology can
Y. Zhang is with the Computer Department, Guangdong AIB Polytechnic, not only provide high-rate local communications and more
Guangzhou, 510507, China (email: zhangyanli@gdaib.edu.cn)
X. Du is with Dept. of Computer and Information Sciences, Temple efficient spectrum utilization but also improves traffic safety
University, Philadelphia PA 19122, USA (email: dxj@ieee.org) and efficiency.
M. Guizani is with Dept. of Engineering, Qatar University, Qatar (email:
mguizani@ieee.org)

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

environment dynamics and perform sequential decision making


by constantly interacting with the uncertainty of the
environment, reducing the computational complexity. In
addition, the hard-to-optimize objective issues can also be well
addressed in a RL framework by designing training reward that
correlates with the final objective. The learning algorithm can
then determine a clever strategy to approach the ultimate goal
by itself. The literature [20] presented some case studies on
how to apply RL tools to manage network resources in
intelligent V2X networks. However, centralized control
schemes will incur a large transmission overhead to obtain the
global network information; thus, they are not applicable for
large networks. Recently, some distributed schemes for V2V
communication systems have been developed. In [21], a
decentralized deep Q-network (DQN) approach is proposed to
improve the traffic safety and satisfy vehicle users (VUEs) QoS
requirements in a green IoV network. Liu et al. [22] proposed a
Fig. 1. A V2V communication network.
distributed cooperative RL approach to manage traffic with the
help of the V2X network dynamic clustering design to balance
A. Related Work
the traffic load. Ye et al. [23] proposed a decentralized resource
In the V2V communication network, two vehicles may allocation approach in V2X networks based on a multi-agent
communicate in conventional cellular mode via the base station deep Q-learning considering the latency constraints of V2V
(BS) or communicate directly in the device-to-device (D2D) links.
mode. In the D2D mode, the V2V links may choose to reuse the
subbands occupied by the V2I links (V2I-occupied subbands)
or use unoccupied ones (V2I-unoccupied subbands). This B. Contributions and Organization
decision process is called mode selection. There have been In this work, we study an optimization problem involving
many studies on mode selection in D2D networks. Various mode selection and power adaptation based on V2V
mode selection strategies have been proposed in [9-10], communication in 5G networks. We propose to support two
depending on the channel conditions, power constraints and types of vehicular connections, namely, V2I links and V2V
system quality-of-service (QoS) requirements. However, much links, to achieve the dual advantages of V2V communication
of the existing work solved this problem using game theoretical networks. Our framework assumes that the V2V links can
approaches [11] and graph theory [12], which cannot be applied multiplex subbands for different mode operations and
to complicated communication scenarios due to their high adaptively select the power level for energy savings. The
computational complexity. In the V2V communication system, objective of our work is to maximize the total capacity of the
research on mode selection is relatively rare, and researchers V2I links while meeting the constraints of the outage
mainly focus on solving the issue of resource management. probability and transmission delay for the V2V links. The
Considering the stringent latency and reliability constraints, major contributions of this paper are summarized as follows:
various resource allocation schemes have been proposed in the 1) We explore the performance of joint mode selection and
V2V communication network to ensure the QoS requirements power adaptation schemes by maximizing the total capacities
of vehicles and improve the capacity of the network [13-15]. In of the V2I links while ensuring the strict URLLC requirements
[16], the V2X resource allocation maximizes the throughput of of the V2V links. Unlike existing works that consider
V2I links, adapting to slowly varying large-scale channel subcarriers scheduling, our work optimizes interference
fading, thereby reducing network signaling overhead. management and power adaptation based on different
Furthermore, a joint mode selection and resource allocation transmission modes to improve the performance of the V2V
scheme is proposed to maximize the total capacity of pedestrian communication network.
users and V2V links [17]. 2) Considering that it is not possible to track fast changing
As for mode selection and resource management, much of wireless channels in a vehicular system, we perform mode
the work is modeled as traditional optimization problems. The selection and power adaptation based on slow fading
optimization complexity of these methods is high, and these parameters and statistical information in the channel rather than
methods cannot be applied to complicated communication instantaneous channel state information (CSI). Therefore, the
scenarios. Although the above work can achieve considerable issues presented in this paper are addressed in a decentralized
performance, the solutions are not intelligent enough. Apart scheme with low transmission overhead where the V2V links
from traditional optimization methods, reinforcement learning can make decisions autonomously and there is no need for the
(RL) approaches have been developed in several recent works BS to collect global information for making decisions.
to address the problem of mode selection and resource 3) We propose a RL framework to solve the joint mode
allocation [18-19]. RL provides a robust way to treat selection and power adaptation problem, in which each V2V

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

Fig. 2 (a). Potential n-hop links between 𝑉𝑎 and 𝑉𝑏 . Fig. 2 (b). Three communication links established by 𝑉𝑎 .

link is considered to be an agent, intelligently making adaptive For any V2V link, it can only use one subband, including
decisions based on the continuous-value state variables in a V2I-occupied and V2I-unoccupied. Additionally, a subband
highly dynamic V2V communication environment to learn the can be occupied by multiple V2V links.
optimal policy. Further, joint mode selection and power We define the channel power gains of the 𝑟th V2I link and
adaptation with the double deep Q-learning (DDQN) algorithm 𝑗th V2V link as 𝑔𝑟 and 𝑔𝑗 , respectively. Let 𝑔𝑗,𝑏 denote the
is proposed, in which a new reward function is introduced to interfering channel gain from the 𝑗th V2V transmitter to the BS.
manage interference and power to improve the performance of The interference channel gain from the 𝑟th V2I link to the 𝑗th
the V2V communication network. Moreover, the replay buffer V2V link is denoted as 𝑔𝑗,𝑟 , and the interference channel gain
is used to improve data utilization and the target network is from the𝑗 ′ th V2V link to the 𝑗th V2V link is denoted as 𝑔𝑗,𝑗 ′ .
applied to improve the stability of the network.
The rest of the paper is organized as follows. The system
model is introduced in Section II. Section III presents the RL B. Establishing Reliable V2V Links
framework to solve the problem formulation of joint mode The 5G communication technology has a wide range of
selection and power adaptation. We focus on maximizing the applications for satisfying service requirements in vehicular
total capacity of the V2I links while guaranteeing the networks. In the V2V communication network, a reliable link
transmission delay and reliability constraints of the V2V links. QoS is especially vital and very stringent, i.e., almost 100%
The simulation results and analysis are shown in Section IV. reliability and millisecond latency [26]. Under the above
Section V is the conclusion. requirements, reliable message transmissions are guaranteed
for various QoS requirements by exchanging the safety
II. SYSTEM MODEL AND PROBLEM FORMULATION information among VUEs. The limit on the data rate is
A. System Model relatively loose for V2V links; however, the constraints on the
reliable and real-time data link communication are tight. In our
Considering that the uplink spectrum resources are
scenario, two main aspects are considered for real-time and
underutilized compared with the downlink spectrum resources.
reliable data transmission in the vehicular network: low data
To improve the utilization of the uplink spectrum, all studies
collision and low delay.
are based on the uplink.
As illustrated in Fig. 2(a), the vehicle 𝑉𝑎 establishes
A cellular-based vehicular communication network is shown
mmWave multi-hop communication links to communicate with
in Fig. 1. In this model, the vehicular communication network
vehicle 𝑉𝑏 . We assume that the potential links between vehicle
consists of two parts: one is 𝑅 V2I links between the BS and
𝑉𝑎 and vehicle 𝑉𝑏 constitute a link set ℒ 𝑎𝑏 , ℒ 𝑎𝑏 ⊇ ℒ 𝑛𝑎𝑏 , 𝑛 =
VUEs to support high data rate services, and the other is 𝐽 V2V
{1, 2, … , 𝑛𝑚𝑎𝑥 }, where 𝐿𝑎𝑏
𝑛 is the set of 𝑛-hop links and 𝑛𝑚𝑎𝑥
links that communicate among close VUEs to transmit safety
is decided by the practical V2V network. We choose one from
messages for safe driving [24]. Let ℛ = {1,2, … , 𝑅} and 𝒥 =
link set 𝐿𝑎𝑏 as an example to analyze link QoS, and the analysis
{1,2, … , 𝐽} denote the set of V2I links and the set of V2V links,
method for other links is similar. When all links between
respectively. Furthermore, we assume that all transceivers use a
single antenna and that VUEs are uniformly located in the vehicle 𝑉𝑎 and vehicle 𝑉𝑏 are traversed, the link with the best
region covered by the BS. link QoS performance will be selected as the actual
We adopt a scheme in which vehicles have a pool of radio communication link.
resources [25]. We assume that there are 𝒩 = {1,2, … , 𝑁} of The link 𝑙𝑛 is 𝑛-hop. We assume vehicle 𝑉𝑎 is marked as 1,
𝑁 = 𝑅 + 𝑈 orthogonal subbands assigned to the VUEs. 𝑅 is then vehicle 𝑉𝑏 is marked as 𝑛 + 1, and other relay vehicles are
the number of V2I-occupied subbands, which means that 𝑅 marked as {2, 3, … , 𝑛} in turn. Accordingly, the adjacent
subbands have been preoccupied by 𝑅 V2I links. 𝑈 is the distances are expressed as {𝑑1 , 𝑑2 , … , 𝑑𝑛 }, and the adjacent
number of V2I-unoccupied subbands. Moreover, any subband distances should be within the maximum mmWave
can be chosen autonomously for V2V communication. We transmission range. So, the total transmission distance between
assume that one subband can only be allocated to at most one vehicle 𝑉𝑎 and vehicle 𝑉𝑏 is ∑𝑛𝑖=1 𝑑𝑖 , which is recorded as D.
V2I link and one V2I link can only occupy a single subband. Considering low data collision and low transmission delay,

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

Fig. 3. Three types of communication modes.

the performance of the V2V link can be defined as fading is a characteristic of radio propagation resulting from
objects that cause multiple reflections, diffraction and
𝑃𝑙𝑛 = 𝛼𝐶𝑙𝑛 + 𝛽𝐷𝑙𝑛 (1) scattering before being received at the receiver, and it is
modeled as the Rayleigh distribution. Large-scale fading
where the weight factors 𝛼 and 𝛽 represent degrees of consists of path loss and shadowing. Path loss is the attenuation
importance to the corresponding requirements, and we define in the power density of an electromagnetic wave as it
{𝛼 + 𝛽 = 1|𝛼, 𝛽 ∈ (0,1)}. 𝐶𝑙𝑛 and 𝐷𝑙𝑛 denote the connectivity propagates through space. Beyond that, fading that represents
probability and the transmission delay of link 𝑙𝑛 , respectively. the average signal power attenuation or path loss due to motion
The vehicle arrival process is usually modeled as a Poisson over large areas is commonly known as shadowing.
process. Therefore, the connectivity probability can be The principles for establishing V2V links are described in the
expressed in terms of the accumulative probability distribution previous section. Mode selection solves the problem of whether
of 𝑛 + 1 vehicles in link 𝑙𝑛 as [27] to reuse subbands or use dedicated subbands. Under the given
assumption, there are three communication modes for the V2V
𝐶𝑙𝑛 = (𝜌𝑣 𝐷)𝑛+1 𝑒 −𝜌𝑣 𝐷 /(𝑛 + 1)! (2) communication network.
Cellular Mode: As shown in Fig. 3(a), VUEs communicate
through BS as traditional cellular users in cellular mode.
where 𝜌𝑣 is the spatial density of vehicles. Moreover, the
Because two users are far apart, the channel gain is poor, and
transmission delay is given by
the V2V link cannot be established between them. In this case,
two subbands (one uplink and one downlink) will be assigned
𝐷𝑙𝑛 = 1 − 𝑇𝑙𝑛 /𝑇𝑡𝑜𝑙 (3)
to them, and there are no other V2V links to reuse their
spectrum resources. Hence, two VUEs working in the cellular
where 𝑇𝑡𝑜𝑙 is the maximum delay of the transmission link that mode are identical to traditional cellular users. This mode has
can be tolerated, and 𝑇𝑙𝑛 is the transmission delay of the link 𝑙𝑛 . the lowest utilization of channel resources among all modes. It
If the value of 𝑇𝑙𝑛 is within the tolerable range of the delay, then is chosen when two other modes are not available since
𝐷𝑙𝑛 ∈ [0,1). Otherwise, when 𝑇𝑙𝑛 exceeds the upper limit of the interference issues. The uplink signal-to-noise ratio (SNR) of
delay, then 𝐷𝑙𝑛 < 0 . As a result, the range of 𝐷𝑙𝑛 is VUEs can be written as
𝐷𝑙𝑛∈ (−∞, 1) . In summary, the established optimal link
𝑃𝑗 [𝑚]∙𝑔𝑗𝑢𝑝 [𝑚]
between vehicle 𝑉𝑎 and vehicle 𝑉𝑏 can be denoted as 𝛾𝑗𝑢𝑝 [𝑚] = (5)
𝜎2

𝑙 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥{𝑃𝑙𝑛 }, 𝑛 ∈ {1,2, … , 𝑛𝑚𝑎𝑥 } (4) where 𝑃𝑗 [𝑚] is the transmit power of 𝑗th V2V link allocated the
𝑚th subband, 𝑔𝑗𝑢𝑝 is the signal channel gain of uplink for the
In addition, we make a reasonable simplification to simplify
𝑗 th V2V link and 𝜎 2 is the power of the additive white
the problem. We assume that a vehicle only communicates with
three vehicles as shown in Fig. 2(b), and each link is established Gaussian noise (AWGN). Similarly, the downlink SNR is
in accordance with the method described above. given by

𝑃𝑏 [𝑚]∙𝑔𝑗 [𝑚]
C. Communication Mode Description 𝛾𝑗𝑑𝑜𝑤𝑛 [𝑚] = 𝑑𝑜𝑤𝑛 (6)
𝜎2

In a wireless system, the transmitted electromagnetic


radiation wave reacts in a complicated way with a medium where 𝑃𝑏 [𝑚] is the transmit power of the BS and 𝑔𝑗𝑑𝑜𝑤𝑛 is the
before it is received at the receiver [28]. One basic type of signal channel gain of the downlink for 𝑗 th V2V link.
non-Gaussian channel that frequently occurs in practice is the Meanwhile, we assume that the 𝛾𝑑𝑜𝑤𝑛 ≥ 𝛾𝑢𝑝 is guaranteed for
fading channel model. Generally, we use two types of fading the downlink [29].
definitions for the mobile radio channel, small-scale fading and Dedicated Mode: In the dedicated mode, two VUEs are
large-scale fading, to describe the channel gain. Small-scale nearby and can communicate with each other directly, as shown

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

in Fig. 3(b). At this point, the V2V link only requires one to maximize the throughput of the V2I links while ensuring that
dedicated V2I-unoccupied subband. However, this dedicated the limitations of the V2V links, such as transmission delay and
subband can be reused by other V2V links, and interference reliability, are met. The joint mode selection and power
between them may occur. Compared to the cellular mode, the adaptation problem in the V2V communication network can be
dedicated mode consumes less spectrum resources. mathematically formulated as
Additionally, the energy efficiency can be improved due to the
proximity of the VUEs. The signal-to-interference-plus-noise Max ∑𝑚[𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑗𝑢𝑝 [𝑚]) + 𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑟 [𝑚])]
ratio (SINR) of the V2V link over the V2I-unoccupied subband s. t. 𝐷𝑙𝑛 ∈ [0,1)
is expressed as 1 − 𝐶𝑙𝑛 ≤ 𝐶𝑡𝑜𝑙 (11)
𝑃𝑗 [𝑚]⋅𝑔𝑗 [𝑚] 𝑃𝑗 , 𝑃𝑗 ′ , 𝑃𝑟 , 𝑃𝑏 ∈ {𝑃𝑜𝑤1 , 𝑃𝑜𝑤2 , 𝑃𝑜𝑤3 }
𝛾𝑗 [𝑚] = (7)
𝜎2 +∑𝑗′≠𝑗 𝜌𝑗′ [𝑚]𝑃𝑗′ [𝑚]⋅𝑔𝑗,𝑗′ [𝑚]
where 𝐶𝑡𝑜𝑙 denotes the tolerable outage probability of the V2V
where 𝑃𝑗 ′ [𝑚] is the transmit power of the 𝑗 th V2V link over ′ links. 𝑃𝑜𝑤1 , 𝑃𝑜𝑤2 and 𝑃𝑜𝑤3 represent the three levels of
transmit power.
the 𝑚 th V2I-unoccupied subband. 𝜌𝑗 ′ [𝑚] is the binary
indicator. When the value of 𝜌𝑗 ′ [𝑚] is 1, the 𝑚th subband is III. RL FOR THE JOINT MODE SELECTION AND POWER
occupied by the 𝑗 ′ th V2V link and 𝜌𝑗 ′ [𝑚] = 0 otherwise. ADAPTATION PROBLEM IN THE V2V COMMUNICATION
Based on the previous assumption, a V2V link can only use one NETWORK
subband at most, so there is the relationship that ∑𝑚 𝜌𝑗 ′ [𝑚] ≤ The purpose of RL is to solve the stochastic and
1. multi-objective optimization problem by learning to
Reuse Mode: In the reuse mode, two VUEs establish a V2V approximate the policy. It can be used for real-time pattern
link for communicating directly because they are so close selection to improve the satisfaction of each target indicator. In
together. As shown in Fig. 3(c), the V2V link reuses the our scenario, limited spectrum resources are competitively
V2I-occupied subband; as a consequence, the spectral occupied by multiple V2V links, and different transmission
efficiency can be further improved. However, this raises the modes can be formed by the way the V2V links occupy
question of how to properly manage the cochannel interference spectrum resources, which can be modeled as a multi-agent RL
so as not to degrade the performance of either the V2I link or problem when the V2V link is considered to be an agent.
the V2V link. In this mode, the V2I link is subject to Hence, the V2V links can make intelligent decisions in the V2V
interference from the V2V links with which it shares the same communication network. In this section, the framework of deep
subband, and the number of V2V links may be more than one. reinforcement learning (DRL) for mode selection and power
Therefore, the uplink SINR for the V2I link can be written as adaptation in V2V communications is introduced, including the
representation of the key elements and a DDQN learning
𝑃𝑟 [𝑚]⋅𝑔𝑟 [𝑚]
𝛾𝑟 [𝑚] = 𝜎2+∑ (8) algorithm for the problem of joint mode selection and power
𝑗 𝜌𝑗 [𝑚]𝑃𝑗 [𝑚]⋅𝑔𝑗,𝑏[𝑚]
adaptation.
where 𝑃𝑟 [𝑚] denotes the transmit power of the 𝑟th V2I link. A. Reinforcement Learning
From another perspective, the V2V link will also suffer from Generally, RL consists of five elements, including agent,
the interference generated by the V2I link and other V2V links environment, state, action and reward. The agent learns by
sharing one subband. Then, the uplink SINR for the V2V link is constantly interacting with the environment. At each moment,
given by the environment will be in a status where the agent will try to
obtain observations about the current state. The agent will act
𝑃𝑗 [𝑚]⋅𝑔𝑗 [𝑚]
𝛾𝑗 [𝑚] = 𝜎2+𝑃 [𝑚]⋅𝑔 (9) on the basis of the observed values in conjunction with its own
𝑟 𝑗,𝑟 [𝑚]+∑𝑗′ ≠𝑗 𝜌𝑗′ [𝑚]𝑃𝑗′ [𝑚]⋅𝑔𝑗,𝑗′ [𝑚]
historical code of conduct, which is generally called a strategy.
The action will affect the state of the environment, causing
In the vehicular communication network, each V2V link will certain changes in the environment. The agent will obtain two
select one of the abovementioned communication modes.
pieces of information from the changing environment:
observations of the new environment and the reward. After all
D. Problem Formulation of the above steps are completed, the agent can perform new
In this situation, the corresponding capacity of each mode is actions based on new observations.
obtained as The Markov decision process (MDP) is generally used as a
formal means for RL. We use 𝑠𝑡 for the observation of the
𝐶𝑎𝑝 = 𝑊𝑙𝑜𝑔2 (1 + 𝛾) (10) environment at time 𝑡, 𝑎𝑡 for the action taken at time 𝑡, and 𝑟𝑡
for the immediate reward obtained after the action 𝑎𝑡 is
where 𝑊 is the bandwidth of each spectrum subband, and 𝛾 is executed. The process of RL can be represented by a
the SINR or SNR of the corresponding link. In this paper, state-action-reward chain as
considering the existence of the V2V and V2I links, our goal is

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

{𝑠0 , 𝑎0 , 𝑟0 , … , 𝑟𝑡−1 , 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 } (12) conditions and all the agents’ behaviors. For each V2V link, the
observed state information is limited and can be depicted as
RL has two salient features: one is the constant trial and the 𝑠𝑡 = [𝜑𝑡𝑆𝑉 , 𝜑𝑡𝑉𝐼 , 𝜑𝑡𝑉𝑉 , 𝜑𝑡𝑉𝐵 , 𝜑𝑡−1
𝐶𝑆
, 𝜑𝑡𝑇𝐿 , 𝜑𝑡𝑄𝑅 ] , where 𝜑𝑡𝑆𝑉
other is an emphasis on long-term returns. For the first feature, represents the observed signal channel information of the V2V
the agent needs to constantly try, trying to cope with the various link, 𝜑𝑡𝑉𝐼 indicates the interference channel from all V2I
possible behaviors of the state and collect the corresponding transmitters, 𝜑𝑡𝑉𝑉 denotes the interference channels from other
reward. Only by collecting various feedback information can V2V transmitters, 𝜑𝑡𝑉𝐵 shows the interference channel from its
the learning task be better completed. For the second one, the own transmitter to the BS, and 𝜑𝑡−1 𝐶𝑆
is the channel selection
reward obtained in one step or two steps become less important status of the neighbors at the previous moment that reveals the
under a long time. Maximizing long-term returns is often other agents’ behaviors. Additionally, 𝜑𝑡𝑇𝐿 represents the
difficult and it requires more complex and precise actions. 𝑄𝑅
remaining transmission load of the V2V links and 𝜑𝑡 denotes
Ideally, every action must be done to maximize the
the QoS requirements.
long-term return. We multiply the future returns by a discount
Action: In this paper, we focus on the mode selection and
factor 𝜉 to reduce the impact of the current actions on the future
power adaptation issues in the V2V communication network.
returns. Therefore, the long-term return is given by
Hence, the action is defined as 𝑎𝑡 = (𝜒𝑡𝑀𝑆 , 𝜒𝑡𝑃𝐶 ), where 𝜒𝑡𝑀𝑆 is
the communication mode selected by the V2V link at time 𝑡,
𝑅𝑒𝑡𝑡 = 𝐸[∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 ] (13)
and 𝜒𝑡𝑃𝐶 is the chosen power level of the V2V link in the
Value functions are used to represent the value of the selected mode at time 𝑡. In the V2V communication network,
strategy. According to the model form of the MDP, value we divide the communication modes into three types as
functions can be divided into two types. One type is a state
0, 𝐶𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑀𝑜𝑑𝑒
value function 𝑉𝜋 (𝑠𝑡 ), that is, the current state 𝑠𝑡 is known, and
𝜒𝑡𝑀𝑆 = {1, 𝐷𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒 (16)
the expectation of the long-term return is generated according
2, 𝑅𝑒𝑢𝑠𝑒 𝑀𝑜𝑑𝑒
to certain strategic actions.
Note: Through the subband occupied by the V2V link and
𝑉𝜋 (𝑠𝑡 ) = 𝐸𝜏 [∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 |𝑠𝑡 = 𝑠] (14) the preassignment of the V2I links, we further determine the
communication mode selected by the V2V link. In addition, we
where 𝜏 denotes the sequence sampled by the policy and state discretize the continuous power into three levels in our
transition. The other type is the expectation of the long-term scenario, so the action space is 9 dimensional.
return generated by certain strategic actions under the premise Reward: The learning process is driven by the reward
of knowing the current state 𝑠𝑡 and action 𝑎𝑡 , called the function in the RL framework, and the system performance can
state-action value function 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ). be improved when the design of the reward function for each
step is related to the desired objective. Consistent with the goal
𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝐸𝜏 [∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎] (15) of maximizing the reward in RL, the performance of the system
is optimal when the algorithm finds a smart enough strategy to
B. Mapping Scenario to Key Elements of RL achieve the final objective. Considering the difference in the
requirements for different types of links in the V2V
Agent: Each active V2V link is an agent. In the V2V
communication network, the capacities of V2V and V2I links,
communication network, the BS cannot obtain accurate CSI in
the latency and reliability constraints are modeled in the reward
real time because of the high mobility of the vehicles; hence,
function. Hence, we propose a new reward function by the
the centralized method has little effect. To solve this problem,
following analysis
we propose a distributed algorithm based on slow fading
parameters and statistical information in the channel. In this
algorithm, the agent does not need to know all the information 𝑟𝑡 = 𝜂1 (𝐶𝑎𝑝𝑟 + 𝐶𝑎𝑝𝑗 ) + +𝜂2 𝐶𝑙𝑛 + 𝜂3 𝐷𝑙𝑛 (17)
before making a decision, and the transmission overhead is
very small. Therefore, the problem of joint mode selection and where 𝜂1 , 𝜂2 and 𝜂3 are the weight factors for the contributions
power adaptation in the V2V communication network is more of these three parts to the reward function composition, and
suitable for implementation in a decentralized way. Based on 𝐶𝑎𝑝𝑟 and 𝐶𝑎𝑝𝑗 denote the total capacities of the V2I and V2V
the above descriptions, each V2V link is considered to be an links, respectively. The total capacity of the V2I links 𝐶𝑎𝑝𝑟 can
agent that interacts with the communication environment. The be written as
agents make intelligent decisions by partial observations of the
environment and information exchanged among them, 𝐶𝑎𝑝𝑟 = ∑𝑚[𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑗𝑢𝑝 [𝑚]) + 𝑊𝑙𝑜𝑔2 (1 +
(18)
including the adaptive selection of communication mode and 𝛾𝑟 [𝑚])]
transmit power. Generally, the agent can obtain experience and
adjust its action strategy. and 𝐶𝑎𝑝𝑗𝑑 can be depicted as
State: Let us define the infinite set for the state space as 𝒮.
The true environment state, 𝑠𝑡 , should include global channel

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

𝐶𝑎𝑝𝑗 = ∑𝑚[𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑗 [𝑚])] (19)

If the policy is optimal, i.e., the optimal transmission mode


and power level, the V2I links will achieve the largest
throughput while guaranteeing the connectivity probability and
transmission delay of the V2V links. At this point, the
performance of the V2V communication network is optimal; as
a result, the reward value is high. In contrast, a lower reward is
given when the low transmission rate event happens or the
constraints are violated.

C. Learning Algorithm
In our scenario, the number of states is large, and the value of
the state is continuous. Generally, the table lookup algorithm
would be inefficient or even impossible for solving the above
problem. A common way to solve such problems is to no longer
store the values of the states or behaviors in a tabular way, but
to introduce appropriate parameters, select the appropriate
features to describe the states, and estimate the values of states
or behaviors by constructing a certain function. Here, we use
the approximate representation of the state-action value
𝑄(𝑠𝑡 , 𝑎𝑡 ). The function 𝑄(𝑠𝑡 , 𝑎𝑡 , 𝜔) consisting of the parameter
𝜔 accepts state variables 𝑠𝑡 and the behavior variable 𝑎𝑡 , and
the value of the parameter 𝜔 is constantly adjusted to match the Fig. 4. An illustration of DDQN for joint mode selection and power
final behavior values based on the strategy 𝜋. adaptation in V2V communication network of 5G.
The Q-learning algorithm selects the action that can achieve
order. If the reply buffer is already full, the new sample will
the maximum reward based on the state-action value 𝑄(𝑠𝑡 , 𝑎𝑡 ).
cover the oldest sample in time. For sampling, the replay buffer
Q-learning uses the current return and the estimated value of
randomly samples a batch of samples from the cache for
the next moment obtained by taking the action that maximizes
learning. Another model with exactly the same structure is
the value to estimate the value of the current moment. The
introduced in the DQN algorithm. This model is called the
updated formula for Q-learning can be written as
target network, and the original model is called the behavior
1 network. At the beginning of the training, two networks use the
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡−1 , 𝑎𝑡−1 ) + 𝑀 [𝑟𝑡−1 + 𝜉𝑚𝑎𝑥𝑎′ 𝑄 − same parameters. During the training process, the behavior
(20)
𝑄(𝑠𝑡−1 , 𝑎𝑡−1 )] network is responsible for interacting with the environment and
obtaining interactive samples. Specifically, the calculated value
where 𝑀 is the size of the mini-batch. The Q-learning of the target network is compared to the estimated value of the
algorithm does not follow the sequence of interactions but behavior network to obtain the target value and update the
instead selects the action that maximizes the value at the next behavior network. Whenever the training completes a certain
moment. number of iterations, the parameters of the behavior network
We use neural networks to approximate the state-action are synchronized to the target network, so that the next phase of
value 𝑄(𝑠𝑡 , 𝑎𝑡 ). The approximation via neural networks is a learning can be performed.
nonlinear approximation. Its basic unit is a neuron that can The formula for solving the target network can be defined as
perform nonlinear transformations. Through the multilayered
arrangement of neurons and interlayer interconnections, a 𝑦𝑡 = 𝑟𝑡 + 𝜉𝑚𝑎𝑥𝑎′ 𝑄(𝑠𝑡+1 , 𝑎′ ; 𝜔− ) (21)
complex nonlinear approximation is realized. We use a single
neural network to represent all actions at once and associate a 𝑄 where 𝜔− indicates the parameters of the target network.
value with each action. Specifically, the formula is further defined as
The DQN algorithm combines Q-learning and deep learning.
The main innovations of the DQN include the replay buffer and 𝑦𝑡 = 𝑟𝑡 + 𝜉𝑄(𝑠𝑡+1 , 𝑎𝑟𝑔𝑚𝑎𝑥𝑎′ 𝑄(𝑠𝑡+1 , 𝑎′ ; 𝜔− ); 𝜔− ) (22)
target network. In the DQN algorithm, there are two problems,
including a certain correlation between the sequences obtained After expanding the formula, we find that after using the
by the interaction and the efficiency of the use of interactive target network, the model still uses the same parametric model
data. To solve the problems, the replay buffer is designed, when selecting the optimal action and calculating the target
which contains collecting samples and sampling samples. The value, which will inevitably lead to an overestimation of the
collected samples are stored in the structure in chronological value. To minimize the impact of overestimation, the simple

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8

approach is to separate the selection of the optimal action from TABLE I


DDQN ALGORITHM FOR THE V2V COMMUNICATION NETWORK
the estimation of the optimal action. We use the behavior
Algorithm DDQN Algorithm for Joint Mode Selection and Power
network to complete the selection of the optimal action, so we Adaptation in the V2V Communication Network
can obtain Input
Initialize the Replay Buffer ℬ with a capacity of 𝑍
𝑦𝑡 = 𝑟𝑡 + 𝜉𝑄(𝑠𝑡+1 , 𝑎𝑟𝑔𝑚𝑎𝑥𝑎′ 𝑄(𝑠𝑡+1 , 𝑎′ ; 𝜔); 𝜔− ) (23) Initialize the Behavior Network 𝒬 and parameters 𝜔
Initialize the Target Network 𝒬̂ and parameters 𝜔−
Start
In the training process, we use a variant of the stochastic for episode = 𝟏, 𝑴 do
gradient descent (SGD) method, named mini-batch SGD, to Initialize the environment, get the initial state 𝑠0
optimize the network parameters. At each episode, a mini-batch for 𝐭 = 𝟏, 𝑻 do
of samples is uniformly sampled from the replay buffer ℬ for Randomly select an action with the probability of 𝜖, or choose the
current best based on the model 𝑎𝑡 = 𝑚𝑎𝑥𝑎 𝑄 ∗(𝑠𝑡 , 𝑎; 𝜔)
updating 𝜔 to minimize the following loss function: Perform action 𝑎𝑡 , get the corresponding reward 𝑟𝑡 and a new round
of state 𝑠𝑡+1
ℒoss(𝜔) = [𝑦𝑡 − 𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜔)]2 /𝑀 (24) Store the transition {𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 } in ℬ
Sample random mini-batch (𝑀) of {𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 } from ℬ
𝑟𝑡
where 𝑀 is the size of the mini-batch and Calculate predictions 𝑦𝑡 = {𝑟 + 𝜉𝑚𝑎𝑥 ′ 𝑄(𝑠 , 𝑎 ′ ; 𝜔− )
𝑡 𝑎 𝑡+1
Perform mini-batch gradient descent on objective function [𝑦𝑡 −
𝑟𝑡 2
𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜔)] /𝑀
𝑦𝑡 = {𝑟 + 𝜉𝑚𝑎𝑥 ′ 𝑄(𝑠 , 𝑎′ ; 𝜔− ) (25) Complete parameters update 𝜔− ← 𝜔 every 𝛦 rounds
𝑡 𝑎 𝑡+1
End for
End for
An illustration of the DDQN for joint mode selection and
power adaptation in the V2V communication network in 5G is the beginning and gradually improves with the updated
shown in Fig. 4. It is described as follows. The first step is to state-action value model. Whenever the training completes a
initialize the replay buffer ℬ with capacity 𝑍 as well as the certain number of iterations, the parameters of the behavior
behavior network and the target network. At the beginning of network model are synchronized to the target network so that
the training, the two state-action value models use the same the next learning phase can be performed.
parameters. Next, the environment is initialized. Our In addition, we analyze the transmission overhead of the
environment simulator consists of vehicles, V2I channels and distributed schemes. All the information, including 𝑔𝑟 , 𝑔𝑗 , 𝑔𝑗,𝑏 ,
V2V channels, in which the vehicles are randomly dropped and 𝑔𝑗,𝑟 and 𝑔𝑗,𝑗 ′ , is transmitted to the BS, which makes decisions
the channels are generated based on the positions of the
vehicles. We obtain the observations of the state about mode selection and power adaptation based on the global
information. Hence, the transmission overhead is large.
[𝜑𝑡𝑆𝑉 , 𝜑𝑡𝑉𝐼 , 𝜑𝑡𝑉𝑉 , 𝜑𝑡𝑉𝐵 , 𝜑𝑡−1
𝐶𝑆
, 𝜑𝑡𝑇𝐿 , 𝜑𝑡𝑄𝑅 ] from the environment
Moreover, it also needs sufficient transmission overhead when
simulator at each time. During the training process, the the BS collects the feedback information from VUEs.
behavior network is responsible for interacting with the Additionally, the transmission overhead grows linearly with
environment and obtaining interactive samples. At each time, mobile speed and quadratically with the number of vehicles;
the state vector 𝑠𝑡 is sent as input to the behavior network, and thus, the centralized control scheme is not applicable for large
the behavior network provides 9 output values 𝑎𝑡 . These nine networks. Nevertheless, the agents make decisions by partial
outputs 𝑄(𝑠𝑡 , 𝑎𝑡 , 𝜔) represent the Q-values of actions in a observations of the environment (e.g. 𝑔𝑗 , 𝑔𝑗,𝑗 ′ , 𝑔𝑗,𝑏 and 𝑔𝑗,𝑟 ) in
certain state. To balance exploration and exploitation, we adopt
the distributed algorithm. Further, except for 𝑔𝑗,𝑏 , other
the 𝜖-greedy policy. First, a random number is generated. If the
random number is less than a certain value, the behavior information can be accurately estimated by the receiver of the
network will select a random action. Otherwise, the action with V2V link at the beginning of each time slot and can be available
the highest output is chosen. With the selected mode and power instantaneously at the transmitter through feedback. However,
level of the V2V link, the environment simulator will give the 𝑔𝑗,𝑏 is estimated at the BS in each time slot and then broadcast
corresponding reward value, and the observations of the next to all VUEs, which incurs small transmission overhead.
state can be obtained. We follow the DDQN with the replay Compared with centralized schemes, the amount of channel
buffer, where the generated data is stored in a memory. Each information that is transferred is greatly reduced in the
sample includes 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 and 𝑠𝑡+1 . The mini-batch data for distributed schemes; therefore, the transmission overhead will
updating the behavior network is sampled from the memory in greatly decrease accordingly.
each iteration. During the learning process, the target value is
calculated by the target network through 𝑦𝑡 = IV. SIMULATION AND EVALUATION
𝑟𝑡 In this section, numerical results are presented to
{𝑟 + 𝜉𝑚𝑎𝑥 ′ 𝑄(𝑠 , 𝑎′ ; 𝜔− ) and then compared with the
𝑡 𝑎 𝑡+1 demonstrate the performance of the joint mode selection and
estimation of the behavior network to obtain the loss value power adaptation method. We further investigate the parameter
through ℒ𝑜𝑠𝑠(𝜔) = [𝑦𝑡 − 𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜔)]2 /𝑀 . By constantly selection for the multi-agent DDQN algorithm. In addition, we
updating the network parameters, the value of the loss function compare the method with the following approaches: 1. The
decreases. When it reaches the global minimum, the
optimal solution of the optimization problem (11) calculated
corresponding optimal strategy can be derived. The policy for
with the full knowledge of the environment information; 2. The
selecting mode and power level in each V2V link is random at
centralized DQN method; and 3. The random search solution.

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9

TABLE II
CHANNEL MODELS

Parameter V2I Link V2V link

128.1 + 37.6log10d LOS in WINNER +


Path loss model
(d in km) B1 Manhattan [31]
Shadowing standard
8 dB 3 dB
deviation
Shadowing distribution Log-normal Log-normal
Path loss and A. 1. 4 in [30] A. 1. 4 in [30]
shadowing update every 100 ms every 100 ms
Decorrelation distance 50 m 10 m
Fast fading Rayleigh fading Rayleigh fading
Fast fading update Every 1 ms Every 1 ms

TABLE III
SIMULATION PARAMETERS

Parameters Value Fig. 5. The training loss of the DDQN algorithm.

Carrier frequency 2 GHz


RB bandwidth 180 kHz
Number of lanes 12
Number of RBs 30
Number of V2I-unoccupied RBs 10
Number of V2I-occupied RBs 20
BS antenna height 25 m
BS antenna gain 8 dBi
BS receiver noise figure 5 dB
Vehicle antenna height 1.5 m
Vehicle antenna gain 3 dBi
Vehicle receiver noise figure 9 dB
V2I transmit power 23 dBm
V2V transmit power [23, 10, 5] dBm
Noise power -114 dBm
Vehicle speed [10, 20, 30, 40, 50, 60] km/h
V2V communication distance 100 m
Latency constraint of V2V links 100 ms
Fig. 6. The expected reward of Q-network.
As illustrated in Fig. 1, we consider a single cell where the
time automatically.
vehicles are dropped based on the spatial Poisson process. Two
Fig. 5 depicts the loss function of the DDQN during the
roads cross each other to form a crossroad with the BS lying in
learning process. When the learning network first started
the center of the cell, and each road is a multilane freeway. The
training, the value of loss was relatively large, and the network
cellular vehicle users are randomly selected among the active
was in the update phase. As the number of training processes
smart vehicles, and the cellular mode, dedicated mode and
increases, the value of loss gradually decreases. The training
reuse mode are selected among the active V2V links. Each
loss gradually decreases and stabilizes after training 550 steps,
vehicle can construct a V2V link with three vehicles
which means that our algorithm will automatically update its
simultaneously. In addition, we build our simulator following
decision strategy and converge to the optimal. The figure shows
the evaluation methodology for the Manhattan case detailed in
that the DDQN method has good convergence in joint mode
Annex A of 3GPP TR 36.885 [30], where there are 9 blocks in
selection and power adaptation in the V2V communication
all and both line-of-sight (LOS) and non-line-of-sight (NLOS)
network, and the convergence time is short.
channels. The channel models for V2I and V2V links are shown
To evaluate the convergence performance of our proposed
in Table II. In one time-slot (0.5 ms), the radio resource is
multi-agent RL algorithm, we show the expected reward per
organized in a number of uplink RBs with 180 kHz per RB. We
training episode with increasing training iterations in Fig. 6.
set the number of PRBs to 30, with 10 V2I-unoccupied and 20
From the figure, the cumulative reward increases as training
V2I-occupied. Other main simulation parameters are presented
continues, despite some fluctuations due to the
in Table III.
mobility-induced channel fading in the V2V communication
The DDQN network for each V2V link consists of five
network. The V2V communication network is highly dynamic,
layers, which is fully connected with three hidden layers. The
including channel information and network topology, which
numbers of neurons in the input and output layers are 82 and 9,
causes a large state space. Although the RL-based approach can
respectively. The activation function is a ReLu function
learn the optimal strategy, the actions taken are not necessarily
𝑓(𝑥) = max (0, 𝑥) , and the RMSProp optimizer is used to
identical in the face of the same state, which leads to slight
update network parameters. The discount factor of our DDQN
fluctuations in the expected reward. Moreover, the underlying
algorithm is set to 0.5, and the learning rate is set to shrink over

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10

Fig. 7. The convergence performance of different learning. Fig. 9. Sum throughput performance of V2I links with varying vehicle
speed.

Fig. 8. The probability of three communication modes chosen.

environment of each episode is not exactly the same, which Fig. 10. Probability of the satisfied V2V links with varying vehicle speed.
leads to the nuance of expected reward. Fig. 6 illustrates the
random search has the worst performance. In terms of
effectiveness of the proposed method.
centralized DQN algorithm, when the vehicles are moving
In addition, we investigate the parameter selection of the
slowly, the throughput of the V2I links is not much different
multi-agent DDQN algorithm. Fig. 7 shows the impact of the
from the distributed DDQN approach. However, when the
learning rate. The difference in learning rate has no effect on
vehicle speed continues to rise, the effect of decision-making is
the scale of the cumulative reward. The learning rate affects the
poor because the centralized DQN algorithm cannot collect
convergence rate. It can be seen that the speed of convergence
channel information in real time. As can be seen from the figure,
is faster as a result of a larger learning rate.
better performance than other approaches under various vehicle
Fig. 8 shows the probability of choosing three
speed values is achieved through our proposed approach.
communication modes with different ratios of V2V links to V2I
Fig. 10 presents the probability of satisfied V2V
links. With the increase in the rate of V2V links versus V2I
communication links with the increasing absolute vehicle speed.
links, the probability of selecting the dedicated mode is
From the figure, when the vehicles move faster, the probability
decreasing dramatically. The probability of choosing the
of the satisfied communication links decreases. Higher
cellular mode drops slowly. The probability of choosing the
movement speed leads to an increase of the average
reuse mode increases when there are few V2V links causing
intervehicle distance. Hence, the lower received desired power
interference. However, after the rate of V2V links versus V2I
in V2V links due to sparse traffic causes such performance
links achieves a certain threshold, the probability of selecting
degradation. In addition, the higher speed results in the highly
the reuse mode starts to decrease, and the probability of the
dynamic V2V communication environment, leading to high
cellular mode increases accordingly.
observation uncertainties, which decreases the learning
Fig. 9 depicts the impact of the vehicle speed on the sum
efficiency; hence, more unsatisfied communication link events
throughput of V2I links. From the figure, the sum throughput
happen in the region of high vehicle speed. However, as the
performance decreases as the vehicle speed increase because
vehicle speed increases with high uncertainties, our proposed
the high observation uncertainties are induced from the highly
approach can still maintain the probability of satisfied V2V
dynamic V2V communication network. Among them, the

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11

communication links at a considerable level and outperforms [11] Y. Li, D. Jin, J. Yuan and Z. Han. “Coalitional games for resource
allocation in the device-to-device uplink underlaying cellular networks,”
other approaches with a higher satisfied probability, especially IEEE T. Wirel. Commun., vol. 13 no. 7, pp. 3965-3977, May 2014.
in the high vehicle speed regions. This reveals that the proposed [12] C. Xu, L. Song, Z. Han, D. Li and B. Jiao, “Resource allocation using a
distributed DDQN approach is more stable and robust in highly reverse iterative combinatorial auction for device-to-device underlay
cellular networks,” in 2012 IEEE Global Communications Conference
dynamic V2V networks. (GLOBECOM), Anaheim, CA, USA, 2012, pp. 4542-4547.
[13] W. Sun, D. Yuan, E. Ström and F. Brännström. “Cluster-based radio
V. CONCLUSION resource management for D2D-supported safety-critical V2X
communications,” IEEE T. Wirel. Commun., vol. 15 no. 4, pp. 2756-2769,
In this paper, we study the problem of joint mode selection Dec. 2015.
and power adaptation in the V2V communication network in [14] L. Liang, G. Li and W. Xu. “Resource allocation for D2D-enabled
5G. The V2V links can multiplex available resources such that vehicular communications,” IEEE T. Commun., vol. 65 no. 7, pp.
3186-3197, Apr. 2017.
they can operate in different modes to meet multiple QoS [15] D. Han, B. Bai and W. Chen. “Secure V2V communications via relays:
requirements enforced by the system. We base mode selection Resource allocation and performance analysis,” IEEE Wirel. Commun.
and power adaptation on slow fading parameters and statistical Le., vol. 6 no. 3, pp. 342-345, Mar. 2017.
[16] W. Sun, E. Ström, F. Brännström, K. Sou and Y. Sui. “Radio resource
information of the channel to address the challenges caused by management for D2D-based V2V communication,” IEEE T. Veh.
the inability to track fast changing wireless channels. We Technol., vol. 65 no. 8, pp. 6636-6650, Sep. 2015.
propose a RL framework to solve the problem with DDQN [17] F. Abbas and P. Fan, “A hybrid low-latency D2D resource allocation
scheme based on cellular V2X networks,” in 2018 IEEE International
algorithm and learn the optimal policy under continuous-valued Conference on Communications Workshops (ICC Workshops), Kansas
state variable. We consider each V2V link to be an agent; thus, City, MO, USA, 2018, pp. 1-6.
the V2V links are capable of intelligently making their adaptive [18] A. Rawat, H. Shah and V. Patil. “Towards intelligent vehicular networks:
A machine learning framework,” Int. J. Res. Engin., Sci. Manage., vol. 1
decisions to improve their performance based on the no. 9, pp. 2581-5782, Sep. 2018.
instantaneous observations under high dynamic vehicular [19] A. Koushik, F. Hu and S. Kumar. “Intelligent spectrum management
environments. Numerical results show that the method has based on transfer actor-critic learning for rateless transmissions in
good convergence. cognitive radio networks,” IEEE T. Mobile Comput., vol. 17 no. 5, pp.
1204-1215, Aug. 2017.
[20] H. Ye, L. Liang, G. Li, J. Kim, L. Lu and M. Wu. “Machine learning for
REFERENCES vehicular networks: Recent advances and application examples,” IEEE
Veh. Technol. Mag., vol. 13 no. 2, pp. 94-101, Apr. 2018.
[1] D. Wang, D. Chen, B. Song, N. Guizani, X. Yu and X. Du. “From IoT to [21] R. Atallah, C. Assi and M. Khabbaz. “Scheduling the operation of a
5G I-IoT: The next generation IoT-based intelligent algorithms and 5G
connected vehicular network using deep reinforcement learning,” IEEE
technologies,” IEEE Commun. Mag., vol. 56 no. 10, pp. 114-120, Nov.
T. Intell. Transp., vol. 20 no. 5, pp. 1669-1682, May 2018.
2018.
[22] W. Liu, G. Qin, Y. He and F. Jiang. “Distributed cooperative
[2] Y. Zhang, F. Tian, B. Song and X. Du. “Social vehicle swarms: A novel
reinforcement learning-based traffic signal control that integrates V2X
perspective on socially aware vehicular communication architecture,” networks' dynamic clustering,” IEEE T. Veh. Technol., vol. 66 no. 10, pp.
IEEE Wirel. Commun., vol. 23 no. 4, pp. 82-89, Aug. 2016.
8667-8681, May 2017.
[3] Z. Zhou, C. Gao, C. Xu, Y. Zhang, S. Mumtaz and J. Rodriguez. “Social
[23] H. Ye, G. Li and B. Juang. “Deep reinforcement learning based resource
big-data-based content dissemination in internet of vehicles,” IEEE T.
allocation for V2V communications,” IEEE T. Veh. Technol., vol. 68 no.
Ind. Inform., vol. 14 no. 2, pp. 768-777, Jul. 2017.
4, pp. 3163-3173, Feb. 2019.
[4] M. Tehrani, M. Uysal and H. Yanikomeroglu. “Device-to-device [24] W. Saad, Z. Han, A. Hjorungnes, D. Niyato and E. Hossain. “Coalition
communication in 5G cellular networks: Challenges, solutions, and future
formation games for distributed cooperation among roadside units in
directions,” IEEE Commun. Mag., vol. 52 no. 5, pp. 86-92, May 2014.
vehicular networks,” IEEE J. Sel. Area. Comm., vol. 29 no. 1, pp. 48-60,
[5] P. Dong, X. Du, H. Zhang and T. Xu, “A detection method for a novel
Dec. 2010.
DDoS attack against SDN controllers by vast new low-traffic flows,” in
[25] R. Molina-Masegosa and J. Gozalvez. “LTE-V for sidelink 5G V2X
2016 IEEE International Conference on Communications (ICC), Kuala vehicular communications: A new 5G technology for short-range
Lumpur, Malaysia, 2016, pp. 1-6.
vehicle-to-everything communications,” IEEE Veh. Techno.l Mag., vol.
[6] J. Li, G. Lei, G. Manogaran, G. Mastorakis and C. Mavromoustakis.
12 no. 4, pp. 30-39, Dec. 2017.
“D2D communication mode selection and resource optimization
[26] H. Seo, K. Lee, S. Yasukawa, Y. Peng and P. Sartori. “LTE evolution for
algorithm with optimal throughput in 5G network,” IEEE Access, vol. 7,
vehicle-to-everything services,” IEEE Commun. Mag., vol. 54 no. 6, pp.
pp. 25263-25273, Feb. 2019. 22-28, Jun. 2016.
[7] J. Guo, B. Song, F. Yu, Y. Chi and C. Yuen. “Fast video frame correlation
[27] G. Wu and P. Xu. “Link QoS analysis of 5G-enabled V2V network based
analysis for vehicular networks by using CVS--CNN,” IEEE T. Veh.
on vehicular cloud,” Science China Information Sciences, vol. 61 no. 10,
Technol., vol. 68 no. 7, pp. 6286-6292, May 2019.
pp. 109305, Aug. 2018.
[8] J. Mei, K. Zheng, L. Zhao, Y. Teng and X. Wang. “A latency and
[28] L. Wu, X. Du, W. Wang and B. Lin, “An out-of-band authentication
reliability guaranteed resource allocation scheme for LTE V2V scheme for internet of things using blockchain technology,” in 2018
communication systems,” IEEE T. Wirel. Commun., vol. 17 no. 6, pp.
International Conference on Computing, Networking and
3850-3860, Mar. 2018.
Communications (ICNC), Maui, HI, USA, 2018, pp. 769-773.
[9] H. Min, W. Seo, J. Lee, S. Park and D. Hong. “Reliability improvement
[29] C. Chien, Y. Chen and H. Hsieh, “Exploiting spatial reuse gain through
using receive mode selection in the device-to-device uplink period
joint mode selection and resource allocation for underlay
underlaying cellular networks,” IEEE T. Wirel. Commun., vol. 10 no. 2, device-to-device communications,” in the 15th International Symposium
pp. 413-418, Dec. 2010.
on Wireless Personal Multimedia Communications(WPMC), Taipei,
[10] K. Akkarajitsakul, P. Phunchongharn, E. Hossain and V. Bhargava,
China, 2012, pp. 80-84.
“Mode selection for energy-efficient D2D communications in
[30] 3rd Generation Partnership Project; Technical Specification Group Radio
LTE-advanced networks: A coalitional game approach,” in 2012 IEEE
Access Network; Study on LTE-based V2X Services; (Release 14), 3GPP
international conference on communication systems (ICCS), Singapore,
TR 36.885 V14.0.0.
Singapore, 2012, pp. 488-492.
[31] P. Kyosti, J. Meinila and L. Hentila. “IST-4-027756 WINNER II D1. 1.2
V1. 2: WINNER II channel models,” Tech. Rep., Sep. 2007.

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12

Xiaojiang (James) Du is a tenured


BIOGRAPHY professor in the Department of Computer
and Information Sciences at Temple
Di Zhao received the B.Sc. degree on University, Philadelphia, USA. Dr. Du
Communication Engineering from received his B.S. and M.S. degree in
Shandong Normal University, Jinan, electrical engineering from Tsinghua
China, 2018. She is currently working University, Beijing, China in 1996 and
towards her Ph.D. degree from Xidian 1998, respectively. He received his M.S.
University, Xi’an, China. Her research and Ph.D. degree in electrical engineering
interests include machine learning, deep from the University of Maryland College
reinforcement learning, multi-agent Park in 2002 and 2003, respectively. His research interests are
reinforcement learning, Internet of Things, security, wireless networks, and systems. He has authored over
and Big data. 300 journal and conference papers in these areas, as well as a
book published by Springer. Dr. Du has been awarded more
than $6 million US dollars research grants from the US
National Science Foundation (NSF), Army Research Office,
Air Force Research Lab, NASA, Qatar, the State of
Hao Qin received the B.S., M.S., and Pennsylvania, and Amazon. He won the best paper award at
Ph.D. degrees in communication and IEEE GLOBECOM 2014 and the best poster runner-up award
information systems from Xidian at the ACM MobiHoc 2014. He serves on the editorial boards
University, Xi’an, China, in 1996, 1999, of three international journals. Dr. Du is a Fellow of IEEE and a
and 2004, respectively. In 2004, he joined Life Member of ACM.
the School of Telecommunications
Engineering, Xidian University, where he Mohsen Guizani (S’85–M’89–SM’99–
is currently an Associate Professor of F’09) received the B.S. (with
communications and information distinction) and M.S. degrees in
systems. His research interests include electrical engineering, the M.S. and
wireless communications and satellite communications. Ph.D. degrees in computer engineering
from Syracuse University, Syracuse,
NY, USA, in 1984, 1986, 1987, and
1990, respectively. He is currently a
Professor at the CSE Department in
Bin Song received his BS, MS, and PhD Qatar University, Qatar. Previously, he
in communication and information served in different academic and
systems from Xidian University, Xi’an, administrative positions at the University of Idaho, Western
China in 1996, 1999, and 2002, Michigan University, University of West Florida, University of
respectively. He is currently a professor at Missouri-Kansas City, University of Colorado-Boulder, and
the Xidian University, Xi’an, China. He Syracuse University. His research interests include wireless
has authored over 60 journal papers or communications and mobile computing, computer networks,
conference papers and 30 patents. His mobile cloud computing, security, and smart grid. He is
research interests are in distributed video currently the Editor-in-Chief of the IEEE Network Magazine,
coding, compressed sensing-based video coding, content-based serves on the editorial boards of several international technical
image recognition and machine learning, deep reinforcement journals and the Founder and Editor-in-Chief of Wireless
learning, Internet of Things, big data. Communications and Mobile Computing journal (Wiley). He is
the author of nine books and more than 500 publications in
refereed journals and conferences. He guest edited a number of
special issues in IEEE journals and magazines. He also served
as a member, Chair, and General Chair of a number of
Yanli Zhang received the B.S. degree in international conferences. Throughout his career, he received
Mathematics from Henan University, three teaching awards and four research awards. He also
Kaifeng, China, in 1997 and M.S. degree in received the 2017 IEEE Communications Society WTC
Software Engineering from South China Recognition Award as well as the 2018 AdHoc Technical
University of Technology, Guangzhou, Committee Recognition Award for his contribution to
China, in 2005, respectively. She is outstanding research in wireless communications and Ad-Hoc
currently an associate professor at Sensor networks. He was the Chair of the IEEE
Computer Department in Guangdong AIB Communications Society Wireless Technical Committee and
Polytechnic, Guangzhou, China. Her the Chair of the TAOS Technical Committee. He served as the
research interests are in image processing, software IEEE Computer Society Distinguished Speaker and is currently
development, big data. the IEEE ComSoc Distinguished Lecturer. He is a Fellow of
IEEE and a Senior Member of ACM.

2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.

You might also like