Professional Documents
Culture Documents
A Reinforcement Learning Method For Joint Mode Selection and Power Adaptation in The V2V Communication Network in 5G
A Reinforcement Learning Method For Joint Mode Selection and Power Adaptation in The V2V Communication Network in 5G
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3
Fig. 2 (a). Potential n-hop links between 𝑉𝑎 and 𝑉𝑏 . Fig. 2 (b). Three communication links established by 𝑉𝑎 .
link is considered to be an agent, intelligently making adaptive For any V2V link, it can only use one subband, including
decisions based on the continuous-value state variables in a V2I-occupied and V2I-unoccupied. Additionally, a subband
highly dynamic V2V communication environment to learn the can be occupied by multiple V2V links.
optimal policy. Further, joint mode selection and power We define the channel power gains of the 𝑟th V2I link and
adaptation with the double deep Q-learning (DDQN) algorithm 𝑗th V2V link as 𝑔𝑟 and 𝑔𝑗 , respectively. Let 𝑔𝑗,𝑏 denote the
is proposed, in which a new reward function is introduced to interfering channel gain from the 𝑗th V2V transmitter to the BS.
manage interference and power to improve the performance of The interference channel gain from the 𝑟th V2I link to the 𝑗th
the V2V communication network. Moreover, the replay buffer V2V link is denoted as 𝑔𝑗,𝑟 , and the interference channel gain
is used to improve data utilization and the target network is from the𝑗 ′ th V2V link to the 𝑗th V2V link is denoted as 𝑔𝑗,𝑗 ′ .
applied to improve the stability of the network.
The rest of the paper is organized as follows. The system
model is introduced in Section II. Section III presents the RL B. Establishing Reliable V2V Links
framework to solve the problem formulation of joint mode The 5G communication technology has a wide range of
selection and power adaptation. We focus on maximizing the applications for satisfying service requirements in vehicular
total capacity of the V2I links while guaranteeing the networks. In the V2V communication network, a reliable link
transmission delay and reliability constraints of the V2V links. QoS is especially vital and very stringent, i.e., almost 100%
The simulation results and analysis are shown in Section IV. reliability and millisecond latency [26]. Under the above
Section V is the conclusion. requirements, reliable message transmissions are guaranteed
for various QoS requirements by exchanging the safety
II. SYSTEM MODEL AND PROBLEM FORMULATION information among VUEs. The limit on the data rate is
A. System Model relatively loose for V2V links; however, the constraints on the
reliable and real-time data link communication are tight. In our
Considering that the uplink spectrum resources are
scenario, two main aspects are considered for real-time and
underutilized compared with the downlink spectrum resources.
reliable data transmission in the vehicular network: low data
To improve the utilization of the uplink spectrum, all studies
collision and low delay.
are based on the uplink.
As illustrated in Fig. 2(a), the vehicle 𝑉𝑎 establishes
A cellular-based vehicular communication network is shown
mmWave multi-hop communication links to communicate with
in Fig. 1. In this model, the vehicular communication network
vehicle 𝑉𝑏 . We assume that the potential links between vehicle
consists of two parts: one is 𝑅 V2I links between the BS and
𝑉𝑎 and vehicle 𝑉𝑏 constitute a link set ℒ 𝑎𝑏 , ℒ 𝑎𝑏 ⊇ ℒ 𝑛𝑎𝑏 , 𝑛 =
VUEs to support high data rate services, and the other is 𝐽 V2V
{1, 2, … , 𝑛𝑚𝑎𝑥 }, where 𝐿𝑎𝑏
𝑛 is the set of 𝑛-hop links and 𝑛𝑚𝑎𝑥
links that communicate among close VUEs to transmit safety
is decided by the practical V2V network. We choose one from
messages for safe driving [24]. Let ℛ = {1,2, … , 𝑅} and 𝒥 =
link set 𝐿𝑎𝑏 as an example to analyze link QoS, and the analysis
{1,2, … , 𝐽} denote the set of V2I links and the set of V2V links,
method for other links is similar. When all links between
respectively. Furthermore, we assume that all transceivers use a
single antenna and that VUEs are uniformly located in the vehicle 𝑉𝑎 and vehicle 𝑉𝑏 are traversed, the link with the best
region covered by the BS. link QoS performance will be selected as the actual
We adopt a scheme in which vehicles have a pool of radio communication link.
resources [25]. We assume that there are 𝒩 = {1,2, … , 𝑁} of The link 𝑙𝑛 is 𝑛-hop. We assume vehicle 𝑉𝑎 is marked as 1,
𝑁 = 𝑅 + 𝑈 orthogonal subbands assigned to the VUEs. 𝑅 is then vehicle 𝑉𝑏 is marked as 𝑛 + 1, and other relay vehicles are
the number of V2I-occupied subbands, which means that 𝑅 marked as {2, 3, … , 𝑛} in turn. Accordingly, the adjacent
subbands have been preoccupied by 𝑅 V2I links. 𝑈 is the distances are expressed as {𝑑1 , 𝑑2 , … , 𝑑𝑛 }, and the adjacent
number of V2I-unoccupied subbands. Moreover, any subband distances should be within the maximum mmWave
can be chosen autonomously for V2V communication. We transmission range. So, the total transmission distance between
assume that one subband can only be allocated to at most one vehicle 𝑉𝑎 and vehicle 𝑉𝑏 is ∑𝑛𝑖=1 𝑑𝑖 , which is recorded as D.
V2I link and one V2I link can only occupy a single subband. Considering low data collision and low transmission delay,
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4
the performance of the V2V link can be defined as fading is a characteristic of radio propagation resulting from
objects that cause multiple reflections, diffraction and
𝑃𝑙𝑛 = 𝛼𝐶𝑙𝑛 + 𝛽𝐷𝑙𝑛 (1) scattering before being received at the receiver, and it is
modeled as the Rayleigh distribution. Large-scale fading
where the weight factors 𝛼 and 𝛽 represent degrees of consists of path loss and shadowing. Path loss is the attenuation
importance to the corresponding requirements, and we define in the power density of an electromagnetic wave as it
{𝛼 + 𝛽 = 1|𝛼, 𝛽 ∈ (0,1)}. 𝐶𝑙𝑛 and 𝐷𝑙𝑛 denote the connectivity propagates through space. Beyond that, fading that represents
probability and the transmission delay of link 𝑙𝑛 , respectively. the average signal power attenuation or path loss due to motion
The vehicle arrival process is usually modeled as a Poisson over large areas is commonly known as shadowing.
process. Therefore, the connectivity probability can be The principles for establishing V2V links are described in the
expressed in terms of the accumulative probability distribution previous section. Mode selection solves the problem of whether
of 𝑛 + 1 vehicles in link 𝑙𝑛 as [27] to reuse subbands or use dedicated subbands. Under the given
assumption, there are three communication modes for the V2V
𝐶𝑙𝑛 = (𝜌𝑣 𝐷)𝑛+1 𝑒 −𝜌𝑣 𝐷 /(𝑛 + 1)! (2) communication network.
Cellular Mode: As shown in Fig. 3(a), VUEs communicate
through BS as traditional cellular users in cellular mode.
where 𝜌𝑣 is the spatial density of vehicles. Moreover, the
Because two users are far apart, the channel gain is poor, and
transmission delay is given by
the V2V link cannot be established between them. In this case,
two subbands (one uplink and one downlink) will be assigned
𝐷𝑙𝑛 = 1 − 𝑇𝑙𝑛 /𝑇𝑡𝑜𝑙 (3)
to them, and there are no other V2V links to reuse their
spectrum resources. Hence, two VUEs working in the cellular
where 𝑇𝑡𝑜𝑙 is the maximum delay of the transmission link that mode are identical to traditional cellular users. This mode has
can be tolerated, and 𝑇𝑙𝑛 is the transmission delay of the link 𝑙𝑛 . the lowest utilization of channel resources among all modes. It
If the value of 𝑇𝑙𝑛 is within the tolerable range of the delay, then is chosen when two other modes are not available since
𝐷𝑙𝑛 ∈ [0,1). Otherwise, when 𝑇𝑙𝑛 exceeds the upper limit of the interference issues. The uplink signal-to-noise ratio (SNR) of
delay, then 𝐷𝑙𝑛 < 0 . As a result, the range of 𝐷𝑙𝑛 is VUEs can be written as
𝐷𝑙𝑛∈ (−∞, 1) . In summary, the established optimal link
𝑃𝑗 [𝑚]∙𝑔𝑗𝑢𝑝 [𝑚]
between vehicle 𝑉𝑎 and vehicle 𝑉𝑏 can be denoted as 𝛾𝑗𝑢𝑝 [𝑚] = (5)
𝜎2
𝑙 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥{𝑃𝑙𝑛 }, 𝑛 ∈ {1,2, … , 𝑛𝑚𝑎𝑥 } (4) where 𝑃𝑗 [𝑚] is the transmit power of 𝑗th V2V link allocated the
𝑚th subband, 𝑔𝑗𝑢𝑝 is the signal channel gain of uplink for the
In addition, we make a reasonable simplification to simplify
𝑗 th V2V link and 𝜎 2 is the power of the additive white
the problem. We assume that a vehicle only communicates with
three vehicles as shown in Fig. 2(b), and each link is established Gaussian noise (AWGN). Similarly, the downlink SNR is
in accordance with the method described above. given by
𝑃𝑏 [𝑚]∙𝑔𝑗 [𝑚]
C. Communication Mode Description 𝛾𝑗𝑑𝑜𝑤𝑛 [𝑚] = 𝑑𝑜𝑤𝑛 (6)
𝜎2
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5
in Fig. 3(b). At this point, the V2V link only requires one to maximize the throughput of the V2I links while ensuring that
dedicated V2I-unoccupied subband. However, this dedicated the limitations of the V2V links, such as transmission delay and
subband can be reused by other V2V links, and interference reliability, are met. The joint mode selection and power
between them may occur. Compared to the cellular mode, the adaptation problem in the V2V communication network can be
dedicated mode consumes less spectrum resources. mathematically formulated as
Additionally, the energy efficiency can be improved due to the
proximity of the VUEs. The signal-to-interference-plus-noise Max ∑𝑚[𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑗𝑢𝑝 [𝑚]) + 𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑟 [𝑚])]
ratio (SINR) of the V2V link over the V2I-unoccupied subband s. t. 𝐷𝑙𝑛 ∈ [0,1)
is expressed as 1 − 𝐶𝑙𝑛 ≤ 𝐶𝑡𝑜𝑙 (11)
𝑃𝑗 [𝑚]⋅𝑔𝑗 [𝑚] 𝑃𝑗 , 𝑃𝑗 ′ , 𝑃𝑟 , 𝑃𝑏 ∈ {𝑃𝑜𝑤1 , 𝑃𝑜𝑤2 , 𝑃𝑜𝑤3 }
𝛾𝑗 [𝑚] = (7)
𝜎2 +∑𝑗′≠𝑗 𝜌𝑗′ [𝑚]𝑃𝑗′ [𝑚]⋅𝑔𝑗,𝑗′ [𝑚]
where 𝐶𝑡𝑜𝑙 denotes the tolerable outage probability of the V2V
where 𝑃𝑗 ′ [𝑚] is the transmit power of the 𝑗 th V2V link over ′ links. 𝑃𝑜𝑤1 , 𝑃𝑜𝑤2 and 𝑃𝑜𝑤3 represent the three levels of
transmit power.
the 𝑚 th V2I-unoccupied subband. 𝜌𝑗 ′ [𝑚] is the binary
indicator. When the value of 𝜌𝑗 ′ [𝑚] is 1, the 𝑚th subband is III. RL FOR THE JOINT MODE SELECTION AND POWER
occupied by the 𝑗 ′ th V2V link and 𝜌𝑗 ′ [𝑚] = 0 otherwise. ADAPTATION PROBLEM IN THE V2V COMMUNICATION
Based on the previous assumption, a V2V link can only use one NETWORK
subband at most, so there is the relationship that ∑𝑚 𝜌𝑗 ′ [𝑚] ≤ The purpose of RL is to solve the stochastic and
1. multi-objective optimization problem by learning to
Reuse Mode: In the reuse mode, two VUEs establish a V2V approximate the policy. It can be used for real-time pattern
link for communicating directly because they are so close selection to improve the satisfaction of each target indicator. In
together. As shown in Fig. 3(c), the V2V link reuses the our scenario, limited spectrum resources are competitively
V2I-occupied subband; as a consequence, the spectral occupied by multiple V2V links, and different transmission
efficiency can be further improved. However, this raises the modes can be formed by the way the V2V links occupy
question of how to properly manage the cochannel interference spectrum resources, which can be modeled as a multi-agent RL
so as not to degrade the performance of either the V2I link or problem when the V2V link is considered to be an agent.
the V2V link. In this mode, the V2I link is subject to Hence, the V2V links can make intelligent decisions in the V2V
interference from the V2V links with which it shares the same communication network. In this section, the framework of deep
subband, and the number of V2V links may be more than one. reinforcement learning (DRL) for mode selection and power
Therefore, the uplink SINR for the V2I link can be written as adaptation in V2V communications is introduced, including the
representation of the key elements and a DDQN learning
𝑃𝑟 [𝑚]⋅𝑔𝑟 [𝑚]
𝛾𝑟 [𝑚] = 𝜎2+∑ (8) algorithm for the problem of joint mode selection and power
𝑗 𝜌𝑗 [𝑚]𝑃𝑗 [𝑚]⋅𝑔𝑗,𝑏[𝑚]
adaptation.
where 𝑃𝑟 [𝑚] denotes the transmit power of the 𝑟th V2I link. A. Reinforcement Learning
From another perspective, the V2V link will also suffer from Generally, RL consists of five elements, including agent,
the interference generated by the V2I link and other V2V links environment, state, action and reward. The agent learns by
sharing one subband. Then, the uplink SINR for the V2V link is constantly interacting with the environment. At each moment,
given by the environment will be in a status where the agent will try to
obtain observations about the current state. The agent will act
𝑃𝑗 [𝑚]⋅𝑔𝑗 [𝑚]
𝛾𝑗 [𝑚] = 𝜎2+𝑃 [𝑚]⋅𝑔 (9) on the basis of the observed values in conjunction with its own
𝑟 𝑗,𝑟 [𝑚]+∑𝑗′ ≠𝑗 𝜌𝑗′ [𝑚]𝑃𝑗′ [𝑚]⋅𝑔𝑗,𝑗′ [𝑚]
historical code of conduct, which is generally called a strategy.
The action will affect the state of the environment, causing
In the vehicular communication network, each V2V link will certain changes in the environment. The agent will obtain two
select one of the abovementioned communication modes.
pieces of information from the changing environment:
observations of the new environment and the reward. After all
D. Problem Formulation of the above steps are completed, the agent can perform new
In this situation, the corresponding capacity of each mode is actions based on new observations.
obtained as The Markov decision process (MDP) is generally used as a
formal means for RL. We use 𝑠𝑡 for the observation of the
𝐶𝑎𝑝 = 𝑊𝑙𝑜𝑔2 (1 + 𝛾) (10) environment at time 𝑡, 𝑎𝑡 for the action taken at time 𝑡, and 𝑟𝑡
for the immediate reward obtained after the action 𝑎𝑡 is
where 𝑊 is the bandwidth of each spectrum subband, and 𝛾 is executed. The process of RL can be represented by a
the SINR or SNR of the corresponding link. In this paper, state-action-reward chain as
considering the existence of the V2V and V2I links, our goal is
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6
{𝑠0 , 𝑎0 , 𝑟0 , … , 𝑟𝑡−1 , 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 } (12) conditions and all the agents’ behaviors. For each V2V link, the
observed state information is limited and can be depicted as
RL has two salient features: one is the constant trial and the 𝑠𝑡 = [𝜑𝑡𝑆𝑉 , 𝜑𝑡𝑉𝐼 , 𝜑𝑡𝑉𝑉 , 𝜑𝑡𝑉𝐵 , 𝜑𝑡−1
𝐶𝑆
, 𝜑𝑡𝑇𝐿 , 𝜑𝑡𝑄𝑅 ] , where 𝜑𝑡𝑆𝑉
other is an emphasis on long-term returns. For the first feature, represents the observed signal channel information of the V2V
the agent needs to constantly try, trying to cope with the various link, 𝜑𝑡𝑉𝐼 indicates the interference channel from all V2I
possible behaviors of the state and collect the corresponding transmitters, 𝜑𝑡𝑉𝑉 denotes the interference channels from other
reward. Only by collecting various feedback information can V2V transmitters, 𝜑𝑡𝑉𝐵 shows the interference channel from its
the learning task be better completed. For the second one, the own transmitter to the BS, and 𝜑𝑡−1 𝐶𝑆
is the channel selection
reward obtained in one step or two steps become less important status of the neighbors at the previous moment that reveals the
under a long time. Maximizing long-term returns is often other agents’ behaviors. Additionally, 𝜑𝑡𝑇𝐿 represents the
difficult and it requires more complex and precise actions. 𝑄𝑅
remaining transmission load of the V2V links and 𝜑𝑡 denotes
Ideally, every action must be done to maximize the
the QoS requirements.
long-term return. We multiply the future returns by a discount
Action: In this paper, we focus on the mode selection and
factor 𝜉 to reduce the impact of the current actions on the future
power adaptation issues in the V2V communication network.
returns. Therefore, the long-term return is given by
Hence, the action is defined as 𝑎𝑡 = (𝜒𝑡𝑀𝑆 , 𝜒𝑡𝑃𝐶 ), where 𝜒𝑡𝑀𝑆 is
the communication mode selected by the V2V link at time 𝑡,
𝑅𝑒𝑡𝑡 = 𝐸[∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 ] (13)
and 𝜒𝑡𝑃𝐶 is the chosen power level of the V2V link in the
Value functions are used to represent the value of the selected mode at time 𝑡. In the V2V communication network,
strategy. According to the model form of the MDP, value we divide the communication modes into three types as
functions can be divided into two types. One type is a state
0, 𝐶𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑀𝑜𝑑𝑒
value function 𝑉𝜋 (𝑠𝑡 ), that is, the current state 𝑠𝑡 is known, and
𝜒𝑡𝑀𝑆 = {1, 𝐷𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒 (16)
the expectation of the long-term return is generated according
2, 𝑅𝑒𝑢𝑠𝑒 𝑀𝑜𝑑𝑒
to certain strategic actions.
Note: Through the subband occupied by the V2V link and
𝑉𝜋 (𝑠𝑡 ) = 𝐸𝜏 [∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 |𝑠𝑡 = 𝑠] (14) the preassignment of the V2I links, we further determine the
communication mode selected by the V2V link. In addition, we
where 𝜏 denotes the sequence sampled by the policy and state discretize the continuous power into three levels in our
transition. The other type is the expectation of the long-term scenario, so the action space is 9 dimensional.
return generated by certain strategic actions under the premise Reward: The learning process is driven by the reward
of knowing the current state 𝑠𝑡 and action 𝑎𝑡 , called the function in the RL framework, and the system performance can
state-action value function 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ). be improved when the design of the reward function for each
step is related to the desired objective. Consistent with the goal
𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝐸𝜏 [∑∞ 𝑚
𝑚=0 𝜉 𝑟𝑡+𝑚+1 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎] (15) of maximizing the reward in RL, the performance of the system
is optimal when the algorithm finds a smart enough strategy to
B. Mapping Scenario to Key Elements of RL achieve the final objective. Considering the difference in the
requirements for different types of links in the V2V
Agent: Each active V2V link is an agent. In the V2V
communication network, the capacities of V2V and V2I links,
communication network, the BS cannot obtain accurate CSI in
the latency and reliability constraints are modeled in the reward
real time because of the high mobility of the vehicles; hence,
function. Hence, we propose a new reward function by the
the centralized method has little effect. To solve this problem,
following analysis
we propose a distributed algorithm based on slow fading
parameters and statistical information in the channel. In this
algorithm, the agent does not need to know all the information 𝑟𝑡 = 𝜂1 (𝐶𝑎𝑝𝑟 + 𝐶𝑎𝑝𝑗 ) + +𝜂2 𝐶𝑙𝑛 + 𝜂3 𝐷𝑙𝑛 (17)
before making a decision, and the transmission overhead is
very small. Therefore, the problem of joint mode selection and where 𝜂1 , 𝜂2 and 𝜂3 are the weight factors for the contributions
power adaptation in the V2V communication network is more of these three parts to the reward function composition, and
suitable for implementation in a decentralized way. Based on 𝐶𝑎𝑝𝑟 and 𝐶𝑎𝑝𝑗 denote the total capacities of the V2I and V2V
the above descriptions, each V2V link is considered to be an links, respectively. The total capacity of the V2I links 𝐶𝑎𝑝𝑟 can
agent that interacts with the communication environment. The be written as
agents make intelligent decisions by partial observations of the
environment and information exchanged among them, 𝐶𝑎𝑝𝑟 = ∑𝑚[𝑊𝑙𝑜𝑔2 (1 + 𝛾𝑗𝑢𝑝 [𝑚]) + 𝑊𝑙𝑜𝑔2 (1 +
(18)
including the adaptive selection of communication mode and 𝛾𝑟 [𝑚])]
transmit power. Generally, the agent can obtain experience and
adjust its action strategy. and 𝐶𝑎𝑝𝑗𝑑 can be depicted as
State: Let us define the infinite set for the state space as 𝒮.
The true environment state, 𝑠𝑡 , should include global channel
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7
C. Learning Algorithm
In our scenario, the number of states is large, and the value of
the state is continuous. Generally, the table lookup algorithm
would be inefficient or even impossible for solving the above
problem. A common way to solve such problems is to no longer
store the values of the states or behaviors in a tabular way, but
to introduce appropriate parameters, select the appropriate
features to describe the states, and estimate the values of states
or behaviors by constructing a certain function. Here, we use
the approximate representation of the state-action value
𝑄(𝑠𝑡 , 𝑎𝑡 ). The function 𝑄(𝑠𝑡 , 𝑎𝑡 , 𝜔) consisting of the parameter
𝜔 accepts state variables 𝑠𝑡 and the behavior variable 𝑎𝑡 , and
the value of the parameter 𝜔 is constantly adjusted to match the Fig. 4. An illustration of DDQN for joint mode selection and power
final behavior values based on the strategy 𝜋. adaptation in V2V communication network of 5G.
The Q-learning algorithm selects the action that can achieve
order. If the reply buffer is already full, the new sample will
the maximum reward based on the state-action value 𝑄(𝑠𝑡 , 𝑎𝑡 ).
cover the oldest sample in time. For sampling, the replay buffer
Q-learning uses the current return and the estimated value of
randomly samples a batch of samples from the cache for
the next moment obtained by taking the action that maximizes
learning. Another model with exactly the same structure is
the value to estimate the value of the current moment. The
introduced in the DQN algorithm. This model is called the
updated formula for Q-learning can be written as
target network, and the original model is called the behavior
1 network. At the beginning of the training, two networks use the
𝑄(𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡−1 , 𝑎𝑡−1 ) + 𝑀 [𝑟𝑡−1 + 𝜉𝑚𝑎𝑥𝑎′ 𝑄 − same parameters. During the training process, the behavior
(20)
𝑄(𝑠𝑡−1 , 𝑎𝑡−1 )] network is responsible for interacting with the environment and
obtaining interactive samples. Specifically, the calculated value
where 𝑀 is the size of the mini-batch. The Q-learning of the target network is compared to the estimated value of the
algorithm does not follow the sequence of interactions but behavior network to obtain the target value and update the
instead selects the action that maximizes the value at the next behavior network. Whenever the training completes a certain
moment. number of iterations, the parameters of the behavior network
We use neural networks to approximate the state-action are synchronized to the target network, so that the next phase of
value 𝑄(𝑠𝑡 , 𝑎𝑡 ). The approximation via neural networks is a learning can be performed.
nonlinear approximation. Its basic unit is a neuron that can The formula for solving the target network can be defined as
perform nonlinear transformations. Through the multilayered
arrangement of neurons and interlayer interconnections, a 𝑦𝑡 = 𝑟𝑡 + 𝜉𝑚𝑎𝑥𝑎′ 𝑄(𝑠𝑡+1 , 𝑎′ ; 𝜔− ) (21)
complex nonlinear approximation is realized. We use a single
neural network to represent all actions at once and associate a 𝑄 where 𝜔− indicates the parameters of the target network.
value with each action. Specifically, the formula is further defined as
The DQN algorithm combines Q-learning and deep learning.
The main innovations of the DQN include the replay buffer and 𝑦𝑡 = 𝑟𝑡 + 𝜉𝑄(𝑠𝑡+1 , 𝑎𝑟𝑔𝑚𝑎𝑥𝑎′ 𝑄(𝑠𝑡+1 , 𝑎′ ; 𝜔− ); 𝜔− ) (22)
target network. In the DQN algorithm, there are two problems,
including a certain correlation between the sequences obtained After expanding the formula, we find that after using the
by the interaction and the efficiency of the use of interactive target network, the model still uses the same parametric model
data. To solve the problems, the replay buffer is designed, when selecting the optimal action and calculating the target
which contains collecting samples and sampling samples. The value, which will inevitably lead to an overestimation of the
collected samples are stored in the structure in chronological value. To minimize the impact of overestimation, the simple
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9
TABLE II
CHANNEL MODELS
TABLE III
SIMULATION PARAMETERS
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10
Fig. 7. The convergence performance of different learning. Fig. 9. Sum throughput performance of V2I links with varying vehicle
speed.
environment of each episode is not exactly the same, which Fig. 10. Probability of the satisfied V2V links with varying vehicle speed.
leads to the nuance of expected reward. Fig. 6 illustrates the
random search has the worst performance. In terms of
effectiveness of the proposed method.
centralized DQN algorithm, when the vehicles are moving
In addition, we investigate the parameter selection of the
slowly, the throughput of the V2I links is not much different
multi-agent DDQN algorithm. Fig. 7 shows the impact of the
from the distributed DDQN approach. However, when the
learning rate. The difference in learning rate has no effect on
vehicle speed continues to rise, the effect of decision-making is
the scale of the cumulative reward. The learning rate affects the
poor because the centralized DQN algorithm cannot collect
convergence rate. It can be seen that the speed of convergence
channel information in real time. As can be seen from the figure,
is faster as a result of a larger learning rate.
better performance than other approaches under various vehicle
Fig. 8 shows the probability of choosing three
speed values is achieved through our proposed approach.
communication modes with different ratios of V2V links to V2I
Fig. 10 presents the probability of satisfied V2V
links. With the increase in the rate of V2V links versus V2I
communication links with the increasing absolute vehicle speed.
links, the probability of selecting the dedicated mode is
From the figure, when the vehicles move faster, the probability
decreasing dramatically. The probability of choosing the
of the satisfied communication links decreases. Higher
cellular mode drops slowly. The probability of choosing the
movement speed leads to an increase of the average
reuse mode increases when there are few V2V links causing
intervehicle distance. Hence, the lower received desired power
interference. However, after the rate of V2V links versus V2I
in V2V links due to sparse traffic causes such performance
links achieves a certain threshold, the probability of selecting
degradation. In addition, the higher speed results in the highly
the reuse mode starts to decrease, and the probability of the
dynamic V2V communication environment, leading to high
cellular mode increases accordingly.
observation uncertainties, which decreases the learning
Fig. 9 depicts the impact of the vehicle speed on the sum
efficiency; hence, more unsatisfied communication link events
throughput of V2I links. From the figure, the sum throughput
happen in the region of high vehicle speed. However, as the
performance decreases as the vehicle speed increase because
vehicle speed increases with high uncertainties, our proposed
the high observation uncertainties are induced from the highly
approach can still maintain the probability of satisfied V2V
dynamic V2V communication network. Among them, the
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11
communication links at a considerable level and outperforms [11] Y. Li, D. Jin, J. Yuan and Z. Han. “Coalitional games for resource
allocation in the device-to-device uplink underlaying cellular networks,”
other approaches with a higher satisfied probability, especially IEEE T. Wirel. Commun., vol. 13 no. 7, pp. 3965-3977, May 2014.
in the high vehicle speed regions. This reveals that the proposed [12] C. Xu, L. Song, Z. Han, D. Li and B. Jiao, “Resource allocation using a
distributed DDQN approach is more stable and robust in highly reverse iterative combinatorial auction for device-to-device underlay
cellular networks,” in 2012 IEEE Global Communications Conference
dynamic V2V networks. (GLOBECOM), Anaheim, CA, USA, 2012, pp. 4542-4547.
[13] W. Sun, D. Yuan, E. Ström and F. Brännström. “Cluster-based radio
V. CONCLUSION resource management for D2D-supported safety-critical V2X
communications,” IEEE T. Wirel. Commun., vol. 15 no. 4, pp. 2756-2769,
In this paper, we study the problem of joint mode selection Dec. 2015.
and power adaptation in the V2V communication network in [14] L. Liang, G. Li and W. Xu. “Resource allocation for D2D-enabled
5G. The V2V links can multiplex available resources such that vehicular communications,” IEEE T. Commun., vol. 65 no. 7, pp.
3186-3197, Apr. 2017.
they can operate in different modes to meet multiple QoS [15] D. Han, B. Bai and W. Chen. “Secure V2V communications via relays:
requirements enforced by the system. We base mode selection Resource allocation and performance analysis,” IEEE Wirel. Commun.
and power adaptation on slow fading parameters and statistical Le., vol. 6 no. 3, pp. 342-345, Mar. 2017.
[16] W. Sun, E. Ström, F. Brännström, K. Sou and Y. Sui. “Radio resource
information of the channel to address the challenges caused by management for D2D-based V2V communication,” IEEE T. Veh.
the inability to track fast changing wireless channels. We Technol., vol. 65 no. 8, pp. 6636-6650, Sep. 2015.
propose a RL framework to solve the problem with DDQN [17] F. Abbas and P. Fan, “A hybrid low-latency D2D resource allocation
scheme based on cellular V2X networks,” in 2018 IEEE International
algorithm and learn the optimal policy under continuous-valued Conference on Communications Workshops (ICC Workshops), Kansas
state variable. We consider each V2V link to be an agent; thus, City, MO, USA, 2018, pp. 1-6.
the V2V links are capable of intelligently making their adaptive [18] A. Rawat, H. Shah and V. Patil. “Towards intelligent vehicular networks:
A machine learning framework,” Int. J. Res. Engin., Sci. Manage., vol. 1
decisions to improve their performance based on the no. 9, pp. 2581-5782, Sep. 2018.
instantaneous observations under high dynamic vehicular [19] A. Koushik, F. Hu and S. Kumar. “Intelligent spectrum management
environments. Numerical results show that the method has based on transfer actor-critic learning for rateless transmissions in
good convergence. cognitive radio networks,” IEEE T. Mobile Comput., vol. 17 no. 5, pp.
1204-1215, Aug. 2017.
[20] H. Ye, L. Liang, G. Li, J. Kim, L. Lu and M. Wu. “Machine learning for
REFERENCES vehicular networks: Recent advances and application examples,” IEEE
Veh. Technol. Mag., vol. 13 no. 2, pp. 94-101, Apr. 2018.
[1] D. Wang, D. Chen, B. Song, N. Guizani, X. Yu and X. Du. “From IoT to [21] R. Atallah, C. Assi and M. Khabbaz. “Scheduling the operation of a
5G I-IoT: The next generation IoT-based intelligent algorithms and 5G
connected vehicular network using deep reinforcement learning,” IEEE
technologies,” IEEE Commun. Mag., vol. 56 no. 10, pp. 114-120, Nov.
T. Intell. Transp., vol. 20 no. 5, pp. 1669-1682, May 2018.
2018.
[22] W. Liu, G. Qin, Y. He and F. Jiang. “Distributed cooperative
[2] Y. Zhang, F. Tian, B. Song and X. Du. “Social vehicle swarms: A novel
reinforcement learning-based traffic signal control that integrates V2X
perspective on socially aware vehicular communication architecture,” networks' dynamic clustering,” IEEE T. Veh. Technol., vol. 66 no. 10, pp.
IEEE Wirel. Commun., vol. 23 no. 4, pp. 82-89, Aug. 2016.
8667-8681, May 2017.
[3] Z. Zhou, C. Gao, C. Xu, Y. Zhang, S. Mumtaz and J. Rodriguez. “Social
[23] H. Ye, G. Li and B. Juang. “Deep reinforcement learning based resource
big-data-based content dissemination in internet of vehicles,” IEEE T.
allocation for V2V communications,” IEEE T. Veh. Technol., vol. 68 no.
Ind. Inform., vol. 14 no. 2, pp. 768-777, Jul. 2017.
4, pp. 3163-3173, Feb. 2019.
[4] M. Tehrani, M. Uysal and H. Yanikomeroglu. “Device-to-device [24] W. Saad, Z. Han, A. Hjorungnes, D. Niyato and E. Hossain. “Coalition
communication in 5G cellular networks: Challenges, solutions, and future
formation games for distributed cooperation among roadside units in
directions,” IEEE Commun. Mag., vol. 52 no. 5, pp. 86-92, May 2014.
vehicular networks,” IEEE J. Sel. Area. Comm., vol. 29 no. 1, pp. 48-60,
[5] P. Dong, X. Du, H. Zhang and T. Xu, “A detection method for a novel
Dec. 2010.
DDoS attack against SDN controllers by vast new low-traffic flows,” in
[25] R. Molina-Masegosa and J. Gozalvez. “LTE-V for sidelink 5G V2X
2016 IEEE International Conference on Communications (ICC), Kuala vehicular communications: A new 5G technology for short-range
Lumpur, Malaysia, 2016, pp. 1-6.
vehicle-to-everything communications,” IEEE Veh. Techno.l Mag., vol.
[6] J. Li, G. Lei, G. Manogaran, G. Mastorakis and C. Mavromoustakis.
12 no. 4, pp. 30-39, Dec. 2017.
“D2D communication mode selection and resource optimization
[26] H. Seo, K. Lee, S. Yasukawa, Y. Peng and P. Sartori. “LTE evolution for
algorithm with optimal throughput in 5G network,” IEEE Access, vol. 7,
vehicle-to-everything services,” IEEE Commun. Mag., vol. 54 no. 6, pp.
pp. 25263-25273, Feb. 2019. 22-28, Jun. 2016.
[7] J. Guo, B. Song, F. Yu, Y. Chi and C. Yuen. “Fast video frame correlation
[27] G. Wu and P. Xu. “Link QoS analysis of 5G-enabled V2V network based
analysis for vehicular networks by using CVS--CNN,” IEEE T. Veh.
on vehicular cloud,” Science China Information Sciences, vol. 61 no. 10,
Technol., vol. 68 no. 7, pp. 6286-6292, May 2019.
pp. 109305, Aug. 2018.
[8] J. Mei, K. Zheng, L. Zhao, Y. Teng and X. Wang. “A latency and
[28] L. Wu, X. Du, W. Wang and B. Lin, “An out-of-band authentication
reliability guaranteed resource allocation scheme for LTE V2V scheme for internet of things using blockchain technology,” in 2018
communication systems,” IEEE T. Wirel. Commun., vol. 17 no. 6, pp.
International Conference on Computing, Networking and
3850-3860, Mar. 2018.
Communications (ICNC), Maui, HI, USA, 2018, pp. 769-773.
[9] H. Min, W. Seo, J. Lee, S. Park and D. Hong. “Reliability improvement
[29] C. Chien, Y. Chen and H. Hsieh, “Exploiting spatial reuse gain through
using receive mode selection in the device-to-device uplink period
joint mode selection and resource allocation for underlay
underlaying cellular networks,” IEEE T. Wirel. Commun., vol. 10 no. 2, device-to-device communications,” in the 15th International Symposium
pp. 413-418, Dec. 2010.
on Wireless Personal Multimedia Communications(WPMC), Taipei,
[10] K. Akkarajitsakul, P. Phunchongharn, E. Hossain and V. Bhargava,
China, 2012, pp. 80-84.
“Mode selection for energy-efficient D2D communications in
[30] 3rd Generation Partnership Project; Technical Specification Group Radio
LTE-advanced networks: A coalitional game approach,” in 2012 IEEE
Access Network; Study on LTE-based V2X Services; (Release 14), 3GPP
international conference on communication systems (ICCS), Singapore,
TR 36.885 V14.0.0.
Singapore, 2012, pp. 488-492.
[31] P. Kyosti, J. Meinila and L. Hentila. “IST-4-027756 WINNER II D1. 1.2
V1. 2: WINNER II channel models,” Tech. Rep., Sep. 2007.
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCCN.2020.2983170, IEEE
Transactions on Cognitive Communications and Networking
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12
2332-7731 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 16:18:55 UTC from IEEE Xplore. Restrictions apply.