Federated Reinforcement Learning-Based Resource Allocation in D2D-Enabled 6G

This article has been accepted for inclusion in a future issue of this magazine.
Federated Reinforcement Learning-Based Resource

Allocation in D2D-Enabled 6G
Qi Guo, Fengxiao Tang, Member, IEEE, and Nei Kato, Fellow, IEEE,
Abstract—The current 5G and conceived 6G era with ultra- interconnection of all things [2]. Millimeter-wave, one of
high density, ultra-high frequency bandwidth, and ultra-low the key technologies in 5G, provides high frequency and
latency can support emerging applications like Extended Re- wide bandwidth. The terahertz band with a spectrum between
ality (XR), Vehicle to Everything (V2X), and massive Internet
of Things (IoT). With the rapid growth of transmission rate 0.1 and 10 THz has attracted great attention as a candidate
requirements and link numbers in the wireless communication spectrum for 6G because it can provide higher bandwidth
network, how to allocate resources reasonably and further im- and transmission rate than mmWave. However, both mmWave
prove spectrum utilization challenges the traditional approaches. and THz bands have the problem of high transmission loss
To address these problems, technologies such as device-to-device because of the high frequency, and the higher the carrier
(D2D) communication and machine learning (ML) are introduced
to the traditional cellular communication network to improve frequency, the more serious the loss. Therefore, the transmis-
network performance. However, due to the interference caused by sion distance of mmWave and THz bands is limited, which
spectrum reusing, efficient resource allocation for both cellular makes these two kinds of waves more suitable for short-
users and D2D users is necessary. In this article, we consider distance communication. D2D communication refers to direct
underlay mode D2D-enabled wireless network to improve the communication between device and device without going
spectrum utilization, and deep reinforcement learning (DRL)-
based federated learning (FL)-aided decentralized resource al- through the base station. In addition to the dense deploy-
location approach to maximize the sum capacity and minimize ment of cells, D2D communication technology is another
the overall power consumption while guaranteeing the quality effective method to shorten the distance between the sender
of service (QoS) requirement of both cellular users and D2D and the receiver, meanwhile, it improves the user quality
users. The performance of the proposed schemes is evaluated of service (QoS) and the overall spectrum utilization of the
through simulations under 5G millimeter-wave (mm-wave) and
6G terahertz (THz) scenarios separately. The simulation results network. A D2D-enabled 6G network model is shown in Fig.
show that the proposal achieves significant network performance 1. Although D2D communication can improve the network
compared with the baseline algorithms. spectrum efficiency, it’s also necessary to reasonably control
Index Terms—6G, Device-to-device (D2D) communication, re- the interference between D2D links and cellular links because
source allocation, federated learning (FL), deep reinforcement of the radio frequency reusing [3]. Resource allocation as an
learning (DRL) effective method to reduce interference caused by frequency
reusing which consists of three common parts: mode selection,
I. I NTRODUCTION power control, and channel allocation. Mode selection refers
to selecting whether the device works in cellular mode or D2D
T HE fifth-generation (5G) network has entered the com-
mercial stage and is being rapidly deployed around the
world. At the same time, with the rapid increase in the number
mode according to the current network status. Power control
and channel allocation refer to allocating appropriate transmit
power and channel to D2D links under the condition of the
of mobile devices and interactive services, data traffic and
QoS requirements to reduce the interference to the cellular
user requirement rates have increased accordingly. In addition
links.
to human-oriented communications, the scale of machine-to-
Resource allocation in D2D network has been widely stud-
machine (M2M) terminals will grow rapidly and will approach
ied recently [4] - [6]. In [4], the authors jointly addressed the
saturation by 2030. It is predicted that the number of M2M
issues including communication mode selection, bandwidth
terminals will reach 97 billion, about 14 times that of 2020 [1].
allocation, power control, and channel selection for a D2D-
It can be foreseen that 5G is hard to handle the tremendous
enabled heterogeneous network, aiming to maximize the en-
volume of data traffic in 2030 and beyond, research on the
ergy efficiency in a long time, and a deep deterministic policy
sixth-generation (6G) network is necessary. The properties
gradient (DDPG) is employed to drive the issues. The authors
like wide bandwidth, low latency, and massive connections
in [5] presented a joint spectrum and power allocation issue
in 5G [1] will be further improved in 6G era, with ultra-high
in high rate D2D communication network to minimize the
frequency bandwidth, ultra-low latency, ultra-high density, and
total power consumption of the mobile devices while ensuring
a space-air-ground integrated communication system enabled
the QoS requirement of devices, and carrier aggregation (CA)
by satellite communication, to achieve seamless coverage and
technology is performed to make D2D devices reuse the uplink
Q. Guo and N. Kato are with the Graduate School of Informa- spectrum of multiple cellular users. Moreover, a power control
tion Sciences (GSIS), Tohoku University, Sendai, Japan. Email: {guo.qi, based on Dinkelbach algorithm and a resource allocation based
kato}@it.is.tohoku.ac.jp
F. Tang is with the School of Computer Science and Engineering, Central on the message passing algorithm (MPA) is proposed in [6]
South University, ChangSha, China. Emails: tangfengxiao@csu.edu.cn to solve the global energy efficiency optimization problem in
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 22,2023 at 07:55:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this magazine. 2
a D2D communication network. Recently, federated learning with the advantages of mmWave, it also brings challenges.
(FL) as a promising privacy-protecting and communication First of all, the high transmission power and broad bandwidth
cost reduction tool in machine learning is widely discussed will cause serious non-linear distortion of signals. Moreover,
[7], [8]. The authors in [7] proposed a 2-stage federated the atmospheric and molecular absorption severely limited
learning algorithm to predict the content caching placement the effective transmission distance of mmWave, especially
by jointly considering traffic distribution, user device mobility in the 60GHz band. Additionally, The high propagation loss
and localized content popularity. also significantly increases the sensitivity of channel states to
In this article, FL technology is applied to achieve a mobility [2].
decentralized resource allocation approach while protecting THz communication refers to the use of THz (0.1-10 THz)
user privacy in a D2D-enabled wireless network. We utilize as the carrier wave to transmit data that can provide higher
deep reinforcement learning (DRL) to optimize the dynamic bandwidth, and a total of 81GHz spectrum is allocated for
resource allocation policy on each device. Specifically, a deep fixed or mobile communications in the frequency range 100-
Q-network is adopted to give the action policy according 275 GHz. Meanwhile, the characteristic of a shorter wave-
to vary-environment states. Since user devices can not get length not only improves the integration level of devices but
overall network information under a decentralized training also significantly increases the dimension of antenna arrays
framework, we consider utilizing users’ historical information compared with mmWave technology [12]. It is expected to
and neighbors’ information as the input of a deep Q-network employ over 10000 antennas in a single base station and pro-
to train the resource allocation model. Thus, the contributions vide hundreds of super-narrow beams simultaneously. Thus,
of this article can be summarized as follows: it is considered as a key technology for enabling 6G to
• We give the same priority to cellular user devices and achieve such as extremely high transmission capacity together
D2D user devices to allocate their transmission resources with ultra-massive connections which support the Internet of
under a decentralized paradigm and consider the mobility Everything. However, THz waves also severely suffer from
of user devices in the base station’s coverage area to get high path loss and only can provide very limited coverage.
closer to the real dynamic network environment. Moreover, technical challenges like implementing hardware
• We employ FL to achieve a decentralized policy training circuits including amplifiers and antennas significantly affect
framework while protecting user privacy and combine the practical deployment of THz technology [2].
DRL to train the intelligent dynamic resource allocation
strategy. B. D2D Communication
• We compare the performance in terms of network
In a traditional cellular network, all communications must go
throughput and overall power consumption between the through the base station even though the sender and receiver
proposal and four baseline methods. are within a short distance. D2D communication enables direct
The remainder of this article is organized as follows. We first data exchange between D2D users without any interaction with
give a brief introduction to the promising technologies utilized the base station or core network [13] is regarded as a paradigm
in this article in section II. In section III, the proposed FL- with great potential for enhancing network performance in
aided DRL-based resource allocation approach is elaborated. the current 5G and next 6G era. Unlike technologies such
Then, simulation conditions and performance analyses are in- as Bluetooth, the use of communication resources of D2D
troduced in section IV. We discuss several prospective research technology is controllable. D2D communication has two main
directions in section V. Finally, the conclusion is summarized modes according to different resource allocation methods. In
in Section VI. the underlay mode, the spectrum resource of cellular users
is shared by multiple D2D users which will cause significant
II. P ROMISING T ECHNOLOGIES interference on cellular users, therefore, the resource usage
In this section, we introduce four technologies that are utilized for D2D users should be determined carefully. In overlay
in this research: mmWave and THz regime, D2D communica- mode, the dedicated spectrum resources will allocate to D2D
tion, federated learning, and reinforcement learning. communication, while the rest of the spectrum resources are
used by the cellular network. Although the overlay mode
A. Utilization of mmWave and THz Regime avoids interference between D2D links and cellular links,
the spectrum resource utilization of the overall network will
As one of the key technologies of 5G, mmWave utilizes the
decrease. In addition to improving spectrum efficiency, D2D
spectrum within 30-300 GHz and offers four appealing fea-
communication also can improve throughput, energy effi-
tures [11]. First, mmWave can provide a large amount of avail-
ciency, and fairness in a wireless communication network.
able bandwidth which is on orders-of-magnitude larger than
the microwave band, and support 5 Gbps and even 10 Gbps
transmission rates. Second, mmWave can achieve air interface C. Federated Learning
delay within 1ms. Third, the size of antennas for mmWave can In the traditional centralized model training or resources con-
be further reduced because of the short wavelength. Fourth, trol approaches, data collected by mobile devices is uploaded
the positioning of mmWave can be accurate to the centimeter and processed on the server which causes privacy issues.
level or even lower because the positioning capability of a Moreover, the complicated model training and data analysis on
wireless system is closely related to its wavelength. Along the user devices side are always time-consuming. To cope with
Fig. 1. The architecture of federated learning based D2D-enabled 6G network
these challenges, a decentralized approach known as federated scenarios, and model-free learning such as Q-learning attracted
learning is introduced to guarantee the training data remains on much attention. Q-learning focuses on problems in which both
personal devices and to collaboratively train complex machine states and actions are discrete and finite, thus tables called
learning models among distributed devices [14]. There are Q-table can be used to record data. However, tasks have
three steps to perform FL: FL server firstly decides the training a significantly large number of states and actions in many
task and distributes the initial global model to distributed practical problems which leads to the proposal of DRL. In
devices that are selected to participate in the model training detail, DRL combines RL and deep learning (DL) and it
process. Then, devices use their local data to train their local utilizes RL to define problems and optimization objectives,
model based on the initial global model to find optimal pa- and DL to solve the modeling problem of policy and value
rameters that minimize the loss function. After several rounds functions. By performing the non-linear approximation of deep
of local training, devices upload their local models to the FL neural networks (DNNs), the Q-table is built by DNN and
server. Finally, the FL server aggregates these local models the Q-table updates are converted to network weights updates.
from participants and sends the updated model back to the Moreover, technologies such as experience replays and fixed
data owner. Repeat the above three steps until the global loss target networks have been developed for DRL to improve the
function converges or a designed training accuracy is achieved. training process and convergence performance.
By performing local model training based on users’ local raw
data on the side of the decentralized device, and leveraging III. P ROPOSED FL- AIDED DRL- BASED R ESOURCE
infrequent local model aggregation at the centralized server, A LLOCATION A PPROACH
federated learning can significantly improve the model training
performance. In this section, network scenarios and research objectives in
this article are elaborated on first. Then, we introduce the pro-
posed FL-aided DRL-based decentralized resource allocation
D. Deep Reinforcement Learning
algorithm.
Reinforcement learning (RL) is the third machine learning
paradigm alongside supervised learning and unsupervised
learning. RL can be described as an agent that continuously A. Network Scenario
learns from interactions with a time-varying environment to A general D2D-enabled heterogeneous 6G wireless network
accomplish a specific goal [15]. In general, when an RL prob- is shown in Fig. 1which is constructed with macro-cell and
lem satisfies the Markov property, such as, the future depends small-cells. In such a network, there are several small-cell
only on the current state and action, but not on the past, base stations (SBS) with corresponding small cells densely
it is formulated as Markov Decision Process (MDV) which distributed in the area, and one macro-cell base station (MBS)
is characterized by state space, action space, state transition coverage all small cells. In each small cell, there are several
probability, policy, and reward. Generally, RL can be divided users connected to SBS or direct communication through D2D
into model-based and model-free RL. Dynamic programming communication. For each SBS, it will relay transmission data
is utilized to solve issues in model-based learning. However, to MBS. Besides, users inner the macro cell but outside the
transition probability and reward are often unknown in realistic small cell can direct connect to MBS or establish a D2D
Fig. 2. The network model in this research

Fig. 3. Detailed framework of DRL
link with other users who are in the available range. In this In this article, one objective is to maximize the throughput
research, We consider a simplified network scenario consisting of the network, which means providing qualified service.
of a base station (BS) in the center of the area, cellular However, it may cause a huge overall power consumption of
user equipment (CUE), and D2D user equipment (DUE) are the network if we only focus on throughput. Therefore, another
distributed uniformly around it in the area, operating within objective is to minimize the overall power consumption of
the coverage of the BS. Uplink transmission is considered the network while guaranteeing the QoS requirements, which
in this research as shown in Fig. 2. In our consideration, refers to tolerable minimum data transmission rate, of all UEs.
each CUE can work on cellular mode and D2D mode at By finding the optimal resource allocation strategy including
the same time, meanwhile, DUEs will select a CUE or DUE channel assignment and power control for each UE, network
which can establish a D2D link within the distance D. Since performance can be significantly improved.
5G and 6G networks utilize mmWave and THz technology,
respectively, the value of D will vary depending on the
network environment. In contrast to traditional D2D-enabled B. FL-aided DRL Approach
wireless networks, we consider the mobility of all UEs which As the objective mentioned in section IV, we can maximize
means each UE moves randomly in the area with the speed the sum capacity and minimize the global energy consumption
of 3m/s. Therefore, D2D links will go through the process through optimal allocation of channel and power resources.
of breaking down and re-establishing during the movement However, due to the mobility of UEs, the network topology
of UEs. In detail, once the distance between transmitter and changes dynamically, and dynamic resource allocation needs
receiver of a D2D link is over D, the link will break down to be performed according to the current network states.
and DUE will search other UEs which are within distance D With the core idea of reinforcement learning mentioned in
to establish a new D2D link. In the considered D2D-enabled section III that determines the action strategy according to the
wireless network, DUEs are operating in spectrum sharing environment state, we utilize deep Q-network (DQN)-based
mode which is also called underlay mode, sharing the same DRL to perform dynamic resource allocation in this research.
channels with CUEs, and the available spectrum is divided The traditional centralized resource allocation approaches
into M orthogonal channels with bandwidth given by W . require information collection from UEs and distribution of
From Shannon’s theory, we can know that channel bandwidth resource allocation policy to UEs. These kinds of approaches
and signal to interference plus noise ratio (SINR) are factors have two obvious shortcomings. The first one is that it lacks
affecting the achievable transmission capacity of UEs. The user privacy protection during the resource allocation process.
power of additive white Gaussian noise (AWGN) at a receiver Meanwhile, the signaling overhead between UEs and the cen-
is assumed to be a constant value in this research. Therefore, tralized server will be relatively large. To reduce the signaling
SINR has a relationship with signal power and interference, overhead while protecting the privacy of the UEs, we utilize
especially due to channel reusing, mutual interference between the FL-aided decentralized resource allocation approach in this
DUEs and CUEs can not be inevitable. As we consider the research. As shown on the right side of Fig. 1, the framework
mobility of UEs, interference between DUEs and CUEs should of the FL-aided DRL-based approach is divided into two parts,
be dynamically calculated based on the instantaneous distance, one is DRL-based local model training, and the other one is
path loss, and fading. According to the different characteristics FL global model aggregation.
of the 5G and 6G networks, the path loss model will vary in In the DRL-based local training process, the framework
different scenarios. For simplicity, it will not be introduced in of DRL consists of an agent and environment interacting
this article and can be found in [9], [10]. with each other. As shown in Fig. 3. In our considered
scenario, agents are all UEs that continuously collect state TABLE I
information from the environment and choose actions that T HE S IMULATION PARAMETERS
are channel and power allocation results based on collected Parameter Value
environment states. In detail, At time slot t, each UE collects Center frequency (mm-Wave) 73 GHz
its’ own environment state and format the state St and perform Bandwidth (mm-Wave) 100 MHz
Center frequency (THz) 275 GHz
action at , the network environment states change to St+1 Bandwidth (THz) 2 GHz
with the action at and feedback the reward rt to UE. The Number of channels 10
immediate
reward calculation is given as rt = n∈N c1 Rn −
Power level 5, 12, 18, 25 dBm
n∈N c2 Pn + c3 A · Nover − c4 Runder . The expression is

composed of four parts. The first part is the sum capacity of
UEs, the second part is the global power consumption in the
network, the third part indicates the revenue for satisfying the 350.0
QoS and the fourth part represents the penalty for unsatisfied
transmission rate. Where Nover and A represent the number 314.8
of user devices whose transmission rate is over threshold Rthr 279.6
and corresponding revenue respectively. c1 , c2 , c3 , and c4 244.4
are constant-coefficient to balance the revenue and penalty.

Moreover, Runder = m∈M (Rthr − Rm ), where Rm is
209.2
Loss
the transmission rate which is under the Rthr . With rt and 174.0
St+1 , UEs can update the weights of DNNs in DRL by 138.8

minimizing the loss function at each step. In this research,
103.6
mean-square error (MSE)is adopted as the loss function
2
which defined as L = E (yt − Q (St , at )) , where yt = 68.4
rt + γmaxat+1 Q̂ (St+1 , at+1 ), Q (St , at ) and Q̂ (St+1 , at+1 ) 33.2
represent the Q network and target network respectively. The −2.0
UEs choose an action based on ε-greedy algorithm to cover as 0 40 80 120 160 200 240 280 320 360 400
Epoch
many as possible actions. After several rounds of local model
training, the network runs into the next stage.
Fig. 4. Learning process of FL-aided DRL algorithm
In FL global model aggregation stage, after NL rounds local
model training on UEs, the BS acts as an aggregation server to
aggregate local model parameters received from UEs to update relatively stable value. This result shows the convergence in
the global model parameters. After global model aggregation, the learning process of the proposed approach.
UEs download updated global model parameters from BS to
To illustrate the network performance of our proposal more
continue local model training. Repeat these two steps until
clearly, we compare it with four baseline approaches which
the global model convergence. When the model training is all
are pure-DRL, local learning, random and equal methods. In
over, UEs can utilize the trained model to allocate resources
detail, the resource allocation policy model training of pure-
autonomously according to the network states.
DRL and local learning methods are both based on DRL:
for the pure-DRL method, all UEs separately train their local
IV. P ERFORMANCE A NALYSIS model without utilizing FL; for the local training method, only
one UE is selected to train the local model and other UEs will
In this section, we conduct simulations to evaluate the per- share the trained model. Fig. 5 shows the simulation results of
formance of the proposed FL-aided DRL-based resource allo- the proposed approach and baseline methods in the 5G and 6G
cation approach under 5G and 6G scenarios in terms of the scenarios respectively. As shown in Fig. 5(a) and Fig. 5(d), no
learning process and network performance. We consider one matter in the case of mm-Wave or THz, with the training pro-
BS in the center, 25 CUEs, and vary the number of DUEs cess going on, throughput increased by over 30 percent in our
uniformly distributed in 120 m x 120 m 5G network area and proposal. However, for pure-DRL and local learning methods,
40 m x 40 m 6G network area. All the UEs randomly move although the throughput increases as the training process go
in the network area with the speed of 3m/s. For those D2D on, it finally reaches a relevantly low level compared with our
links distances between transmitter and receiver devices over proposal; for random and equal resource allocation methods,
50 m in the 5G scenario and 15 m in the 6G scenario will the throughput is almost a constant value. Similarly, with the
break down. The summary of other parameters used in the increase in the training process, the overall power consumption
simulation is presented in Table I. of our proposal decreased by over 30 percent, as shown in Fig.
We firstly analyze the learning process of the FL-aided DRL 5(b) and Fig. 5(e). Moreover, we further illustrate the network
algorithm, the result is shown in Fig. 4. The result demon- performance with different numbers of DUEs. The throughput
strates that the loss of the proposed algorithm is relatively high is collected after 7000 training steps. As shown in Fig. 5(c)
at the beginning of the learning process, with the increase of and Fig. 5(f), our proposed FL-aided DRL-based algorithm
the training epoch, the loss keeps decreasing until reaches a can achieve better performance under various DUEs numbers
7.0 4.5 7.0

Pure-DRL method
6.5 Equal method 4.2 6.3
Local learning method
5.6
Power Consumption
6.0
Random method 3.9
FL-DRL method 4.9
Throughput (Gbps)
5.5 3.6
(Gbps)
5.0 3.3 4.2

3.5
Throughput
4.5 3.0
4.0 2.7 2.8

Pure-DRL method 2.1 Pure-DRL method
Overall
3.5 2.4
Equal method Equal method
3.0 2.1
Local learning method 1.4 Local learning method
Random method 0.7 Random method
FL-DRL method
2.5 1.8
FL-DRL method
2.0 1.5 0.0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 4 6 8 10 12 14 16
Epoch
Epoch Number of D2D pairs
(a) Network throughput during learning (b) Overall power consumption during learning (c) Network throughput with different number of
D2D users
150 4.5 150

Pure-DRL method 140
142 Equal method 4.2
Local learning method 130

134
Power Consumption
Random method
3.9
120
126 FL-DRL method
Throughput (Gbps)
110
3.6
Throughput (Gbps)
118 3.3
100
110 3.0 90
102 2.7
80
Pure-DRL method 70
94 Pure-DRL method
Overall
2.4
Equal method 60 Equal method

86 2.1
Local learning method 50 Local learning method
78 Random method Random method
1.8
FL-DRL method 40 FL-DRL method
70 1.5 30
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 4 6 8 10 12 14 16
Epoch Epoch Number of D2D pairs
(d) Network throughput during learning (e) Overall power consumption during learning (f) Network throughput with different number of
D2D users
Fig. 5. The network performance of the proposal compares with the random resource allocation algorithm: (a)-(c) are under 73GHz and (d)-(f) are under
275GHz
and different network scenarios. B. Realistic Mobility Model

V. F UTURE R ESEARCH I SSUES In the 6G network, there will be a variety of smart devices
connected to the network to realize the interconnection of ev-
The above introduction and analysis illustrate the effectiveness erything. However, due to the heterogeneous nature of devices,
of our proposal. However, there are still some issues that need the moving speed of different kinds of devices will differ from
to be further discussed in our future research. In this section, each other. Therefore, various mobility models need to be
we discuss three directions in the following paragraphs. considered to be close to a realistic scenario. For example,
A. Intelligent and Collaborative Resource Allocation between consider a slow-speed model for pedestrians or robots, a high-
Inter-Cell speed model for vehicles, and an ultra-high-speed model for
unmanned aerial vehicles or trains, respectively.
Since mmWave and THz waves severely suffer from high
path loss and are only suitable for short-range distance com-
munication, the density to deploy the base stations for the C. Communication Efficiency
5G and 6G network is predictably large, and the coverage Federated learning is considered to be a promising method
area for each base station will relevantly small. According to for protecting user privacy and training complicated machine
this characteristic of mmWave and THz waves, the mobility learning models in a decentralized paradigm. In the global
will cause user devices to frequently switch between different model process of conventional FL, the aggregator waits for
base stations. Under this consideration, in D2D-enabled 6G all users to finish their execution, including the lowest one,
network, the established D2D link may also go through however, due to the heterogeneous nature of computation and
switching between base stations, in particular, the transmitter communication resources, the local model training and data
and receiver of a D2D link may be located under different analysis execution time varies in different users. Meanwhile,
bases stations’ coverage. To handle inter-cell interference and several rounds of communication between the participating
to improve the network performance of the D2D-enabled 6G devices and the FL server in FL may be required to achieve
network, the intelligent and collaborative resource allocation designed accuracy. The high dimensionality of the model
approaches between inter-cell require more attention. parameters update will also lead to high communication costs.
Thus, this kind of traditional aggregation approach will reduce [13] M. Waqas, Y. Niu, Y. Li, M. Ahmed, D. Jin, S. Chen, and Z. Han,
the efficiency of model training. Therefore, how to improve “A Comprehensive Survey on Mobility-Aware D2D Communications:
Principles, Practice and Challenges,” IEEE Communications Surveys &
communication efficiency and model training efficiency is an Tutorials, vol. 22, no. 3, pp. 1863–1886, 2019.
important research direction. [14] T. Zhang and S. Mao, “Energy-Efficient Federated Learning with Intel-
ligent Reflecting Surface,” IEEE Transactions on Green Communications
and Networking, vol. 6, no. 2, pp. 845–858, 2021.
[15] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep Reinforcement
VI. C ONCLUSION Learning for Multiagent Systems: A Review of Challenges, Solutions,
and Applications,” IEEE transactions on cybernetics, vol. 50, no. 9, pp.
In this article, an FL-aided DRL-based decentralized re- 3826–3839, 2020.
source allocation approach in D2D-enabled dynamic 5G and
6G networks has been proposed to maximize the sum capacity B IOGRAPHY
and minimize the overall power consumption while guarantee-
ing the QoS requirement of UEs. And we briefly introduced
four technologies which are supporting our proposal, in which
D2D communication improves the network performance, FL
helps to protect the privacy of UEs while supporting a de-
centralized model training paradigm and DRL provides UEs
with the ability to do their resource allocation policy based on
network states. The simulation results show that our proposal
can help UEs dynamically allocate resources according to lim-
ited network states which significantly improves the network
performance including throughput and power consumption. Qi Guo received the M.S. degree in electronics science and technology
We also discuss some future research directions on D2D- from Beijing University of Posts and Telecommunications in 2019. She is
currently pursuing the Ph.D. degree with the Graduate School of Information
enabled wireless networks with more realistic scenarios and Sciences (GSIS), Tohoku University, Japan. Her research interests include
service requirements considered. D2D communication, resource allocation and machine learning in wireless
networks.
R EFERENCES
[1] “Study on Scenarios and Requirements for Next Generation Access
Technologies,” 2016, [Online]. Available: http://www.3gpp.org
[2] W. Saad, M. Bennis, and M. Chen, “A Vision of 6G Wireless Systems:
Applications, Trends, Technologies, and Open Research Problems,” IEEE
Network, vol. 34, no. 3, pp. 134–142, 2019.
[3] S. Shi, Y. Xiao, W. Lou, C. Wang, X. Li, Y. T. Hou, and J. H. Reed,
“Challenges and New Directions in Securing Spectrum Access Systems,”
IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6498–6518, 2021.
[4] T. Zhang, K. Zhu, and J. Wang, “Energy-Efficient Mode Selection
and Resource Allocation for D2D-Enabled Heterogeneous Networks: A Fengxiao Tang (S’15-M’19) is a full professor in the School of Computer
Deep Reinforcement Learning Approach,” IEEE Transactions on Wireless Science and Engineering of Central South University. His research interests
Communications, vol. 20, no. 2, pp. 1175–1187, 2020. are unmanned aerial vehicles system, IoT security, game theory optimization,
[5] R. Li, P. Hong, K. Xue, M. Zhang, and T. Yang, “Energy-Efficient network traffic control and machine learning algorithm. He was awarded IEEE
Resource Allocation for High-Rate Underlay D2D Communications with Communications Society Asia-Pacific Outstanding Paper Award (2020).
Statistical CSI: A One-to-Many Strategy,” IEEE Transactions on Vehicu-
lar Technology, vol. 69, no. 4, pp. 4006–4018, 2020.
[6] B. Özbek, M. Pischella, and D. Le Ruyet, “Energy Efficient Resource
Allocation for Underlaying Multi-D2D Enabled Multiple-Antennas Com-
munications,” IEEE Transactions on Vehicular Technology, vol. 69, no.
6, pp. 6189–6199, 2020.
[7] Z. M. Fadlullah and N. Kato, “HCP: Heterogeneous Computing Platform
for Federated Learning Based Collaborative Content Caching Towards
6G Networks,” IEEE Transactions on Emerging Topics in Computing,
vol. 10, no. 1, pp. 112-123, 2022.
[8] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang, “Communication-
Efficient Federated Learning for Digital Twin Edge Networks in Industrial
IOT,” IEEE Transactions on Industrial Informatics, vol. 17, no. 8, pp.
Nei Kato (M’04-SM’05-F’13) is a full professor (Deputy Dean) with
5709–5718, 2020.
Graduate School of Information Sciences of Tohoku University, Japan. He
[9] “Study on Channel Models for Frequencies from 0.5 GHz to 100 GHz,” has been engaged in research on computer networking, wireless mobile
2018, [Online]. Available: http://www.3gpp.org communications, satellite communications, ad hoc sensor mesh networks, and
[10] J. M. Jornet and I. F. Akyildiz, “Channel Modeling and Capacity smart grid. He is the fellow of The Engineering Academy of Japan and IEICE.
Analysis for Electromagnetic Wireless Nanonetworks in the Terahertz
Band,” IEEE Transactions on Wireless Communications, vol. 10, no. 10,
pp. 3211–3221, 2011.
[11] Y. Song, Z. Gong, Y. Chen, and C. Li, “Efficient Channel Estimation for
Wideband Millimeter Wave Massive MIMO Systems with Beam Squint,”
IEEE Transactions on Communications, 2022.
[12] H. Sarieddeen, M.-S. Alouini, and T. Y. AlNaffouri, “An Overview of
Signal Processing Techniques for Terahertz Communications,” Proceed-
ings of the IEEE, 2021.

Federated Reinforcement Learning-Based Resource Allocation in D2D-Enabled 6G

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Federated Reinforcement Learning-Based Resource Allocation in D2D-Enabled 6G

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this magazine.

Federated Reinforcement Learning-Based Resource

Fig. 1. The architecture of federated learning based D2D-enabled 6G network

Fig. 2. The network model in this research

n∈N c2 Pn + c3 A · Nover − c4 Runder . The expression is

of user devices whose transmission rate is over threshold Rthr 279.6

and corresponding revenue respectively. c1 , c2 , c3 , and c4 244.4

are constant-coefﬁcient to balance the revenue and penalty.

St+1 , UEs can update the weights of DNNs in DRL by 138.8

rt + γmaxat+1 Q̂ (St+1 , at+1 ), Q (St , at ) and Q̂ (St+1 , at+1 ) 33.2

represent the Q network and target network respectively. The −2.0

7.0 4.5 7.0

FL-DRL method 4.9

5.0 3.3 4.2

4.0 2.7 2.8

150 4.5 150

Local learning method 130

Equal method 60 Equal method

and different network scenarios. B. Realistic Mobility Model

You might also like