Professional Documents
Culture Documents
Multi-Agent DRL Approaches For PA in Multi-User Cellular Networks
Multi-Agent DRL Approaches For PA in Multi-User Cellular Networks
Abstract— The model-based power allocation has been inves- Dense deployment of small cells, such as pico-cells, femto-
tigated for decades, but this approach requires mathematical cells, has become the most effective solution to accommodate
models to be analytically tractable and it has high computational the demand for spectrum [1]. With denser APs and smaller
complexity. Recently, the data-driven model-free approaches have
been rapidly developed to achieve near-optimal performance with cells, the intra-cell and inter-cell interference problems can
affordable computational complexity, and deep reinforcement be severe [2]. Therefore, power allocation and interference
learning (DRL) is regarded as one such approach having great management are both crucial and challenging [3], [4] for
potential for future intelligent networks. In this paper, a dynamic small-cell networks.
downlink power control problem is considered for maximizing the A number of model-oriented algorithms have been devel-
sum-rate in a multi-user wireless cellular network. Using cross-
cell coordinations, the proposed multi-agent DRL framework oped to manage interference [5]–[9], and existing studies
includes off-line and on-line centralized training and distributed mainly focus on sub-optimal or heuristic algorithms, whose
execution, and a mathematical analysis is presented for the performance gaps to the optimal solution are typically chal-
top-level design of the near-static problem. Policy-based REIN- lenging to quantify. Besides, the mathematical models are
FORCE, value-based deep Q-learning (DQL), actor-critic deep usually assumed to be analytically tractable, but these mod-
deterministic policy gradient (DDPG) algorithms are proposed
for this sum-rate problem. Simulation results show that the els can be inaccurate because both hardware and channel
data-driven approaches outperform the state-of-art model-based imperfections can exist in practical communication environ-
methods on sum-rate performance. Furthermore, the DDPG ments. When considering specific hardware components and
outperforms the REINFORCE and DQL in terms of both sum- realistic transmission scenarios, such as low-resolution analog-
rate performance and robustness. to-digital converter, nonlinear amplifier and user distribution,
Index Terms— Deep reinforcement learning, deep determinis- it is challenging to develop signal processing techniques using
tic policy gradient, policy-based, interfering broadcast channel, model-driven tools. Moreover, the computational complexity
power control, resource allocation. of these algorithms is typically high and thus implementation
becomes impractical. Meanwhile, machine learning [10] algo-
I. I NTRODUCTION rithms are potentially useful techniques for future intelligent
wireless communications. These methods are usually model-
W IRELESS data transmission has experienced tremen-
dous growth in past years and will continue to grow in
the future. When a large number of terminals such as mobile
free and data-driven [11], [12], and the solutions are obtained
through data learning instead of model-oriented analysis and
phones and wearable devices are connected to the networks, design.
the density of access point (AP) will have to be increased. Two main branches of machine learning are supervised
learning and reinforcement learning (RL). When training input
Manuscript received January 17, 2019; revised June 21, 2019, November 9, and output pairs are available, the supervised learning method
2019, and March 17, 2020; accepted June 8, 2020. Date of publication June 18, is simple and efficient, especially for classification tasks such
2020; date of current version October 9, 2020. This work was supported in part
by the National Natural Science Foundation of China under Grant 61801112 as modulation recognition [13] and signal detection [14], [15].
and Grant 61601281, in part by the Natural Science Foundation of Jiangsu However, the desired output data or optimal solutions are
Province under Grant BK20180357, in part by the foundation of Shannxi usually derived by assuming certain system models. In addi-
Key Laboratory of Integrated and Intelligent Navigation under Grant SKLIIN-
20190204, and in part by the Open Program of State Key Laboratory of tion, the performance of the learned models with supervised
Millimeter Waves, Southeast University, under Grant K202029. The associate learning is suboptimal. Meanwhile, the RL [16] has been
editor coordinating the review of this article and approving it for publication developed as a goal-oriented algorithm to learn a better policy
was D. So. (Corresponding author: Fan Meng.)
Fan Meng and Lenan Wu are with the School of Information Science through exploration of uncharted territory and exploitation of
and Engineering, Southeast University, Nanjing 210096, China (e-mail: current knowledge. The RL concerns with how agents ought
mengxiaomaomao@outlook.com; wuln@seu.edu.cn). to take actions in an environment such that some notion of
Peng Chen is with the State Key Laboratory of Millimeter Waves, Southeast
University, Nanjing 210096, China (e-mail: chenpengseu@seu.edu.cn). cumulative reward is maximized, and the environment is typ-
Julian Cheng is with the School of Engineering, The University of British ically formulated as a Markov decision process (MDP) [17].
Columbia, Kelowna, BC V1V 1V7, Canada (e-mail: julian.cheng@ubc.ca). Therefore, many RL algorithms [16] have been developed
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. using dynamic programming (DP) techniques. In classic RL,
Digital Object Identifier 10.1109/TWC.2020.3001736 a value function or a policy is stored in a tabular form, which
1536-1276 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6256 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
introduces the dimensionality problem. To address this issue, problem with stringent quality-of-service (QoS) constraints on
function approximation is proposed to replace the table, and latency and reliability in vehicular-to-vehicular broadcast [30];
this approximation can be realized by a neural network (NN) and power control in cognitive radio system that consists
or a deep NN (DNN) [18]. When RL is combined with DNN, of two users sharing a spectrum [31]. To the best of the
the deep RL (DRL) is created and widely investigated, and authors’ knowledge, the classic policy-based approach has
it can achieve stunning performance in a number of noted seldom been considered for power allocation [32]. An actor-
projects [19], such as the game of Go [20] and Atari video critic algorithm has been applied for power allocation [33],
games [21]. where a Gaussian probability distribution function was used
The DRL algorithms can be categorized into three to formulate a stochastic continuous policy. These researches
groups [19]: value-based, policy-based, and actor-critic algo- concern multiple users sharing spectrum in a cooperative
rithms. The most widely-used algorithms are value-based deep distributed manner, which agree with our proposed multi-
Q-learning (DQL) and the policy-based REINFORCE. These agent DRL design. However, unlike the previous work [33],
two algorithms have the following merits and defects: we perform a theoretical analysis to prove that the investigated
1) DQL: This algorithm is efficient when the action space is distributed optimization problem is near-static rather than
finite, discrete, and low-dimensional, and it is unsuitable highly dynamic. Our proposed DRL algorithms are simple and
for joint optimization problems where possible actions efficient compared with the typical DRL algorithms that solve
increase exponentially. Besides, the actions must be an MDP or a partially observed (PO) MDP problem. Except
discretized for tasks having continuous action space; for the well-known QL algorithm, we also consider the classic
therefore, quantization error can be introduced. policy-based and the state-of-art actor-critic algorithms.
2) REINFORCE: Based on estimated gradients, the agent The sum-rate maximization is a static optimization problem,
learns to generate a stochastic policy in REINFORCE. where the target is a multi-variate ordinary function. The
In some games, the optimal policy is stochastic and DQL resulting centralized algorithm is inherently unscalable for
is difficult to produce. However, balancing the explo- large applications, and therefore we turn to multi-agent DRL.
ration and exploitation during learning is challenging Although inter-cell coordinations are used among the users,
and REINFORCE usually converges to a suboptimal the distributed power allocation problem can be considered
solution. Similar to DQL, the dimensionality problem as near-static in fast fading scenarios. While the standard
on action space still exists. DRL tools are designed for the DP that can be solved
The actor-critic algorithm is developed as a hybrid of the recursively, a direct use of these tools for solving the sta-
value-based and policy-based algorithms. It consists of two tic optimization problem will suffer performance degrada-
components: an actor to generate a policy and a critic to assess tion. In a previous work [34], we verified that the widely
the policy. A better solution is learned through solving a multi- applied standard DQL algorithm suffers from sum-rate per-
objective optimization problem and updating the parameters formance degradation. In this work, we perform a theoretical
of the actor and the critic alternatively. As an example, deep analysis of the general DRL approaches to address the sta-
deterministic policy gradient (DDPG) [22] generates a deter- tic optimization problem and improve the DRL algorithms.
ministic action and operates over high-dimensional continuous Based on this theoretical basis, we propose three simple
action spaces. and efficient algorithms, namely policy-based REINFORCE,
Consider a wireless cellular network having single-input value-based DQL and actor-critic-based DDPG. Simulation
single-output (SISO) interfering broadcast channel (IBC) results show that our DQL achieves the best sum-rate perfor-
whereby multiple base stations transmit signals to a group mance when compared with the algorithms using the standard
of users within their own cells. We investigate the dynamical DQL, and our DRL approaches also outperform the state-
sum-rate maximization problem, i.e., choosing downlink trans- of-art model-based methods. This work makes the following
mit power in response to physical channel conditions under contributions:
maximum power constraints. This problem is NP-hard [3], • We perform a mathematical analysis on general DRL
and two advanced model-based algorithms, namely fractional algorithms to tackle the static optimization problems,
programming (FP) [5] and weighted minimum mean squared such as the centralized dynamic power allocation in
error (WMMSE) [6] are regarded as benchmarks in this study. multi-user cellular networks. Furthermore, we verify that
The supervised learning was studied [23] where a trained DNN the distributed sum-rate problem having coordinations is
can accelerate processing speed and have acceptable perfor- near-static.
mance loss. To improve further the performance of super- • The training procedure of the proposed DRL algorithm
vised learning, the ensemble of DNNs is also proposed [24]. is centralized and the learned model is executed distrib-
To manage the interference using DRL approaches, the current utively. After off-line learning in a simulated system,
research work mainly concentrates on value-based algorithms. the agents are then trained on-line using transfer learning,
The QL or DQL is widely applied to various communica- namely sim2real. An environment tracking mechanism is
tion scenarios having different power allocation problems, also proposed to control the on-line learning dynamically.
such as general interference management in heterogeneous • The logarithmic representation of channel gain and power
networks that are typically composed of a macro base station is used to address the numerical problem in DNNs
and several femto-cells [25]–[28]; sum-rate maximization or and improve training efficiency. Besides, a sorting pre-
proportionally fair scheduling in cellular networks [29]; the processing technique is proposed to approximate the
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6257
interferences, reduce computation load, and accommo- where Dn is the set of interference cells around the n-th cell;
date varying user densities. ptn,k is the emitting power of the transmitter n to its receiver k
• The general DRL algorithms are proposed for static at slot t; and σ 2 denotes
the additional noise
power. The terms
optimization, and the concrete DRL design is further t t t t
g p
k =k n,n,k n,k and g
n ∈Dn n ,n,k p
j n ,j represent the
introduced. We propose three simple and efficient algo- intra-cell interference and inter-cell interference, respectively.
rithms, namely REINFORCE, DQL and DDPG, which are Assuming normalized bandwidth, we express the downlink
respectively policy-based, value-based, and actor-critic- rate of dln,k as
based. Simulations on sum-rate performance, generaliza-
t
Cn,k = log2 1 + γ tn,k . (5)
tion performance and computation complexity are also
demonstrated.
The remainder of this paper is organized as follows. B. Centralized Optimization Problem
Section II outlines the power control problem in the wireless In a centralized approach, the optimization problem is to
cellular network with IBC. In Section III, the top-level DRL maximize the overall sum-rate with respect to the non-negative
design for a static optimization problem is introduced and power set pt where all elements satisfy the maximum power
analyzed. In Section IV the proposed DRL approaches are constraint, and it is given as
presented. Then, the DRL methods are compared with bench-
mark algorithms in different scenarios, and simulation results max
t
C(g t , pt )
p
are demonstrated in Section V. Conclusions are presented in
s.t. 0 ≤ ptn,k ≤ Pmax , ∀n, k (6)
Section VI.
where Pmax denotes the maximum emitting power; the power
II. P ROBLEM F ORMULATION set pt , channel gain set g t , and sum-rate C(g t , pt ) are,
respectively, defined as
A. System Model
A cellular network having SISO IBC is considered, and it is pt : = {ptn,k | ∀n, k}, (7)
composed of N cells. A base station (BS) equipped with one g :=t
{gnt ,n,k | ∀n , n, k}, (8)
transmit antenna is deployed at each cell center. Using shared
C(g , p ) : =
t t t
Cn,k . (9)
spectrum, K users in each cell are simultaneously served by
n,k
the center BS.
At time slot t, the independent channel gain between the The problem (6) is non-convex and NP-hard. For model-
t
BS n and the user k in cell j is denoted by gn,j,k , and it can based methods, the performance gaps to the optimal solution
be presented as are typically challenging to quantify, and also a practical
implementation is prohibitive due to high computational com-
t
gn,j,k = |htn,j,k |2 βn,j,k (1) plexity. More importantly, the model-oriented approaches can-
where | · | is the modules operation; htn,j,k is a complex not accommodate future heterogeneous service requirements
Gaussian random variable with Rayleigh distributed envelope; and randomly evolving environments. In the problem (6),
βn,j,k is the large-scale fading component that takes both the current channel state information (CSI) is the sufficient
geometric attenuation and shadow fading into account, and statistics of the optimal solution. Therefore, there exists a
it is assumed to be invariant over one time slot. According function to map the CSI to the solution and we resort to data-
to the Jakes’ model [35], the small-scale flat fading can be driven DRL algorithms to realize this function.
modeled as a first-order complex Gauss-Markov process
C. Distributed Optimization Problem
htn,j,k = ρht−1
n,j,k + nn,j,k
t
(2)
The optimization problem (6) is centralized. Using RL
where ntn,j,k ∼ CN (0, 1 − ρ2 ) (h1n,j,k ∼ CN (0, 1) when algorithm, the current local CSI is first estimated and transmit-
t = 1), and the correlation coefficie ρ is determined by ted to the center agent for further processing. The decisions
of allocated powers are then broadcast to the corresponding
ρ = J0 (2πfd Ts ) (3) transmitters and executed. However, several shortcomings of
where J0 (·) is the first kind zero-order Bessel function; fd is the centralized framework having a number of cells must be
the maximum Doppler frequency; and Ts is the time interval observed:
between adjacent instants. 1) Dimensionality problem: The cardinalities of DNN input
The downlink from the n-th BS to the k-th serving AP is and output (I/O) are proportional to the cell number N ,
denoted by dln,k . Assuming that the signals from different and training is difficult for such a DNN since the state-
transmitters are independent of each other, the channels are action space increases exponentially with I/O dimen-
fixed in each time slot. Then the signal-to-interference-plus- sions. Additionally, exploration in high-dimensional
noise ratio (SINR) of dln,k in time slot t can be written as space is inefficient, and thus the learning can be
t impractical.
gn,n,k ptn,k
γ tn,k = 2) Transmission problem: The center agent requires full
k =k
t
gn,n,k ptn,k + n ∈Dn gnt ,n,k j ptn ,j + σ 2 CSI of the communication network in current time.
(4) When the cell number N is large and low-latency service
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6258 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
is required, both the transmitting CSI to the agent and 4) R, a finite set of immediate rewards, where element
a
the broadcasting allocation scheme to each transmitter rs→s denotes the reward obtained after transitioning
areas are approximately identical (since their characteristics From the perspective of MDP, we have the following
are location-invariant and the network is large), the policy conclusions when the environment satisfies certain conditions.
can be shared in a paradigm of transfer learning. To train the Theorem 1: When the environment transition is independent
network, we only train one policy and set the batch size equal of the action, and the current action is only related to
to the number of users. The learned policy is executed by all the reward function of this instant, the optimal policy for
the users, and each user is an agent. Therefore, the training maximizing cumulative rewards is equivalent to a combination
is centralized and the execution is distributed. In multi-agent of single-step rewards.
training, non-stationarity occurs when the policies of the other Proof: First, we focus on (12) and expand it as
1
agents change during training. Meanwhile, non-stationarity VπT (s1 ) = π(a1 |s1 ) Psa1 →s2
does not occur in centralized training because all the agents a1 ∈A s2 ∈S
share the same policy. In the centralized training, we still use
1 a1 T − 1 T −1 2
indepedent Q learning. × r1 2+ Vπ (s ) . (16)
T s →s T
The description of the assumed conditions can be mathemati-
III. D EEP R EINFORCEMENT L EARNING cally formulated as
= Ps→s ,
A. Preliminary and Theoretical Analysis a
Ps→s (17)
A general MDP problem concerns about a single or multiple a
rs→s = rsa . (18)
agents interacting with an environment. In each interaction,
Without loss of generality, for probability mass functions of
the agent takes action a by policy π using observed state s,
policy π and state transitioning P , we have
then receives a feedback reward r and a new state from the
environment. The agent aims to find an optimal policy to max- π(a|s) = 1, (19)
imize the cumulative reward over the continuous interactions, a∈A
and the DRL algorithms are developed for such problems. Ps→s = 1. (20)
To facilitate the analysis, we consider the discrete-time s ∈S
model-based MDP where the action and state spaces are From (17), (18), (19) and (20), the state value function (16)
assumed to be finite. The four-tuple (S, A, P, R) is known, can be rewritten as
where the elements are 1 1
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6259
Using (23), we can prove that the maximization of (22) with where ui is the received message of agent i, and it is
respect to {at |∀t} can be decomposed into T subproblems: transmitted by the other agents. Theorem 1 does not hold
because uti can contain information of the previous action at−1
at
and state st−1 and these historical issues can influence the
max
t
Vπ
T
(s) ⇐⇒ max
t
rs t
∀t . (24)
{a |∀t} a current agent’s strategy.
Furthermore, we consider the impact of the environment
The equivalence proof of γ-discounted cumulative reward is
changing. When the Dopple frequency fd is large, the channel
similar.
environment changes rapidly, the delivered message has little
Since the channel is modeled as a first-order Markov
influence on the current policy making, then a small γ value
process, the environment satisfies the two conditions (17)
t is preferred and vice versa. This implies that when the envi-
and (18) in Theorem 1. Then we let rsat = C(g t , pt ) and
ronment changes fast, Theorem 1 is still valid in the scenario
a = p with the power constraints. The centralized optimiza-
t t
which considers multi-agent setting with coordinations.
tion problem (6) using the DRL approach is equivalent to (23),
Validation of Theorem 1 provides a guidance for the setting
and this problem is unrelated with the value of T or γ. We will
of γ value. To the authors’ best knowledge, there exists no
first make serveral observations before choosing an appropriate
theoretical optimal setting of hyper-parameter γ, and a proper
hyper-parameter value. We take the value-based method as
γ value can be obtained through simulations. We have verified
an example, and optimal Q function associated with Bellman
that an increasing γ has negative influence on the sum-rate
equation is given as
performance of deep Q network (DQN) via simulations [34],
Q∗ (s, a) = rsa + γ max Q(s , a ). (25) as shown in Fig. 1. This implies that the influence of agents’
a behaviours over time is negligible, and Theorem 1 is valid
The function in (25) must be estimated precisely to achieve for problem (11) when we consider coordinations and a fast
the optimal action. Here we list two issues caused by γ > 0: changing environment. Therefore, we suggest hyper-parameter
γ = 0 or T = 1 in this specific scenario, and thus
1) The Q value is overestimated, and the bias is
Q(s, a) = rsa . In the remainder of this paper, we make
γ maxa Q(s , a ). This effect has little influence on the
an adjustment to the standard DRL algorithms, and claim
final performance, since this deviation is unrelated with
that the Q function is equal to the reward function. The
action a.
aforementioned analysis and discussion provide a guidance
2) The variance of Q value σq2 becomes enlarged, and noise
for the DRL design.
σq2 becomes larger as γ increases. During the training,
the noise σq2 on data slows down the convergence speed
and deteriorates the performance of learned DNN. C. On-Line Training
In the proposed two-step training framework [34], the DNN
is first pre-trained off-line. Deriving a coarse learned policy in
B. Multi-Agent Setting a simulated system can reduce the on-line training stress due to
1) Without Coordinations: The conditions (17) and (18) are the large data requirement for data-driven algorithm. However,
still true in multi-agent learning. When all the agents obtain off-line training will suffer from system model inaccuracy
the environment state s partially or fully, no messages are including hardware imperfections, dynamic wireless channel
delivered between the agents. The value function (16) can be environments and some unknown issues. Therefore, the agent
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6260 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
the reward function R is changed, and thus the policy π or coordinations from interference cells where \ is the division
Q function must be adjusted correspondingly to avoid perfor- operation. Corresponding to Γ t , the last power set p
t−1
n,k n,k is
mance degradation. Hence, the Q value needs to approximate defined as
the reward value r as accurately as possible. We define the
normalized critic loss lct as t−1
pn,k := {pn,k | (n, k) ∈ In,k }.
t−1 t
(31)
The irrelevant or weak-correlated input elements consume
2
1
t t t
Q(s , a ; θ) more computational resources and even lead to perfor-
lct = 1− (28) mance degradation. However, some auxiliary information can
2Tl rt
t =t−Tl +1 improve the sum-rate performance of DNN. Similar to (31),
the assisted feature is given by
where θ denotes the DNN parameter; Tl is the observation
t−1 := {C t−1 | (n, k) ∈ I t }.
C (32)
window; lct is an index to evaluate the accuracy of Q function n,k n,k n,k
approximation to the actual environment. Once lct exceeds Two types of feature f are considered, and they are defined as
certain fixed threshold lmax , the training of DNN is initiated t
to track the current environment; otherwise, the learning n,k , p
f1 : = {Γ t−1
n,k }, (33)
procedure is omitted. The introduction of tracking mecha- t , p t−1 t−1
f2 : = {Γn,k n,k , C n,k }. (34)
nism achieves a balance between performance and efficiency.
Combined with on-line training, the DRL is model-free and The partially observed state s for DRL algorithms can be f1 or
data-driven. f2 , and their performance will be compared in Section V-B.2.
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6261
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6262 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
C. Value-Based: DQL
DQL is one of the most popular value-based off-policy
DRL algorithms. As shown in Fig. 2, the topologies of DQL Fig. 3. An illustration of data flow graph with DDPG (feature f2 ).
and REINFORCE are the same, and the values are estimated
by a DQN Q(s, a; θq ), where θ q denotes the parameter. The D. Actor-Critic: DDPG
selection of a good action relies on accurate estimation, and
thus DQL aims to search for optimal parameter θ∗q to minimize DDPG is presented as an actor-critic model-free algorithm
the
2 loss, i.e., based on the deterministic policy gradient operating over
continuous action spaces. As shown in Fig. 3, an actor gen-
1 2
θ ∗q = arg min (Q(s, a; θq ) − rsa ) . (43) erates deterministic action a with observation s by a mapping
θq 2
network A(s; θ a ), where θ a denotes the actor parameter. The
The gradient of θ q is given as critic predicts the Q value with an action-state pair through
a critic network C(sc , a; θc ), where θ c denotes the critic
∇θ q = (Q(s, a; θq ) − rsa ) ∇θq Q(s, a; θq ). (44) parameter and sc is the critic state. The critic and actor work
∗ cooperatively, and the optimal deterministic policy is achieved
The optimal action a is selected to maximize the Q value,
and it is given by by solving the following joint optimization problem:
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6263
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6264 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
TABLE I
H YPER -PARAMETERS S ETUP AND DNN S ETTING
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6265
and random power schemes are considered as benchmarks to Fig. 9. Average sum-rate versus different Doppler frequency fd .
evaluate our proposed DRL algorithms.
1) Cell Range: Both the minimum cell range Rmin and half exploration and exploitation. Therefore, it is possible for the
cell-to-cell range Rmax are considered. As shown in Fig. 6, DRL algorithms to attain solutons closer to optimal perfor-
the expected SINR of adjacent users (near BS) becomes mance than the conventional optimization algorithms.
smaller as the minimum cell range increases, and thus the 2) User Density: In a practical scenario, the user density
average sum-rate decreases. In Fig. 7, both the intra-cell inter- can change over time and location, so it is considered in this
ference and inter-cell interference are stronger as the cell range simulation. The user density is changed by the number of AP
becomes smaller, and thus the average sum-rate decreases. per cell K, which ranges from 1 to 8. As plotted in Fig. 8,
The sum-rate performance of random/maximum power is the the average sum-rate drops as the users become denser, and
lowest, while the FP [5] and WMMSE [6] achieve much higher all the algorithms have the similar trend. Apparently, the DRL
spectral efficiency. The performances of these two algorithms approaches outperform the other schemes, and DDPG again
are comparable, and WMMSE performs slightly better than achieves the best sum-rate performance. Hence, the simulation
FP. In contrast, all the data-driven algorithms outperform the result shows that the learned data-driven models also show
model-driven methods, and the proposed actor-critic-based good generalization performance on different user densities.
DDPG achieves the highest sum-rate value. Additionally, 3) Doppler Frequency: The Doppler frequency fd is a
the learned models are obtained in the simulation environment significant variable related to the small-scale fading. Since
having fixed range Rmin = 0.01 km and Rmax = 1 km, but the information at last instant is used for the current power
performance degradation in these unknown scenarios cannot allocation, fast fading can lead to performance degradation
be observed. Therefore, the learned data-driven models using for our proposed data-driven models. Meanwhile, the model-
proposed algorithms show good generalization performance in driven algorithms are not influenced by fd . The Doppler
terms of varying cell ranges. frequency fd is sampled from 4 Hz to 50 Hz, meanwhile the
The conventional optimization algorithms such as WMMSE corresponding correlation ρ value ranges from 0.920 to 0.001.
and FP sacrifice optimality for tractability and suboptimal The simulation results in Fig. 9 show that the average sum-
solutions are often obtained. As a sub-discipline of heuristic rates of data-driven algorithms drop slowly in this fd range.
optimization, the data-driven methods such as DRL interact This indicates that the data-driven models are robust against
with the environment and learn to improve the policy through Doppler frequency fd .
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6266 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6267
[12] T. Wang, C.-K. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep [37] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
learning for wireless physical layer: Opportunities and challenges,” dient methods for reinforcement learning with function approximation,”
China Commun., vol. 14, no. 11, pp. 92–111, Nov. 2017. in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[13] F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation [38] P. S. Thomas and E. Brunskill, “Policy gradient methods for rein-
classification: A deep learning enabled approach,” IEEE Trans. Veh. forcement learning with function approximation and action-dependent
Technol., vol. 67, no. 11, pp. 10760–10772, Nov. 2018. baselines,” CoRR, vol. abs/1706.06643, pp. 1–2, Jun. 2017. [Online].
[14] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel Available: http://arxiv.org/abs/1706.06643
estimation and signal detection in OFDM systems,” IEEE Wireless [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018. CoRR, vol. abs/1412.6980, pp. 1–15, Dec. 2014. [Online]. Available:
[15] F. Meng, P. Chen, and L. Wu, “NN-based IDF demodulator in http://arxiv.org/abs/1412.6980
band-limited communication system,” IET Commun., vol. 12, no. 2,
pp. 198–204, Jan. 2018. Fan Meng (Student Member, IEEE) was born in
[16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Jiangsu, China, in 1992. He received the B.S. degree
Cambridge, MA, USA: MIT Press, 2018. from the School of Electronic Engineering, Univer-
sity of Electronic Science and Technology of China
[17] L. R. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement
(UESTC), in 2015. He is currently pursuing the
Learning and Dynamic Programming Using Function Approximators.
Ph.D. degree with the School of Information Science
Boca Raton, FL, USA: CRC Press, 2010.
and Engineering, Southeast University, China. His
[18] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
main research topic is applying machine learning
MA, USA: MIT Press, 2016.
techniques in the wireless communication systems.
[19] Y. Li, “Deep reinforcement learning: An overview,” CoRR, His research interests include machine learning in
vol. abs/1701.07274, pp. 1–85, Jan. 2017. [Online]. Available: general, joint demodulation and equalization, chan-
http://arxiv.org/abs/1701.07274 nel coding, resource allocation, and end-to-end communication system design
[20] D. Silver et al., “Mastering the game of go without human knowledge,” with machine learning tools.
Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
[21] V. Mnih et al., “Human-level control through deep reinforcement learn- Peng Chen (Member, IEEE) was born in Jiangsu,
ing,” Nature, vol. 518, no. 7540, p. 529, 2015. China, in 1989. He received the B.E. and Ph.D.
[22] T. P. Lillicrap et al., “Continuous control with deep reinforcement degrees from the School of Information Sci-
learning,” Comput. Sci., vol. 8, no. 6, p. A187, 2015. ence and Engineering, Southeast University, China,
[23] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, in 2011 and 2017, respectively.
“Learning to optimize: Training deep neural networks for interfer- From March 2015 to April 2016, he was a Visiting
ence management,” IEEE Trans. Signal Process., vol. 66, no. 20, Scholar with the Electrical Engineering Department,
pp. 5438–5453, Oct. 2018. Columbia University, New York City, NY, USA.
[24] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power control He is currently an Associate Professor with the State
via ensembling deep neural networks,” CoRR, vol. abs/1807.10025, Key Laboratory of Millimeter Waves, Southeast Uni-
pp. 1–30, Jul. 2018. versity. His research interests include radar signal
[25] M. Bennis and D. Niyato, “A Q-learning based approach to interfer- processing and millimeter wave communication.
ence avoidance in self-organized femtocell networks,” in Proc. IEEE Lenan Wu received the M.S. degree in electronics
Globecom Workshops, Dec. 2010, pp. 706–710. communication system from the Nanjing University
[26] M. Simsek, A. Czylwik, A. Galindo-Serrano, and L. Giupponi, of Aeronautics and Astronautics, China, in 1987, and
“Improved decentralized Q-learning algorithm for interference reduction the Ph.D. degree in signal and information process-
in LTE-femtocells,” in Proc. Wireless Adv., Jun. 2011, pp. 138–143. ing from Southeast University, China, in 1997.
[27] M. Simsek, M. Bennis, and I. Güvenç, “Learning based frequency- and Since 1997, he has been with Southeast University,
time-domain inter-cell interference coordination in hetnets,” IEEE Trans. where he is currently a Professor and the Direc-
Veh. Technol., vol. 64, no. 10, pp. 4589–4602, Oct. 2015. tor of the Multimedia Technical Research Institute.
[28] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, He is the author or coauthor of over 400 technical
and D. Matolak, “A machine learning approach for power allocation in articles and 11 textbooks. He holds 20 Chinese
HetNets considering QoS,” in Proc. IEEE Int. Conf. Commun. (ICC), patents and one international patent. His research
May 2018, pp. 1–7. interests include multimedia information systems and communication signal
[29] Y. S. Nasir and D. Guo, “Deep reinforcement learning for dis- Processing.
tributed dynamic power allocation in wireless networks,” CoRR, Julian Cheng (Senior Member, IEEE) received the
vol. abs/1808.00490, pp. 1–30, Aug. 2018. [Online]. Available: B.Eng. degree (Hons.) in electrical engineering from
http://arxiv.org/abs/1808.00490 the University of Victoria, Victoria, BC, Canada,
[30] H. Ye and G. Y. Li, “Deep reinforcement learning based distributed in 1995, the M.Sc. (Eng.) degree in mathematics
resource allocation for V2 V broadcasting,” in Proc. 14th Int. Wireless and engineering from Queen’s University, Kingston,
Commun. Mobile Comput. Conf. (IWCMC), Jun. 2018, pp. 440–445. ON, Canada, in 1997, and the Ph.D. degree in
[31] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent electrical engineering from the University of Alberta,
power control for spectrum sharing in cognitive radios: A deep rein- Edmonton, AB, Canada, in 2003. He was with
forcement learning approach,” IEEE Access, vol. 6, pp. 25463–25473, Bell Northern Research and NORTEL Networks.
Apr. 2018. He is currently a Full Professor with the Faculty
[32] N. H. Viet, N. A. Vien, and T. Chung, “Policy gradient SMDP for of Applied Science, School of Engineering, The
resource allocation and routing in integrated services networks,” in Proc. University of British Columbia, Kelowna, BC, Canada. His current research
IEEE Int. Conf. Netw., Sens. Control, Apr. 2008, pp. 1541–1546. interests include digital communications over fading channels, statistical signal
[33] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource processing for wireless applications, optical wireless communications, and 5G
allocation in HetNets with hybrid energy supply: An actor-critic rein- wireless networks. He was the Co-Chair of the 12th Canadian Workshop on
forcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, Information Theory in 2011, the 28th Biennial Symposium on Communica-
no. 1, pp. 680–692, Jan. 2018. tions in 2016, and the 6th EAI International Conference on Game Theory
[34] F. Meng, P. Chen, and L. Wu, “Power allocation in multi- for Networks (GameNets 216). He serves as an Area Editor for the IEEE
user cellular networks: Deep reinforcement learning approaches,” T RANSACTIONS ON C OMMUNICATIONS. He is a past Associate Editor of the
CoRR, vol. abs/1812.02979, pp. 1–6, Dec. 2018. [Online]. Available: IEEE T RANSACTIONS ON C OMMUNICATIONS , the IEEE T RANSACTIONS
http://arxiv.org/abs/1812.02979 ON W IRELESS C OMMUNICATIONS , the IEEE C OMMUNICATIONS L ETTERS ,
[35] P. Dent and G. E. Bottomley, “Jakes fading model revisited,” Electron. and IEEE A CCESS . He served as a Guest Editor for a Special Issue of the
Lett., vol. 29, no. 13, pp. 1162–1163, Jun. 1993. IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS on Optical
[36] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and Wireless Communications. He is also a Registered Professional Engineer with
P. Soldati, “Learning radio resource management in RANs: Framework, the Province of British Columbia, Canada. He serves as the President for the
opportunities, and challenges,” IEEE Commun. Mag., vol. 56, no. 9, Canadian Society of Information Theory as well as the Secretary for the Radio
pp. 138–145, Sep. 2018. Communications Technical Committee of the IEEE Communications Society.
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.