Professional Documents
Culture Documents
Deep Learning Based
Deep Learning Based
Deep-Learning-Based
Wireless Resource Allocation
With Application to Vehicular
Networks
ABSTRACT | It has been a long-held belief that judicious discuss deep-learning-assisted optimization for resource allo-
resource allocation is critical to mitigating interference, cation. We then highlight the deep reinforcement learning
improving network efficiency, and ultimately optimizing wire- approach to address resource allocation problems that are
less communication performance. The traditional wisdom is difficult to handle in the traditional optimization framework.
to explicitly formulate resource allocation as an optimization We also identify some research directions that deserve further
problem and then exploit mathematical programming to solve investigation.
the problem to a certain level of optimality. Nonetheless,
KEYWORDS | Deep learning; reinforcement learning (RL);
as wireless networks become increasingly diverse and com-
resource allocation; vehicular networks; wireless communica-
plex, for example, in the high-mobility vehicular networks,
tions.
the current design methodologies face significant challenges
and thus call for rethinking of the traditional design philoso-
phy. Meanwhile, deep learning, with many success stories in I. I N T R O D U C T I O N
various disciplines, represents a promising alternative due to In the last few decades, wireless communications
its remarkable power to leverage data for problem solving. have been relentlessly pursuing higher throughput, lower
In this article, we discuss the key motivations and roadblocks latency, higher reliability, and better coverage. In addi-
of using deep learning for wireless resource allocation with tion to designing more efficient coding, modulation,
application to vehicular networks. We review major recent channel estimation, equalization, and detection/decoding
studies that mobilize the deep-learning philosophy in wireless schemes, optimizing the allocation of limited communica-
resource allocation and achieve impressive results. We first tion resources is another effective approach [1].
From Shannon’s information capacity theorem [2],
power and bandwidth are the two primitive resources in
Manuscript received July 6, 2019; revised October 1, 2019; accepted a communication system. They determine the capacity
November 30, 2019. This work was supported in part by a research gift from of a wireless channel, up to which the information can
Intel Corporation and in part by the National Science Foundation under Grant
1731017 and Grant 1815637. (Corresponding author: Le Liang.) be transmitted with an arbitrarily small error rate. For
L. Liang is with Intel Labs, Hillsboro, OR 97124 USA (e-mail: lliang@gatech.edu). modern wireless communication systems, the definition of
H. Ye and G. Y. Li are with the School of Electrical and Computer Engineering,
Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:
communication resources has been substantially enriched.
yehao@gatech.edu; liye@ece.gatech.edu). Beams in a multiple-input multiple-output (MIMO) sys-
G. Yu is with the Department of Information Science and Electronic Engineering,
tem, time slots in a time-division multiple-access sys-
Zhejiang University, Hangzhou 310027, China (e-mail: yuguanding@zju.edu.cn).
tem (TDMA), frequency subbands in a frequency-division
Digital Object Identifier 10.1109/JPROC.2019.2957798 multiple-access system, spreading codes in a code-division
0018-9219 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
design objective and constraints as an optimization prob- To avoid such difficulties, traditional optimization-based
lem. It is then solved to certain levels of optimality, methods break down the problem into isolated resource
depending on problem complexity and allowable compu- allocation decisions at each time step without considering
tation time, by leveraging tools from various disciplines, the long-term effect. For instance, methods in [32]–[35]
including mathematical programming, graph theory, game reformulate the requirement as a signal-to-interference-
theory, etc. In this section, we discuss the limitations of plus-noise ratio (SINR) constraint at each decision step
these optimization approaches and highlight the poten- and then use various optimization techniques to solve for
tials of uprising data-driven methods enabled by machine the resource allocation solution. Such a practice loses the
learning, in particular deep learning. Methods that com- flexibility to balance V2I and V2V performance across the
bine the theoretical models derived from domain knowl- whole time T and leads to inevitable performance loss.
edge and the data-driven capabilities of learning are also
briefly discussed. B. Deep-Learning-Assisted Optimization
Deep learning allows multilayer computation models
that learn data representations with multiple levels of
A. Limitation of Traditional Optimization abstraction [36], [37]. Each layer computes a linear com-
Approach bination of outputs from the previous layer and then
Except in a few simple cases, where we are fortu- introduces nonlinearity through an activation function to
nate enough to end up with convex optimization that improve its expressive power. Deep learning has seen a
admits a systematic procedure to find the global opti- recent surge in a wide variety of research areas due to
mum, most optimization problems formulated for wireless its exceptional performance in many tasks, such as speech
resource allocation are strongly nonconvex (continuous recognition and object detection. Coupled with the avail-
power control), combinatorial (discrete channel assign- ability of more computing power and advanced training
ment), or mixed-integer nonlinear programming (MINLP) techniques, deep learning enables a powerful data-driven
(combined power control and spectrum assignment). For approach to many problems that are deemed difficult
instance, it has been shown in [8] that the spectrum man- traditionally. In the context of resource allocation, this
agement problem in a frequency selective channel, where sheds light on solving hard optimization problems (at
multiple users share the same spectrum, is nonconvex and least partially).
NP-hard. There is no known algorithm that can solve the In a simplistic form, deep learning can be leveraged to
problem to optimality with polynomial time complexity. To learn the correspondence of the parameters and solutions
deal with problems of this kind, we are often satisfied with of an optimization problem. The computationally complex
a locally optimal solution or some good heuristics without procedure to find optimal or suboptimal solutions can
any performance guarantee. More often than not, even be taken offline. With the universal approximation capa-
these suboptimal methods are computationally complex bility of deep neural networks (DNNs), the relationship
and hard to be executed in real time. Another limitation between the parameter input and the optimization solution
of existing optimization approaches points to the require- obtained from any existing algorithm can be approxi-
ment of exact models, which tend to abstract away many mated. For implementation in real time, the new parame-
imperfections in reality for mathematical tractability, and ter is input into the trained DNN and a good solution can
the solution is highly dependent on the accuracy of the be given almost instantly, thus improving its potential for
models. However, the wireless communication environ- adoption in practice.
ment is constantly changing by nature and the resultant In learning tasks, the DNN is usually trained to mini-
uncertainty in model parameters, e.g., channel state infor- mize the discrepancy between the output and the ground
mation (CSI) accuracy, undermines the performance of the truth given an input. With this idea in mind, we can
optimized solution. leverage deep learning to directly minimize or maximize
Finally, as wireless networks grow more complex and the optimization objective, i.e., treating the objective as
versatile, a lot of new service requirements do not directly the loss function in supervised learning. Then various
translate to the performance metrics that the communica- training algorithms, such as stochastic gradient descent,
tion community is used to, such as the sum rate or pro- can be employed to find the optimization solution. Com-
portional fairness. For example, in high-mobility vehicu- pared with the direct input–output relationship learning,
lar networks, the simultaneous requirements of capacity this approach lifts the performance limit imposed by the
maximization for vehicle-to-infrastructure (V2I) links and traditional optimization algorithms that are used to gener-
reliability enhancement for vehicle-to-vehicle (V2V) links ate the training data. Alternatively, deep learning can be
[27], [28] do not admit an obvious formulation. In par- embedded as a component to accelerate some steps of a
ticular, if we define the reliability of V2V links as the well-behaved optimization algorithm, such as the pruning
successful delivery of packets of size B within the time stage of the branch-and-bound (B&B) in [38] and [39].
constraint T [30], [31], the problem becomes a sequential This method leverages the theoretical models developed
decision problem spanning the whole T time steps and with expert knowledge and achieves near-optimal perfor-
is difficult to solve in a mathematically exact manner. mance with significantly reduced execution time.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Q(St , At ) ← Q(St , At )
Fig. 2. Agent–environment interaction in RL. + α Rt+1 + γ max
Q(St+1 , a ) − Q(St , At )
a
(1)
C. Deep RL-Based Resource Allocation where α is the step-size parameter, γ ∈ (0, 1] is the
MDP discounter factor, and the choice of At in state St
RL addresses sequential decision making via maximiz- follows some soft policies, e.g., the -greedy, meaning that
ing a numerical reward signal while interacting with the the action with maximal estimated value is chosen with
unknown environment, as illustrated in Fig. 2. Mathemati- probability 1 − while a random action is selected with
cally, the RL problem can be modeled as a Markov decision probability .
process (MDP). At each discrete time step t, the agent
observes some representation of the environment state St 2) Deep Q-Network (DQN) With Experience Replay: In
from state space S , and then selects an action At from the many problems of practical interest, the state and action
action set A. Following the action, the agent receives a space can be too large to store all action-value functions in
numerical reward Rt+1 and the environment transitions a tabular form. As a result, it is common to use function
to a new state St+1 , with transition probability p(s , r|s, a). approximation to estimate these value functions. In deep
In RL, decision-making manifests itself in a policy π(a|s), Q-learning [46], a DNN parameterized by θ, called DQN,
which is a mapping from states in S to probabilities of is used to represent the action-value function. The state-
selecting each action in A. The goal of learning is to find action space is explored with some soft policies, e.g.,
an optimal policy π∗ that maximizes the expected accumu- -greedy, and the transition tuple (St , At , Rt+1 , St+1 ) is
lative rewards from any initial state s. The RL framework stored in a replay memory at each time step. The replay
provides native support for addressing sequential decision memory accumulates experiences over many episodes of
making under uncertainty that we encounter in, e.g., the MDP. At each step, a mini-batch of experience D are
the resource allocation problem in vehicular networks. The uniformly sampled from the memory for updating θ with
problem can be tackled by designing a reward signal that variants of stochastic gradient descent methods, hence
correlates with the ultimate objective and the learning the name experience replay, to minimize the sum-squared
algorithm can figure out a decent solution to the problem error
automatically. Indeed, it is the flexibility of reward design
2
in RL that makes it appealing for solving problems that are Rt+1 + γ max Q(St+1 , a ; θ− ) − Q(St , At ; θ) (2)
a
deemed difficult traditionally due to the inability to model D
the design objective exactly.
Here, we briefly review two basic RL algorithms, where θ− is the parameter set of a target Q-network,
i.e., (deep) Q-learning and REINFORCE, as representatives which is duplicated from the training Q-network parame-
of the value-based and policy-based RL methods, respec- ters set θ periodically and fixed for a couple of updates.
tively. We also provide a concise introduction to the actor– Experience replay improves sample efficiency through
critic algorithm that blends the two RL categories. As for repeatedly sampling stored experiences and breaks cor-
more advanced algorithms, such as natural policy gradi- relation in successive updates, thus also stabilizing
ents [40], trust region policy optimization [41], and async learning.
advantage actor–critic [42], we refer interested readers to
the given reference articles and the survey on deep RL in 3) Policy Gradient and Actor Critic: Different from value-
[43] and [44]. based RL methods that estimate a value function and
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Table 2 Performance of Graph Embedding-Based Link Scheduling multiuser interference channel. With the channel power,
hi,j , as the input, the DNN training loss is set as the sum
rate, given by
hi,i Pi
ηSE = log2 1+ 2 (6)
resentation of each node based on topology information. i
σ + k=i hk,i Pk
The output feature vector is then input to a DNN for
link scheduling. Compared with the kernel-based feature where Pi , the ith output of the DNN, denotes the transmit
extraction process in [56] and [57], the graph embedding power of the ith transmitter, σ 2 is the noise power, and hi,j
process only involves the nonlinear function mapping and denotes the channel power from transmitter i to receiver
can be learned with fewer training samples. The simulation j . Furthermore, in [20], the CNN is used for power control
results are summarized in Table 2. Note that there are two strategy learning in a similar scenario. It can be trained
metrics: classifier accuracy and average sum rate. The for- to improve the spectral efficiency as defined in (6) or the
mer reflects the similarity between the scheduling output energy efficiency, which is expressed as
of the proposed method and the state-of-the-art FPLinQ
algorithm for link scheduling. The latter is the normalized ηSE
(i)
sum rate achieved by the proposed method with respect to ηEE = (7)
i
Pi + Pc
that by the FPLinQ algorithm. From Table 2, the proposed
method in [58] only needs hundreds of training samples to
(i)
achieve good performance without accurate CSI. where ηSE represents the ith summation term in (6) and
Pc is the fixed circuit power. The performance of both
unsupervised learning approaches is shown to be better
B. Unsupervised Learning Approach for than the state-of-the-art WMMSE heuristic [52].
Optimization Along this line of thought, resource allocation is fur-
The supervised learning paradigm sometimes suffers ther treated as a functional optimization problem in [61]
from several disadvantages. First, since the ground truth that optimizes the mapping from the channel states to
must be provided by conventional optimization algo- power allocation results such that the long term average
rithms, performance of the deep-learning approach will be performance of a wireless system is maximized. To enforce
bounded by that of the conventional algorithms. Second, constraints in the learning procedure, such as the power
a large number of labeled samples are usually required budget and QoS requirements, training of DNNs can be
to obtain a good model as demonstrated in [16] and undertaken in the dual domain, where the constraints are
[54] that use 1 000 000 and 50 000 labeled samples for linearly combined to create a weighted objective. Primal–
training, respectively. However, high-quality labeled data dual gradient descent is then proposed as a model-free
are difficult to generate in wireless resource allocation learning approach, where the gradients are estimated by
due to a few factors, e.g., inherent problem hardness sampling the model functions and wireless channels with
and computational resource constraints. It further limits the policy gradient method.
scalability of the supervised learning methods. Conventionally, the loss functions for deep-learning
In order to improve performance of the deep-learning- training are designed for regression and classification
enabled optimization, unsupervised learning approaches tasks. These loss functions are usually of simple forms and
have been proposed to train the DNN according to the have theoretical guarantees for convergence. For instance,
optimization objective directly, instead of learning from the deep linear networks can obtain the global minima
the conventional optimization approach. In general, deep with convex loss functions [62] and overparameterized
learning is trained to optimize a loss function via stochastic networks with nonlinear activation functions can achieve
gradient descent. The loss functions are designed from case zero training loss with the quadratic loss function [63]. In
to case. For instance, in classification problems, the cross- the resource allocation problems, however, the loss func-
entropy loss is often used while in regression problems, tions deviate from the commonly used ones in the regres-
l1 and l2 losses are preferred. Therefore, it is natural to sion and classification tasks, and hence issues arise from
use the objective function from the optimization problem the theoretical and practical aspects in obtaining good
as the loss function so that the objective function can be performance. As such, in the majority of existing works,
optimized based on the stochastic gradient descent during additional techniques have been applied to improve train-
training. ing performance for the optimization objectives in wireless
Various loss functions have been utilized to train the resource allocation. For example, in [57], the DNNs are
DNNs for wireless resource management and shown designed with special structures, including the use of con-
to outperform state-of-the-art heuristics. For example, volutional filters and feedback links. In [60], multiple deep
the sum rate of multiple users is treated as the loss function networks are ensembled for better performance while in
to train a fully connected DNN in [60] that learns to [20], the DNN is pretrained with the WMMSE solution as
solve the nonconvex sum rate maximization problem in a the ground truth.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Table 3 Performance of Accelerated B&B Algorithm With Different Num- algorithm. From this table, the accelerated B&B algorithm
bers of Training Samples for the Scenario With Five Cellular Users and can achieve near-optimal performance and meanwhile
Two D2D Pairs
substantially reduce computational complexity using only
tens to hundreds of training samples.
IV. D E E P R L - B A S E D R E S O U R C E
A L L O C AT I O N
C. Deep-Learning Accelerated Optimization Deep RL has been found effective in network slicing
[64], integrated design of caching, computing, and com-
In the two aforementioned learning paradigms, the opti-
munication for software-defined and virtualized vehicular
mization procedure is usually viewed as a black box
networks [65], multitenant cross-slice resource orches-
and then completely replaced by a deep-learning module.
tration in cellular RANs [66], proactive channel selec-
These approaches do not require any prior information
tion for long-time evolution (LTE) networks in unlicensed
about the optimization problems, but need a large amount
spectrum [67], and beam selection in millimeter wave
of training data, labeled or unlabelled, to obtain good
MIMO systems in [68]. In this section, we highlight some
performance, albeit with some efforts to improve sample
exemplary cases where deep RL shows impressive promises
efficiency [56]–[58]. To address the issue, it has been
in wireless resource allocation, in particular, for dynamic
proposed to leverage the domain knowledge by embedding
spectrum access, power allocation, and joint spectrum and
deep learning as a component to accelerate certain parts of
power allocation in vehicular networks.
a well-behaved optimization algorithm.
The MINLP problem is frequently encountered in wire-
less resource allocation, which is generally NP-hard and A. Dynamic Spectrum Access
difficult to solve optimally. The generic formulation is given In its simplest form, the dynamic spectrum access prob-
by [38] lem considers a single user that chooses one of N channels
maximize f (αα, ω ) for data transmission, as shown in Fig. 5. In [21], it is
α,ω
assumed that each channel has either “good” or “bad”
α, ω ) ≤ 0
s.t. Q(α conditions in each time slot and the channel conditions
αi ∈ N, ωi ∈ C ∀i (8) vary as time evolves. The transmission is successful if the
chosen channel is good and unsuccessful otherwise. The
where f (·, ·) is the objective function, αi and ωi are user keeps transmitting in successive time slots with the
the elements of α and ω , and Q(·, ·) represents certain objective of achieving as many successful transmissions
constraints, such as power or QoS constraints. N and as possible.
C denote some discrete and continuous set constraints, The problem is inherently partially observable since
respectively. Traditional approaches to the MINLP problem the user can only sense one of the N channels in each
are often based on mathematical programming, such as time slot while the conditions of other channels remain
the globally optimal B&B algorithm, which nonetheless has unknown. To deal with such partial observability and
high complexity for real-time implementation. improve transmission success rates, deep RL has been
The resource allocation in cloud radio access networks leveraged in [21] that takes the history of previous actions
(RANs) and D2D systems has been studied in [38] and and observations (success or failure) up to M time slots
[39], respectively, which can be formulated as an MINLP
problem and subject to the B&B algorithm. By observation,
the branching procedure is the most time consuming part
in the B&B algorithm and a good pruning policy can
potentially reduce the computational complexity. The more
nodes that would not lead to the optimal solution are
pruned, the less time is consumed. Therefore, algorithm
acceleration can be formulated into a pruning policy learn-
ing task. With invariant problem-independent features
and appropriate problem-dependent features selection,
the pruning policy learning task can be further converted
into a binary classification problem to be solved by deep
learning. Simulation results of the learning accelerated
optimization method for D2D networks are summarized
in Table 3. In this table, ogap, or optimality gap, means
the performance gap between the optimal solution and the Fig. 5. Illustration of dynamic spectrum access of a single
one achieved by the accelerated algorithm, while speed transmitter with N channels. “G” and “B” represent good and bad
refers to the speedup with respect to the original B&B channel conditions, respectively.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
as the input to their DQN. The output is a vector of to gain experience for Q-network training without prior
length N with the ith element representing the action knowledge about the protocols that other devices follow.
value of the input state if channel i is selected. The reward With such a model-free deep RL approach that is purely
design naturally follows: a reward of +1 is obtained for data-driven, the performance is measurably close to the
successful transmission and −1 for failed transmission. theoretical upper bound that has full model knowledge.
The learning agent employs the -greedy policy to explore So far, we have demonstrated the power of deep RL
the unknown environment and the accumulated experi- in a single-user spectrum access problem. In fact, similar
ence is used to improve the policy with the experience observations translate to the multiuser setting as well,
replay technique to break data correlation and stabilize albeit with some variations. As investigated in [74], a set
training. When the tested N = 16 channels are strongly of K = {1, . . . , K} users attempt transmission over N =
correlated, the proposed learning-based dynamic spectrum {1, . . . , N } orthogonal channels. In each time slot, every
access method outperforms the implementation friendly user selects a channel to send its data and the trans-
model-based Whittle Index heuristic introduced in [69]. mission is successful (with an ACK signal received) if
It closely approaches the genie-aided Myopic policy (with no others use the same channel. Different from a single-
channel evolution dynamics known), which is known to user setting, various design objectives can be defined for
be optimal or near-optimal in the considered scenario multiple users, such as sum rate maximization, sum log-
[70], [71]. rate maximization (known as proportional fairness), etc.,
However, as pointed out in [72], the DQN-enabled depending on the network utility of interest. Apart from
approach developed in [21] encounters difficulty when the its strong combinatorial nature, the problem is difficult in
number of channels, N , scales large. To tackle the chal- that the environment is only partially observable to each
lenge, a model-free deep RL framework using actor–critic user and nonstationary from user’s perspective due to the
has been further proposed in [72]. The states, actions, and interaction among multiple users when they are actively
rewards are similarly defined. But two DNNs are now con- exploring and learning.
structed at the agent, one for the actor that maps the cur- The deep RL-based framework developed in [74]
rent state (history of actions and observations up to M time assumes a centralized training and distributed implemen-
slots) to the distribution over all actions and is updated tation architecture, where a central trainer collects the
using policy gradient, and the other for the critic that experience from each user, trains a DQN, and sends
evaluates the value of each action (minus a baseline) under the trained parameters to all users to update their local
the current policy. The agent explores the unknown envi- Q-networks in the training phase. In the implementation
ronment following the policy defined in the actor network phase, each user inputs local observations into its DQN
and the accumulated experience is then used to update and then acts according to the network output without
both the actor (with information from the critic network) any online coordination or information exchange among
and critic networks. Finally, the trained actor network is them. To address partial observability, a long short-term
available to guide the selection of channels at the begin- memory (LSTM) layer is added to the DQN that main-
ning of each time slot. It has been shown that the actor– tains an internal state and accumulates observations over
critic method has comparable performance as the DQN- time. The local observation, Si (t), of user i in time slot t
based approach in [21] when N = 16 and significantly includes its action (selected channel), the selected channel
outperforms that when N grows to 32 or 64. Similar to the capacity, and received ACK signal in time slot t − 1. The
original DQN-based algorithm, the actor–critic approach observation is then implicitly aggregated in the LSTM layer
also demonstrates strong potential to quickly adapt to embedded in the DQN to form a history of the agent. The
sudden nonstationary change in the environment while the action, ai (t), of user i in time slot t is drawn according
actor–critic algorithm is more computationally efficient. to the following distribution to balance exploitation and
The DQN-based method is also applied to address exploration:
dynamic spectrum access in a time-slotted heterogeneous
network in [73], where a wireless device needs to decide
whether to transmit in a time slot when it coexists with (1 − α)eβQ(a,Si (t)) α
Pr(ai (t) = a) = + (9)
other devices that follow TDMA and ALOHA protocols. The ã∈N e βQ(ã,Si (t)) N +1
transmission fails if more than one device attempts to
transmit and the goal is also to maximize the number where α ∈ (0, 1), β is the temperature to be tuned in
of successful transmissions. Similar to [21], the wireless training, and Q(a, Si (t)) is the value of selecting channel a
device constructs a DQN that takes the history of previous for a given observation Si (t) according to the DQN output.
actions (TRANSMIT or WAIT) and observations (SUC- To address the issue of environment nonstationarity,
CESS, COLLISION, or IDLENESS) up to M time slots, experience replay, which has been popular in most deep
as the input, and the output is the value of each action RL training but could continuously confuse the agent
given the current state. The reward is set to 1 if the with outdated experiences in a nonstationary environment,
observation is SUCCESS and 0 otherwise. The learning is disabled during training. More advanced techniques,
agent interacts with the unknown wireless environment such as the dueling network architecture [75] and dou-
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
ble Q-learning [76], are leveraged to improve training values in different environment states. After training for
convergence. When compared with the classical slotted a while, the updated DQN parameters are broadcast to
ALOHA protocol, opportunistic channel aware algorithm all agents that use the parameters to construct/update
[77], [78], and distributed protocol developed in [79], their own DQNs for distributed execution. To have a
the proposed deep RL method consistently achieves better better representation of the communication environment,
performance in terms of average channel utilization, aver- the state observed by each agent is constructed to include
age throughput, and proportional fairness with a properly useful local information (such as its own transmit power
designed reward according to the utility of interest. in the previous time slot, total interference power, and
its own channel quality), interference from close inter-
B. Power Allocation in Wireless Networks fering neighbors, and its generated interference toward
Power allocation in wireless networks concerns the impacted neighbors. With a proper reward design that
adaption of transmit power in response to varying channel targets system-wide performance, the proposed deep RL
and user conditions such that system metrics of interest are method learns to adapt the transmit power of each link
optimized. Consider an interference channel, where N rep- only using the experience obtained from interaction with
resents communication links, denoted by N = {1, . . . , N }, the environment. It has been shown to outperform state-
share a single spectrum subband that is assumed frequency of-the-art, including the WMMSE algorithm in [52] and
flat for simplicity. The transmitter of link i sends infor- fractional programming (FP) algorithm in [80], which
mation toward its intended receiver with the power, Pi , are assumed to have accurate global CSI. Remarkably,
and the channel gain from the transmitter of link j to the the developed power allocation method leveraging deep
(t)
receiver of link i in time slot t is denoted by gj,i , for any RL returns solutions better and faster without assuming
i, j ∈ N , including both large- and small-scale fading com- prior knowledge about the channels, and thus can handle
ponents. Then the received SINR of link i in time slot t is more complicated but practical nonidealities of real sys-
tems. Another interesting observation is that the proposed
(t)
Pi gi,i learning-based approach shows impressive robustness in
(t)
γ i (P ) = (t)
(10) the sense that DQNs trained in a different setting (dif-
j=i Pj gj,i + σ 2
ferent initialization or number of links) can still achieve
decent performance. It suggests using a DQN trained on
where P = [P1 , . . . , PN ]T is the transmit power vector a simulator to jump-start a newly added node in real
for the N links and σ 2 represents noise power. Then, a communication networks is very likely to work in practice.
dynamic power allocation problem to optimize a generic An extension has been developed in [81] that considers
weighted sum rate is formulated as the case where each transmitter serves more than one
receiver and the data-driven deep RL-based approach has
N
been demonstrated to achieve better performance than
(t) (t)
maximize wi · log 1 + γi (P) benchmark algorithms.
P
i=1
The power allocation problem has also been studied in
s.t. 0 ≤ Pi ≤ Pmax , i∈N (11) [82] and [83], which consider a cellular network with
several cells and the base station in each cell serves mul-
where Pmax is the maximum transmit power for all tiple users. All base stations share the same frequency
links and wi(t) is the nonnegative weight of link i in spectrum, which is further divided into orthogonal sub-
time slot t that can be adjusted to consider sum rate bands within the cell for each user, i.e., there exists
maximization or proportionally fair scheduling. intercell interference but no intracell interference. Each
Owing to the coupling of transmit power across all links, base station is controlled by a deep RL agent that takes
the optimization problem in (11) is in general nonconvex actions (i.e., performs power adaptation) based on its
and NP-hard [8], not to mention the heavy signaling local observation of the environment, including cell power,
overhead to acquire global CSI. To address the challenge, average reference signal received power, average interfer-
a model-free deep RL-based power allocation scheme has ence, and a local cell reward. The power control action
been developed in [22] that can track the channel evo- is discretized to incremental changes of {0, ±1, ±3} dB
lution and execute in a distributed manner with limited and the reward is designed to reflect system-wide util-
information exchange. ity to avoid selfish decisions. The agents take turns to
The proposed deep RL method in [22] assumes a cen- explore the environment to minimize the impact on each
tralized training architecture, where each transmitter acts other, which stabilizes the learning process. Each deep
as a learning agent that explores the unknown environ- RL agent trains a local Q-network using the experience
ment with an -greedy policy and then sends its explo- accumulated from its interaction with the communication
ration experience to a central controller through backhaul environment. The deep RL-based approach learns to con-
links with some delay. A DQN is trained at the controller trol transmit power for each base station that achieves
using the experience collected from all agents with expe- significant energy savings and fairness among users in the
rience replay, which stores an approximation of action system.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Fig. 6. Illustration of spectrum sharing in vehicular networks, where Gt and Ht represent the current V2V signal channel
where V2V and V2I links are indexed by k and m, respectively, and strength and the interference channel strength from the
each V2I link is preassigned an orthogonal RB.
V2V transmitter to the base station over all RBs, respec-
tively, It−1 is the received interference power, Nt−1 is the
selected RBs of neighbors in the previous time slot over all
RBs, Lt and Ut denote the remaining load and time to meet
Power allocation in a more complicated multicell net-
latency constraint from the current time slot, respectively.
work has been considered in [84], where each cell has one
The action of each V2V agent amounts to a selection of
base station serving multiple users and all base stations
RB as well as discrete transmit power levels. The reward
transmit over the same spectrum, generating both intra-
balances V2I and V2V requirements, given by
and intercell interference. Each base station is controlled
by an RL agent that decides its transmit power to maximize
the network sum throughput. Similar to [22], a single RL r t = λc C c [m] + λv C v [k] − λp (T − Ut ) (13)
agent is trained in a centralized manner using experience m k
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
power control for multiple users within a cell, or joint have been some recent works [94] that attempt to solve the
spectrum and power allocation for multiple V2V links in issue and establish the convergence guarantee of mean-
a vehicular network. In these cases, actions of one user field multiagent RL, more investigation in this direction is
impact the performance of not only itself but also others still desired.
nearby. From the perspective of each user, the environment
that it observes then exhibits nonstationarity when other
VI. C O N C L U S I O N
users are actively exploring the state and action space
for policy learning. Meanwhile, each user can only obtain In this article, we have provided an overview of
local observations or measurements of the true underlying applying the burgeoning deep-learning technology to wire-
environment state, which, in the deep RL terminology, is a less resource allocation with application to vehicular net-
partially observable MDP. Then the learning agent needs works. We have discussed the limitations of traditional
to construct its belief state from the previous actions and optimization-based approaches and the potentials of deep-
local observations to estimate the true environment state, learning paradigms in wireless resource allocation. In
which is a challenging issue. A partial solution to the particular, we have described in detail how to lever-
problem might be to enable interagent communications age deep learning to solve difficult optimization prob-
to encourage multiuser coordination and increase local lems for resource allocation and deep RL for a direct
awareness of the global environment state. That said, such answer to many resource allocation problems that can-
a combination of environment nonstationarity and partial not be handled or even modeled in the traditional opti-
observability makes learning extremely difficult, which is mization framework. We have further identified some
made even worse if we scale the number of users large open issues and research directions that warrant future
as in the upcoming Internet of Things era. Although there investigation.
REFERENCES
[1] Z. Han and K. J. R. Liu, Resource Allocation for [13] H. Ye, L. Liang, G. Y. Li, and B.-H. F. Juang, “Deep intelligence: Embedding expert knowledge in deep
Wireless Networks: Basics, Techniques, and learning based end-to-end wireless communication neural networks for wireless system optimization,”
Applications. Cambridge, U.K.: Cambridge Univ. systems with conditional GAN as unknown IEEE Veh. Technol. Mag., vol. 14, no. 3, pp. 60–69,
Press, 2008. channel,” 2019, arXiv:1903.02551. [Online]. Jul. 2019.
[2] C. E. Shannon, “A mathematical theory of Available: https://arxiv.org/abs/1903.02551 [25] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep
communication,” Bell Syst. Tech. J., vol. 27, no. 3, [14] N. Samuel, T. Diskin, and A. Wiesel, “Learning to learning in physical layer communications,” IEEE
pp. 379–423, 1948. detect,” IEEE Trans. Signal Process., vol. 67, no. 10, Wireless Commun., vol. 26, no. 2, pp. 93–99,
[3] W. Yu and R. Lui, “Dual methods for nonconvex pp. 2554–2564, May 2019. Apr. 2019.
spectrum optimization of multicarrier systems,” [15] N. Farsad and A. Goldsmith, “Neural network [26] D. Gunduz, P. de Kerret, N. Sidiroupoulos,
IEEE Trans. Commun., vol. 54, detection of data sequences in communication D. Gesbert, C. Murthy, and M. van der Schaar,
no. 7, pp. 1310–1322, Jul. 2006. systems,” IEEE Trans. Signal Process., “Machine learning in the air,” IEEE J. Sel. Areas
[4] C. Y. Wong, R. S. Cheng, K. B. Lataief, and vol. 66, no. 21, pp. 5663–5678, Nov. 2018. Commun., vol. 37, no. 10, pp. 2184–2199,
R. D. Murch, “Multiuser OFDM with adaptive [16] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and Oct. 2019.
subcarrier, bit, and power allocation,” IEEE J. Sel. N. D. Sidiropoulos, “Learning to optimize: Training [27] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu,
Areas Commun., vol. 17, no. 10, pp. 1747–1758, deep neural networks for interference “Machine learning for vehicular networks: Recent
Oct. 1999. management,” IEEE Trans. Signal Process., vol. 66, advances and application examples,” IEEE Veh.
[5] G. Song and Y. Li, “Utility-based resource allocation no. 20, pp. 5438–5453, Oct. 2018. Technol. Mag., vol. 13, no. 2, pp. 94–101,
and scheduling in OFDM-based wireless broadband [17] C. Zhang, P. Patras, and H. Haddadi, “Deep learning Jun. 2018.
networks,” IEEE Commun. Mag., vol. 43, no. 12, in mobile and wireless networking: A survey,” IEEE [28] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent
pp. 127–134, Dec. 2005. Commun. Surveys Tuts., vol. 21, no. 3, vehicular networks: A machine learning
[6] G. Song and Y. Li, “Cross-layer optimization for pp. 2224–2287, 3rd Quart., 2019. framework,” IEEE Internet Things J., vol. 6, no. 1,
OFDM wireless networks—Part I: Theoretical [18] Z. Chang, L. Lei, Z. Zhou, S. Mao, and T. Ristaniemi, pp. 124–135, Feb. 2019.
framework,” IEEE Trans. Wireless Commun., vol. 4, “Learn to cache: Machine learning for network [29] M. Chen, U. Challita, W. Saad, C. Yin, and
no. 2, pp. 614–624, Mar. 2005. edge caching in the big data era,” IEEE Wireless M. Debbah, “Machine learning for wireless
[7] G. Song and Y. Li, “Cross-layer optimization for Commun., vol. 25, no. 3, pp. 28–35, Jun. 2018. networks with artificial intelligence: A tutorial on
OFDM wireless networks—Part II: Algorithm [19] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, neural networks,” IEEE Commun. Surveys Tuts.,
development,” IEEE Trans. Wireless Commun., “Wireless network intelligence at the edge,” Proc. vol. 21, no. 4, pp. 3039–3071, 4th Quart., 2019.
vol. 4, no. 2, pp. 625–634, IEEE, vol. 107, no. 11, pp. 2204–2239, [30] 3rd Generation Partnership Project; Technical
Mar. 2005. Nov. 2019. Specification Group Radio Access Network; Study on
[8] Z.-Q. Luo and S. Zhang, “Dynamic spectrum [20] W. Lee, M. Kim, and D.-H. Cho, “Deep power LTE-based V2X Services; (Release 14), 3GPP TR
management: Complexity and duality,” IEEE J. Sel. control: Transmit power control scheme based on 36.885 V14.0.0, Jun. 2016.
Topics Signal Process., vol. 2, no. 1, pp. 57–73, convolutional neural network,” IEEE Commun. [31] R. Molina-Masegosa and J. Gozalvez, “LTE-V for
Feb. 2008. Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018. sidelink 5G V2X vehicular communications: A new
[9] M. Bennis, M. Debbah, and H. V. Poor, “Ultra [21] S. Wang, H. Liu, P. H. Gomes, and 5G technology for short-range vehicle-to-everything
reliable and low-latency wireless communication: B. Krishnamachari, “Deep reinforcement learning communications,” IEEE Veh. Technol. Mag., vol. 12,
Tail, risk, and scale,” Proc. IEEE, vol. 106, no. 10, for dynamic multichannel access in wireless no. 4, pp. 30–39, Dec. 2017.
pp. 1834–1853, Oct. 2018. networks,” IEEE Trans. Cogn. Commun. Netw., [32] L. Liang, G. Y. Li, and W. Xu, “Resource allocation
[10] T. O’Shea and J. Hoydis, “An introduction to deep vol. 4, no. 2, pp. 257–265, Jun. 2018. for D2D-enabled vehicular communications,” IEEE
learning for the physical layer,” IEEE Trans. Cogn. [22] Y. S. Nasir and D. Guo, “Multi-agent deep Trans. Commun., vol. 65, no. 7, pp. 3186–3197,
Commun. Netw., vol. 3, no. 4, pp. 563–575, reinforcement learning for dynamic power Jul. 2017.
Dec. 2017. allocation in wireless networks,” IEEE J. Sel. Areas [33] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu,
[11] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, Commun., vol. 37, no. 10, pp. 2239–2250, “Graph-based resource sharing in vehicular
“Deep learning based communication over the air,” Oct. 2019. communication,” IEEE Trans. Wireless Commun.,
IEEE J. Sel. Areas Commun., vol. 12, no. 1, [23] A. Zappone, M. Di Renzo, and M. Debbah, vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
pp. 132–143, Feb. 2018. “Wireless networks design in the era of deep [34] W. Sun, E. G. Ström, F. Brännström, K. Sou, and
[12] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning: Model-based, AI-based, or both?” IEEE Y. Sui, “Radio resource management for D2D-based
learning for channel estimation and signal Trans. Commun., vol. 67, no. 10, pp. 7331–7376, V2V communication,” IEEE Trans. Veh. Technol.,
detection in OFDM systems,” IEEE Wireless Oct. 2019. vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
Commun. Lett., vol. 7, no. 1, pp. 114–117, [24] A. Zappone, M. Di Renzo, M. Debbah, T. T. Lam, [35] W. Sun, D. Yuan, E. G. Ström, and F. Brännström,
Feb. 2018. and X. Qian, “Model-aided wireless artificial “Cluster-based radio resource management for
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
D2D-supported safety-critical V2X for wireless scheduling,” in Proc. IEEE GLOBECOM, architectures for deep reinforcement learning,”
communications,” IEEE Trans. Wireless Commun., Dec. 2018, pp. 1–6. 2015, arXiv:1511.06581. [Online]. Available:
vol. 15, no. 4, pp. 2756–2769, Apr. 2016. [57] W. Cui, K. Shen, and W. Yu, “Spatial deep learning https://arxiv.org/abs/1511.06581
[36] Y. LeCun, Y. Bengio, and G. Hinton, “Deep for wireless scheduling,” IEEE J. Sel. Areas [76] H. Van Hasselt, A. Guez, and D. Silver, “Deep
learning,” Nature, vol. 521, no. 7553, p. 436, 2015. Commun., vol. 37, no. 6, pp. 1248–1261, reinforcement learning with double q-learning,” in
[37] B.-H. Juang, “Deep neural Jun. 2019. Proc. AAAI, 2016, pp. 2094–2100.
networks—A developmental perspective,” APSIPA [58] M. Lee, G. Yu, and G. Y. Li, “Graph embedding [77] K. Cohen and A. Leshem, “Distributed
Trans. Signal Inf. Process., vol. 5, Apr. 2016. based wireless link scheduling with few training game-theoretic optimization and management of
[38] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: samples,” 2019, arXiv:1906.02871. [Online]. multichannel ALOHA networks,” IEEE/ACM Trans.
Learning to optimize for resource management in Available: https://arxiv.org/abs/1808.01486 Netw., vol. 24, no. 3, pp. 1718–1731, Jun. 2016.
wireless networks with few training samples,” IEEE [59] K. Shen and W. Yu, “FPLinQ: A cooperative [78] T. To and J. Choi, “On exploiting idle channels in
Trans. Wireless Commun., to be published. spectrum sharing strategy for device-to-device opportunistic multichannel ALOHA,” IEEE
[39] M. Lee, G. Yu, and G. Y. Li, “Learning to branch: communications,” in Proc. IEEE Int. Symp. Inf. Commun. Lett., vol. 14, no. 1, pp. 51–53, Jan. 2010.
Accelerating resource allocation in wireless Theory (ISIT), Jun. 2017, pp. 2323–2327. [79] I.-H. Hou and P. Gupta, “Proportionally fair
networks,” IEEE Trans. Veh. Technol., to be [60] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards distributed resource allocation in multiband
published. optimal power control via ensembling deep neural wireless systems,” IEEE/ACM Trans. Netw., vol. 22,
[40] S. M. Kakade, “A natural policy gradient,” in Proc. networks,” 2018, arXiv:1807.10025. [Online]. no. 6, pp. 1819–1830, Dec. 2014.
Adv. Neural Inf. Process. Syst. (NeurIPS), 2002, Available: https://arxiv.org/abs/1807.10025 [80] K. Shen and W. Yu, “Fractional programming for
pp. 1531–1538. [61] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and communication systems—Part I: Power control and
[41] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and A. Ribeiro, “Learning optimal resource allocations beamforming,” IEEE Trans. Signal Process., vol. 66,
P. Moritz, “Trust region policy optimization,” in in wireless systems,” IEEE Trans. Signal Process., no. 10, pp. 2616–2630, May 2018.
Proc. Int. Conf. Mach. Learn.(ICML), 2015, vol. 67, no. 10, pp. 2775–2790, Apr. 2019. [81] F. Meng, P. Chen, and L. Wu, “Power allocation in
pp. 1889–1897. [62] T. Laurent and J. Brecht, “Deep linear networks multi-user cellular networks with deep Q learning
[42] V. Mnih et al., “Asynchronous methods for deep with arbitrary loss: All local minima are global,” in approach,” in Proc. IEEE ICC, May 2019, pp. 1–7.
reinforcement learning,” in Proc. Int. Conf. Mach. Proc. Int. Conf. Mach. Learn. (ICML), 2018, [82] E. Ghadimi, F. D. Calabrese, G. Peters, and
Learn.(ICML), 2016, pp. 1928–1937. pp. 2908–2913. P. Soldati, “A reinforcement learning approach to
[43] V. François-Lavet, P. Henderson, R. Islam, [63] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, power control and rate adaptation in cellular
M. G. Bellemare, and J. Pineau, “An introduction to “Gradient descent finds global minima of deep networks,” in Proc. IEEE ICC, May 2017, pp. 1–7.
deep reinforcement learning,” Found. Trends Mach. neural networks,” 2018, arXiv:1811.03804. [83] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters,
Learn., vol. 11, nos. 3–4, pp. 219–354, 2018. [Online]. Available: https://arxiv.org/abs/1811. L. Hanzo, and P. Soldati, “Learning radio resource
[44] K. Arulkumaran, M. P. Deisenroth, M. Brundage, 03804 management in RANs: Framework, opportunities,
and A. A. Bharath, “Deep reinforcement learning: [64] R. Li et al., “Deep reinforcement learning for and challenges,” IEEE Commun. Mag., vol. 56,
A brief survey,” IEEE Signal Process. Mag., vol. 34, resource management in network slicing,” IEEE no. 9, pp. 138–145, Sep. 2018.
no. 6, pp. 26–38, Nov. 2017. Access, vol. 6, pp. 74429–74441, 2018. [84] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power
[45] C. J. C. H. Watkins and P. Dayan, “Q-learning,” [65] T. He, N. Zhao, and H. Yin, “Integrated networking, allocation in multi-user cellular networks: Deep
Mach. Learn., vol. 8, nos. 3–4, pp. 279–292, 1992. caching, and computing for connected vehicles: reinforcement learning approaches,” 2019,
[46] V. Mnih et al., “Human-level control through deep A deep reinforcement learning approach,” IEEE arXiv:1901.07159. [Online]. Available:
reinforcement learning,” Nature, vol. 518, Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, https://arxiv.org/abs/1901.07159
pp. 529–533, 2015. Jan. 2018. [85] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep
[47] R. S. Sutton, D. A. McAllester, S. P. Singh, and [66] X. Chen et al., “Multi-tenant cross-slice resource reinforcement learning based resource allocation
Y. Mansour, “Policy gradient methods for orchestration: A deep reinforcement learning for V2V communications,” IEEE Trans. Veh. Technol.,
reinforcement learning with function approach,” IEEE J. Sel. Areas Commun., vol. 37, vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
approximation,” in Proc. Adv. Neural Inf. Process. no. 10, pp. 2377–2392, Oct. 2019. [86] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in
Syst. (NeurIPS), 2000, pp. 1057–1063. [67] U. Challita, L. Dong, and W. Saad, “Proactive vehicular networks based on multi-agent
[48] R. S. Sutton and A. G. Barto, Reinforcement resource management for LTE in unlicensed reinforcement learning,” IEEE J. Sel.
Learning: An Introduction. Cambridge, MA, USA: spectrum: A deep learning perspective,” IEEE Trans. Areas Commun., vol. 37, no. 10, pp. 2282–2292,
MIT Press, 2018. Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Oct. 2019.
[49] Y. Cui and V. K. N. Lau, “Distributive stochastic Jul. 2018. [87] M. I. Ashraf, M. Bennis, C. Perfecto, and W. Saad,
learning for delay-optimal OFDMA power and [68] V. Va, T. Shimizu, G. Bansal, and R. W. Heath, Jr., “Dynamic proximity-aware resource allocation in
subband allocation,” IEEE Trans. Signal Process., “Online learning for position-aided millimeter wave vehicle-to-vehicle (V2V) communications,” in Proc.
vol. 58, no. 9, pp. 4848–4858, Sep. 2010. beam training,” IEEE Access, vol. 7, IEEE GLOBECOM Workshops, Dec. 2016, pp. 1–6.
[50] V. K. N. Lau and Y. Cui, “Delay-optimal power and pp. 30507–30526, 2019. [88] J. Foerster et al., “Stabilising experience replay for
subcarrier allocation for OFDMA systems via [69] K. Liu and Q. Zhao, “Indexability of restless bandit deep multi-agent reinforcement learning,” in Proc.
stochastic approximation,” IEEE Trans. Wireless problems and optimality of Whittle index for Int. Conf. Mach. Learn. (ICML), 2017,
Commun., vol. 9, no. 1, pp. 227–233, Jan. 2010. dynamic multichannel access,” IEEE Trans. Inf. pp. 1146–1155.
[51] Q. Zheng, K. Zheng, H. Zhang, and V. C. M. Leung, Theory, vol. 56, no. 11, pp. 5547–5567, Nov. 2010. [89] G. Tesauro, “Extending Q-learning to general
“Delay-optimal virtualized radio resource [70] Q. Zhao, B. Krishnamachari, and K. Liu, “On adaptive multi-agent systems,” in Proc. Adv. Neural
scheduling in software-defined vehicular networks myopic sensing for multi-channel opportunistic Inf. Process. Syst. (NeurIPS), 2004,
via stochastic learning,” IEEE Trans. Veh. Technol., access: Structure, optimality, and performance,” pp. 871–878.
vol. 65, no. 10, pp. 7857–7867, Oct. 2016. IEEE Trans. Wireless Commun., vol. 7, no. 12, [90] L. Wang, H. Ye, L. Liang, and G. Y. Li, “Learn to
[52] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, pp. 5431–5440, Dec. 2008. compress CSI and allocate resources in vehicular
“An iteratively weighted MMSE approach to [71] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and networks,” 2019, arXiv:1908.04685. [Online].
distributed sum-utility maximization for a MIMO B. Krishnamachari, “Optimality of myopic sensing Available: https://arxiv.org/abs/1908.04685
interfering broadcast channel,” IEEE Trans. Signal in multichannel opportunistic access,” IEEE Trans. [91] M. Eisen and A. Ribeiro, “Large scale wireless
Process., vol. 59, no. 9, pp. 4331–4340, Sep. 2011. Inf. Theory, vol. 55, no. 9, pp. 4040–4050, power allocation with graph neural networks,” in
[53] B. Matthiesen, A. Zappone, K.-L. Besser, Sep. 2009. Proc. IEEE SPAWC, Jul. 2019, pp. 1–5.
E. A. Jorswieck, and M. Debbah, “A globally optimal [72] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, [92] M. Eisen and A. Ribeiro, “Optimal wireless resource
energy-efficient power control framework and its “Actor-critic deep reinforcement learning for allocation with random edge graph neural
efficient implementation in wireless interference dynamic multichannel access,” in Proc. IEEE networks,” 2019, arXiv:1909.01865. [Online].
networks,” 2018, arXiv:1812.06920. [Online]. GlobalSIP, 2018, pp. 599–603. Available: https://arxiv.org/abs/1909.01865
Available: https://arxiv.org/abs/1812.06920 [73] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement [93] T. N. Kipf and M. Welling, “Semi-supervised
[54] M. Lee, Y. Xiong, G. Yu, and G. Y. Li, “Deep neural learning multiple access for heterogeneous wireless classification with graph convolutional networks,”
networks for linear sum assignment problems,” networks,” in Proc. IEEE ICC, May 2018, pp. 1–7. in Proc. Int. Conf. Learn. Represent. (ICLR), 2017,
IEEE Wireless Commun. Lett., vol. 7, no. 6, [74] O. Naparstek and K. Cohen, “Deep multi-user pp. 1–14.
pp. 962–965, Dec. 2018. reinforcement learning for distributed dynamic [94] M. K. Sharma, A. Zappone, M. Assaad, M. Debbah,
[55] H. W. Kuhn, “The Hungarian method for the spectrum access,” IEEE Trans. Wireless Commun., and S. Vassilaras, “Distributed power control for
assignment problem,” Naval Res. Logist., vol. 2, vol. 18, no. 1, pp. 310–323, Jan. 2019. large energy harvesting networks: A multi-agent
nos. 1–2, pp. 83–97, Mar. 1995. [75] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, deep reinforcement learning approach,” IEEE Trans.
[56] W. Cui, K. Shen, and W. Yu, “Spatial deep learning M. Lanctot, and N. De Freitas, “Dueling network Cogn. Commun. Netw., to be published.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
ABOUT THE AUTHORS Dr. Yu received the 2016 IEEE ComSoc Asia-Pacific Outstand-
ing Young Researcher Award. He regularly sits on the Technical
Le Liang (Member, IEEE) received the Program Committee (TPC) Boards of the prominent IEEE con-
B.E. degree in information engineering ferences such as ICC, GLOBECOM, and VTC. He also serves as
from Southeast University, Nanjing, China, the Symposium Co-Chair for the IEEE Globecom 2019 and the
in 2012, the M.A.Sc. degree in electrical Track Chair for the IEEE VTC 2019’Fall. He has served as a Guest
engineering from the University of Victoria, Editor of the IEEE Communications Magazine—Special Issue on
Victoria, BC, Canada, in 2015, and the Ph.D. Full Duplex Communications, an Editor of the IEEE JOURNAL ON
degree in electrical and computer engineer- SELECTED AREAS IN COMMUNICATIONS—Series on Green Communi-
ing from Georgia Institute of Technology, cations and Networking, a Lead Guest Editor of the IEEE Wire-
Atlanta, GA, USA, in 2018. less Communications Magazine—Special Issue on long-time evo-
Since 2019, he has been a Research Scientist with Intel Labs, lution (LTE) networks in unlicensed spectrum, and an Editor of
Hillsboro, OR, USA. His general research interests include wireless IEEE ACCESS. He is currently serving as an Editor of the IEEE
communications, signal processing, and machine learning. TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING and
Dr. Liang received the Best Paper Award at the IEEE/CIC ICCC in IEEE WIRELESS COMMUNICATIONS LETTERS.
2014. He is an Editor of IEEE COMMUNICATIONS LETTERS.
Geoffrey Ye Li (Fellow, IEEE) was with AT&T
Laboratories Research, Red Bank, NY, USA,
as a Senior and then a Principal Techni-
cal Staff Member, from 1996 to 2000, and
a Postdoctoral Research Associate with the
Hao Ye (Student Member, IEEE) received University of Maryland, College Park, MD,
the B.E. degree in information engineering USA, from 1994 to 1996. He is currently a
from Southeast University, Nanjing, China, Professor with Georgia Institute of Technol-
in 2012, and the M.E. degree in electrical ogy, Atlanta, GA, USA. His general research
engineering from Tsinghua University, Bei- interests include statistical signal processing and machine learning
jing, China, in 2015. He is currently working for wireless communications. In these areas, he has published over
toward the Ph.D. degree in electrical engi- 500 journal and conference articles in addition to over 40 granted
neering at Georgia Institute of Technology, patents. His publications have been cited over 38 000 times.
Atlanta, GA, USA. Dr. Li was a Fellow of the IEEE for his contributions to signal
His research interests include machine learning for physical and processing for wireless communications in 2005. He received the
MAC layer communications. Donald G. Fink Overview Paper Award in 2017 from the IEEE
Signal Processing Society; the James Evans Avant Garde Award
in 2013 and the Jack Neubauer Memorial Award in 2014 from the
IEEE Vehicular Technology Society; the Stephen O. Rice Prize Paper
Award in 2013, the Award for Advances in Communication in 2017,
Guanding Yu (Senior Member, IEEE) and the Edwin Howard Armstrong Achievement Award in 2019 from
received the B.E. and Ph.D. degrees in the IEEE Communications Society; and the 2015 Distinguished Fac-
communication engineering from Zhejiang ulty Achievement Award from the School of Electrical and Computer
University, Hangzhou, China, in 2001 and Engineering, Georgia Institute of Technology. He has organized and
2006, respectively. chaired many international conferences, including the Technical
From 2013 to 2015, he was a Visiting Program Vice-Chair of the IEEE ICC 2003, the Technical Program Co-
Professor with the School of Electrical and Chair of the IEEE SPAWC 2011, the General Chair of the IEEE
Computer Engineering, Georgia Institute GlobalSIP 2014, the Technical Program Co-Chair of the IEEE VTC
of Technology, Atlanta, GA, USA. He joined 2016 (Spring), and the General Co-Chair of the IEEE VTC 2019 (Fall).
Zhejiang University in 2006, where he is currently a Full Professor He has been involved in editorial activities for over 20 technical
with the College of Information Science and Electronic Engineering. journals, including founding Editor-in-Chief of the IEEE 5G Tech
His research interests include 5G communications and networks, Focus. He has been recognized as the World’s Most Influential Sci-
mobile edge computing, and machine learning for wireless entific Mind, also known as a Highly Cited Researcher, by Thomson
networks. Reuters almost every year.